FOSDEM 2024 Transcribed / Subtitled by Whisper

Welcome to FOSDEM 2024
Welcome to FOSSTEM 2024. I'll start with a reminder to be kinder. We lost quite a few volunteers over the pandemic. We lost muscle memory. We lost stuff. We lost everything. Last year was pretty rough in the background. From what I've heard, it was not super noticeable for the attendees, which was nice. But still there were a few words. We hope that we are basically almost back to pre-pandemic levels of efficiency and effectiveness. We'll see. But if something breaks, ideally tell Infodesk about it and just move on. Can't do anything else anyway. Speaking of losing a lot of hands, please volunteer. As of right now, we still have plenty of opportunities for volunteering open. You'll find a bunch of QR codes so you can just take photos of the thing if you want to and you'll have everything. We still need more hands to help. As per usual, if you help with cleanup, A, any of the network cable, which is not that much anymore, because we have been pretty good at using UB infrastructure these days, any network cable which you take out under all supervision you can keep. I'm going to repeat this in the closing talk. Also, there is food for everyone who stays until the end. Historically, sometimes this went until midnight. But the last few years, we were always done-ish at 10, and more hands, more better. So if you are around and if you have time, please help. If you need help yourself, well, yellow shirts or jackets mean Steph. Blue shirts mean deaf room managers. Those are the people who basically run the conference in a conference in the various rooms. Green shirts are video team. There are a few back there. You also see orange shirts, which are general purpose volunteers for all the things. Speaking of not everything being super smoothly just now, the supplier didn't send all the t-shirts which we ordered, and we noticed too late. So a long story short, we don't have enough of the shirts. As you can see, I'm wearing a black one, not a yellow one. So not every volunteer might have a shirt. If in doubt, given the benefit of the doubt, they are probably a volunteer if they claim so. This is ULB. This is not according to True North. You see True North down below, but also we have the navigation. If you need to go around, this is way, way better than just five years ago when we didn't have this, and everyone got lost all the time. So those QR codes all now go into this precise system if you need anything. From a t-shirt or a hoodie, if you want to have a printed map, if you want to give us a donation, if you just want to say hi and thanks, whatever Infodesk is your main landing spot, as per usual, extra large, extra small might be gone sooner than later, so going there sooner than later might be good. We are also on Matrix, and you can also email us at all times, and all of those are monitored, not 24-7, but almost. So this is the difference between last year and this year, and there is more than a person a week of work between those two photos, and more than 10K in investment. So as a reminder for those who don't know, we have on the FOSTA network, it is IPv6 only on the Wi-Fi, and it is net6.4 and DNS6.4, so all the translation to IPv4 happens for you automatically, except for, I believe it was ripe, we are the first conference to actually do this, and we found a bunch of bugs. Learning in the public, full transparency, we just found a bug in internal Grafana tooling with IPv6. Anyway, so primarily use this network please, and if you maintain any software and something breaks, this is great, of course now you can fix it. At the same time, if you want the more legacy taste of internet in your devices, you can use the dual stack one, and you will get full IPv4. But we have been doing this for years, it is pretty stable, just try it, and you will just see, hey, this IPv6 stuff actually works. We also have a lightning talk, or longer than a lightning talk in the infrastructure view, as per usual, the QR code goes there, and as a reminder, because we have funny people who think they are funny when they try and attack stuff, A, we are going to find you and shut you down, but B, please encrypt all your traffic, because some people are just maybe not as nice as they should be. Social interactions, well, hallway track, that is the main thing while we are here at FOSTA, because we can't get into the rooms anyway. There is also lots of new joiners, because as you would have loved. Anyway, we have Matrix as the primary thing, we also have IRC with reasonable effort, but there is no bridge this year, if you don't know what that means, doesn't matter, just go to Matrix, long story short, yeah. Also, go to Mustadon and look at stuff. If you need to register for, well, Brain's Matrix on, chat FOSTA or you click on the upper left, and then on the create account and you are done. If you want to toot about those things, which you see, ideally with consent of the people who are in the picture, use the hashtag FOSTA, there is our main account, we also have other social media, but we try to ignore this, so that's the main one. You will find this with the QR code. Anyone who doesn't want to be in the picture now is a good time to hide your face, and everyone else just raise your hands for a second and say hi. Those are going to be posted on Mustadon in a few. So, schedule. You probably know this, but this is the schedule, it builds dynamically, but we are pretty much settled, but if any deaf room manager needs to still change something, someone fell sick, whatever, it will update within minutes, so this is the absolute source of truth for everything regarding schedule. And I still don't think anyone told us about an Apple-compatible thing for the... There is one? Okay. Can you send us to info at FOSTA, so next time I can put this on the slides? Thank you. Just say Ritchie H or 2Ritchie H and it will find its way. Thank you. So, the stats, we have even more events crammed into these old buildings. We have even more speakers. We do have more tracks and deaf rooms, but I forgot the old number, so you don't see the diff. We managed to cram even more stands into the building and maybe we will be able to do even more next year. We don't yet know, because we are always out of space, as you know, and fewer lighting talks. These are all the deaf rooms. You won't be reading all of those right now, but just to give you a rough idea of scale, this is only page one, this is page two. And for those who don't know, yes, this is the largest open source conference on Earth. We have a bunch of stands. We rejuggled them quite a bit. So, if you are walking to that place there for the last three or five years, you always saw that one thing. Look online. Some of them didn't make it, but most we just rejuggled them and tried to organize them with a new system. So, there might just be somewhere else. Yeah. We don't have a hacker room this year. We needed this for basically for staff and for boxes of stuff. So, please go to the cafeteria in the main, or next to the food area in the main campus. Please don't go into the old room. Even if you know where the old room is, please just go. Don't go in there, because there's going to be people in there who are working and trying to make FOSTA happen. So, yeah, please don't. Speaking of being nice about things, when, not if, you need to queue here, because you will need to queue. Please form an orderly queue. Don't skip, blah, blah, blah, the usual, but also don't block any of the pathways and don't block any of the fire exits. We are literally filling this campus to capacity and maybe a little bit to breaking capacity. So, please try and keep the pathways free. If staff and volunteers come with requests or instructions or, hey, this is a fire hazard or whatever, please don't discuss. Just do it, because they probably have a good reason why they do it, why they tell you. If staff or volunteers are running, make way. We are not running often, but when we are running, we mean it, so please just make way. In the food queue, please consider letting people skip ahead if they have one of the color-coded t-shirts, but also if you have one of those, don't abuse it and don't use it when you actually would have time. Also happens sometimes. It goes both ways. The one exception, we are doing food runs for the actual teams which are stuck in rooms all day, like myself, and those, we will just skip the queue and go to the front. If you see this sign, you saw a green one on the sides or on the doors as you came in, which was nice. All of them on the back have this one. If there is a sign that the room is full, that means the room is full. It does not mean that because you have a really good reason and really wanted to see that one talk and your friend is speaking, you can sneak in. It means the room is full. This sucks, but it means the room is full. Maybe it's going to work this time, but please, that sign means do not enter. Also if you try to open the door, sometimes they are a little bit broken, but if you try to open the door and someone pulls back, that means the volunteer has a reason why the door is currently closed. Usually of course it's super loud inside. Maybe also I should have put this there. When you leave and enter a building, please don't talk and try to not bump the chairs and everything. It's incredibly loud up front and also quite a few defferums try to optimize for Q&A, but this means the audience needs to be quiet while entering and leaving. I know it's unusual, but those buildings are super, super, super loud, in particular for anyone up front. This is literally built to echo everything from me up to you, but also everything from you down to me. It's super loud and this is true for quite a few buildings. So please be quiet when entering and leaving. I need to put this on the slides for next year. If you have any feedback, send it there. We don't get enough feedback. If you say nice things, that's also appreciated. That happens almost never. If you want to say crazy things, maybe don't, but anything you want to send us, send us, we like reading your stuff and we read everything. If you need first aid, we have one first aid station in K Building on the second level and for most of the entryways to the second level, it would be on the right end of the thing or if you're the one end and then it's at the other end, if you don't see it. There's also the net link here. I'm just going to wait a little bit longer for people to take a picture of this slide, because if you need it, you will need it, so now is a good time to take a picture of this. We also have both professional security and the rec course walking around campus randomly, like they have their own schedule and everything. They are the professionals, we are not. If you need anything, go there. A bit about health and safety. You hear the coughing all around of you. Some of you are sick right now, some with COVID. Some of you will fall sick while here and some of you will fall sick at home. If you are currently infected, ideally go home. At the same time, I recognize that this might be harsh if you're only coughing, but in this case, please just wear a mask. We have free masks at the infodesk. I also see a few people with masks here. You will also see me with a mask when I'm not talking about stuff. Also for the first-timers and for those who might not have been here so often, Fostem Flue, Chaos Grippe, Oktoberfest, those are mass spreader events. A statistically significant amount of people will come home sick. Done. Plain statistics in science. You can get sick if you want to, but maybe don't. This is the Mozilla death room. I saw this on Masteron, not on any other social network, which is nice. The left one is 20 years ago, 2004, I think they said, and they are almost in the same place by pure accident this year. The wall changed color, but beyond this it's the same. Why am I bringing this up? Well, next year we will have the 25. So, if you have old shirts, which you have from as old as possible, ideally, please do bring them. Even if they don't fit, still bring them. Someone might be able to wear them, or you can just pull them over your belly, or maybe you slim down so much so you can just get a second person into the shirt, depending which way it went. But please bring the old stuff. If you have an old hoodie, if you have an old sweater, if you have this one project shirt from like ages ago, whatever. Bring the old stuff. We are going to figure something out for next year. Maybe take a picture here or outside, or we'll see. Most likely Sunday next year, because then we can gorge on Saturday how many people come. Anyway, if you have anything old, which you think might be nice on a photo, bring it. We also have a code of conduct, and we followed a code of conduct both ways. Anything which happens, I need to put a QR code here, just realized. Yeah, this is the code of conduct. Please read it, but the short version is nice. Anything happens, conducted for StemOrc. If anything is really like currently ongoing, there is a cell phone, and the cell phone is going to wake people up. So anything really bad happens, call this number, and wake whoever, poor soul. Probably Michael is currently carrying this thing. Or anyone who has a yellow shirt, well, with stuff on it, or a jacket or something, just go to them. By proxy, if you can't find someone else in green, blue, or orange shirt, but primarily ideally go to the yellow shirt, or InfoDesk and K also works. To the thank yous, to all the volunteers, all the deaf room speakers, all the deaf room managers, all the staff, all everyone. Thank you. APPLAUSE Also, thank you to all the sponsors. APPLAUSE If you want to give thanks the other way, donations can be made at InfoDesk in HOK or online. So for those who don't know, of course, I don't know how many are in... Oh, for who is this the first foster? Wow, good! APPLAUSE So for those who don't know this, we limit the amount of sponsors, basically. The reason is we don't want to depend on anyone, any one company, or on any long-standing support or anything. We deliberately finance ourselves substantially through donations and t-shirt sales and things like these. Well, that's basically it. So yeah, this is why we call for donations, because we don't want to be depending on the corporate overlords, and like they pull funding or something. We want this to be a grass-rootsy thing, not only on the organizing side, but also on the financial side. And thank you! APPLAUSE To be clear, also thank you, even if you don't donate. Also, for those who don't know, it's really loud up here, so even if you just whisper, like two, five, twenty people... Thank you. Whispering is super loud. You cannot imagine how good or bad, depending on your vantage point, these rooms are. It's insane how loud is up front for the speakers if you even mumble. Person who's over there on the right. Yeah, so, have fun, and as always, be excellent to each other. APPLAUSE .
Where have the women of tech history gone?
Good morning everyone. Hope everyone is settling down. We can get started with our first talk. Our first talk is where have the women of tech history gone. Our speaker is Laura Dury. She has been a developer for six years and awarded at World Scales Belgium in Web Technologies category. She has been doing monthly YouTube live discussions on latest tech developments in tech industry. Additionally, she has also started a career of Fourier in France. The talk is mostly about where have women of tech history gone. Addaa Lovelace, Hedji Lamar, the Enoch Girls, Grace Hopper, John Clark. Stemming from the role of calculator, the profession of developer has initially considered a woman's job while hardware design was seen as a man's job. However, who are these women who have shaped the world in tech? Why don't we hear more about them with Laura? You'll attempt to see the record straight bit by bit and provide role models in the tech you've always needed. Thank you. Hi, can you hear me right? Hi everyone, thank you so much for coming today. I just wanted to say that at first I try to do my talk on Sunday because this is way too much for me to handle. Please be kind to me, thank you so much. We're going to talk today about the women in the tech history. First of all, I wanted to talk to you about a little anecdote that happened to me when I was in college. During my first year I had a North History class and I was kind of sad to see there were maximum two women represented. I decided to send an email to my teacher and to ask him why he presented so few women. He answered kindly, honestly, that he didn't have enough time to add more artists to his syllabus. Because of that, some students may not have the required basis for their future career. We think about students in illustration, in painting, art, etc. At first I didn't really pay attention to it, I didn't really see the huge problem behind this. Then when I started to realize that this is kind of weird, that this is not normal, this is not fair, I had two questions in my mind. The first one is why are women not considered as the required basis? Why do they have less than men? The second one is who is the person or the group of persons who decide what someone deserves more than another to be in a syllabus? Spoiler alert, I don't have the answer to this question, I have ideas, I have theories. This is not the aim of this talk, but I hope that this question you can yourself think about it and maybe try to think about it. What I can do is to pay tribute and give a place to women who did a fantastic word to revolutionize the computer science field. This is something I have had in my mind for many years, in fact it's only natural that I'm here today in front of you to speak about that. The problem is present in the majority of fields, but today we're going to concentrate and talk only about computer science, the reason why we are all here today. Of course we're going to do that. Personally, if you go home and you remember two names of women you learned about today, it's a huge win for me. What about you? Do you know some names of women in the tech history? Adelovelle. Kathleen Booth. Margaret Hamilton. Belinda Pearson. Oh sorry I can't hear. Belinda Pearson. Belinda Pearson, yeah that's true. I don't think you know anything about that. Okay I have a lot of names, that's really nice. Okay thank you so much for that. So let's go discover together the stories through the computer science history. And for that we need to go back in time and we're going to begin at the age of enlightenment. So the ancestors of computing machines were human computers and especially in the astronomy fields. So basically computer was a job. And it was about mathematical calculations and very often the job was divided, the computers were divided into groups to compute long and difficult calculations. And the job was done in a way that the calculations were executed at the same time in parallel. And I wanted to talk to you about that because this is really funny, but because still today this is something that we are looking at in our computer. How many operations my computer can execute at the same time. And it was already something that people created, a way of working that was created a long time ago already. So like every profession was dominated by the men. However the first woman to be quoted in articles about the computer science history is Nicole Ren LePote, co-corrigor pour les Français. So she is one of the most famous astronomers of the age of enlightenment. And she is famous because with two other men she calculated the return date of the Hallease comet for April 13 in 1756, 59. Almost exactly as it returned on March 13 in the same year. So I don't know if you understand, we are in the 18th century and they calculated by hands the return date of the comet with only one month of error. So it's really amazing. Maria Mitchell also made a splash for discovering the first telescopic comet, which means it's invisible for the eyes. It will be named after her and she will receive a gold medal for this achievement. So during the 19th century there were a few barriers and contradictions regarding women in the scientific fields. So despite the fact that they had access to degrees, they were forced to resign as soon as they get married. A kindly reminder that a woman that is not married at that time doesn't exist in the eyes of society. So yeah. The history of computer science starts in 1840 with a woman that you obviously know and if you don't know her you should ask yourself some serious questions. Who is that Pokemon? Well of course it's Ada Lovelace. So I think that everyone in this room know who is Ada Lovelace. But for me she is not only the first programmer and this is my thought and I wanted to speak to you with you about that. So for that I need to explain you something. So Charles Babbage is the person who built the difference engine and the analytical engine. However he was messy and he couldn't stand back from his machine he was building. So he had ideas but he didn't have a concept that embraced his machine. Hence the arrival of our sweet and dear Ada. She invented the concepts behind the analytical engine by providing the first algorithms. And I have something to say more and I forgot about it. So she invented the concept of the analytical engine by providing the first algorithm. And ladies and gentlemen computer science was born. So this is why I think for me that Ada Lovelace isn't just the first programmer but she is the mother of computer science by giving these first algorithms. And by the way you can find the first notions of loops and functions in these algorithms. So despite this extraordinary invention it was way too innovative for that time. I remind you to tell you again where we are in 1840. So it was way too innovative and the analytical engine was forgotten for lack of funding. Before being rediscovered in 1937 to inspire the Mach 1 the first general purpose electromechanical computer. But let's take it easy. Alright we are in the end of 19th century and Edward Charles Pickering is the founder of a group of women called the Harvard computers. These women listed over 10,000 stars and developed a system to describe them. But a woman, a particular woman stood out, Aynie Jump Cannon. So she pioneered, this is hard to remember this one, she pioneered a new spectral type classification system and she developed the Harvard classification scheme which is still in use today. No sorry, which is the basis of the system used today. Between 1911 and 1915 she classified over 5,000 stars a month at a rate of one star per 3 seconds. I don't know what you can do in 3 seconds. I mean I can chug a beer in 3 seconds but that's all I can do, right? Okay girl, you have my respect. And in the 19th century the growth of industries opened up opportunities for women to join the field of technology. One notable woman, Great Hermann, made significant contributions with her advanced work in mathematics and physics. She played a key role in her early philosophical work on the foundation of quantum mechanics. But in 1920s her doctoral thesis led the groundwork for computer algebra and it first established the existence of algorithms for many of the basic problems of abstract algebra. So we are going to see a little more of computing here. I promise it's coming. Between computer algebra, I don't know if you know this app or a definition over there if you want to look at that after. So between the 1940s and the 1970s women were widely hired as coders and there are numbers of reasons. The first one is that coding programming was an emerging field so you didn't need a diploma to be hired. As new hires only had to pass a straightforward test, logic test, sorry, to work in a computer science job. Another factor was that despite the fact that women had diplomas degrees in scientific field, they faced a lot of challenges like finding a job or even advancing in their career. So they decided to turn to opportunities in the IT field. The last one is the shortage of manpower during this time and the fact that women cost very little. Grace Hopper. So during the World War II Grace Hopper, a 36 years old mathematician, decided to serve her country. This is very American. I'm sorry for the Americans over there. She decided to let her job, her teaching position at Vassar College to enroll in the US Navy expecting to decode enemy messages and serve her country. Surprisingly, the US Navy sent her to Harvard where she became the third programmer of the Mark I. If you remember, earlier I mentioned the analytical engine and how it was rediscovered to inspire Howard Aiken to create the Mark I in 1937. Well, the Mark I is a versatile, punchcard, programmable calculator and it was Grace who has the honor or rather the heavy burden of taming this machine. She wrote her 521 page user manual from scratch with any help of nobody. Like they said, okay, this is the machine. Go yourself and yeah. See you next time. Okay. So this is really impressive to know that and with her work she was engaged in top secret calculation crucial to the war efforts. Involving past like determining rocket trajectories, generating range tables for new anti aircraft guns and calibrating minesweepers. Now look at your computer. Look how easy it is to code. Now imagine doing this with a big, big computer like doing this for day long, for day long and for night long also. This is not the right page. Yeah, this is. We continue in the history and we are in 1940 and this year marks a milestone in the history of computing. The first fully electronic computer, the ENIAC. It was developed to automate and speed up the work of calculators and computers who was first humans. Right. But even if it was faster, it still needed a human intervention called the operator. And this job was largely performed by women. So the operator is the person who will enter manually questions into the machine through switches and cables. So you have a little, I don't know, overview. Can you see it? Well, it's kind of dark. I'm sorry about that. Yeah, you have a lot of cables over there. And six astounding women, Kathleen, Marlene, Betty, Francis, Betty and Ruth were the first six ENIAC programmer and the first programmer by extension also. So they had to install and assembling this machine. You have to know that the operator was the programmer of today. And even if this is the case, even if this is the programmer of today, at that time it was, it didn't receive a lot of credits. And it was very often belittled because it was performed by women. And hardware was the main job. Yet the line between these two jobs wasn't really clear cuts because women, so operators, needed to have little or in depth hardware knowledge to do it. To control this machine, to program these machines. Because this is still hardware. We didn't have in graphic interface or things like that. You needed to touch the hardware to use the cables, the switches. So this is where we see there is a big difference between a job description and what these women really had to do. Hello. I have a little anecdote. So first of all ENIAC for those who didn't know, means Electronical, Numerical, Integrator and Computer. So all these six women had a mathematics degree in common. They were responsible for installing and assembling the ENIAC. And the most important thing, they were the ancestors of the debugger. So look again to this machine and imagine you have a bug but you don't know where it is. So there were six. There were a group, so they had to work together to try to understand where a bug come from. And why is it a bug? So they created a system to work together as a debugger when there is a bug. And this is quite impressive. I don't know if there is people in this room already saw a machine like that or not. Yeah, okay. That's so nice. I'm jealous. So now we are in 1942 and a significant innovation emerged unintentionally driven by Hedila Marm, a renowned movie star. So to understand what happened, we need to rewind a little bit and delve into her background. So Hedila Marm is really famous for her role in the first non-pornographic film featuring a nude orgasm scene, which is really like, people were, oh my God, oh my God, this is so, yeah. And she also recognized, she recognized as the face of Disney's animated film Snow White. I don't know if it made sense. So, yeah. But she was facing a troubled marriage and Lamar decided to fled from Austria. But she had a really interesting alter ego. Like she was super duper into war and technologies, advancements. Well, it was influenced by her former husband who was a prominent Austrian art manufacturer. And during that time, she crossed path with a pianist named George Entail. And together they created, they invented top secret communication for radio-controlled torpedoes called, if I remember, Frequency Hoping Spread Spectrum. Is it right? Yes, it is right. Okay, Arda FHSS. Thank you, thank you. Okay, let me correct this. And so they patented this idea in 1942 and what is surprising, singly, what is really awesome is that to see that this technology is still in use today. And for all those who are on social network right now on the web, you can think, Adi Lamar, because of her that we have Wi-Fi and Bluetooth today. And a little thing that I have to say is that when it comes to unusual career changes, I think that we are reaching new heights. At the same time, a new way of thinking could emerge in the 50s. So the programming was involving way faster than hardware, which is still the case today. And so they begin to think because they had to begin to optimize their algorithm. And this lead to an image of the singular creative genius who wielding a form of black magic. And with that, the first stereotypical of the programmer emerged. So the white, hairy, antisocial men. And even if this is more to the realm of fantasy, studies in the 60s showed that it was a profile sewed after and it was more easily hired by companies. So you thought you were done with Grace Hopper? Now she's back. And you have to know that after the war, she worked on the Univac. So it was the more powerful computer at that time. And when she was put in charge of the automatic department, sorry, when she was put in charge of the automatic programming department, she had the idea of the compiler. So this person, this one there, she saved our life because now our computer can understand languages that we can read. We don't have zeroes, one or very low, low, low level languages. So thank you, thank you Grace Hopper for that. And as the idea was revolutionary, she started to observe that every manufacturer, every brand of computer was started to develop their own compilers. So in 1969, sorry, this is not the right date. In 1959, almost, in 1959, she faced a potential chaos that it could be. She decided to call on her old Navy connections to organize a meeting with every manufacturer of the country. And when they came out of the meeting, they all agreed on a simple universal language. The common oriented business language or COBOL was invented, which is still in use today in banks. Who do COBOL? Who can code in COBOL? Here, some people not a lot, okay. Are you happy with that? Okay, that's nice, the love man is in there, thank you. So I have two little anecdotes about Grace Hopper. I mean like this woman, like who didn't know Grace Hopper before coming today? You're gonna love her, okay? I mean, I already, we can love her, but the first anecdote I have about her is that she was also the person who think about the software portability. So before, we had to rewrite every program on every computer. And she then had the idea of why? Why we couldn't compile the code to just put a software in between computers without having to rewrite them? Thank you Grace, thank you so much, oh my god. And the second thing is like a little bit funny is that she is the one who decided to call the process of writing instruction, coding, coding, coding, coding, coding. And it's funny to know that this term replaced by programming because you know this is a woman, so no coding, we're gonna say programming. Today is coming back to our vocabulary and today is way more cool, cooler to say coding than programming. Okay, now look at this graph. So this is the percentage of women majors by field. So we have the medical school, low school, physical sciences and computer science. And what we can see, I was going to speak in French, what we can see is that there is a kind of rupture between women and computer science between 1980 and 1995. So this is a big question and I think that if you are interested in, by women in computer science, I think that you already heard about that, about this thing and what happened and why. This is not the aim of the speak but I think it's still important to speak about that. And this is, so there are a lot of reasons, there are a lot of theories about that. And I really invite you to discuss about that with people, older people, younger people and to see what can be done to try to make this curve up again, really higher. But today one of the reasons I saw when I did my research is the arrival of the personal computer in 1981. Woohoo, PC. Before the PC, the thing is that university students had little to no exposure to computer because they were rare, expensive and oh my god, it was like the size of a house. So they were relatively on equal foot. However, with the introduction of the PC, a new stereotype emerged and I love this one. This is a joke. The perception arose that to be a proficient programmer, you have to spend countless hours obsessively on a computer, which is still the case today. So leading to the notion of the real programmer who sported a computer screen tan from constant screen time. This is my case. I don't know if I'm good, but this is my case though, sadly. Funny thing is many men in the business didn't even fit the stereotypes and it could be a little bit controversial. And however for the women it was different. You couldn't have this kind of stereotype on women because either they were not tough enough or they were too tough and then annoying. So many women begin to doubt about their ability to code and dropped out school. And the last thing I have to say about that is the fact that when households acquired a PC, a personal computer, it was mostly put in the boys room with the father taking a coach role and trying to push his son to explore programming. Does people here live that? Or not? Yeah? Okay. Okay. And this is one of the multiple reasons why there is the wear, sorry, a gap gender who began. It's not the only one. I'm not saying that because people after my conference were, no, this is not the only reason. No, I know I didn't say that. I'm sorry. And so before I said, like they were relatively on the equal foot and with that they weren't because the women, the girls, wasn't pushed to, not a majority, so there are exceptions, all right? A majority of girls wasn't pushed to try the computer or programming. And so at the end, before university, the boys were more experienced than women. So today we hear every day, we hear every day about Chagipiti and AI. That's so cool. I'm sick of it. Thank you. Thank you. That's cute. And during my research, I discovered several women who have advanced the field of artificial intelligence, including Alice Chocock and Karen Spark Jones. And today we're going to speak about Karen Spark Jones because I had to do a choice. Scientists and researchers in computer science, Karen Spark Jones' work focuses on natural language processing or NLP and information retrieval. So this is a good anecdote to say when you are with your friends in a party with your friends from programming and everything. She developed the, you know, to seem intelligent, smart. She developed the TF IDF. I don't know if people know that. Perfect. Yes, some of you. Okay, nice. Okay. So this is the term frequency, inverse document frequency. And if you may let me read this because this is impossible to read by heart because this is not my field. This is a weighted relevance measure that is still used today by most search engines. And it's an important tool for SEO. So if you are doing web, if you are web developers, it's kind of important to know it. And this is this woman who developed it. This method combined the physical presence of a word in a text with the weight of its importance in general. It does make it possible to define the relevance of a specific keyword in a text. So finally, this is kind of charge PD due to understand what you're saying when you are writing a prompt in big. I don't know. I'm bigger. Oh my God. What did I say? And then after she decided to work with Margaret mastermind and they wanted to do, to have a little challenge to challenge themselves. So she decided to program a computer to understand words with multiple meaning. And the result of that was a dictionary of synonyms. Karen published an article in 1964 that is considered as a fundamental document and the foundation in the field of natural language processing. I think that if you are interesting in that, if you are working, if you are coding in this field or just interested, I think this is really, could be really nice to read more about her and to let people know about her work. So her ideas was little appreciated at that time, but they are implemented today and continue to inspire. Okay, I'm going to say something now. Please don't stay here. Okay. People go out because I'm saying that is going to be a little bit. She also mentored a generation of researchers, both men and women, and she coined the slogan, computer is too important to be left to men. Thank you. Thank you to her. Nobody is living? Perfect. Okay. I also discovered something really interesting is that there are no sexism in hacking. Why? Because the philosophy of the hacker is that only the work of the hacker is judged in the hacker itself and not the hacker itself. So it means that we don't care about where you come from, your age, your gender, what you look like, or anything, or your orientation. It's hard to say this one. This is, you are only judged by your work. However, I had the luck to type, to do a research on Google in French and trying to search the top 10 female hackers of the world. So, yeah. The funny thing is that if they are French speakers here, it's written, Le dit plus belle accuse du monde qui te font chaud, which is a literal translation from another language. So it makes sense, but a half is not really making sense. So this is the 10 hottest female hacker in the world. So I watched the article and they were quite impressive for their work. Well, this is true they were impressive for their work, but sad to be to finish inside that. And what I wanted to say is that, yeah. So we see a will of progress, of progressments in, is it English? No, we see a will of doing better about all these ethics things. But, however, we see that in the society, the female hacker still is a fantasy, like this, or we have a lot of stereotypes of women female hackers. So the woman I would like to highlight here is Joanna Ruckowski, sorry for my pronunciation, a Polish computer scientist and security expert. She's best known for her research on low level security and still malware. This is a conclusion. So I can go hours and hours about women. To be honest with you, my first version of this conference, I think I had like 20 women. And they said to me, come down, okay, okay, okay. So today, many actions and associations are being set up to give a place and a voice to women in IT. And this conference is one of them. The reason I'm glad, no, this is not, I'm glad, but no, okay. I have some questions, like have you ever had a role model in your life? And this, sorry, I don't remember. And did this role model help you to dream and give you the motivation to project yourself and believe in your dreams? Yes, no, okay. Yes, okay. Did it allow you to say to yourself, I can do it? Well, role model, like I would like now to speak a little bit about my own experience. Sorry, this is my conference, okay, so you're here to hear me now. I would like to talk a little bit about my experience of discovering my own role model. So this is really weird to say like that. The role model have a lot of consequences and all of them are positive. Not only they can make us think that we can have that kind of dream, dream of reaching great heights, just as they do, but above all, we allow ourselves to think that we have the right to do so. It may sound weird and simplistic, you know. Often when I suggest to my friends, female friends, you know, because I'm passionate of what I'm doing and I don't have a lot of co-dra friends. I don't know, I have Twitch, okay, it's good. So I'm like, oh, do you want to learn a little bit? You know, HTML, CSS, it's really funny. You don't have to, you know, to do a trigger warning is going to be a lot of flash colors, okay, trigger warning. You know, the little rotations and colors, CSS animation, this is so funny. I love to do that. And this is really funny. Okay, it's going away, trigger warning is done. And they always said to me like, oh, no, no, I don't want to because I'm not good at math. So even if computer science have a basis of mathematics, depending on the field, it doesn't like require a lot of mathematics, depending on the field. And I love this sentence and you'd be surprised by how many of my buddies who were not brilliant at math at all have gone on to study computer science or engineering without ever asking themselves whether they're good or not at math. I love that, I love that. And this is kind of a sad situation, all right. So now we all agree and I think we all agree in the room here today that the fact that women and men are smart to do mathematics. I don't know. What did I write? Okay, stereotypes linked to women in mathematics no longer put people in agreement. And I think that we are all agreed today to say that. But the fact is that they persist unconsciously in society. A woman will often feel inferior to her male peers in math because of conditionings and stereotypes that persist. I know that this is not the case of everyone. So I had this case, I felt that until maybe I was 15 and then after I met people who let me learn math and say, okay, no, I'm good at math and I love it. So personally, when I discovered my role model, it was maybe two years ago and her name is Aureligeant. I don't know if you know her here in the room. Okay, so yeah, she's from France and she, I never know how to describe what she's doing, all right. She's a numerical physicist. I don't know how to explain. She's doing AI. She's a physicist. She's doing a lot of things and she's really impressive. She wrote a lot of books. She's like trying to help people to understand the AI. And I just fell in love with what she's done, her background, her career. When I read her book, I don't have the translation. If you want to read the book, you should really read her book, her first book. Where is the mic over there? Okay, and if you want to know a little bit more, like for the book, don't hesitate to come after and to ask me. I can show you the book. So like that you can see if you want to buy it or not. And discovering this woman let me think that, okay, no, even if I was already a programmer, you know, I was already working. I was already having, did my studies and everything. But it made me think that I can do more because I wanted to do more, but I was afraid. I was like, what do I have to say? What can I say? I'm like, I mean, I'm a woman. I'm afraid. It's sad, but I think that this is what I thought unconsciously before. And meeting this woman, like being in the highlights, being in front of people, writing books and being known, and give me the courage, give me the, it opened the door for me to go in to say, okay, I can do it too, and I have the right to do it. So the aim of this conference is to highlight women who have changed the course of IT history and who can inspire young girls today or women or all the people like. But I ask you to those who have patiently listened to these stories, when you get home to write down at least two names you discovered today and spread the word, the word, the word. To share the stories of these women with your daughters, with your students, with your friends, with your cousins, your niece, with the people in the street. I don't know, your bar mate, well, I don't know. And create them to show these women. These girls don't have to become, they don't have to become programmers, but you can open their horizon and show that being a girl, being a girl doesn't have to limit the choices and their dream. So please narrate and create and propagate. Thank you. It's literal translation of French, so if you have better translation, don't hesitate to tell me. So to finish my talk, my, why, I didn't, oh no, this is internet. Oh no, oh no internet. Go buddy. Okay, try again. So I know you have talk to see, I hope I'm gonna do it faster. Oh. Okay, we're gonna do it like that. So, nice to meet you, my name is Laura Durieux, a.k.a. Deaf Girl. So I'm a full stack web developer, WorldSkills Belgium Gold Medal in 2020 and 2021. I am a streamer on Twitch and we are doing code on Twitch, so don't hesitate to come and say hi. I'm also the show presenter of On est pas des Yankees on RTBS X-Pay, which is the national media of Belgium. So here you can take a picture and see, and come to see me on my social media. So the slide gonna be available for after. Thank you, if you have questions, don't hesitate. Thank you so much. Thank you, thank you.
Outreachy: 1000 interns
Hello, folks. Good morning, evening, afternoon, wherever you are. Welcome to the Outreachy Talk and Celebration of 1000 Interns. So before we start, I just want to see a show of hands has anyone participated as an Outreachy mentor, a coordinator, an intern before? Woohoo! Thank you for coming. And for folks who haven't heard about Outreachy before, Outreachy is an internship program that provides internships in free and open source and open science internships. And our internships are open to people who are subject to systemic bias, discrimination, and impacted by underrepresentation in the technology industry of their country. Outreachy is truly remote all around the world. We have mentors are remote, interns are remote, we have interns on all the different livable continents, not Antarctica yet, but maybe soon. And the interns are paid $7,000 total for the internship stipend. And that's a three-month internship. We run internships twice a year, May to August, and December to March. And as of our most current cohort, December 2023, we have had 1097 internships. And to celebrate that 1000 interns, we had a bunch of celebrations. Awesome. Okay, so we celebrated milestone in six countries. We had the celebration in Cameroon, in India, Nigeria, Kenya, and of course in USDSE. And this celebration is awesome because we had past interns. I mean, folks who have gone through this program, they were able to like organize, they led the celebration, and they made everybody to feel included across the celebration. Aside the six countries that we celebrated, we also had the celebration virtually. We had three sessions, and it was really awesome. I also want to talk a little bit about our accomplishments. Not only do we have 1000 interns, we have a 96% internship completion rate. And that's part of because our internships, we consider more of a fellowship. We want to make sure that the interns complete the internship. If they get sick, if they, you know, have family issues, we extend the internship. And so we want to make sure that this is more about them learning about free software and open science than trying to get a particular project done. And we not only have this great completion rate, we also retain people in free software as well. So 80% of past interns continue to contribute to free software, and 44% of those interns are employed to contribute to free software as part of their job. So we want to talk about a little bit about how did we get here? How did we get to 1000 outreach interns? As we talk about how did we get here, you're probably wondering who we are. So let us introduce ourselves. My name is Karen Sandler. I am a co-founder of Outreachy. I'm the executive director of Software Freedom Conservancy, which is the home of Outreachy. I'm from Brazil, came here from a trip of 11,000 kilometers. It took me a while to get here. I was a past intern when we came here, and I'm the current information and process architect of Outreachy. Awesome. And I'm Omotala Eunice Omotayo. I'm from the giants of Africa, Nigeria, and I'm the community manager at Outreachy. Hi, I'm Sage Sharp. I use Dave M pronouns and I have one of the Outreachy organizers from USA. So we're going to go back to Outreachy history. Oh, right. Before I can tell you, I'm going to just quickly introduce why I wanted to help co-found Outreachy. I have a heart condition. I literally have a big heart. I used to think it was very rare, but it's actually quite common. I'm at a high risk of suddenly dying, and so I have a pacemaker defibrillator implanted in my body. I can't see the source code in my own body, and I was shocked unnecessarily once while I was pregnant, actually more than once, while pregnant because my heart was doing what a normal pregnant woman's heart does, but my defibrillator thought I was in distress. The only way to stop it was to take drugs to slow my heart rate down. And this made me realize that our technology may not be made for us despite the best intentions, and what are we going to do when that happens? And so I became really passionate about software freedom and learning about, like, as I've lived with this heart condition and I've participated in the free and open source software communities, it's become very clear that our software can never be made for everyone, unless it's made by everyone, unless everybody has a chance to contribute. And so this is where I sort of entered the role as I found out about my heart condition and started speaking about it. I became the executive director of the GNOME Foundation, where I met a woman named Marina Zurahinskaya. So this is a picture of Marina, this is me, ages ago, presenting the award to Marina. So Marina was a GNOME shell developer, and she was very involved in the GNOME community. And when the GNOME board evaluated their applications to Google Summer of Code, they noticed that out of 181 applicants, none appeared to be women, and they realized that there was a problem. And so the GNOME board eventually brought Marina in and said, what should we do about this? And Marina wanted to start a program to help address this issue. And so she looked back, and in 2006, the GNOME board had decided to do a summer outreach program, which they did a few internships, and it was a one-off thing. It was successful, the interns finished their internships, but none of those interns continued with the GNOME project, and it was just kind of left behind. And so Marina decided to reinvigorate that program. She is not on stage now, you're probably wondering. She's not on stage because she died of breast cancer last year, which is really tragic, but she leaves this amazing legacy that she created of outreach, and I'm so excited to be able to tell her story to you. And so in the 2009 guatech, there were so few women attendees that the GNOME board and Marina decided that this was the moment that we were going to pick this up and we were going to create this internship program. Raise your hand if you were at that desktop summit in 2009. Nobody! That's great! I'm so excited to tell you about it. No, it was a really interesting experience, and so the GNOME board went back with Marina and we decided to launch a new internship program, and Marina very thoughtfully tried to say, what are all of the ways that women are not participating in free and open-source software? Why don't they get started? And she systematically tried to address those issues, connecting interns with mentors and helping them make their first contribution. And so in 2010, the first outreach round, so this is the beginning of what we considered to be outreachy, and for a while we did the first round, the second round, and then we started using the months and years, because saying that you were part of the 13th round or the 15th round didn't make a lot of sense. So we started with that. If we could just go back to that previous slide. So if you notice, this program at the time was for women, and so you see we have this logo of this karate lady sticking her foot out, kicking forward. I love this picture, but it's very much of how the program started, very, very gendered. It was open to anyone who identified as a woman, and the program had interns, and it was a really amazing cohort for the next one. So in 2010, we had eight interns, and then you can see all these pictures that were of the interns at the different guatechs in the coming years. And so a community was starting to be formed, and one of the things that Marina did was she created meetups so that people could meet each other before a conference so that you could walk in there and know that you would have the confidence of knowing you had met someone before you entered it. So as RIT progressed, the internships again continued to be all with GNOME, and I was executive director of the GNOME Foundation, and the internships were so successful. The interns that came through the program were core contributors to GNOME. We had the GNOME planet, and so the interns would be blogging on the planet, and we would see their avatars, and people would come to Guantanamo and they would become so connected, and we realized that this was a program that really needed to expand beyond the GNOME project. And so I started talking with my friend Bradley Kuhn, who was the executive director of Software Freedom Conservancy. Now he works with me at Software Freedom Conservancy still, and Marina connected to Jessica McHeller of the Twisted Project, and Twisted was a Software Freedom Conservancy member project, and so we decided to do experiment and see if we could expand the internship beyond GNOME, and so we did, and it was hugely successful, and so we went from there and offered it to connect it to a lot of other member projects. So now today we tend to have 35 to 40 different free software communities and open science communities participating in each cohort. Yeah, we used to have a slide where we put all of the communities on it, but it just became too difficult to read. Yeah, so as Karen mentioned, originally in 2010, our criteria for who could participate in the internships was anyone who identified as a woman, and then in 2013 we decided to expand that to make it more trans and queer inclusive, and we said the internships are open to women, both cis and trans, trans men, and gender queer people as well. I think in 2014 or around around that time, we also started expanding tech companies published a lot of their data about their employees, and so we realized that in the United States we were able to expand to people of color who were underrepresented in the US tech industry, and I launched this effort to kind of try to expand outreach to country by country. I was talking to lawyers in France and lawyers in Australia, and we were starting to like figure out a way to expand place by place, and it was a lot of work and very difficult, and you know free software is global, and outreach participants were always global, the mentors and the interns, and it really didn't make a lot of sense to try to do that. Yeah, so instead of country by country, the internship criteria we have now is anyone who faces underrepresentation, systemic bias, or discrimination in the tech industry of their country. Now, how do we determine that? We've come up with a series of essay questions that we ask applicants, which is, you know, tell us which country are you going to live in during the internship? How are you underrepresented in that country? How has your learning environment been? Did you see people, you know, the last slide, the last talk, talked about role models. Did you see few role models who looked like you, who represented your identity and background, and then we talked about, you know, what systemic bias or discrimination have you faced both in while you're building your skills and if you were to apply for a job in the industry of your country, and so these essays over time we found ways to evaluate them in a global scale while still being, having, allowing people to talk about their experiences at a local level. I love this because we don't decide what it means to be discriminated. We don't decide what counts as discrimination. We don't, like, have a list of anyone who is subject to systemic bias. We don't have classes of people. We let people tell us about their own experiences and because we don't presume to understand every single experience of systemic bias, discrimination, and underrepresentation. So then we get into sort of middle history. Well, can I do one more ancient history? Because it's so exciting here at Bosnium. I was on this very stage in 2014 when I announced that Outreachy was coming to, well, it was rebranding, Outreach Program for Women was rebranding to Outreachy because it was no longer just for women, and we also announced that it was coming to Software Freedom Conservancy. The project outgrew the Gnome Foundation. You know, there were still only a handful of Gnome interns and the rest of the internships were with the Linux kernel and Wikimedia and Mozilla and a ton of other communities. And so the Gnome board and Software Freedom Conservancy and the Outreachy team all got together and we moved the program over to Software Freedom Conservancy where it remains today. So I got involved as part of Outreachy and I think it was was it 2014 or 2015? One of the two. I think 2014. I think 2014. Yeah. As the Linux kernel coordinator. So I originally helped find mentors in the Linux kernel. I connected them to Outreachy, got them prepared to help applicants during the contribution period. And then in 2016, I stepped up to become part of the actual Outreachy organizer team and passed off the Linux kernel coordinator position to someone else. So in 2015, we have opened up our program and said, hey, let's write these essays about the discrimination and bias that we face. We started having issues with reviewing those because we started to get thousands and thousands of initial applications and also a lot more communities involved too. So in 2017, I sat down with my spouse, Jamie, and he helped me understand a little bit of Django and we built Outreachy, a Django based website where mentors could sign up, where applicants could sign up and it really fits the the customness and fit what our our program was. And so big shout out to Django and Python and that wonderful community. And I want to say, like, this is a reflection of, you know, I talked a little bit about Marina and how she founded the program. One of the things that is the most impressive part of her legacy is that she built up this program, but then Sage came on board and she worked with them and she was able to transfer that knowledge and create a program that was robust and that could could exist without her. And so we're here on stage with this project that Marina really started with her personal passion, but that she thought about how it would continue without her. And so Sage coming on was this absolutely essential and then bringing all of this maturing the program. Yes, and I would say my role has been how do we scale. This is how do we scale. And so the next part was we really need to it to just be more than me and Karen at this point. And so we brought on Anna. My story about Alt Ritchie starts in early 2017. I heard about Alt Ritchie from an Alt Ritchie intern working with Mozilla, she gave a talk, a lighting talk at a women technology conference in my city. And at the time I had the so crushing realization as these mechanical engineering major that as a partially sided person, I wouldn't be able to find a job in my state and when country, I had too many obstacles to face and to overcome. So I applied to the December 2017 cohort was accepted in my first try. And I had a really good experience in my internship. I had mentors who believed in me. And if you're seeing these, Beno and Johan, thank you. And the community was happy to have me as a member. It was a really transformative experience as one who faced ableism all my life. I had people who believed in me in my potential and didn't question whether I was capable or not of doing my job. And when you were switching careers through a program like this, you will experience something that's called a liminal moment. You are not the person who you were before it started and you are still not the person you were about to become. You are in between states. It's disorienting and scary. And you have to find yourself again at the end of the program. And that can be a really difficult task. Interestingly, when I joined Outreachy, Outreachy itself was facing a liminal moment. Things were changing. And we asked ourselves, what is Outreachy exactly? I remember when we created a Zulip server and we started connecting with interns by running bi-weekly intern chats about career in free and open source software conferences, et cetera. Interns were no longer experiencing their internship in isolation. And they were connecting to each other without depending on proprietary software or proprietary social media. That was when something clicked. What was once something more of a liminal online space where people would just go through with an adjacent community, it became a communal space. And with a communal space comes coexistence, the need of permanence and a sense of belonging. With a thriving community comes management's challenges that were beyond our capacity. At that time, we were just too few. And we published a call for a community manager. And I will say that before we posted the call for community manager, we tried to scale by improving our documentation. We said, okay, if we can't answer everyone's questions, if we can't answer all the applicants' questions, especially with so many, could we scale our documentation? And that worked for a while. But eventually, we said, no, we really need an actual person that can help us. Present day, yeah. So we can... So we're going back. What would you like to do? We can continue. All right. So present day, one of the things too is as we expanded, we really need to make sure that we could find additional funding. Right. So I want to... I do want to start by saying, Outreach was originally funded by corporate sponsorship, which was great. I definitely want to give a shout out to Google, which is the company that sponsored the first... Like all the first rounds and every round since then is the only company that has sponsored every single round of Outreach. Plus, they gave us a lot of help. The program is modeled in part after Google Summer of Code. And the Google staff has always been very supportive and helpful and has given us the information and assistance throughout. And I really also want to give a huge shout out to Red Hat because Marina worked at Red Hat and Red Hat contributed her time. It's safe to say that there would be no Outreach-y without Red Hat's contributions early and then continuing in those years after. But nonetheless, the program is not... We deeply appreciate our corporate sponsorships, but it is very tough on the program to have to continue to get corporate sponsorships and then to be responsive to the interests that a lot of companies have and want to put on internships that they're funding. And so in this period, we shifted a lot more to grant funding to supplement the corporate sponsorship. And that was really transformative to the program because we were able to plan a little bit more long term and Ford Foundation, ARDC and Chan Zuckerberg Initiative were the foundations that came in. I would like to say if any of you work at a company that want to sponsor Outreach-y, definitely get in touch. We really can use the support. We also have some individual funding support. And having that mix of funding is really important to be able to have the internships that we want to have. And honestly, being able to say no without having to think twice to a company that wants us to have an internship that's too tightly tied to one company, we're not going to do it. Having an internship that is not going to be a good experience for an intern, we're not going to do it. And having all that... Having this independent funding, we would have said no before, but it's even easy. It's very possible and easy to do it. And one of the interesting things that comes from grant funding is that we can decide, hey, there might be some initiatives that really need our support. And so one of the things that we did in 2020 was we started funding humanitarian free software. And so this is things like public lab that did citizen science and... Mobile lab. Mobile lab as well, which is a open science hacker space and biomedical research. All peer. Yes. All kinds of interesting things. And so these are projects that don't necessarily have enough funding on their own to support an intern. But because we were able to get grant funding, we could offer both funding for humanitarian open science... open source, and then eventually we moved to funding open science as well. So again, citizen science, scientific research, we had outreach projects that were actually looking at COVID at trying to estimate what was the hospital capacity with COVID. And so it was really a proud moment to be able to fund that kind of research. And then in 2022, we had our lovely community manager come in. Okay. So a little bit about where I was coming from. I have past experience working with marginalized population, supporting them, especially when it comes to their rights, when it comes to them receiving the rights supports that they need. And I also have past experience empowering people into tech through Sheikot Africa, coming up with programs, supporting them and standing in gap as an intermediary between them and also the organization. Then coming into R3C as a community manager, I now stand as an intermediary between the R3C applicants and the R3C community and the program itself. So I was veered the R3C social media platforms, supporting and also responding to R3C applicants, putting out contents that made the applicants, people who were interested in what R3C is doing, understand what R3C stands for. And also I was able to come up with coffee charts. So via R3C platforms as well, we were able to have real conversation, real life conversation, helping R3C applicants to understand the R3C program better and also bringing past interns, mentors, community coordinators to answer questions that the applicants have and also to share their experience through the R3C program. And I've also been able to create more awareness about the R3C program through attending and speaking at various conferences. This has really been awesome. Especially at different conferences, I was able to empower people, tell them about the R3C program and that has created a very good awareness about the program. And also this, I would say, has created a very good and resounding application. We have a big growth about R3C applicants from especially the African perspective, right? People coming not just to participate and also to give back to R3C. As you can see, we have zero interns from the African perspective at 2010. And as at the December 23 course, we have over 44 African interns. So which means, so this way, folks from the African perspective now understand better that there's a space for them in the open source ecosystem. They are coming into this program to contribute and to improve open source and open science projects and also to give back to the R3C and the open source ecosystem in general. I want to say that before we had a solid program, an amazing program, but you gave it a voice, you gave it faces, the recognition it deserves. Thank you. And I'm grateful for that. Thank you so much. I would also like to add that since I joined the R3C program, folks have been, especially the applicants, they now understand the different parts of open source that they can contribute to, especially the fact that it's not just about the code part. They don't have to come into open source to maybe be a programmer. They can come into it to contribute and to give back in various perspective documentation, even event planning, community management, and so on and so forth. And also that because to the new R3C organizer. Yeah, we talked about a sense of belonging that comes with finding a community. Another thing that comes up often is this desire to give back. You offer a great opportunity. You want others to have access to similar opportunities that has happened to me. This is why I joined the outreach team back in 2018. We found that many interns come back as mentors, some as new mentors, some as experienced mentors. Either way, challenging situations require extensive support. And we decided that we needed someone dedicated to supporting and advocating for our mentors. Yes. After Omotolo's outstanding year of supporting applicants and interns, we hired Hilda Udufu. She is someone who has extensive experience with the program. She was an intern. She was a mentor. She was a coordinator all of it for public lab. And I'm proud to say that in turn I've become her mentor when she joined team. She's been facilitating conversations with mentors in office hours, having interviews with them so we can highlight their work, working hard and facilitating relationships between mentors and interns. And I think all of it is an indication of a phase of maturity within the program. We are not only looking for always growing. We are looking for growing sustainably and keeping our community flourishing. And I also love to add that Sage and Karen has mentioned how Astrochi has grown from, I mean the background of Astrochi and the growth so far. And with this we can also point out how Astrochi has grown in the aspect of not just why should we have Astrochi, but now to better support the applicants I come in as a community manager, right? And also we have Tilda. So Astrochi is not just supporting applicants, we are also supporting mentors. So because we understand that the program is not just about interns coming to contribute to open source, the program is also about people staying back in open source and also working together to give back to open source ecosystem. Yes, this is about open source sustainability as a whole, like the ability of us continually to exist as a community, supporting contributions and making sure that software still exists and still maintained. And I would say that you know you can look at the numbers of the people who find jobs that are contributing to free and open source software and the number of people who continue to contribute, but no matter where our interns go after that they always take the values of software freedom with them and they're exposed to software freedom and they take those values and they there's a follow-on effect from these internships. And I would say our interns have won awards, they've joined boards of directors, they've been mentors, grand mentors, great grand mentors and we see graduates of Outreachy everywhere. All right, so then the question becomes what's next for Outreachy? What is the future Outreachy? And the future of Outreachy maybe it's you. Maybe you would like to mentor, maybe you would like to coordinate. If you'd like to know more about Outreachy you can come and ask questions, but also there's a bough in AW121 at 1300 or 1pm and if you'd like to come talk with us, figure out how to get involved, we would love to hear you, we would love to hear what you're doing in free software and come connect with us. If you're interested in being signing up as a free software community, the deadline to sign up as a community is February 15th, so please do check out our call for mentors and communities. This is a celebration, you know, we're celebrating the fact that we got to this point and we can only do it with you really. We are actually gated generally by the number of mentors that we can find, so we we shield for funding already, but realistically most of the time it's finding enough mentors to provide those internships and so you know really that's that's all of you who are here, you're you're you know enough to be here. Actually raise your hand if you're here and it's your first FOSDEM. Wow so it's like it's like a third of the room, that's great. So yeah you know I think one of the things that I'm most proud about Outreachy is that it's a real grassroots program, like it's something that we started by offering something really pragmatic, like just offer internships, have that work pair interns with mentors and have them learn and then we've just been growing it slowly. I remember when when we started back in the day and I was a new executive director there were a lot of diversity initiatives coming up at the time, it was like very fashionable to start diversity initiatives and there were like programs getting millions of dollars based on glossy work you know glossy brochures that they had made having not done anything in the past, but we found it Outreachy with the different mentality with the with the bottom up free and open source software mentality of we're going to do the work and then if people find it valuable then the resources, the time and the money will come after and so we can't you know it's Outreachy is our thousands of volunteers and I'm proud that it is itself a free software project. And also we want to like tell folks that are listening to us that you can support Outreachy in several ways. You can go back to your local communities to tell the story of Outreachy to become an advocate of Outreachy. Tell folks who can be part of Outreachy as an intent to apply to the Outreachy program. You can also contribute to Outreachy right through your various communities, your various projects by bringing your projects as I mean your community, you can be a mentor, you can come in as a community coordinator right and you can also support Outreachy by going back to create more awareness about the Outreachy program. So tell folks about Outreachy, you tell your communities about Outreachy, bring your projects to Outreachy and you can also partner with Outreachy in various ways. So you can reach out to us if you want and connect with us right. We have you can connect with us later today to ask us questions, discuss ways, several ways that you can contribute to Outreachy. Additionally, we may not have the capacity to work as a mentor, but you may have the capacity of reviewing pull requests, of reviewing contributions made by the applicants. Communities need it so much, they get so overwhelmed with our applicants and they will be great help if we could help them. Yeah, even if you have experience with any particular community that's involved with Outreachy going to help out and answer questions in the community chat, that is a great way to help those communities. Questions? No, we're going to the thank yous because there's a lot here. Yeah, so I don't know if we... I don't think we want to read all the thank yous. No, we're not going to. I want to highlight a few people though. I definitely want to... We've already talked about the organizers and the reviewers and all of our volunteers. I definitely need to... We always joke that Outreachy is like a python swallowing a goat. There is so much logistical work to be done to manage Outreachy. It is huge and so we want to thank the Conservancy Accounting and Logistical and Financial staff including Bradley and Roseanne. And also... They're amazing. And I also really want to thank Roseanna who is on the... It says Gnome Board. Oh, and the Gnome Board. Right, Roseanna who did that logistical work at Gnome and helped launch the program. We want to thank the Gnome Board because there were times when running a program like this is difficult. It's a lot. Yeah, and we've had our times where there's been misinformation and people attacking the work that we do beyond calling us names and threatening us. And it's been really stressful. And the Gnome Board spent a lot of time making sure that they were defending the program and supporting it. And then I also want to applaud them for realizing that it had outgrown the Gnome community and that it made sense to move it to another organization. The Outreachy leadership in the past, Cindy Polaris and also Tony Sebro who is now on SFC's board and was our general council and is still involved with Outreachy. Justin Colonino who has given us pro bono legal help actually from Outreachy's inception has been supporting us with legal work. Ropes and Gray who gives us pro bono legal work, Otter Tech and also Owen Taylor and Jamie Sharp. I did read most of them, I'm sorry. But they deserve it. All right, so. So we can take some questions if anyone has any questions. All the microphones so we have to share one here. It is so hard to hear in this room, so you have to speak really loud. Okay, first of all, a huge big thank you. It's hard to overstate the value of what you do. And because it is so valuable, my question is, so in the end you kind of dodged the topic a bit about the future. So my question would be, since it's so valuable, how can you transcend from an organization that depends like so many others on the efforts of some individuals for survival into something that is actually hard to stop, that has a life on its own that you couldn't shut down if you wanted to. Is that for me? Yeah, that's you. I mean, they're pointing at me to answer that question, which you know, I'm executive director, so I have to like be the visionary of this program, you know, like, and give that voice. But I do have an answer after you. But Sage will have the answer. No, I mean, I think that the whole point of it being a grassroots, like free and open source software project is that we grow sustainably, we grow slowly, and we grow carefully. We bring stability. We've been working for the last five years on redundancy, so making sure that we have a team that isn't going to completely burn out. It's so much work to do this program. I don't know how Marina did all of those logistics. She basically did them herself for a really long time, and she maintained all of these wiki pages where she wrote down the names of, she just stayed in touch with every outreach intern and like wrote down where they went to work because she ran into them in a, you know, in the hallway. So like what we've done is with Anna's help and Sage's help, and now what Matola's told us is to make that a lot more systematic. So we've got a robustness so that if any one of us is no longer part of the program, it has a life of its own. I think too, to bring in some of the values of free software, what we have done in Outreachy is we have talked to different communities and learned what are the best practices for being inclusive, for onboarding new members, for designing projects for interns, and we've documented that. So if you look at the Outreachy documentation for mentors and communities, there's a lot there of knowledge that we have learned that was siloed across different communities. And so even if Outreachy goes, I think we still have impact on those communities. Our documentation, our knowledge sharing, the lessons we've learned will move on. And so I think in the future, we'd like to be a little more vocal about why we design the program specific ways, how we be more inclusive and coach our communities on that. And I think too that the grant funding companies that want to fund the Outreachy General Fund rather than specific internships is going to be the way forward. So we keep pushing our sponsors towards that and hoping that they'll allow us to make sure that our team continues and also that we can decide which communities have the strongest interns and allocate funding that way. We have a dream of having an open mentorship alliance with other mentorship programs. We know we are not the only ones and there are many, many more that do things differently, but they are as important and as fundamental to the open source system. I would also say that like historically, we have improved something about Outreachy every single round. Like the idea is that the whole point of like free and open source software is that no one and nothing is perfect, right? And so we've been changing something every round. If you have feedback for us about things that could be made better, we would love to hear it because we're looking at it ourselves and so we expect to change and improve. I was also going to add to that, but it's also really nice to see all four of you on stage and also the diversity of the organizer group as well as really I think a special part of this, but my question is kind of actually building on what you Anna said about the mentoring side. So I'm definitely seeing a challenge in a lot of open source communities and projects around that mentoring side. General on how do you do mentoring and how do you scale mentoring in a community? So my question is like from your perspective and doing all of this kind of working so closely with projects and working with mentors, what are the greatest challenges that you see around mentoring and mentorship and open source? And do you have all the answers? No, do you have any ideas or tips about what you think the open source movement needs today to grow and scale mentoring? I can think of some like cultural differences, the way you talk to someone in Brazil may be different from the way someone talks to another in the United States or in Nigeria. So conciliating those differences when you are doing asynchronous communication, for example, you can create a lot of conflict with some. Another one is safeguarding. This is something that some mentors have told me, especially when we work with more marginalized communities. It's difficult to ensure, it can be challenging to ensure that everyone is in a safe environment. We had some folks that had some really challenging lives at home and needed safeguarding. And that can be a difficult situation, both for the mentor and the intern. So having the psychological support for both of them is important and also challenging. And I think as well for mentors, I think having a pathway to mentorship is important. I think a lot of people assume, hey, I have to be a maintainer for five years to be a mentor. And so finding ways to define a path towards mentorship that doesn't feel like you have to be an absolute expert. So one of the things we've been trying to do in our outreach chats is talk about what does mentoring mean? How do you get to be a mentor and emphasize you don't have to know everything? And so with Outreachy, what we've done is we've encouraged people to co-mentor. So you've got someone who is more experienced in the project and someone who has just been in intern, either Outreachy, Google Summer of Code, and they shadow the mentor. They're starting helping out. We're training mentors. And so figuring out how to create those pathways is going to help. It's difficult to aspire to be something if you don't even know that is a possibility to be that something. And also to add to what Sej just mentioned, the Outreachy organizers, especially the mentor support, to that we have been able to come up with different initiatives to support mentors from the contribution stage. We understand it's not easy from the contribution stage to getting feedback at every point during the internship to our mentors office accession. Because we want to understand the various challenges that the mentors face, we want to be able to support them. Sometimes we also want mentors to come together through our coffee chats with mentors. And even of his accession, we want mentors to be able to discuss with one another from different communities to state the challenges they are facing, to learn from another mentor from another community, how they've been able to address these challenges. That way, happy mentors learn from one another. I would say also you commented on our diversity as organizers, but one of the strengths of the program is recognizing that the burden of bringing diversity to free and open source software shouldn't rely on the people who are underrepresented. And so, and so, mentorship is a great way to be an ally, right? It's a great way to shift that burden. And so, like, you know, I think that's when one of the strengths of the program over time is that it's a great way to get folks that are not subject to systemic bias who are not underrepresented to help bring people up. I think, do we, do we have time for one more question? Or are we? We're done. We're out of time. Thank you so much for joining us. Thanks for supporting our reach. Have a great bottom. Thank you.
How to Chart your own Career Path in Open Source - Panel Discussion
Okay. Can people hear me okay? We're good. All right. Thanks. So I guess we can officially say good afternoon. Thanks for coming. My name is Ray Paik. I'll let the panelists introduce himself in a few minutes. It's a little weird because we have to be here for the camera, but we'll make it work. So I'm a community manager at Pincap. If you're not familiar with Pincap, we're a company behind the open source database, TIDB. And if you're part of CNCF, you may have heard about a couple of other projects that we donated to the foundation. First one is ChaosMesh, and the other one is TikeAB, which is our key value database. I've been at Pincap since April of last year, and I started my career in open source community management about 10 years ago when I joined the Linux Foundation. I was there for about four years. Then I had community manager roles at GitLab and CubeDev before I ended up at Pincap. I don't know how you all felt about 2023. 2023 felt somewhat difficult, especially on the job front for people in open source. I myself was laid off, but I was fortunate. I was wrapping up my interview process at Pincap, so I think I accepted my offer a week or two after I was given a notice. So I was fine, but then you had this constant drumbeat of negative music. It seemed like companies that we thought were at the forefront of open source were making significant cuts to the community teams. Open source program office is just completely being obliterated. I couldn't tell for a long time last year that whether this is just another boom and bust cycle in the high tech industry or there's something more fundamental going on. I did a lot of thinking about open source careers, and then so I decided to propose this panel. Glad to have a wonderful panel this year. I'll let the audience introduce themselves. I'll look at you want to get you want to start? Yes, I hope you can hear me too. So thanks. My name is Ildeco Vanja. I work for the Open Infrastructure Foundation as Director of Community. The Open Infra Foundation is an open source foundation that hosts and helps support open source software development communities into software infrastructure space like OpenStack, Cata Containers, Starling Access, all our examples of the projects that we have. I joined the foundation seven and a half years ago, so I'm already on the record in terms of longest employment of my life. So you can tell that I like working working here. Before the foundation, I used to work for Ericsson, which is a large telecom vendor company, so very different environment. However, that's where I got in touch with open source. I started to contribute to the OpenStack project, and my first experience was so wonderful that I just couldn't stop afterwards. I became a really big open source advocate. And open source became a fundamental part of my life to the level that now my full time job is is all about open source and working with communities and the ecosystem and anyone who would like to get involved or maybe they don't like yet that they would like to get involved, but that's where I come in and convince them that that's the best idea that they will have in their lives. So yeah, that's me in a nutshell. Okay, I had technical difficulties. So I'm Don Foster. I am the Director of Data Science and Governing Board Member for the Chaos Project. I'm also on the board of an organization called Open UK. I live just outside of London. And I'm also a co chair for the CNCF contributor strategy technical advisory group. So I tend to wear a few a few different hats. I got my start, well, I came out of university with a computer science degree in the mid 90s, and I somehow managed to luck my way into a Unix system administration job. So my very first job out of university. And back then I worked for a manufacturing company and manufacturing companies do not like to spend money on software. So I used a lot of open source software just in the nature of being a system administrator. And then fast forward a couple of years, I was at Intel in around 2000 2001. And they needed someone to look at which open source projects were going to be strategic for for them over the next, you know, number of years. So which ones should we be engaged in which ones should we be working with and I was working mostly at the time in the kind of the Linux developer tools space, I think like compilers and IDEs. So so that was sort of my first first role that was more focused on open source. And then over the years, I managed to somehow turn that into a full time thing where I was community manager and a few different companies. I've done I've done lots of different things in my open source career over the years. Most recently, before chaos, I was at VMware and I was their director of open source community strategy. So I've done little little bits of things in open source over the years. Alison Randall. So I also started my career in open source in the 90s or through software since we didn't have the name yet then. I was working at a startup, as an online bookseller that just happened to use Pearl as their development language. I'd used Pearl a little bit before for linguistic research, but that was when I really got into it. And within a year, I was teaching Pearl at the local Linux user group. And then I got sucked into Pearl design work by the development team. And then I got sucked into being project manager and the president of the foundation. And it kind of went from there. So I've been involved in a lot of different projects, but you'll know some of them. Debbie and Ubuntu open stack. I'm currently chair of the board of software freedom conservancy on the board of open infrastructure foundation and also on the board of open usage comments. Cool. So yeah, I mean, I was really excited about the panel because we bring a lot of different backgrounds and different way you've got introduced the open source. So I guess I'll ask this question to you out, Alison is like, so 20 plus years, like what motivates you to keep saying in open source? Like what are the things that you enjoy the most about the open source communities? I mean, for me, hindsight is clear. It really has always come down to the people and the things that we built together. And that's partly the software that we build together. We've built some really amazing tech. But it's also the communities we've built and the styles of collaboration we've invented. And, you know, the legal structures that supported those very different ways of working together. And that's what really stands the test of time. You know, you can get distracted by the politics and all of that. That's not really what matters. What matters is the people and what you're building. Anything else you want to add or? Yeah, plus plus one to the people. It's been it's been an amazing career, right? Like I I've met people and I know people all over the world and I can I can go almost anywhere and find someone that I've worked with on a project somewhere to sit down and have a coffee with no matter no matter where I am. And so it's yeah, I've just met so many amazing, wonderful people. I can I can also plus one that that notion. And the other thing is that when it comes to open source, I mean, the majority of the people are there because they are interested in the project and the technology they had they share the goals. They they work on something that's in their common interest. So you find people who are enthusiastic about what they do. And it is a great environment to be in and to be part of. And like knowing people all around the globe, you learn a lot about cultures. And you just have access to so much knowledge that we share with each other on a daily basis. And you get so many different points of view that it's just it's very hard to match in any corporate environment in my experience. So the flip side of that question is, because we talked about all the positives and what we enjoy the most. Are there examples of time when you wonder to yourself, like, what am I doing with my life? And maybe this isn't for me. Like, I mean, maybe it doesn't have to be that dramatic. But anything you want to share? I mean, I do like what I what I do today. And that's that's why I keep doing it. There's there's ups and downs, no matter what you do. When it comes to open source, like back in my, let's say, corporate days. I think that it would have been better if I spent a little bit more time understanding corporate politics and navigating how open source can fit into a product development environment, and figuring out how to work with with our managers to also help them understand. Because there there are a lot of examples where where you're a developer, you're working on the code, you know what you're doing, you know why you're doing it, you're enthusiastic about it. But there are so many other people in the company who are trying to make sure that there's a product schedule that the customer is happy that the company makes revenue because otherwise, we are all in big trouble. So there are a lot of moving pieces. And you who are actively participating in an open source community to you, it's crystal crystal crystal crystal clear. There you go. What's happening. But but someone like a manager who's a program manager trying to again make sure that the product is on track. They don't know they don't have the experience. They just see something from the outside. So helping them understand how these communities work, what you need to do to be effective in that community and also be effective in the company where you're working. That that can be an interesting balance and an interesting challenge. And when I was very new to it, I think I stumbled on a few mistakes that I would do differently today. Cool. Anything else you want to add or we can move on. But I mean, you mentioned balance. And I think one of the, like challenges that you hear a lot of different events like including your FOSM is that people talk a lot about work life balance or trying to maintain balance in general. And then maybe Don, I'll ask this question to you because you work. I mean, before you joined, you came on board a chaos full time. You're at VMware. So I mean, you're actively involved in Kubernetes and other communities, but that's not 100% of your job. You have responsibility as a VMware employee. And how difficult is that balancing at like trying to be a good open source citizen, but also trying to be a good employee. Yeah, that can be Okay, so so that can be a real, a real challenge. I, you know, I think on the one hand, I was I was just super lucky, right? Because my managers at VMware were really supportive of the work that I did. At the time I was contributing to Kubernetes and to the chaos project and a few other things. And so they were, they were very supportive of me spending that that time. But I also take the approach where I, and I didn't always do this, I've burnt out a couple of times in tech, like, like many people have, where I tried to do all the things. And now I'm super protective of my personal time. And I, you know, I kind of work as a set number of hours. And then when I'm done, I'm done. And the only way I can do that is by being really brutal about prioritization and just saying no to the things that aren't that important so that I can focus on the things that are, whether they're the things I'm working on in open source communities, or the parts of my, you know, at the time, real, real day job. And then I'm, I'm sort of lucky now that I, I do, I will admit right now I have, I have my dream job. So the data piece, the open source metrics with chaos has always been my passion project. So being able to do that full time is, is been pretty great. I would like to applaud Don for being able to do the prioritization that that you made the decision and you're sticking to it because I suck. I am also in my dream job. But to me, that did not help with spending not too much time on it. And I think when it comes to open source and also what we are doing right now, especially after COVID, like so many of us are working from home. And to me, just the working from home setup, whether it's open source or not open source work, I like to be enthusiastic about whatever I do. So that setup, to me, makes it already really hard to find a balance because like the left corner of the table is the work corner and the right is where I have my personal time when I eat lunch. Like that, that just doesn't really work well for me. And I think that Don also mentioned burnout, that that is something that probably most of us who are enthusiastic to an extreme level will experience at least once. So I share all the challenges and I can, I can only recommend that once you experience burnout once, you do have a choice from that point because you do have the full end to end experience. You know what the signs are that lead to burnout. So you do have the choice when you are seeing the signs next time to stop to know that, okay, I'm not going forward like this anymore, because I know where it leads. So you do have the tools with the experience that you're gaining, even if you don't find the right balance right at the beginning. Cool. There's just so many interesting things to do in open source. It's hard to choose one or two. No, I'm like Don said, I'm all for setting boundaries. I mean, I work for work with a lot of colleagues in China, and I'm in Pacific time zone. And between five and seven, it's really difficult to say no to a quick call. But I think most of my colleagues now like between six to 8pm, that's family time, I need to have dinner with them. And they understand like if you have the right corporate culture that works, but it's really hard to do some time. So, so go ahead. Just a quick note, like I just started to work with a new community. And they they are very active in Europe and Asia Pacific. I also am in the US on West Coast time. And I work with two communities in total and the other communities very North America centric with a few people in Asia Pacific. So I have all the three major time zone regions to cover. So I'm currently in the process of trying to find a new balance because I can work 24 seven so easily because there's always someone awake. Who's very active in the community that I'm working with, who I could talk to, why could solve a challenge for and it can be very hard when it comes to the time zone challenges, especially if you're like really working with global communities. So when when I first like opened our talk, I mean, we talked about like the job market like last year. But for people that are looking for jobs in open source, like I mean, are there any advice like any of you like to share like in terms of, you know, first of all, finding the interesting openings that you might you might want to pursue interviewing tips, etc, etc, finding the right culture. Okay, I can start. I would say that my my biggest piece of advice when you're looking for work is is to use your use your network. So I think in my entire career, I have only I've only ever had one job that I got from applying through the traditional channels. Every other job I've ever gotten has been because of someone I knew. And in a lot of cases, these were people that I knew through my work in open source through these open source communities. So when you're when you're looking for work, just, you know, spend some time talking to some of the people that work in the communities that you're interested in, and who work at companies that you might want to work at or organizations that you might want to work at and talk to them, you know, ask them what it's like at that company and see if it might be a good fit for you, ask them what kind of job openings they have, and just just talk to people and get other people's suggestions. Because once you talk to enough people, they will generally know of other people that you can talk to that maybe you weren't already connected to. So so don't be shy about talking to the people that you know and asking them asking them their advice and what it's like where they're where they're working. Yeah, I think that when it comes to open source, like, you're operating in a public environment, whatever that you do is public. So you can also point to things that you've done. It's it's much easier to build a resume as well if you're active in open source. So it's the connections and also the work that you you've already done. And another thing that kind of connects back to sort of early mistakes. That for I think that was the first question or along those lines. Like, building connections really is truly important. Like when you're attending an event, you can you can prioritize to listen to talks. But I would challenge you and say that if you're not interested in talking to the speaker after the talk or talking to people in the room who are interested in the same topic, then is that really the best session you could choose in that particular time slot because you can always have access to the content later. Many conferences are recording presentations and even if they don't, the the information is out there floating on the internet one way or the other. But the person doesn't. And the in person connection is invaluable. Like I have a lot of experience, you know, jumping in new communities and you do that on the online channels first. But whenever you get the opportunity to actually talk to a few people in person, the online interaction just becomes so completely different, way more efficient and usually a much more pleasant experience. And then those connections are also could be the ones that are lending you a new job because those people know you, they trust you and they can give a recommendation at the company where they work that, hey, there's this person, we've been working together in this community and they are so amazing and they're looking for a job or maybe they are not looking for a job, but we should get them anyway. So that's that's a great way to go. I would I would add keep in mind that there's not just one way to have a job in open source, you know, pretty much any job these days that's related to software is going to be related to open source. So in my career, I've often switched between like doing all my volunteer, all my open source development as a volunteer and doing paid work that's like running an open source conference or managing an open source foundation. And then also I've done it the other way around where my paid work was open source development and then I was as a volunteer serving as a board member or you know, a community manager or something like that in an open source project. So like, don't be afraid to mix things up and yeah, find a way to get paid but also find a way to like live your passions. Cool. So somewhat related to that I guess I mean you've done lots of hiring over the years for open source rules. When you interview candidates like what do you typically look for? I can I can start. So obviously the skills that you need for a particular job depends on the job. But when it comes to open source interacting with people and being a team player is kind of a requirement. It doesn't mean that you have to be an extra word. I'm an introvert. I know so many people in open source who are totally introverts. But since we are all so passionate about what we are doing, that is not a barrier for us to participate. So the willingness to interact with people and to even if you're not comfortable with the public environment fully yet, but the willingness to be and to do so, that is very important because you will need to interact with people from all over the place. And if you're quiet and shy and you don't want to be out there, then it is very hard to get successful in open source in my experience. So that's definitely up on the list of being able to do that. Do you want to take the mic? I would say that I generally look for someone who has enough of the skills that we're looking for that they can probably do the job knowing that there will be pieces of the work that they'll need help with later. So one of the things I will caution you about is that job descriptions on the website are wish lists. They are not requirements. I have never in all of my years had every single thing listed on that job description as a skill. And they still give me the job. And I still, I guess, I seem to be successful. And so don't look at those. As a list of requirements, look at those as a list of things that they would like that person to have because they're not going to get that unicorn. They're not going to get that person with every single one of those skills. They're going to get somebody who has enough of those skills to do the job, and then they're going to train them on the rest of it. So make sure that you go ahead and apply for stuff, even if you don't think that you have everything, because in a lot of cases they'll be willing to take a chance on you and train you up on some of the other bits. And also, like, if they see that you're passionate about that particular job and you have an idea of why you would be the best person to do that job, that usually gets you through the interviews as well. And that, like, if you don't have a skill that's listed, then they will more likely overlook that because you're someone who's already in that mindset that you're ready for that job. So I totally agree with that observation. I don't think I ever checked all the boxes either. I think that's impossible. Most of the time the job description is also written in a way to just be a little bit scary. I assume they are trying to limit how many people are submitting applications just because the job description looked like you need a 200-year work experience before you apply. But really, most of the things you will be able to learn and don't be afraid to learn. And if you're also open about that, at least to me that was always appealing when a person is honest about, okay, this I don't know yet, but I can learn it. And for most of the tech jobs, you will never stop learning. So if you have that ambition that will take you from A to B and then from B to C and you're able to grow, that is always very appealing. Like, again, the ability to grow, that is another thing that at least I personally look for, to see that the person will be able to grow into the job that they are applying for. But then they will also be able to grow further out of their job and do something else in the company. Really, a job where you already know everything before you start it is super boring. So look for the jobs that you will learn something really interesting and that will lead you on to other jobs where you learn even more interesting things. Yeah, I mean, we don't mean to harp on a job description too much, but the other thing I want to add is, by the time you accept the position and start, it could have been three, four, five months since that job description was originally written. So think about that, after a quarter, things change, the market's changed, there was a reorganization of the company. So, I mean, that's why I try not to take job description as a gospel, although it's very tempting. Because you want to check all the boxes as many as you can, but it's just a guideline. Like, it's an educated guess as to what the new person might be doing, but it's still a guess. So, sorry, I'm going to the list here. So, I think what I've seen some people do, like, you start in open source, but then you step back, you take a different role, you do a non-open source role. I think some of you have done that in your career. Like, can you talk about that experience and why you, I don't know if you were forced to do that, or why did you just step away and what was it like coming back into open source? I've done it multiple times, I mean, three decades is a long time. I mean, some of it was layoffs, you know, it happens, but more of it was often, I mean, there's reasons like family health, there's reasons like kids, there's reasons like I took a break to do a PhD, you know, there's all kinds of good reasons to take a break, but another one is to avoid burnout. So, if you think you have to stay in the one project forever, you will work yourself and work yourself and work yourself. But if you recognize it's totally okay to just go away for a couple years and, you know, either come back to that project or a totally different project that excites you two years from now. Like, it feels less devastating to step aside from a project and you can do a well planned orderly handoff instead of a flame out burnout. If you push yourself all the way to flame out burnout, chances are you will never work in open source again because you burned yourself too far down. You can come back, but it's much, much harder than if you just recognize the signs and say, oh, you know, I should really take a couple years off and do something else. And then you come back revitalized. So, yeah, I actually, I highly recommend taking a break from time to time. It's a really good idea. Cool. Yeah, I've also had the occasional detour. I had one that was the company that I worked for, the politics around open source internally just got to be too much for me. And so I spent six months working in a, like, a market research department or something, something kind of random that I thought was a little bit interesting. But, you know, in the other way, I think that can help with, you know, burnout and just kind of, you know, doing something new is I've worked in loads of different open source communities. So, chaos is probably the one that I've worked in the longest because I've been working with these tools since before the, before the organization existed. But in a lot of cases, I've worked in, you know, kind of a series of open source projects based, frankly, on what the company that I was working for was particularly interested in. But the thing that I found was that in every time I switched from an open source community to another, I found that there was at least one person that I knew from a previous open source community. So even when you kind of bounce from community to community, there are usually other people that you know from previous lives in other communities. I do not have the experience yet. I'm still burning myself to learn the lesson. But to Don's point, I did see people, like, moving from one employer to the other, but still working in the same open source project. And also the popping up at another community and like, oh, hi. Like, you're here too. Cool. And it's just, it kind of shows you that the world gets a little bit smaller if you keep being involved in open source and you just know people. And the connections that you make will more likely to stay with you longer than in a corporate environment where you're just jumping, jumping companies. And that's a really nice experience. Even if I assume that even when you're taking a break and you come back and you see some familiar faces, but the project is new, that's kind of a nice mixture of I'm doing something new, but I don't have to make all new buddies to go from A to B. So yeah. Cool. So yeah. Yeah. So, I mean, I think earlier we were talking about like regrets or mistakes. I mean, the one I made personally was, I mean, I was working at Intel. We got reorg and then I had to stop working on open source, which was devastating. And then I think the mistake I made was I just spent time like sulking and being depressed. And I mean, that's fine. But what I should have done is do more productive and try to get engaged in the open source community somehow, find a different project, show up to meet ups, et cetera, et cetera, rather than feeling sorry for myself. So I think getting reorg then maybe getting laid off are two good examples. Like if you're a force away from like open source, like, you know, what advice do you have for, you know, maybe just stay engaged in the community somehow. I think nowadays it's easier. But, you know, what kind of approaches do you take to find a new community to join or like how do you keep up to date on what's happening out there? I have an example that's kind of slightly connecting to the questions. I will share it. I used to do trainings like how to contribute to OpenStack trainings one or two days before, prior to big events where a lot of people traveled to. And I met a lot of different people with very different motivations of why they were at that training. Some of them were just like it was free and then they were already there and it seemed interesting. And a lot of people kept asking like they do the training, they learn about the tools, the processes in the community. So what should they do next, what they should work on? And I always kept asking back like what are you interested in? Because I can point you to the low hanging fruit bugs. That's easy. But once you fix the bug and then you fix another one, then the third one is like why am I doing this in the first place? If you're not interested in the particular technology or you don't have a motivation to be at that particular place. So I would say that don't ever let anyone else tell you what you should do. Go where you feel passionate. Like learning, when you learn something new, where you're interested in the technology. And if you get involved then you will have the connections and then the job will come around as well. If you want a paid job that also works with that particular technology. So I would say that here make sure that you prioritize your interest and invest in yourself through that. I think it also partly goes back to the people. Like multiple times, we talked about other open source contributors moving from project to project. Like I was working at Canonical and when I left Canonical, a lot of other people that had also been at Canonical working on the OpenStack project. And I thought, that's interesting. What's that all about? And that is how I got involved in OpenStack. It was just by talking to other people and seeing what they were interested in now and kind of keeping those connections. So your network in open source can be really, really valuable in staying connected and finding out where the new things are and where you might want to keep working. Cool. Okay. I think we have about 13 minutes left. So I think I'll ask one more question and then leave the last 10 minutes for the audience. And I'm not going to hold any of you to this. I think earlier when I started, I felt pretty depressed for large parts of last year. Because I wasn't sure if this shift is unique or we're just dealing with another pendulum swing. So open source careers, in general, what's your outlook? I mean, to be honest, I think I'm a little bit more optimistic now than I was like middle of last year. Middle of last year just seemed daunting. And I just devastating to see a lot of my friends get laid off. But what are your thoughts on where things are headed? Or are we dealing with more of the same? Okay. I can go first. So as I work for an open source foundation that is a nonprofit organization, and I work with a lot of communities, I do see the effect still of where the economy is right now. However, at the same time, even just in the past two days in the co-located events, people are throwing out numbers. Like if we didn't have open source, then it would be like 4. something billion dollars to rebuild what we lost. And the trillions of dollars of demand that is driven by open source software. So those kind of numbers show that open source will not go away. So even if the economy is restructuring itself, companies will restructure themselves too. And I don't think that anyone really has a choice of not using open source software anymore. The software also needs to be maintained because otherwise you're not able to use that. Security is a high priority item in any single conversation that I've been participating in in the past few months. And maybe it's getting up to years now. So there's a lot to do in open source. It is also a model that is very sustainable if it's done right in terms of investment. So I think we will bounce back overall. And I think that the job market will have a lot of opportunities that are more directly focused on open source. And even, I think Alison mentioned that there isn't really a job that has nothing to do with open source anymore. It's just maybe isn't called out directly. But I'm optimistic. Yeah, I'm also optimistic. I do think that the pendulum has swung too far in the cutting of jobs. And in particular, I think some of the open source groups have been particularly bad hit. But I think it's not going to take companies long to realize that somebody has to do that work on the projects that they depend on. And so, you know, I work a lot with CNCF projects. Most of them are understaffed and they don't have enough resources to maintain the software over the long term. And I think that if they don't get more contributors coming back to those projects, because companies have pulled people off of some of those projects. And so many companies, their whole product line relies on a lot of these products. So I think they're going to quickly realize that their new features, their bug fixes, their things they're going to need in the software, that they're going to have to resource some of that. But the other trend that I find particularly promising as well is some of the alternative funding sources. So you look at groups like the sovereign tech fund out of Germany who are funding core infrastructure projects. You look at things like GitHub sponsors. You look at a lot of these other groups that have started funding individual projects, individual developers. And so I think that's also an interesting trend from a career and a job standpoint for open source. I don't know. From personal experience, I was laid off last year and I haven't looked too hard because I was having fun working full time on my volunteer open source projects. But I did see in the new year, towards the end of the year, it was a lot of, oh, this year it's totally blocked off. And in the beginning of the new year, it was like, we're hiring, we're hiring, we're hiring, we have a lot of spaces to fill. So if you think it's been a difficult year last year, look again because things are changing now. Cool. So I think cautiously optimistic is the phrase I like to borrow from the economists. So I think we can open up to audience questions. I don't know if we have a microphone for the audience or I can just bring one. Thanks. So I have a question for Ildiko. You mentioned earlier that when you are at events like this, you have to take advantage of getting to know people and to interact. So I'm also introvert. How do you get past the barrier of talking to strangers at an event like this? Not an easy question I know, but. Excellent question. I can only share my personal experience. To me, if I'm passionate about something that will push me through the first few seconds of awful experience, I can only share my personal experience. I can only share my personal experience. If I'm passionate about something that will push me through the first few seconds of awful experience. The other thing is that what I found is I have days when I just wake up and I'm feeling more social. And there are days when I can do whatever I want. I could write a script for myself before I walk up to a person who I don't know yet. And I would still be totally awkward. And I learned to say that it's okay. I have days like this and it is okay. I also started to kind of be a bit more open about this sometimes just telling the person, you know, I'm socially awkward sometimes. I'm not ashamed about it. So like I'm not afraid of putting it out on the table. And many times the other person can relate that, yeah, well, it's not easy for me either. And it's just so many examples I have where I said something like this and all of the sudden, that was like the icebreaker and the other person is also like, oh yeah, it is hard for me too. And then we have something to talk about. So it is hard. I know that it will drain me like after a conference like this. I need a few days to recover. My mother also knows that she should not call me like two, three days for two, three days because I will not be a pleasant experience on the phone necessarily. But yeah, you learn how the social interaction affects you and then you will also learn how to navigate yourself. So I can only encourage people to get through the first few awkward experiences and then build on what you learned about yourself. Yeah, I mean, just to build on that, like talking to strangers is hard, right? And I share some of Ildigo's like I tend to be a little socially awkward. But what helps me is to talk to people in more social situations. So you know, you're in line for a coffee or you're at one of the after parties or something where it's a little bit more social. And I just sort of have ways of coping with it. Like my question for people is always, you know, are you enjoying the conference? Or you know, and building on that like, oh, it was the favorite thing you saw today or what are you looking forward to tomorrow? So you're talking about the conference and you're also, you know, even if this person isn't all that interesting to you or working on the kind of the same things, maybe you learn something about what they found at the conference and what looks interesting to them. And it can be a good icebreaker. And you know, and then sometimes, you know, if you're standing in a group of people and you know, somebody else will chime in and then pretty soon you've got a conversation. But that's how I start. That's my coping strategy for awkward conversations with strangers. I'm also an extreme introvert. For me, it's about understanding that it's difficult for them too. So if I'm focused on trying to make them comfortable, I'm not thinking about how uncomfortable I am. And also just being super curious. Like, ooh, what do you do? What are you interested in? And like, I get so focused on whatever project they're involved in that, again, I just completely forget about my own awkwardness. But planning time off, like even in the middle of the conference, planning like half a day off, like, oh, I don't have a whole lot of talks I want to see right now. I'm just going to go back to the hotel. And it really helps because you recharge that introvert battery and then you're ready to deal with people again. I just want to say like plus thousand to that one because I think it took me years to be comfortable with saying I don't really need to talk to anyone in the next two hours. Like on the sessions, I don't have a target topic where I need to network with people. So I just, I leave, I find a nice coffee shop somewhere outside of the convention center getting some fresh air and just saying that that's okay. Also like once you get into the environment and you start to know more people, you don't have to go to every single social event after the conference because there's usually happy hours every evening. You don't have to go to all of them. Once you have a base network, then you can pick which one you want to go and the rest don't sweat on it. Because at the very beginning, I was like, my company's, company's sending me like overseas. It's a very expensive trip. I'm missing a week of work. Like the day job kind of work. And I felt obligated to go to every session, talk to people, go to every social event. And like if I stepped outside of the convention center during the conference day, then I felt guilty. So letting that go, yeah, let it go. It's very important for you to take care of yourself first. I mean, in addition to social awkwardness that I also deal with, I mean, it's just like crowded conferences like this. It's challenging to find time to talk to people because they're all busy, especially with speakers. And I've been on both sides of this. It's completely okay to say, could I message you on LinkedIn and have a call with them like a week later. And I actually did that with one of the speakers last year. This is one of the, like it was in the K building, one of the larger sessions. And he was just inundated with a lot of people. And I just said, hey, can I, can I connect you with you on LinkedIn? And then you're going to be in a more relaxed environment on Zoom. You just have a conversation about his talk or his background. So there are, you know, you don't, don't force yourself to just have all the conversations in two days. It's just very difficult logistically. So any other questions? Oh, go ahead. Yeah. So my question is, have you experienced any sort of a significant difference in terms of revenue working on a heavily open source type of project or job versus a, suppose a normal one, if there's such a thing. Thank you. The only thing I can say about like salaries and things, in my experience, that's more tied to geographic locations rather than, and well, the, the job role itself. Like if you're at a hyperscaler and I don't know VP position, then I assume you will not have money problems for the rest of your life. But, but at the same time, I, I really, my experience is I moved geographic locations and that affected my salary more than anything else. I have not noticed a difference. And to the, I have not to the degree that when I was putting my son through college and I very much focused on salary, which I don't anymore. But I was like 1% working on fully open source. Like I was in the 1%. Like fully open source, like nothing, like no proprietary software. So it's not, it does have a lot to do with the company. Different companies have different salary bands. So you're more likely to get more if you work at a big company and then a small startup startups tend to be a bit more weighted towards like stock options. So it just happens. But yeah, there isn't, there isn't really a difference whether it's open source or not. Yeah. And then I mean also comparing like, because I just asked this question because I worked at a foundation working at nonprofit versus like a for profit organizations. When they need to hire people, they need to be competitive. Like, I mean, if you're a nonprofit, you can offer like stock options. That's not viable, but they have to find other ways to make it appealing to attract good people. Right. So you can't be at a complete disadvantage like salary wise as an example. If that's, if, you know, that's, you know, that's my experience. Other questions. Any nine. All right. Cool. Well, just final thing I want to say. So if you want to connect with us, I mentioned LinkedIn. All of us are on LinkedIn and also on Twitter. If you want to continue the conversation, feel free and enjoy the rest of the weekend. Thank you. Thank you.
The Regulators Are Coming: One Year On
Okay. Testing, testing. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Yeah, there we go. If I call your attention. In the next session, we have a one hour and on the regulators are coming. Your chair for this session is going to be Simon Phipps, and he will tell you all about it. Welcome. Thanks for coming. There we go. Yay! Hi. So I'm Simon Phipps from OSI, and I'm part of a group of people from Open Source foundations that have been engaging with the European legislators this year to fix the issues that you all told Benjamin about after his talk at FOSDEM last year. And the TLDR for when you leave early is that thankfully Benjamin and Omar down here listened very carefully and have, I believe, addressed all of our concerns with the impact of the CRA on open source developers and open source charities. There are some remaining issues that are a little more complex to deal with, and they will be dealt with in some guidance that comes from the European Commission. So to speak to you today, first of all I've got Benjamin Burgle, who is a head of unit now, head of sector at DG Connect, and he was one of the authors of the Cyber Resilience Act and has been intimately involved in fixing it with us all year. And he is going to tell us all about the CRA. After that we're going to hear from Gail Blondel from Eclipse Foundation, who was also part of our group that was interacting with the Commission, and he's going to tell you whether Benjamin is telling you the truth or not. And then Omar is going to tell us the same things about the Public Liability Directive, and then Doug Villum from Apache is going to tell you whether Omar told you the truth, and then Enzo here is going to run an audience Q&A so you can ask these people all the questions that you want to. We've only got 50 minutes, so if your question doesn't get answered, come to our dev room, which is all day tomorrow, in AW1120. It's an open source in the European Legislative Landscape, and we're running four 2-hour workshops to give written feedback to the Commission on their digital agenda legislative program. So with all that said, Benjamin, thank you so much for coming back, and they've promised not to throw anything. So go for it. Thank you. Thank you so much, Simon. Thanks for having me again. It's been an exciting year. I was here exactly one year ago. Last year when I was here, I was presenting the Commission proposal, which is the first step of the legislative process. We as the Commission, we make the proposal, and then the co-legislators, the European Parliament, as well as the Council, which represents the Member States, they negotiate on the basis of our proposal, and now I'm here to report back after one year of negotiations. The text is almost done. It's quite stable. We still need the final vote by the European Parliament, so it's not entirely finished, but we are quite confident that what I'm going to present to you today is a rather stable version of the Cyber Resilience Act, the newest kid on the block when it comes to cybersecurity legislation. Last year I presented the proposal. I will repeat some of that this year, but I will focus much more on the open source elements, because there are much more open source elements in the final version compared to the original version. For those that weren't there, what is the CRA about? It essentially requires developers, hardware and software manufacturers to introduce security by design in their development processes. The cheese on the left represents a product with digital elements, as we call them, filled with holes and security vulnerabilities. On the right-hand side, once you've complied with the CRA, there will be way fewer holes, although we do acknowledge, of course, that it will be impossible to get rid of all the holes. That's just the nature of cybersecurity. Here is a brief introduction into the main elements of the law. As I said, it's about cybersecurity rules for the placing on the union market, the entire European Union of hardware and software products. We have three main actors in this legislation, the manufacturers. They will bear the brunt of those rules. They have to make sure that their products are secure, but then there are also obligations on other types of actors, mostly the distributors, so these are essentially either brick-and-mortar stores or online shops. They have to make sure that the products that they sell are secure, as well as importers that import from outside the union onto our market. The rules come in the shape of essential requirements. So essential requirements are high-level, objective-oriented, technologically neutral requirements for the placing on the market of the products. They are things like ensure access control, ensure the confidentiality, integrity of stored and transmitted data, and so forth. So you know all these are high-level. This is the cybersecurity 101 that we're essentially putting in the law. To make it more useful and more easy for manufacturers to comply with those requirements, the European Standardization Organizations, they will develop harmonized standards, and then you can use those standards to comply with those requirements. The European Standardization Organizations, essentially, they gather the manufacturers, so it will be the manufacturers themselves who will develop those standards. Depending on the level of risk that is associated with a product, there will also be different types of conformity assessment. I will explain that in a moment. I also want to mention separately that there are going to be reporting obligations, so if you discover vulnerabilities in your products that are being actively exploited, or you have an incident on your network that affects the security of your product, then you would need to report that. And finally, another important element, of course, is the market surveillance and enforcement. So all 27 member states, they will be required to set up their own national market surveillance authorities to check products and ensure that the products that are on the market are actually secure or at least compliant with the CRA. So these are the main elements. We are tapping into an existing framework. You've all seen it probably, the CE mark. So on your smartphone charges, for instance, you have the CE mark. The CE mark tells you that this product that you're holding in your hands is essentially compliant with all European product regulation. And in the future, when you see the CE mark, it will not only mean that you're compliant with safety regulation at the union level, but also with cybersecurity legislation, the Cybersecurity Act. So which products are we talking about? The scope is quite wide and deep. So when I say wide, I mean that it applies to all sorts of hardware and software products, such as laptops or operating systems. But it also applies not only to the final products, but to the components, because the nature of cybersecurity is, as you all well know, that often vulnerabilities and components can have an impact on the security of the final product. And in many cases, it is very difficult for the integrator who builds a final product to find all the vulnerabilities in those components, often components of black boxes, in particular when they don't come in the shape of open source. So they also need to be secured. And so all components that are placed on the market as separate products, they are also in the scope of this regulation. What is not in the scope? I already explained that last time, but it was not sufficient for you. I explained that non-commercial products would not be in the scope. And I think this has been quite an issue that has been discussed very lengthy. A lot of people have asked, what does it mean non-commercial in particular in the context of open source? And this is one of the reasons why for the last year we've tried to flesh out in more detail what non-commercial means for open source. And I can tell you that during the last year, barely a single day has passed by when I didn't wake up to a message from Simon, Dirk Willem or Enzo trying to help along with this process. So non-commercial products are not in the scope. I will explain in a moment what that means for open source. Stand-alone services, in particular software as a service, that don't come with a product that are stand-alone, that you just access through a website, they are also not covered. And we also have a few outright exclusions of products that are already regulated when it comes to cybersecurity, so they don't need to be covered by the CRA. And that includes, for instance, motor vehicles and medical devices. Okay, so just to understand, I said the scope is wide and deep. I want to talk a bit about what it means that it's deep, right? So when you are a manufacturer of a final product, in this case a smartphone, you will be integrating two types of components. On the one hand, like in blue here, components that you've developed yourself, as well as components here in yellow that you are buying on the market or sourcing from the market, and you're also integrating them. So you are responsible for the security of the entire product as a whole and for its compliance with the CRA. But when it comes to the components that you source from third parties, of course it's much more difficult to have assurance about the security. And for those components, we've introduced a due diligence requirement. That means that as a manufacturer you will have to do the utmost to make sure that the components that you integrate are secure. That can mean that you simply check the change log. Is this a component that is regularly maintained? You check the vulnerability database that are out there on the internet to see if the latest version contains any vulnerabilities. And if it's a commercial product and it is subject to the Cyber Resilience Act, you can also check whether it carries the CE marking. So this is how you can achieve that the product as a whole is CRA compliant. So now to the conformity assessment, I mentioned it earlier, and this is the first time I'm going to mention open source more explicitly. This is where it's explicitly mentioned in the text. For the vast majority of products, which we call the default category, manufacturers, they will have to undergo a self-assessment. That means that it's the manufacturer, Him or herself, that will check and ensure that the product is compliant. But then there are some products that are explicitly listed in the annex of this regulation that the co-legislator have considered as important or critical from a cybersecurity point of view, and they will have to undergo a more stringent type of conformity assessment. So first we have the category of important products, and manufacturers in this category, they will have to apply at least a harmonized standard, the ones that I mentioned earlier, or in some instances they will even have to submit their product to a third party to have it checked if it's secure and compliant with the law. So products in this category are for instance operating systems, antivirus software, or also firewalls. Then there are also critical products. They are also listed in the annex. These are products such as smart cards and secure elements that we consider to be even more important. By the way, only hardware products, no products that are softwares or nothing that is potentially open source. And for these products we may in the future even go a step further and require a certification of the products. Now when it comes to free and open source software, we have a special provision in the CRA that says irrespective of whether your product is important or not, you will always be allowed to undergo a self-assessment. So you will not have to submit any free and open source software that is in the scope of the CRA to a third party. And the reason behind that is that when it comes to open source, it's a transparent product, and anyone including the users or integrators, they can check for themselves whether this product is secure. So you do not need to have a third party that vouches for the product. Now we also try with the CRA to shift the responsibility from the developers of open source components to their integrators. Because so far integrators have often been free riding on open source components and not giving enough back to the community in terms of fixing vulnerabilities in these products. So coming back to the smart phone product that I presented earlier, right? So imagine a smart phone product that integrates an open source component. Here is a silly open source component that prints fruit onto your... that prints fruit. So far it was a one direction thing, right? So the integrator would take the component and, I mean not always sometimes, of course integrators also contribute a lot back. But in many cases they would just integrate the component into their own product and that would be it. From now on the CRA will say, if you find a vulnerability in your component, you have to inform the developer of that component. So that developer can also provide a fix to that vulnerability. In addition to that, since as a manufacturer of a final product, you are responsible for the product as a whole and in absence of a fix from the upstream manufacturer, you will also be required to provide a fix. I mean either you fix the vulnerability in that component or you replace that component by a different component. You just have to make sure that your product is secure. But if you do provide a fix, then you will also have to provide that fix to the upstream manufacturer so that the upstream manufacturer can integrate it. So this is how we want to share the burden on security between the developers of final products as well as the developers of free and open source software. So is your open source software project covered by the CRA? I think this is the question that you are all asking yourselves. I said initially the commission proposal said, if you are not commercial, you are out of scope, right? And now we fleshed this out in much more detail and we've even introduced a new type of actor. The open source software steward which I will also present in a moment to you. So if you are merely contributing to someone else's project, you are definitely not a manufacturer. You're not subject to any obligations. That was a worry that was expressed several times but here I can assure you, you can just keep contributing and you do not need to worry about CRA compliance. Now if you are providing the project and not merely contributing to it, the question is, are you developing in the course of a commercial activity? So if you're not, if it's really just a hobby project, again, you're not in the scope of the CRA. Now if it is in the course of a commercial activity, the next question is, are you directly monetizing that product? I mean, because we know that many open source projects, they do not directly monetize but they're still a wider commercial setting, right? Many companies coming together to jointly develop a component that they will use for their own products that's a wider commercial setting. But we only look here at the direct monetization of the project. If you're directly monetizing it, then you are a manufacturer and then you are subject to the security by design requirements of the CRA. If you're not directly monetizing the project but it's still taking place in this wider commercial context, this new type of actors introduce the open source software steward. So these are essentially foundations, not-for-profit manufacturers and so forth. Here we've invented a new very light touch regime. So if you are a legal person that provides support to specific FOSS projects on a sustained basis and these projects they are intended for commercial activities, then you will have to comply with the light touch regime of the CRA as regards the open source software steward. But if you're just a collaborative project, no governance frameworks to speak of, no direct monetization, then again you're not in the scope of the CRA. That means the vast majority of the open source projects will not be in the scope of the CRA. So I don't know how do we still have time. I can maybe quickly explain what the open source software steward will be. I already gave some examples, right? So foundations, not-for-profits, also companies that build open source for themselves for their own monetization or integration into their own projects, but then make it available to the public, they will all be open source software stewards. And I already said it's a light touch approach. It's not going to be heavy, but the idea is to place some responsibilities on these types of actors, but only responsibilities that they can also bear, giving the nature of their project and their organization. So there are basically three types of obligations. First, you have to put in place a cybersecurity policy. The CRA is not very prescriptive what that cybersecurity policy should look like. It provides some basic elements that need to be mentioned, such as supporting the community in providing information about security vulnerabilities, describing how you will mitigate vulnerabilities and so forth. Secondly, you will be required to cooperate with market surveillance authorities, just like any other actor in the Cyber Resilience Act. And thirdly, you will also be required to report incidents and vulnerabilities, but only to the extent that you are involved in the development. So if you're not involved in the development and you know nothing about the project and the vulnerabilities, then you will not be required to report vulnerabilities. Okay, so this was a high level overview of the CRA. Just maybe very briefly what are the next steps. So we are hoping to conclude the CRA very quickly in the coming months. The entry into force, I cannot be sure, but it will be roughly around middle of 2024, maybe a little bit later. And then there's going to be a three years transition period. During that three years transition period, the European standardization organizations are going to develop the standards. We as commission are going to develop guidance. For this, we will need you because of course the CRA is a high level legislation. Many of the concepts, they need to be fleshed out through the guidance. So I'm actually looking forward a lot to all your questions because these questions, they will help us determine what is relevant for the guidance. Yes, and then in three years time from maybe June this year, so maybe in June 2027, the CRA will enter into force. Thank you very much for your attention. Thank you very much. Thank you very much. Still got to turn it on. There we go. Thank you very much for all that. Now Gail Blondel is one of the leaders of the Eclipse Foundation. Eclipse has been speaking up frequently for the open source community in this legislative process. They've had two staff working on it quite a lot of the time. Deb Bryant and Enzo over here who you'll hear from later. Gail, could you come and tell us how the Eclipse Foundation feels about the state of the CRA now? Yes, thank you very much Simon and thank you Benjamin for the presentation. Well, thank you for coming. You see that went well. That was okay. So far. So one first point is that I think that we have always said that we agree with the goal of the CRA. That was on the first blog that was published by Mike on the topic. We agree on the goal but initially that was very scary. And I think that last year that was the conclusion of your presentation last year. Hey, come on. What are you doing? How can you put us on the spot like that? Because putting C marking on all the open source project was just not an option. One thing and that's very important is that we know that we have lots of open source developers that are volunteers. And even when they are paid to do open source development, what they focus on is doing the features of their project. And non-functional topics like security, etc. I think that as a community, as an ecosystem, we know we have to take care of that because we had lots of issues in the past. But having that coming through a regulation was something completely new to us. And yeah, even if there is a legislative process that is kind of obscure for most of us, I think that what's interesting is that to see that during the year, we managed to establish some enough connections that the co-legislators listen to the open source community. So I think that from your presentation today's obligation to push corrections, to push fixes, upstreams, also the fact that contributing people are not responsible for, have no obligations, etc. And also from my perspective, the introduction of a new kind of organization, that's the first time there is a regulation talking about open source foundations or those kind of organizations as something specific. That's very interesting aspects. But to conclude, and maybe that's an opening for the conversation after, is that that's just the beginning because we have mostly three years in front of us. And in those three years, so you will write guidelines. And hopefully we can collaborate well on writing guidelines. But there will also be the standards. And maybe from the point of view of the open source community, it says that the standard organizations have not been the best friends of the open source community. So that's how do we, I think that when you say open harmonized standards, I guess that a few people in the room say, hmm, it's unlikely we will like such things like an harmonized standard. So that's something we need to keep on our radar. And the fact that the regulators are coming, that's the title of the panel, I think that that's a good thing because that's also the fact that open source has won and is present everywhere. So we used to be under the radar. And now I see several faces from the European Commission in the attendance. You are here to explain to us and we have established some connections. So that's good things. And yeah, the conversation continues tomorrow in the panel, in the EU policy day room. And that's it. Thank you. So that's the CRA. Now, the CRA sets the rules for the market surveillance authorities. It says how countries are going to make sure their citizens are safe from the products that are being sold in those markets. When it turns out those products aren't safe, Europe's Product Liability Directive gives citizens recourse to have justice brought into their lives. And the Product Liability Directive has been in place for many years in Europe, but it doesn't give any liability to software producers. And so within boundaries that I will fix. So the European Commission is going to do something about that. Those big, bold, lettered disclaimers at the end of your software licenses do not apply in Europe anymore. And that's because the Product Liability Directive is being updated to give software producers liability towards consumers. And to tell us about that, we've got the legal and policy officer from DG Grow, Omar Anagi, who was one of the primary authors of the PLD, and he's going to tell us what's in it. So Omar, please. Thank you Simon, and good afternoon everyone. It's a pleasure to be here again this year. Same as the CRA. We are a year after, and we have now more than just a proposal. We have a legislation that still needs to go through the adoption by the parliament. But just as a small introduction, whatever has been said just before, let's try to forget it for the next 12 minutes, because it is not applicable in our case here. When we speak about the PLD, basically it applies to any type of products. The only element that is necessary is whether they are made available on the market, and made available on the market basically means any supply distribution of a use, and whether it's return of payment or free of charge. And the most important element is actually the commercial activity. I know that everyone asks always, especially here last year, those questions, what is a commercial activity? Unfortunately, I cannot tell you exactly if your own product or your own software is in a commercial activity. This is an assessment that is done by the judge itself. There are elements, the number of supply of the product, the number of use of the product, but this cannot be determined beforehand for the PLD, because of its own nature of safety net. So the assessment will be done for each individual product, even if it's let's take the more traditional product like a bottle, and you will have to look at the specific bottle and not the series of bottles to determine whether it is in a commercial activity or not. And I say this is the scope, but then we arrive to the product itself. Any product, the definition is really legalistic, so you don't need to really get that, but basically it's everything, and we have clarified that also softwares, raw materials, and digital manufacturing files are products on the PLD. There is no definition of what is a software, as you probably know, like the software 20 years ago is not the same than today. So the idea was to leave it as open as possible to ensure future proof safety net nature of the PLD. You asked me if SIS are covered, yes they are covered. The PLD disregards how the product is supplied, how the product is bought, how the product is used, how it is the model of the product, where it is stored. All of this is totally disregarded. Any software is covered by the PLD, algorithm, operating systems, AI systems, apps, whatever you want, all of them are covered under the word actually software. As Simon said, the PLD does not kick in. I mean, you do your job, and in the PLD we're not telling you how to do your job. The only thing that we are telling you is, nor the risk profile of your product, because if something wrong happens, and maybe none of you will ever experience the PLD in your life, if something wrong happens, someone has to get compensated for the damage. The damages are pretty straightforward. It's basically death and personal injury, including psychological health, the destruction of property, and the last one is destruction or corruption of data. Those are the three main categories of damages that need to be compensated. If there is a single of one of these ones, you would then have to compensate basically everything that is related to that. As I said, you will not have a case if there is no damage, and you will not have to face the PLD itself. Except in certain situations, you might have liability even if the damage has not yet occurred. Let's take a pacemaker. You know that the pacemaker has an issue. You will not wait that the person dies because of it. You will get preemptively the compensation, namely the damages of going back again for surgery and etc. I use the pacemaker because they are part of the wider range of medical devices, and medical devices also sometimes implies or include software. This is a specific situation in itself. When we talk about the liability, the question is for how long? The main rule is ten years. This is the general rule. Namely, if you place your product on the market, you may have it available on the market from the first day. This is when the time starts running. But as you know, a software might evolve, AI system for example as well. Considering that a software that was placed 15 years ago and has been changed through a lot of updates for instances, it will be kind of limiting to steer or only apply to ten years, because it means that someone who bought the software ten years ago or eleven years ago will not be able to cover the damages in case something wrong happens, although the software has been updated. We have also included a new starting period, which is when the product is substantially modified. I'm not going to go into detail. We're not explaining exactly what is a substantial modification. In most of the legislation, you will find what a substantial modification is, but roughly for software, I don't know if the CRA has a substantial modification definition, but for instance you will have to go under the CRA to see what is a substantial modification in case of cyber vulnerabilities. What we say is basically if you update your software and the update is as such that it changes the risk profile of your software, it is a new product, it is a new software and the time limitation starts running from that moment again. So each time you will change to that point or to that element of your software, you will restart the clock in that sense. If it doesn't, then it doesn't, and then you're ten years, ten years. The extension of the liability has also been put to 25 years in a specific situation, which are the health injuries. That shouldn't concern you that much, but just for you to know, it's basically pharmaceutical. That's the easiest one when you realize that you have some damages because of it, but it took more than ten years to appear. So this is a specific situation, but software you just know. We talk about time limitation and then we also need to talk about the exemptions. Exemption means that even though your product caused the damage, one of the three, you might be able to be exempted from your liability. There is a full list of exemptions, not going to go into details, but maybe two are important for you and I will explain the first one a bit later. If you did not place your product on the market, but it was placed by someone else, or the development risk defense, what we call the state of the art, which I think in your field is the most relevant one, and just to be clear, it's not the knowledge of the developer, it's the knowledge of the community, of the science around. And it's not about the known unknown, it's about only the unknown unknowns. Only in those cases, you will be able to be exempted from your liability. So just to take an example for you and maybe to make it as clear as possible, the PRD does not apply for any product when they are supplied outside of a commercial activity. This is the same for free and open source software. If your free and open source software is developed or supplied outside of a commercial activity, but someone decides to integrate it into another product, and therefore the product is then sold to a person and causes harm, the liability is pretty clear. The person will only be able to go against the integrator of the software, but not against the developer of the free and open source software that has been supplied outside of a commercial activity. That's a bit of clarity that is now in the text, which was not there before, but just for you to really understand how it will work. And the very last point is about, I know that you have clauses in your license. The PRD is pretty simple. No matter your clause, you cannot use it against a person, a natural person that is claiming compensation. So there is no leeway for avoiding liability. If it's a natural person, so me, you, anyone else comes against you, has a damage, asks for compensation, brings you to court, you cannot say that you had a clause in your license that said that you will not be held liable. That will not be accepted by a court. That's a general principle that works for everyone, and for any type of product, is to avoid that the weakest party, namely the consumer, suffers from an imposed contract. But what we have clarified in the legislation is basically, if you, a small company, very small company, decide that, okay, you sell your software to another company to integrate it, but you do not want to take over the liability. If this is your case, you can then have a clause in your license or in your contract. And in that case, the manufacturer of the overall product, the integrator of the software, will not be able to come against you in case he has compensated the natural person. What happens usually is the natural person goes against the name manufacturer of the product, and then it is that manufacturer of the overall product that will go against the other component manufacturer for getting part of the compensation. This would then not be possible if you have such a clause. So that's a bit of the small panorama of the PRD. So I leave you on that, and I hope you enjoy it. Thank you. Thank you very much. Perfect. I'll come get it from you now. And to respond for the defence, we have Dirk Willem van Gulak from the Apache Software Foundation. Thanks, Simon. So, yeah, so basically these were, so I think so like in many ways, what's happening here is that software is becoming, yeah, very grown up, and just sort of like a, I don't know, a phone charger or an electric drill, where sort of like being put under the same rules. Now, I think the positive news is here that in this process, the open source site, the development site, and also like the micro enterprises are largely sort of like out of scope. However, what I want to stress, and also want to stress about the CRA, is that it is a massive change for our industry. Even we as open source developers, we're not alone. We're actually part of that IT industry, and the PLD and the CRA will probably sort of like, or will absolutely affect our industries way more than they do open source, because the industry has to come to the table. The industry is basically squarely in the view of the CRA and squarely in the view of the PLD. So I think one thing sort of like, we can sort of like, yeah, be positive about and celebrate about is that all the worries we had last year around the CRA and especially about the PLD didn't really come to fruit. I mean, things are now sort of like, we've got a fair balance, I think. But at the same time, as an IT community, we've got sort of like some massive challenges sort of like they're left. And I think sort of like some of the questions of you may well be in that area. Thanks. Thank you. Okay. And so we're going to move to an audience Q&A. If you've got a question that you would like to ask the panel, or particularly the guys from the commission, then if you would like to raise a hand, there's a hand raised down here, and Omar is going to moderate for us. I'm Enzo, not Omar. Sorry, Enzo is going to moderate. My brain is gone. Yeah, go ahead. Go ahead. Please, yeah. Yeah. So we're glad that a lot of the concerns of the open source community were heard. We can't hear you. Yeah, okay. So we're glad that a lot of the concerns of the open source community were heard. But for Linux distributions, like for example, Debian, we will be exempt because we don't do anything commercially. But we are worried about our downstream users, which of course use Debian commercially. So for example, a lot of very small and very small local IT providers sell computers with Debian, for example, or do other business using Debian and integrating it into their products. And we are worried about how they will be able to comply with the CRA obligations because they are so small that they can't do it themselves. So it would be really hard for them. And also the margins in the computer industry are not that big that they can just say, okay, I'm going to employ somebody who's doing that. That's not possible for most of them. So that's what we want to have guidance for. And also it's really difficult for them to understand all these regulations and what this means in practice concrete for somebody who's, for example, just selling computers with Debian. Thank you very much. I think it's a very good question. I guess Benjamin, it's pretty obvious that this question is for you if you want to answer real quick. Yeah, thanks. It's a great question, I think. So indeed, if you are selling a laptop, for instance, with an operating system installed, if you're building that laptop, if you're the manufacturer of that laptop with the operating system, you will be in the scope of the CRA. And the due diligence requirements as regards the integration of the operating system, they will also apply to you. I mean, I explained before what due diligence means, right? So there are a lot of ways in which you can do due diligence. The CRA is on purpose not very prescriptive because we want to give a lot of flexibility to the integrators. But one thing is for sure, it doesn't mean that you can only integrate CE mark products. You can integrate any open source component that you like. And there is a myriad of ways in which you can demonstrate that the components that you integrate are secure. I think in a case like this one where the upstream provider, so the Debian project, is such a massive undertaking, I think it would be extremely helpful for your integrators if you provide them with useful documentation on how Debian as a software, how Debian addresses the various security requirements of the CRA. I mean, just because the CRA doesn't apply to you, doesn't mean that you shouldn't take security seriously and I'm sure you do, right? So I'm sure many of the things that the CRA requires, such as the access control and so forth, I mean, obviously modern operating systems like Debian do that. So if you document in a transparent manner how you are actually complying with security by design principles, I mean, you're essentially doing the work for your integrators and then they can just recycle that work for their own documentation. So their documentation doesn't need to be heavy anymore. Thank you very much, Benjamin. Is there another question here over there? Thank you. Yeah, this is a question for the Eclipse Apache foundations. Aren't you afraid that you have kind of doomed the software foundations in shielding the developers? Because when I look at this, the first thing that jumped out of me was, okay, I have to make sure that I'm not going to be a software steward. So if somebody wants to pay me for work, then the best thing I can do is dump the project into one of the foundations and make myself just a contributor. Thank you very much. Dirk, maybe first or Gail? Right, so I think the question is really like what do I do as a small developer, right? And this forced me to dump my projects in one of the foundations. And I think it's useful perhaps to turn this around. I mean, what is happening here is that society is asking the software developers to start producing good secure software to basically use industry best practices. Now in open source, we by and large do that. In fact, we pretty much set every industry best practice around security. And it's our downstream people in the commercial markets who are often not updating. I mean, we update log4j within 24 hours and then like now years later, it's still not being done universally. So I think to a large extent the answer to that question is that as developers basically, we'll have to sort of like get more systematic and more explicit about documenting the good things we're doing. And I fully expect sort of like that a year from now, two years from now, we basically all more or less have documented that in the same way. Because I mean, at Apache we've documented some of the things at Eclipse, at Python, we basically all doing the same thing. So yes, of course, we're going to steal each other's documents, right? It's open source. I mean, that's just the easiest way of doing it. And then indeed, basically, you sort of like get that foundation like style, all those things which are part of an open source steward like being sustained in the market, being responsible about these things. Yeah, simply then becomes much wider available. Thank you, Dirk. Yeah, just maybe to add something like I hear your point that, OK, if there is some constraint due to the fact that there is an open source steward, I absolutely want to avoid being in this situation or I want to make. I don't think that people or organizations bring their projects to a foundation just to avoid the theory or to do something like that. And that's the main point is more likely to set up collaborations or to have a vendor neutral governance or stuff like that. I think that's our main point in my opinion is that we help create consortia, but the open source steward is a good way to implement the requirements of the CRA in the context where.
How open source projects approach Functional Safety
Hello everybody. So for our next speakers we have Nicole and Philip. So Nicole is from Electomatis and Philip from Bosch and they'll be talking about how open source projects approach functional safety. Thank you. Thanks and welcome. As the title says we're talking on how open source projects approach functional safety. There are so many projects out there so we just took three examples which will be the Xan Zafire and Linux part. So we do together so just to give you a short intro on who we are. My name is Philip as introduced. I'm currently a product manager for embedded open source inside Bosch. But the time where I'm speaking here I mainly do is the chair of the technical steering committee within the ELISA project which works on enabling safety and I'm also leading a work group there which basically puts the different pieces of open source projects together. Then I'm currently also a member of the Linux Foundation Europe advisory board which has been started last year and personally I'm open source into the S since like 15-20 years or something where I'm using it mainly. And this gives time for Nicole directly. Yeah so this works. So yeah I'm Nicole. Yeah I'm mainly a safety person. I started off in production maintenance so not maintenance like a maintainer that you might know but really like yeah something breaks and somebody has to fix it. I went then to software development, been a software developer and tester mainly always with some safety background in the automotive and the defense industry. And then changed to do more safety sort of safety centric work. About five years ago we decided we can do things better and founded ElectroMetres so that's where I'm currently at. Doing safety consulting a little bit of what we heard before the license compliance stuff. I'm in and out of the open chain project. I'm mainly involved with the ELISA projects in the system and the medical group. I'm in the SPDX project in a safety working group. The safety manager of the SEFA project and so on. So yeah if you want to know about open source and safety maybe we should talk. Yeah if you talk to me and I won't recognize you again. I'm not arrogant I just cannot remember your face so please just you know if you know me just tell me hey we know each other. That's completely normal for me that I won't recognize you again. So you can also text me so mainly each social media platform. I have the same handle Nick Pablo so yeah please feel free to reach out. So starting with the real thing. So yeah we're starting about functional safety so who is in the room familiar with the term functional safety at all. Oh that looks good. So for those who aren't so we're not talking security we're talking safety here so that part of the system that should not kill you in case of an internal fault. So yeah it's mainly it's a do not harm thing that the system behaves always in a safe way whatever happens whatever input you have whatever thing that breaks that's a part of safety and the stuff that you need to ship with your product to really prove that the system cannot kill you or most probably won't kill you let's put it like that. So with functional safety the main things that you do expect are yeah that software behaves as specified which implies already that you have a specification that it does not interfere in any way with the system components you have and all around us events are addressed somehow so that you even can avoid them or at least detect them and have some mitigation efforts in there and yeah that you have sufficient evidence that's really proof this. Right and I guess the next one is for me. So as in the title we talk about the three example projects so we have one as the Linux representative while there's so much more we'll later see this. We have a real time OS with Zepfire there are recently also others coming out which claim to have safety certification and may all be open source but this is the one in there and we have the last one which the Xen hypervisor has a virtualization solution and they all run under the Linux foundation projects and to just get an idea on who's in there so we have a large number of members within the Zepfire a few people in Xen actually and then the middle size in Elisa and as you could see from my introduction I'm coming from Bosch so we're doing a lot of automotive work also and with this respect I just wanted to show how different members are because if you think about like medical devices, automotive, railway, industrial parts that's where really safety standards come into picture and if you look into Zepfire it's not really our mobility or ever space member in there it's mainly hardware driven and so we have microcontroller companies, sensor manufacturers and so which really have this visibility in there while for the Xen projects you have some mobility suppliers but it's originated from the server process there's no real car manufacturer in there and what I had here in my also to mention is that they have a bunch of other members in there but we're just not supporting the project as a finance while if you look into the Elisa project there are mobility aerospace system providers in there so from OEM to one supplier base and so we will now go through the different projects one by one just say where they are and how they have common parts on safety and how they differ so the next slide goes already again wants to because she comes from Zepfire. Yeah so I'm not sure how many people in the room already know about Zepfire so Zepfire is let's say the coolest real time art class that you can find out there it's open source it's permissively licensed you can bring it down to a really resource safety as a safety size really even smaller than the Linux kernel we're currently heading for safety certification we have yeah the safety working group that now currently preparing and let's say enhancing what's there in the project so that anybody can take the project artifacts to go through their own certification or qualification or at least has a look into it to say hey that really brings everything that I need for my application at the moment yeah the safety awareness honestly in the community is limited there's a high awareness but quality but yeah safety sometimes really is the hot potato nobody wants to touch but yeah we have this working group and we're working on it and we're getting more and more support in the complete community. What's important still yeah it's posicompatible so for all that that come here from the automotive domain and think hey we need something it's you can use your posic stuff on that and the main part of Zepfire is hardware agnostic so there's a very strict hardware abstraction layer so it's really easily to port from one application or one platform to the other. So yeah San is you. And in contrast if you take San so we saw that it's a much smaller community but they are really coming from a strong security background they were used to have virtualization to isolate systems so in there from the beginning of the project they had a very strong quality mindset which you can see that they have every commit tested they have two different CIP pipelines for this testing they also have a strong rigorous quality process so that they really just have full commit traceability so from the first commit to the end you can see everything which happened which is organized there due to having it in official production also data centered it shows you how much care you need to take because there's a lot of chance for intrusion there yeah AMD silings are those which are mainly driving the San project they are also the ones which work on the safety certification and that's also what they said if you have an open source project you need to do this continuously right because traditional things are often safety certified for one shot and then you have hard times to update things so it's a challenge which many have to follow now and what they said in the first phase they show that actually open source is certifiable that's what they have as the certification concept approval and then they go for what for an assessment if you want to do the same thing in Linux just imagine all the distributions out there the flavors how you build create things life gets much more complicated this is really the open source superlative you have so many contributors such a large code size and everything and then also very large community much more flexibility how user use cases and also in this way I just took some by searching the web some examples from all the companies which making their attempt and how to do this either together with others or independent and it was having a first activity with the social Linux and P project so it was roughly a range of 10 years back already where they worked on it it stopped a little bit it transitioned into the Lisa project which I'm representing and it recently get a new momentum due to what's called the software defined vehicle where you have much more centralized high performance compute cloud connectivity and so on and there is a we want to have more open source usage and bring more things into the space but then the first question when they come to the Lisa project they asked to me when will you have a safety certified linux which I can simply use I was a well that's something which we are not able to provide directly because when you are in an environment you need to make sure that your system is safe and we cannot make sure that you do your engineering properly and we cannot also not guarantee that you follow certain processes which are required we can just give you guidance how to use them and the last thing is like oh will you just have one snapshot which is then safety certified and for that this doesn't make sense because you know how many fixes are getting into a product so there's most likely a vulnerability coming up things are connected you want to enhance features so we need to go on a continuous path and that's something completely new also for traditional safety part works and we always put the disclaimer in you cannot be relieved from your responsibilities so it's like legal notice here but yeah we have different projects which can provide fast forward and as we share same burdens on regulations that goes together certification will be the key part of it at the end and this gets complicated because certification is very expensive so it takes some authorities a lot of checks audits and so on and how this is financed you can see with three different approaches so there's for Zephyr their Platinum members which finance it they get the full access for Xanad's AMD Xilinx which is the business in there they mainly spent the effort in the project and as for Linux there are integrators so the strongest people in their involved is currently Red Hat and Coating they are really trying to bring the things forward and you can also see this difference in there on how open the people are so Nicole mentioned there is the workgroup on safety the safety workgroup when Zephyr it opened last year so there has been always safety activities from the beginning considered but to get a wider outreach this was opened also to the non-platinum members so that there's new inputs new activities the requirements management is a good example which came out of it and a little bit things keep behind the scenes because it's basically then a benefit also for the Platinum member to get the commitment and to guarantee that there is financing in Xan has the approach from AMD they are working on getting code MISRAC compliant which is some special activity which often was asked by automotive and they also provide documentation and their parts upstream so later on for Zephyr and for Xanad if you have the software at hand this is a software which is running on safety systems but you don't know how to use it because you miss maybe a manual or you miss certain test cases or you don't know how to really bring it into picture in the ELISA project or so we really focus on the enablement we don't do a safety certification and it will just enable others so we want to pre-helping pass why we do so code complexity is one reason of it so if you see we have million lines of code in Linux and small footprint in the artist in the hypervisor due to the nature of the software and I don't I put it in here and said it can be easier so I don't say it is easier but due to the small code size you have a much better chance in reaching these things faster but to get code coverage testing and so on you need trainings which Nicole for example does a lot with in the past and currently so maybe Shoddy can give a word on this yeah so yeah asking for what do the people contributing to the safety element have enough skills it's a it's a frequent question and we have different approaches in the different projects so in Zephyr we had a training just by a usual training provider for who was interested some while ago there is the option to have another training for that and we have two committees we have a safety committee and we have a safety working group which are full with as a seasoned safety experts that can guide and train and are just there and to to help people out with open questions the sand project had a little bit of different approach they are very centric about their code quality and about complying with coding standards so everybody who wants to contribute needs to know about Misra and they had even a training by boxing to do so and yeah they have mainly one safety expert that really spreads the word in Elisa we again have a different approach so for the complete community we have open web in ours that you can yeah either dial in or watch them again on YouTube there are the safety experts again in the working groups there are different working groups so there is no direct training provided but you can have the webinars and you can yeah just learn as you go from the experts in there right and then if you think well that's a lot of things which I have to do I just go with the traditional approach I save some money I take a normal artist but then I just took the Linux example you can read it from the left the right or from the right to the left so you don't have hard real-time requirements you don't have safety certification yeah these are some topics but you have a really rich ecosystem portability features experts any kind of hardware support it's two tackle complex products this something which you don't have with a certified artist because it's often proprietary and just for a very limited set of devices so it depends on which point you are you need to tackle both sides you need to see how does my system look like what do I want to achieve where is my complexity and how do I want to solve the complexity for this we put together different working groups within Elisa and created the largest system so it's also why we have this talk here where we really bring an on a microcontroller micro processor base we're bringing different Linux flavors and all put things together to showcase this is the reproducible system however you can make use of because from our mission we say we would like to give you elements processes tools software documentation which brings things forward and this especially means if you have safety critical systems you need to get understanding about the systems so this is what you need call wants to take about yeah so I think everybody here in the room will agree with me to say yeah to assess whether a system is safe you first need to know about the system you need to understand the complete system to see if a system or a subsystem in the system is safe enough so then you really need to understand if this element or subsystem that you want to qualify for safety is in this context of your complete system in a safe enough or capable enough to do what you want it to do really needs to yeah choose which features are important for your safety to evaluate them to qualify them and identify the gaps that you might want needs to fill yourself with your own application with your own basic software with your own whatever so in safety there's this approach for that called safety element out of context so I think the market approach is that you have a safe a generically safety qualified or safety certified system software component whatever that you can integrate into your system and it has been developed without knowing where it will go into so in fact it's not a safety element out of context safety element was an assumed context so whatever you do for example as a sefer or a son you assume what you will need for functional safety and you write this down and you work according to that so typically these elements are then come with obligations they come with a safety manual that you need to adhere to you prove that for your assumed context and that the features that you identified as maybe safety critical for your user that they are developed efficiently that they have requirements suitable implementation that there is test that there's completeness that there's planning and that there are these obligations of use when you want to integrate it to your final system and yeah sounds all very great but we still come over some community challenges or some general challenges when we want to bring open source to the safety world so we will we get a lot of pushback still from the safety world that yeah open source is not behaving like commercial software and we can't do this as a death sentence yeah it's true in the same in this open source project you have less influence on maintainers you can you're not the boss you can't tell them do this what you can do is hey I need this and I'll I'll propose the change to do this and I do it and we will do it together oh yeah it's not it's it's not a top down hierarchical approach there it's harder to bring skills into the community you can just say hey these 20 developers will get this training because they just don't care people tell us hey who so often who will be liable for a certified suffer who will be liable for this who will be liable for that so this is something that still needs more understanding out there because the community will never be liable whatever CRA will say we need a development process that somehow is present in the document saying hey I need to ship this with requirements there needs to be a system architecture there needs to be auditable code whatever that is so really to map map what a safety integrity standard needs from a product and how yeah I guess as we approach the last three to four minutes already yeah I just keep this one short so what we do is we cannot do this alone and we really try to find communities areas where we can collaborate so that's why the project really reaches out to all the different areas seeing what are related activities to go through development and really share ideas and for the sapphire for example can be something to learn from also for other communities yeah so then I also keep this very short because so yeah we try what we are currently approaching is to apply something like a v-model as a knowledge model not like as a process model to the sepah project to really have requirements and traceability and everything that we need so there's a lot of stuff there and there's a lot of stuff happening so we have already said we have two committees working on that there's this safety committee consisting of the platinum members that really do the strategic decisions about a final certification and there's a working group really creating the value for everybody that you can have all the artifacts and all the information that you need for a safety qualification this is a current snapshot of our requirements work we do requirements using stricter stand from stricter is over there and for everybody who wants to know more about that we do have a talk on sunday in the esbombe deaf room around lunchtime for that we also get asked what to do if you want to contribute same thing as always just show up please when you show up and you have ideas you have best practices please share them the communities are all open to that even when you don't know much about safety just show up because everybody will just tell you hey listen and learn and we do this together and when you're a safety person that wants to contribute or that wants to bring open source to their products yeah just become an ambassador for open source and safety because the quality usually is really high the functionality is very high there's a lot of stuff around that can be used and where we have the value through collaboration and not just through yeah purchasing license agreements and all that so we have many value here yeah final thing yeah there is no certification date set so please don't come to us and ask we can we can collect money and bet on that when we are ready we are very far in a lot of projects there won't be a certification for elisa because elisa should help you to certify or to qualify and this yeah elfen and sefer we are on it we are creating the stuff let's say soon and that's it and before you leave we have a little bit swag left over there but not the hat this is from someone else
Privacy-respecting usage metrics for free software projects
Hello, hello, everyone. Welcome to 4th stand 2024. And this agenda speaker is Will Thompson. And welcome him. Yeah. Hello, can you hear me? Great. So, so, how about this? Is that better? We'll see how we go. Cool. Hi, everyone. Thanks for coming today. I've seen a lot of really great talks in this room over the years. It's a real privilege to be on this side of the auditorium for the first time. So, a little bit about me. I'm an engineer at Endless West Foundation, where I've been for seven or so years. And I've been working on GNOME and GNOME adjacent stuff for longer than that. And today I want to talk about why it's useful for free software projects to collect usage data. I want to talk about how this can be done in a privacy respecting fashion. I'll talk about the Endless OS system for this as an example of maybe an existence proof. I'm not necessarily suggesting that other projects should take what we've built and use it directly, though, of course, you can. But I hope to encourage other free software projects on the desktop or other ways to consider adopting similar techniques so we can better understand how the software is used. I mentioned Endless, what's that? So I work for Endless OS Foundation. We are a nonprofit organization. Our vision is simple. The whole world is empowered. And access to the digital tools of the modern world is a prerequisite for being empowered. So we strive to ensure access to these tools and create opportunities for underserved and under-resourced communities around the world. We do a lot of things which are not Endless OS, even though Endless OS is in our name. But it's Endless OS I'll be referring to today. So I'll talk briefly about what Endless OS is and what it's for. In brief, it's an immutable Linux desktop distro. Visually it's known with some modest customizations to suit our target users. The groups we work with typically have little to no previous computing experience, but they probably have user smartphone. You can download Endless OS from our websites and you can in some parts of the world buy it to be installed in OEM systems. But we as an organization are more focused on, okay, we are more focused on working with other nonprofits and with companies aligned with our mission to bring computing to underserved communities. So this might be partnering with another foundation to set up a computer lab in a disconnected rural village or we might work with microfinance organizations to make computers affordable to low-income families and so on. And in these contexts, there's often limited or intimate and internet connectivity. So part of the point of Endless OS is we pre-install lots of apps and lots of offline learning resources and we make sure the whole system is fully usable offline. So what do I mean when I say the word metrics? I'm going to use the word telemetry, metrics, analytics, usage data and so on interchangeably. Sorry if there are technical nuances of those words. But I'm referring to the concept of end user software, so that software that runs on a device in your hands, collecting data about how it is used and then periodically sending this to its developers. You might be saying that sounds a lot like spying. Please hear me out. I'm not talking about that. The other part of the title was privacy respecting. So you might be skeptical because when people talk about usage data, they're often talking about slurping up all kinds of personal data about each user, building profiles of each individual and then you sell it to advertisers, which is so the easiest way to explain what I mean is privacy respecting means the opposite of that. But the easiest privacy respecting thing to do is to do nothing. You don't collect any usage data. You don't have to write any code. You don't have to think about the ethical or legal issues with the data collection because you're not doing it. So maybe for a lot of projects, that's fine. And you might ask why? Why would you do this? Well, software is not made in a vacuum. Normally you're trying to help some group of people do something they couldn't do before. And so in order to build good software, it's useful to know how is your software being used. Is it being used at all? What hardware is it being used on? Which features are used? Which features are not used? And so on. And if we have this information, we can make informed decisions about how to build the software rather than basing it on assumption and guesswork and vision alone. The other strand to this is a lot of people are developing free software at work. I work for a non-profit and I would like us to continue to do the work that we do, to advance our mission and also to contribute to the open source comments. And part of doing that is to demonstrate that the work that we're doing has the impact that we are trying to have on the world. And the organizations we work with have similar needs. They need to justify to themselves and to their own sponsors that it's worth putting their time and resources into working with us. So having quantitative data helps to support the impact we're making. And you might say, okay, that's fine, but why don't you just ask your users, run some interviews, do some surveys, some usability testing and so on. Wouldn't that be ideal? And yes, of course, there's no substitute for actually talking to the humans who are using our software. It's quite rare, particularly in free software projects, to have the resources to scale that. And for some things, users are not consciously aware of the ways they're using their software, the software they're using. There are also limits to what you can learn from a half hour or one hour testing session as opposed to usage over time as part of doing your day-to-day work or life. It's very useful to find volunteer testers from the community. You can learn very interesting things from that. Those groups tend to also be quite self-selecting. So this will sue the results towards people who have a higher motivation to tell you what they would like you to do with your software. So ideally, you want both, I think. I think you want to talk to end users and explain the why behind what you can find in data that you have. And in the other direction, having data about how the software is used can drive the kinds of questions that you want to ask your end users. And essentially, every website online store, app, and mainstream OS provides something like this. I'm not arguing that we should do something just because everyone else does it. And hearing that a big tech company does something might often be a reason to do the opposite thing. But there are non-evil reasons to want to do this. And I think it's reasonable to assume that the people who are developing software free or non-free typically want it to be good and useful. And other projects have similar requirements and constraints to what I've just discussed. So even with more resources, you can't constantly interview your users. And we're often at a disadvantage compared to commercially backed software. The big ones are in people and time and money. All of these things are, of course, related. And I think that rejecting the idea of collecting users' data outright just creates more unnecessary disadvantages for ourselves. We should want to have the information that we need to focus the limited time and resources that we do have. And we have the opportunity to use the structure and the transparency of free software projects to do something that's actually better than the status quo in the wider industry. We want to respect our users and preserve their privacy while still being able to make better decisions and make our software serve them better. The kind of axiomatic thing here is we do not want to collect personal data. We don't want to track individuals. We don't want to sell that data or worse have it stolen through some database hack. We don't want to serve targeted advertising and so on. Of course, handling personal data comes with legal responsibilities as well, which if you can just not collect personal data, it's much better for everybody. So if you want to hold a word in mind, think tally not surveillance. An analogy to Cicassidy, who's here with his phone, is think about a library. So near me, our local library is run by volunteers. And you might imagine that one day you go to the library and there's someone at the door holding one of these little tally clickers. And for each person that goes through the door, they click it once. And this helps them to get some kind of measure of how well used the library is. Maybe they can collect this similar tally on different days of the week or at different times. And this can help them decide how they staff the library, advocate for more funding from the local government and so on. The other end of the scale is if you imagine someone kind of following you around in the library and they're going to look over your shoulders, say, okay, you've gone to the computer book section. You've gone to the children's book section. You probably have a child. Okay, watching what you're reading. Obviously, this is not hyperbole, but this is really not what we're talking about here. So sometimes you can get this kind of tally information from some kind of service that you control. So FlatHub is the de facto standard flatback repo. It has, we recently announced that it's reached one million users. So how do we measure this? Well, it's measured by a proxy. There is a runtime which most users of Flatpack we claim have at least one app installed which uses this runtime. And due to the way that Flatpack downloads updates, you can tell the difference between an update and a fresh install. So when a new point release of that runtime is made, you simply count how many downloads there are of updates for that runtime in a given period of time, say, a week after you've released the runtime. And this gives you a pretty reasonable lower bound on how many installations of that runtime there must be. And there was no identifier needed. We didn't need to look at IP addresses or machine ideas or anything, just having some knowledge of the ecosystem and how the Flatpack client behaves. And there are other places where this idea is used. So in Fedora, there's this thing called count me. EndlessOS has something similar. So with DNF is the package system for Fedora. And it has to periodically update the list of packages that are available. And so the approach here is that in one random request per week, and these requests would be happening anyway, an extra parameter is added, count me. And then it has a value which refers to how long it's been since you first installed your system, which gives you some indication of what retention is like of the system. Then from the user agent, it's possible to infer what the distro version is and what variation of Fedora it is and architecture and so on. And it's a clever idea to piggyback on the meta link request. And again, let's users be counted without personal data because there's a fixed frequency. So they published the aggregate data, which I've doctored a little bit to fit on the slide. So here, again, there's the fixed frequency, which meant that no identifier was needed. And there are also these kind of statically determined segments of the user base, which doesn't identify any individual. It identifies a massive group of systems. So the three main ideas for doing something else here is that we want to generalize this approach to finer grain data. But it's data that we wouldn't otherwise have because we don't get anything as a side effect of stuff that's happening entirely on your local device. In the library analogy, they do have this information, I suppose, of which books are people borrowing. It doesn't matter who's borrowing them. It's just in general terms what's popular. So the three ideas here, the first one I've mentioned already is sending on fixed frequencies. If you send information or rather record it on a daily basis, this means that you can be sure when you look at the data that if two different events on the same day appear, they must have come from two different systems. But you don't have to identify which particular system they came from and you can't tell which events were the same from week one, which systems they were. In the other axis, we're not interested in individual events. We're interested in patterns of usage data. You generally want to be able to compare those patterns between different groups of your users. Maybe it's for software version, maybe it's by local, maybe it's by hardware. It depends on what you're trying to learn. But these are determined ahead of time. They're static and they are common across a large group of users or devices, rather, I should say. The third piece is to do some kind of data, which is instantaneous or which you're collecting on a timer. This is easy, but for some things, it's kind of continuous data, which this doesn't work for. For example, app usage. You might want to understand which apps are used the most in terms of time. This is something where you might, on a given day, open and close an app several times. You need to do some kind of client-side aggregation to turn this continuous value into a single data point on a fixed frequency that you can record by itself. So the end-of-the-sometrics system, you'll be shocked to hear, works as I've just described. It breaks down into a few components, which I'll go kind of following the duration of the arrows in this diagram. We'll talk about what happens on the end-user's device, then how that's transmitted to a server, and then what happens once the data reaches the server. So for local event recording, we have a daemon which runs system-wide. It's a D-Bus service, and applications on the system use a D-Bus API to talk to the daemon to record when certain events happen locally. So some of these components are just regular system components that are doing things that they just do. So our updater, for example, you can see the red box in the bottom left is recording an event when an update has failed. There's also one extra daemon, this metrics instrumentation thing, which is for capturing just general stuff about the system, CPU information, disk usage, and so on. We actually also have a mediocre crash reporting system using this mechanism. It's not ideal, but it's better than nothing. And as we'll see, it works for a system which is intermittently connected. So each of these events has some kind of payload associated with it. So we'll zoom in on the red event from the updater. This is when an update fails, we capture some information here. We capture the time at which the update failed. We capture the OS version that it occurred on. We have this UUID. Now, this is not specific to this event that happened for this one machine. This ID is the same for all updater failures. It identifies the category of event that's occurred. And then we have a payload, which in this case is just a human readable and localized error message. And that's kind of gross. We have some nasty pattern matching to untranslate the string in some cases and take out the values that vary just to narrow this down. We transmit this the raw event because it was the only practical thing to do given the way error handling works in the updater. But it's still very useful. This is, from this we can determine, this is the most common reason updates fail if the disk is full. The updater runs in the backgrounds, so it's unlikely that people will be actively checking it. So it's useful for us to know, are there fixable errors that we can sort out somehow. I also talked about app usage. We've patched an MShell on NSOS to record how long particular apps are used for. And so this one is one which gets aggregated in the way I just described earlier, where you coalesce this continuous variable into you slice it by day and by month. Here I'm showing by day. And it's actually the metrics demon which does this. The shell tells the demon start recording event with this UUID and this payload. And then sometime later, when you close the app, it says stop. And the demon takes care of coalescing multiple instances of that into one in any given time period and slicing it if it runs over midnight or over the end of a month. Okay, so now we've got a load of events buffered in this demon. We have an on memory, in memory, an on disk buffer with a size limit. So we just delete odd stuff if we run out of space. And then if and when you're online, the demon periodically reports these to our server and then deletes the local copy of the event. This is an HTTP request. You might be saying, I've said there's no device unnotified. Yes, there's an IP address. We'll come to that. That's an artifact of the internet. And this this upload contains as many events that will be cashed as we can fit in a single request. Plus a timestamp. Actually, there's more than one timestamp. There's a clever algorithm to correct for incorrect clocks and a channel. What's a channel? So this is the kind of static segmentation that I referred to earlier. And on end of the S we have just a couple of things here. The lesson flags for is it a standalone install, a dual boots or a live system? Interesting. But the main thing is this image identifier. And so this is an artifact of how we build and distribute and the source. When you install in this OS, you're taking a disk image, which has been pre built with a load of apps in it, and you install that by just DDing it directly to your disk. You image the disk with the same image. And so we have custom customized these in various dimensions. There's this product idea, which is how this came to end up on a computer. Was it a download version? Was it an OEM partnership? Or was it another organization we're working with? Or is it a custom built image that someone has built using the tools we provide? There's some other stuff about the original OS branch that was installed and the hardware architecture. And then this is personality, which again is an artifact of the way the OS works. If you're pre installing lots of learning resources, you want them to be in the language that the user speaks. So we have different variations for different decals. And we have a basic one which doesn't contain all of the massive reference apps. And when we work with partners or in particular projects, we often make a customized version for that. And that identifier ends up in this personality field. So if you go to the website today, or in fact at any point since the third of January, and you download the French version, you will get this image. It has this OS product, which is what we refer to the download version. Some attributes about the branches, the time stamp of when that image was built, and the personality. And so any system installed having chosen French will end up with exactly the same identifier ever since the start of the year. So this is what's on my laptop. And I happen to know that there are only other two users of this one, and one of them is over there. That's a unique case because we built this specially for a bunch of laptops in the UK endless team, and we never publish this image. And that's an edge case. In general, the same OS image is used by many different systems. So we have submitted a batch of events together with the channel to the server. What happens? Well, first of all, we discard the IP address. We don't want that. The HTTP, the endpoint adds a yet more time stamps to this bundle of events and puts it in a readers queue. Now, something totally separate, which has no idea where this bundle of events came from, pulls from the readers queue, and it splits the events apart and stashes them into a SQL database. There's one table in that database for each category of events. So I talked earlier about the daily app usage event. So this table has a field for the day. It has a field for the app, and it has a field for the duration. So in this example, of course, in the real database, there'd be many more rows. But just by way of example, you can see there were two different GNOME terminal events on the 30th of January. So we do know that there are two different systems. We don't know if the Chromium user on the same day was either one of those two users or a third user. The next day, there's an event for GNOME terminal, two and a half hours usage. We don't know if that was any of those two or three users we've already talked about, or a fourth user. We also have this aggregation by calendar month, which has higher latency, but it tends to be less noisy. And these tables are not linked to a device identifier. They're linked to the channel that was associated with the event. So this has this image identifier which is shared between many systems. And so we can't match up which different events came from the same system. We can't even identify which different instances of the same event came from the same system. Of course, there's an element of trust in this, like the server could be behaving not in the way I described. The best answer we have for this is that we're not doing that, and the server is all open source. So you can go and take a look at what it is, what's on our GitHub, is what we run. And the system is on by default because we've designed it to be privacy respecting. When you first install endless OS, like many GNOME systems, you get an initial setup wizard, which takes you through some steps to set up your system. This is actually from the development branch. It looks a little different in the released version. There's a toggle for enabling or disabling this feature. The toggle is enabled by default, but nothing will be submitted until the user setting up their system has gone past this page and continued to the end of initial setup. If you set the switch to off, then nothing will be captured, anything that's already been buffered but not submitted will be deleted. Of course, you can control this later. Once the events have been submitted to our server, there's no way for us to delete the events for a particular system because we don't know which events came from which system. And defaults are very powerful. The overwhelming majority of systems leave the default enabled. You might say, well, of course they would. Everyone likes defaults, right? The point of this is to get more representative data about a large body of systems. The system relates no personal data, it's designed not to be invasive. Being on by default keeps us honest about that. We really have to be sure that we're not collecting anything questionable. And some people, you can see some number here, may prefer that we don't do this. Of course we allow that, but we don't force someone to make a choice. Decision fatigue is real, particularly during first boots. We've seen that people get scared off by the number of questions that are asked. What's a keyboard layout? So adding more questions which people don't have the context to answer is not necessarily helpful. I acknowledge that not everyone agrees. There are other opinions. This is what we do for now. So what if we learnt? Some people may have read a blog post that I wrote six months ago with some examples of what we learnt. So for those who have read it, everything here is new. Parental controls. So some time ago we developed a feature in NSOS to allow parents to disable access to certain apps which are installed on the system to control whether their child using the system can install new apps and to set age rating thresholds on those. As part of integrating this into GNOME, which is now upstream, this screenshot is from GNOME OS, and we added this to the initial setup flow. So it's to be more easily discoverable. When you create a new user as well as choosing their name and the username, you can tick a box which is a little out of focus. The box at the bottom says set up parental controls for this user. It's unticked by the form that people tick it. If you tick it, two things happen. One thing is that this three things. The user you create is a standard user, not an administrator. A separate administrator user is created with a separate password. And then on the very next page you're offered the option to choose which parental controls you want to apply to this child. Now in this screenshot, if you sprint at it, no controls have actually been applied. The default is that you have to actively choose which things do you want to enable to to restrict. Do you want to restrict access to web browsers? Do you want to turn off certain apps? Do you want to set an age limit on which apps people can install from GNOME software? And we instrumented this and a large minority just left the defaults. So 40-something percent of people who chose parental controls didn't actually enable anything. That doesn't tell us why they didn't do that. I mean you can come up with some good theories, but it tells us that there's research to do in this area and it can help us to guide what we do next with this feature. A tool. So GNOME 40 introduced a tool that's offered when you first log in and that's whether you've previously used an older version of GNOME on the system or this is a fresh installation. NDSOS 5 was the first release to include GNOME 40 and it looked, as I showed you earlier, very GNOME-y which is rather different to what the previous versions of NDSOS looked like. So we inherited this tool. When you first log in you get this prompt and if you choose to take the tool you get a tool which just briefly walks you through how to use the desktop. I was curious whether people actually take it so I added a very quick patch to instrument this. This isn't really a show me the code kind of talk, but just for an example this is what you need on the client side. It's legible. So the top line is we've just defined a constant for the UUID, we just generate an ID, and then you have the two lines where you create the payload which is a Boolean which is true if they chose to take the tool and false otherwise. And then we call this method on the event recorder class to record the event. That's all you need on the client side. This is a small C library around a small D-Bus API and there's geo object introspection around it so you can access it from JavaScript and Python and all other things. Then the server, this is using SQL Alchemy as the ORM. You define a table like this which has some keys that have a name of the table, the same UUID. Again this is for all events in this category. The payload and how to turn the payload into an instance or into a row of this table. It's a little annoying that you have to do database migrations to add or remove events on the server. It has the downside of having the data in these nice structured tables but there's an upside in that we can generate the documentation which is on read the docs of which events the server understands. So the results are in. We captured this bit of information from 35,000 systems and across those 35,000 systems about 19% chose to take the tool. My assumption was that more people who are upgrading would take the tool than new installs because if you're upgrading you're surprised or this looks a bit different what's going on. Actually it's the reverse. At the top row we see users who are fresh installations and 32% of those users took the tool out of 5,000 total in the period we sampled this. Whereas for upgrade and list OS 4 it was just 15% who chose to take the tool out of a total of 29,000. This is just a snapshot because now that we've answered the question we've deleted this data. We've erased the data from the database. We add the UID to a list which gets discarded as soon as it's received from old plants. We've also updated the OS to remove the three lines of JavaScript I've shown you so we no longer collect this data on up-to-date systems and we discard it if we receive it from old ones. This is the part where I talk about all the things that are subpar about this system and what we might do in future. The big one is it's actually really annoying to have the data split out in this way. All the app users are atomized and we can't answer questions like does someone who uses app X also use app Y? Is there any correlation between groups of apps that people use? We could of course submit one event which contains all of the apps that a given user uses but that starts maybe that's a bit too fingerprinty. It would be nice to find some way to answer questions like that without implicitly fingerprinting users. It's also hard in general to slice this in new dimensions that you haven't already chosen to slice by. One question might be whether parentally controlled accounts behave differently in some ways to accounts that do not have parental controls enabled. The parental controls flag is not part of the channel so we can't see for any other event whether it came from a parentally controlled user or not. This is all just a consequence of what you choose to slice the data by. I think the trade-off is worth it but I need to acknowledge that it is annoying to not have it identify. There's also some kind of indeterminate upload latency. The problem here is how do you know when you have basically all of the data for the last time period? It's particularly bad for monthly events. Today is the third of February. Let's say I left my desktop at home. I switched it off on the 31st of January. We can't submit any data for January until February has started because otherwise we might have to add a bit more to the tally after the fact and you can't do that in the survey. Now my computer at home is switched off while I'm here. I'll switch it back on on Monday. That's the fifth. That's a five-day lag. Is that typical? Maybe we can look at the timestamps when we receive the events but we can't do that because we don't store the received timestamp for each event because if we stored that we could figure out which events came together. You can probably imagine ways to solve this by reducing the precision of timestamps and I think that's true in general. There are some cases where we have more precise timestamps than we might like largely for historical reasons. There are some complications if you can't assume that the local clock is accurate. Of course NTP exists but many endless OS systems are used mostly offline and it's also quite common we found for the real-time clock battery to have run flat. So it's not that unusual for people's laptops to have a totally incorrect time until they connect to the internet and then when they go offline and run out of power it goes back to sometime in the past. There's a lot of research into how to randomize the data that's submitted. There's randomized responses, differential privacy. I'm sure there are people here who know more about this than me. We haven't really explored this but the basic intuition is that you add noise to the data you record. Suppose you're recording a coin flip, maybe the parental controls one as an example. In 50% of cases you just always say true and then in the other 50% of cases you submit the true flag. That of course changes the results you get but once you aggregate it you know that of the 100 responses you get you expect to see 50 truths just from the coin flip and so then you can look at the rest of the batch of events to figure out the true ratio without actually having to know if any of the data points which is true is really true. This might be a way to allow collecting more interesting facts without getting into personal data. There are lots more questions we might like to ask about the software we ship. There's questions like are most, desktop Linux systems, single user or do you have multiple different Unix users on the system. What are the common monitor configurations? How common is it to have an external monitor most of the time? Do people change this around? Do people have their screens arranged horizontally or vertically or in a cool circle shape? Do people use workspaces? How do they use them? Which GNOME shell extensions are in use? I could go on for an hour, I won't do this. I think this data would be much more interesting if we had comparable data from other GNOME distributions. I'm using GNOME as an example just because that's what we ship, insert project name here. Every distro reaches a different group of people. Those groups will have different behavior. For example, I would claim that the typical Fedora user is probably quite different geographically, perhaps economically, perhaps in terms of technical skills, the typical NSOS user. If we have a common structure of data that was shared between all users of a given project, we can compare how the same upstream software is used in different contexts. Other organizations who do this kind of telemetry have public dashboards of the aggregate data. I showed you Fedora's published data from their repo servers. Mozilla has this great Firefox public data report, which gives you sort of daily and active users, monthly active users, version statistics, locale, top add-ons, and this is all, you can slice this by country as well as looking at it globally. Steam has this very interesting hardware survey. They've made a different choice. This is opt-in with a pop-up dialogue and still anonymous. It's very interesting. The median gamer is probably quite a different user to the median desktop Linux user. Kind of, you know, a little tongue-in-cheek, haha, only serious. In December, Spotify publishes this thing where you can open your Spotify app and it tells you in like really garishly bad images if you are like one of the top 100 listeners to some artist. And you see a lot of people remarking when they do this that this is kind of creepy, they have all this data. It's very cynically free marketing for Spotify. Now, of course, that's true. It is free marketing for Spotify. Other streaming services are available. But it's also fun and sociable. I've had conversations off the back of this that I wouldn't have had otherwise. And maybe we can have free marketing too. But we could do this differently. The central entity doesn't know anything about any individual, but we could potentially publish percentile distributions. And then on the local device, you could fetch this and determine, oh, right, you are actually in the top 5% of the ability. Maybe this is a bad idea. I don't know. Anyway, just to wrap up, I guess, from a starting point, I hope to have made the case that telemetry doesn't have to be creepy. There are ways that you can gather data about how your software is used without being invasive and building profiles of your users. And in an industry where I think not enough thought is given to this, I think we in the free software community can lead by example. We can build something that is better and allows us to improve our software while showing that a better way is possible to the broader industry. And if we do that, we can make decisions based on the combination of data and vision. The two work together to make something that's really great. Tomorrow morning, there's a telemetry buff in Room 121 and AW1. I hope to see other interested parties there and for people to tell me all the prior art that we didn't know about. That would be great. Hope to see you there. Otherwise, that's all I've got. There's some various links. If you follow me on Masterdome, don't expect too much discussion of this, but you're welcome to come. My blog has an older write-up on the same topic, which has some more and some less details. And the source code is on GitHub under the endless M organization. The name of the server and the event recorder is the service that buffers and submits the events. EndsOS.org is the place to go for more information about the EndsOS foundation and our work. Thank you. Hey, hey, hey. Does anyone have some questions? Please raise your hand. Oh, okay. I was wondering, you showed us that 10% opted out of sharing metrics, but how do you know that? So, in case the question didn't come across the PA, I think if I'm right, the question is, I said that 10% of people opted out. How can I know this? So, I mentioned that we have a system similar to Fedora's count me system. So, it sends a daily ping with a retention counter with no other identifying information, plus there's a Boolean, which says, is the full-fat metric system on or not? Okay. Thanks. Does anyone have some problem? Oh, I see you. Thank you. Hello. Hello. So, your talk has mainly been focused on how to get metrics. Sorry, I can't quite hear the question. Sorry. I couldn't quite hear what you were saying. Sorry. Is this coming through? Yeah, okay. So, your talk was mainly focused on anonymous metrics effectively, making them as unidentifiable as possible. And you did say that one of your problems is if you wanted to aggregate, if you wanted to sort of correlate these metrics to kind of figure out, okay, if person X uses this app, do they also use the other one? Have you given any thought internally on how you might do this in a way which wouldn't impact privacy? You mentioned fingerprinting, would it be one concern? Have you elaborated on that at all? I didn't touch all of that, but I think you're saying that I mentioned that we would be interested in knowing. This is probably focused on an anonymous system, and so this is one of the reasons we can't answer the question, who uses both app A and B? And you'd like, I think if I understand the question right, it's do we have any ideas for how we could do this? Effectively, yes. Okay, there's a few ways you could do this, right? One idea that we haven't explored, but I think would be interesting, is to layer onto this an opt-in system. So you could prompt people to be part of a time-limited study, and you could temporarily add something extra to this channel, which identifies them specifically for a fixed period of time. Then we'd turn it off on the client side, analyze it on the server, and then delete it. Then I think, so you think you can, it's easier to add more stuff to the channel than to remove it. And the other way to do this would be to look at some of these differential privacy techniques, and then submit a single event containing aggregate app usage for all apps on the system in any given week, let's say, but add artificial noise to that. So with some probability, change the numbers, replace the names of the apps, remove apps from it in a more systematic way than just shuffle it around. And there are techniques you can use to add noise while keeping the distribution of data the same. We haven't had an opportunity to go into that, but I think that's probably, in the general case, the way to address the points. Maybe there are other ideas. I'd love to hear more. Thanks. Any questions? If you have any questions, you can raise your hand. Hello. We still have 10 minutes to ask a question. 10 minutes left. Any questions? You can raise your hand. Okay. Okay. Climb into the speaker. Thank you very much.
Learning from disaster response teams to save the internet
Hi, everyone. It is great to be here. Thank you for coming to this talk. If you're here for the magic show, I'm afraid you have 30 minutes to wait. I'm here to guide us in an exploration of what we as a community, as open source practitioners, can learn from some of the most finely tuned and highly performant teams in the world. First responders. Through the interdisciplinary lens of social network science. So perhaps there is some magic in this talk. The magic of people working together. My name is Hannah Aubrey. I lead fast forward at Fastly. Let's save the Internet. In a past life, I was lucky to serve as a study coordinator of Sonic. No, not the one with the roller skates and the hamburgers. Dang, I knew that joke wouldn't play in the EU. The science of networks and communities research group. Sonic advances social network theories, methods and tools to better understand and meet the needs of diverse communities. They develop cutting edge techniques to study and improve social and knowledge networks and distributed working groups, online communities, virtual teams, and other large communities like the one we're all in. I am thrilled and a little bit washing to share that with the director of Sonic. Professor Nasheer Contractor is here in the audience today. Thank you for coming, Nash. And my dear friends, if you have any tough questions, please direct them at him. Let's start with a history reminder. Our earliest ancestors not only had to contend with the same natural disasters we experienced today, they also had to adapt and survive to nature itself. First, we became bipedal, freeing our hands to reach and to grasp and also to communicate simply with each other. Next, we developed complex brains with prefrontal cortexes, our personality centers, which enabled us to make snap second decisions based not only on external stimuli, but also our past experiences. Then we developed symbolic language to communicate complex ideas and then finally tools to take control of and shape our surroundings. So you see what makes us uniquely human. Actually, what has brought us here together today, the abilities to ponder, convene, reflect, build, collaborate, and coordinate are not only what make us so special, but also so successful. So then our tools got a lot better. The first fire pump was invented in Alexandria in the third century BCE. Unfortunately, it could not save the library, but I digress. As societies and civilizations began to form, the blast radius of disasters grew. We settled into towns that could burn down and buildings that earthquakes could topple. And so those smart brains of ours formed teams whose sole purpose was to patrol and respond to natural and made manmade disasters in the form of firefighters and police forces. Then societies became more complex. And with that came more complex disasters, not only fire and flood, but we created monetary systems, banks that could collapse and food systems that were prone to mass famine, not always for lack of food, but sometimes for lack of transportation or poor planning. Our close proximity to each other in cities and long distance cultural exchange made possible by ships brought diseases, colonization, and war, which ravaged human populations. We think of these ages as dark or undeveloped, but their responses to such crises were surprisingly neither. In fact, we begin to see thoughtful and multifaceted disaster response, not only search and rescue or medical aid, but tax relief, temporary infrastructure, even what we now call refugee camps, providing long term food and shelter for displaced peoples. In 1493, the Knights Hospitalers shipped doctors and surgeons to the Greek island of Kos after an earthquake. And so we see some of the first evidence of multiple different groups or organizations coordinating across disciplines and borders to respond to a disaster. In the intervening years, we've continued to hone our disaster response strategies. Humanities impact on this planet has required us to do so. And besides, those prefrontal cortexes of ours have a lot more data to lean on than our friends the ancient Alexandrians had. If they knew then what we knew now, maybe they could have saved that library. You should pull it together. Anyway, today we have entire organizations, governmental bodies, NGOs, and community groups dedicated to such activities. We have laws by country and internationally to enshrine basic human rights and ideal responses in crises. And now we're building a new frontier, a new form of transit. We're creating massive new civilizations, hosted on smallish, inscrutable, blinky boxes. In this new world, we can't even really see the threats, the crises. We're throwing people together in a way that's affecting global social structures and people's everyday lives. Like every form of infrastructure, like most every place where humans gather to live, to work, to learn, to play. The internet has grown up in an unplanned way. And we're still scrambling to understand it, to learn from our mistakes, to apply those lessons, to build the best internet, to build systems that protect people and systems that react when people are harmed. But don't worry too much. We'll survive these dark ages. Our species has survived every disaster it's encountered, at least so far. A common organizational structure found in groups undertaking large-scale operations to solve big, big problems is called a multi-team system. A system comprised of multiple teams working towards a shared goal. These structures can be found throughout all sorts of industries, working on all sorts of problems, disaster response, space exploration, governing humans, building stuff. If you're part of a business with multiple departments, you're in one. If you attend or work in a university, you're in one. And if you maintain or contribute to a support or, excuse me, contribute to or support or care about an open-source project, you're also in such a system. Because no matter what corner of the internet you occupy or to which technology you contribute, you're working in service of our shared mission to keep the internet open and free. So what makes up a multi-team system? Within the subordinate, the superordinate team, the entire system, we have local teams working on local or proximal goals, which may even be split further into component teams. And directing the subordinate teams is the leader or perhaps the team of leaders, which shares a global or system goal. And when you examine these teams using social network analysis, you find common patterns between successful MTSs. There are many more patterns we could discuss, but let's focus on three. A plan for coordination paired with frequent, clear communication, highly-performant and resilient local teams, and finally, empowered and effective leaders who are willing to sacrifice their local goal in service of the global goal. So before we explore each of these patterns, I want to share this diagram with you to underscore the importance of these patterns in disaster response. Because that term disaster response makes such activity sound reactive, doesn't it? But in reality, the most effective disaster responses begin long before the disaster happens or second best right after a disaster occurs. So I ask you to bear that in mind through the rest of this talk. After all, the best time to plant a tree was 10 years ago, and the second best time is today. First, let's talk about planning, coordination and communication. I don't think I need to talk about docs too much here. I think the OSS communities know this one quite well. And engineers know all about retrospectives. Like I mentioned, disaster response begins well before the disaster occurs. So in terms of coordination and communication, knowing where to turn for help or resources before a disaster occurs spares valuable time, energy and mental load during a crisis. Effective communication prevents errors in the field, helps the even distribution of resources and helps us learn from the mistakes we've made last time so we don't make them again next time. During disasters, response teams crucially over communicate. They share reports on the situation as it evolves. They communicate with stakeholders on the ground. And they report changes or progress to make the best decisions. Leadership and subordinate teams must have the most accurate and up to date information. Because knowledge sharing fosters a coordinated and collaborative environment. It reinforces the multi-team system as a single unit, not a set of separate teams. And because knowledge sharing makes it easier to be flexible and adaptable in rapidly changing environments. Interestingly, research has found that inter communication, communication between local teams is more important than intra communication. Communication within the local team to the success of the whole system. So in fact, there's actually a Goldilocks zone of inter to intra communication. Local teams should communicate half as much between teams as they communicate within their own team. Any more inter communication than that and performance declines any less than that and it declines too. When we talk about the viability of a team, we mean the success of the team. In moments of disaster or crises, the stakes are life and death. And at the end of the day, disaster response teams and open source maintainers too, they're people. They have feelings. So viable teams or successful teams support each other. They lend a hand. They take emotions into account when making decisions. Viable teams engage in what is called disruption, buffering behaviors, which is to say change management. They try to anticipate changes that may occur, plan ahead and invent that some change or disruption occurs. And again, they support each other through those changes. Viable teams also try to balance performance and resilience because when you work with people and you're so hell bent on performance, then the team's physical or mental health is at stake. And the team becomes brittle and the team does not perform well. So when we see teams that are so, because people do not want to be a part of such a team, right? So I'll say that again. There's a difference between successful teams and teams that people want to be a part of. And in the long term, teams that strike the right balance are the ones that are the most successful. Finally, the most important, the most performant teams strike the right balance between reinforcing the team's boundaries, which is to say reinforcing the identity or team spirit of the local team and boundary diminishing behaviors, which reinforce the local team as part of something larger than the team as part of the whole system. So a little bit of silo is actually good but not to the extent that teams develop an us versus them mentality, which brings us to our last assertion today. Empowered and effective leaders. Strong leaders serve as an ambassador to the team and for the team. Internally, they help teams understand why the team has a certain goal or is performing some task. Within the system, they advocate for the team's priorities and points of view. Those are called boundary spanning behaviors. They make sure that the team has the information it needs, not only the what, but the why of a task or priority that they understand their own team's priorities. In a disaster response scenario, times of the essence. Rapid decision making allows teams to quickly assess the situation, evaluate available options and act promptly to address emerging challenges. Delays in decision making can lead to missed opportunities, increased risks, and further escalation of the situation. And as much as we're proud to be a part of our own team, we must recognize and understand other teams' respect and contribute to their priorities and not be too selfish in our own focus. That's why a crucial feature of successful multi-team systems, of disaster response effectiveness, is that local leaders and teams are willing to sacrifice their local goal, if it means more for the common good. So now that we've immersed ourselves in the theory of effective multi-team system performance, let's illustrate it with a real-world example. I recently discovered this amazing YouTube channel. It's called Brick of Mortar. It's all about infrastructure disasters, ship sinkings, critical failures. It's fascinating. If you're into this kind of stuff, check it out. You'll never look at bridges or tall buildings the same again. The sinking of the MV Ferry Wall on April 16, 2014, off the southwestern coast of South Korea, was a disaster, not only in and of itself, but also of multi-team system performance. Over 300 people paid the price for these failures with their lives. On what seemed to be a trip like any other, the ferry suddenly made a series of sharp turns. But as we know, a disaster such as this starts long before the immediate catalyst. Over the years, this ferry had been repurposed many times and additions had been made that affected its balance point. For this trip in particular, the ship had taken on excessive cargo, which compromised the vessel's stability and made it more susceptible to capsizing. What's more, the ship's crew had drained the ballast that's water that's kept in a ship to make sure it doesn't sink, to make sure it's properly balanced. They didn't want it to sit too low in the water, they wanted to be able to pass inspection knowing they'd taken on way more weight than they were supposed to. So the communication breakdowns. First, when the ship began to list, the captain refused a sendage of stress call during the crucial first moments, delaying rescue efforts as the ship began to sink. He told passengers to go to lower levels of the ship after refusing to tell them anything about the impending disaster. For crucial moments when they should have been getting on to the deck, getting ready to be rescued. When he finally sent the distress call and rescue ships came, they quickly learned that the actual communication infrastructure, the radios that the ship needed to call the disaster teams were either malfunctioning or were broken. Something had gone wrong with them. So despite the rescue teams trying to raise the ship's crew on the radio, vital communications failed during those crucial first moments. So you can see the ferry seawall had no plan for intra communication in the event of a disaster. They coordinated poorly, not only within their local team but also with the rescuers. So they failed to inter-communicate with local teams. So the system, the global team failed. So for the sake of this section, let's quickly divide the various local teams. The crew is a team, the rescuers are a team, the passengers are a team, and the South Korean government is a team. What were each of those teams' goals? Passengers wanted a safe trip. The crew should have wanted to get them there safely but they just wanted to maximize profit. The rescuers wanted to make it to the site quickly and save as many passengers as possible. You would think the South Korean government would want to save their people and prevent such a disaster from happening again, but unfortunately that was not the case. Their true goal was to save face on the international stage. We'll talk more about that in a second. Now each of these teams had goals that were in opposition to another team's goals. And as the circumstances evolved, there was no ability of any of these teams to shift their priorities or manage this change to negotiate the priorities and evolve. And each team in the system saw the other team as a detriment to achieving their own goals rather than as a part of a system, as allies, as individuals worthy of consideration. In fact, the crew had never received proper safety training. So even if their goals had been aligned, they were not properly equipped to perform. Now the next example from this horrible tragedy is an example of leadership, failure, and boundary reinforcing. When rescuers arrived on site, the assembled parties included Japanese Coast Guard and the US Navy. When a ship sinks, often there will exist air pockets within the ship. If passengers can find them, they can survive for seven days as long as they have food. The US Navy or water, pardon me, the US Navy and Japanese Coast Guard and private citizens too all had the equipment necessary and were on site, the equipment necessary to conduct such a rescue. But due to South Korea's rigid hierarchical culture and their government's desire to save face, the teams that had the equipment necessary were not allowed to perform the rescue. It's an example of unwillingness to sacrifice the local goal and a lack of emotional and really life support to the passengers who just wanted to survive. In fact, throughout the crucial hours then days when those high school children that were trapped in that ship could have been saved, the South Korean government lied to the parents who had assembled to wait for news about their kids. They said that all the kids had been saved despite that being quite far from the truth. So what do I hope the open source community will take from this line of scientific inquiry, from the lessons of the MVC wall? Because folks, this ship is sinking. Our planet's ecosystem is failing. The climate is changing. I hope when projects, especially leaders, see someone building something similar to what they're doing, they start to think that other project is an ally. That other project is an ally, not a competitor. They think, how can we help each other? Not, how can I win? Or worse yet, how can I sabotage them? I hope maintainers who make the commitment to serve their community understand the commitment they're making and live up to that responsibility. Because remember, it's not a commitment you have to make. You can make something, you can choose not to maintain it, you can choose not to accept issues, change anything about it. But if you make that choice, I hope you live up to it. And I hope you respect your community and listen to what they need. I hope BDFLs, benevolent dictators for life, focus more on the benevolent part and less on the dictator part. I hope we can take better care of each other. So many maintainers and contributors out there in this room are carrying so much weight and holding so much space for all of us. I hope we can do more to help them or at the very least, I hope we can spare them kind words. I'm not under illusions here. I don't expect what I've said here today to do all that much. People have said a lot of what I said here many times before, but maybe, just maybe, I've touched one heart or one mind and maybe that heart or mind will go out there and they'll make a different choice because of what I said here today. Or maybe they'll speak up and share what was touched today with the next person when they see something wrong. Maybe, like our very first ancestor who looked up and reached, maybe we can make a little difference now that will make a really big difference for the people who come after us. Because the last 10 years, the platformification of the web, the inshidification of those platforms, that was not a new normal. That was a glance at a future that doesn't have to be. Our power out as a community is in our principles and it's in our numbers. If we can convene, if we can coordinate, if we can collaborate, if we can take good care of each other and choose kindness every day, if our leaders stay humble and choose the greater good over their own enrichment ego or fame, we can change the course of this information age. We can change the course of history. But it will take all of us working together and it will be damn hard work. The wonderful organizers of FOSSTEM have given me this stage, so to close this talk, I will now issue a challenge as if all of that wasn't already a challenge. From my perspective, and I'm speaking especially to our leaders, we must focus our collaborative energy and kindness on the following three areas. We must make the internet more efficient. We must make our code bases smaller. We need to reduce storage usage, duplicated requests, and reduce the distance data needs to travel. We are in the midst of an energy and an environmental crisis. Half our world is drowning and the other half is on fire. And as the diaspora of people across digital social spaces continues, we must collaborate across the internet community to protect disadvantaged, disenfranchised, and marginalized people. When diversity and inclusion suffers, we all suffer. Our pursuit of knowledge, societal progress, and the advancement of humanity only succeeds when we are inclusive of all walks of life, of all creeds, of all religions, of all races, of all colors, of all communities. Barring those who promote violence or enable hate. And we must protect science and knowledge. We must stand for the truth, not only from a geopolitical and societal perspective, but also on an individual level. We need to protect people and the systems through which we organize into collectives. We have to make the truth resilient. Whether you recognize it or choose to identify as part of it, you are part of a movement. Whether you're doing this in your spare time as a passion or as a hobby, or if you're one of the lucky people who has found a company to pay you to do this. You are part of a movement. You have experience and passion and you're smart as heck we need you. And I believe in us. Thank you.
Magic and Software
Okay, so welcome to the last talk for today about magic and software by the the talk is done by the software developer Steven Goodman and the magician James Merlin and you decide who you see. Okay, thank you all. As we said, my name is James Merlin. Thank you for that. I haven't done anything yet. So my name is James Merlin and I'm a magician. Now when I said I was going to present at Fozdem, I went to the magic shop in Brussels and I asked the magic man in the magic shop who stood behind the magic counter, have you got any new magic tricks that I can show those lovely folk at Fozdem? So the magic man in the magic counter had a look around and he looked on his magic shelf and he saw a little red box and they thought that's a nice little red box. But I don't know what it's for. Maybe inside the box there'll be a thought it up playing card or a coin or a prediction of something yet to happen. Do you know what's inside the little red box? It was a little black box and I thought, okay, that's nice enough. It's a little black box. I quite like it. It goes, it's magician-y, it's mysterious. So I asked the magic man in the magic counter what's inside the little black box. Maybe there's a folded up playing card or a coin or a prediction of something yet to happen. And do you know what's inside the little black box? Nothing, it's empty. I thought, oh well, never mind. I quite like the little black box, it's mysterious. I don't have a use for the red box. So the magic man behind the magic counter said, well, since it's false-tem, if you buy the little black box, you can have the little red box for free. So I thought, that's great. I know what I'll do with the little black box. I'll use it and store a folded up playing card or a coin or a prediction of something yet to happen. But for now, I'll put the little red box back inside the little black box and put it on my magic shelf for later. Now, that's my trick. I call it little red box. But is it really my trick? I mean, the whole idea of one box that is inside another box and the bigger box goes back inside the smaller box it came from, that wasn't me. That was, I think, Leber Friedler, who came up with the idea. He did it as a stage effect and he had big boxes on stage. This small version of the little red box, that was an Alibongo idea. And that box used to belong to Alibongo. So the pattern about the magic shop, sorry to spoil it. There is a magic shop in Brussels, but I didn't go there to get the little red box. I take it some of you twigged that that was all a lie, right? Yes. I wrote that, spent some time honing the words, thinking about what I wanted to say. Well, I can say that I wrote the words to that trick, but I didn't come up with the trick. So is it still mine? And this is one of the things we're going to look at over the next 20 minutes or so. So we'll start, we should probably start at least by sort of putting out our stall. What is magic? What are the parameters that I'm going to be going through today? Well, it's not magic, the gathering. No one's walking out yet, so good. We've got the right audience. Magic will say is something that maybe we don't understand, the most common question is, how did you do that? We like to think of it as entertaining. And we also consider the allied arts. So these are things like this chap, Harry Houdini, a scopologist, in what is his favorite picture of himself. Is that magic? I mean, it's entertaining. There's a process that goes on that you don't know about that the performer does. So there's that sort of thing. What about clairvoyance? Is that magic? Psychics? You don't know what's going on? They're doing something you don't know? I mean, it is said there are two types of psychic, the fraudulent and the delusional. But they can't. For me for this, I say no. I'm going to focus primarily on the magic magic things. The debate is open on this fella still. Uwe Geller, he's the one on the right. I consider him a magician. He doesn't. He thinks he's real. Hands up if you think he's real and bending the spoon with his mind. Yeah, he's a magician. So now we're going to say, now we know the type of magic that we're going to be looking at. Let's then start breaking it down. We start with the effect. What's the broad strokes of what roughly happens? What am I trying to convince you is actually going on? Am I trying to convince you something is appearing or disappearing or changing? They said there's only sort of seven different types of storylines that you can ever have. So maybe the same is true in magic. So what do you think is happening? That broad strokes, we have the presentation. What am I doing to convince you that effect is happening? And then the method. What is the thing that I'm really doing that you don't know about that makes the whole thing work? And the presentation can come in many, many shapes. And I'll show you just two very briefly. So there's a typical plot in magic called card to impossible location. You have a deck of playing cards. Someone will pick a card. It will get lost in the pack. And then miraculously the card is no longer in the pack, but it's over there. Or it's in my shoe or it's in the back of the room. Now there are so many variations of this card to impossible location. You can't possibly have all of them and try and claim a right on every single version. It's no different. If the card goes from here to that shoe to there, you can't claim it's a different trick. Magicians often say, well, if you change any two of these things, it's a new trick. So the effect, the broad strokes, as a public you generally don't care. The method, as the public, you should never know. So if I as a magician change the effect and the method to keep the presentation the same, I've got a brand new trick. But everyone else thinks it's exactly the same one as before. If Penn and Teller were doing this, take a card, lose it. Penn and Teller will make that card turn up in the fish guts of some fish that you pick randomly. They'll make it appear on a billboard. They'll do something gross quite frequently. But that's them. If I was doing a kids show, I might have something like this. I can say, OK, do you all want to pretend that you're seven years old? Yeah! Hello, everybody. Hey! My name is Billy Nomates and I'm here to do some magic. So we're going to start with a book. Can you see the book? Lots of blank paper. So anyone who can draw here, can you draw some? You can draw. You're going to draw on the book for me? Draw a picture. Draw a picture. And look, you've now got pictures in the book. Aren't you clever? You've drawn pictures. Now, oh, thank you. Now, who likes painting? We've got people who like to paint. You like to paint? What colors do you like? Do you like red? Take some red, take it out the end, throw it at the book. Who likes blue? Throw some blue at the book. Well done. Throw some yellow. Well done. Isn't it amazing? Woo-hoo! But you know what, boys and girls? When you stop applauding, the magic goes. And there really is no pictures and no colors in the book. Also available for weddings, christings and marvits. Now, that's a kid's presentation. If I was doing this for a group of sensible, grown-up people... I would present it in a completely different way. I certainly wouldn't take it seriously. I certainly don't think anyone here believes the book is magically changing. But I would have a presentation that would suit me somewhat better. I might say it would certainly be comedic, and it would be something along the lines of, I would like to say it to everybody. I would like to do something of mass hypnosis. Because at the moment, you can see I've got a book of blank pages. Now, I'm going to hypnotize this side of the audience to see images. Can you see images? That's because you've all been hypnotized. Can you see images? No, because you haven't been hypnotized yet. Let me hypnotize you over this side. You're now all hypnotized to see images. Can you see images? Can you see images? But can you see images? No. Okay, you're all now hypnotized. So you can see images, and I'm going to hypnotize you extra special, so now you can see colored images. Can you see colored images? Can you see colored images? Okay, now you're hypnotized. Now you can see colored images. And I better unhypnotize you all, otherwise you'll still think there were pictures, and in fact, we're just left with blank paper. It's exactly the same trick. APPLAUSE That's exactly the same trick, isn't it? I've just changed that presentation, yet it's completely different. So what are we actually going to be protecting here? So anyone remember this song by Steve, Shave of My Heart? It was really quite popular a number of years ago, and you know, good enough song. I'll read the words for those at the back. I know that the spades are swords of a soldier. I know that the clubs are weapons of war. I know that diamonds mean money for this art, but that's not the shape of my heart. The four suits of the playing card are in the first verse of the song. Do you not think that every single magician who heard that song is sitting there in their basement going, oh, I think there's a magic trick in this? And then we get to the second verse. He may play the jack of diamonds, he may lay the queen of spades. Do you think any of the magicians were not stupid enough to have realised there's a magic trick at this point? Have got a magic trick at this point? Of course they have. Every magician had that idea, and because it's independent of the invention, they both claim to it. But because it's exactly the same trick, cards suddenly appear, there were two magicians who were fighting each other almost physically about who had the rights to perform the trick. No one bothered to ask Sting's permission first. So let's head back. So when we talk about magic, are we looking at art? Is magic art? It can be. I don't think it's ever going to change anybody's viewpoint on the world, breathe life into the conversation about the human condition. But it's art, probably low art, but it's art. Is it science? So ultimately, there is a process going on, there's nothing mystical here, and in the case of real magic, it has to be science. It's something that TV producers, movie producers, always like to skirt over if they're doing this fantasy thing that has a magic element in it. They say, oh, magic exists in this world. Well, if magic is able to exist within that world, then those things happen in that world, therefore that is science. Maybe you don't know what the science is, but it's a scientific fact that that thing happens. They just haven't considered it. And when we say, is magic a business? Oh, absolutely it is. One of the biggest parts of magic is the business. People are creating magic tricks for the sole purpose of selling them, not for performing them. So I suppose we should get to the juicy bits about secrets, right? So the first secret is we don't call it secrets. There's no such thing as a secret. It is always a method, it's a process. Any one particular trick might use more than one secret, more than one method. Sometimes that method will be the sleight of hand that's being used. Sometimes the words themselves will be very specific to achieve a certain result. Sometimes these words might be almost inconsequential. You, as someone that's watching that, will not think that those words are chosen for a reason, but they actually were. So we don't refer to secrets, we refer to them as method. And there is a lot of method insert some routines. The second part of the secret is what I call the disconnect. There is a gap between what you know and when you actually know it. So I think most people will be aware that if I took a deck of playing cards and I started dealing out perfect hands of poker, you'd go, well, the cards are already stacked in an order that gives you perfect hands of poker. You know about stack cards, you know about bottom dealing and false shuffles that don't really shuffle the cards. But if you see a magician doing a card trick, you might not actually realize they're doing that if maybe you just forgotten or they did something that disconnected that moment. It's very, very common if you really start analyzing magic. You'll see there's often times, and if you know what's going to happen, I can spot when magicians move their hand in a particular way, I know what piece of slight of hand they're going to do. That's just part and parcel of being a magician. But if you do that and you don't know what's coming, you forget that it ever happened. So when you see the actual thing, it's like, yeah, I know there's stacked decks, but I thought they shuffled it. I thought the spectators shuffled it. And magicians will do this all the time. They will say, if I give you the deck of cards to shuffle, you shuffle the cards, it's probably legitimate shuffle. But if I shuffle the cards, and then I ask you to just maybe cut them, at the end of the trick, I will say, and you shuffle the cards, and you'll say, yes, you didn't, you cut the cards. I shuffled them. But I'm just rewriting the rules for you so you remember it in a different way. Magicians get stumped by this all the time. I've had people who have told me about magic tricks that I have done, which have been incredible. And I'm going, I can't do that magic trick. That's impossible. And it's because they've remembered it incorrectly. Also, there's a secret in the vocabulary that we use. Special words that only us know about, that we can talk about, IDs which stands for antiti, which is a... And then a magician will automatically know what I'm talking about without having to keep explaining everything. So now we know roughly what we're talking about. How do we protect all of this stuff? We could use copyright. Works for software? Mm-mm. Doesn't work. The effects... Now, you can't copyright the effect. There's not enough there. It is just saying something disappears. Could be... We've said this, so there's so many types of magic effect. There's seven different types of story. This is every type of magic effect. So we can't copyright that. There's nothing there to copyright. What about the presentation? We can copyright that presentation. Me talking about the little red box, that's a script. That's a piece of prose. That is copyrightable. I can do that. That's fine. And the method. Can't copyright the method. That method is just a list of instructions. And traditionally speaking, you can't actually copyright a list. You copyright the expression of that list. Recipes don't have copyright, but you can have copyright in recipe books because you're expressing the idea. You're creating a presentation for the recipe. Same would be true with our method. We can't use copyright. We could use patterns, right? I know software patterns are evil, yeah? Still roughly on that agreement. So can we patent the effect? No, it's not a patentable entity. There's nothing there. Do we patent the presentation? No, that's the thing that's handled by copyright. We've got that. Can we patent the method? Yes. If something is sufficiently advanced, a method can be protected by a patent. So you write down the method, you write out how it works, if it's inventive enough, you file it at the patent office, and you have a patent, and you've legally protected your trick from being used by anyone else. Anyone's got the problem with that? To protect the secret, you have to file it publicly where people can look at it. Yeah, problem. This is a patent. This is a patent for a magic trick. Anyone recognize which magic trick? No, because there's a disconnect. That pattern applies to this magic trick. David Copperfield's flying. I don't want to spoil it for you, but if you don't want to let the answer look away now, that's the pattern to look up if you want to learn how to fly like David Copperfield. It's a matter of public record. You can go and find it. I'm actually taking that number down. It was created by George Galvin. He patented it just so that he could claim it. We all know he's not really flying, but he protects it anyway. Could we use a license agreement? We have copyright in code, and we choose to put it under a particular license. We could license our magic. Well, could we? Could we license the effect? We're not doing the presentation. Certainly it's a copyright thing, so you can say, I allow you to perform this presentation. And the method? No. Again, it's not a license. Magic books, when they're sold in shops, specialist magic shops, they are sold like other specialist books in Clear Wrappers. So if you have a license in the book that says, you are only allowed to use these tricks for your own personal use, that's a fairly common thing to include. You have to buy the book, unwrap it, then see the license agreement you've already agreed to, even though you couldn't read it. Anyone say, end user license agreement? Has anyone ever found that they're actually a good thing? But they're in magic. We got them somehow. I don't know. So how do we actually protect all this stuff? Well, the simplest way is in doculus privata loci, because I believe it's pronounced in Latin. Do we actually have any Latin scholars in? Good, so no one's going to correct me on that one. And that just basically means, keep your mouth shut. If you don't tell anyone what your secret is, no one's going to find out. Great. We can protect it with money. Yep. If someone wants to know how the little red box tricks worked, pay me, and I'll tell you. Oscugation. We've mentioned about having a vocabulary in magic that the lay public don't know about. We can protect it through just hiding bits of things in plain sight, essentially. If you find magic forms online, they do exist, and they do talk about how these things work. They will have little initials and little stars removed from work, little stars inserted into words, so you can't tell what it is unless you already know. And ethics. Ethics is the thing that you do when no one else is watching. Business ethics is the optimal one. But that's because it's not a community. The ethics work in magic because you don't want to get kicked out of your magic club. You want to go... It's the only place in the entire world where you can get a bunch of weirdos in a room, and they all feel normal. Present company accepted, maybe. But we've got a little problem here. This is a magic trick. It's not a very good magic trick. It's $15.56. It makes things float. Map magician's like making things float. And this is a new device. It's obviously not a rip-off of something old. It says this new device. It contains no threads, no magnets, and no wires. See, you don't know what you're buying, but you do know you're not buying any threads. You know you're not buying magnets, and you know you're not buying wires. So here's another trick. This also makes things float. This is called self-vention. It makes a mobile phone float. And this one is also a new trick, and this uses no thread, no magnets, no clear plastic, no wires. Wait, hang on. This floating thing uses no threads, no magnets, and no wires. This floating thing uses no thread, no magnets, no clear plastic, and no wires. Anyone want to guess at how this works? That's an online magic shop. And what do all magic shops have? A forum where you can review things, and then someone will post, the main gimmick came misaligned and the gimmick cards that hold the magnet. Okay, so there was a magnet. Popped apart after just a few minutes. Easy to repair, but there we go. A normal random magician has just told everyone, by the way, magnet, we can't even protect them amongst ourselves. So how are we going to learn this stuff? Well, there's lots of ways we can learn it, and because I'm a magician, I am contractually obliged to do a card trick of some sort. So I got these cards, because... Can you see these at the back, hopefully? So that's kind of... I'll come out here. I do know some people. Find someone I don't know. I don't think I know anyone here. So could you take a few cards off of the top for me? Lift them all up as a block. Yeah, any number. Lift them all up as a block, turn them all upside down, and put them back on the top. Excellent. Now, who thinks we're in on this? Who thinks I came to him earlier and said, by the way, when I do the thing with the cut, you're not very trusting. I like you. Okay. I'd like you to take a nice big handful of cards off the top. So, you went to be earlier, right? Oh, yeah. Take a big handful. Actually, I thought I cut. Maybe we'd do it something. Big, big, big handful. Turn them all upside down and put them back on the top. Okay. Now, who thinks we're in on it as well? I think you're in the hoose. Yeah, we possibly are. Anyone? Do you think we're both in on it? Yeah, do you want to come up with me? Hold the cut. No, no. Do you want to come up with me? No. Who wants to come up to the front with me and hold these? Yes? Hold them. Don't let me. People think I palm these cards. I'm not. Bring them down to the front. Somebody might even give you a bit of a pause for that. Yeah, if you hold it for it. Okay. Come up to the face so we can see. So, you took some cards. Turn them over. You took some cards. So, somewhere these cards will go from face up to face down. Yes. Can we find out where that is? Can you move through the cards? Somewhere that we should see the first face down card. No, not that one. That's face up again. So, it's somewhere there. See, there's only cards stuck behind them. Okay, so it's this one. So, if you'd have picked a few cards more, we'd be this side. If you'd have picked a few cards less, we'd be this side. If you picked a few cards less, we'd be this side. If you picked a few cards more, we'd be that side. So, it's this one. So many ways this can go wrong. Could you take that one? If you take the card and hold it up to the audience and get ready, I've got my prediction here. Show everyone what it is and shout out in a nice loud voice. The name of the card is the Eight of Hearts. No, okay, so it's not. Oops. Hang on. Is it the Three of Hearts? No. Hang on, I know what I can do. The name of the card is a heart. Is that correct? Is it a heart? Okay. I don't need this anymore then. What is the card, by the way? The Ten of Hearts? Not the Eight of Hearts. Not the Three of Hearts. It's the Ten of Hearts. I don't need this then, do I? Thank you very much. Put the cards back on the table and we take a seat. There we go. Brand new magic thing. I learned that. I learned that from a book. I could have also learned it from a DVD or the internet or from lecture notes. I could have learned that from anywhere, but I happen to learn it from a book. What about the copyright? It's a book. It still has a copyright. That particular thing was done by this guy, Ted Anaman. If you're googling on your web pages right now, you can find books by Ted Anaman. I'm not telling you which book by Ted Anaman. You've got to spend the time and effort finding it. I can tell you that I've read his book and that comes from his book. Instead of a dry erase board, he used chalk on a chalkboard. The difference between the permanent marker and the dry erase marker, he used paint and chalk. Who else reads Ted Anaman's books? That guy. Derren Brown and myself, the right resident, James Merlin, have both learned these cyber things from these books. Incredible. And yes, there's a copyright, but Ted Anaman died about 80 or years ago. So even if you say, well, I'm going to buy the book, I'm still out of copyright now, you could quite literally just copy it freely. But it doesn't matter, even though that secret could well be out there, could just be disseminated by anyone for free, it's a 100-year-old trick and it still gets a round of applause. So does it really matter if the trick is known? So how are we going to go on about sharing it? Well, all of the aforementioned things. We have our own little number of patrol that we use, our nice little words and phrases, that everyone has in every field of endeavor, every person has got their own private in-jokes, every industry has got their own special jargon. Magicians have it too. We have the names of gaffed and gimmicked decks. Everything's got a name for some reason. Again, I'm not telling which book that trick comes from, but you try searching for that. What are you going to type into Google? Trick with a tenor heart that gets wiped out on a board. That's not going to come up with any search results. Unless you know the name of that trick, you can't actually search for it. There's no way of just gripping it. I can't remember the name of the trick. I've been doing that for 20 years now. I can't remember what it's called. I can barely remember how to do it. So the naming of things throughout, the naming of slights, for example, some people, if they've got a slight or something with their hands, it will have a name, often named after the person who came up with it. Fine, name things after yourself if you want. It doesn't make you egocentric at all. But if you know the name of that slight, you can look it up, you can find out how to do it, and you'll probably find YouTube videos explaining it, usually by some 12-year-old kid who's doing it better than I can because they've actually got time to do this stuff. We also do this by tipping the methods. Magicians are like computer games in as much as they have levels. If you don't have enough health points, you can't get through to the next bit of the game. If you don't have enough experience in this part of the game, you can't do the next bit. The game locks you out of content if you're not good enough. Magicians will lock out other magicians if they don't think they're good enough. The mechanics of some of these things are very, very simple. I'll show you that and I can talk you through it. Yeah, great. But it's more than just doing a set of steps. It's about knowing how you're doing them, why you're doing them. Funnily enough, there is a certain level of skill in standing up in front of a load of people doing this stuff. How do you teach that? Is that actually learnable? And if I don't think you're good enough to put a good version of that trick on display, I won't teach you what you need to know. I'll wait until you are good enough to be able to do it. And all magicians are like that. It sounds very cruel, but we do. Just that little hook into the secret, the method. Just like a foothold. Just enough to get you started. So we say it's the end, but it's not quite. No one has put up a board yet, so I've got time for one last thing. Some of you may have seen this, which I put on the table earlier. So you see here we have a series of, they're called Xena symbols. They're created by Professor J.B. Rine, Duke University. And these symbols have nothing special about them. Actually, I could be asking you to just look around them, model them up or something. But make sure that there are five cards there, and the symbols on each of them are different. I'll come over to this side, because these are the same, these are the ESP symbols. And on this side we have numbers. Numbers were not invented at Duke University. They are quite a bit older than that. Would you actually ask you to mix up this? How many cards have you got there? Three? Mix up those other two as well. Count them, make sure there are five of them, and make sure all five symbols are different. I'm going to come back to you. I've got to get my steps in today. Can I have those for a moment? You're happier or different than that? I'm going to put these here, and you're going to watch me, and you're going to make sure I don't palm these out for other cards. So I'm going to put one there. One there. One here. One there. And one there. Okay, I'm not completely sure. But those are the cards that you check, they're not... So over this side, you have five cards, and they're all different. Okay, would you give me the first one? And would you like this to go in, sort of, number one, two, three, four, or five? You've got five choices. Where would you like? Number three? Exactly right. Good. Next one. Where would you like to go? You've got anything from one, two, three, or four? Four. Okay, have another one. Actually, let's try reading someone else. Can you give me a number? We've got one, two, or five left. What would you like? Five. And two more left. Let me try and mind read someone right at the back. Yes? One. So, can we finally have someone who's very bad at making decisions? Exactly. Now, I say I don't do mind reading, and that's kind of true. I do what we do as programmers. The easiest way to read someone's mind is to put the idea in there in the first place. So when I say there's a free slot, I'm not actually saying there's a free slot, I'm saying it's a free slot. What did you say first? Of course you did. And the same for the other ones. When you said four, I double-tapped the card onto four. Who spotted the two I got wrong? Two of these are on the wrong way, so I need to swap two of these over. Who hasn't had a chance to shout out anything yet? Once I shout out which two of these I need to swap? I see a random number. Which one do you want to go? If I choose one, then obviously it's going to look fake. So what ones do you want? Three and one. Are you sure you're happy with that? Is everyone happy he's not a stooge, and I haven't stoogeed them and them and everyone? Three and one. So that's number one, and that's number three. Now, some of you are probably good enough at maths to work out this could happen by chance. There's a five in one that these are going to match. Which means there's a four in one that these will match. So that's five times four, that's 20. There's a three in one that these are going to match. So that's five times four times three. There's a two in one, so that's five times four times three. So that's 120. There are 120 possibilities for this, and that's just by chance. That's just pure luck. But 120 on one chance. I've done this trick 119 times so far. So I'm really lucky this is full-stem. So here we have two circles, two stars, two pluses, two wavy lines, and two squares. Now, I said I don't do mind reading, well I do mind writing. I put the ideas in your head. Who saw what I was doing to put those ideas in? I mean, where did we end? We ended up on chapter five, right? How did chapter five start? Here, we talked about sharing the magic. Let me stand this up. What's the very first symbol on chapter five? The symbol next to the word chapter five, the symbol next to five, and the symbol at the very end. Five is square. Who remembers chapter four? When we have the title of chapter four, what's the symbol of the beginning, middle, and end? What's the fourth symbol? Chapter three. Who remembers chapter three? The fourth symbol is around the title on chapter three. There were a lot of slides. It's a plus. What's in the third place? It's a plus. You've been looking at these slides for the last half an hour. How are you not going to pick that? You're always going to pick the plus for number three. What about number two? There's a star. Chapter two, star, star, star, star. And finally, chapter one, what is magic? Well, it's having a circle, a plus, a star, a plus, or wavy lines. All because of that. What is magic? This is my name is James Merlin. Thank you. Good night. If there was time for questions, I can take questions. Forgotten. Thank you.
Alexandria3k: Researching the world's knowledge on your laptop
Thank you. Good morning. How many of you have read the scientific paper in the last month? Okay, for a number of you, probably why you are here, it turns out that we are churning ever more scientific papers. I'll show figures later on. And for this reason, we are conducting studies on them, so seeing how they accumulate. Some types of studies that you see here, systematic studies, so a study on previous science that can be reproduced and done in a specific objective, we're not just picking out a few papers from a pile. Scientometric or bibliometric studies where we measure things such as the scientist's output or work done in a specific field. And then there's the analysis where we use statistical techniques from existing studies, say how cancer is related to smoking in order to get better results and other secondary studies. And as you see here, such studies have been rising from 1970s onward at an exponential rate. Look at the scale on the left, it's logarithmic. Scale, something that we will see many times in this presentation. Now they have reached tens of thousands of studies every year. We've been conducting such studies also in our group, such as looking how various data papers, open data papers are being used by others or how software engineering research, thousands of papers that are published on software engineering every year are actually used in practice. And lately also looking at how machine learning is associated and used in software engineering. There have been so many papers in this area that we conducted what is called the tertiary study. So we didn't look at all those thousands of papers, but we looked at papers that summarized those studies, a few dozens of papers. Research on using existing publication data and building on it, demand quantitative data. And two scientists are famous for working and establishing the field. The one you see on the left is Eugene Garfield, a linguist and businessman. He established the Institute for Scientific Information, which became then the science citation index, then was bought by Thomson Reuters and now it's a firm called Clarivate. And on the right, another famous scientist, Derek DeSolla Price, who also worked in this field, established disciplines together with the Garfield of Scientometrics and Bibliometrics. We can use and measure scientific output and study it using a variety of services such as the ones you see here. How many of you have used any of those in the past month say? Wow, a fair number. However, there are problems associated with them. First of all, there's a lack of transparency, repeatability and reproducibility, a query that you give today on say Google Scholar will give you, well give completely different results in a year. And we have no idea how the results appear here, why the results appear in a specific order. Latency can be high and bandwidth can be low. If you want to run a query on tens of thousands or millions of publications, good luck going back and forth on an API. And if this wasn't bad enough, there are also rate limits. I have the privilege of having been kicked out of various digital libraries for 25 years now. There are also the query languages that we use are also proprietary, so each service has its own query language. And they can be restricted in what you can do, maybe you can add terms, what you cannot order them or you cannot search in specific fields. The coverage may be limited. Some services contain only a subset of what we want to search. And finally, there's the issue of availability and costs. Some are lucky enough to have subscriptions to some of these services. Others don't and even if you have a subscription, getting access to the full dataset may be difficult. Thankfully, two developments are changing the status quo. The one is the rise in computing power that we have in our hands. I don't know if you have seen this picture. We are the group of people delivering an Elliott computer to the Norwich City Council. And a few decades later, somebody putting a photograph in a Raspberry Pi zero in front of the same building. I've compared the two machines. Interestingly, both are European endeavors. And there are three to six orders of magnitude increases in power. So things that we could do in 56 compared to now are a thousand to a million times better than they are today. So obviously we can use this power. Remember, CentroMetrics was established on those decades, past decades, to do a lot more. The second development is the open availability of datasets. We are here celebrating openness in software and data and hardware. And a large number of datasets associated with research have now become available, such as Crossref, ResearchRid, United States Patents, PubMed, Research Organizational Registry. I'll go over them. So alluding with some lack of modesty, I admit to the famous library of Alexandria, how it should be in the third millennium, I've developed a system called Alexandria 3K that allows us to perform publication metadata analytics on our desktop. What it does, it provides us relational access to about two terabytes of data without needing to have actually this amount of space on your computer, who has two terabytes available on their laptop. Exactly. So you don't need it with Alexandria 3K. It gives us access to about four billion records, 75, 74 tables. You install it as a single Python model. You don't need to maintain a graph database or a cluster to install and maintain. I know it's not a problem for you, but it is a problem for people who are not in computing. And it's also super efficient. So if you run a sample query on the whole dataset, but sample it, it can finish in minutes. If you build slices of the data to study, it can finish in between five hours in a couple of days. But then you can run queries that finish in seconds. And the space requirements start at about 160 gigabytes for the downloaded data. You can then process it in compressed form without requiring to decompress it. What I will show in the next about half an hour is the model that you get access to and the type of data that you can use. I will explain how it can be used in practice. Go deep into how it is implemented, giving you perhaps ideas how you can do similar things and finish with some issues, limitations on how to move forward. So what you see here is the scheme of all 75 tables. They are colored based on the dataset they come from. So on the top you see the United States patent. The various yellow ones are cross-off. This is the main dataset. It contains details about publications and their authors. There's also a similar set from PubMed regarding health sciences. Information about researchers, open access journals, research organizations and some other stuff I will explain. As I said, the main dataset is cross-ref. You see it here. It contains mainly works. And then these contain references to their authors, the updates, subjects, funders, licenses and also the affiliations of the authors and the references in each work. In numbers these are on top thousands. So you get about 135 millions of publications. Not all of them contain a subject or references but you see these numbers diminishing but have been going up over the years. About 360,000 million authors and about 1.7 billion references. So each work has a reference at the end. You get that many references. Many of the works are also associated with subjects telling us what area they appear in. Most of the works are journal articles. Then come book chapters, proceedings and other elements such as books and post-edentries and dissertations, far, far, far smaller. If we look at the publications that appear in the dataset each year, all of these charts by the way have been derived through Alexandria 3K. You see that they've been increasing at a very large rate over the years. If you think it's exponential, you're indeed right. If we plot it on a logarithmic scale, you see a linear rise which means an exponential rise in the numbers. You can even see two dips in the rise. Any idea why these are there? What happened? Wars. Wars, exactly. The wars apart from the extreme human tragedy also affected the science of the world wars. Regarding availability, these are the various lines show which works have an abstract, the subject, the funder, a researcher identifier, awards associated with the works. You see that these numbers have been rising in most areas. The subject is a special case. Another dataset associated with Alexandria 3K is the open research and contributor ID. How many of you have such an identifier? Good. If you don't have one, go and get one. If you publish, it helps us, all scientists, associate you with what you have been publishing. These have some basic details about the persons and then further details associated with distinctions and education, invited positions, memberships, peer reviews conducted in other resources associated with researchers who have access to a specific large telescope, for example. Again the completeness of this data is not uniform, so most people have associated works with them, but fewer have associated employment, education, personal data. A similar dataset to the works is the dataset of publications made available through PubMed, United States government effort, which has publications associated with health sciences, so health and biomedicine. Similar to Crossref, but it also contains some more specialized fields, such as pathogens or a very complete taxonomy of where something belongs or chemical substances mentioned in a specific paper, so it allows you to do more concentrated and specific research. Also available are the United States patents for the past 20 years, containing about 5 million records and a registry containing about 600,000 records of research organization, so the organization you are associated with, if it conducts research, it should appear there, it's a taxonomy, so it contains the parent organization of your organization, in many cases up to the top, maybe the government. What else is there, some smaller datasets, journal names, so that you can directly associate them with their ISSN, funder names and also director of open access journal metadata about 19,000 records. All these are tied together through identifiers, such as the digital object identifiers, DOIS, the research identifiers, ISSNs, URLs and many other identifiers. In a way you see here, these are just some representative tables from diverse datasets linked together. How are we using Alexandria 3K in practice? You can use it as a command line tool with a typical nowadays for large tools pattern of running A3K, the A3K command and then sub-commands such as populate or process or query. Here's an example of that, I'm running the A3K command asking it to populate a database called COVID-DOTB from the crossref dataset found in this directory and selecting only the rows that have a title or an abstract that matches COVID. I will show you later on how this can be useful. You can also use it through Python, here's an example, you import it, you create an instance for the specific data source that you are interested in and then you give a similar, you call a method that performs a similar function. Typically the way to use A3K is to download the data for crossref, it's about 160 gigabytes, can be downloaded through a torrent in about three hours, by the way this gives you plausible deniability argument of why you are using torrent. You can then run various exploratory data analytics queries directly on the sample, this can finish in about two minutes for 1% of the records, no need to uncompress this to the terabytes required. Or you can populate a database, a SQL-like database, this can take for four to 20 hours, the database can be some four to 200 gigabytes in size depending on what you store in it if you are selective. And there on the database you can test, define, refine analysis queries, the queries can run in minutes or in hours if they are very complicated. Mainly how you can use it, you can run ad hoc SQL queries directly on the uncompressed data, you can populate SQL-like databases and there you can select elements either horizontally so you can say I want only the records that match the specific condition, works published in the last two years, or vertically specific columns, I'm not interested in the abstract for instance because it takes up a lot of space and I'm not going to use it. And then once you index the SQL database you can have many queries finishing in seconds or even on the complete data set. Here is an example of a query directly on the data set so without creating an intermediate database measuring how many publications appear each year, the chart I showed you earlier. Here is another example of a query that performs sampling, it calls a random function of Python and which returns to when it's less than 0.01 so I get 1% sample from the data set to find out how many works contain abstracts and how many don't, this is the answer and there is a chance on the complete data set but sampling it in about two minutes. Another example here of populating a database in order to extract metrics that I showed you previously how many works have authors or subjects or abstracts associated with them. Similarly another population completing the ORC data set showing how many elements are there in the corresponding query on that data set. Let's see some more advanced elements. Here there are two papers both written by Nobel prize winning authors, on the left is one by Colin Schame, a Nobel winning paper used to establish theorems to develop a method for calculating the structure of the electrons and on the right another famous paper, you are probably aware of it by Watson and Crick or Nobel winners introducing a model of the structure of the DNA. However, the way these papers are associated with science turns out to be very different. If we look at the papers that cite them, so the red papers here, the two blue ones are this one and this one, we see that for the left papers that cite the electron paper also cite previous work, so works cited by this. Whereas papers that cite the DNA paper do not cite the work published before it. So people have said that the paper on the left consolidates existing research and advances it whereas the paper on the right, the DNA paper is a disruptive paper that changes all things and therefore people no longer cite other works, they cite only this paper and ones that follow it. There is a measure that you can calculate here, the paper that first established, published in Nature gave these numbers here. My own calculation on Alexandria 3K gives similar numbers and the very highly significant statistical measure showing that these methods are equivalent. The one on the top, the published one is opaque, you cannot reproduce it because the data is not openly available, the one on the bottom can be run on your laptop. Here are some other measures, the evolution of scientific publishing after the Second World War, before that things were completely different, that's why I'm not looking at it. We see many interesting changes, number of authors per work, reason from 1.5 to 4, works per author from 1.99 fell to 1.59, references rose from 13 to 46, pages doubled to 12, the consolidation disruption index fell, so if you think that science is becoming less disruptive these days, it is a true number of citations, works published, journals, works cited at least once and factors have all reason exponentially. All these calculated with queries you can find on the software you can download from GitHub. Here's another interesting chart showing how the evolution of applications has changed in specific fields. You can see the big rise in computer science, the purple thing that has increased substantially over the years and also the relative fall in publications in arts and humanities, the other purple here that has diminished in this way. The absolute number has risen, don't be fooled because publications have risen exponentially, but still it occupies a lot less than it did in 1945. Here are examples of two other data sets. Here is the evolution of applicants by year and country of US patents. You see a fall in the number of patents associated with the United States and Japan and the rise from a low base of patents associated with China on the bottom of the blue line. Another one associated with replicating a paper looking at specific software, statistical software used in health science public research. Again you see with green the original paper and with orange replicated results with Alexandria 3K. This was completed by a TU Delft student in a couple of weeks. Let me show you a more substantial example of what can be done with Alexandria 3K, proof of principle of concept study on a specific topic namely COVID-19. What you are seeing this was while checking the data set, it's a publication I found in the American Journal of Ethics, COVID care in color, which was published by an author, a nurse working in a Bronx emergency room team for six years and painting for 25 years. She captioned it as fear of non contagion is dreadful, especially without proper protective equipment. So at the beginning of the pandemic when we were conflicting information on how and when to use our PPE we relied on each other. I created the data set in the way you see here. I populated a vertical and horizontal slice in an escalate database, selecting some elements with those that matched COVID in their title or abstract in running nine hours, about three gigabytes of data and the ones I indexed it, it rose to 3.6 gigabytes. We can see some numbers, half a million of published articles, 2.6 million authors, imagine the amount of effort that went into there and eight million references that go more there. What are the topics associated with this research? Everything you can imagine, you can see education and engineering featuring very high, of course, after general medicine, but even strategy, management, law, cultural studies, pollution, anthropology, AI, waste management, ocean engineering, all over the place. Who funded that research? This is the query I run, National Natural Science Foundation of China, the highest number of publication, and then National Institutes of Health and National Science Foundation from the United States, followed by various trusts and councils. However, if we look at the affiliations associated with COVID publications, these are the queries that established this. We see that first came the government of the US, then the University of California System, University of Toronto, and so on. Here I use the research organization registry and I moved the organizations up to the parent organization, so Berkeley, for instance, and the University of Southern California rose and UCSD rose to the University of California systems. Another question is, how quickly could we look and work on each other? So how scientists publishing COVID research could cite each other? Was it taking too long a time because things were advancing too fast? And what I have plotted here is publications citing other COVID publications published each month. And you see that fairly quickly, even on April 2020, there were thousands of citations to other COVID publications, which rose to hundreds of thousands late in 2020, early 2021. The number you see at the beginning is an artifact, journals that got published with a January date, although they appeared a lot later. This shows us we shouldn't blindly trust our data. I also looked at collaboration. I found some amazing things. There were articles with thousands of authors authored by thousands of people. So you see articles by 2,000 or 1,700 authors. I thought this cannot be true, so I looked at it and I saw that there were indeed most articles had the number of, say, five authors, but there were articles with 7,000 authors. I looked, couldn't believe my eyes, I haven't seen such thing before, and it was indeed published here. The number of authors appear in a footnote, pages 20 to 28. And this is not an isolated case. Through other queries, I found many articles with thousands of authors showing a way of collaborating in an amazing way. People probably contributed by giving data from hundreds or thousands of hospitals, and all these were collected on papers. Let me switch subject. How many of you heard the impact factor? Yeah. What is the impact factor for those who haven't? It's a measure that tries to see how important a journal is by measuring the citations articles in this venue appear in a given year divided by the publications in the previous two years. It has been severely criticized, especially when it's used for measuring scientific worthiness or productivity of authors. But another problem is that it is opaque, so clarivate publishes, but we have no idea how exactly it derives those numbers. It publishes the method, but we cannot replicate it. We can with Alexandria 3K with queries and populations, such as the ones you see here, and we can get results that rank journals by impact factor with a very high significance correlation with the ones published by clarivate. Through this, we can do queries such as find the most cited article in the last two years. It is this article establishing a method where you can find how matter depends on the properties of steel based just on facts associated with the atomic structure of that. I found that very strange, very specialized subject, so I run a query to find what publications are citing this article, and I got results in the titles such as these at a rate of about eight records per second over the entire data set. If you cannot understand the titles, this is how people outside computing view us when we talk shop. I also looked at cited articles over the last two years, but published in that period, and predictably this was clinical features of patients infected with the 2019 novel coronavirus in Wuhan, China, the first article reporting on COVID-19. Another metric of author productivity is the so-called H5 index, so how many papers you have published in the past five years that have been cited at least that number of times. I found an author that has an index of 76, which amounts to 15 papers a year, and many authors having more than 60 papers, an H index more than 60 and 100 more than 38. This was too large to believe because it's not only what was established previously, hyperproductive authors that publish papers every one paper every five days, but also papers that got cited a lot. So how could this happen? Using again a 3k, I looked at papers that were cited, the papers that those authors cited, and created and looked at those graphs, and what I found is that those papers were cited a lot more between them than papers of other authors with high productivity, but not that high. So this seems to suggest that some sort of clique is working there, authors citing each other, and thereby elevating artificially their H5 index. Whatever number you use, you can game it as you can understand. Finally, here's another interesting chart. Here I'm looking at what topics cite other topics, and for instance we can see these are the 50 strongest relationships, and we can see for instance that the cancer research cites a lot chemistry organic and inorganic chemistry. I will finish with some details regarding the implementation of Alexandria 3k. I hope that you will get a few tricks that you can use here. A3k is based on a plugin architecture, so on the top you get the command line interface, which uses a Python API, which you can also use directly, and below it there are two sets of plugins. Data sources, the values I showed you, you can create new ones by creating a file that establishes a new data source, say archive publications, and processes, things that manipulate data in new ways. For example, matching authors with affiliations, or disambigrating authors with the same name based on what they publish. And finally, at the bottom is a plugin API that these plugins use in order to function. The main ideas behind the cross implementation and the other databases is SQLite and virtual tables. Virtual tables are a feature of SQL that's magical. You can create tables that don't exist as real in a real database, but appear as tables which you can access with the select and other SQL statements. It uses a method to partition the data because large databases come in many, many files, tens of thousands of files for the case of Crossref. I use the number of each file as an index so that the database, the virtual table implementation of SQLite does not jump from one partition to the next because the compressing partitions is expensive. Another trick is to understand once you have written a query or a selection of what data you are interested in, how to understand what tables or what fields you want from each table. I don't want to parse SQL, especially with all the specific implementation details and of the SQLite. So what I'm doing is I ask SQLite to trace the analysis of the query, and thereby I can understand which columns and tables it touches. I also create vertical slices of the partitions for the queries in order to run faster. And I use various queries to only look within a partition in order to populate records. So the Crossref data appears in JSON format, and in a self-contained file, one of the 26,000 files, it will contain all references to each work. I don't need to go to other partitions. So here's an example. When you run the query on the top, what is happening underneath, a virtual table is created, and the query is run on this very simple table. When, however, you also do joins, what is happening is it is creating the tables, but as you see here, the tables have a container ID restricted to each container in turn, one, two, three, up to 26,000, so that each partition is decompressed in turn, and not all of them are run together, and then this is run on tables that are actually realized. For population, similar things are happening. So if you populate something with a condition, so I want only those subjects associated with library information sciences, and you want only some columns. First of all, tracing establishes the table name and the field that you are interested in, and then again on partitions, tables are populated, and then queries are run to fill the data that you want from the populated tables. I found that this is faster than using virtual tables because of the various joins. A thing called topological sort is used to establish in which order the joins have happened based on the names of the tables that you want to join. Similar ideas for populating ORCID, United States Patterns, and so on. Here because this appears XML records, we can skip the parsing of XML records that we are not interested in, and thereby gain an additional efficiency advantage. Let me finish with some issues and limitations of A3K. The coverage of authors is fairly low, so about 17 out of 360 million authors have an associated ORCID with them. Keep in mind data go back to the beginning, to the second world war, so ORCID wasn't a thing then. But even now, not all authors have an ORCID. This is improving because many institutions asked them to, and we're also investigating ways with which we can disambiguate authors even if they don't have an ORCID through machine learning methods. Also affiliations, either they are missing or they appear in diverse forms, so the same university, say the one here, can appear as ULB or the full name of the university. Abstracts are also not always available. A small number of them have an abstract, but many publications also have a text mining link that you can use in order to obtain the full text for data mining purposes. The subjects of publications are based on an identifier established by Scopus, which is associated with complete journals. So if something appears in a zoology journal, we assume it has to do with zoology, even if it has to do with, say, biology or informatics. Again, we're working to use machine learning methods in order to obtain better results here, and looking specifically at the impact factor calculation, which many are interested in. Establishing what is a citable item is a tricky, clarivate uses a proprietary method, where, for instance, an editorial or a letter is not considered citable. It's difficult to do this automatically. I assume they have people working on that. Way forward on how we can work as a community to improve a 3-key, first of all, these are the early days, so I'd be very happy to help the community conduct studies. If you have an idea for something for a way to use a 3-key, please contact me. I would like to integrate more open access data. So here are some ideas, an archive, the DBLP is a database of computer science papers. There is a taxonomy of medical research called MESH, which is extremely interesting, and the wider one used by the public library of science that I also think would be worth integrating and associating works with it. Associated with that is improving the various processes, so things in which we can process the data, disambiguate authors, there are many John Smiths or Zouz in China, find out which are the ones that are written in a specific article, and classify the topics of the publications. And finally, and this is relevant to us here at FOSTEM, evangelize more and better data availability, more use of ORCID and publication improvements on the published metadata. With this, I thank you for your interest and attendance here. I think we should have time for one or two questions. Do we have questions? Thank you for your talk. About 10 years ago, I participated in a Kangol contest, which was about disambiguating authors to link papers written by the same authors in the work you've done. Have you observed the need ambiguity in the different forms? Sir, can you repeat? I'm not hearing very well. Sorry. So I was saying that about 10 years ago, I participated in a Kangol contest, where the topic was finding the different papers written by the same author, because the names had different variations and formats and so on. In your work, do you also observe that the names of the authors are written in different ways, and that makes it harder for you to link papers together? All right. The question is regarding a contest that was run 10 years ago, whether there are authors that have their names appearing different forms? Absolutely. First names often get abbreviated to the first letter. Middle names appear and disappear in random order. So it's indeed a problem, or kid helps. But also efforts such as what you did to develop ways in which to establish uniquely an author with all their works are helpful and they can be integrated as processes, either with a pull request on A3K or by using the API and doing it on your own. Okay. So it's a two-parter. First, how often do you update the dataset? And is there a way to download the delta? Just get the new stuff. Okay. Two things. How often I update the dataset? The answer is never in contrast to say to open Alex. A3K is a tool for working with existing data. So whenever you want, you go and fetch the data. It doesn't come with the data. You use the data sources from their primary source. I don't pretend to curate the data. A3K allows you to use existing open data sources. Can you work with incremental updates if you have a way to running a select, if you have incremental parts, you can populate the data also with the increments, if you have a database that you populate incrementally. Thanks. Thanks. So I was in a university 10 years ago and more than 10 years ago and by that time our articles were published mostly as unstructured text. So is that still a thing? Are there any, are you aware of any efforts to make articles, because they're unstructured tests and those are difficult to analyze? If I understand the question correctly, whether there are efforts to structure the articles in a way that we can better analyze it, there are tools such as a Grubbit, we heard it yesterday at another dev room on Open Science that do that and create XML associated with an article's text. Of course this cannot always be perfect. Ideally we'd want the complete pipeline from authors to publishers to publication to take this structured account and not reverse engineer the structure after it has been published. Well, when I look at the time, we should stop here for Q&A here in this room. Maybe he has some more time. So maybe some applause. And first anyone wants to say thank you. Thank you very much.
Firefox power profiling: a powerful visualization of web sustainability
Okay. Welcome to Janssen. Final Fox Power Profiling. We'll listen to Florian Quez. Welcome. Applause Hello. I'm Florian Quez. I'm a performance engineer at Mozilla. For the last few years, I worked on understanding how much power is used by Firefox and how it's used, how we could reduce it. And today, I will be sharing how the tooling we've put in place to understand Firefox power use can be used to improve web sustainability. So first, I will explain what I mean with web sustainability. I mostly talk about carbon footprint when I'm thinking sustainability here. And there are three main components of the carbon footprint of browsing the web. The first and biggest one is the user device. Then the second component is whatever's not in front of the user. So that includes networking, server equipment. And then the power used on the device by the browser when browsing to see the web page the user wants to see. So let's look at each of the three. First, the user device. Usually, we think it's not within our control when we develop a website because you know it's the user who picks the device. We have nothing to do there. The emission we are talking about here is the embodied emission. Whenever someone is buying a new computer or a new smartphone, so it's the emissions to produce this device to manufacture it but also to ship it to the user. And even though we don't get to do anything about the actual emissions when creating the device, we can reduce the incentive for the user to replace the device. And the way we can do that is to ensure good performance because something feeling too slow is a strong incentive for someone to replace the device. And the other thing is ensuring web compatibility because if the device becomes incompatible, the user has to replace it or update it in some way. And on this topic, I would like to mention that Firefox is currently the only browser that is left supporting Windows 7 users. And we actually still have millions of users running Windows 7. So if you are thinking about sustainability, Web Compat is one of the first things and think about Firefox ESR. Second piece of carbon footprint is the emissions for the infrastructure. Anything server networking. I'm not going to talk a lot about this because it's already well covered. And there's a reason for that. It's that the financial cost of operating the services scales mostly with the emissions. So there's a strong incentive to optimize and there's already a lot of tooling available. And last and maybe not least, very often neglected, the emission caused by using the browser to display the web page. And the reason why it's neglected is that in people's minds, there's no good tooling available to look at those. Is that correct? Well, not really anymore. And this is what I will be talking about today. It's what we've done to change this. We'll be focused for today. So because the talk is 40 minutes long, I want to give you a structure. First, I will explain why as Mozilla, we care about this. Then I will explain our journey to measure power use locally, what we've done to be able to understand it. And then I will go deep into the topic of power profiling. But first, I will introduce the Firefox profiler so that even if you don't know it yet, you can make sense of it. Then I will explain what power profiling does and show examples. Examples are important because this is where we see why all of this makes sense. And then we will take a break from the structured presentation of slides. I will try to do a live demo if I have enough internet. And then I will explain what we call external power profiling, whatever that is, and give some more examples. So let's start. Why do we care about this as Mozilla? There are three main reasons. The first one is for sustainability. Mozilla met climate commitments of being carbon neutral, of reducing our footprint year over year, of leading openly by sharing materials, tools and methodologies. That's what I'm doing today. And of improving products from a sustainability perspective. There's a reason why this one matters. It's on the next slide. It's when we look at Mozilla's carbon footprint, the actual footprint caused by using our products on our users' devices is more than 90%. And by the way, when we say we are leading on sustainability, we are the only browser maker organization that actually publishes this kind of data. The others, they are shy about it. And we would like to encourage others to also publish this kind of data. Second reason for hearing about power use of Firefox, a very important one, it's for user experience. Nobody likes to use a computer that uses too much power, and there are multiple reasons. One, it's causing noise with the fans. If it's a laptop and it's super hot, it's painful. And then even the people who couldn't care less about climate change because they think it's somebody else's problem, they hate running out of battery. Life matters to everybody. And then last but not least, we do this for a better web. And there's a reason why we want to do this. It's Mozilla's mission. We are building a better internet. This is what we are here for. And because Firefox is made on web technologies, everything we do to make Firefox more energy efficient, and all the tooling we put in place, it's directly reusable for our web pages. Now let's dig into our journey to figure out local power use. I started a couple years ago. My task was to figure out how Firefox is its power, what should we do about it? It was not clearer than that. It's just, let's look into it. So when you want to understand the power use of something, the first thing you do is you take an energy meter, or what matter. So that's what I did. It's cheap, it's easy. It's pretty accurate in terms of the data it shows. It's also pretty useless for the case of software because software is not something that you start and it does the same thing all the time. You need to see evolution all the time, and I was not seeing anything. So next step is get a better what matter. I got some that is communicating to the computer through Bluetooth. It's sending me something like this so I can see a chart. It's much better. It's still not so great because I still can't correlate with what we were doing with the code in the browser. And at that point I wondered how is the competition doing it? And I found this blog post from Microsoft that I found very interesting. Back when they were working super hard on edge battery life, they were super proud of it. It was before they switched to Chromium. And one sometimes really caught my attention. That's the one highlighted in blue here. Power was measured on the surface book because it has integrated hardware instrumentation. So that's how Microsoft did it. They have their own brands of laptops. They put built-in power meter chips in those so that they could compare power use of edge with competing browsers. So that's not really something we can do as Mozilla. I don't have my laptops. But can I get some of those Microsoft laptops? Well, sure. So I actually found two that include those power meters. They are pretty old because they are back from when Microsoft was doing this kind of work. They still work. I tried to find newer devices but unfortunately all the devices where I found that a device like a power meter is exposed in the ACPI table actually doesn't seem to be actually present on the device. So the power meters were put in by manufacturers for prototyping and calibrating battery discharge rates and then not put into production devices. When we look at the tool called Perfmon on Windows on those computers, we get something like this. We see energy meters and we have 4 channels here, battery, CPU, GPU and the Wi-Fi chip. Which means we can measure the power use of each of those components. We can measure it. We see something like this. So we now have charts. We can try to correlate with stuff that happens. I'm not sure if you think like me but I really dislike this UI. I find it absolutely terrible. I can't make sense of it. Even the unit, I can't make sense. Like here I selected the CPU cores energy and it's using, last time it was measured, 5.0 something E plus 011. Whatever that means. While searching for those devices with built-in power meters, I had a good surprise. On some laptops, the names of energy meters were pretty familiar if you use Intel Power Gadget. Those are the rappel channels that are exposed by the CPU itself. There are some investigations. It turns out that all Windows computers with Intel CPUs expose the actual CPU as a built-in power meter and we can access it. Because another nice surprise, there's a documented API for it. I have not found any example of someone using the API but the API is documented. Which means I can now understand what the unit was. So the E plus 11, it was because it was Pico whatever. And we can correct many times per second. So I know that because I experimented with it. Very little overhead so I can correct many times. And the most important thing for us is it's accessible in user terms. I absolutely don't want anybody to run Firefox as root to be able to power profile it. So accessing as user land without providing users to install anything was very important for us. So all of this makes it pretty tempting to use for profiler counters. At this point, I started prototyping something, started hacking. And this is the screenshot of the very first power profiling prototype I got of Firefox. So we see the same name here, rappel channels. So it's not very user friendly as names here. And the units were not correct but the thing that matters a lot when seeing this screenshot is the shape here of the track. So this is the CPU package and you can see that it matches the shape of when we were using CPU. So the data seems correct. We were using CPU here too. And there's a shape that's moving here. PP1 is the GPU and there's a spike whenever we do something on screen. We were doing graphics work here and we had something here, something here, something here. So the shape is correct and this is a key validation because until then we were thinking we could not power profile because power profiling means running the profiler. Running the profiler means using power and we were afraid we would be profiling the profiler, which is not what we want. And this validates that it actually worked. So I decided to polish it and make it something we could ship. But I see I shared with you a screenshot of a profiler without introducing the profiler. It's a good time to introduce it. It's a profiler. It's built into Firefox. No additional tooling required. It's always there. The user interface is not there because we don't want to clutter things for all users. But it's trivial to make it show up. And it was created for performance work. And here by performance I mean making things faster. It's useful both for users so that they can make a useful bug report very quickly instead of saying something is slow. They can say, here's a profile of what happened. Please have a look. And useful for developers because those profiles are easily actionable. And it's one of the best profilers that exist currently and there's a good reason for that. We started investing heavily in it in the Firefox quantum days. And those days were when we decided that the engineering teams of Firefox being several times smaller than all of the competitor teams was definitely not a good reason for Firefox to be slower. So we needed better tooling. Profiler uses multiple sources of data. It's a sampling profiler which means that it uses a timer. And at a fixed interval it stops the execution of the program and will capture information. Typically it will capture stacks of all the threads we care about. And it will also capture counter values. So counters, for example, if we care about memory use, whenever memory is allocated or released, we'll increment or decrement a counter. And then when sampling we'll record the value of that counter at the time we sampled. And the last source of information is markers. Markers can be seen as annotations left by the developers so that whenever we see a profile we see what happens at the time. I had a screenshot of showing how to use the profiler, but I will try to do a demo instead. It will be more interactive. So this is Firefox Nightly Instance that is created fresh, no user profile. And I will go to profiler.firefox.com. It loads this web page and I will click the big Unable Firefox Profiler menu button. When I click this, I see a new toolbar icon that appears in my toolbar. It was already there. We are showing the icon when clicking that button. And I have settings here. In most cases the default settings will be good for what you want to do. So you can just click start recording and then you can start doing what you want to profile. So I will, for example, load the Wikipedia on page. And once I'm done doing the thing I want to profile, I can click the button in the toolbar again. The second oscillator, I have a new tab that opens which is my profile. The UI might be a bit intimidating at first. I will go through it with you. There are two main pieces of the UI. The first one is the top half here, which is what we call the timeline because everything is drawn against time. There's a time error here. And then there are panels at the bottom. In the timeline, you can see that we have what we call tracks here. And there are tracks for processes, like you see the parent process here. In Firefox, the parent process handles the user interface and the rendering. And you can see a process for each content process. So for example, I have the Wikipedia process here that I will select. And here there's activity. So I can make a selection and I can zoom into it by clicking this Manifold icon. So now we'll see in a lot more detail what happens. And the UI is very interactive. Whenever I move the mouse somewhere, I will see a tool tip showing me what happens. So here I'm seeing the stacks that were sampled. I see there's some JavaScript in here. I assume some of you are web developers and probably care more about what JavaScript runs so we can filter frames to JavaScript only. And here I'm seeing which JavaScript code was run by Wikipedia when loading the page. I also said we have markers, so I will show an example. In the panel at the bottom, I have a marker chart. And if I go here, I have DOM events. They show whatever event we are sent to the web page. And if I scroll down, I have many more. I won't go through them, of course, because we clearly don't have time. I just wanted to show the Yawek markers here that show whenever the thread was active, which is important for PogarUse. And the runable markers we have at the bottom here that show the name of the C++ task that was running, which is very important for Gecko developers when you send them one of the profiles. So that will be it for an introduction of the profiler itself. I will go back to the presentation and skip all the screenshot I had that show the same thing. And now we'll talk about actual power profiling. So I said we have a working prototype. Now we want to make it work for real. So it's built-in. Again, no extra tooling required. It supports the three major desktop platforms. We shipped it in Firefox 104. So that's already a little while ago. We got a lot of great feedback on it, especially from people who care about sustainability of the web. At best, my knowledge it's not been copied yet, but that might happen at some point. Platform support. On Windows 10, we support only the devices that include built-in power meters. On Windows 11, we have Intel CPUs. We have information about integrated GPU and memory power use. And on Windows 11, 22H2, we have about updates. We started seeing Windows supporting AMD horizon CPUs. And a good surprise here is there are separate information for each core, which means that if we can track which process was using which core for which thread, we can know exactly how much power is used by the code we are running. On Mac, we support both Apple Silicon CPUs and Intel CPUs. Different ways. For Apple Silicon, we actually have an API from the kernel that gives us information about the amount of energy used by each process. And I think Apple can do it because they control both the CPU chip and the kernel. It's very likely that whenever the context switch of thread, they record the value of the counter, and that's how we get the value. On for Intel, there's a very obscure system called that can be called only from assembly of a Raspberry Crash that gives us the value of a Rappel model-specific register. I'm very happy I didn't have to figure this out on my own. A former colleague did it several years ago. I could just copy-paste the code. That was nice. Last but not least, Linux. On Linux, we use Rappel Perfevance. Those used to be unrestricted, but then there was a side-channel attack. And the reason for that is if you look on Windows, you can access the power data up to once per millisecond. If you try to access it faster, you will get the same data returned again. On Linux, there's no rate limit like this, which means that you could create so often that you could get some information about the data being processed. So when this was discovered, it was then restricted to only be accessed by root. Thankfully, there's a command we can run as root that lets the kernel know that it's fine to not be paranoid about it. And that's the exact same command that needs to be run to use Linux Perfev, the built-in profiler for Linux. So I think it's fine. As long as users don't need to run Firefox as root, I'm still happy. So I will take this work around. MDCPUs are supported. A lot of running, it doesn't work on SNAP packages if you are an Ubuntu. The binary is provided by Mozilla Work. And I think that's because the packaging system puts some restriction on what's allowed or not. If you were about how to configure a profiler, I said in my previous demo that the default settings are right in most cases. If you want to put a profile, they are not right. Because obviously you want to also include the Power Use feature that's down here, the checkbox you see there. And also markers for all threads. And the reason for that is, we're waking up a thread as a cost in terms of power. And we want to know about it, even if it happens on threads, we are not profiling. So we want to know where ever it happens. And I think the power preset, of course. And the other thing we want to do is reduce the overhead. As I said, we were concerned we might be power profiling the profiler itself, which is useless. So what we are doing is increasing the sampling interval. Instead of sampling every millisecond, we sample every 10. We're using significantly the overhead of waking up. And when sampling, we don't capture the stacks. We only capture countervalues. We still have all the markers that still give plenty of information. For the presentation, I will give plenty of examples. And as I said, the profiler UI is very interactive. So here's a link to buy slides. And whenever you have a screenshot of a profile, I also have a link to the profile so that you can see the profile for yourself. One thing I should mention is that the share.firefox.devdomain doesn't work with IPv6. So you need to be on the first damn dwell stack Wi-Fi network if you want to be able to click right now. First example of wire profiling, I will go through it with you because I know the profiler was intimidating. It's the same thing as what I was doing before, which is loading the Wikipedia homepage. And we can see on the screenshot, near the top, it was not visible. So the screenshot here, we see it was not loaded yet, but there was CPU activity. Here something starts occurring, and here it's visually complete. So I made a selection here of the part of the profile that we care about. You see we have a parent process. We no longer have colors, and we no longer have stack samples because it's the power profiling settings. We still see the network requests. They are also shown here. And we have a new process power track here that gives us how much power is used by the parent process. We have a similar track here for the content process that tells us how much power is used at any given time and the amount of energy used. So here 134 mW. So that's how much power it takes to load Wikipedia, apparently, on that computer. Second example, this is profiling on Windows 11, and this is starting Firefox. And you can see how much... So this is the CPU activity when starting Firefox. You can see that here we had a window, and here it was visually complete, and the activity was done. So we can see how much power is used by the CPU cores. The built-in GPU, the entire CPU package, and here it's selected, and we see it used 10 mW to start Firefox on that specific Windows laptop. I'm not sure about you, but for me, whenever I have new tools, I like to play with them, and I like to test their limits. And I was wondering what's the tiniest thing I could power profile reliably. And this is the smallest thing I have found. I'm not sure if you've read what's written at the bottom of the slide, but basically, I was profiling Firefox doing exactly nothing. And it was not exactly nothing, actually, because when I looked at the profile, there were those small spikes. What are those spikes? And actually, it turns out there were the cursor blinking in the address bar. And if I select one of those spikes, like I did here, I can say that making the cursor happy on the address bar uses 1.5 mW. I was surprised. I didn't expect this to be that precise, but yes, it is that precise. Now is a good time for live demo, if it works. I assume many of you, when you came here, you wanted to see a map of the campus, especially if you came for the first time. And because you're in two open source, you probably use the open straight map. So I will try to figure out how much power it uses to search for the campus on the open straight map. So I will configure a profiler to use the power preset here. Start recording. Open a new tab. Type openstraightmap.org. It loads. I have a text field in the top right corner with a cursor in it. So I will type ulb and see what happens. So I'm in a university campus. This is very nice. I don't really recognize the shape of the building though. So I will zoom out to see if it's in the right place. Zoom out again. Oh, we're in the third door. Probably not correct. So I will additionally type process. Now that looks better. If I zoom into it, this is actually the building we are in. Okay, I will stop the demo for now and now we will look at the profile I'm capturing. Again, we have multiple tracks here. Google.com, I don't know why it's here, but I don't care about it. It was probably the background. I don't care about this either. So I have open straight map here, and I have the power and process here. I also don't care about how much memory we used. Okay, so this filter is down to what's useful. So I have a screenshot track here that that orientates me pretty quickly. So I was typing the address of the map. Here we have the home page that's loaded. And here I had my first results. Here it was animating when I was zooming out to try to figure out where I was. Okay, so let's look at power use now. So the part we might care about is when we actually loaded the right campus. So that's here. You see the network request here, and you see there's a spike here in power use, and similar here. So I will zoom into this time of a profile. And now this is what happens in the Firefox parent process. So we can see it used about 100 mW, and in the content process, about 33. So this is about 130 mW to load the search results, showing the campus here. But because we have a profile that's nice, whenever you have a profile, very often you explore other things, because it might be interesting. I'm most interested by the shape that I see here that seems pretty interesting in terms of power. So I will zoom into it. And we see there was syncycum power use here. When looking at the screenshot, there were animations going on. So it makes sense that the process during the rendering uses power. Here we used more than 300 mW in the parent process, plus 150 here. So actually zooming out to see where I was actually used a lot more power than searching for something. We can also see the bandwidth used. So this is the network request, and we see how much bandwidth was used in network bandwidth. So we can see here that zooming out took about 4 megabytes of network bandwidth. I will zoom out again. I see there's some activity here, and I'm curious about it. I see there are markers, so I will zoom into it and try to see what's going on. There are regular spikes here. I will select the content process here, because the markers and the content process, it's likely where the activity started. I will zoom a bit more on one of the spikes. And I said renewable markers, they're very interesting for browser developers because they let us know what's going on. I think I will try to zoom into it to make it more readable. So the marker here, and it's correct, correct blink callback timer. So this is actually the timer that used to blink the cursor that I had in the search field where I was typing ULB browsers. So we can look at how much power it used. And this is 0.17 mWh in the content process, plus 0.8 here. So a little bit less than 1 mWh was used when making that cursor appear. So what I was showing in the previous slide, we can actually do it, and it really works, really, really. And I think that will be it for the live demo. I will switch back to the slides. So I talked about many things already, so let's recap a little bit. We have power profiling that works on all the free major desktop platforms, Windows, Linux, and Mac. It's reliable, it's easy to use, you don't need to be rude to use it, you don't need to install anything, you just have everything in Firefox. So what about the free platform? Firefox is not shipping only on those three platforms. There's something called Android where we ship Firefox, and lots of users there too. So what about it? So far, we've not found good APIs that we could use for power profiling, but we had another idea, and this is what I will explain now when talking about external power profiling. If I'm taking a step back from what I was showing before, my first step was to look at how much power was used on the power sockets. And this gives us a little bit of a sense of how much power is used on the power sockets. And this gives us the full picture of how much power is used by the entire computer. But there's one problem. The maximum sampling rate is 50 hertz, which is in Europe the rate at which the current is oscillating for AC power, and they can't get any much faster data than that. We also got data at the extreme of a hand getting data from the CPU. Very precise data, but missing some of the computer. And it's even worse on the phone because we miss the entire screen or things like this. So the question was, is there anything in the middle we could look into? And yes, there is. If we are talking about mobile phones or also laptops, there's the charger. Maybe we could instrument the charger instead. And yes, we can. It turns out there are devices that are already on the market that are sold, and their purpose is to test chargers to verify how good the charger is. Check if the current and voltage of the charger is stable. And to be able to do that, they need to sample very quickly. That's very interesting for us. And those devices are affordable, at least if you compare them to the smartphone you used to test your web application on. Some can export data to a computer for USB or for Bluetooth. And one thing that's really important to note to understand how this works is when you charge your battery, when you unplug your charger, you want the battery to be completely full. So anything that was done by the smartphone while your battery was on the phone was done by the smartphone while your battery was already full, it's still using power from the charger because if it was taking from the battery, the battery would not be full when you unplug the charger. That means that if we wait enough for the battery to be completely full, and we still measure how much power goes through the charger, we actually measure how much power is used by the phone. And that's exactly what we want if we want to power profile. Another interesting detail is some of those power meters, they support more than 200 watts, which is more than enough to power profile any laptop that charges for USB power delivery. So here it's a MacBook that I was charging. So looking at power data from those kind of power meters is what we call external power profiling, and we shipped it in Firefox 121. So how did we make this work? Those devices that are charger testers, they are few that are available. There's only Windows software that comes with it. The software is in English, which means that if we hear on those Chinese characters, not sure what they mean. And there's poorly documented API that was nice when I wrote this. What actually that meant is that I found one device some day that has what they called an open API. Open API means there's one page of example C++ source code with Chinese commands. That was enough to get started. And then thankfully there are great and powerful reverse engineering tools. And I tested with this USB light that you can see on the slide here, but as various levels of brightness. And this was a stable load that could let me know which data to expect. And all the power meters on this slide are compatible with the Firefox profiler. They all produce nice power tracks. Well, some not so nice. Some produce very nice power tracks. And it's plug and play. If you run the script that's in this GitHub repository, you can see the device and start power profiling. You will see this power track appear out of magic. And it's nice that it just works because all the windows that came with those it's terrible, you don't want to try to use it. The readme file in this GitHub repository includes a list of supported devices. So that's basically the names of the device you saw in the previous pictures. Next to the name of each device, there's a link to an example profile of what you can get from it. And you can see that using... So that was... with this USB test light that I was using. Various levels of brightness. And you can see in the good example profile that there's what looks like a lot of noise here. It's actually not noise. If you zoom into the profile, you will see that it's a very regular pattern. And it exposes internal details of how the light is using power. It's sampling every milliseconds. And if you look at the bad example here, there's no noise, but it's just because it's sampling every 10 milliseconds, so it's taking an average. And the worst part about it is what's happening here. We are turning off the light. It should use zero power. But here we see a linear decline for 500 milliseconds. If I want to profile anything I'm doing with my software, a latency of 500 milliseconds is completely useless, so this device can almost go in the trash. In all the future examples I will be sharing, whenever you see something labeled USB power, it means power data coming from those kind of external power meters. Here's the first example of power profiling using this system. It's what we call an Android remote profile. Remote profiling means that the profiler was not running on the same device than what was used to start the profiler. So in this case, the profiler was running on an Android phone running Firefox. And I was controlling the profiler from my laptop that was also controlling the power meter. And when capturing the profile, both sources of data were merged, and we got this profile. We can again validate that it makes sense. You see the shape of the CPU user, the Android device. You see the shape of the power track. They match pretty well, which shows that we're actually measuring the right thing. And the baseline is relatively high here. It's probably because the screen was on at the time. And it's again a profile of loading the Wikipedia on page. We can see on the screenshots here. So as I said before, I have a link to all my profiles at the bottom of the slides. You can look at them now if you have a laptop in front of you. You can also look at them later if you look at the slides and want to see it again. I have more examples coming next. So I'm giving the links to the slides again. I will mostly be telling you two stories on how we use power profiling to understand what was happening. So the first story is I had one of my colleagues tell me, hey Florian, have you seen this new green leaf icon you see in the Windows Task Manager next to Edge? What is it? Can we have it too? So we're wondering what it is. It's screen washing. It's Microsoft doing something fantastic about the environment that we should know about. It turns out there's a Windows 11 API to let Windows know that a process is doing nothing that's immediately visible to users and that instead of optimizing for finishing as quickly as possible, the kernel should optimize and scheduling for resource use. We could use this API for Firefox 2. The power profile you see on screen here is the result when using a test case. And the first half of the profile, the test case was in a foreground tab and the test case is a stupid piece of JavaScript using as much CPU time as it can with an infinite loop. The second half of the profile, the tab was in the background and we can see the dramatic difference in power use. So yes, it actually does something. It's pretty significant. So putting background content processes in the eco quality of service on Windows 11 is something we shipped to Firefox 108, so that's quite a while ago. We have a first browser to do it if we exclude edge that did it when the API was introduced in Windows, of course. Chrome has followed a couple months later, so now I think everybody on the web benefits more or less, and this is great because it actually saves a lot of power. And I will explore a bit more how this works in the next few slides. So we try to do the same thing on Mac. So this is a profile on a Mac with an Intel CPU and we see the same nice decline in power use. And you will see here that I have a power reported by the CPU itself but also power from the USB power meter. So I checked also the power use by the entire laptop. So they all decline at the same point when we switch to the background. And you can see the numbers, they are pretty dramatic. The cores drop from 18 to 1.6 watts. And the entire Macbook from 30 to 10 watts. The numbers are even better on Apple Silicon, but this example is an Intel CPU to be able to compare with what I was showing for Windows. And next, I wonder so using less power when doing something stupid like an infinite loop is great, but that's usually not what you want to do with your code in your web pages. So what if the code was, the test case was doing an actual computation? So this is computing Fibonacci something. And you can see that when it's in the background, it uses dramatically less power to do the same thing. But also it takes a lot longer. So I have the numbers in the table here. It took more than three times as long. It used less CPU, the CPU used less energy, but the entire computer used more during that time. So if you can control the entire component system, typically in server environments where once you are done with the task you are doing, you can shut down the server, try to finish as quickly as possible. If you are like us in the situation of a web browser where there are things in the background that have no user impact, but you don't control what happens to the computer because it's the user's computer, then reducing the resources for everything that's in the background makes a lot of sense. And the way that this works on the CPUs that are not the CPUs, all the cores are the same, is probably by reducing the CPU frequency. And there's one slide where I'm trying to check this, because the profiler can also record CPU frequency. This profile was on Android. And you also see that whenever I have a spike in the CPU frequency, I have a small spike in power use. And when the CPU frequency remain high for a while, the power use was also high. So that kind of confirms the hypothesis here. Second story. This is a real life story. I was trying to fill in a survey that had many checkboxes, and I moved that web page to my external 4K display and put it in a maximized window so that I could see all the things that I was being asked to fill in. And I got distracted. Maybe by a baby or something. Came back a few minutes later, my laptop was super hot, the fan was extremely noisy, and I was wondering what's going on. Of course, I profiled. You could have guessed that. In the profile, I noticed something like this. So this is an artificial test case. I'm not seeing what the bad web page was. I could see that the color of the background was slowly changing with the gradients. With my eyes only, I could never see that. Like it was changing over the time of a few minutes. So completely useless animation. I tried to replicate this with animation that moves slightly faster than what we see it well. You can see very high CPU core use, high memory power use, high GPU power use, high everything power use. It's terrible. So if you think about it, I said an external 4K display, that means 8 million pixels. I'm in a modern MacBook that has a refresh rate of 120 Hz. That means we forced the laptop to compute the colors of a billion pixels per second. So no surprise that it was hot. Then I tried to explore my hypothesis because I'm saying it's because there are many pixels many times per second. Maybe we could check that it's correct. So on the next slide is the same test case, but I tried to reduce to a minimum the size of the window. We can see on the shape of the chart the impact that it has on power use. So on GPU power use, the impact was very dramatic. When the window was tiny, there was almost no power use left. You can see there are big spikes here. While I was resizing, this is because whenever we change the size of the window even by one pixel, we need to recompute the layout of the browser UI. So high CPU use while resizing, but very low otherwise once the window was small. I have all the numbers on the side. I won't read them outside, but you can look at the slides later. Another thing I did to test is there's a hidden preference in Firefox to limit the refresh rate. And I tried various refresh rates. And we can also see that the power use declines dramatically when reducing how many frames with display per second. So this validates the hypothesis we had, but it's just too many pixels too many times. So one thing to take away if you're a web developer and you're thinking about animating the background of a web page, think about it more than twice. It's absolutely terrible. You should not do it. And then I was thinking, okay, many pixels on screen. There's one case where we typically do it is if we are watching a video. And then it makes sense, right? So this is a profile of watching a YouTube video. First in a frame and then full screen. So you can see in a frame here the amount of GPU power use. And then full screen here. And there are spikes while we were entering and leaving full screen because there's a big animation and things we need to do with the UI. We can also see that the CPU power use was relatively low. I think this validates that graphic accelerations and hardware decoding were working well. So this is all good news. One last example about things to avoid as a web developer. Timers. Waking up a CPU is expensive, especially if you wake it up to do nothing. And using the web API, set time out, you can wake up the thread up to every 4 milliseconds. And this is what we see in this profile. This is a test case that just wakes up the CPU for the sake of doing nothing and sleeping again and waking it up again. And you can see a spike in power use whenever the CPU wakes up. And then the tab is put in the background. And when the tab is in the background, Firefox limits timers to 1s per second. And you can see this one tiny spike here in power use at the very end. So... This shows that throttling timers is a good idea. This is just about the CPU wakeups. If you are doing something using actual CPU time in those wakeups, this would dominate the power profile, of course. And if I have a few more minutes, I have a few more things that are worth sharing. One is Firefox as a built-in process manager. And if you see this icon here, whenever you have a name of a content process, which is typically what you will care about if you do a website, a profile icon will appear. If you click it, 5 seconds later, you will see a profile of the entire process. And if you are on an Apple Silicon machine where we have per-processed power data, you will see a power fraction showing how much power your website used at any given time. You might have seen in my slides and in my demo that whenever I was showing energy values next to it, there was a CO2 equivalent value, which is the equivalent of carbon emissions we would do... Those were created using the CO2.js library from the GreenWeb Foundation. This was a very welcome contribution we had. So thanks for that. I shared with you a very quick look at the bandwidth track while we were looking at the demo on the open-suit map. So the bandwidth track that you have that lets us know how much data has been transferred in regards to CO2 equivalents. This is a big question that I got after doing a preview stork in a different place a couple months ago. I was very much a participant that on the A floor our profiling is fantastic. We wished we had a tool like this for a very long time and it's great for optimizing for performance and sustainability. But you know what everybody else is looking at is how much data has been transferred because that's what everybody else was measuring until you had power profiling. And in the Firefox profiler, you already have all the information about networking because there's all the network requests that are shown. Could you just show how much data has been transferred and put a CO2 equivalent somewhere? What about it? Something like a great idea, so we did. We did it. This is shipping in Firefox 123 which is currently in beta shipping in a couple weeks. So we did use that. Maybe the last thing is if you're out of luck and you can't power profile. And there could be a few reasons. Maybe you're on a virtual machine so you don't have direct access to the CPU hardware. Maybe you are using a snap package and then there's nothing you can do. You're not good at Linux and you're not good, so you can't do the magic command to let the kernel know that it's fine to let us know about how much power is used. There's a hidden feature in the profiler because it's not fully polished yet. If you open the DevTools console and type experimental.enableProcessCPUTracks, you will see new track appearing that say ProcessCPU. And you can see in this example that the shape of the power track and the shape of the CPU's track match extremely well. The one case where they won't match is if you do massive animations like the full screen animation I was showing before. But I said you should not do that anyway. So if you are not doing anything completely stupid that's feasible, that's something you could look into as an alternative. Another conclusion, power profiling is possible. It's easy, it's fun, I encourage you to do it. Play with it, it's really simple. This is why I did a live demo so that you see how simple it is to use it. But if you are really thinking about web sustainability, where you will have the biggest impact, even though it's less visible, is on-show web compatibility with all browsers and especially with all devices that still have supported browsers, so things Firefox ESR. And think about group web performance because even if something is still compatible, if it's super slow, people will want to replace their hardware and that's where we really lose in terms of sustainability. Thank you very much for the talk. We've got still roughly 5 minutes for Q&A more or less. If we've got questions, you can answer. Do we have questions? Please raise your hands so we can come to you with a microphone. So, have you ever thought about the question of whether you would like to talk about the Q&A? Yes, I have. So, have you ever thought like when someone loads a website in localhost and Firefox can detect that it might use too much power to actually show like a pop-up hey, your local host app is using too much power, check out Firefox Profiler and Power Meter. Would that be feasible to basically push devs to fix their own apps? Would that be a good idea? I have not understood. There's so much echo that I couldn't understand what you are saying. Yeah, there was too much echo in there. Yeah, so basically the question is have you ever thought to push the Firefox Profiler towards developers when they're running apps on localhost and they're using too much CPU to have like a message hey, check out Firefox Profiler it will show you your app is slow and why it's slow. Okay, so you're suggesting that we could detect the case of web developers because they are running something on localhost and they are definitely developers and then we should show warning messages. Yeah, and push like promote the profiler to devs directly this way. That's an interesting idea, not just for excessive power use but also when something is dramatically slow we could let them know hey, you know we have good performance tools, you should have a look at them. Thanks for the idea. More questions? You have another question? So thus, like optimizing for power usage differ a lot from optimizing for like CPU usage because I guess I mean the less CPU you use the less power you use and the less network you use the less power you use but are there like other things to consider when you're trying to optimize for power usage than just useless resources? So if I understood the question is there other things to optimize for to use less power over than CPU use? Is that the question? Yeah, of course. So the power use is typically first CPU that really dominates it's both CPU time and waking up the CPU then there's graphics power use which is what I was trying to show with the examples network power use but you don't consider it as much if you're setting data from the network over time that will use power because it will wake up the Wi-Fi chip for example but in terms of scale CPU dominates so much that's really where most people should focus their attention at least when thinking about web pages. So more questions? Another thing I should have mentioned is I have Firefox provider stickers on the table here and Firefox stickers so you might want to take them. Shiny, shiny, shiny. Thank you very much. Okay, thank you very much. Maybe another applause. Thank you.
Open Source for Sustainable and Long lasting Phones
Thank you. Thank you. Thank you. You can hear me? Yes. Good afternoon. Thank you very much for joining the open source for sustainable and long lasting fun stock with Luca and Agnes. Thank you very much for being here. I will be helping you here, helping you with your talk. Should you need anything? The stage is yours. Thank you. Who knows Fairfun? Wow. Happy. So we are super happy to be here with Luca today to speak about how open source is helping us at Fairfun. We'll speak about software, but not only. In this talk, it's not a super-taky talk. We'll try to speak about all the stuff that we do at Fairfun, which kind of report, which kind of data we open-publish to the public. We open-publish to push the industry to change. But first, let's introduce ourselves. I'm Agnes. I've been working at Fairfun for the last six years now, and I'm leading the IT and software longevity team. Software longevity means that we are maintaining the fund for longer. We are a small team, but the goal of this team is to make sure that we can have long-lasting funds from a software perspective. And I'm also involved in some collectives. I founded a company first in France called Ninja Squad, focused on open source, but more on the web layers. I'm also part of a collective called Duchesse France, which promotes women in tech. I founded also a smaller conference compared to the first dem called Mixit. And Mixit is a conference in Lyon, in the center of France, focused on tech and ethics. And the last but not the least, I'm part of the RézoMutu. I launched a website in my city, Saint-Etienne, in the center of France called Le Numéro Zéro. You have also an antenna of Le RézoMutu in Brussels, stu.info. And those websites are alternative to the mass media owned by billionaires. You can publish some information about UView on the local news. And all those websites are built on open source software. So I invite you to have a look on that. Luca Froyo. Hello. So my name is Luca Weiss. I work as an end-of-platform engineer at Fairphone. And on the side, I'm a free-time also do a bunch of Linux kernel development. I maintain the project called OpenRazor, which is an open source driver for Razor peripherals. And I'm also one of the core maintainers of the Post-Marketers project, which is a Linux distribution for mobile phones. A few words about Fairphone. Even if a lot of people here know Fairphone. So Fairphone started as a awareness campaign on conflict minerals. I will come back on that just after. In 2010, and only in 2013, we launched our own company called Fairphone. And the overall angle of Fairphone is to push forward change in the electronic industry to make it fairer. It's not an easy game, but we have been doing that for almost 10 years now. And how we did start on the next slide, please. Thank you. So as I mentioned just before, so we started as an awareness campaign on conflict minerals. And at that time, Fairphone was just a group of activists in between the DRC, Democratic Republic of Congo, and Amsterdam. And this campaign was to more or less raise awareness about the social damages in DRC linked to the mining industry. And at that time, a bit of context on that, a US law was passed called the Dot-Fronk Act. And this law, initially speaking, was good, right? It required US companies from the US stock exchange to disclose whether or not their products did contain some conflict minerals. But the immediate consequence of this law was the fact that some big players started to seek resources outside the DRC. And then people in the DRC, the miners then, were a bit stuck and started to go back to the smuggling activities. So this campaign was really, really focusing on that, to show a bit what did happen behind the scene. So after two or three years of campaigning, we decided to change a bit in the next slide, please, to change a bit the way of working. So we decided to be part of the industry, to try to push for a change from the inside out. And to do that, we were incubated by a nice place. I'm not sure if you know this foundation. This is called the Warf Foundation. So the J is pronounced R in Dutch. So the Warf Foundation, if you see the tagline making technology and society more open, fair and inclusive. So yeah, that's not a random incubator, like the French tech in France, which I don't like that much. So that's really, really focusing on fair technology. And this is in the center of Amsterdam, in the Red District. You have this castle. And the Warf Foundation is within this castle. So we were incubated in this incubator. And then Fairfun BV, so the company itself, was born in 2013. This is a social enterprise. So social enterprise means it's a bit like the scope in France. That small, the financial profitability is just a means to achieve all the goals, environmental and social goals. And in 2013, when we launched this company, we started a crowdfunding campaign, which was quite cool, because in the end, we managed to sell some funds in total, something like 50K funds, which is not that bad, without any marketing, the website was built by our own team, etc. And if you see the tagline on this screenshot, it says that a seriously cool smartphone that puts social values first, it shows again that Fairfun really, really started as a social activity. And we wanted to show how it is difficult in this industry to be a miner in DRC, etc. So perhaps some of you people have heard about how it is important to be more ecological friendly in this industry, but we do think that the real issue is the issue that you have on the people side. So if we pretend to build an ethical fund, we think that we should have a decolonial approach. We think that we should focus on the people, not in the western countries, but far away from us. We think that we need to respect the miners in DRC, and the people, the workers in the assembly line. But why a phone, Luca? Yeah, so why a phone? So as you, I think, are aware that the digital industry is causing a lot of greenhouse gas emissions, that's about 4% of the global greenhouse gas emissions, and of that 84% is actually going into device production. And the remaining 16% goes into running it, like networks and data centers. And yeah, electronic waste is really the world's fastest growing waste stream. With over 50 million tons per year, it is a really big problem in the world. Most of it is not recycled. Some of the phones, for example, are just kept and drawn, or thrown away to landfills. It serves like 1.2 billion phones a sort every year, which is a huge amount. And since most of them are only kept for 2 to 3 years, most are thrown away afterwards. And then 20% are recycled and the rest are just dumped somewhere. And also in the world, millions of people are working to dispose of this electronic waste that is not recycled properly. And this of course can cause a lot of environmental and problems to the people that actually work in this sector. So kind of where Fairf don tries to kind of make things better is in the materials, so in the mining sector, in the factories where the phones are produced and the components for the phones, longevity so you can actually keep the phone for longer, and then also reuse and recycling. So what happens to the device after the user is done with it, let's say. So kind of how we try at Fairf don to actually change something in the industry to not just be another smartphone company, is to try to change things. So we try to raise awareness to tell people about this issue, like in the stock we are doing today. We also want to set an example. So we want to show other companies and we want to show the industry that it's actually possible to do it differently and to do it better. And by doing this we want to motivate them to actually also come along, because of course if only Fairf understood the things that we are doing, not much happens in the world, but then if bigger companies like Samsung, they actually also implement the same programs, I think we could change a lot in the world. So in the last 10 years that Fairf don has existed, we have launched 5 smartphones. So we started with the Fairf don 1 in 2013 and then kind of had a phone every 2 years, more or less. Fairf don 3 and 3 plus are a bit of also great examples where if you have a Fairf don 3, you could actually just upgrade the cameras in the Fairf don 3 and have a Fairf don 3 plus and then just have better camera quality. At least starting with the Fairf don 2 we are also really focusing on software updates. So we had 7 years of software support for the device, so starting with Android 5 and upgrading it all the way to Android 10. And for example for our latest device, the Fairf don 5, which launched mid of last year, we want to provide software, or we are promising software updates until at least 2031, so 8 years of software updates, but hopefully also aiming for being able to provide updates for 10 years until 2033. So now let's look at the hidden side of how it is difficult to build a long lasting phone, how we can try to fight off-sodience and we will try to speak a bit about software and also about hardware. So why it is so important to do to reach longevity? So speaking about the stuff that we open publish, for every device that we build, we open publish what we call a NELCA, Lifecycle Assessment. So this is a methodology to assess the environmental impacts linked to each stages of your life cycle of the product, so production, transportation, usage and life. And we hope that at one point all the manufacturers will be obliged to publish such a report. This one is done by an academic partner called the Fraunhofer in Germany. And if you read this report, it shows that if you keep your phone longer than the average, the average being 3 years for the Android stack, if you keep it 5 years, you can cut global warming by 31%, and it goes to 44% if you keep your device 7 years. So this is key, this is really key to keep your device longer. And how we do that? So first of all, I will come back on that just after, but when you see a fair phone, you will see that you can easily open it and change some spare parts, because in the end, the obsolescence comes from the fact that it's super hard to repair your phone, right? If your display is broken, if you want to change the battery, sometimes it's screwed, and you cannot do that easily. So that's the first fact, the hardware aspect, I will come back on that after. But on the software side, perhaps some of you people have seen this message, the app is not compatible with your device, it means that apparently your app is not compatible with OS running on your phone, and probably it means that the OS is not resaving any security updates. So if we go to the next slide, so let's look at the device, so you will see a fair phone 3 in the middle of this slide. And on this fair phone 3, you have several components, right? And as a manufacturer, we need to choose some components, so a fingerprint sensor, a camera, a sock, system on a chip. And those components are not built by the same supplier, and it's quite difficult to have access to the code, the firmware running on those components. So as a manufacturer, you have not a lot of options, you can either contract long term support with those suppliers, right? But it's not always easy, because some of them are not at all willing to do that. For example, for fair phone 3, I put this device on purpose because we failed on fair phone 3 with the fingerprint sensor. We didn't manage to have a long term support agreement with the fingerprint sensor manufacturer. And when we wanted to upgrade to a great 1.13, then it means that for some people, for some users, they were obliged to use a pin cut and not to use their fingerprint, right? So it was not super nice. So sometimes we fail, but it's also something that we are not ashamed to mention. So you can either contract long term support agreement with your suppliers, or you can try to have access to some code. And obviously the second part is the most interesting part that we want to do. And if I look at just the Android part, so Android is just running on the CPU. You see the sock on the right, so on the sock you have plenty of subcomponents. You have the CPU, the GPU, the modem, etc. And if I look at just the Android part, if we go to the next slide, perhaps you have heard that Android is open source, but Android is not fully open source. Just AOSP, Android Open Source Project is open source, but the rest is not open source. So if you look at the orange layer or the purple layer, the hardware abstraction layer or the native daemons and libraries, this is not open source. So it comes with all the downside that I mentioned before about the lack of longevity. But on the older layers, close to the high level layers, you can find also some closed source, and it could be even worse. Speaking about some SDK, so if you are an app developer, perhaps you can have the willingness to integrate some SDK, because it could facilitate your development for your app, for example. So you can integrate Facebook SDK for your base whatsoever. And the second big black spot could be that those SDK could be big data artists. So that's the second issue that you see with those kind of SDK. It's not only an issue about software obsolescence, but that's also an issue for your own privacy. Yeah, I think we don't have a good slide, but that's okay. I wanted to skip this slide, but in the end I will speak about it. Just, yeah, it's a bit, that's a sad story about Facebook disclosing some private message between a mom and her daughter about an abortion that this daughter wanted to do. And Facebook disclosed all the private message to the justice, the American justice. So that's another, that's an example of how it could be better to have all you data in the big giant. If we go to the next slide, so let's speak now about the software updates that we do at Fairfax. As Luca mentioned just before, for Fairfax 2 we reached 7 years. And for Fairfax 5, which is the last device that we launched, we hope to do 10 years with a strong promise of doing 8. So I think I'm preaching a bit to the wrong audience to explain why we need software updates, but just to make sure, of course, there are any software that is out there kind of has issues and decisions need to be patched. There's always new security vulnerabilities found. So for example, in the Android world, Google is publishing a security bulletin every month with a bunch of new security issues that should be fixed on the devices. And of course, with new Android updates, with new Android versions, you can get new features which Google has implemented. So for example, better auto-fill or better permission management. So bringing Android to devices is quite a complex effort that is between multiple stakeholders. So any Android release starts out as AUSP by Google. This thing is taken by the SoC manufacturer, so in our example Qualcomm, which then modifies the AUSP code to actually make it compatible with the given SoC. And after they're done, then finally the device manufacturer can get this code and can integrate their changes in top, which makes the software work on the specific device. So for example, adding support for the display or for the touchscreen that's on the specific phone. Still, there needs to be implemented a bunch of changes for operators. So for example, to make voice of a LTE work correctly or make some settings be according to their requirements. And then kind of the last step is that you actually need to get launch approval by both Google, which is done by running millions of tests in the compatibility test suite and other test suites. And every single one of these tests needs to pass to actually be able to get launch approval. But also the operators test the software and make sure that it conforms to their standards. The process for security updates and normal updates are not a major version upgrade, but just on the same Android version look a bit differently. So every month, as I said, Google is providing security updates. The same is also happening from Qualcomm and some other parties. Network operators may have new requirements and new, for example, new app updates that need to be integrated. And then the device manufacturer is responsible to actually pulling all of this together to make sure it still works correctly. And then to go through the whole process of the approval again, so running many, many tests and making sure that everything works correctly. This process can be followed for about three years, which is kind of how long Google is maintaining any given Android version. And after that, or hopefully already before that, the manufacturer needs to update to a new Android version, otherwise you're out of support. So, yeah, as was mentioned already, while Android itself is actually open source, now as some modifications by Qualcomm, there's a lot of other proprietary components going into the system. So on a modern system or a modern SOC, there's a lot of code processes that handle a lot of different tasks. So for example, audio or modem or GPU. And these are run proprietary codes. So one where either only the device manufacturer can access the code, or actually for some even only the SOC manufacturer. So what happens if actually this chain is broken? So when the SOC manufacturer... When the SOC manufacturer doesn't provide support for the new Android version anymore. Well, this is generally where support in the industry stops, where device manufacturers no longer can provide any updates. So for this, we can look at Fevon 2, which was our device launched in 2015 with Android 5 in 2016, got an Android 6 update, but already back then the SOC went end of life. So other devices with the same SOC stopped... So other devices with the same SOC, so for example Nexus 5, stopped receiving software updates. But still in 2018, we managed to launch an Android 7 update in 2021 Android 9 upgrade and in early 2022 Android 10 upgrade. And to understand how we achieved this, we actually need to dig a little bit deeper. So we took over the role of Qualcomm a bit. We reused some proprietary parts from Android 6. And we looked at the Codel Rural Forum and the order name for Codel Rural, so where Qualcomm is releasing their open source changes. And we looked at some of that codes to give us a reference to how it really could work correctly. The kernel is also quite an important part that we needed to take a look. But yeah, for this we also looked at Linear Trials, which also provided some... Yeah, for Linear Trials also because they have a great reference of how the code could work together. And also provided quite a lot of fixes for some components in the system. So to enable communities like Linear Trials, we try to open source whatever code that we can do. Of course all of our devices also have an unlockable boot loader. And we share all of our code on our platform code.ferfn.com. Of course, open source is also great for a lot of other projects, like post marketers, which are more involved in. Which is a real Linux distribution for phones and other mobile devices. You can still check out the standard in the AW building to learn a bit more. So for Android as we said, normally ASP itself is open source, but normally all the changes that the manufacturer does to it are not open source. So the legal minimum that any manufacturer needs to publish is the kernel sources. So the Linux kernel which license another TPL license. But on top of that we also try to publish the full Android sources wherever we can. So for example, for Fevon 3 and Fevon 2, we had the complete Android source code with the proprietary components as pre-built. This version was then without the proprietary Google services and DRM and a few things like this. But essentially you can just download all of the code, compile it yourself and then flash on the device. And you have essentially a relatively similar build on your device than what we provide for regular users. For our new devices for Fevon 4, we currently only published all of the Android source code that our manufacturing partner, our ODM, produces. Unfortunately, the way that Qualcomm has structured the source tree normally makes it not possible that regular users without Qualcomm proprietary sources can compile this. But still we think it's really important to have this public as a reference. Because for things like the audio hire, still people can look at it and see what was changed for this device and then take over some of these changes for example for custom ROMs like linear joys. And we also managed to get permission to get the kernel divest tree sources public. This was quite a struggle because by default they are part of the proprietary package. But we managed to convince Qualcomm to allow us to publish them also because a bunch of other manufacturers also published them. So one problem that we also have with the software on our devices is that the chipset manufacturer provides us with a kernel version that is normally already by the time the device launches multiple years old. And it's never really updated to any newer version so the Linux kernel version that the device was launched with was basically the one that it stuck with. This means after a while that security patching can become quite tedious because there's a lot of changes on top of the Linux kernel release also. And of course also if the kernel releases end of life upstream then it becomes even more difficult to backport security fixes. Generally being stuck on this Linux kernel version doesn't make too much difference to a user. But especially lately we've seen that also new Android versions require new features in the Linux kernel. And if we are not able to update the Linux kernel, yeah we cannot really update to a new Android version then. So we can try to instead push the device in the SoC and device-specific support to the mainline kernel so upstream to kernel.org. With this done in the perfect word you can take any recent kernel release, put it on the device and have everything working. Unfortunately currently it's still far from feature complete but it is really cool and still can be used for a lot of purposes. There were also lots of great talks yesterday in the Boston Mobile Devices classroom. You can watch the recordings later. So some of the other things that we do. So we try to provide team-win recovery project builds, so TWP builds for the devices. For example for the FF5 we managed to get the build public on day one when the device was announced. We have factory packages on our support page. This is quite useful for third-party ROM developers so they can just take the new build and extract the proprietary components from it and integrate into their build. Also where possible we try to support third-party ROM developers so try to answer some questions and help them with some problems where possible. We think that default OS is great for regular users but for some users that prefer to have a bit more privacy oriented or security oriented operating systems. For example for VOD Google services we think that custom ROMs are really important for users. Hopefully OS is soon the app for our Fairbats XL headphones will be open source. A few words about the reparability on the hardware side. As I mentioned before it's quite easy to prepare a fairphone. If you look at this screen you can see a Fairphone 5. You have 10 modules on Fairphone 5 so if you break your screen you display you can easily change it. We also want to have an accessible decent price for that because one of the downsides of the reparability is the fact that sometimes a repair cost could be super high. For a display for example the cost replacement is on average 44% of the original price. It results in the fact that the users want to buy a new phone and not to just repair their display. The battery is the same. I'm not sure by heart the cost of our battery but for Fairphone 4 I think it is 20 or 29 euros. We really want to make sure that it's not super costly for one of our users to buy a new battery. Of course the batteries are not glued. I personally think that it should be forbidden to have glue on a phone. So we really really strive for having more modularity on the phone and we also fight or we try to do some lobbying to push the older manufacturers to do the same. We were part of some discussions at the European level or the French government level to have modularity as a criteria to define the coming index of reparability. And the last but not the least we do also an extension of the warranty from 2 years to 5 years for free to convince people to keep their phone longer. And we also publish the schematics. So about those schematics we just published the Fairphone 5 once a few days ago. So when you discuss with the competitors they can mention some reasons that it's not possible to publish those schematics. And the reasons are for example some intellectual property issue or some security issue. And this is all bullshit right. This is all bullshit. They cannot pretend. So intellectual property this is hard choice. If we want to publish those schematics right. If we want to not hold back on the intellectual property it's possible. And for the security reasons this is not the right clue to explain why open source is not an issue in terms of security. But you have to know that when you speak with some people at the European level or the French government whatsoever some lobbying from the big tech could convince those people that security could be an issue with open sourcing. So yeah that's something that we want to highlight today to show that it's possible. Yeah so let's talk a bit about the materials in the factories where we also try to improve the situation. So a smartphone contains over 50 different materials. One of those we selected 14 so called focus materials so where we think that's where improving things can have the biggest impact for now. We try to integrate these materials into the supply chain so they actually end up in the product. We also have for different materials we look into a bit more in the recycling part in the so trying to use recycled materials and for some other trying to get fair version materials. So we also try to map the journey of the materials and we publish this on our website so you can look at it. Why we want to do this is because we want to scale the fair sources we want to get more of these into our products. But also again we want to follow us we want to see that other companies see what we're doing and actually can look into exactly how we're doing this and then hopefully they are following us and also implementing this. For example here you can see the map of some of the materials in the FFIV and if we click on one of them for example for Tangsten here you can see it is mined from a town in Rwanda. Then it is processed in a I think it's a smelter in Austria and then goes to a different manufacturer in China and then finally to the final assembly manufacturer in China where the phone is actually being put together. We have a very long list of all of the suppliers, smelters and refiners also in these documents so you can see exactly what companies are involved here. What about the fair factories and just about the list of suppliers so the ultimate end goal is to convince the competitor to use the same list right because we have been doing this work of convincing the suppliers to act more responsibly so we hope that the competitors could do the same. What about the fair factories so I'm not sure if you have heard about Foxconn, that's a big factory owned by a Taiwanese company called Foxconn and this is one of the largest employer in the world, almost one million of employees and this company is known for bad working conditions, bad revenue etc. So what we try to do is to collect the workers voice. We don't want to pretend that we know better than them what is a good working condition right this is a decolonial approach also so we have Chinese employers working with the assembly line workers to make sure that we understand what are the best working conditions for them. We disclosed also in terms of open source we disclose a methodology about how you can implement leaving wage in the factories. So in this toolkit you have plenty of things to calculate this leaving wage you have some templates for the agreement for the workers etc. And I'm speaking about this notion of leaving wage so if you speak with the workers they will tell you that the most important thing for them is to have a leaving wage, a decent wage. If you look at the daily wage for an assembly line worker this is approximately $13 per day. And if they want to have a decent wage to avoid to do extra hours for example they will tell you that they need the double more or less right so $28. So you have a big gap right between the daily wage and the leaving wage. So what we have done at Fairphone we have paid this gap right we have paid those $28 per day. And the ultimate consequence for us in terms of price was to dedicate even less than $2 per phone to be able to pay those people correctly. So in this lobbying that we have done about the leaving wage, this toolkit, of course we have tried to convince all the manufacturers to do the same so far this is a big failure. But we are still hoping that it will work at one point because let's imagine that all the manufacturers could do that right, it would be super nice for the people there. We can go to the next slide. Yeah, so a lot of companies have recently also put recycling very big on their front covers. So for example, Apple, Samsung or nothing they are very big on the recycling. Unfortunately the way that recycling is being done is also of course sometimes not great but also just the way that the economy works. You cannot take a phone, recycle everything and then get 100% of the materials back to put into a new phone. There will always be a big junk that actually goes to waste and which you can't recycle because either it's made into alloys which you can't separate anymore or just different components where it's either not worth getting the tiny amounts of gold back out or anything. So there will always be new mining needed and somebody needs to look into this to actually make it better. But also kind of what we are doing with e-waste in Europe is sometimes shipping them to places in Africa. So here are some pictures from Akra Bloschi which is in Akra and the capital of Ghana. And you can see kind of where teenagers are burning some copper cables to get the plastic or to burn the plastic off to get the copper back to actually recycle the copper. And yeah of course you can't imagine that this is healthy for anything in the area. And also recently I think Ghana has also done a bit to clean up this area a bit but of course it's just going to happen somewhere else and it's probably happening in a million other places. Okay, time for the conclusion. So yeah the conclusion that we wanted to know if Luca was to speak about techno discernment and social justice. So we will use a quote from a person called Ivan Illich. I really like this guy not because he's Austrian like you Luca. But yeah I mean I started to read this philosopher a few years ago. And yeah he influenced my life as a software engineer. He wrote this book in the 70s called Tools for Conviviality. And if you look at this quote he's saying more or less that the modern tools, so if you extrapolate a bit think about your daily work. The modern tools should not be at the service only about a small group of experts but at the service of all the people, a bigger group. Right? So more or less how cool today with Luca. We don't want to speak only about fair fun right? We want also to invite you to step aside from your daily work and from your expert position. Of course open source is great right? We are in a super nice conference. I would like to force them by the way to the organizers. That's super great. Open source is a super good lever for responsible projects. But open source doesn't make automatically a project responsible right? So that's an invitation for you people to think about broader right? From a broader perspective. It's always good to do open source for sure. But where the code will run right? When we think about the hardware we spoke about about the social damages behind the cobalt right? Behind the extractivism in general. So we really need as experts to think about how products and we really need to ask ourselves if this product will really empower people. And also this is an invitation to think not only about how people, how community, but people far away from you. People far away from our western countries. Thanks. Thank you. You guys have any questions? Here is. Thank you for the great talk. You have told a lot about software upgrades and updates and support of your phones. But what about hardware updates? Because the most important part for the pollution is hardware waste, but not software. And if you, for example, if you upgrade your phone, I don't mean the main parts like CPU, but maybe displays or cameras or whatever on your current models. Maybe it would reduce the electronic waste as well. And did you think about it? Yeah. So for kind of reusing old phones, there's actually a project by a Belgian company ongoing where they are looking into kind of how they can reuse old FFM2s to actually do something with and use them for example for some IoT use cases. Yeah, some of the problems still apply there. Of course, the software support is gone for the old devices. And also, yeah, then the same applies to kind of the old firmware on the device. So you cannot, it's tricky to kind of make a secure product out of the old phone because all of the proprietary software support is just stopped. In terms of kind of keeping modules between different phones, which I think was kind of part of the question. It is definitely something we are thinking about, but we also currently can't really limit ourselves to kind of keeping the exact same form factor, for example, for the cameras. So the camera modules are compatible between the different phones. Hi. I have a fair phone here and it's five years old or so and maybe I want to buy a new battery for it. It's still fine, so I don't need it now. So I think I would want to buy a new battery. Would it be a battery that is produced last year or so? Or would it be a battery that has been laying on the shelf for many years and I think keeping a battery lying around, without tension, is not very good for such a long time. So how do you ensure the quality of batteries is still okay after five years? I mean, for Fairfond 3, for example, we have car suppliers to produce batteries over four years. We are still producing the battery. Oh, by the way, we stopped a few months ago. So we make sure that we are not buying all the stock ones and we make sure that the supplier is still around. That's also for us quite critical for the software updates because that's, to be honest, ultra tricky because normally speaking, in the industry, when you stop to buy the spare parts or whatsoever, the supplier is not willing to work with you anymore. So for us, you know, still buying batteries after years and years, it helps us also to have them still around for the software part, especially for the Qualcomm preparatory components. For Fairfond 3, for example, we don't have access to the source code of those components. So we needed the supplier to do that, the assembly, what we call the ODM. So we are still buying batteries over the time. This is what we try to do. We have time for one more question. Hi. Quickly, so one question was on the business model. How do you stay afloat in such a market? Is it coming from the premium price of the devices which I was happy to pay? That could be one factor, but I was wondering if you have investors who are particularly interested in sustainability and stuff like that, or where does the money come from? And the other one was how do you guarantee the 10 years of updates? Do you think you'll be able to force Qualcomm into giving you a 10-year long-term support? Long-term support or? For the 10 years of support, the Fairfond 5 is actually using an IoT chip set, so long-term support where we actually have from Qualcomm at least for way more years than with a normal phone processor. It works very similar. It's just a different product line from their side. What was the first part again? The business model. Of course, we want to attract new customers, of course. I think it's also... I mean, we don't need a single customer buying a new phone every two years. We are also happy with them if they keep the current phone for six years or eight years, and then come back to us. And I think there's a lot of room for expansion in just making people keep the phone for longer. People are already keeping their phones for longer, not just because of longers of the support from the manufacturers, but also because the life cycle of the improvement of the smartphone industry has a bit slowed down. So the phone that you buy now is not that different to one that you bought four years ago. It also helps in keeping them. We know it's just better for the environment, so we try to convince people to keep the phones for as long as possible. Thank you. Let's go. Thank you very much. The time is up. Thank you. Thank you. Thank you.
Take Your FOSS Project From Surviving To Thriving
Good morning everyone. We have Ryan here talking to us about open source and yeah, take it away Ryan. Thank you. One problem with having your talk first thing in the morning is there are a lot of missing seats and I'm pretty sure that's because I saw almost everyone here out last night drinking really late. Louder? Is this better? No? I'm not sure I can make it louder. Is there a set? Is there a? Okay. How about if I just talk like this for a little bit? That's going to be hard because the thing you didn't hear was I said we're missing people because I saw them all out drinking last night which included me so talking at this volume the whole time is going to leave me without the ability to do anything the rest of today. I'm Ryan Sypes. I'm the managing director of product for the Thunderbird team at Mozilla. And that's a little weird to say because Thunderbird exists in a different Mozilla than you guys all know. There's a foundation which is a non-profit. Hey, that's perfect. And there's a corporation that develops Firefox and then there's another corporation called MZLA that develops Thunderbird. And that's due to a unique history of not being sustainable, of Thunderbird not being sustainable. There's a great product, a great open source project attracted a lot of users. There are multiple counts and depending on which one you believe I happen to believe we attracted as many as 30 million users at one point and not at all monetizable. At least that's the telling that I got when I came on the project which is we haven't really found a way to make money off of this so paying developers is hard so we're giving it back to the community to Steward. And so that takes me into this talk where when I came on it was a project that was run by a very active community of volunteers and there were a couple of us, a few of us who were able to work on it for a job. I came on part-time as community manager and if I remember right there were only two other people when I came on. For a product that still had 25 million users there were a lot of days that we couldn't even build the product literally like it would be a red tree every day and we would have to work really hard just to get it into a buildable state. I'm speaking to a room of developers and so I can say that a lot of that was because we were downstream from Firefox which had a thousand people working and developing the product every day and so absorbing that, thousands of changes in a week for someone who only had one single developer working full-time and was difficult. But that's the backdrop against to kick off a story of success. Maybe one of the few stories of really incredible success around sustainability and open source. So why do I have credibility to give a talk on sustainability and success in open source? This is our revenue year over year. 2017 is when I came on, 2023 obviously last year and when I came on it wasn't bad. We had $700,000 in donation revenue last year. That number is $8.9 million in donation revenue. Going on to an online percentage calculator, I determined that's a 1,108% increase over the last six years. When you start to move, when the bars in that graph start to get higher and higher, people start asking you why? And it may seem like an easy answer. You're like, there has to be a why. And yes, the whole time I knew there was a why. But describing that to someone else felt like I need you to understand about 600 things and then we can talk about why. So for you guys, I did the thing of distilling 600 things into three, which I'm going to try to impart on you. And why is this important? Well, I believe it's really important for every open source project to have a map to sustainability. I took a class a few years ago around product management. And one thing I was really struck by was the professor said every product before you launch it should have a business model. Even if you're not going to deploy the business model to make money, if you're a startup, let's say, your intention is just to grab users, as many users as you can, and not monetize the product. It was this person's opinion that you should know when and how you're intending to monetize the product. In open source software, I would say that that's not always the case. You know, you don't always intend to monetize your open source software and in fact, you shouldn't always. But you should always have a plan for how you're going to sustain the project. And I don't need to, you know, originally I was going to find some articles of really big open source projects that had really sad episodes or endings. But then I decided you guys know many of them, which is, the story is pretty common, right? Like you have a person or a group of people working on a piece of software. It gets really popular. And then something happens to that person or this group of people. And it doesn't even have to be a sad episode. It could be that the person developing the software has a kid and loses time to work on the software. But it could also be sad, you know, they die. Okay, now the software is going to go stale. There's no one to maintain it. And so it's my strong opinion, just like that professor who said you need a business model, that you need a sustainability plan for your software. And this will come back to Thunderbird again. So I can tell the rest of that story. I'm going to drink some water. A sustainability plan is you just forecasting into the future and saying this project, or maybe you don't have to forecast in the future. Maybe you're already running a successful open source project. But you sit down, you say, how are we going to make sure that there's always someone or a group of people developing this software? Or you can say, which I don't recommend, but you could say, no, it's just me. And when I die, it dies. And that's okay too. Now, at least you've thought it through. But let's say that you want your software to be healthy and go into the future in a way that people can count on it. As I spill my water all over myself in front of like 100 people, you need to lay out a plan. And for Thunderbird, I think that plan should have been developed from the start. So that plan would have said we're going to try to a couple of methods to monetize this. And because we know we need at least a handful of developers in order to consistently develop this software, simple, right? That's all it needs to be. You say one person is not going to maintain this. So we're going to have to figure out how to monetize this project so that we can pay people to work on it in order to sustain it. And the fact that that didn't work out is why Thunderbird entered dire straights. And what we ultimately figured out, at least for now, and plans can change, is that our sustainability plan was to tell, we know we need at least a handful of people working on Thunderbird in order for the software to work. That's just, I live through the volunteer days of it, and that's not, it was not pleasant, it was not good, it was not a way to sustain the software. And so the answer had to be, no, we have to pay people to work on this, to do the crappy stuff, to maintain it. It's not always pretty work, but it has to be done and therefore we need money, because people are not going to actually do some of the stuff that we need them to do unless we pay them, because we know that from experience. We were trying to, I was talking to a bunch of people about what Thunderbird looked like when I came on. And to drive this point home, I've now heard a bunch of different ways to describe it, but the one that I thought of last night is, you go into a house, maybe you buy a house, and you walk in and you're like, man, this is a really nice house, it's big. A lot of people, you know, really like it, they come by, they check it out. And then you're like, all I'm going to do is I'm just going to update the kitchen, because the kitchen looks a little dated. And like you start pulling out the cabinets, and there's just like termites everywhere, there's a pipe just like shooting water off in one direction, and like, it's just not pretty. And every time you go to change something in that house, it's the same thing. Like you just like open a closet, and there's like a clown in there, and you're just like, I'm just going to close the closet and not think about that for a little while. And so for us, we had to somehow, we said, okay, you can't sustain this project with just random people off the internet working on whatever they want to do, whatever they want to add. In fact, that's really bad. That's like, all this structural problems with this house, and someone's like putting a pool on the roof. And so we needed a plan, and that came in the form of trying to monetize it. So, to come to a sustainability plan, you kind of have to ask yourself some key questions. And this is probably not that crazy, but I wonder how many of you have actually asked yourself this about your open source project. If you're not working on a project that's like Red Hat, like we know how Red Hat is sustained, we know how like, you're going to make money, you know how you're going to sustain the project, I hope, I don't know. But for me, something that I took stock of over the years at Thunderbird and tried to think about is, okay, how much effort does it take to do this project? In a perfect, maybe not a perfect world, in a the minimum viable level of effort, what is the minimum viable level of effort that it takes to maintain this project? And then I thought, you know, who are the key stakeholders? Well, that one was really easy, because that was like the tens of millions of users. They were definitely like the number one stakeholder. And then the third one, which I'm not sure anyone had really spent a lot of time thinking about, is how do we communicate with those stakeholders? Which businesses creating proprietary software, right from the outset, this is a thought that they're mulling around in their head, which is, okay, someone downloads a product, there's either something within the product that we're going to use as a mechanism to have a conversation with our users. Or there are other channels by which we have these conversations with our users. And that can serve a variety of purposes, right? Like one, obviously, for commercial product is to convert people to paying users. Maybe it's a free product and then you pay for some kind of additional features or whatever. For us, we really hadn't developed these mechanisms, and the people that we had following us, we're following us in places that we're only able to speak to a fraction of those users. So whether that's an IRC channel, you know, obviously, you're not capturing all of the Thunderbird users there, or a Twitter handle, not capturing even remotely all of the users there. And then once you kind of have answered the first one, you think about this fourth one, which is that aside the effort, what else does a project need? For Thunderbird, it's infrastructure, and building and distributing Thunderbird alone is like a source of cost in and of itself. So I call that out because you have to kind of take a holistic look at like, what does it take to build and distribute and make this software available to everyone? Okay, so I've told you a lot of really basic things that you probably could think about if you spent 15 minutes thinking about projects, and why did I do that? I've been contributing to open source for probably 20 years in some capacity or another, and I'm losing a lot of people. We're losing them. Yeah, I'm gonna have to like do a cartwheel or something. I call this out because I don't think most open source projects start from a place of any plan. I don't know if you guys agree, but oftentimes it's just like, I'm gonna do this cool thing. It's gonna scratch my itch, or it's gonna scratch the itch of people I know, and what happens happens. And talking about a sustainability plan, I'm asking you, please, for the love of God, don't do that. Just take the extra whatever it is, 10 minutes, to just think through, okay, best case scenario, I create this software. How am I gonna do, or how am I gonna think about these four things? Because what we don't need in the world is more un-maintained open source software that people rely on. Because that creates a bad ecosystem and a bad reputation for open source software, and we all know that. We all know people who we've turned on to open source software who essentially just say, oh yeah, well, I'll just use Thunderbird because I don't want to trash any project. But something you could imagine hearing about Thunderbird is someone's like, this is just crappy outlook. I can't afford outlook, or I can't use outlook for some reason, so I'm using crappy outlook. I don't believe that, but that is the result you get for not having a sustainable maintained open source project. And then if you're like family members of mine, and you learn that this is an open source project, and it is crappy outlook or crappy Photoshop, or insert the software of your choice here, you associate open source software with not as good. It's just not as good. And I don't think any of us want that to be the outcome of our work. One major challenge that I had in helping get Thunderbird to a place where it was sustainable was I had a community of developers who, whether or not they'd admit to believing this, this is what they believed. Not this, this is what I believe, but they believe that money was bad. Anytime I brought up the fact that we could do this to raise donations, they're like, we don't want to annoy the user. We should just make the software, and if they manage to stumble across a donation page somehow randomly, that's okay. And the thought I kept having in these conversations, because I was saying, oh, okay, once a year, we could just put a little thing, a little pop up or something that just says, did you know Thunderbird runs on donations? That's how we pay folks, like please give. And I got so much pushback for that. And it's because when you really talk to people, they thought that asking for money in a direct way to users was somehow not a activity that an open source project should do. I don't know where that comes from, but I know it's true of a lot of open source projects, because when folks started asking us, how did you raise donations? And I told them, one mechanism is we hit all users with a full page donation appeal every year. You just saw like their faces drop, and you could just tell that they were like, I'm not going to do that. And you know, it's funny because I also felt uncomfortable with that. I thought this is going to look like spam. This is going to be annoying. Maybe users will leave because it'll be like, I just want to do my email. And this is like bothering me, and I never want to see a popup again like this. And I don't remember which piece of software it was, but a little later after that thought, I was using some, oh, it was Evernote. I opened up Evernote. I don't actually use Evernote, but I had used it in the past, and I was like, I know I put a number in here that I need to remember. So I'm going to open it up and find that number. Before I could ever look at a note, it was like three things in succession that I had to exit out of. And it was, they were all like, you should pay us. And then it's like, no. And it's like, yeah, but you should pay us. And I'm like, no. It's like, well, if you don't pay us, we're going to do X, Y, and Z. And I'm like, no, I just need my number. But after that, I was like, you know what? I bet that's happened a lot in a lot of different programs. And I don't remember it because that's not an activity. Exiting out was not an activity that I committed to memory. It's just like, oh yeah, like, of course they're going to ask me for money, like, whatever. But that, my friends, was a eureka moment of like, I don't remember any software asking me for money, but I know they do it all the time. And so that's what we did. Popped up full screen once a year. Help keep Thunderbird alive. This is like the history of Thunderbird in one page. Did you know that less than 1% of Thunderbird users fund all of our work? That was especially true when this was displayed. Not too long ago, Thunderbird was on the verge of extinction. We don't show advertisements or sell your data. We're completely funded by donations from our community. That, my friends, is a $6.8 million appeal. Well, now it's more than that because we've run it twice. So it's a $15 million appeal. That, just this. And, you know, when I ask people, our users, do you remember the bird, the end of year donation appeal? I've asked probably, I don't know, 100, 150 users at this point about this appeal. You know what most of them say? I don't know what you're talking about. The bird, the Thunderbird, I'm like, no, he's a bird holding a heart. He popped up over your whole paid, you know, he took over your whole browser like a couple weeks back. No, I don't know. I don't remember. I'm sorry. This bird sounds really important to you, but I don't remember it. And some KDE guys asked me about, you know, raising donations. And I told them exactly what you might imagine came out of this, which was like, pop something up a couple times a year full screen. And they're like, you know, we don't ask often. Give us a little money. And they're like, people will, people will revolt. They'll change to, you know, I'm like, maybe, but honestly, it'll pop up in December and you'll ask them a week later. And they'll have no idea what you're talking about. Because if they don't donate, they're just going to hit the X and just move on and not think about it. Maybe next year they're like, oh, yeah, didn't they do this last year? But probably not. Because the moment they go to Wikipedia and get that appeal from our good friend, Mr. Wales, you know, these things are all invisible. This year we tried displaying this a few times a year. Donations went up. Nobody remembered. Nobody remembers the bird. It just doesn't, that's not how human brains work. We're so inundated with incoming signals all the time. And that's the point. You're not that annoying, maybe to your friends. But these appeals are not that annoying. Because we live in an information environment where this is just something people expect and something that people have grown to not see anymore. And so that's the takeaway. But there is one other thing. And now we'll go back into Thunderbird. Because ultimately I'm up here to both tell you how to make a sustainable project and implant in your head that you're going to come out of this and you're going to install Thunderbird. So we asked our users for money. That's pretty simple. Because we knew in order to be sustainable that we really needed at least 10 engineers working on it just to make Thunderbird run, not to do a bunch of fancy stuff just to go. And once we were able to set up this model, it became a lot easier to convince the other developers on the project. Those ones who were like, money is bad. They're like, money's not that bad. And it helps us sustain the project. And users understood that too. That appeal said, essentially, you get value from this and without you, it doesn't work. And you know, I lied a little bit earlier. I did hear from users who do remember the appeal. And especially after the first one. And I'm going to look at someone in the audience because he may have seen a negative comment. But I never saw a negative comment about the full screen takeover. In fact, I saw positivity of folks saying, I just assumed that you guys ran off of like Google money. If I had known that you were reliant on donations all these years, I would have been donating the whole time. And I got that hundreds of times of just people saying, of course, I want the tool that I'm using every day to manage my email, to be sustainable, to be funded. But you never told me that you needed my help. You never told me that that was on me. And so I remember the strangest feeling of being thanked for telling people to give me their money. So there you have it. But this slide, which I haven't even talked about since I put it up, is the other piece, which is you're all maintaining different software. Or maybe you're not. I don't know what you do. But I assume most of you are engaged with some piece of software that you develop. And it's not a one size fits all. If your software isn't public facing like Thunderbird and doesn't have 20 million users, this is not irrelevant to you. You probably do have stakeholders. What is your conversation with them look like? If it's a big enterprise that's leveraging your software, are you talking to them? Have you talked to them? Have you said, hey, don't be like the mafia and be like something really bad could happen. The software, it could go away. But maybe you should at least remind them that that could happen. Something really bad could happen. Somebody could get hit by a bus. That guy is me. But if you don't make that clear to the people who rely on your software, if you're not sharing with them the need, the pieces of the story that tell them how your software is sustainable, they're not going to know. And then you're going to have these, I remember, and this is the example I'll finally use. What piece of software was it? Any of you could answer this and I'm not going to pull it out of my head. But we've seen it, right? Just go to ours, Technica. Some open source project that's used by Google has a security vulnerability in it, probably today. At some point, that's going to be exploited. And the ours, Technica story, it's the same every time. It's like major exploit found in this product. And at some product we all use. And then they figure out, oh yeah, it was because they were using this library. And the maintainer stopped maintaining it 10 years ago. Google would have much preferred that that maintainer be like whatever it is in the repo maybe, it's just a big thing at the top. It's just like, I can't afford rent. I'm going to have to go work at Starbucks unless somebody gives me some money to work on this thing. I guarantee a lot of developers downstream would be like, oh, that's really bad. We need to give this guy some money. And that's what I'm talking about. Don't be that guy. That is a dick move. You created the thing. You're not maintaining it. And that's not the dick move. The dick move is before you stop maintaining it, what did you do to let everybody know who relies on it? What was happening? Why is it going away? What can you do to prevent it from going away? And that's something we see all the time. You guys know it. I know it. So just think about it. Think about it. Today, tomorrow, how am I going to make my software sustainable for the future for the people who rely on it? And you'll think of something. Try different things. Make that part of your day. Make that part of your development process. And I know that's the most annoying thing to say. But just, you know, like I'm going to dedicate 3% of my time working on the project to just figuring out how to make it sustainable. Because I think that's good not just for you, but for me, for open source, for this software movement. And yeah, that's my talk. Thank you. Thank you. And I'm happy to take some questions you can ask me about Thunderbird. You can ask me about sustainability. You can ask me whatever you want. You can ask me how I get my hair like this. But we have another five minutes with each other. Thank you for your talk this morning. Are you helping your colleagues at Firefox? I get asked a lot of questions from those teams. It's true. At first it was an anomaly. You know, people didn't really know what to think when we started on this path to sustainability. And to be honest, at first it was met with like maybe some snickers of like, oh, they're funded by donations. Well, let's see how long that lasts. Now, you know, six or seven years into it, I get a lot more serious questions of, because it's kind of changed, right? It's like, oh, wait, folks are paying you for no value exchange, no immediate value exchange? Maybe we should figure out why they're doing that and whether or not we can apply that to other things. But I will say, though, it is scary too. It's scary to run off of just pure donations. Because, and I don't think you should always choose a donation model for your open source project. If you have any other model available to you, use that because I do tell my team one of my biggest worries is that in an economic downturn, for instance, donating to the Underboard would probably be a much lower priority than buying food. And so, you know, donations is not necessarily always the best model for sustaining an open source project. And you should use, when I put, like, do it your way, like you should use whatever mechanism, whatever mechanisms you have available that are going to make a project most sustainable, in my opinion. Any other questions? Yes. What's the current size of the Thunderbird team? I'm trying to find who said that over here, down here. It's 32. We're hiring an additional 13 roles this year. So it'll be 45 by the end of the year. And we expect to continue growing minus an economic downturn. So, yeah. There's one down here. Do you get a lot of contributions from people who you have not hired, like from random people on the internet? Or is it mostly the pay developers that are developing Thunderbird? I didn't hear all of that. Can I get it one more time? I was asking if you get a lot of contributions from people on the internet, as opposed to pay developers? We used to get more, so we used to get more contributions out of necessity from our community, because they were maintaining the project. So from 2012 to 2017, maybe even 2018, I would say that we had to have a majority of contributions from the community. But as I said at the beginning, the problem was people were scratching their own itch and not addressing the actual product needs. That's right. And so that was a really weird time, because you may have a really bad, bad thing happening in the software. And we had a few. But if it was going to take two months of consistent development to fix it, volunteers, they just were like, yeah, I'm really just here to fix filters, and then I'm out. And so once again, sustainability, you have to have people around your project to work on the hard things. So you mentioned you showed the full page thingy. Did you show it to everybody the same day, or is it like staged? Because I feel like everybody the same day might end up in Twitter, like, oh, what is this thing? I'm like snowballing. Yeah. Yeah. So the answer is we have not been very sophisticated. We hit everyone. So we don't hit them on the same day, but that's not because we're clever. And we developed a system for, you know, spreading that out and a B tested that. That's because they see it when they update to the newest version. And not everyone updates the day of the release. In fact, no one like, okay, I shouldn't say that. Not just it's spread out because people update on different days. But that also gives us some data of like, hey, well, we performed better this week than we did last week, you know, but with this group. So yeah. But in the future, we'll have a more sophisticated mechanism and I'll come back and tell you tell you what worked and what didn't. Two quick questions. One, do you use a particular way of collecting the money? Like is it a PayPal driven thing? And second one is do you maintain some kind of a pad for, you know, to try and smooth out that economic curve like a year in the bank, so to speak? Yes. Yeah, yeah, yeah. Very good questions. On the first one, we used to have our homegrown donations stack from Mozilla. So it's a stack that Mozilla created for the foundation. You know, there's this thing of in any endeavor where it's like, does this make your beer taste better? And we found out the answer was no, maintaining a donation stack does not make our beer taste better. And so we've moved to a platform called fundraise up, which supported supports way more ways of giving than we ever did. And so I would recommend you like, if you don't have to create a donation stack, please don't do that. Like use some tool or platform to do that for you. And then the second question, which I'm trying to remember what it was. Oh, yes, we, yeah, we have a year in the bank and that's what that's, that's how we hedge against potential donations. So far donations have only gone up, which is good, but we're hedging against, you know, a scenario that I talked about. And I think that's wise. It definitely gives our developers who know we're donation funded, who work for us a lot of a sense of stability, which is good. Well, thanks for your talk. And I see possible applications of this method also not only for software, but also for creative commons, art projects or something like that. Yeah, just to throw it in. Yeah, I think, I mean, it's, and it's super simple, right? And my time's done. So I'll answer this and then I'll be done. The model is just this one that like, even if you're creating something that's free, you should be thinking about how it is you're going to support that work. And especially if there are people in the world who find value from what you're doing, you should be communicating with them in order to tell them to continue to receive this value. There just has to be some layer of support around it, which I think is it's straightforward, but I just don't think it's a process that a lot of folks go through when they're creating something for free. They don't think about, they're always thinking like best term, best case scenario, like I'm going to be able to dedicate this amount of time to this in perpetuity for the rest of my life. And then something happens, you know, like maybe you're like me and you have twins. And then everything that you used to do, all your open source projects, for instance, in my case, they just stopped being developed if they're only developed by me. And if I could go back to myself like six years ago, I would, you know, like smack myself and be like, you don't have time for this, like you don't have time for maintaining all these different projects. How are you going to, who else you should be pulling in other people to help maintain them, whatever it is, you know, but it was not a thought that I had. And that's what I would like everybody to think about, you know, before you start something, or if you're in the middle of it right now, how are you going to sustain it in the long term? Are you going to sustain it? You know, okay. Cool. Thank you so much.
NetBSD 10: Thirty years, still going strong!
Okay, finally we're set up. Many thanks for your patience. Let us kick off immediately with NetBSD 10. So as you probably know, NetBSD turned 30 years old, or 30 years young last year. We've tremendous improved in the security support of CPU, GPU and stuff like that. Also NetBSD is quite known for its package system. We have here Benny with us. Benny has been a developer for more than 10 years on NetBSD. He's been as well with us many times at the NetBSD developer room here at FusDem. So who better than him could talk about the 30 years of NetBSD. So ladies and gentlemen, please welcome Benny Siegert. Thank you for the kind words of introduction. So welcome to this talk today. I have mainly three topics for you. 30 years of BSD. I want to talk about the new NetBSD release, NetBSD 10. And are we at 50 years of BSD yet? And that's what I want to start with. And the answer is maybe, depending on how you count. So one BSD, the first Berkeley software distribution was released in 1978, so that's not 50 years. However, the work on BSD started in 1974 when the computer science research group at Berkeley University got a PDP 11 and they installed Unix on it and they started hacking on it. In fact, they didn't have sole possession of it. They had shared with the mathematicians and they were using a different OS. So twice a day they had to switch the stack of disks and reboot. So one BSD, I don't know, isn't that interesting? It's mostly like a collection of utilities. You already need Unix installed and you can install some BSD stuff on it. Two BSDs may be the interesting one, which came out in 1979 because it's kind of a full system. And what I find amazing is that two BSD is still maintained today. So there's a collection of some crazy folks on the internet, obviously, that are pushing out patches for two BSD every once in a while. The last one was less than a year ago. And so like one BSD, two BSDs only for the PDP 11, but you can emulate one. You can use SimH, run a PDP 11 emulator and run 2.11 BSD and see what it's like. You can even go one step further if you're willing to do this and buy this thing here, which is called a Pi DP 11. It is the front panel from a PDP 11 model 70, except shrunk by a factor of three. So it's not quite as bulky. And in the back you stick a Raspberry Pi and it runs a PDP 11 emulator. And all the lights and all the switches and all the key switch even work do the right thing. So you can do this and run any of the PDP 11 operating systems, including 2.11 BSD in 2024, that's your thing. And then three BSD made a major change that they did, they ported it to Vax. And at the time it was still single architecture, so PDP 11's board vanished, instead was replaced by Vax and 4BSD was kind of the same. By the way, this will be very abridged. I will not go into all the details. BSD history would be its own talk and other people such as Marshall Kirk McKewsick have done this much better than I could. But anyway, so I want to highlight one release, which is a bit weird. Frankly like Berkeley is terrible at naming. 4.3 BSD Tahoe was named because in addition to the Vax it supported the new Tahoe architecture. Now you probably have not heard of the Tahoe architecture. That's because it was a colossal failure. And the only, this is what the workstations look like. The only company manufacturing these workstations went out of business about two months later. So there are practically no surviving machines. I don't think anybody knows much about these. Apparently Pixar had one and they used it for special effects running BSD. What was special about this release is that it was both for the Vax and the Tahoe. So they separated all the bits that are specific to an architecture support. So people took this and said like, I don't care about Tahoe, I want to run it on something else so they ported it to all sorts of other architectures. And this is sort of the origin of the multi-architecture nature of modern BSD operating systems. And then again I'm leaving out a few steps. Then it gets very messy. There was a lawsuit involved and so on. And at some point there was a release of BSD that ran on PCs. It was called 386 BSD. And the team was I think two people. And there was a lot of buzz around it. It was a lot of development and a lot of patch sets and stuff. But the development of 386 BSD itself was kind of sluggish. So people started taking matters into their own hands and that's where NetBSD comes from. So I found the announcement from 1993 of NetBSD 0.8 for whatever reason they called their first release 0.8. And it starts off a bit odd like we've been wondering what I've been up to, blah, blah, you know I've built this new system. I'm calling it NetBSD. Essentially what it is realistically it's the last 386 BSD release 0.1. And all the patches that were floating around the net and that were okay, added on top. And that's also why it's called NetBSD. Because in 1993 not that many people had internet access but NetBSD from the start embraced the internet as a method of development and getting patches and distributing the OS and so on. So NetBSD as the name implies is a creation of the members of the network community, meaning the internet, without the net it's likely that this release would not have come about. So this is 30 years. This is a bit more than 30 years. It's not quite 31 so I guess where it counts. I want to dwell on this announcement for a bit more. By the way there's four signatories on the bottom of the announcement. You see CGD is the one who sent it, but there's also Theo de Rage is one of the signatories. He was one of the founding members of NetBSD before they kicked him out and he ended up founding OpenBSD. But what is interesting is that it tells a little bit more about the purpose and the plans for NetBSD. And it's interesting seeing these and comparing what has happened since. So I've added a few highlights here. Why do this at all? And it says we consider this an escape from the political wars surrounding our wonderful operating system and we want to do a stable production quality release. And also this bit we intend to integrate free positive changes from whoever will provide them and we hope to create a stable, accessible system and to be responsive to the needs and desires of the users. So here you can see the project values laid out in short form. No religious wars, stability, community, acceptance of patches if they're okay. And I think these have largely held up honestly. Like 30 years later I think the NetBSD project is holding up these values even though probably most developers of today haven't read this thing here. I mean I hadn't until I prepared this talk. That's quite nice. And then what ended up happening is sort of a Cambrian explosion and one of these aspects of people contributing their stuff is people contributed support for the machines they were using. And that is how NetBSD got this reputation of running on all the things, even including toasters. So this was sort of the peak of NetBSD porting. The year was 1995. This person here is Jeff Rizzo, a NetBSD developer. And their company Technology Systems Design presents toaster running NetBSD. Of course it runs NetBSD. And the point of the project at the time, so you see there's an arm board there which is in the toaster. But it's sort of IoT in a sense. It had remote management and it could actually manage the toasting function. So it could change the duration or the heat or whatever. It has a little display which you cannot really see at the front. So this was like famous, you know. It runs on everything, even on the toaster. And that was 1995. And I want to go slightly heretic here and ask, is any of this relevant today? Because if you look at the list of architectures on NetBSD calls and ports, there's I think 71, if I remember correctly. And they're in three tiers. Tier one is like the good ones. Tier two is the ones that may have some problems, but they're sort of chugging along, maybe not the main focus of the project. And tier three is the ones that are on live support and basically dead. Anyway, if you look at the tier two architectures, they all seem kind of retro in a sense. Dreamcast, really. Amiga, the B-Box, who here has a B-Box? I don't think any one of you has a B-Box. So I'm going to offer this. I think the portability argument is more or less dead. Because there's no modern hardware, I think, that runs NetBSD but not Linux. If you look at say, new RISC 5 boards or something, they come with a Linux kernel. They don't come with a NetBSD kernel. Usually. By the way, this is the list of tier one ports. So the important focus things are ARM, X86, 32, 42, 64 bits, Spark, and Xen emulation, and MIPS and PowerPC. Anyway, so these are the tier one ports. So it's a good list, but still. I mean, is portability the sales argument anymore? I don't think so. So what remains? Why would you want to use NetBSD? And going back to the values we heard about earlier, so we have an accessible system, but it's still powerful and stable. And by accessible, what was meant then, and what I mean now is, it's something you can understand from top to bottom. If you're starting out with Linux, let's say, and you try to figure out how a modern Ubuntu system works with system D and 100 demons running everywhere and things reconfiguring themselves, it's very hard. It's simple on the surface, but underneath hides a ton of complexity trying to make stuff work. NetBSD is different, I think. It's simple throughout, and that way you can understand it, all the layers and how they work together. There's also documentation there. There's a NetBSD guide, which is very long and complete and has a ton of stuff. So you have one eBook, if you will, that explains the system to you. You can read the main pages, which unlike in Linux, main pages are usually complete and well written. There's a system D, maybe some of you like that. But I keep saying to people, why should I try a BSD operating system? I think it's a learning opportunity. And even if you're, say, a Linux user today, and you try BSD for a few months and you come back to Linux, maybe you have learned something about the system. I think that's good. But also, NetBSD has some crazy research things in it, but it's also kind of old school unix in some sense. So I think it's a nice compromise between these two very different worlds. So if you boot it today, you're going to have a graphical console. You have graphics acceleration. You have like ZFS, modern volume management. You can run all sorts of software. You can run GNOME on it. You can run Firefox on it. You can run Rust programs, Go programs. It's all there. And I think the main actual uses that people are making of NetBSD, one is on the server. It is a very solid, very reliable server OS for things like routers and stuff, firewalls. But also, it is kind of surprisingly useful as a desktop OS. Maybe you might have to make some compromises here and there, but like you can listen to stuff, listen to music in Firefox or some other player. You can watch YouTube videos. You can use LibreOffice, whatever you want. It's all fine. Or you can do things like this here. This is a Dynabook running NetBSD with a patent input. Again coming back to the announcement email, I keep coming back to this. There's so much in it. The welcoming community is also an important thing. I think not all open source projects, not even all BSD projects, have welcoming communities. I think NetBSD does. This is the group photo from the 2019 package source con in Cambridge. I don't think we have done one since COVID hit, unfortunately. But like people are generally nice and welcoming. I think that's very important. And it's a good thing to have. Changing gears a little bit, I want to talk to you about the new release. We've done 30 years of development, but what do we have to show for it? We have NetBC 10, which in fact is not released. When I wrote the abstract for this talk in October, everybody was saying, you know, it's going to be released in a month. So I just put this as a given here. We have the new NetBC 10 release. I'm going to talk about it. It's not there. We have release candidate number three. But it's okay. It's all release candidate number three. To understand where we are, I want to talk a little bit about the release timeline and maybe also about the way NetBC development is organized in general. So NetBC has a core team and only the people, well, has a core team and has a slightly larger team of developers that have joined the NetBC foundation and have officially joined the project. And they're the ones that can directly commit to the repositories. So if you're not a NetBC developer, you cannot directly commit to the NetBC repo. And all the development of NetBC happens on the head branch, well, on the main line. And then what happens is every once in a while, there's a branch for the numbered releases. And between the branch and the release of the .0 version, there can be quite a lot of time because you find that there are some problems that you didn't notice. Usually you know, usually once you have a release branch and you're in beta at that point, many, many more people start using it. And they find many more problems that you didn't know about, for instance. So the release of NetBC 10 is probably going to be in February sometime. But the branch of the 10 branch was in December 22. So it's already been branched for quite a bit. And after the branch is done, there are no direct commits by random people into the branch. But instead it's all going, it goes by tickets and they're reviewed and there's the discussion on them. So it's a bit more stable in that sense. So like, so the basis of this development is already a bit old. It's already from 2022. So if you have hardware that's newer than 22 and it doesn't work on NetBC 10, maybe current is actually better. But also the point I want to make is the NetBC 9 branch was in July 2019. So you have three and a half years of trunk development that has also gone into this. So I'm going to explain some things that have changed that are new, but there's like a million other small changes that would be far too boring to talk about. So the one thing that you might immediately notice is performance. Performance has increased a great deal, especially file system performance, which to be fair was not very good before. It's good now, I think. And the scheduler has improved a lot. So if you have a system that has big and little cores, for example, an ARM or an Intel CPU with performance cores, they call it and power saving ones or something, the scheduler is aware of that. And depending on how much like punch you need, it'll use one or the other. That's very nice. The graphics drivers have updated to be on power Linux 5.6. So you have accelerated support for AMD, for Intel, for NVIDIA. There's a new wire guard driver, which may be interesting. So if you're using tail scale or another VPN on wire guard basis, and this is not the original wire guard, but it's a re-implementation from the spec. And then there's much improved DTFS, newer version, and also much improved virtualization. For example, for Xen, I don't know if a lot of you use Xen, but in the past Xen had two virtualization modes. There was the HVM mode, which did not require any collaboration from the OS. So you could run an unmodified Windows on it, but it was kind of slow. And then there's the PV mode, para-virtualized one, where the OS is sort of aware that it is running on Xen, and it's not actually, it's its own architecture, basically. Like the kernel is directly written against the interface of the hypervisor instead of pretending to talk to hardware. And Xen folks have added three sort of in-between modes, and I think we can do them all. So one thing you can do if you're on HVM, you can gain speed by using para-virtualized driver for network and storage. And these are called VIRT-IO. So NetBeasty has those. There's a mode called PVHVM, where you also have interrupt and timer that are, like, not pretending to talk to some Intel interrupt controller, but to talk to the hypervisor. This is called PVHVM. I have that too. And the highest performance mode these days is called PVH, which is a para-virtualized system. So I used the Xen kernel, not the, I don't know, AMD64 kernel, but it uses hardware support for page tables and stuff. So this is the highest performance mode, the PVH mode. If you're using Xen, this is what you should be running. And the whole thing is more multiprocessor aware. The DOM zero, which is the sort of the host system, can be multiprocessor. The individual VMs can be multiprocessor. This is really nice. This graphic here comes from Brendan Gregg's blog, by the way. What color is the Xen? If you don't know Brendan Gregg, you should look him up. He does good stuff. He's also done amazing talks at Fostern before. And then in terms of hardware, I think the biggest amount of work has gone into ARM. This in general, I.O. is a lot better in ARM, like if you're running it on a Raspberry Pi, let's say, you'll notice you'll have more network throughput, more disk throughput. That's all really nice. There's support for the security features in modern ARM processors in the ARM V8E instruction set. Many of them help against the sort of return-oriented programming, like exploits. For instance, you can authenticate, the kernel can authenticate the pointers. The pointers are tagged with a secret tag, and only if the CPU will check if the tag is there, otherwise the pointer cannot be dereferenced. So you can't just take random memory and say, this is no pointer. This branch target identification, where in the instruction set, only instruction set level, you can say, here's a branch, but it's only allowed to jump to this address or this address, something like this. And there's a mode called privilege access never, where the user space can actually forbid the kernel from accessing a page. So while you're holding your key material, for example, in memory, you can mark it as privileged access never, and then nobody else has access to it. So that's great. Crypto support, using crypto instructions if the CPU has them, and a lot of new systems. And there's three things that might bite you if you upgrade. So I want to mention them specifically. One is SSL route certificates are now in the base system. So before you had to install a package called Mozilla root certs, and that was always annoying. Nowadays, you install, SSL will just work, certificate validation will just work. It's nice. Entropy management means that if you don't have entropy, if you don't have randomness, then the kernel will not give you random data. And practice what that means if you're running in a VM mainly and you don't have an entropy source at all, things might hang when they ask for random data. And that's not great. So there's some, the config files have special support for stuff. Also there is a VRTIO RNG driver where the host of your VM can hand randomness to the VMs. And so if you enable that, you may have to enable it in your QMU config or whatever you're using, then this is better. And finally, there's a new feature around POSIX ACLs and extended attributes. So these are attributes on files. And they need a new file system, or rather like a variant of the existing file system. So by default, NetBC has FFS version 2, the fast file system. And there's a variant now that you can choose when you install called FFS v2 EA, which has extended attributes. And if you do that, you should be aware that older NetBC versions cannot read it. And also if you installed current at some point and used that, there was a flag day where that format changed. So you did, I don't know, there's a complicated procedure to not lose your data. So that's by the way one reason why NetBC 10 is late because there was this file system problem. But FFS v2 with extended attribute is not the default, I think. Only if you need POSIX ACLs, you can choose it during installation, but default is just to disable it. So now that I went through like this laundry list of things, I don't know, some of you might be bored. I'm going to ask, what are you going to run it on? Maybe you say, you have convinced me, I want to try out this NetBC 10 thing. What am I going to run it on? And here I have a little gallery of a couple of the ARM SBCs that are now supported in the new release that weren't there before. The Raspberry Pi 4 is very nice. I'm using one myself. The 5 is I think in current, the 3 to 2, et cetera. They're all also there. Then you have the Odroid and N2 plus, the Quad 64, the ASUS Tinker board, the humming board pulse, I don't know, there's various things. They all have very specialties, different SOCs. The Orange Pi 5 I like a lot because it has a pretty powerful processor and you can get it with up to 32 gigs of RAM. You see that slot at the top left, it's an NVMe slot, you can add an SSD there. So for less than 200 euros, you can have a really powerful ARM-based workstation with an Orange Pi 5. Highly recommended. So if you're using it on ARM, it used to be annoying with bootloaders. ARM bootloaders are a bit weird in many ways. This has become a lot better for the Orange Pi, the Raspberry Pi and a couple others. This thing called Tiano Core EDK2. You downloaded it for your model and you just unpack the thing onto your storage medium, whether it's an SSD or memory card or whatever, and then it acts like a BIOS basically. You get a UEFI shell and then you can just use a generic, like, Net-B-Ste-Arm image or some other OS. So that's become very convenient. ARM-SPCs are awesome. They're still the Pinebook Pro. It's a laptop that is, I think it's more than $1.99 now, it's $229. But still, it's very cheap. It has a nice ARM processor display. It's perfectly usable. Net-B-Ste runs really, really well on it. So this thing here talks about current, but now it's in 10. So you have display, backlight control, NVMe, USB. It all works. It's nice. So if you want a Net-B-Ste laptop, maybe that's the best choice. Or how about this thing? I don't know if I can get the video to work. I feel like a boomer. Can I get the video to work? Where's my mouse pointer? So this here is a Nintendo Wii. And then this happens. It turns out, I didn't know that. The Nintendo Wii has a PowerPC processor. So this runs the Net-B-Ste EVB PPC, EVB's evaluation board. So we treated like an SBC with a PowerPC processor. And there you are. This is pretty new. This is a few weeks ago that this support got added. I'm not sure if it's present in 10, but it's there. Here's Net-B-Ste. Did you ever want to run PostFix on your Wii? There you go. You can also run in the cloud. Net-B-Ste is pretty much at home in the cloud. It runs on all sorts of virtualization things. It itself includes several virtualizers. So I've talked about Xen. There's NVMM, which is a hypervisor that's integrated into QMU. There's Hex... I think Hexen... I'm not sure if it's still there, actually. But anyway, so there's various acceleration technologies that are in QMU. You can run it under Beehive if you have a free BST host. You can run it under KVM if you have a Linux host. You can run it on AWS. We have community AMIs available. These are AWS machine images for both Intel and ARM. The ARM ones in particular are nice. They're very cost-effective. There's this project here to build Google Cloud machine images with Net-B-Ste, Canon Vulture, Oracle Cloud, many others. So that's also an option. Why not have Yen? Now that you have a machine, what are you going to run on it? This is where I slightly switch gears and talk about packages. This is my hobby horse because I mostly work on packages. So Net-B-Ste comes with a package collection called Package Source, PKGSRC. And it's actually not Net-B-Ste only. You can run package source on 18 or 20 different OSes. You have 35,000 packages, although not all of them will probably build on all OSes. Once a quarter, we do a stable branch with binary packages. In fact, the last two stable branches have offered binary packages for 10. So I think that's very nice. You can install Net-B-Ste 10 somewhere and install the package manager and get going immediately. If you're on a platform that does not have binary packages you build from source, that's very easy. You just do make package install in the right directory and it downloads all the dependencies and builds them in order. If you're doing Firefox or LibreOffice, you may have to be a bit patient. Like on the Pinebook Pro, it's more an overnight run. But yeah, it works. And you can also update from sources when a new branch comes out. If you're using package in, then it's even easier to change the path to the binary packages and do package in upgrade and it does the thing. These 35,000 packages include all the stuff you expect, like NGINX and Apache and whatnot, but also GNOME and, as I said, LibreOffice, Firefox, Thunderbird, they're all there. It's pretty complete. And also, if you find something that you would like to be available as a package, but it's not there, you can relatively easily get access to the WIP repository work in progress and start creating your own package in there. The barrier is pretty low and it's sort of our gateway drug, I guess, to becoming a full-time contributor. This WIP thing is, I think it's a superb idea because it gives you a well-lit path to becoming a developer, at least in the packages space. But it's just starting to tinker there. And it comes without warranty. The WIP packages, they may be completely broken. So as a user, you don't quite know what you're getting. Maybe you want to stick to the main repo, but if you're tinkering, it's great. And one last thing that I want to mention, you can install a package source tree in an arbitrary prefix and you can do it without root privileges. So if you have a machine, even if it's a Linux or a macOS machine, and you don't have root on it for some reason, but you want to run some software that is not on there, you can bootstrap a package source into some tree and use that. And you can also use that the same way that Python developers would use virtual ENV. You can have a tree that has your development environment in it and then just copy it back and forth and use it on any machine that you're working on. So that's also a very, very powerful thing. It's not just for Python, it's for all software. So I think it's great. So to conclude, we've spent, not me personally, but the NetBeastie project has spent 30 years on NetBeastie. It's still going strong. We have a new release, almost out. A release, Canada 3 is the current one. You should go check it out. Thank you very much. We're open for questions. Any questions? Yeah. Hello. Okay. So we talked about where NetBeastie can run everywhere. And so maybe here a question. What is the current state about regarding the risk 564 integration? The risk 5, is that the question? Yes, exactly. Okay. What's the current state of risk 5? Not quite there yet, I think. There is a NetBeastie risk 5 port. I've seen videos of it booting, but it's not as good as Armiette, I think. Do we have any more questions? Yes. Hello, and thank you for your presentation. Is there something that can be provided to the students, Bachelor of Science, Computer Students, for, let's say, get them more involved in the project, like easy hacks and stuff like that, to tell them, hey, do this and you can learn something. You know OAS, but this is some OAS that you can start hacking with. Thank you. Okay, the question was about sort of low barrier contribution opportunities for students, like what can you do to get into hacking and so on. I think with Package Source WIP we've done a really good thing around packages. I'm not so sure if there is something for NetBeastie itself. We have the Google Summer of Code, so you can become a GSOC student and do a programming project there. If I remember correctly, on netbeastie.org there's a list of projects where you could even get funding for outside of GSOC. Like somebody has done bountains on things, but I'm not sure if they're beginner friendly, and there's also a list of possible projects. But maybe we could do more there. Thanks for the suggestion. Hi, good talk. So I've read a blog, I think it was last year, about the status of Weyland. Hello. So the status of Weyland in NetBeastie, where it was sort of early attempts to get it running. Are we ever expecting Weyland to land properly in NetBeastie, or is it going to be eight years, 10 years, heat death of the universe? It's a good question. The question is about the state of Weyland support. I think there is some Weyland. If I remember correctly, you can run Sway as a Weyland compositor. Various applications like Firefox are built with Weyland support by default. But I think the vast majority of NetBeastie users are still on X11. I'm not sure what it would take to change that, other than more manpower. Weyland on NetBeastie is more or less like a one woman show at the moment. So if you're interested in that, please contact us and contribute. Maybe to add one thing that makes it hard is Weyland has a lot of Linux-isms. For instance, the Weyland input API just takes the Linux input.h and wraps it, and that's it. So you get to reimplement parts of Linux APIs, which is annoying. More questions? Coming back to the question of NetBeastie is suitable for student projects. I'd like to talk with the person who asked that question. I have more information. Hi. So I've been in this Linux user for a while, and I'm interested in the BSD world. Why would I want to try NetBeastie instead of FreeBSD, say, on my laptop? That's a difficult question. Why would you want to try NetBeastie and not FreeBSD? FreeBSD has many of these things that I mentioned for NetBeastie as well. FreeBSD has a lot bigger community, I think more contributors and so on. What's in it for you personally? I think for some, depending on what the hardware is, the support for NetBeastie or FreeBSD might be better. Again if you want to get involved in the community, the FreeBSD community is bigger, but the NetBeastie community might be more accommodating. I don't know. I'm struggling a bit with an answer there. I think they're both good. You could try both and see which fits. Questions? Anyone? Who was it? For the last few years we have seen Linux take over many things. Things we take for granted about how the Linux in general work. There has been a lot of efforts to reimplement functionality in BSDs. I was wondering is there any organized effort to counter that? Not all of these interfaces are the best designs. There's a lot of room for improvement there, but I have seen BSDs lag behind. There's an engagement issue when it comes to companies that it's a lot easier for people to justify contributing to Linux versus contributing to BSDs just based on license, even though that's a bad idea anyway, because projects get founded to a point and then making things open source of course allows the cost for the companies. Is there anything going on to improve these situations? I think there were several questions. You said Linux has taken over mindshare. If you're at a company it is easier to justify contributing to Linux than to BSD on impact arguments, which is true, I think. There was also the aspect about Linux folks re-implementing more and more classical things with system D and network manager and all these things, or getting rid of the if config command. I think the BSD community in general just doesn't follow along in these things. There's no system D. There's no IP route 2 or whatever it's called. The route command still works the same way it worked in 1982. As for the impact argument, I think if you're looking for sort of impact in that sense, then let's say a contribution to free BSD might be more justifiable than to net BSD. In a company it depends on what you use. If you build your service on net BSD, then contributing to net BSD makes a lot of sense. If you're building your service based on Linux, then it might not. For example, Netflix, they're highest throughput streaming servers. They're on free BSD, not Linux, because free BSD achieves a higher throughput with the same hardware. It all depends. Maybe one other aspect, talking for myself personally, my company does not use net BSD in production. But I work on it precisely because it is not like my day job. So I guess you partially touched the answer to my question. I see that there is good support for Zen. So do you have any insights about application and workloads which people using on top of BSD under those hypervisors? The question was, I think, what applications are people running on net BST on Zen? Anything really. You can get Zen hosting from a lot of places. So get a virtual server that's a paravirtualized Zen VM. It used to be, AWS used to be like that. I don't think they offer that anymore. So some of it is just like a cheap shell access for somebody. You have a home in the internet somewhere that you can SSH into. I've used it for web servers, mail servers, file servers, all sorts of things. Hi, thank you. I have two questions if it's okay to related questions. The first one is the community looks really nice and welcoming. But do you think it's maybe too small with the best factor of one? And second question, how reactive is the community for incorporating new changes? Okay, two questions. One of them is about the size of the community. I think it would be great if it were bigger. Honestly, net BST is at least in some ways, I think, kind of minimal stuffing in a sense. We could use more hands on many things. Although it's not dead or in decline, which is good. I regularly see new people coming in. Maybe 10 years ago, there were a lot of really old school folks that were around since the beginning. There's been sort of a generational change as it's bound to happen after 20 years of running something. There's a lot of younger folks in here, which is nice. People would be welcome, of course. What was the second part of the question? How reactive is the community? It depends, unfortunately. Sometimes very fast, sometimes very slow. I don't know. There's no SLA or anything. What Ryan said in the previous talk, if you have people from the community working on something together, they're going to scratch their own itch. If you submit a patch and somebody sees it and thinks, oh, this looks interesting, they might react to it immediately. Sometimes you might have to ask on IRC or ping the thread again on the mailing list to get somebody to react. A lot of people are busy in their day jobs, unfortunately. Many, many thanks to all for attending and your questions. Many thanks, Benny.
Reproducible Builds: The First Ten Years
Our next talk is about reproducible builds the first 10 years and I'll let Holger explain Hello everybody. So I'm not a lunar I'm Holger but that was the interest light of lunar's talk 10 years and two days ago. I'm based in Hamburg I work on Debian since many years and I've also been to foster many years. Ten years ago we did the first setup with all videos and all rooms and then I gave up with video and cut and voile for reproducible builds. We're working mostly on Debian but I try to make or try to contribute help make all software all free software reproducible and it's pretty complex topic ask me anything anytime during the next two days because there's a lot of information in this talk. So reproducible builds. This is not my talk but the talk of the people working on this these are over 150 people and it's not my knowledge we have them in Git so if you are missing there then you can add yourself to this repository and be here. So about you who knows about reproducible builds why and how a bit. Yay that is awesome I can go home now thank you. Who has contributed to reproducible builds? Wee thank you. Who knows that reproducible builds have been around for more than 30 years? Ten years. Because 30 years really old not as old as net BST or maybe it is. Who knows about S-Bomb? The industry. So S-Bomb is software build of materials and basically our build info files from 2014 also already have this contain all the build environment describe what's in there. It's the same concept more or less. And we need you it's not going reproducible builds nothing five or ten or even fifty people can do it needs to be done in every software project needs continuous testing that software is reproducible. But I think we can do it and there's still a lot of work to do. So I give a short introduction the problem source code is freely available which is not a problem but most people use binaries and that is a problem. And no one really knows if this binary really comes from the source code. I'll get to that in a moment. And as a result there are various supply chain attacks. So long time ago more than ten years ago there was a threat on devil mailing list in 2007 and it was known but it seemed undoable. This email would be really cool if deviant policy required that packages could be rebuilt bit identical from source and then somebody replied I see no benefit. Someone else replied I for one think this is technically infeasible but hey I'm happy to be proved wrong. I'm happy to prove them wrong. So that was ten years ago but the idea also appeared in the year 2000 already in another threat. And then in 2017 we learned that GCC was reproducible in the early 90s on several architectures and not only GCC but also binutils and the whole new tool chain. But that got lost that got bit rotten and people forgot. So fast forward to last year. There was a mail on the wire guard mailing list of VPN up for Android said that the bills are now reproducible. The release is identical on their website Google Play store on an asteroid. And that was well and they didn't even tell us. Yay so great. We're super happy. This logo by the way we developed in 2016. So it's also 80 years old was a logo by design committee and in the end I think it turned out nice anyhow. So our definition when is it reproducible. It built is reproducible given the same source code built environment and built instructions. Any party can recreate it by bit identical copies of all specified artifact. Pretty simple. Yeah it gets more complicated because you need to have everything. What is source code. What is the environment and so on. So our mission is to enable anyone to independently verify that a given source produces bit by bit identical results. And by that we are an important building block in making supply chains more secure. Nothing more nothing less reproducible builds are not more secure than others. They're just built more securely and unsecure software still remains unsecure. But with reproducible builds you cannot be sure that you run this insecure software. And this again is from Luna's talk from 10 years ago. It's pretty much the same definition that we have now. And by 2024 reproducible builds have been widely understood. We have resources documentation. We have public scientific publication. There's lots of material online. And even the White House has said about us. They made a release which in 2021 government statement which requires software bill of materials for governmental software. And at the moment only recommends reproducible builds. But hey the White House recommends our work. Yay. How did we get there. Money. Snowden. Why money. Bitcoin. 10 years ago actually 12 years ago in 2012 or 11 the software was made reproducible because Bitcoin all bitcoins were worth four billion. I think there was very much more now. But they were still afraid if there was a compromised Bitcoin client would steal the Bitcoin. The developers would be accused of having a back door and they didn't want this. So they basically made reproducible builds. And Snowden is kind of obvious. And so the tour people made tour browser reproducible in 2013. Because they were afraid that they would get back door. And tour browser is Firefox. One of the biggest software projects in the world. 50 or 70 or whatever megabyte binary at the time. Every bit was the same. And there was. Wow. So how did we really get there. Lot of work by many people over many years. And. Debcom 13 in 2013 there was a box of small workshop last minute 30 attendees and that kicked off the Debbie in a thought for reproducible builds. And then Luna had this another box at that conference team. And the stock here. And we had the first package that patches for the package to Debbie and package maintainer. We sorted the files and created built info files built info files is where we store the environment and the sources and the product. The output the binaries. And with that we can reproduce them. And that was all done in 2014 already. And in September 2014 I started systematic builds of Debbie and packages twice. First does handle packages and then all 25,000 at that time. And Mike Perry and said she gave a presentation at Congress. See Congress in December of 2014 showing my graphs. It was really nice sitting in the audience and suddenly seeing this graph I made. And I have some from the slides. So this is the presentation. 2014 again. And I want to believe that is really the problem with trusting somebody that this binary comes from. Whether it's needs to need to believe me or Microsoft or your government or Debbie and you need to believe somebody saying that's not believing is not scientific at all. And I am the developer. It's I know it's on my machine and I'm upstanding careful and yeah. But we develop us are also excellent targets. I just spoke with somebody who's producing the GPG binaries for Windows who would have interest to compromise that several nation states. So. And they had that to very nice proof of concept. So the most secure computer in the world is that network most computers on network USB access. 24 seven you can compromise computers and especially if one computer gives you access to 100 millions of other computers or lots of money or whatever the state secrets. Where whatever you terrorist or war taxi want to make. So they made a small back door. They use the CDE against SS eight where the the. It's greater and it should be greater equal. That's the difference and gives you who to exploit. And in the binary the difference is one bit. So they are seven e. Seven C is the difference whether somebody can get root on your computer. And it's very hard to detect. And then they made another thing. They made a Linux kernel module which modifies. That if the compiler looks at the source code and it's not a good thing. And then they made another thing and they made a Linux kernel module which modifies. That if the compiler looks at the source code it will take different compile different code than if you look at it with an editor. And then they made a proof of concept for that. So these attacks are not only feasible they are possible and probably be done. And this was the graph they showed. No they must have showed an earlier graph because it was from 2014. But anyway the green pack is the percentage of Debian packages reproducible. The orange one are unreproducible and the red ones are failing to build. But it was still more than half the packages were quite early reproducible. So 2015. Luna and myself gave a talk here. And this was the first talk where we were spread from Debian to going to all free software projects. I think it was quite nicely perceived. Because since then we have lots of collaboration. And Luna gave her presentation at CCCCAM presenting many problems. We found many many problems and come on ways to work against them. And we had the first reproducible build summit in Athens. We had the time I think we were 25 people from 15 projects or something. And we wrote source date epoch spec which I explained in a second. And divorce code. So first the common reason for unreproducibility is time stamps. People leave time stamps everywhere. And they leave more time stamps. Really every documentation tool has time stamps. Compilers have time stamps. And there's build parses. Also very annoying. And the rest. And the rest is about 400 different kind of issues or something. But it's mostly time stamps and build parses. And for build parses to get that is very easy. You just rebuild in the same build parses. And nowadays with name spaces it's also trivial to do. And for time stamps we came up with source date epoch. Who knows about source date epoch? Yoo hoo hoo. Build time stamps are largely meaningless. Source date epoch describes the time of the last modification of the source. In seconds since the unique epoch. Because that is consistent. This is deterministic. It doesn't change. And this means that was when the software last changed. And when you know the build environment you also know all the libraries you're using. So there's no use to record a time stamp. And that's supported by a lot of software today. If this build environment variable is set. Then it's used and there's 100 tools. DCC is using it. Pundock is using it. Whatever is using it. And the specification is also really stable. We modified it once in 2017 and that's it. So it's been working. And it's available on the internet. Difascope. Who knows about Difascope? You should all know about Difascope. I met lawyers who know Difascope. Who uses Difascope? Nothing against lawyers. I explained it. Difascope tries to get to the bottom of what makes files or directories different. It will recursively unpack archives into many kinds and transforms various binary formats into more human readable form to compare them. So you have a tar archive. And in the tar archive is a PDF. And in the PDF is an image. And the image has a varying time stamp. Difascope will show this. And it has large file system formats. So basically for anything. APK files, DEX files, all file systems, GBG, keybox databases, HTML, anything. There's more JPEG, whatever. It compares two objects. And this is also why there's a lot of different types of files. And it's not just a file. It's a file. It's a file. And it compares objects. And this is also why lawyers like to use it, because they compare two PDFs, and then they see which text changes. And there's other use cases, but we don't also use it to find out why something is unreproducible. Not if it's unreproducible. If it's unreproducible, it's easy because the hash doesn't match. But really why? And it falls back on HexStump. It does fuzzy matching and many things. It does disassembling. Here you can also go to try Diffrescope Org and upload two objects and it will show you the difference. Not sure if you can read this, but basically the colors are. There's an archive, there's at the top, there's the bits are different, and then here you see really the diff between the two versions. And because I compared two different versions, 5.06 and 5.07, the version numbers of course change. But you can look at many differences with Diffrescope. If you haven't taken a look at Diffrescope, do. So in the last 10 years we filed almost 4,000 bucks with patches and 3,500 of them have already been merged. So there's only 300 left and there's still some more coming. And in general in Debian we found over 32,000 bucks. Most are failing to build from source because we constantly rebuild Debian, but there's also many other things. The Reproducer Builds is also a huge QA effort. And yeah, we have a Git repository very categorized the issues and put, okay, this package is also affected by this. And Luna's talk is also a good reference for this. And it's because it's much easier to describe what makes a problem a package unreproducible, we have created the unreproducible packages package which shows many, many problems and how to fix them, because a reproducible package is very, it's hard to show why it's reproducible. It's very easy to show what's unreproducible. And some of the unexpected benefits we had is lower development costs and increased development speeds through less developer time. Google is one of the main users of this benefit because the builds are way faster, of course you can cash way more. And it's also good for software development to see if this change really just affects this part you thought it would be effect. And for license compliance, you can only be sure it's free software if you know the binary is coming from the source. As you're running some binary, maybe it matches the license, maybe it doesn't, who knows. Yeah, and you have reproducible verified S-bombs. S-bombs are just a statement with reproducible builds you have verified S-bombs. So we made these summits over the years, mostly in Europe. And we're going to have a summit this year as well. There were always like 50 people around something. But there were many, many projects there. Like all the BSD, FDROID, Github, Microsoft, RATAT, Apache, Maven, whatever, many, many, many. And there was another benefit, bootstrapable org. It began as a breakout session at the Reproducible Build Summit in Berlin. Who knows about bootstrapable org? Some people. So my understanding is you have 500 byte block which can build in this very small assembler, which builds another small assembler, which builds a tiny C-C, which builds an old GCC, which builds an old GCC, which builds modern GCC. So you bootstrap from sources only. And they bootstrap Dix, which is another Linux distribution. It's pretty amazing. They have their own talk here. And that is just because there was an idea to do this and people really tried this. And bootstrapping, completely bootstrapping from source has not been done since the 60s or 70s. We all just use binaries, building binaries, building binaries. For the summit this year, we don't know where, we don't know when. We need a location for 50 people. We need some sponsors to cover the costs. And we need you to make it happen. Please talk to me after the event if you have an idea where to hold this. In general, we have some funding. We are a Software Freedom Conservancy project since 2018. Funding is for Chris Lam, Mati Aritzolo, Vagrant Cascadian and myself. We support our continuous work, the development, community work, developing software, designing processes, the summit. Thanks to all our sponsors. So short overview of various projects, which is mostly about Debian. So these are the CI results for Debian Trixie. We are now in the 95% range and it's pretty boring. The graph has become very boring. This is a bit nicer. So these are the past Debian releases. Bookworm is the current release. Bullseye is two years ago. Buster is four years ago. And you can see the unreproducible packages have gone constantly down, but there are still over 600 unreproducible packages. And these I really want to get to zero. This is still the goal. So these are the CI results for Debian Unstable for the last 10 years. And you can see in the right here in the end in 2023, we stopped varying the build pass. That's why we went from 85 to 95%. But Debian is constantly growing and we are getting constantly getting more reproducible. So in 2017, we changed or Debian changed policy and now says packages should build reproducible. It's not a must. It's just that would be really nice. And of course I want to get to the must, but that is not so easy. So I want to have reproducible packages must not regress as a next step. And in 2025, because that will be after the next Debian release. And also new packages must build reproducible. I don't want any new packages which are not reproducible. It's been 10 years, it's over. And finally in whatever 2027, one, two more releases. I think all packages must build reproducible to be allowed into testing unstable. That can be an unstable experimental. You can experiment with that. And really 100% is the goal. And 100% reproducible is a political decision and nothing technical. Because we can always say, okay, you're out. And that is political, not technical. So we need to change policy. And we can work around must have offenders using whitelist in the beginning. Like at the moment Grubb and Linux kernel are not reproducible. And I guess most people want to use them. And goal is still 100%. Whitelist are just a way to achieve that goal eventually. Because with that we can kick the others out. And then Debian testing migration. I'm not sure how many know Debian workflow. So there's packages get uploaded to unstable, then they move to testing. And eventually testing is declared stable. So this moving to testing, this migration, there's various penalties or can be introduced. And since three months or something, it now shows if the package is unreproducible, there's no penalty or nothing yet. But I think for the next release, there should be penalties for violating, not regressing, and new packages must be reproducible. And the framework is now there. It's just not activated yet. And the whitelisting part I already said. So, and this, this is a bit stepped for the next part. Because what I showed you before these other graphs, this graph is just about continuous integration where we build the package twice in maximum variation to see why a package is not reproducible. But what we really want to do is Debian builds a package once, and we want to rebuild it to see if we can reproduce it. And we don't want to find differences, we want to find the same thing. So for this, this, we have made this other rebuilder, and this was already working two years ago, and it also showed good results, but then it stopped working. Because we need the working snapshot Debian org service. Snapshot has all the Debian binaries ever released. And without snapshot, we cannot recreate the same environment. But snapshot is buggy. And this has been buggy for five years. And so this is broken. Sad. And Snaps fixing snapshot, it's 150 terabytes of data. It has four pushes per day, gaining 70 gigabytes of data every day. One project to fix, so I got access to fix it. Yay. We need something soon. We still need to fix snapshot, so if somebody wants big data, we need this, please talk to me. But at the summit last year, we also did, we don't need the whole snapshot. We only want to reproduce 70,000 binary packages, and they will depend on 30,000 packages. So 40,000 packages are never used as built-in pens. And then it looks at the build info files, because the build info file describes the environment, and those 30,000 packages are only used on 100,000 variations. So we only need 100,000 packages. That's just 100 gigabytes per arch and suit, so that's nothing. It's just two terabytes or something. It fits on my laptop. So we rebuild our snapshot, it was born. And that's a cache for Snapshot WNORG, which only stores the packages used as built-in pens today. And if new version is, then the old built-in pens are not needed anymore. Because Snapshot has this problem, seeding this still takes a week, but we've done this, and then each arch only takes hours to seed from another instance. And we already run two instances of rebuild a snapshot, and our goal is to allow many instances, so you can just have your snapshot cache in your institution and use it. And this is needed because DAP rebuild, which is used for rebuilding DAP packages, then uses the boot snap together with meta-snap. Because the packages don't have the trust information, and the meta-snap has the metadata from Snapshot WNORG, so you have a trust pass there. But because there's only five minutes left, I will skip those details. And rebuild a snapshot only has one issue at the moment, because we only started it in early December, and then there was Christmas and Congress and whatever, so we didn't fix this issue. And this is, Lynx and Yosh have really helped with that. I've done some work, but the coding part is mostly them. I've done the design work. I hope to have this working in a month or at least two. I don't really care when. We've waited five years, so yay. And so Outlook testing migration can use and force policy, but we need real rebuilders, because we don't want to test immigration just on CI results. And therefore the rebuilders, we need a working snapshot, and we will keep the CI builds to really still find the issues. I can give a very short overview now in the last three minutes about other projects. Tails is easy. Tails was the first project which is reproducible. You can rebuild the Tails ISO, and be sure that's the same ISO. ArchDinox has rebuilt us in Snapshot Binary and Active Community. They really rock. They know more about this than I do. Zouza is one person, but maybe that person will be allowed to do a reproducible Zouza fork on company time this year. So Zouza, I'm looking forward to Zouza in this regard. MixoS and GeekZoS are by design reproducible, but they also have still the unreproducible Linux software. Yachto has support for reproducible images, and FDroid also has reproducible packages in the repository. Alpine has basic support. FreeBSD, the base system is reproducible for all BSDs. We never tested OpenBSD, and we never tested the ports. Fedora Red Hat Ubuntu is not interested, it seems, but that is not really true anymore. Fedora has enabled in Macro, so that source state epoch is now used when building packages, so Fedora could have this easily. Ubuntu, I would really love a Debian fork, which is reproducible. Take the Debian sources, throw away the Debian build processes, do them new, and make a Debian binary fork, which is reproducible. So many projects support reproducible builds in a way or another, but it's mostly in theory, so it's unclear what it does, how users benefit. Tales is easy. Tales has one ISO, one checks them, you can recreate it with Debian with 60,000 packages. How do you verify them? It's still open. And this is massive success. This was thought impossible 10 years ago. This is, again, a 10-year-old slide. In theory, we are done. In practice, we have shown that reproducible builds can be done in theory. We need rebuilders, we need to store the results, we need defined criteria, how tools should treat the data, and then we need to use these tools. Because if you have several rebuilders, which you basically want, what you do with the results are not matching. And yeah, those last 5% or it's maybe 2% now, we still need to fix the software. And we need project-level consensus and commitment to keep reproducible builds working in practice. Thank you. Thank you for the talk. I learned a lot actually. I had vaguely heard about it, but very good introduction. Anyone have any questions? Hello, Agarhe. Most of the graphs you were showing were for AMD64, I think. Do you have any information about the other architectures? Is there any difference between AMD64 and ARM64, for example? We also have graph for ARM64, I-386 and ARMHF, and I just got an offer for RISC64 hardware. So when we do this rebuilders, we want to have every WN architecture. But it's, yeah, get there. Any more questions? How about, I actually have several questions. How about source releases? Because this is a problem with many projects that they don't have reproducible sources. If you want to get an expulsion or something, unless the original tower is around, you cannot remake it easily from a weapon. It's getting out of the... The releases of the source code or the source release? The releases of the source code. For many projects, it's still hard. Is there any effort around that? It depends mostly on the tools, like whether G-SIP or GitHub, when it creates an archive, is reproducible. We have basically decided not to look at this, because if you do a release once, you have version X, and that version X stays. Being able to reproduce will recreate the same version. We've just decided to be out of scope. But it's worthwhile, do it. Another thing is variants. For some distributions, they support a lot of variants of packages, and these distributions are more interested in causal analysis. How we arrive at certain versions and why we didn't arrive at it. Because this is what enables us to fix the things that didn't allow us to have a reproducible build. And if you have 100 variants multiplied by 1000 packages, that's a problem. I would not say that the number of packages matters. Of course, if they are reproducible, you can have 100,000 of packages. Is there any effort in causal analysis of why a package is not reproducible? Can you very hardly understand you, because it's so loud. Is there any effort to build tools to do causal analysis? Why the packages are? Yes, we've built this. This is the framework to analyze this. But first you check whether the packages are reproducible, but just comparing the hash. Build it twice, and if the hash is the same, it is the same. And if not, you can use Difloscope to analyze why. We've also had Bernhard Wiedemann from ZUSA also made some scripts to analyze the build logs to make a statistically analyze on this. So there is work on this as well. There is endless work on this area. We should talk about Snapchat, the Levian, and Thare. We should, Julian. Definition, I assume, that we have given the same build environment. What are the best practices to deliver the same build environment in long term? Are there any best practices for that? Well, just recreating the build environment is different from distribution to distribution, and it's often a challenge. Could you repeat it? Recreating the build environment is often difficult, and it depends very much on the distribution how to do this. But it can be done. So what are the best practices? What about containerization or something like that, or VMs or whatever? VMs help with that, yes, but it's also you need to record the build environment while you build, and then have some tools to recreate it. Okay, so are there any tools right now for kind of defining build environment? There are several tools, yes. Any keywords for that? There's DepriBuild for Debian, there's Reprobild for Arch Linux, there's, I don't know how OpenWRT does it, but depending on the distro there's different ones. Great talk. I was really fascinated by you describing how we can now potentially build from source code, bootstrap from, you know, 500 bytes, and then ancient versions of GCC and upwards. I think one of the reasons that is fascinating is because it addresses, Ken Thompson's attacks that he described in On Trusting Trust, which has been a major weakness in the security of the whole industry. When can I do this with Debian? When can I build Debian from source completely like back in the 60s? There you would need to talk to the bootstrap over people to bootstrap Debian really. I think you can do it probably today if you do the work. The sources are there. Somebody just needs to do it. And about this trusting trust from, can Richie what you mentioned, there's now from David Wheeler, there's reverse double compilation, where you rebuild both compilers, two compilers twice with each other, and if they produce the same results then you can be more or really sure that this trusting trust issue is resolved. There was a nice paper through the last year, October, for this 30s or 40s anniversary of Trusting Trust. Any more questions? It's not really a question, it's more a comment. For people who come from TPMs and attestation, attestation always works on binaries, and the question you always have is, what is the source code corresponding to the binary? And the only process we know forgetting that is actually reproducible builds. And so there's a lot of people in the security community actually trying to advocate for reproducible builds, just so we can prove to our customers that this binary hash corresponds to this source code. And the comment is just that this is actually a very important use case that is rising enormously in importance with the attestation requirements of confidential computing and the like, that you can actually use to plead for funding for guys like the NSA or other people, because this is suddenly becoming really, really important. I just have a total beginner question. So I have source code checked out somewhere, and I want to have a reproducible build. So how do I go there? I can set source stage epoch, but that doesn't really match because each of those files have a different date. So is there an easy just to make this way, and then it will analyze the source states, set the right source epoch and things like that, or how do I get there? It depends. But first you set source state epoch just to the last modification of the whole source. So if you have 10 source files, just the latest timestamp. And then you just build it, and you build it twice, and you compare it. And with that, if you just build it now twice, you already have some variations, like randomization, hashes are not sorting the same, but maybe you don't catch the issue with the timestamp. So you build it once today and once tomorrow. And then you compare it, either they are the same, or if not, then you compare them with DIFFERSCOPE to see where is the difference. And according to this, you do whatever is needed to remove the difference. Okay, then is there an easy tool to give me the right timestamp to set for source stage epoch? Or when I have a make file, I may be even able to say, Analyze the date and give me the highest one as source state epoch. There is not the right timestamp. Often the right timestamp is no timestamp. And some people just set it to January 1st, 1970, or just drop it. Because if the timestamp is just there to be a timestamp, if it's really meaningless, you can just drop it. The other thing is you replace the build timestamp with the timestamp of source state epoch. Then you still have a timestamp, but what's the timestamp worse? Just kind of saying that one thing I have done several times in projects is to add, for the reason of getting the same source table each time, added a make target or something like that to basically touch all files with a timestamp derived from a file. And this file will be some sort of manifest. So you can check that against the files and also use it as a source for all the timestamps. It's not perfect because sometimes there's timestamps inside files, and you need to manually add something to edit that, but it's a good starting point to do that. Thank you for the talk. What are the main challenges to make the Linux kernel representative? At the moment, the main challenges are signatures on kernel modules. Then we came up with the solution that if you rebuild something and you get the same bits, then you can just reapply the signature again because the signature will match again, but for that you still need to have the signature. So it's mostly, it's not impossible, but it's busy work. And because the signing process is also a secure boot change and there's time requirements to get the signing in, that is more problematic than the technical challenges. Okay, and that's time. Thank you very much for the talk.
An engineer's guide to Linux Kernel upgrades
Thank you everyone for coming to my talk. My name is Ignat. I work for Cloudfer. Who here heard about Cloudfer? Who's using Cloudfer? Should be more hands, by the way, because even if he didn't hear about Cloudfer, probably using Cloudfer one way or another. This is my first time at FOSDEM. So thank you very much for exchanging your lunchtime for my talk. I hope it will be really exciting. And today we're going to talk about Linux kernel upgrades and how you should do them, and most likely how you should not do them. So a little bit about myself. I do Linux at Cloudfer. I enjoy system security and performance, and I'm passionate about low-level programming. So the Linux kernel, drivers, bootloaders, and other stuff, reading in unsafe programming languages. Okay, before we start, a little bit of show of hands. So what would you do in this case? Imagine you're working at the shoot on your laptop. You're doing stuff. And yeah, and suddenly this pop-up comes in. I'm like, oh, updates available. What would you do? Like install now? Who's install now? Oh, nice. Well, who's resumed later? Do later? 50-50. So those people who raise their hands for install now, what if instead it wasn't your computer but a production system? Who would press install now? No, very few. But yeah, you like Bitcoin probably, right? Risky. Yeah, and usually it's something like that for production system, right? So it's a difficult choice between remind me later and don't remind me at all. Please don't install. And this is natural, I think. Because it's connected to the fact how do we perceive software updates, especially for production systems, right? Well, we don't perceive them really good, right? So we perceive software updates as kind of these monsters where they come in, they're nasty, they're bugging you. They kind of like an update can break your stuff. Like the traditional engineering motto, if it works, don't touch it, why do we need to install an update, right? Yeah. But the thing is, with regular software updates, we perceive them as monsters, but they're not really scary. They're kind of annoying and ugly, but pesky, but not that much. When it comes to Linux Chrome upgrades forever, it's mostly like this big monster trying to destroy the universe, right? And why that? And again, it's natural because, well, we know how to deal with regular software updates. Yeah, you have a service, it crashes once a week in production, how do we fix it? Well, if you use like system D, you'll just set a policy for it to restart it, and yeah, job is done. It can go home. Well, yeah, you'll be kind of restarting a service once a week. Your service will be in slightly degraded state, but yeah, you'll buy yourself some time to investigate and fix it later. When the Linux crash, Linux kernel crashes, however. Well, technically, this is you, right? So it's end of the world because you don't have any system D to restart it. You don't have any metrics and understanding why it happened. Your service is not reachable. No SSH to debug nothing. Well, it's kind of, it's indeed end of the universe. And that's why usually we're scared of software updates, but when it comes to Linux kernel updates, we're scared like even more. And this why like people avoid updating their Linux kernel for the most part, right? Especially in production systems. But there are common risks. If you don't apply software updates regularly, especially for the Linux kernel. So the first one of them is like your bugs are not getting fixed. And here's some statistics. So I will be talking about the Linux kernel release cycles a little bit later to introduce you. This is basically the preview is a snapshot of all bug fixes releases of a stable kernel branch 6.1. So the latest Linux LTS kernel is 6.6, but because it doesn't have as many releases, so you don't get pretty graphs, I decided to go to the previous one, 6.1. And what this graph shows you is the number commits per each bug fix release on a 6.1 stable kernel. So again, I'll be talking about release types later in this talk, but you at this point, you should know that these bug fixes releases happen roughly every week. And these bug fixes releases are what the name says. They're only bug fixes. There are no new features, no subsystem rewrite bugs and security vulnerabilities. And as you can see, so far the 6.1 stable kernel had 67, 76 releases, and out of 76 releases, there are 50 releases with more than 100 commits in them. So it means 100 bug fixes every week. Almost every release, really, like 80% or something, right, if I'm doing the mass write. 20 releases, so it's 25-ish percent every four release, every fourth release, so roughly every month, have more than 200 commits and maybe 200 potential bug fixes. And there are like these five mega releases with more than 500 commits in them. And actually, if you look in the graph, it's actually seven, but the last two barely made it to the 500. But yeah, these are like these mega releases with a lot of commits. So if you don't upgrade your kernel regularly, your system runs with all these potential bugs, like, and every week you delay, you're kind of missing out at least on 100 bug fixes in your kernel. Second, what you'll be missing out is on potential performance improvements. This is a snapshot from Cloud 4 production systems when we started evaluating, we were using at the time the 5.4 stable kernel and we started to evaluate 5.10 kernel. And so we did like half and half deployment to set of identical servers, like one with 5.4, one with 7. And this is like, this graph shows the, you know, like average memory consumption per server and you can see that on 5.10, we have much less memory consumption. And people are like, what did we break? Like, what happened? And nothing bad happened. It's actually, yeah. So that was 5.4, 5.4 versus 5.7. So we kind of saved something around 5 gigs of RAM per server. And like, at first we thought something broke, but when you dig later into the mailing list, you just, you see that like, you know, like some other folks in this case, this was Facebook now matter, and nice people did some improvements in the kernel code and improved the memory management system. And now you are consuming less memory for the same workload, with the same performance. Right? So it's like, it's almost like downloading RAM from the Internet. And you basically get it for free if you just apply an update, like it's open source, right? And recent news, for example, the latest LTS kernel is 6.6 and it rumored that it has a new scheduler in that. And there is a phoronics article that says like, if you're using Nginx, with that scheduler, it will be much, much more performant. So you'll get it for free as well if you move to 6.6 potentially. I mean, I don't have any pretty graphs because it didn't work better for us, but maybe for you it will. Yeah. And I mean, looking a little bit forward to the next talk, after mine, there will be some discussion, I hope, regarding some security improvements with TPMs and the Linux kernel, and it will involve some code probably, and you only can get it if you upgrade. So let's look at the same data, but from the point of view of accumulating changed delta. So this is basically the same data, number of commits per release, but it's kind of accumulating. It shows the number of commits since the initial release, right? And in this graph, you can easily see you can commit, you can calculate changed delta. For example, if you're on a 6.1.10 bug fix release and you want to upgrade to 6.120, you can commit changed delta is 1,762 commits, right? And basically, if you assume, which would be natural to assume the fact that the number of changes is proportional to risk, so for example, these are like 1,762 bug fixes you're running with, so it's kind of like the amount of risk you're taking by not upgrading is proportional to that number. Now let's say you wanted to upgrade, but for some reason you decided to delay, and you decided for, like, I don't know, it's end of the quarter you had a big incident, you have, you know, like your company gets a big contract, so you decided not to change anything to be more stable for the time being, and you're postponing the upgrade, and when you actually decide to upgrade now, you're upgrading from 6.1 to 10 to 6.1 to 30, which is like you just extended your not upgrading time twice. And you might think naturally that your risk grew 2x, but if you calculate the difference here, you may see that in some cases, a 2x postponing, 2x not time not upgrading, your risk actually can grow higher, now your risk grew 2.221. Right, so the risk sometimes of not upgrading systems and delaying may grow higher than the time you're not upgrading. So yeah, for 2x delay of not upgrading, we get 2.21 more risk of hitting above. If you're not upgrading, security vulnerabilities are not getting patched. So this is a similar graph, but it now shows only publicly known CVEs patched in a bugfix release, and just this data is actually crowdsourced, so it might be incomplete, but even from this you can see that out of 71 releases, for which data is available right now, 56 releases, like again almost 80%, have at least one CVE patched. And there is 18 releases again, 20, 25%, with more than five CVEs patched. So again, if you're not upgrading kernel regularly, you're running not only with security vulnerabilities, you're running with known, publicly known security vulnerabilities, for which most likely an exploit is available somewhere on the internet. Not patching your security vulnerabilities also puts a risk on your compliance, so if your production systems are subject to some sort of compliance, you have a required time at which you should be patching these vulnerabilities. So for example, if you're subject to PCI DSS compliance, like for most payment systems and stuff, it says that the critical or high security patches or updates should be patched within one month of release. So imagine there is a known, publicly known security vulnerability in the Linux kernel and you have one month to fully roll it out to your production systems. Who here knows about Acvifox? What happened to it? A few hands. So it wasn't about the Linux kernel, but Acvifox was running an Apache server, an old version and patched with known security vulnerabilities and people used an exploit on their system and exultrated some data. And it was a big mess. It was really expensive for the company. It cost its reputation as well as a lot of money, compensation, a lot of lawsuits, so very, very, very bad. Which brings us to not so fun fact. You remember like in the old days when you go to admin forums in 2000 and people were boasting around how long their server, how stable their servers are posting their uptime. Like my uptime is two years, three years. Well, since Apache and Linux kernel requires a reboot, now it's not cool anymore. So if your uptime is more than 30 days, you're most likely vulnerable and not compliant to something. So now let's talk about an anti-partness for Linux kernel releases. If you're managing a production system, for most software updates there is some kind of a change management process or well understood practices which usually like sysadmins, sres and engineers apply to manage change. But most of them unfortunately do not apply to the Linux kernel. So when you go and want to update your production system, oftentimes for a software update, the change management process will ask you why. Why do you want to update and which things from the change log on this new version is applicable to us. Like are we really fixing bugs that are hitting us? Are we really fixing TVs that are applicable to us? Well, and it doesn't apply here just because of this graph, right? So remember these bug fix releases happen every week and like with most of the releases having more than 100 commits, so it doesn't mean that every week you should be going through all the commits and trying to understand if that particular fix is actually applicable to your system. For this it's very expensive. You need a huge team of really good Linux kernel experts to understand if you know like this off by one thing in the memory management subsystem is actually triggerable on your work. So if you do go this way, mostly you'll be doing something like this. You will be just continuously stamping releases for no particular reason with no analysis. Then goes for security vulnerabilities. You say, yeah, we need like we have five CVs we need to patch due to compliance and then you may ask somebody may ask the question, is the security vulnerability actually exploitable in our systems? Do we use that subsystem? Sometimes it's an easy answer if it's in a driver for I2C and you're on a machine which doesn't have an I2C, then you can say no, but most of the time it's much more hard and like many exploits, many successful exploits are not like some kind of high severity big vulnerability. Sometimes attackers manage to change smaller vulnerabilities properly to get an exploit. So going back to that, like going back to this question if you really think of it like who can answer this question? Technically this question can be answered by the attacker because if the attacker has the list of the CVs running in the system, they're highly motivated to break into the system and this is their bread and butter. They spend like 24 or 7 to design and implement successful exploits. But unfortunately you're not asking this question to the attacker, you don't know who they are, right? You're asking for a security patch reviewer, you're going to some team for security people and they're like, oh is this vulnerability applicable? And they're highly motivated to go home on time, right? And they need to review several patches a day, not only from the Linux kernel but from many other subsystems and do other stuff like doing security architectures, doing compliance, many things. So it's kind of you're asking this person, right? And the quality of that answer will not be great. They will say like, yeah, maybe yes, maybe no. So the best course of action is just not ask this question and assume that every CV is applicable to your workload and patch it. Well, one of the traditional approaches in upgrading stuff, especially the Linux kernel is soaking. Like let's put it in the cannery somewhere and soak it for one month to ensure we don't get anything. Yeah, but basically you come back to this by soaking it in a subset of your production, you're not releasing to elsewhere and you start accumulating change delta and therefore your risk of not upgrading grows and hitting a potential bug. Same with security vulnerabilities, if you're soaking it somewhere, you're not patching CVs in your production and you'll have the risk of being hacked and you're probably, for one month's soak, you're probably all, like if you have a one month's soak time somewhere in a cannery, you're already violating some of the compliance which dictates you have 30 days to roll out everywhere. But what does high soak time means in practice? It's usually because we just don't know what we're looking for and what it translates to. We don't have any success metrics or observability how our kernel performs, is it performed the same way after upgrade as it was performing before that. We also don't know our workload. My team gets questions, the same question from many teams, right? Will the kernel break my software? But for every team, the subsystem of interest is different. For a database team, they're mostly focused on IEO and file system performance, but for some image processing team, they mostly care about CPU scheduling and CPU performance. The question should be, will it, I'm interested in this particular subsystem, will it break my workload within IEO workload or like CPU bound workload or I'm interested in some hardware or something like that, networking as well. And probably it indicates lack of sufficient production kernel testing. Within the Linux kernel, you can also ensure that an update doesn't break someone's workload if you write a particular Unix test, an integration test. The Linux kernel has this nice suite called case self-test, which is easily extendable. If you care about a particular feature in the Linux kernel or a particular behavior, you can easily write a program which exercises that behavior and verifies that each upgrade keeps that behavior. Even though the kernel itself is written in C, you can write these tests in any programming language and even scripts. Sometimes you just get, yeah, whatever, kernel is just too critical. Let's have more approvals before we deploy. Regular software requires one approval and the Linux kernel should require two or three approvals. And again, this is related to the fact that we perceive kernel as a, you know, like, bad scary monster which can destroy the universe. But what if I told you that kernel deploys are inherently safer than any other software? Would you believe me? Who believes? You're in the matrix, yes. We learned it the hard way actually in CloudFare. So this is like a map of CloudFare data centers around the world. It's maybe even outdated, but the gist is like, yeah, we have a lot of data centers around the world. And with regular software, how do the updates happen? So from a 1000 feet view perspective. So engineers update the software package, push it to our registry package. Registry then the config management picks it up, downloads a new package. Also the config management may be configured to restart the service which uses the package. It can be graceful or non-graceful depending on the context. It doesn't matter. But the gist is new, bad, or good code can propagate through all this network without proper safeguards in minutes. And CloudFare learned it this hard way. So we had several bad outages where we didn't have proper safeguards for stage rollouts of some software. So we almost caused global wide network outages and these are described in these blog posts. On the contrary, how does Linux kernel upgrade works? The gist is it requires a reboot. So we, and to reboot the server, what we do is we drain traffic from the server, put it out of production, actually reboot. Then it comes up, it contacts our config management, we wait for it to be reconfigured. We run some basic acceptance tests and put back the servers into production. And I mean we would be crazy if we reboot everything at once, so we don't. So we have automation rebooting servers in one by one or in batches. So what it means is kind of, it's an inherently natural, slow-paced, gradual rollout with minimal impact. If things go wrong. Did we release kernels with bugs? Yes. But yes, some servers didn't come up properly. Some servers started showing errors and there were only a couple of servers. So we like reverted the release and like there was no visible impact. One problem is why people are afraid of running kernel releases is they don't understand them. How the kernel release process works. So kernel versions are designated by three numbers, like one number dot, another number dot, and then another dot. Example, like 6.132. Who here knows about semantic versioning? Almost everyone. So the gist of this talk is this is not a semantic versioning system. Everyone confuses this with a semantic versioning and it's not. But instead, what really is the first two numbers mean the major version, not major or minor as in semantic versioning. And the right most number means bug and security fixes. And when the right most number increments, you most always never get new features or major subsystem rewrites. So it's not only bug fixes or security vulnerabilities, nothing else, no new functionality. So how do these releases created? So the main bleeding edge we call it source code is stored in a git repository managed by this person. Who knows this person? We call him benelope dictator, right? So, yeah. The features are developed in branches, subsystem branches. So for example, you have subsystem for drivers, memory management, and that. And once in a while Linus pulls changes from these branches. This is where the pull request probably came from. I don't know, I'll note that for that. But the original pull request, not like fancy PRs that we have now, but it was an email saying, hey Linus, can you pull from my branch? This was a pull request. And it still is actually in the Linux kernel. Yeah, so Linus pulls all these changes from subsystem branches. And once in a while, he branches out the main branch into stable branches, which designate a major stable kernel release. And this happens roughly every nine to 10 weeks. Eventually, when bug fixes get accumulated, you get a tagged version on a stable branch, which indicates a bug fix release. So for example, you get 6.2.1. But how these bug fixes get propagated there? So they're not, if you have a bug, you do not submit a fix directly to a stable branch. Instead, you actually have to go through a respective subsystem maintainer to ensure this bug is not only fixed in the stable branch, but in the main branch and all other branches. So you actually commit your bug fix to the particular subsystem where the bug is, which will eventually get propagated to the main branch. But once it's in the main branch, it's not just merged into the stable branch. These bug fixes commit, especially mark, and the maintainers for the stable branches, the stable branches all have maintainers, they basically cherry-pick these bug fixes. And when enough bug fixes are getting accumulated, they do another bug fix release, which happens roughly every week. So yeah, a new major stable kernel is released every nine to 10 weeks, and it's the so-called merged window where new features get merged. There are only two weeks of the merged window usually. And the rest seven weeks are for testing in bug fix. And so even the major version receives a lot of bug fix in testing in the first place. And what you have to remember is leftmost version means nothing. So in Galway we had this problem where we, at some point, when we upgraded for 4.9 to 4.20, it was fine. But when we wanted to upgrade to 4.20 to 5.0, people were like, oh, it's the leftmost major version of this. It's probably really scary. No, it's not. It can even have less features than the previous major release. Linus himself tells that he just increments the leftmost number when he runs out of fingers on his hands and toes. But for whatever reason, sometimes he increments when the middle number is 19, sometimes it's 21, and sometimes it's 20. So apparently he has a variable number of fingers. Yeah, and bug fix or patch releases are released almost around once a week. They are denoted by the rightmost version number. They're usually cherry-picked from the main Linux branch. And the rule is there's always no new features. Therefore, regressions are quite rare. They almost always contain critical security patches, and you almost always want to apply it. Well, the problem with major kernel upgrades is that the major stable branch is kept alive around two, three months, and then it's abandoned. It's declared end of life, and no new bug fixes and security patches are backported there. And the assumption that at this point you will have a new stable merger version available, and you should just upgrade the merger version. But sometimes it's very costly to evaluate the major version because you do get new features and potential regressions. For this, there are so-called long-term stable releases where bugs and security features are backported for at least two years, and it's usually the last stable release of the year. Therefore, the so-called LTS stable release is released once a year, and if you follow these, which we do, for example, it provides you enough time for more rigid evaluation of the next long-term release. And surprisingly, the releases are quite well described on the kernel.org website slash releases. I was surprised how many people don't go beyond the main page of kernel.org to read stuff. So yeah, go and read it. It's quite interesting. Okay, so what do we do for safe and easy production kernel upgrades? First, don't create a dedicated deploy procedure for the Linux kernel, because kernel upgrades are usually less risky than other software who's been convinced today. Well, some hands, okay. A simple stage rollout is usually enough, and kernel upgrades are naturally slow paced because they require a reboot. And because you probably won't reboot everything at once, there is a lot of headroom to abort the deploy if things look wrong. Do avoid justifying bug fix kernel upgrades. Apply them with no questions asked. There is almost always something that is applicable to your workload, and it contains only bug fixes and security vulnerabilities only. And also minimize cannery soak times and prefer to use metrics-diven approach. You can sit in this 30-day window of operating your production kernel everywhere. So if you require high soak time, think about it. What metrics or observability will give you more confidence to roll out this kernel faster? Stay on the long-term branch if validating a major version is costly, so you have to do a lot of analysis and testing. You get at least two years of bug fixes and security patches, but don't wait for the two years, of course. Better what we do, for example, we start evaluating the next long-term release early in one year when it's available. Again, apart from just being proactive, it gives us more features early and sometimes, most of the times, better performance and resource utilization. And we also don't accumulate too much change delta, as I described before. If you don't have it, implement and improve production testing for major version validation. Basically, faith-lab grading the kernel requires you to understand what your workload is. If you're a web server or a database, what specific subsystems are in the target of your workload? Because sometimes, even a bug or an improvement in CPU does not apply to databases. Once you understand your workload, better to write tests which exercise these kernel subsystems and interests required by your workload. Having these tests also really helps with communicating issues to the upstream community, because in Cloud for All, our team is quite small and we're not experts in anything, and I would highly doubt if anyone really experienced in the Linux kernel, including Linus himself, could be an expert in all the kernel subsystems. So sometimes, we had a time where we had a bug in KVM, and we know nothing about KVM at that point, but because we had a reproducible test which triggers the bug, we spent like two weeks trying to understand what's going on, and we couldn't, but since we had a reproducer, we just posted an upstream mailing list, and there's always a person saying, oh yeah, here's a fix in 10 minutes, but you have to create this reproducible self-contained test to actually people to help you. And yeah, make metric-driven decisions whether to upgrade and not time-based decisions, so many might sometimes. One thing also helps with metrics and monitoring, and also automating your kernel releases, is with human risk perception, because sometimes when new people join your team, they still have this mentality of Linux kernel upgrades are very risky, and if you require a human to progress and do these upgrades, they will always be reluctant to do this. Like, automation really helps here to remove the human risk-perseverment factor, because these days, especially in clover, many teams are not even aware the kernel upgrades are happening. They're like happening under the hood automatically, and people don't notice it, just because, and you don't have to ask anyone whether you should upgrade, because you have this more or less, not perfect, but more or less data-driven approach. And I think that's it, whatever I want to talk to you today. So again, Linux kernel upgrades are not more risky than any other software. You need to patch early and patch often, and your bug fixes kernel releases should be applied with no question asked, and understanding your workload, metrics, and monitoring on automation will allow your system to stay patched and secure in the long run. Thank you very much. May I ask something? I know where that fear, it's a fear that we all have, I guess, and it comes from things that I can just say one story, so you have like a 5.4, and it's working fine, and you have some kind of special, maybe, chipset, and it doesn't support everything that chipset can offer, but it runs fine. So you upgrade to 5.0 something or 6, and it starts to crash. And then you roll back, and then you next time you will really think twice if you will upgrade to the next version, which will offer you more support for that chipset, but you still don't know. Then you wait others to upgrade, and to be sure that it's working fine now, and that's why you don't run to upgrade really fast, and then let me see if my dead one and that one did it, and it's running fine, and then it builds fear, you know, these things build fear, that's what can build fear, that's why it's always good to wait a bit more until 5 of them do it, and then, okay, I can see, so when I'm running fine now, I will do it now. Well, I mean, based on our experience, I have this same question from our production engineering team many times, it's like, why do we rush to upgrade? Why don't we wait until all the bugs were fixed and we can upgrade? And I guess it depends on your workload, but for us specifically, I sometimes call CloudFer as Linux as a service, because many of our products are using Linux, are stretching the Linux kernel to its edge. If there is a new feature like, and Linux kernel like XDP, IOU ring, people jump on it and adopt it almost immediately, and the result of that, because we use these edgy features which many people don't use, there is no one for us to fix these bugs, like we're hitting them first, so we tried waiting, and when we're waiting, we're still hitting the bugs, because like nobody else is using that feature in this way, and this is where you just can't, I guess it's the same with very specialized CPU or hardware, if nobody uses this hardware, you can't wait for the community, someone else to fix your bugs, you have to push them through. Of course, you see the bugs, it's always helpful to report them, and there will be some people on the mailing list within a moment, they will send you a one-liner patch to try it out, and usually it works out, but I mean, generally, if your workload is specific enough, or hardware is specific enough, you can't just wait for all the bugs to be fixed, because it's very applicable only to you. Okay, good day. I wanted just to in phase your position to say Linux is safer to upgrade over any other software, and to me the main reason is because of the strong commitment from this community to ensure that all the stable release are safe to upgrade. And I know very few other software that takes this contract with the users to say you cannot grab safely. And I think this is a major point, and I think the Linux community should be recognized for this, because it puts a lot of work to ensure that we are safe to upgrade. That's something very important. More than the rollout points you are leveraging, it's much more because I've searched strong contracts to ensure that every stable release is safe to be used. Yes, you mean you're referring to don't break user space mentality? Or even don't take a patch which is not already in mainline. I mean, if you get your patch into the stability, it's because it has been tested and proved to be safe, and because of the sum of all these patches is not to be safe. And this strong commitment is very important, I think, for the users. Yes, yes. They can press their work. Yes, yes, yes. And many times when you submit patches, there are tons of external, even people or systems, we run your patch in kind of a CI and they will report if there is something back. Yes, I guess you're right that we have to acknowledge that community puts a lot of effort to these stable releases to be actually stable. But also, like the release process itself goes a long way. So, technically, again, you have only two weeks to deploy new features and then you're stuck with seven weeks of bug fixing. So, yes, the emphasis on stability is a real win, I guess, for this community. And another thing, the sum of security issues is not only counting the CVs. Greg made a great presentation around that. If there is CVs, there's probably a security issue. But there are also fixes which are not as stacked as CVs, which could be our security issues. So, to evaluate the security risk of a given version, it's not only counting the CVs, it's much more complex than that. Yeah, I agree with that. And this is what I partly mentioned, that data is crowdsourced and probably incomplete. It's kind of like the minimal baseline of risk. But there is more, of course. There is like, these are publicly non-vulnerabilities which have been tagged on this project. There is like a lot of them which are intact with no CVs attached, as well as like a lot of unknown security vulnerabilities hiding in this system. So, yeah, definitely. Anyone? Hi. Here. I don't see. I'm here. Oh, okay. Hi. I have a question about Livepatch. Do you use in your company? Livepatch, we don't use Livepatch. And my personal view on this, I'm not... So, like, I don't fully see Livepatch technology covering all the use cases. So, I think it is useful for patching vulnerabilities really fast. Yeah, yeah. But on the particular type of vulnerability. Yes, yes. With Livepatch, you're basically replacing a piece of code in the kernel with another patch piece of code in the kernel. But we have to remember that in kernel, kernel API is not stable. And basically, you can only do that if your patching requires not changing some kind of structure. It may fall apart if you're required to adding a mutics into the structure if you have a race condition. And this is where Livepatch fails. And moreover, implementing Livepatch is very complicated. And it's kind of like you can crash the system as well because you're messing with the kernel code. So, in my perspective, in my opinion, the effort is kind of not worse of the return of investment. Like, if you don't have any company, like a Linux enterprise, Linux distro doing it for you and doing it for yourself, you're putting a lot of effort to make it. You can't patch all the security vulnerabilities with that. You're putting a lot of effort and you don't get much benefit. If you instead just focus on building a system where you can reboot anything at any time, that kind of gives you, like, much better, like, long-term result. Because you just can't reboot with a new kernel and, you know, your system kind of is resilient to that. And it takes as much effort. Thank you. Hello. Thanks for your detailed explanations and for outlining that the December version doesn't actually work the way we think it does. Now, I have questions. So, you mentioned that we usually install the rest of the software out of some side bound that we don't have control over. And actually, I do that for everything. Can you kernel? I don't usually compile it myself. So, the question is, can we, should we aware, should we be aware of particular tricks? Because this process is actually mediated by the distribution. Like, do the people who do the distributions know all the stuff you mentioned? Yes. And actually, the model which I described following LTS release and, like, rolling out bug releases regularly is what most distributions actually do. You might not see it because, for example, Debian, you kind of, they version the package differently. So, you think you're always on the same version, but you may notice if you're doing, like, regular up-get upgrade that when your new Linux kernel is installed, it actually installs you a new bug fix version, which is hidden under the hood. So, this is what most distributions do. They either follow LTS or they take a non-LTS branch and maintain it for longer. But when you upgrade your system, you just get bug fixes and security vulnerabilities patched as this bug fix release. Hello. I'm not completely sure how the kernel process works still. How about a firmware that's just dropped into the kernel? Is that included in those bug fixes? And if so, how are data set? How are you ensuring that those binary blobs don't change something that breaks everything? So, in modern distribution, and like within the Linux kernel upstream as well, the binary blobs are now managed separately. They're managing the separate git repository. And on distributions, there is a separate package for it usually called Linux firmware. So, basically, the code for the kernel and the binary blobs are upgraded at different cadence and have different release procedures. So, they are not included in the code upgrade these days. Hi. Over here. Yeah. So, you were talking about the fear in upgrading kernels, but to me or when I'm looking at my team, sometimes it's more of the tedious task in having to reboot or to migrate the service. And then, you know, doing it over and over like Groundhog Day. Now, my question is, what would you consider a reasonable cadence for that task? Or do you see even like a need at the system to align on a specific kernel and, you know, and zeroing out the whole system or just having some routine monthly maintenance that jumps a few versions? What's your take on that? So, again, for bug and security releases, my preferred kernels is weekly. So, they released every week. You have to compile it and roll. I mean, not roll it out everywhere, but start its rollout at some set of production then more and more and more. And again, basically, the more you delay, the more change delta you accumulate, the more risky you're bringing. So, if you do it as regularly as possible, your change delta is small. And technically, like within a couple of two bug fixes, even if it's something breaks for your particular service, you can kind of bisect it and understand what's happening much more easily than you have to go through, you know, like, thousand and thousand of commits. So, if it's hard, you have to think about how to make it easier and how to do it more often. The more often you, it's like gym, the more often you do it, you kind of build that muscle, you build the tooling around it, you build the metrics and observability around it, and then you build, eventually, you build your confidence that kind of, it takes you very fast and effortless to actually do it much, much more. Yeah, my question is mainly about the time spent. My question is mainly about the time that you spend, you know, managing that as part of your day-to-day. Well, again, it's basic calculation of return on investment, right? If a kernel upgrade is too costly in terms of like spending, you're spending a lot of time doing that, think about if you can invest this time to build some kind of automation. And that's what we basically did. Like, when I joined the company eight years ago, like, it was very manual and time-consuming and it required a huge team of SREs to actually do a kernel upgrade, but now they're not even involved anymore. And, like, it just happens. Thank you for the interesting talk and nice present for you. Thank you. Enjoy it. Thank you very much. Thank you.
The D Programming Language for Modern Open Source Development
Hello. All right. Great to see everybody. See some familiar folks here. Just a quick show of hands here. How many folks have heard about the D programming language? Oh, wow, awesome. Keep your hand up if you've used the D programming language or tried it out. Okay. Yeah, I see you there, Dennis. Yeah. A few other folks here. Great. This is perfect. You're in the right space. We're going to have a lot of fun today. And I'm going to give you an introduction to the D programming language here. I'm not going to show you everything because the D programming language is a really large programming language, but hopefully enough to get you excited here. And ultimately to show you some open source projects where you can get some inspiration. So let's go ahead and get into it here. So again, it's been six years since my last Boston talks. I just want to thank the organizers for inviting me back and letting me talk again. So again, the goal today is just to have fun. You can kind of sit back, relax, have a good time and just learn about, again, what I think is really interesting programming language that's expanded my mind as far as how I think about programming. So with that said, hopefully I'll come back sooner than every six years. So a little bit about me. My primary role is to do teaching. So I'm an associate teaching professor. So I love teaching stuff. I do teach the D programming language. I'll talk about that towards the end or give you a reference for that. Otherwise, I'm really interested in other sort of performance systems, these stuff. Again, you folks are my crowd. So again, I'm really excited to be here with you. And with that said, here's the abstract, of course, that you read and led you here. Again, to get you excited about the D programming language. And any code that I have for the talk will be linked here. If it isn't already shortly after this talk, I'll post it. All right. So again, what I want to do today is, again, get you curious about a really, really cool open source project. Now that open source project happens to be the D compiler. In fact, all the D compilers that we're going to find out have the source code available. So how cool is it that you can actually look at a programming language that's been around for quite some time and see some really awesome work by some really smart engineers. So at the very least, I hope that's exciting for you that you will have some place where you can look or send other people to look and see how optimizations are done or code is written or organized. So again, I think that's in itself very interesting. And maybe one day you yourself might find yourself contributing to this compiler, this ecosystem, or find inspiration elsewhere for using this programming language. And again, my secret dream for you, if I do a good job during this talk, is to get you excited enough to say, yeah, I'm going to contribute. There's been some awesome videos on how to just do that. Again, a lot of the open source projects that we've seen today and we'll see tomorrow have these resources. So again, I just want to point out that those are available as well. So again, it's really cool to look through the source code of the D compiler, which is a very, very, very fast compiler for the D programming language. Okay, so with that in mind, with my interest out there on what I want you to get out of this, or maybe to get excited about, again, whether you're a student practitioner, somebody in industry. Again, we'll continue moving forward here. And as I'm talking about this, I do want you to know that I'm a bit of a programming language enthusiast myself. So I love using different programming languages. This has been a problem for me since I started programming, always looking and kind of moving around to different languages, seeing what was new, what kind of features. And honestly, I think there is some value in that. You get to see how different languages approach things. Actually, we were just at a previous talk on the Hector script talking about actor model and mutability, how parallel processes are organized. I think there's a lot of value in taking away some of those core concepts from different languages. So what I've been doing lately is, again, every few days now at this point, I've been just turning on my camera for an hour and live streaming myself, learning a programming language for the first hour or so. And you pick up interesting things from different languages. But just to be clear, the languages that I use professionally and teach most are C++ and the D programming language. I'm always kind of thinking in terms of, oh, you know, Golang does it this way with their defer statement and D has scope, or oh, there's message passing in this language and this is how you do it in D in this way. So it's been a really interesting sort of experiment going through this process. And I'm thinking in the language that you ultimately, well, use. You kind of wire your brain a little bit sometimes. So that could be something kind of curious, again, looking at new languages, looking at languages that are popular, looking at languages that are maybe not so popular as well as far as mainstream. At the end of the day, what I hope one of your other takeaways will be is, you know, as we know, sometimes it doesn't matter what the language is. It's going to be what gives you a competitive advantage, what is fun for you to build software in, what is, you know, the tool that you can use to create something. So then my goal is not going to be to convince you today one programming language is better than the other. Even as I look at those programming languages, I try not to do that. I'm smarter than that. I think I am. We'll see if I slip today. You know, we sort of like our program languages and get used to it, right? We have our favorites. But again, I do want to share my enthusiasm for D, why it stands out, and why you might also have fun with it. So with that said, we're going to do that same little experiment that I've been doing, just turning the camera on for an hour, looking at a programming language for the first time, and just investigating some interesting parts of it. I hope that will get you curious about the different parts of the D programming language and again get you excited. And maybe, just maybe if I'm successful, and I looked around, I saw everybody who rose their hand and who didn't. We'll see more hands raised. What was it? Six years from now when I'm invited back. So anyways. All right. So I'll show you a few cool projects for inspiration. Most if not all are open source. The only ones that aren't are the scripts that I haven't put in my GitHub repo yet. So we'll be true by then to this top. And all something that you can learn from a specific feature. I'm a big proponent. Again, my background being in teaching in some industry, that we need to read more code as we're learning as well, because there's lots of smart engineers, you folks, writing that code and I want to learn from you. So with that said, we'll look at these projects all in the D programming language here. So let's go ahead and begin. I'm going to go ahead and start with something cool made in D. Why not get some inspiration to start this talk off. And here it is. A project that's built in the D programming language. TILIX. How many folks have used this terminal emulator? Yeah, I'm seeing a few hands go up here. Yeah, this is something I like to occasionally download and try out different ones. But to my surprise, I actually looked at the source code. One of my students actually told me TILIX is built in D. I didn't know that. So that was really cool what you find sometimes in the wild. But again, oftentimes as a user, you don't really care. Just a cool piece of software as an end user. But you get to see as a practitioner some of the cool tricks they do. So along with just showing you some different tools that have been built in the D programming language, I think it's important to say, well, why don't we care to look at this closer? So with all these slides here, again, I'm not going to ask you to read these or click on all the links. The slides will be available. But what you might be curious about or with this particular project, what's interesting is to see that, well, it's something that's very visual. And if you dig into the source code, it's using the GTK libraries. And those are C-based libraries. So how does D interface with C code? Well, the answer is D actually does a really, really nice job interfacing with C code. So if you are C programmers or have been using C, you can basically call directly your C functions in the D compiler. Easy as that. Now, of course, there are bindings and wrappers and other things that folks do with the D programming language. But that's nice. You get a head start by being able to use some of your C code or even C++ and Objective C. There's ways to squeeze stuff in. So I thought that was very neat, just looking at the main app file from this particular program to see the different libraries that they were bringing in and was it just straight C code or library? Some other neat things that I'm just going to trickle in some details about the D programming language as we go along here. There is something called import C, which is a really cool, well, it's effectively a C compiler built into D. So you can, on the command line, like you would with whatever your tool was, type in compiler, DMD, your D source files, and your C source files as well. So that's kind of neat there. Again, just giving you a head start if you're going to consider migrating to different programming language, which is a big decision to make if you already have some open source project. All right, so that's Tylix. That's kind of a fun one. I'm learning about how D can also play with C code. Okay, so let's just get a first impression of the D language. Again, pretend you're doing this experiment that I'm doing. You go into Google, you type in Dlang, and you go to the homepage, Dlang.org, and what do we see here? We'll actually see something that looks like this. I'm going to give everybody a minute or so to just look at this piece of code. There's a sample code there. And then I'll ask for some participation. We can make this interactive on the afternoon. But just take a look at this and let me know what you think it does or what's interesting. I'll take hands and get some volunteers here. I'll give everyone a minute to think about that. What's popping out there, folks? Give me a hand and then something out. Yeah. So the few things I see on the first line, the import is local to main. On the second one, there is an object-style notation for a string. On the third one, there is an enum for an array, so that one I don't get. Then there is this immutable keyword, which is interesting because it's doing a mutable operation on A, but then B is immutable, I guess. And MSG is apparently a program that you send to the compiler, so I suspect it emits something at the end of compilation. Okay. How many did I get? Yeah, so we got a good staff at it. I saw other hands going up here. There was one actually right behind, if you wanted to share. Yeah, it could be the same thing or to add on. It really looks like a C, next one, like a C+++, free plus, so I don't know why they gave a name D, but they could just keep it with the pluses. In a way, it's really kind of easily to read. If you know anything of C or family, you can easily jump in and just do it. Yeah. So immediately when we're looking at program language, just to recap, we see it's sort of a curly brace, C style, algo style language, right? So we can kind of read it if we know C or C++, objective C, whatever. And it does look like a C+++, kind of language. We'll talk about that in a second. There's another hand here. Is that program manipulating types as values at compile time using decode like you would do in the ZIP program language? So the question about is it manipulating types here? Or something's kind of interesting about the types, certainly here. So for instance, what's the type of B, for instance? So what's it doing with the types there? Okay, it's static. We sort of know static and C and stuff, something about memory storage. Immutable, some sort of qualifier. It turns out a stronger than const. But what's the actual type? Well, there actually is some types being inferred here for us, like auto and other languages. Now I will let you know, again, I'll repeat some of these details. D is statically typed, but at compile time, yeah, we do have to make a decision about what the actual type's gonna be and what's returned. Yeah, this is great. I'm gonna advance it one slide forward here, and you'll see what the label is on the program here on the D language homepage here. And it's sort of an array at compile time. And that's kind of cool. Just this first example, this is usually the first example that comes up here. And I've got a description of the stuff that you folks recapped very nicely. But let's actually, we'll run or look at a few codes, but I think we should at least look at this basic one here. Let's make it a little bit bigger here. Just to get a feel, again, this is the same Hello World sort of program. Well, this is maybe even after Hello World, I would say. But interesting enough here. And let's just go ahead and compile it. So with DMD, again, I'm looking towards the bottom of my terminal here. I'm gonna compile it out. This program I called compile time sort.d for the extension. And the output file is going to be prog or prog. And as soon as I hit enter, interesting here. It's finishing compilation here. And boy, I didn't run the program. I'll tell you I didn't run it. But while I was compiling it, yeah, there is something interesting here going. It is called compile time sort. So you might have guessed that. But interestingly, and this is one of the big, why should you care? Or things to look out in languages that you care about. We can do computation at compile time. So this is a really powerful feature of the D programming language, the D compilers specifically, that we can take something like in a new, something that would maybe be a constant, right? Usually in another language. Set some values here, like an array. And then actually evaluate it with sort. But again, if you look at sort, this looks like a function that you might just call in your regular programming language, right? So there's nothing really different than the compile time sort to the run time sort. That's probably what we want, right? To be able to execute as much as possible at compile time and save our work for when we're actually running, right? Before aiming for performance. Of course, there's always trade offs for that. You notice that might take a little longer to compile. Again, let's go ahead and compile it. Again, pretty fast. Actually, we're gonna talk about how fast the D compiler is later here. Now if I actually run it, the program here, PROG, right, we just get hello, Fostum, because that's the actual run time computation that's going on, okay? This part here, this is the only thing we're really doing at run time. Now if we go on and later do something with B or print it out, we'll get to our sorted array, but that's the point there. So, already kind of neat. This is kind of an attention grabbing thing. And again, something that might be new depending on what programming languages you've looked at. And the thing, again, one of the things that certainly caught my attention. All right? And I mean, there's some other interesting stuff here like the, I think was mentioned here, the quoted string before, right line, dot right line. Okay, we'll talk about this. It's called universal function call syntax, but you know, some nice potential quality of life features for us. All right, so that was our pop quiz, only pop quiz we have here. But I do invite folks to raise their hand high if they see something interesting as we move forward. All right, so again, the sample and why you might choose to care. Just to go back, we call this CTFE or just Compile Time Function Execution. This idea that we can do work at compile time. Didn't we know there's a lot of other languages, templates, or sort of a mechanism to do this if you're coming from C++ background. To various extravagance levels of metaprogramming that you can do. Other languages might do this a little bit more explicitly otherwise, but that's the idea with the decompiler. So a big win in my mind. In a big win, how clean this syntax is. Okay, so a little bit about this deep programming language. Somebody mentioned, it kind of looks like C, plus, plus, plus, plus. Yeah, so a little bit of history here. Walter Bright, who's highlighted there with the arrow, that's him at D-Comp. A few years ago, two years ago now. He was the initial creator of the deep programming language. It's called the Digital Buyer's compiler originally. But folks kept saying, hey, it looks like C++, plus, plus, or whatever. And they just started calling it D, and that just sort of stuck. So that's what we got here. So a little bit about Walter, again, he's a compiler expert. He's worked on C compilers. Hence why there's sort of a C compiler in the D language. C++ compilers. And then of course, thought about it for a while and said, well, I'd like to make something new, something that's fun and efficient to program in as well, and that's where D sort of came about. And then also, a major collaborator was Andre Augsindrescu, around 2006 or so, joined. And then for the next 10 plus years was a very active contributor in building what we now use as D2. And we actually got other audience members who are contributors. I don't know if you want to out yourself, you can raise your hand, but you don't have to. So anyway, so there's a full history with the D programming language and a really interesting article if you want to learn about the history and the origins about how to evolve and the sort of why's you do things in the programming languages. Again, that can be interesting sometimes if you know the historical context, why things look a certain way they do. Sometimes that helps you understand when or when not to use a feature. So anyways, that's just a little bit about the history of the D programming language here. So again, what is the D programming language? Still on the front page, it's a general purpose programming language with static typing. So whether or not you see those types, they can be inferred. It's a systems level programming language. So you have low level access to things like pointers, for instance, and you get the C likes syntax. So it's relatively familiar again if you've used C or C++, right? I imagine pretty much everyone who knows I hand who had heard of it's new. Yeah, something like the next C or whatever. But the mantra with the D programming language, at least on the home page, is write fast, read fast, and run fast. So we'll try to see if it holds up to those things and again why it might be a good choice for playing around with or maybe your next open source project. So over the last 25 years now, there are three compilers for D. There's the DMD compiler, that's the main one that Walter has and works on. And that compiler is completely open source. So you can dig into it, you can make a fork of it and modify it and play around with that DMD compiler. And it's a very, very fast compiler as far as compiling your code. So you can compile the actual D compiler, I want to say in a matter of seconds, tens or hundreds of thousands of lines of code. And that has in part to do with these module system, being able to do concurrent builds and how many passes it does over the language. But it's very, very fast. Your edit compile and run cycle is very quick as you're iterating and doing development, which I find something important. There is also the equally as important the GDC front end for the GCC compiler suite. I think it was around GCC 9 or 10 that was added in officially. So you've got the front end there with Ian Buchwald working on that and LDC work on by Martin, which gives you all the LLVM infrastructure. So if you're trying to target lots of different platforms, for instance, the LDC or the LLVM based D compiler is available for that. So you've got three compilers, which is great. So you don't have to worry about it disappearing anytime soon. And it is very common for D programmers to take advantage of the very fast edit compile cycle with DMZ. And then when it comes time to build an optimized build, you want to take advantage of all your GCC tool sets and infrastructure or your LLVM infrastructure and all the optimization passes. You can use those compilers afterwards. So as far as downloading the tools, don't need to spend too much time on it. But again, if you're on one of these platforms, you probably have a way to get the D compiler built for that platform. Or otherwise, there is a zip file or something on your package manager available. And with the D programming language, you get a package manager that's called dub, which will help you manage dependencies, bring in packages, and these types of things. It's also sort of a lightweight build tool as well. There's other tools that you might expect, like Dformat, which are being worked on and already exist for code formatting. Dscanner, which is like a linter. And if you're a VS code user and want to intelligence and these types of things, there's support for that, as well as for IntelliJ. Okay. So D, where is it being used right now? Again, we've heard of this language. Maybe we've used some of the applications without realizing that they were written in D. Again, from the website, lots of different companies have used it internally. Again, folks like myself just use it for our own projects or research. But I think D has done a really nice job finding itself in various performance based niches. From some of these various companies, there's different stories about how different tools were being used, which I'm happy to go into. So I want to go ahead and show a few. And this was another built in the deep programming language tool. I tried to pronounce it correctly. I think it's Elmer, but it's a compressible flow simulator. Okay, super cool. So they're doing computational simulation, something very expensive to do. So this tool now is 10 plus years old, being used by various PhDs and postdocs and researchers. But again, why should we care about this tool other than it generates really pretty pictures? Their website has some really beautiful pictures. These are just the ones I sort of understand. So I could post in case anyone asked a question. But again, it's a project that's been around for 10 plus years. Most of the code is in D and it's shelling off high performance. And I thought this was a great message to share from their GitHub saying, our focus is on open source development to give a simple access point for doing their gas dynamics and research and teaching. So what a great place to start if again you're in this area and want to look at some open source D software. Okay, so that's a nice tool. Getting back to some of the D language features. Sort of already thrown out one of the main big ones here, the compile time function execution. Which again, we're starting to see in more other modern languages, but that's sort of a staple of D and why I think it's really interesting. But the language itself has a lot of really nice quality of life features. So these are things like you get a bunch of built in data structures without having to import anything. Dynamic arrays, associative arrays or maps or dictionaries. They're bounds checked, which you can enable or disable. There's always a path to performance here. You get things like your land does and delegates. The object oriented functional style, generic programming designed by introspection, concurrent paradigms, all of that. Again, they said it's a really big tool. We can't cover all of it, but there's probably something interesting here for you or it's a domain where you might expand. I personally found that I started doing more functional style programming when I started using D because it was very accessible in their standard library. The D language also by default is garbage collected. But you can turn that off if you want. You can malloc and free. You can do reference counting. You can implement from scratch your own strategy if you want. There is a question and I'll repeat. Yeah, so the question, just to repeat and I'll break it into two, is how granular are these, this ability to turn off things like garbage collection. If you do need performance in a certain sectionary code. That's as granular as putting an attribute on the function. You could say at no GC on it. And in practice, no garbage collections will happen. And you can do that. You can do that. You can do no GC on it. And in practice, no garbage collections will happen in that section there. I think there are more in the actual tools you can do like a GC.disable, which I think is similar to what Java and other languages have. I think you could do like a system.noGC or whatever. So you get that granularity. That could be at a function based level, saying this code, no GC, and being able to handle it. The array bounds checking, I know that is set as a compiler flag. I don't know if, for that one, I actually don't know the answer if you could do that on a per function level. What I would say is if you wanted an array that wasn't bounds checked. There is, I think, of a standard library. I think one of the standard array containers doesn't do allocations. And then I would also just say, so you don't have to worry about that container of garbage collections. But for the bounds checking, I would probably just, to be sure, you could implement your own dynamic array. No problem, just like you wouldn't see if you want that granularity. I will also show, what will I show here? Yeah, so does that answer the question? GC as granular as garbage collection per function, you can enable, disable. And then for the array bounds checking, you can always implement your own. But there is a compiler flag for on or off. And typically, folks would use that again for that last little performance game, if they're like building a video game or something, and super certain, there's not gonna be any arrays that got bounds. Because typically, you know the thick size allocation, you would just turn that off. Perfect. All right, questions or features that look exciting here? And there's lots and lots, and the point is you have control, which is really, really cool for what you need. And we're gonna even dive a little bit further into this. There's some other cool stuff you can do, if you only need a subset of this feature. But let's continue getting inspired here. So we've got a standard library. So again, batteries included, like pretty much every other programming language these days, you have to have a standard library with containers or data structures, various algorithms, right? We've already seen sorts in the very first example, but there are things like map and filter and fold and so on. There's various concurrency primitives and so on, and we'll take a look at some of those. So you have a pretty decent standard library here. That's in discussion about expanding and refactoring and so on. So most of the common stuff you would need, handling JSON, CSV, files and so on. So that brings me to another built in D here, why do we care? I'll get into my code, so B's, Yanmi. But so here's just the type of way that I started using the D programming language was writing these little scripts, 50 line, 100 line, throw away codes to automate some tasks that I'm doing at my desk. I found myself doing a lot of queries to YouTube to gather data about what videos have been published in a channel or what videos are in my playlist, these types of things. So what was really nice was just to find that there is the standard.net.curl in the D standard library. And then I could just build a query string and then effectively make a query and retrieve my data from that curl request. And then I have standard JSON and then I can just, again, if I'm retrieving from some API, JSON data, again, common format for that JSON or JSON, then I can work with that data as needed here. And then you've got other sort of quality of life things like range-based loops. So you can go through the keys, you can put the keys and the values here if you wanted to iterate through them as well. So nice little script you end up writing a few of these here. So there's one example with YouTube. I do this a lot for GitHub for, again, pulling repos, looking at them, pushing code to students. So again, same sort of pattern that I'm always using with any rest-based API where I'm pulling data in. One little interesting thing here, looking at line 53, we can start to see that if you want to set various event handlers, again, here's just a little example of a lambda function here. You can have anonymous functions. You can have delegates and these types of things in the D language. So nice little quality of life things here. Okay, so this is kind of interesting here. My little scripts, and I'm sure many of you folks have your shell scripts or Python scripts or whatever. And again, that's what happened to me. I had a bunch of shell scripts. Mostly I had scripts in Python. And then I just started translating them to D because again, I liked it, it was a little bit less cognitive overload for me. Again, if I'm working in C++ and D, they're pretty similar in how I can think about some of those intakes. But it's sort of interesting that when I'm using D, I'm still effectively executing my scripts like I do in Python. Okay, so let me go ahead and explain this here. And what do I mean by that? Yeah, question first. Let's see, line 54. Maybe a bug or something there, line 54. Sorry, I didn't hear. Unreceived. Oh, the E and the I backwards, uh-oh, okay. I knew I shouldn't have put my code here. Good catch, I'll fix it in the post, yeah. Gotta do some fixing tonight. But the good news is, right, we can iterate quickly. So I'm gonna give you an even faster tool that I use to iterate and run these scripts. Just a little helper tool, it's called rdmd, run, you know, the dmd, basically just does on the fly compilation. And it does, you know, it compiles, you know, as fast as your decompiler basically does, dmd, but then it'll just execute your program immediately. And the advantage of this is then you can use D like a shell scripting language, right? You can actually, if I get a read down here, I'll try to highlight my cursor. I know it's a little bit small here, but you can just put the pound into the bang sign slash user slash bin slash environment, rdmd, you know, chmod, execute, or whatever, and then you just run your program, just like a regular script. So again, that's a really nice way to, if you need to, transition your scripting language to something that's statically typed, or you can just think in the deprogramming language rather than multiple languages. I found that as a nice quality of life improvement. Again, I understand I'm the enthusiast here, but I found that as a really big win for me. So rdmd is available. The LDC compilers, you also have this available as well with L, dmd2. I haven't checked the GDC one actually. So that was really cool. So, you know, generally speaking to, you know, my effort because I was running somebody who does little scripts was, you know, if you use a compiled language, generally, gotta be careful with talking about performance. You get better performance than an interpreted scripting language. So again, a big win for me and my projects. But there is still more to this performance story beyond just, you know, switching to a compiled language here. Because I started stumbling upon other really cool things in the deprogramming language, the community pointed me to. I started doing this in my scripts here. So you'll see here highlighted, let me draw your attention towards the top. Dot parallel here. So I just kind of stick that on the end of some collection or some array. And basically what I get is the equivalent of, for those of you who've done OpenMP, a parallel-based for loop here, right? We're able to launch multiple threads here. That's a small change that you can make, right? If you don't have any dependencies on the data in between, you still have to think about it, certainly to make sure you get correct code. But imagine just going through all of your range-based for loops and doing dot parallel. And if you're doing separate tasks, getting a performance boost, right? Use your CPU. You paid a lot of money for it, so put it to work. So again, quality of life feature there. Now does it make things faster? Again, you have to profile. You always got to check these things out. So, you know, maybe a better use case, another open source project from a D conference, just a standard, you know, Hello World Ray Tracer project where I used standard parallelism. And again, if you're looking per pixel or doing something graphically, right, you have a lot of pixels, however wide your resolution is, a thousand pixels, but, you know, a thousand something of that nature. You can try dot parallel on it and see if it speeds things up. And of course, my performance wizards and the spiniers. You're launching too many threads or what's going on, you know. So, you know, does it make things faster? I'll get to that in one slide here. Because I also see something interesting that I've touched on but haven't explained. What's going on in this for each loop? For each Y and for each X, okay, those must be like the pixels going across and up and down. Okay, so there's a lot of them. But this next part's kind of interesting. Okay, I've got a camera dot get screen height dot iota, which is like a range, and then that dot parallel. Well, what this is, is an example of that uniform function call syntax. This idea that we can sort of chain functions together with a dot. Again, maybe you've seen this about programming languages. Maybe you've implemented design patterns that allow you to do this. But it's a really nice quality of life feature if you just sort of compare the camera dot get screen height dot iota dot parallel versus, you know, trying to figure out how do I nest these things. Parallel, okay, iota, and then you're counting your parentheses or, you know, you're hoping them or whatever counts them correctly for you. Again, just a little quality of life thing, more readable code, and you can actually think and sometimes see like, oh yeah, I see that is just a range there. Maybe I can paralyze it. Maybe there is some data independent thing there. So anyways, that's just, you know, following up on that. And then as a little aside, and you can look a little bit more, there is a built-in profiler in the D compiler for seeing how many times a function executes, how much time you spend in it, and there's also a memory profiler so you can see how many garbage collections you're doing if you're using the garbage collector. Okay, so built into the compiler, you don't have to search for them. You know, I do use other tools like perf and choose your favorite tools, but nice that it's there, okay? It's an easy tool that you could build into a continuous integration system or whatever. Okay, so, you know, speaking of some graphics projects, again, that's sort of one of my passions, so, and it turns out that D is a great language for building graphics projects. So, you know, the must-need-it, you know, pretty picture slide, and there's actually games and physics, you know, if you click into this. The cool D language project, Daegon, here is a game engine, so, you know, something sufficiently complex. Why do we care about this, though? Other than it's, you know, pretty in a slideshow. Very, very beautiful. Lots of hard work there. But again, just to see the substantial project by engine and graphics developers, you can see how it's laid out, how different core systems are laid out. Again, might be interesting for you to, again, think about if you're going to use D for building games, how you organize different components and game objects and these types of things. And you can kind of look through the directory structure. D uses a sort of directory structure for packages like Java or other languages, and that's kind of interesting. And there's also just a fun comparison to C++ here if you want to see the video. It's not really to say anything. Both these applications are very GPU bound, so that's sort of the point, right? Use the language you want, and if you're GPU bound, that's all on the GPU anyway, so, you know, you can think about those tradeoffs. So there's one game engine. Another one, Dash, this is a cool, I think it started off as a student project, and then I gained some steam with several folks. So there's a little game they made. Why do you care about this? Well, you know, I spent just a few minutes looking at the code to see how things were structured. And very interestingly, they were using this idea of mixins in their code. How many folks, just as a survey, have heard of a mixin by hand? Okay, we've got about 40% or so around there. But that's the idea that you're literally just taking in a string and pasting in your code, and it should be like valid decode that gets compiled. Sounds trivial, sounds like, kind of, why would you do this? But it makes sense in use cases if you've got graphics code, if you can just import or paste in some shader code and do a mixin. Or maybe you can use other compiler or compile time techniques to sort of build out a string at compile time and then generate code. It's a very simple idea that you can compose and generate some really cool graphics things. I think it tends to work well in this use case that the game showed. Another later project here, Hypreem Engine. So they built, you know, some nice stuff. Why do you care about it? Why should we look at it? Well, Hypreem is very active in the community, so a good person to know for one. But a really interesting example of just seeing how to support multiple platforms. So again, Hypreem can build a D project on PlayStation Vita, Xbox, Mac, iOS, Android, et cetera. Just to see that that's accessible, I think that's a project worth studying and to see, you know, how did they get there? Okay. All right. So there's lots of other graphics resources. Mike Parker, who's a member of the community, has done a great job with common libraries and graphics stuff, sort of an FYI. We're talking about open source today, so I'm going to sort of ignore the commercial game projects done in D, but there's a few interesting talks, again, if that's your sort of domain. And, okay, so talking about a few of the other D language things of interest, the paradigms, okay? Because again, I said when I started using D, I started doing things more functionally. I started thinking more about concurrency. I started thinking about object-oriented programming, I think in the right way. At least, you know, how message passing is supposed to be one of those pillars of object-oriented programming that kind of gets forgotten sometimes. At least that's what I think of with object-oriented programming. But anyways, just a couple of examples. You can take a peek at these again after the talk, but I've got the range-based loop here, and then I've got the sort of the mantra of no raw loops. Get rid of those raw loops and just use functions like filter or, you know, these types of components here. So again, very nice, often easy to substitute, often you find instances where you can just do a dot-parallel much more easily. And on the right here is just a classic. You've got an interface, and you want to create a type of dog, a husky golden retriever, you know, like your favorite dog, Belgian Shepherd, et cetera. Okay, and then I can't leave D without giving a hollow world of metaprogramming, because that's really, again, one of the strengths here, right? We talked about stuff that you could do at compile time. So just a sort of simple function here. It's called print data. So I'll draw your attention towards line 38. T is the template parameter, so there's no angular brackets. You just put the template parameters right after here. So T, whatever the data type would be, and I've got another T for whatever that type is, and then the struct. Okay, what is the struct and why do we care about it? Well, we care about this struct only if it has members, right, attributes called memory and elements. Okay, so memory might be a chunk of the, you know, I don't know, some attribute of the elements is maybe, again, an array of the data. So what's sort of interesting is, one, you can think about this as a sort of template constraint, or a concept, again, depending on what language you're coming from, that has to be adhered to. So I can only use this templated function on structs to print their data if it has memory and elements. Well, I think that's kind of a nice constraint to think about or to have that ability to do it. So that's kind of interesting here. If we have time at the end, I'll flash some of the examples that I'm going to put in the GitHub repository for other introspection things you can do. You've got a traits library, so you can see, you know, what member functions you have. Is this thing a unit test? Does it have some attribute on it, like no GC or whatever? A question? And the question was, is there static if? There is static if. There's static for each. My question is why you use static. Why it's not static? Here, it's, I guess I could make this static. I don't know if it's implicit actually here. I need to think if it is or not. Yeah, I guess, yeah, we don't need it because technically we wouldn't generate this template if it wasn't valid, since that's happening at compile time. Okay. So, you know, here's, you know, leading us towards the end. So, I know, I've gone through this. I've tried not to make it a sales pitch just to show you things that I'm excited about. But if you're not ready to try, D, there's still yet other interesting things in the compiler. There's something called better C, which is a subset of the D language. And basically what this does is it gets rid of or sort of removes a lot of the language run time. So, this is if you want to do some more like bare metal programming, for instance, and you don't want to carry the standard library phobos, or you don't need some of these other features. You get most of the quality of life things, like the bounds checking with arrays, you get slices for working with them, you know, delegates, land as all those nice things, all the compile time execution, but you can sort of just use it as a better C language. Some of the stuff you're starting to see in C23, for instance. And there's a really nice talk introducing that on kernels and how they're using better C for kernel development. So, again, getting into a little level stuff here. So, as far as learning more about the language, again, great tour on the website. The good news is, you know, anybody who's written a book on the D programming language, and there's seven or eight, I think, they're all good books, right? They're all written by enthusiasts, reviewed by the community. These are the first two I'm going to recommend that folks who are beginners take a look at. They're more, you know, for someone who's an audience who knows how to program, and we'll get you started here. Forums and Discord, otherwise, are very active as well. YouTube, that's me. And then teaching the D language. So, you can hear it from my perspective again, but even better if you hear it from the students, right? They're unbiased thoughts on what the value was, if it was useful for them. And the last sort of resources as we're kind of wrapping up here to talk about, again, from Andre, he wrote this really nice piece here called The Case for D. This was in 2009. I think a lot of it still holds in a way, but, you know, basically, he summarizes it as a high-level systems language where it can be productive and enjoy coding. That's what I found. You know, maybe you'll find that too. Okay, again, that's up to you to decide. I hope I just shared some cool stuff for you to get excited about otherwise. So, again, what do we care maybe as an open-source developer? You know, you've got a readable, writeable, performance language that hopefully gives you a lot of quality of life features like fast iteration time. You know, I think there's a competitive advantage here with any project. I found it with my students. Again, that's something you'll have to test, but that's what I found. My students get further using D than other programming languages. And there's three compilers available. You don't have to worry about it disappearing or, you know, other stuff, you know, going on here. All right, what's next for me? Well, I talked a whole lot about graphics. That's my passion. That's what I've worked in. But I'm now working on learning a web framework called Vibe, which is super cool. If you're more on the website, there's a great book about it to get you started on building, you know, scalable and performance web applications. Alrighty, so we learned a bunch of things. Here's sort of a summary slide on some of our takeaways. Again, I'm going to leave that wall of text for you because I want you to leave excited and not tired from reading. I just want to go ahead and close off by thanking you. I'm going to be around so you can ask me any questions now or after as well. Thank you. Thank you. A question? What you will say, why Rust and why not Rust and why D? That's a good question. I don't want to pit languages against each other, so it's why Rust or why D? What I would say, because that's a hot question I get asked a lot, D's code is very plastic, the plasticity is high. Meaning I can mold it and change it, which I very much like. In a way that, again, I'm not as much a Rust expert. I've used it a little bit, but D's plasticity is very good. It writes how I want to write the code. It's got the memory safety with the garbage collection itself. I find it very, very productive. I find if you're going to write an application, again, I'm in games and so on where there's lots of mutable state. D's a perfect fit for that, for writing safety and maintainable code that I can change later. Yeah. So the comment was coming from a C, this was sort of easier code to C and to read. Yep, yeah. That's the other read. It's easy to get into. Yeah. Another question? Testing. The UFCS looks very cool, but how do I know if it's like a function or if it's a method of the object like that I'm calling? Because it felt, it was all the same color on your VIM script and I was like, oh no. Yeah, when you're doing the dot, so a few nice things that D language does when you're working with pointers in classes one, you know, if you're coming from C or C++, there's no arrow. So, you know, you don't have to worry about that. Everything's a dot. But then the idea of, is it a variable that I did a dot or the function call? Usually function calls are not required if they don't have any parameters, you can leave them empty. I usually just put the parentheses after. Otherwise, this is for things like language server protocol and your text editor make easy enough. It's not usually a problem. Yeah. Alrighty, thank you.
First Aid Kit for C/C++ Server Performance
And today I will show you some of the most common performance issues which I have seen so far in my career, how to fix them, and the benchmarks which show the numbers, what kind of performance increase which you can get when you fix the stuff. It works. My talk will follow the plan on the slide. So I will first present you some issue categories where you typically lose most of the performance, at least in my experience, like I said. And then for each category we will go through specific topics, what you can optimize how and what kind of numbers you can get when you optimize. And then some sort of conclusions of the topics on the slide, we just go for them one by one. The QR code right now is not working, it will be working after the talk. Everything will be online clickable, you can just walk through it again to repeat the recipes if you need. So back end performance at least in my area of work, back end usually means one of those three things, latency of your requests, CPU and memory usage on your machine, and your throughput, which is how many requests you can process third time frame, which is usually expressed as per second. So request per second, RPS, and we want to improve this stuff. And also there are those bad places where you can lose performance in those three categories, which are inefficient suboptimal heap usage, unnecessary expensive threat content should be done on critical paths and inefficient networking, like inefficient networking IO or inefficient circuit scheduling and things around this stuff. And like I said, we just go through each and see specific cases. Starting with the heap, to understand what can you lose here, you have to understand how heap is working. It's just enough to understand basics, you don't need to know specific implementations. But the basics are that this heap thing, it's a data structure, like some sort of tree, has stable whatever, it's global in your process, and it is used by all the trends. When you call new or malloc, they go into the heap and fetch three block of specified size, return it to you and you use it. When you call free or delete, they are placed back into the queue. And this operation of finding free block in the queue of needed size or placing free block back into the queue, it takes time. So this lookup thing in the heap, it's not free and it's not constant time. It's some lookup time, which depends on how big the heap is, for example. So if we mentioned that this is a tree, which stores blocks sorted by size, then lookup time will be something logarithmic or something like, doesn't have to be tree. But the point is the bigger the queue, the bigger the heap, the more expensive are the lookups in the heap. Also like I said, this is a global thing in the process, usually by default, which means that you will get threat contention on this thing if you use it extensively. For example, multiple charts are located in blocks of the same size, but very frequently you will have threat contention. Heap does have mutixes inside and you can even see them sometimes in the flame graphs. So make it worse if you are writing in C++ and you're happily using those nice fancy containers, least vector queue stack unordered containers. No, forward least, all this nice easy to use stuff, you have to realize that even if you don't use heap explicitly, it is used inside those containers. Heap is basically, vector is basically dynamic array, myp is basically a red-black tree where nodes are allocated, least allocates, containers for every item of the distance are on. So you use the heap even if you don't do it explicitly to make it even worse after that. You have to remember that allocations affect each other. Like I said, the more allocations you do, the slower it will be next to allocations and freeings. Use heap becomes bigger, it becomes more fragmented, less optimal and it gets more and more expensive to use. What can we do about this stuff? Firstly, you can try not to use this stuff. You can just not use the heap when you don't need to. For example, when you can just locate stuff on the stack, when something, some object is array, it's small enough and its size is known at compile time, just declare it on the stack and use it. If it doesn't have to be something long-leaving. Or another frequent use case which I see is then when we have a class or struct and we store in there something by pointer and lifetime of this object is equal to the class where it's stored, right? And they just store it by value then. You will reduce number of heap allocations then. When you cannot get rid of the heap allocation and you have it in some critical path which is very frequently used on your server and you see it in the flying graphs, you can still do something about it, you can optimize it. And there are ways, some easy ways how you can quickly regain some performance back. We will start with the object pooling thing which is not as simple as it sounds. Typical very widespread use case in the back end. We have this server, requests are coming to the server. Each request is read from the network parsed allocated into something like struct request or class request. It can be big, one kilobyte, five kilobytes of different data, different members, attributes, then you place it into your business logic pipeline. It is processed in the end, it is deleted. And this process is repeated again and again for every request. And if the request is big enough, like one kilobyte, and you do it frequently enough, like 100,000 times per second or million times per second, then you will get heap issues here because this heap allocation and freeing will get expensive of such a big object like one kilobyte or more. And you can see it in your flying graphs sometimes if you are building them at all. Example of the code. So we have this class request with many members. Some of them can be indirect members. For example, we could inherit this from base request, which is base, base request and something like and it can pile up. So in my current project, the size of this thing is two kilobytes from those many, many, many small members. And then we have this business logic pipeline, like process request and it allocates request object, fills this with data and when request is complete, asynchronously somewhere it is deleted. This thing, those two lines will get expensive. If done frequently enough and request is big enough. Effects of the heap here can be mitigated quite easily. If instead of using the heap all the time for allocating and freeing stuff, we will just allocate it once and then we will reuse it again and again. So we use the heap just once and then we don't use it. And we avoid the heap issues. This is called object pooling. When you allocate stuff once, store it in some sort of pool and then you take it from the pool and place it back by placing the heap. Even though first time you do allocate it on the heap. Then what you get from this is that firstly you do not pay for the lookup time in the heap. If you remember that the heap is storing those blocks of different sizes, sort it somehow then it needs to be something like 3 or hash or whatever. But here all the objects are the same of the same size. It can be just least or stack, right? It could be done in constant time, allocation and freeing. We do not pay for lookup time anymore. You can deal with concurrency in a more efficient way than the standard library. I mean you can switch of course the heap from like GMLOCK or TCMLOCK, right? We heard about it. It can make stuff actually faster. But if you do not want to or you have to have more control in your code over those things and you have this pooling thing, you can implement concurrency yourself and you have to agree that doing concurrency stack or concurrent list is obviously much simpler than doing concurrent tree or concurrent hash table or something, right? It can be done much simpler. Let's try. This is how I tried first time. It's good first try, right? Kind of. It's simple. That's why it's good. Sometimes it's even good enough, right? So we don't need over engineer things. But in this case it makes not much sense because if your code is very hot and you suffer from heap contention and you change it to this, then it will get even worse because you will exchange heap contention with mutics contention. And secondly, you are still using the heap because if you are storing in an STL container, any of them, you will be using heap and we don't want to use the heap. So it cannot be done this way. But it can be improved. It's not a dead end, right? This is how it can be improved. And the alternative is to add local pooling. So what we have is instead of single global pool for everything, we have one global pool and also in each thread we have thread local pool of limited size in each thread in addition to the global pool. When threads are locating something with new or maloc or whatever, they take objects from the local pool, not from the global one. And when they free objects, they place it back into the local pool. And this local pool can be done very, very simply. It can be just a list, an intrusive list and that's it. It doesn't need mute access or anything because each of those local pools is used exclusively by one thread. But when pool inside some threads becomes empty and they want to allocate more, they will take a batch of objects from the global storage and will reuse this batch until it also ends and so on. On the other side, when they will be freeing stuff and local pool becomes too big because it's limited in size, it cannot grow. Infinitely, they will move it back into the global storage so as other threads could reuse it. This way we get firstly that the heap is used rarely. It is used in bulk when it is used to allocate at once many objects, not four, like 64, 128. And also it will not be used at all after some point when all the pools will get saturated. And fourthly, there is no contention on the single global pool. This global storage, it can be protected with the mutex, but it is used so rarely that this mutex contention will not be visible. It will be used at most every, like, 64 locations. So it's 64 times less contention, which means it will be basically almost zero, neglectable. If the explanation was too bulky, I prepared an example how it works, like a real life example, how it could look like. Imagine that we have those three threads and empty global pool. All is empty in the beginning. First thread wants to allocate something. It will take a look at the global storage. There is nothing, so it has to allocate a new batch. New batches are located on the heap. But then when it will allocate objects, they will be taken from this batch. No more heap allocations. Just one heap allocation. And then from the allocated batch, we take objects one by one. Then second thread. That's the same. It has had local pool empty, nothing in the global storage, so it had to allocate second batch. They keep using the objects from the local pools. So far, we only did two heap allocations. But then happens, which happens very frequently in backend code. Those objects, they migrate into another thread. It happens when you have dedicated threads for networking, they read data from network. They parse it, create this struct request, push it into some sort of queue. And this queue is taken by other threads doing business logic, and they will delete the request. So most often, it happens that you have one thread allocating requests, other threads deleting requests. Objects will migrate. So here they migrated. And this other thread completed them somehow and tries to free them. They do not fit into this local pool. It is limited in our example by four. So fifth item didn't fit. And to fit more, it will have to migrate this pool into the global storage. And then it can keep freeing stuff. Now, a little bit more random work happens. Some more migrations. And then we are in a situation when second thread wants to allocate something. But it doesn't have anything locally, so it will go to the global pool. And this time, we have a free batch. So we take it. And we use this batch. So far, during this entire demonstration, we have only used heap two times for all those allocations. And at some point, after some more work, we will not use heap at all. It will be all saturated. Work continues like that. How it could look in the code? Yeah, visible, good. I have this benchmarks link is on the slide. Everything is already open. You can reproduce it yourself. I have this value, I think, which is, whose size is configurable at compile time via templates in C++. And I'm testing it with sizes one byte, half kilobyte, and one kilobyte. And they also have the same value, but thread pooled by the algorithm, which I just explained before. And in the C++, no matter how much we can argue whether it's good or bad, full of unnecessary stuff, but templates are sometimes very nice. In this case, I implemented the pooling in templates just once. And what I have to do is simply inherit this magic thread pooled class, and my class becomes thread pooled. I can simply apply it in as many places as I need, and all the types will become thread pooled with their own independent pools. So I'm comparing value versus value pooled. The comparison itself is that I have many threads. Each thread allocates many, many values, and then frees them in random order. And then again, and then again. And I am testing how fast is this spring and allocation, depending on number of threads and so on. And those are the results, which were surprising for me to be frank, that for one byte, even for a single byte case, I got a speedup. Normally, heap is very fast, even for, is very fast for smaller locations like those few bytes. Heap is actually extremely fast, the standard heap. But some why my pooled version was even faster than that on single byte case. But my most interesting relevant case was twice faster, which was good enough. And it can be actually quite visible in the final RPS. So of course, you have to benchmark everything. You shouldn't just blindly make everything thread pooled. I think this stuff will get faster. Probably will not. You have to apply case by case, measure performance, see how much it helps. I have seen in my experience that this can help and can be observable in the final RPS. This simple thing. What else can we do with the heap? Intrusive containers. So to understand the problem, which mostly comes from STL again, from STL containers, those list, map, unordered things, forward list, and the thing which unites them all is that they are not intrusive. And to show the point, let's have a look at the list. The lists are the most popular type of container, at least in my type of work in the stuff which I'm coding. I very frequently use lists. And the problem with the list is when you push something into the list, it will not be directly saved there. It will be copied and saved into link container object, this gray cube. Even if you store pointers, this pointer, those eight bytes, they will be copied. Not your object, but something will be copied and it's unavoidable. And it will be copied into this link container thing, allocated on the heap every time when you push into the list. And when you pop from the list, it will be deleted. So every operation with the list costs you heap operations. Secondly, which is not so obvious, but it also has performance, is that when you store pointers in a STL list, iteration of the list becomes slower. Because when you store pointers and you want to get into your object to de-reference some written member, for example, in your struct, you will first have to de-reference link container and then you de-reference your pointer to get to the member. You have two memory access operations. And they are not free. This arrow thing costs something. So we have additional memory. We look up simply because of how a STL list is implemented. What can be done with this? It's an intrusive list. So what we do, the basic idea is that we add those links, next and previous links, which are linking the items together. We add them into our object directly, like in the old C times. When you ask a student to implement a list, they do this. Probably they are doing it right because we will not have heap users here on every push and pop because we don't need intermediate link container objects to locate and delete them. And secondly, we don't have this additional memory lookup because to get your data, you just de-reference your pointer and directly get to the data. No intermediate objects for this. The only problem with those intrusive containers is that they are quite bulky, at least in C. So this is huge pain. Maintaining those next and previous pointers, head and tail of the list, and you do this every time for every type that you have and you want to store it in the list. This looks quite not good. It's quite hard to reuse such code without C++ templates. C++ templates, you can implement actually intrusive lists just once and then reuse them. On the slide, there are links to forward list and doubly list implemented by me. On the left side, you can see how the API looks for forward list. And on the right side, how it's used. So I have this object something. I simply add this next pointer in any place of my object and I instantly can use it in intrusive lists. With the intrusive list implemented just once using templates. And this name, by next, it's customizable so you can change the name as well of the member. Then what can you get from the performance if you apply intrusiveness is shown on this benchmark link on the slide as usual. I'm comparing a list of pointers with intrusive list. It's the list of pointers because usually, just like I said in my code, I prefer to manage lifetime of my objects myself. And when I have an object, I push it into the list. So I have object before that. And when I pop it from the list, I usually keep sleeping for a while after that. So I don't want to copy the entire object for storing it in the list. That's why I usually store pointers. And intrusive list stores pointers by design. So I'm comparing kind of similar cases. And the benchmark is doing that I'm measuring time of list population, how fast I push items into the list. And list walking, how much costs to me this additional memory lookup. It's interesting, right? So this small arrow thing is even visible in any measurements. This is what you get when you switch. So at least in this benchmark, right? So it might not get this speed up in your case, but in this benchmark indeed. And in my experience, it also sometimes does. I've got almost three times the speed up for list population because I no longer allocate those link containers. Firstly, secondly, you see this walking speed is 7% very small, almost noise. But it's not noise. It's reproducible. Every time you run this benchmark, you will see this difference, which comes, it's not much, but it comes from this additional arrow thing. And it's not much, but it doesn't mean that you can just leave it, right? Why have this performance loss if you cannot have it? Those small things, they pile up into something bigger. In my experience, this was all the easy stuff with the hip for which we have time. We can also have a look at thread condensation things. What is thread condensation? It appears when you have multiple threads which try to access certain critical sections at the same time, like mutex protected data or something like. And when this happens too frequently, it can cripple your performance, your cripple parallelism of your code. So your code will not as parallel as it could be. And result could be something like you have the 64 core machine, you enter it, you type H-stop and you see two cores used, right? It's not a good situation, paying so much money and then getting this. You are not utilizing all the resources when you have thread condensation or you are utilizing them on the condensation itself, not on something useful. And what can we do about this quickly? Like it's first aid kit, right? So it should be done something easy and quick. First thing, false sharing. It assumes that, let's start on the case when you think it's easy stuff. I know this, I am master of condensation. I don't have it. This is how I protect it from condensation. I placed this link on the slide with the benchmark and the example is that I have this object with two members. One member is always accessed in one thread, other member is always accessed in another thread. And seems like I don't have condensation because I am not sharing any data between the threads. And they have this benchmark which does some amount of work for 10 seconds or so, which looks good enough. But if I do it like this, I get five times the speedup. By adding this 64 bytes of unused data between my members of the, in this track. What is the link that I increased size of this track and I've got five times speedup? Should I just make all my strikes bigger the better? They will get faster. To understand the reasons behind this, you have to understand how CPU works with the memory. The thing is that CPU cores in your CPU, they don't access main memory, the RAM, the bus directly. They do it through this proxy thing called CPU cache, which is to put it simply is basically one cache per core, right? Not to dive into too much details. And this cache thing is basically accessing the main memory for the CPU and CPU is reading the cache transparently. And the cache, it has those blocks of fixed size, which are copies of small, small parts of the main memory. And those small blocks of fixed size, 64 bytes or 128, we call them cache lines. And all works fine and fast until we get the case when multiple CPU cores for some reason start reading and writing the same cache lines. For example, by the same address, one thread is doing crates, other threads are doing grids. Then we get contention and CPU has to perform this very expensive synchronization of the different cores so as they would store the same data for the same address. So as it wouldn't happen, then for the same address, different threads see different values, right? It shouldn't happen. And this synchronization of the cores is very expensive. This is where the slowdown happens. And what could happen and did happen in our case, that data which was seemingly unrelated, different bytes, they by bad luck just happened to be in the same cache line. And we've got contention on the cache line on the hardware level, not on the application logic level. Simply because when you work with memory, you always work with basically with minimal size of single cache line. Even when you access single bit, the entire cache line of 64 bytes containing this bit will be used by the CPU by the cache. My fix was as simple as just adding this to split my data into separate cache lines. And now I no longer have contention. This is how I've got five times speed up. This is measurable in the final RPS as well. It can be visible when you fix it. Just when you're fixing it, make sure that it makes sense. Like I said, don't just add the 64 bytes padding everywhere where you think you're sharing data. Add it, test if it makes sense. If it doesn't change anything, then just don't add it. It's as simple as that. What else can we do with thread contention? Have a look at memory ordering. If you are having highly loaded multi-threaded application, it's very, very likely that you are having also those atomic operations in your code. Like SDD atomic and C++ and double underscore sync, double underscore atomic in C compilers, which all do the same basically. Today, besides some arguments, they also take this memory order mysterious thing. There are plenty of those orders. And what they are doing is that they regulate how much freedom the CPU has about executing this instruction and instructions around this one without explicit ordering. CPU can execute your instructions in any order it wants. Even if you turn on all the off of the compiler optimizations, your machine code looks absolutely linear, even if you have single thread, still those instructions inside single thread can be completed in random order. It doesn't matter in which order you wrote them in C or C++ or whatever you're using. Example on the slide. So we have those free variables, ABC, starting zero, and they have one thread assigning them to one to three in order ABC. And then other thread reading them in different order, CBA. It looks impossible by all the logic, but it is in theory possible in some CPUs that you will get printed free equal C and zero equal B. It looks impossible because if second thread C is B, C assigned to free, it means that also it should C be assigned, right? Because it was assigned before C. But it could happen that it will not see this because, for example, read in second thread of B could be completed before the read of C. Or writing of B in thread one could be completed after writing of the variable C. We don't have any guarantees from the hardware when we are talking about observability of thread state from another thread point of view. And if you think if you are safe on x86, you just don't use ARM and ignore the problem, the bad stuff is that you still have reordering on x86. There is example on the slide by this link which you can compile and run. And even on x86, it will demonstrate reordering. The some instructions, logically impossible, will complete in different order. Without any tricks. It's completely predictable machine code. It will happen even on x86. We will not dive into details of each memory order possible. It's too much time. But I will demonstrate you what kind of speedup you can get if you study memory ordering and use correct ones. Benchmark on the slide link as usual and the benchmark is very simple. I have this loop, single thread. I'm not even using multi-thread here. It's just single thread. I'm using Atomics in a single thread to demonstrate the point. It has this loop where I'm using STD Atomic and it runs in two versions. First is default STD Atomic operation with sequential consistency order on the right side. Memory order, sex, CST. It is default when you use STD Atomic and don't specify memory order. So it is the safest and strictest order when you use it. Code works like it looks. This wire is default. Otherwise people wouldn't have to bother when they don't care. But it is an overkill in this case. It's too expensive. And in my case, relaxed order is enough. It is in most cases actually enough. Like in shared pointers, relaxed order is enough. And I'm just comparing this loop with relaxed and sequential. Just think of a number. What do you think would be, how much would be a speedup? Like probably you're thinking zero because if you know x is x86, you will tell me that it will render the same machine code. When x86 writing has the same guarantees. Prosequential consistency and relaxed order doesn't matter. x86 is safe, right? But I've got 16 times a speedup here if I'm using relaxed order. It was x86. It was modern compiler. Loop was not optimized out. It was minus of re-optimization. So top optimizations. And still I've got 16 times speedup of this loop. What happened here exactly? If I open machine code, this assembly stuff, I will see that relaxed order was compiled into single-move operation. Prosequential consistency was compiled into this exchange operation. The reason is that on x86, there is only one possible re-ordering. Prosequential consistency order protects from this type of re-ordering using this exchange operation, which gives more guarantees than move operation. And the problem is that in this case it wasn't needed. So I just requested two strict guarantees where I don't need them. And I paid 16 times slowdown for this. And in fact, at least in my entire career, I have never seen case when sequential order was needed. It is needed in such extreme weird cases that I have only seen the artificial examples. I have never seen it in actual production code needed. The only pre-orders I ever needed were relaxed or acquired plus release, nothing else. So this is what kind of speedup you can get. For fun, go to Godbolt and try to render the same code on this version of compiler on C-line. It will be even more interesting. Just amount of machine code simply didn't fit on the slide. That's why I didn't put it here from C-line for this simple loop. What else can we do with thread contention? Look for eqs. In the background code, it's very, very frequent that you need some sort of queues sometimes in multiple places of your application. And the usual use case is that you have multiple threads producing something for the queue, like requests. They read from the network, allocate requests, validate it, and push into queue. And other threads are doing, for example, business logic. They are taking objects from the queue and processing them, and then deleting, like on the slide. How can we do this? We start simple again. If we don't have much load, then this solution is actually just fine. So we have this queue. It's just a Mutex protected container in STL. It works fine. If you have hundreds of thousands of RPS or millions of RPS on this queue, then you will get Mutex contention here. You will get it guaranteed. What can you do about this is just get rid of the Mutex. And there are solutions how to do this, called log free queues, which allow you to have a queue without a Mutex and STL. It will be thread safe. And the problem with those queues is that there is no one major queue, which is best for all the cases. Implementation of specific queue very much depends on what kind of queue you want exactly. Like there are those four types of queues, depending on how many threads are producing for the queue, how many threads are taking objects from the queue. And also you have to know whether the queue should be limited in size in your case, what happens when the size limit is reached. So when you understand your case, you can choose one of the queues, one of the implementations. There are many, many implementations. Of course, I just placed a few of them on the slide. For all the queue types, two of them are mine. One of them is this very popular, according to GitHub stars, Cameron 314 concurrent queue. And also there is this very nice website, 1024course.net. Who knows? It's a very nice website, which not only contains source codes of various queue types, but also they are actually explained there in a simple language. So you can go there and dedicate yourself about how those queues are working and why, what is log free, what is weight free, what are all those memory ordering types. It's all explained on this side. Very understandable stuff. And like I said, don't just use multi-producer, multi-consumer queue for everything. If you have, for instance, single producer, single consumer queue, it can be done much faster than the former. So just be careful what you choose. And this is kind of speed up you can get when you simply switch from mutex protected to log free queue. This is benchmark of my two queues. Some benchmarks doing multiple producer, multiple consumer threads. And stuff, and those are the numbers. So this also can be visible on the final RPS of your application. Just make sure you test it before you apply it. All of those stuff I'm mentioning today, it makes sense to test it first to make sure if you actually need it. What else can we optimize quickly like first aid kit, networking, backend performance, very often like 90% of all this performance will consist of how efficient your networking is. How efficient is your data received and sending. And in cases like one connection, one request, this HTTP stuff, it also matters how quickly you can accept and close clients. So this socket creation and closure also matter how fast you can do this in those types of scenarios and quick stuff we can fix here is, for example, scatter gather a link to the benchmark on the slide. And the use case is this. You have, imagine this multiple buffers that you want to send. Each buffer can be separate message or each buffer can be part of single message like chant, response or something. And you want to send multiple piling up buffers into the socket. How do you do this? Do it in a simple way. You just run the loop where your calls send on every buffer, right? It works, obviously. And on this benchmark, I have speed two and a half gigabytes per second on local socket payer without networking. And I was sending 16 one kilobyte buffers every time when I called send all works fine. But if I do it like this, I suddenly get two and a half times speed up. And what I changed is that instead of loop of send calls, I did a single send message call. Even the code on the left side looks bigger. It was this much faster. In practice, I saw that this switch made my code 10 times faster. It just depends on how many buffers you are trying to send of which size at one time. In this case, 16 buffers each one kilobyte in size, local circuits. I got this speed up, but it can be better. And where is the speed up coming from? The thing is that on the right side, I did 16 send calls. On the left side, I did single send message call. And those send and send message, they are in fact system calls. Very, very expensive. Switch into the kernel context when you're calling those things. And this is extremely expensive, basically. Every system call is always very expensive stuff. And you should avoid that, make them as few as you can. In this case, speed up is coming exactly from this. I simply made less system calls. I sent into the kernel multiple buffers at once. And this single, even single system call is many orders of magnitude more expensive than just filling this Iovac array. Even if it's something like 128 buffers, sending more doesn't make sense. 128, as far as I remember, it's the limit in the kernel anyway. They will not accept more. Funny thing, when you try to send more, sometimes kernel can return errors. It will even just do partial send. It can return error, like too many buffers. Someway, this is what I absorbed at least. So the solution here, if you have multiple buffers to send, simply use send message and receive message instead of looping those send and receive calls. And of course, it only matters if you have more than one buffer. If you have just one buffer and you switch from send to send message, absolutely nothing will change. Some people might be already thinking that why didn't I use readV and writeV calls? Because they look simpler. I don't need to fill in this message header object. I can just send array of Iovacs directly, right? They will work even with the same speed. The problem with those system calls, read and write, readV, writeV, is that when you use them, they are accounted in the kernel as disk operations, even if you use them on the circuit. I don't know why, but it is the fact. So when you are using read write calls on a circuit and you check this protspeed.io file, it will grow even if you call those functions on circuits. They will be accounted as disk operations. If you don't care about the statistics, then you can use those functions. But if you care, try to use send message and receive message. They are portable, available on all the Unix-like systems. So good stuff. What else can optimize event queues? It, of course, depends on the application very heavily, but often in the backend servers, we have, they can be quite loaded. So we can have tens of thousands of clients easily in the same number of circuits in one process of server. And although circuits can generate events like circuits can become readable, writeable, and receive out of band data from TCP or receive errors and stuff or custom events. And we need to handle all those circuits somehow at once. And there are three ways how to do it. Without ridiculous solutions like one thread per circuit or one process per circuit, it's not scalable at this scale. And the solutions are periodic polling, reactive polling, and event queues. Those are made up names. I just made them up myself. It's not like they can be found somewhere. And we go through each. So periodic polling is the simplest approach. As simple as you just have a loop where you iterate through all the circuits, and you try to read and write each, and then you sleep. And then you repeat. This way you don't spin on the busy loop, and you still handle all the circuits. The problem with this solution is that firstly we'll have additional latency here because imagine that circuit number N becomes readable. To get to the circuit, you firstly have to try to read N minus one circuit before. It will cost you time if you have thousands of circuits. Secondly you will lose latency here because imagine circuit became readable, and you just started 100 milliseconds sleep. You will waste 100 milliseconds of latency absolutely with no reason. And firstly you will waste CPU here because you will be doing lots and lots of unnecessary system calls. If socket is not readable and you are doing receive, you just wasted a system call, wasted some CPU time. The stuff can be easily fixed with a couple of solutions, one of which I am presenting only for the sake of you not using it because select thing is deprecated. It gives undefined behavior on Linux. If you have more than 1,024 sockets, or even if just one of them is bigger than 1,024 by value. It is not advisable for you to use it even in documentation. So there is an alternative poll which works quite fine. Even these days and it takes array of descriptors with events you want to listen for, those events field. And when you call poll, it will block you until any event happens with any of those circuits and when it returns, it will fill in our events field in all the descriptors with events which are available for circuits. This is how it looks in the code approximately. So we have this poll, you call it on all your descriptors. When it returns, you have events and you are scanning all the sockets and checking which socket has which events. Then you don't do those unassisted system calls. You only do reads when socket is readable. Write when socket is writable. And I have this benchmark, click on the slide where I have 5,000 clients and they are sending 5 million messages in total. And only a part of clients is active at the time which is realistic. It's not like all the time all the sockets are active. This kind of speed up I get when I switch from periodic polling to poll. I have got 25% speed up instantly and I did zero system calls which ended up with eWood block and periodic polling did 120 millions of those system calls which were not needed. And thirdly, periodic polling wasted huge amount of CPU time because it was spinning in a busy loop. I didn't even have slips in this case. If I would add slip to periodic polling here, it would get even slower. Here I didn't have slips and still it was slower and it wasted huge amount of time on those unnecessary system calls. This is not the end. We can optimize it further. One last optimization using event queues. The idea is that instead of having socket array in user space, we can have it in kernel space. And kernel will monitor your sockets all the time for happening events and notify you when something happens. This is Epolyn Linux, KQ on Mac and BSD and Diocompletion ports on Windows. So the idea is that you create this event queue, you add sockets one by one into the queue for monitoring specifying which events you want to monitor. And then you call this Epolyn wait thing to fetch the events. When it returns, you handle the events. It is as simple as that. So instead of placing all like 10,000 sockets into the kernel for each Epolyn wait, we just call Epolyn wait and get the events. This is how it looks in the code. We call Epolyn wait on our queue. We get some events, return, we handle those events, just them without full scanning the entire circuit array. For example, if you have 10,000 sockets and 10 of them got events, you will just iterate 10 times here, not 10,000 times. And the rest is the same as with polls. So we just read where readable, write where writable. As simple as that, if I apply this on top of poll, I get another 30% speedup. Even with a single chance. So you can of course optimize it first, but those were simple optimizations. This was all the stuff which we had time for, but also there is some additional content with eight other small, simple things which you can apply in your code. They're all clickable. You can click on them after the talk or ask them as questions right now if you like or ask me afterwards outside about those other optimizations. And now this was the end. Thanks for your attention. If anyone has any questions, then I believe we have time to take a couple. Thank you. Amazing talk. Thanks. You mentioned flame graphs a few times for the cash sharing issue. What kind of tooling do you recommend to detect those? For cash pieces? The cash sharing variables, sharing through cashes? Yeah, I think the first tool for example, Linux is able to measure the stuff. Okay. The first is also able to build nice flame graphs by the way. Any other questions? I have a question about the first example. I guess the second one, but still on the first chapter I guess of your talk about the intrusive list and it's my understanding that standard list C++ is also an intrusive list. So I don't think that interaction should do anything. The list is intrusive? Yes, I just checked it up so it can be in presentation. It's not intrusive. Okay. So for example, when you have this STD list, right, sign of intrusive list is when you have link inside of your objects. For example, if you store pointers and you have pointer at your object, you should be able to just unlink this object from the list directly, right? Just leave of the previous element, link with the next element instead of you. So you just in constant time can pop the item from the list, right? When it's intrusive. In a STD list, you have first it located. You have to iterate the list, find your pointer there and erase it by the iterator. Standard forward list also not? Are you certain on that? Unfortunately, yes. In a STL we don't have it. Maybe they, we have it in boost, I don't know. What, what in boost? Okay. Good stuff. Hello, thanks. What do you think about IO ring? I haven't tried it myself in real life use case yet. Okay. But I heard that can be faster than Ipul. So basically IO ring idea as far as I understand is the same as IO completion ports on Windows. Right? So you just directly send data from your buffers without copying. Yeah, I guess it is possible with IO ring to make even less assist calls. Yeah, yeah, perhaps. Could be good, could be great. But the idea falls into the same folder of event processing. So we don't then sort of socket array full scan or anything alike. Those are by the way, obviously not cross platform solutions in networking. I don't think we have anything cross platform enough besides maybe poll, right? And the rest could be like boost ASIO. Yeah, this stuff is working everywhere. If there are no other questions, then that should be it. Thank you all very much for coming.
20 Years of Open Source building XWiki and CryptPad
Okay, so hello everybody and thank you for coming so early. And so for those that were not there before, since you came so early, there are a few free t-shirts if you want to take them. So I'm going to talk about the story of our company, X-Wiki SS, building the X-Wiki and CripPad open source software in the last 20 years. So first a bit of a track about myself. And so I discovered technology in 1984, like using an Apple II and then I moved to PCs, I even moved to Windows 95, then I graduated from a good school. And actually in that good school I was kind of told, so I was actually, it was that we had a speech at some point telling us, you're a soldier of economic war. And so that of course resonated in a young person, but it also, I mean later on, it's like what, I mean why are we doing war? Like that doesn't make any sense. We're not fighting other countries, we should work together with other countries. And then in 1995, I was really very interested by the internet. I saw people using Mozilla browsers, Mosaic browsers in the school and I really just wanted to work on internet technology. And I ended up, so I started, I took one job about the internet at Capgemini, but after a few months I was recruited by Netscape because somebody from the team had left to Netscape and I ended up working three years. So who knows Netscape here, like okay, like so just not so bad. And so I was a consultant, I became also their Mozilla fan. I even wanted to work for Mozilla, Oregon, inside the company when they launched it. They didn't take me, so I ended up working in a French startup. I wanted to stay in Europe and that startup raised money, went IPO. I actually even was a virtual millionaire and then there was the internet bubble, it crashed. I was not at all a millionaire anymore, just like any other IT guy. And in that company, we were both by US company and in that company I used Wikis and that's how I wanted, I found Wikis like amazing in terms of how it brings people and helps people share knowledge. And that's how in 2004 I created XWiki. I was a bit accustomed to open source with what Netscape was doing. Netscape had a highly transparent organization and a way to share things. It was really pushing internet protocols standards and then they made Mozilla open source. And then I was a user of open source in my company as a CTO, like installing using Apache software foundation code and so on. And so in Wikis, we're purely coming from the open source world. It was very natural when I wanted to create a company and create a software to create it as open source. But I was not as much aware of the political aspects of open source. I was really looking at open source from the technical point of view. So that's how I started XWiki. I'm going to continue from that in this presentation. But I'm also now a member, our company is a member of APEL, which is an organization of companies that do open source in Europe. In France, we have the CNL. We have the Herb Open Source, OW2. I'm also on the board of OpenFoodFacts, great association working on open data for food. And I'm a small shareholder of Morena that is doing an open source phone. So I'm welcome to look at this. I find this amazing work. So what is XWiki SIS? So XWiki SIS is a French and European independent company. So it means we've been self-funded. And I say independent. It means it's self-funded. And I still own majority ownership. And the very large majority of the share is owned by employees, some ex-employees. Yes, I should stop. Okay, great. It's on HDMI, you put HDMI. The slides are here. No, no, you're here. And you're not seeing them. That's too bad. I took this cable. Sorry. Okay. So it means we control the company, which is actually something that's not so easy to achieve in tech companies. We are at around 4 million revenue in 2023. We did 50% growth, which has been really very nice. And we have 60 people, mostly in France and Romania, but also some people in Germany, even two people in Brazil. We do two open source software, XWiki and Kripad. One is the one we started with that I created. And then Kripad that was created inside the company in 2016. We have an international community. And so we also are very engaged for digital sovereignty. We think open source is very important for gaining control of software, both for states but also individuals. And we have a business model of allowing to have revenue for that software so that we can build it. And this is done through services support training, like anything software company do, but trying to do it in a way that it allows us to fund the open source software. So we have employees in all these countries. And so what we're trying to do is enabling freedom, both with the code but also with the products that we do. I'll come back to that. So two software, XWiki. It's about knowledge management. It's sharing information. What's really interesting with the knowledge, Wikis, is that it really allows people to share and make knowledge available. So we all know Wikipedia. But we do it for organization. Inside that area, we have competitors such as conference or notion or even the Wikis of Microsoft Teams. And this is part of, I mean the competition in the end for any open source software has a high impact on how you can actually fund your work. Depending on how you compare to the competition, you can find more or it can be more or less difficult to find money. We have more than 7,000 installs and more than 400 clients. And XWiki is now part of the Open Desk project. Also if you don't know Open Desk, I think about looking at it. Google it. We also did CripPad since 2016. And CripPad is an end-to-end encrypted document editing platform. Who knows CripPad? Here? Okay, good. And so I'd say competitors to Google Docs. And the real part is that it protects people's privacy. So I'll go on, how did we start? And so the big question is why be an entrepreneur in the end? Because I'm trying to, so in this talk I will try to focus more about the open source aspect of what we did. But when you talk about your company, it's all also about the entrepreneurship and the difficulties to just run a company. But so I had this wish to kind of create things and make them happen. And so that was a bit at the core of being an entrepreneur. But one really important thing was to try to do something that's useful for people and have some impact. I wanted to do it also in Europe. I've been to the Silicon Valley and I didn't feel I liked the moons. So the technology was great, but I didn't feel as good about the fact that people were just talking about money and how they would become rich all the time. And that really made me think, okay, I don't want to spend my life in an area where that's the goal. Like I want to be more in a place where we're talking about culture or whatever. And another aspect was as an employee in companies, you sometimes feel your managers are not doing what you want them to do or they are not fair or the company makes decisions that you don't understand. And in the end, you can complain and stay as is and keep complaining about what people tell you around. But my idea was that my feeling was, well, instead of complaining, just try to do better. And that's also a reason to become a manager or own company and be the one that has to take responsibility for what's happening in the group. A big aspect is about believing in the product and in the purpose of the product. One of the really important things that motivated us at X-Wiki for 20 years is the fact that we feel our products are missing or they're not enough existent in the world and they're useful. They serve an important purpose. When I started X-Wiki, I was a big user of Wikis, I was a big user of task management tools. And I said, okay, we could do task management tools, we could do Wikis. And I directed myself towards Wikis because of the fact that they help sharing knowledge. Task management tools are a lot about efficiency being more efficient in companies and I felt knowledge is missing more. Like we're missing more the fact that we spread knowledge and that we educate people more. And in the end, this has stayed with us for 20 years. So we have a lot of Wikis inside companies that help people get more knowledgeable about what they do and about the work in their own company. But we also have a few public Wikis. We have the dictionary of the history of Switzerland, which is a public funded Swiss project about the knowledge about Switzerland. We also have a Wikis about rare disease. And you don't want to look at the website too much because it's sometimes really hard to look at it, what the parents of the kids that have this disease live. But it's highly useful for this community of parents that are living with the disease of their kids. We also have a Wikis for public service in France and so on. And so from my point of view, if you want to stay motivated about software for 20 years, you also need to really believe about the fact that your software is useful. And in the end, in 2016, we created CripPad. We created the technology, but we decided to make it a product because we really thought it was doing something that was missing, is protecting people's privacy and that too many software are exposing the data, are not built to protect the data, and CripPad is a product that is built to protect the data. So now the problem is that if you want to do a good software, you're interested in doing it as open source, then how do you fund it? There's different ways. So you can just raise money. So that works. You can build, there is a lot of open source software that is built by companies that have raised money. Even now, the modern way of raising money is doing some crypto thing and launching a token and getting millions of dollars. So it can work. I'll come back why. I didn't feel it was a good approach for us. You can be an open source volunteer and that's great. But what I tried to do in this graph is measure the sustainability of that action and how much impact it can have and on the other side, how fast you can develop things, but also the comfort of doing it. Because in the end, if you want to do that for a lot of years, are you doing this under stress? Are you doing this having a good day and being able to have a good life aside? And so open source volunteers won donations being an independent professional, like a freelancer and getting paid for doing service around open source. That's really good ways. And bootstrapping a company, this is what we did. And I feel, and this is what I want to show also in this presentation, is that it's a good way. You have a decent level of comfort. You can have speed because if you hire people, you do more. There is the sentence that you can go fast alone, but if you want to go far, you need to go as a group. And that's what the company allows, being a group that is funded, that has some money and can go further together. And you can see in this presentation also the acceleration that we had over the time between the beginning and now. So investors, why? I want to take a little bit more time on the investing. So it took us a little time to realize it was not what we wanted. I came from a company that had raised VC money, and I saw the fact that you can create a momentum. You can have a money hire very highly skilled people, and you can build things fast. But in the end, the real thing that you need to think about is the day you take VC money is who is the real boss and who holds the key of the decision in the future. And whenever I had discussions with some investors, beyond the fact that they tend to like the salespeople or the business people more than the tech people, and so that might be a reason for them to not giving us money. But for us, the problem was, okay, are we agreeing on where we want to go? And in the end, investors are in for a return on investment, so making more money with the money that they put. And as an entrepreneur, that's not what I was in for. I was in for the human relationship with the employees, running a project over the long term, and creating open source. And when you discuss about open source, and in France at the time, it was also quite simple. They didn't understand open source, so you had to explain it. Today, it might be better like, oh, open source, great. Let's do open source AI in France. They love it right now, and they tell you, oh, it's great. But what is their goal with it? Do they want to sustain that open source AI, for example, or do they just want to make a play to take a piece in the market and then cash in at some point and close the work? And so this can create good open source. And that's fine. But if you have a goal of being, for example, good to your community, not lie to them. Not tell them that you're doing open source and not have a hidden agenda about how you're going to make some money. That's going to be difficult. And I felt as a CEO, if I raise money, I would start lying to my customers about what our real goal is with this open source project. And being independent allowed us to not be that. It's much slower. It was much slower. But in the end, it's more important to do it like that. In the end, money is a mean not a goal. That's really a thing to think about. And so what was bootstrapping about? And so from 2003 to 2010, it took seven years to get to one million of revenue. It took a lot of time. It took three years almost of myself not getting paid. I found some other ways. And then any time we would do a little bit of money through service, we would use the product. We would use it for hiring more people and growing the product and making it better. One of the great things with open source is that you can build on other people's software. And that's magical. And so you can really reuse a lot. And that's actually what the proprietary software companies are doing. Now you have 90% of proprietary software is actually open source software. And they keep control of the latest piece, trying to cash in or build some business model about our data. But 90% of it is open source. And that helps us also. That helps also the open source companies we can build on that. The support of the community is huge. The service is a good way to start. It has problem over the long term to do only service. But it's a good way because you sell time, you make money. So you don't take risk with service. It can be something that doesn't have the level of risk. Another aspect that allowed us to go from zero to one million is European research money and French credit and brochures. In France you have a lot of help about research. So you can, if you do something innovative, you can bring it to the state and get some taxes back. So you will have less cost as a company. This is, for example, more difficult if you are in association. You can get subsidies, but you won't get social charges back because you're doing research. It's going to be more difficult. And then you have European research projects. You can group with other companies. And we had the chance in 2007 to join some other companies in projects and get some funding through that. In the end, over the 20 years of X-Wiki, I calculated that we received 10 million euros of European research grants of projects in France and so on. And that in the end was our VC. I mean, getting 10 million from a VC is quite difficult. It took 20 years to get that, but it allowed us to fund the software. Another thing that happened in that time is that we went to Romania in 2006. It was initially through the Google Summer of Code. And we had a student that was in Romania and we gave some projects. He candidated and he was really great. At the time, I didn't have money to pay people a lot. It was difficult to hire a full-time employer in France. There was competition about the cost. Romania was really an emerging country in the tech industry with great scientific skills. And we hired some of the first people. They all stayed. The first three that we hired are still working with X-Wiki today. And we have 25 people now in Romania. It was initially a cost-driven decision with the opportunity to have people with skills. And over the time, it's a fully integrated team that is also believing in open source. And so we hope that we also had this little effect to bring some open source to Romania because we're one of the rare companies that is doing open source in the city we are among Amazon, Microsoft and so on. And we also have, thanks to Romania, a lot of women in the team. And it may also happen that there were some couples created at X-Wiki. So as an entrepreneur, it makes you think about the impact you have. So that's just a graph of finance. I'm revealing our finances. So I'm not going to detail them, but it can show you the split. The most important data here is that when we started, 0% recurring revenue. And after six years, 20% of recurring revenue supports. And in the end, that's the goal. The goal is increase the support revenue. It takes time. You need a great product. You need to reach a maturity in the product. But it grows over time. So it's all about the strategy to make that recurrent revenue grow with the users and customers of the software, whatever the goals are, whatever the type of recurrent revenue. So it took a lot of time. It grew to 20%. So one thing that I really want people to think about is there is no success in open source without a good product. There's a lot of people that think that it's open source business model doesn't work, but in reality, it's just the product's not competitive. There is a huge amount of products, including in the open source world. If you don't do a good product, there's not going to work. And so you also need to think about the strategy to direct revenue towards the product. So when you do service, that's part of the problem. You might diverge from the product roadmap to make a great product. Because you're going to follow what some customers say instead of following what all customers need. So you need to think about that. And one of the things we learned over the 20 years is that it can be a good idea to condition the service on taking the support, which allows to give extra funding to the product and dedicate people to work on the product. And but there is also some companies, for example, NextCloud, one of the things they do is that they don't give you service. They don't sell you service. They make you pay the product, the support a bit more, and they give you the service. That's also a strategy that is interesting, that is going to raise the product revenue and really make the company focus on the product. And then the service will be used to make the product better. So we need to think about focusing the revenue towards the roadmap. That's the case, for example, of the research projects. Another aspect is the community is super important. It's your marketing. It's also your insurance. Customers will find it reassuring that you have open source. And it's also your recruitment tool. You'll find developers. We have hired so many people that came toward the community. And it's also very important, the community, to be a good open source citizen. That's also how you look. You see if companies are really true about open source. Are they really working with the community? Is the community open? If a software doesn't take patches, doesn't take pull requests, is not discussing with people about how the software should be, you could question their motivation to really do open source. So at X-Wiki, for example, even though we don't have that many contributors because we're moving fast on our end, and it's not fully natural that people come and give you code, it doesn't happen like that. It's a challenge to make people give you code. It doesn't happen on all software. So the fact that the community is not huge around a product, from my point of view, doesn't necessarily say that it's not a good open source community because it also depends on whether the people want to come. At X-Wiki, we have a fully open development model, but we don't get automatically people running. We're using Apache Software Foundation kind of rules for running the community. You can find our code, you can comment, discuss in the chat, and so on. So some companies bring their products to a foundation, that's also an approach. One thing is the relationship that we customer in open source at the beginning. I realized that you talk about open source, oh look, it's great, open source, you're going to be more free, no lock-in, etc. And you talk to some large companies, the thing is they don't give a shit. They don't care about this. They just want the best product at the lowest price possible, efficiency. So some people do care in the end, but you have to kind of find them in companies and find the people that can be sponsors of open source. Today you have OSPOS in very large companies, even in the European community, in public service. These are the sponsors, but the majority of buyers of software are looking for the best software at the lowest price. And that's why you need to be competitive to show them that you have also the best software. And there is a difficulty with the marketing of the proprietary products. There is so much marketing of the proprietary products that it will cloud the vision of the customers. They will get stuff for free. They don't look at the long-term price evolution of software. We lived it with our competition with conference. Conference recently changed the prices, but for years customers were buying. We knew it would happen. We knew that at some point they would cash in as much as they could. I mean, they would cash in on the proprietary nature of the software and the fact that they control people's data. And a good thing with open source is that open source validates your product. So you can go and show to customers, look, we have these users, it shows that the software is good. And that works very well. And we also have progress in Europe today because there is an issue of digital sovereignty, not something that was not foreseeable with the dominance of American companies, but it's something that politicians or European organizations or state organizations took time to take some action on. Now there is a bit of action in this area. One thing that I learned also through creating X-Wiki is looking at Floss and free and liberal open source software as a goal. Initially, let's create good software, let's create a good company, let's have a good balance with employee. But in what I discovered is the goal of open source of free software is giving us freedom, giving us control about software. It's all the values that are described by the FSFE that are really interesting. And that they discovered that and that motivates us even more in building what we do. In the end, we had to find a balance between all these things. And these are the values that we promoted internally in the company. This is what makes our company. We need to take care of our community, we need to take care of our customers, we need to do a great product. And the great product is about the domain in which we are, the knowledge and privacy, the goal that we have for software. And we want people to be happy inside the company and we want to do open source. So these are the values that we promote internally. And what the challenge for CEO and for the group is to find that balance between these five items. And for example, we can see that these are the highest ranked reason why the people at XWiki decided to join XWiki. This is recent data, this is not data from the past. And we can see that being open source is a key reason why people want to be there. But they also want to be there because they like the product that you're building on and that we're building. So one of the key things was building on support revenue. I mentioned the recurring revenue and this was really important to really make the support revenue accelerate to be able to gain sustainability. And that's really the challenge for a company that wants to build open source over the long term. And so from 2010 to 2015, we moved from 1 million to 2 million revenue. But most importantly, we grew from 250K to 800K of recurring revenue. In the end, that's what I look more at, like how much recurring revenue we're making. Because that is what's funding the company. We failed at building partnership. We hope the product could be used to build some other products. But we found it very difficult to find the deals. And in the end, we found that we were better at creating a direct relationship with customers. Also explaining them the open source model and what we were trying to do. And also explaining them the value of our product. Relationship with direct customers is key in order to build the value of your software. We also tried to build the first version of SaaS, we call it XWiki Cloud. In the end, we focused on the main product and the main product's value. It also allows some simplification. And so that's the graph, you can look at it. Recurrent revenue grew to 35% in the time and that's really great to have that. It's not only about the percentage of recurrent revenue, it's about the amount that sustains the team. Because even if you stay with a percentage that's 50, the extra money that you're getting from the service from the research product, it becomes bonus when the recurrent revenue is enough. So if you reach a certain amount of recurrent revenue, the rest becomes a bonus. At the beginning, that doesn't work like that. You have close to zero recurrent revenue, so you don't even manage to find the team to continuously develop the product. And one thing to keep in mind is that close source competition is tough. Even if you're doing something innovative, you'll launch something new, such as Enterprise Wiki when we started or an end-to-end encrypted tool. At some point, if there is a big market, you're going to have close source competitors that are raising money that are going to come. And they might grow faster than you for a while. They will educate the market, which is interesting for you, but they will also try to take the market and then lay the cash in. But you can stick and stay true to your goals and wait. I always tend to say when you're number three and number one buys number two, you become number two. And when you're number two, you're the alternative to the number one. And all companies want a need alternative for competition. So I hope, I wish that open source would not be the alternative, but would be the leader that's not always happening and not always easy. But being the alternative is also something that helps you grow. And so after that, we had a challenging period. And so we were growing progressively, but what happened at some point, we flattened. And that was because of the competition. At the beginning, we were working mostly on innovation. We had customers interested in buying what we were doing through innovation. But at some point, we flattened. And we didn't have that innovation thing. We had stronger competition, SaaS coming in and speeding up. Deployment people would just buy SaaS. Companies would be less interested in the open source aspect. So we basically were flat with 35%. So what difficulties that we had? Competiteness, SaaS, competition. Our custom work was less demanded because there were more products that were doing things in a standard way. The fact, we had to educate the market about the fact that open source is not completely free. So that's also, you tend to put the priority on this when you're not making enough money. You tend to think it's just a business model. It's also the business model, but the main thing that we changed in that period, one of the things that we stopped trying to think like open source startups, we came back to think like what we had, what was the value that we had, and it was about our product. And what we did in the end, we created Task Force to transform the company, focus again on the product, making the product better. And trying to convince again people that the product was good. And it worked. And we looked at what was missing, what was not so good in the UI, and really did effort in it. And the thing is, when you're doing two million revenue, you do have money to try to fix problems, and that's nice. And so we relaunched a competitive offering, and we also changed a few things in the way we were selling to customers to try to improve their understanding of open source so that they would give us more recurrent revenue in order to fund more of the product. And so one of the things we did is rewarding customers, paying the product. So for example, we decided to build open source paying applications. So this is quite unique. So I don't believe in the open core model where you're doing open source and proprietary on top of it, because it tends to push you towards doing more proprietary. At Xwiki, what we decided to do, we did paying extensions that's similar to open core in the sense that you have to pay for them. But the code is open source, we just don't make the build available. So we have the Xwiki core, completely free, and of course, completely open source. And we have extensions, you get them as extension, they say pay for it, pay for it. In reality, the code is fully 100% available in GitHub. If people would want to use them for free, they rebuild them. The thing is, people don't make the effort. It's a lot of effort for companies to do that. And by making a little bit of friction for companies to adopt these extensions, it motivates them to pay, to give them a reason to pay in companies. And the bad part of it is we like it to be free completely for individuals. But this means that you would need to find some other ways to make it happen. But the most important thing in this strategy is that the code itself is open source. That means over the long term, it's owned by everybody, not just by us. So we cannot be the owners of that code over the long term. So this is a part where open source is not free and it needs to be explained. We tend to think that everything needs to be free, but you cannot pay people if everything is free. So if you want to build it, you have this difficulty. In 2016, we launched CripPad. And it was another experience there because we relaunched a second product inside the company. It had some useful aspects. It recreated innovation in the company. And it helped us gain other research projects because research projects are highly linked to innovation. So we had a second batch of innovation inside the company. And it also helped for the image of the company. It made us more known by individuals. And then, oh, you're doing Xwiki also. Some people, I mean, I don't know. In people that know Xwiki and CripPad, how many people didn't know it was the same company that was doing that? I don't know. Who knows Xwiki? Who knows CripPad? Anyway, so then 2020 happens, COVID, what happens? So that's a crisis. One thing to think about is always be ready as an entrepreneur for a crisis. It will happen if you stay long enough. We had subprimes in 2009, 2020 we had COVID. The thing is for us, we were more ready than a lot of companies because we were already remote friendly. We were allowed to do, everybody was allowed to do two days of remote in the company. We just moved it to do whatever you want, just work. And everybody worked from home. We had the tools, everything was already adapted to work with remote tools. That's one of the magic of open source tools and open source development model. We had the knowledge, we had the knowledge tools. And it also gave a boost to CripPad because CripPad actually was used by education. For example, in Germany, we had credible usage of CripPad over a couple weeks during COVID. So it also gave a boost. But as a company, of course, it creates a bit of scare what will happen, will customers go away, will there be a financial worldwide crisis for years. In the end, we went through there. One of the things that COVID showed is a challenge for European digital sovereignty. Politicians realized that supply chains were a problem and that there were risks there. And this has tainted towards digital sovereignty and software. And since a few years, we've seen that there is an interest in this area. But the most important thing that happened for us is Atlassian changing their business model and saying that people should move to their cloud. And they should stop using software in their own companies. And they closed the smaller offers. And they decided that in 2020 in November. So I don't know how they found that COVID was a good time to add some stress to their users. But they did and their customers didn't like it because we received a lot of mail like saying that, okay, what is their way to replace Atlassian conference with Xwiki. And so we spent time on improving our migrators. And we were not necessarily surprised. We were surprised the extent of the change that they made and what they did to their customers. But from our point of view, it was something that would happen at least progressively. When investor backed companies want to cash in, this is a time for the SMAs and open source companies to really propose an alternative that is more sustainable over the long term. Open source is more sustainable for other people than proprietary software is. So for us, it brings some maturity. This raised our revenue to 3 million 3 of sales in 2023. 50% growth, I said that at the beginning of the presentation. And in the end, 1 million 6 of recurring revenue with 30% growth on the recurring revenue. And that has been huge for the company and for allowing to build more software. So this is a graph you can see last three years, pretty nice. So when you look, when you are in 2020. When everything's flat, you feel a bit depressed. And that's not going well. But then three great years behind it. So it's never given. You can always turn things around. Not only because of Atlassian or thanks to Atlassian, but also because we went one project with Digital Sorbitancy in France and Germany and the software was recognized. And what about the future? So everybody talks about AI. For a knowledge company, it's a real question. So we need to think about it. One of the things that AI is doing right now is that it's questioning the aspect of open source again. We saw a lot of big companies, as I said, not caring about open source. Politicians not caring about open source. With AI, it's the first time the president of France said the word open source. Which we wanted him to do for years in the industry saying that it was important. And he said it for AI. Okay, what will it change? We'll see. But at least it raised the question of transparency again, of the control of code of data. And that's something that is positive for the future. But you also need to get prepared because it changes a lot of things. The architecture of running AI is complicated. It's much more harder to run it on premise. And so you need to find solutions for that. We're working on AI at Xwiki. We have an extension. And we also gained a research project to do some search engine using AI. And I would like to point out the approach of also NextCloud with ethical AI. We are completely aligned with that aspect. You cannot do AI today without thinking about whether it's ethical, whether it's protecting data or not. One big aspect that I think is really important for the future is software modularity and integrations. We believe at Xwiki that the future of open source software is allowing to assemble software together. And making better reuse. I said at the beginning of Xwiki that when we started, we reused a lot of open source software. Well, if we want our software to survive in the open source world, we also need to make sure that it can be reused more. And this is why we've launched a new product. We call it Xwiki Crystal. And it's going to be a new modular AI that will not only work with Xwiki, but can work with other Wikis and can be integrated in other tools. And the other thing is that we're part of the Open Desk project, which is a funded project in Germany to make an open source suite of collaborative products. And we're very happy to be part of it. And the other aspect is doing with CripPad what we did with Xwiki. I showed the financial of Xwiki, the company that included both Xwiki and CripPad. But what's really interesting is when you run a second product inside a company, how does the other product look? And what's really interesting is that it looks a lot like the Xwiki product at the beginning, 20% only of recurring revenue. And it's difficult to build that recurring revenue. So if you love CripPad, we are very happy. We've been able to double the size of the team, as you can see in the funding in the last two years. So in 2023, we doubled the size of the team for 2024 too. But it's only 20% of recurring revenue. And that means we don't have the sustainability yet. And if you look, the blue and red part is our recurring revenue, subscriptions to CripPad.fr and donations. And you can actually help us build sustainable revenue by promoting CripPad and allow us to find more users and customers. But you can also help it with donations. Any software needs to reach that sustainability through the recurring revenue. That's really the challenge for it. Finally, giving back. So first we give our software because it's open source. We give our code as a company. But we also think that it's important that we give back to the other open source projects we do. We wish large companies would do that. Large companies that use open source for free a lot or proprietary software company today that are building on open source should give back something to all the project that they use. And we decided to create a fosfan of 1% of our recurring revenue to give it back to the projects that we use. We have three years backlog that we're going to give like almost 30K to the different projects that we use. We're going to give for example to the Matrix Foundation. We're going to give to MasterDont and to lots of other tools that we're using. And we're going to continue to participate to industry organization to help it make known. The conclusion is that nothing of this would happen without the team itself. And we have a team of 60 people, more than 200 people over 20 years that worked on Xwiki. And that's really the kudos to them because you cannot do that without all the people that worked. And we, for example, at Xwiki we have more than seven people that worked 15 years. We have seven people that worked 15 years at Xwiki. 15 that worked 10 years at Xwiki. And this is not necessarily that easy to achieve that for a group of 60. We have a difficulty of funding all the time. If you want to join, we have jobs. And also nothing would have been possible of what we did without the help of European projects, French projects, BPI, Europe, NLNet. If you don't know the NGI program, the funding you can get for the open source from NLNet, go look at it. It can help you fund your project. That's it. And if you have any questions, I'm welcome. I'm available or I can take any questions. Any questions? No, I guess people are just installing here. I have a question if nobody has a question. So I was wondering how was the ride between building a company and having a community, basically, was there any conflict of what to put in the product, what to not put in the product, let's hide this away so that people pay for it, let's give it for free. How was this dynamic in the building of Xwiki? Yeah, and that's the difficult part is what do you do as a pain module? What do you do as free? Well, one of the things, so first is really keep an open community. We are very important and really keep everything open source even if you have pain stuff so that people can look at the code and discuss. For the choices of what the features are, well, we try to direct them as much as possible to the ones that the bigger companies would need, would need most themselves, not necessarily the individual or the smaller companies. Because in the end, it's mostly the bigger companies that have the funding for you or for an enterprise software. And it sounds weird that the bigger companies are not paying for it. I think Matrix has a talk just after and I know that they will talk about the fact that you have huge deployments of Matrix and with zero money and some smaller deployment that are giving significant money. And the larger companies that are massively using open source need to participate to it. And so directing the specific features that they need, for example, audit logs for compliance reason is something of big companies. But for example, LDAP authentication or SSO, it's a bit tougher to not give it because it's a security feature that's really important to make software more secure. So for example, that has been a difficulty for us. So we made active directory paying application, but LDAP configuration is still available in XWiki as a documentation in the open source documentation. But if they want the simple configuration with Microsoft Active Directory, they pay the application and we sold a few of them. Hello, first of all, very nice talk. Thank you. What would you have done differently on the Twiki journey? What would you have done differently on the Twiki journey? Oh, that's a good question. Well, the little strategies to make people understand open source better, for example, making people pay more for service if they didn't take support, what I've done earlier, the paying application maybe earlier, not so sure because initially you need to build community for sure first and you need to build competitiveness. So it's kind of difficult. So that part, not do the four products on top of XWiki with partners, but at the same time they gave us some money. So maybe do less service sometimes, more product. So these are the things I would have done. But it's done differently. Like basically the playbook of how you can fund the product or try to do it earlier. And then when we learned it, I had other presentation about the different method we found in prior for them about how to fund an open source software. So I gave to Kerl, which also has a great experience about how to fund the work in Kerl. Any other questions? Nope, okay. Thank you, Ludovic. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
You too could have made curl!
I better not touch anything anymore. Okay, nine minutes off. Okay, cool. Hi. Technical stuff. Right. Let's start this. I am Daniel. I work on curl all day. I work for Wolfers & Sells. I do curl stuff all day. I am going to talk a lot about curl. I always talk a lot about curl, but today as well. I don't think I am going to present a lot of new things here. You are going to hear me reproduce and repeat the things you already know. But cliches are cliches for a reason. I am going to just let you know that some of them are actually true. At least from my point of view. I work on curl since a long time. It runs in a few things these days. You can actually probably not walk very far without using curl knowingly or not. It is in a lot of different devices, things, services. Since a few years back on more than one planet as well. Right? A favorite slide of mine. Anita, squeeze it in. I am sorry. A few years ago I also got this gold medal from the Swedish king here for my work on curl. And actually... But not a single gold medal since then. It is kind of a disappointment. But anyway, these days we estimate there is roughly 20 billion installations of curl. Quite a few. We don't actually know that it is 20 billion. It is roughly open source, we don't know. But definitely there cannot be that many other open source programs in general. Software anywhere that runs in many more instances. I am pretty sure. Pretty decent thing, I think. But you know, everything really didn't start out like that. It has been taking quite a while. Because in my project, our project Curl stuff... We of course started somewhere. And it was a long and sort of an effort. And a long journey from something that was really not very good until what it is today. Which could possibly be good. So in November 1996, it's a long time ago, I turned 26. Fun. So I started with a little project. It was more like that. Very silly toy. 160 lines of code. Just a few screen folds. And what do you do with that? You start with playing with it, it makes it something. You start fiddling with it. And you know, start small. Do what you want to do. Give it a lot of time and have fun. That's how you start an open source project. You have a niche, you start scratching. And as long as it is fun, why not work on it? And in my case, I worked on it for about two years. I actually recalled it Curl then in 1998. So it started with another name. But that's a long story. Anyway, two years later, December 1998, what an awesome success. 300 downloads of my software. I have this screenshot from the website that I had back then. Because I think it's a cool reminder that actually getting 300 downloads of your software is pretty cool. That's way more than all your friends. All those who just did it because they know you. Actually started to reach out. And that's cool. It is cool with 300 downloads, even compared to 20 billion today. And I also want to emphasize that this was two years later, right? Two years later, 300 downloads. Yay, a good one somewhere. I mean, in 20 years, we could have 3000. So yeah, keep working at it. And finding your goal or you find a project to work on, of course, it's a good thing, right? It's fun and work on it. And maybe, sure, you want to make it easy for others to help. But you can be sure that, I mean, the world is drowning in open source projects and good ideas, right? It's not a problem to find good ideas. It's not a problem to find open source projects. But how do you actually get anyone else interested in your little project? Because you think it's fun and interesting and serves a purpose. Probably not. Probably you're just going to have to realize that it's you and your project for a while until it's proven to be something. So as long as it's fun, why not keep at it, right? And spend the time because it's not going to be an immediate success. Very few things are an immediate success. So yeah, spend time on it. People often ask me what I've done in Curl mostly. But I think what I'm mostly done on Curl is spend time, right? 1996, I started this. And also learn to use the time. I told you, I was 26 years when I started this. I didn't have any kids. I've had kids since then and they have grown up pretty big since then too. And all of that, we're all having lives, families, other things than just open source, right? But how do you actually get time to spend on your projects? In many cases, you maybe need to do a little less of something else or a little bit less sleep or whatever. In my case, I sort of, yeah, maybe if you really want to spend time, as I say, you need to spend time on your project to get somewhere, maybe you have to do a little bit less of something else, right? And people actually sometimes don't believe me when I say that I never ever play computer games, right? That's just, that's an easy thing to rip out of your life and save hours and spend that on your open source project instead. I mean, you can cut down on sleep as well. And I do that to Firmum, but that has its downsides as well. Just accept the fact that for long periods of time, you might just be the only person, right? You, of course, you make it easy for anyone to contribute and you know, load the bars and accept pull requests and everything, but, you know, there are many open source projects out there and we're all competing with the same developers, right? And all those developers, they also play computer games. They watch TV, they have families, they have other priorities in life before your open source project. But I can spend time on my project. I can control at least to some degree what I spend time on. So sure, just accept the fact that, yeah, yeah, yeah, I make pull requests in my own project, right? I put them up there, someone can comment on them, someone might review them, but if they don't, I go ahead and much and continue with the next one. Because in the end, it doesn't really matter. Looking back at your project, you don't care if I started my project 10 years ago, 15 years ago, or two years ago, as long as the project is good, it's there, it fulfills the purpose. So in a way, time doesn't really matter in the end. And of course, reaching somewhere, accomplishing something with your project, there is really no silver bullet here. There is just engineering. And there's just open source stuff that we all know how to do. We've all been doing for a long time. There is just hard work and keeping at it. And of course, having fun. Because if you're not having fun when doing it, you probably won't endure. So in the code project, right, it started in 1996. Number of lines of code is basically zero. So I actually started a project with someone else's code, so I didn't write those first 160 lines of code. And then I became a maintainer a few months later. And then we started the journeys. And then we, so now we're at 160k something. And yes, a fascinating linear growth too. Kind of unbelievable. So yeah, I'm just saying that keeping at it, things might develop. And making sure that others can contribute is of course, crucially important. And that's why it's open source. We want to enable others to contribute, even if in many times maybe they don't, but there's still that opportunity, right, and availability. And if you're doing things right and you happen to be accepted by others, maybe someone will contribute. And now everyone is looking at that bump in 2005 and thinks, what happened? And it's quite boring. I actually just went back and sort of filled in names that I have missed out from the list before. So I just went back, so it's actually not supposed to be there. But it's my script count number of names in the list. So over time you might get a lot of help if you're successful enough. But success is obviously not given, right? There are a lot of open source projects. I mean, and they're adding every day, right? So there are hundreds of thousands. Just look at GitHub or whatever. We're drowning in open source projects. And yeah, it's certainly not a guarantee that whatever we do is a success and going to be popular or anything. But if you don't give it enough time, if you don't spend your efforts and really make sure, I get a lot of questions or people say, yeah, yeah, I spend a lot of time on my 47 projects. I did them for several months and nobody used it. So sure, if you don't spend enough time, if you don't polish it enough, maybe it doesn't stick out among all the others, right? So maybe you actually have to spend more time to get somewhere. And it needs to be fun. But whatever you do and whatever anyone does, there will be times when you're sort of, when you just run into something that wasn't supposed to happen, like a security problems or whatever. And it's bound to happen to anyone who's doing software, maybe more to some than less to others. But still, everyone is doing mistakes. It doesn't matter how long we're doing this or how much we have done it. As long as we keep developing, we keep changing things. There will be mistakes and mistakes will lead to security problems every once in a while. In Curl, it looks like this. The green ones are bars. When we fix security problems, the red ones are when we introduce them because I tracked them down. So of course, we introduced them before we fixed them. But anyway, I'm just meaning that, yeah, we work really hard, of course, to make sure that we don't introduce bugs, we don't introduce security problems, but can be sure that they will creep in anyway because it's tough. And you all know that, right? Nothing new here. But what do you do? You just own your mistakes because they are going to happen and try to learn from them, which I think is really, really hard, right? Because every time you get a security problem, it feels like, this is a one-off, we should never have done this stupid thing. But try to learn from it, adapt, move on, add more tests, and make sure that we at least don't reproduce the exact same problem again in the future. And yeah, I've done that still several times actually, it's kind of stupid. Yeah. And keep having fun because if it's not fun, it's not going to, you're not going to spend all that time on it. And no one else is going to do it either. And of course, everyone makes mistakes. And it's really a matter of how you handle the mistakes. It's not sort of the amount of mistakes or how critical they are, but how you take care of them, how you take care of the people who actually made the mistakes. In my case, it's easy to take care of the people because almost all of them were my mistakes. And there's no denying that it's sort of soul-crushing when you have your software in 20 billion installations and you have one of these things that you know can end up really, really bad for the users. Yes, that can make it a little bit harder to go to sleep at night. But yeah, again, we all do mistakes. We try to learn from them and move on, right? And in our case, in pretty much everyone's case, we just have to do what we can do, right? Engineering, we write readable code. You should be able to understand the code in any language. Whenever you read code, it should be understandable. If you can't understand the code, it's the wrong code, right? And you document everything clearly a lot. And another thing with working stuff, or working stuff for a long time, is that you have a long time to write the documentation as well, ideally, right? And a lot of tests too, because the more time, the more tests. And you analyze your code, of course, you threw every tool at it and make sure that the tools, they don't complain on your code. And then when you have sort of fulfilled all these steps, yeah, you know, it's pretty decent and you can throw fuzzing at it. And in our case, I also like to offer bug bounty as well, because I'm fortunate enough to have someone who pays for it. So we offer a lot of money to people who can point out the security problems. And yes, then you get a lot of bogus crap as well, sort of, yeah, there's a security problem. But still, also get a lot of quality people spending a lot of time and effort actually trying to find security flaws. So in my experience, this works really well. It's a pretty cheap way to get a lot of help to find your most stupid mistakes. But, okay, there might be other people involved in open source sometimes. You're not alone all the time. And really, over time, you learn that it's code is easy, right? Code is easy, you can just debug it, try it again, write a new algorithm, but the people, they are never easy. People are sort of what the challenges are. And the longer you work in an open source project, the more you maintain, you know that the challenges, what you need to sort of, what you face on a day-to-day basis is the problems with communicating and talking to people from different areas of life, cultures, languages and everything. And you can be sure that they are going to be less-to-friendly at times. So over time, we do less and less coding and more and more interfacing with humans and other things as a maintainer of some stuff, right? And, right, so negative feedback is sort of the default. It's a little bit depressing, but you know, as long as things work, sure, 20 billion installations, no one says a single word, sort of, yeah, it works, cool. And someone finds a little bug somewhere and you can be sure that that is what you are told about, especially if it appears stupid or silly or something, because then someone is very upset that surely it should have worked since a long time ago. You've been working on this for so much. So that is, of course, and I know you all know this, that's the default. You basically never hear when things are good, because that's the default. Everyone assumes everything is good all the time. When something is bad, you get told about it. So people often ask me what the difference is in curl back in the days with 2,000 lines of code with 300 users compared today with 20 billion installations. There's really no difference, because in the little development community, people raise their bugs, they complain, they have problems. All the ones that are successful, they shut up, they are somewhere else. So it doesn't really look different today. And a lot of lessons in what you do when you realize over time that contributors rarely stick around. In curl, I have lowered the bars and the friction for new contributors, I think, a lot. So we get a lot of contributors even fixing a spelling error or typos in a comment somewhere. People contribute that. And I think any contribution is a good contribution. It doesn't really matter if you fix a typo that makes it hard to read. Yes, it's an improvement. So I accept it, but do the contributors stick around? Out of all, I mean, today we have, I think, 1,240 authors who have written code commits in curl. That's an amazing number of people. Over 65% of them did it once and never again. So, and I don't think I'm unique in that, and I don't think it's special. I think it's more like that's how people work, right? They show up, they find a problem, they submit it, and they move on to something else. Because it's not their primary interest in helping my project, they just found a problem and fixed it and moved on. And sure, it's okay for them, it's okay for me too, but just the realization that most people who show up, they will show up there a few times maybe if you're lucky and then never again. And maybe every once in a while, of course, you get a new contributor who will actually stick around for a long time and contribute a lot. And you will be happy for those. And of course, I mean, there actually is the reverse too, right? There are a lot of newcomers. And I've never heard of, you never saw this person before in your life, and they show up suddenly one day with an amazing patch showing that they understand everything. And you can sort of be amazed that someone just shows up on your doorstep one day and have a perfect understanding of your architecture and design style and code style and everything. So suddenly, open source is open and ready for surprises in every direction. And that's part of the fun, right? Less fun is perhaps that sometimes when being a little bit public about things, things can go in the other direction. So I actually never really... So this email from, well, soon three years ago, was actually the first one that sort of hit me. Hit me like this. Yeah, so my email address is in the current license. And the current license then appear in a lot of products. And this person quite clearly had been attacked in some way and saw some traces of curl in some leftovers somewhere. And that was obviously my fault. He had lost his family's life and job and everything. Completely confused person, but it was all my fault. That was tough. But, okay, open source, this fun thing with open source, open source, the term was coined in 1998, right? Exactly the same. Actually the month before I started renamed it to curl. So it's sort of open source and curl. And it's been hand in hand going for a while and still just 25 years, right? Before we did open source before we called it open source too, right? Because we still worked exactly the same way. We just didn't use the term then because then mostly we talked about free software, but it was a little bit more confusion than what it actually was. But anyway, so today is much easier to do open source because everyone knows about open source. If you approach a developer today working in any field, people actually know what open source is. Back in 1998 or 96, no one knew about open source in general. It was just a niche click of weirdos. And today everyone is using open source, right? There's not a single project, single user, single developer anyway who doesn't use open source at least to some extent willingly or not. It's just going to be there. And we're all going to work with open source in ways that we suddenly did not 25 years ago. And we all, I mean, there's so, so many more contributors to open source today than before, right? There are literally millions and millions of possible contributors today. Back in 1998, there were not millions and millions of contributors. In 1998, the total internet population was, I think, estimated to like 40 million. That's basically the amount of open source developers today, right? And of course, we're many, many more maintainers of open source today than we ever were before. So there's also a lot of equals among us, right? We all know, I can talk to you like today, we who maintain open source and you are all a lot of open source maintainers. I don't have to even pretend. So there's a lot of good things. So it's of course also much easier and much better place to do open source today than ever before. And I think it's going to be much easier and better going forward as well because all of this is just going to improve. We're just going to do more open source and it's way, way easier to do open source today too, thanks to infrastructure, tooling, funding, whatever. But I think we're into in for a bright future. But anyway, I've done this and worked on a single project for so long and people ask me then, don't you ever get bored? The same project for 27, 28 years, yes. And of course, I get bored. Everyone gets bored every once in a while, right? Lack of motivation. How fun is it to work on the same thing all the time? Of course, the motivation comes and goes. That goes for everyone. And that's just natural part of life, right? Whatever you do, there will be periods in your life when you don't feel the same sort of, yes, it's going to be great to work on this documentation today again. Sometimes you just have to, you know, do something else, spend more time with your family. In my case, I like to sort of move around, do something silly and some less important part of the code or do a slight less curl for a while. I've just come to realize that lack of motivation is just a natural thing. It's just sort of an endless cycle. Sometimes you come in and then come back and it doesn't really matter as long as you sort of let it play out and maintain your overload. One of these things is very commonly brought up, right? If you're being that single person and you feel that a lot of users are depending on your work, maybe you sometimes work a little more than you should. And I think this is a real problem and it's a real, it can affect us for real. But it is important to separate you from your project, of course. I'm not sure I manage that always, but I do try. And there's a little this of, you know, if your code run in a lot of places, can you really ever be sure when you release a version that is not going to sort of bring down half the internet? I don't know. I think you just come to, you have to deal with it. In my case, I think I'm actually pretty good with this because I feel that we have enough tests, enough eyeballs, enough people involved that crossing my fingers. It might not happen too often at least. So I think it works really well. That's from my case at least. But I want to emphasize and I think this is true for many people that the thing about imposter syndrome, it doesn't really ever go away. It doesn't matter if you have those 20 billion installations, you can still experience periods of that. Did I, did I, do I even know this sort of who am I to tell them how things work? I mean, come on, this protocol doesn't actually work like this. But what one of my skills, I think when it comes to doing open source is just make sure to use the time slots you get. I have that a lot, you know, you have a family, you have a life, you have friends, but sometimes you have 20 minutes for yourself. Can you spend those 20 minutes on your open source project? I've become very good at it, which makes me very good at, you know, if I get 20 minutes here and 20 minutes there, that's actually 40 minutes. And I'm not complaining about, yeah, I need an hour to get prepared first because then I would never do anything at all. And I don't split my attention between all, a lot of sort of many other tiny things. Sure, I do a lot of other projects as well, but I give them much less attention. And again, time might feel important sometimes, but it really doesn't. In most cases, it doesn't matter if you're done today or tomorrow or next week or the week after that. Who cares, right? Sure, it's not in this release, you're going to do another release soon again anyway. And down the line, it didn't matter if you were done last week or next week. So, let it take some more time. And of course, I'm a true believer in release early, release often, so that everyone has a chance to get your latest code as soon as possible, because it just makes maintaining and everything easier and contributors have a much easier time to actually work on your latest code better. Yeah, so reduce contributor fiction to get people to help out better and have fun. Of course, we need to just remember that we're all different. I can stand here and say how I work, but I'm sure that you all have sort of objections and say, yeah, it doesn't work for me. It doesn't work for my case because I mean, spare time, as I'm talking about spare time working on open source, you can of course, in my case, I work on open source work hours and spare time hours, that sort of maximize. But working on anything, spare time is of course a luxury, right? If you're working on something on your spare time, maybe someone else in your family is doing, you know, the laundry or cooking or taking care of the kids or whatever. So of course, that's a luxury. If you have that position, it's a luxury. I don't deny that. So in many cases, you don't have that luxury. And of course, then it's much harder. And there's of course an unequal privilege here, right? If you're rich enough to do this, you can do this. If you have to work two works and take care of the rest of your extended family, maybe you can't do this. Yeah, just have to be aware that of course it's a luxury. We're all different. We're all unique. And of course, what is success? I consider 300 downloads a success in 1998. We all have a different way to measure success, right? So we don't have to have 20 billion installations. It's fine if all your friends are happy with your tool and you can just have fun. That's also success. In my case, I have mentioned already my email address in the Curl license. This gives me an excellent opportunity to learn about people's agonists in life. Like if they don't know how to install their GPS in their car, they email me and ask me. And you can imagine the amount of anger in this user. He couldn't install the GPS. He's been scrolling through that open source license screen in his car. Found an email. I'm going to email this person. So I get a lot of car questions. So then you learn, yeah, Sean, so my email is apparently in a lot of cars and people have problems with cars. So I have no idea. But not only cars, actually. So I can learn about other things too. And usually my way, this is the best way I have to actually learn about where people are using Curl. So wow. So I have to Google. Often I don't even understand the question. I have to Google it. What are you talking about? Oh, great. Are they using Curl too? It's confusing. I sort of stopped replying to them because... You know, the first... She asked me when I started. You want to help out? You want to be friendly? Someone ask you questions? Obviously completely lost. No, sir. This is how it works. I just wrote a little component. No, no, no, no. That's not how it works. Just ask your friends and help me fix this car now. So I have this example. This is a great one. It's a little bit convoluted, but I'll explain. I got an email from a woman. She said her Instagram account was hacked. So what are you asking me about that? Sad for you. Okay. But she showed me the proof that I'm involved. Instagram? My name. Now I should just head over and talk to the guys and tell them to help her with her account that had been hacked. And I told her, cool. They're using Curl. That's in my code, right? And try to explain the concept of open source. I never talked to these persons. I didn't know the use, Carl. For me, it was like a revelation. Cool. Instagram, right? That's like a billion installs suddenly. She didn't really see it the same way. Now, she emailed me back. She found my name again in her phone. Exhibit two. But it cannot be a coincidence. Your name cannot be twice in my phone. For any good reason, right? So she threatened to contact them and tell me that I'm an Instagram and Spotify hacking. I don't know if she did actually. So maybe they don't know this yet. No, I'm exposing myself. So when I work on this stuff, I just, what I'm trying to say here is I'm not special. I didn't do anything genius like I've just been working on this a long time. I just had an idea. I think it's fun. So this is what I do. And I think this is sort of the best you can do. And I wanted a tool to do the internet transfer. It does look a little bit more these days than it did from the beginning. And I endured. I kept going at it because I didn't know anything else and didn't know better. I think it's fun. And keep polishing. If you spend a lot of time on something, it can actually become pretty good. And make it possible for others to contribute if they want to. And you can just hope and wish that they will contribute. In my case, they did to a pretty hard degree. And this is really the most fun I can imagine. Yes, I'm living the dream. I work on my spare time project full time and getting paid for it. What else can you ask? So, is that easy? I think you can do it too. And pretty much that's sort of what I wanted to tell you. I've written about these things a little bit before in this book-like thing if you wanted to read more about my thoughts on this topic. So, thank you. I'm done. APPLAUSE I think we have a few minutes for questions. If you have a question, raise your arm and someone will run with it. There's a question. The mic will come flying. Hi, thanks for the talk. I have a question regarding... You mentioned that you have lots of contributors nowadays. And how do you deal with their PRs, basically? I was wondering two questions. One is how nitpicky you are and how you... Basically, based on your experience, how nitpicky can you be without discouraging people from contributing? Like being overly pedantic on comments and stuff like that? I'm having a little bit of a hard time to hear your question. So, it's regarding how nitpicky you are in your PR reviews. So, how pedantic you can be not to discourage people from contributing to such an important piece of software? So, do you tend to just let things through or are you very strict? And you still get lots of contributors, even though you're strict in your reviews. Because I guess when you get a diverse set of contributors, it can happen that lots of people have different coding styles and different levels of detail that they go into, code comments and stuff. I don't think I have any sort of general rule there. I try to... Sometimes, I think there are contributors who are clearly, maybe newcomers struggling with the language or the culture or everything. And of course, I try to be a little bit more welcoming, maybe more forgiving and helping out. But it depends also a little bit about load and everything. Usually people, no matter culture, language, anything, people understand code and following code styles and making sure the test case works and everything like that. So, usually I don't have to consider that to any greater amount. Okay, that's interesting. Most people are developers. They understand this from the beginning. The other bit of question was regarding... similar, but regarding documentation. So, have you found that... Documentation is roughly the same. Documentation in the code comments. So, if you've seen that being over-documented, has that helped you or are you not doing it? Because when you get such a... Over-documented, that's a rare thing. Well, over already means it's too much. But when you know... you could go overboard and you could... You can, but in my experience that is very rare and sure. I mean, we can have a discussion. Sure, you mentioned this as a comment, but then below the code is exactly the same thing. Assign A to 2. No, yeah, of course. Maybe you don't have to say that in a comment, and then we just had a discussion. So, of course. But I think that's very rare, actually. Usually it's in the other direction. Maybe you could explain a little comment here why this is happening, and not just have a huge blob of code. Right, right. I guess what I'm referring to is when you have such a long story in your software, and you want to leave traces of some design choices and why some things were implemented some way rather than the other, because other people, especially contributors who are one-time contributors, are not going to have enough context. So, I'm just asking regarding your style. Do you try to leave traces of context like this was done this way because of this reason? That reason, please do not change it, blah, blah, blah, stuff like that. Sometimes, but it's hard to leave traces of sort of to leave it for history because things change. So, leaving traces like that also just risks that you leave traces of your former design or former decisions that maybe were not enlightened enough. So, I don't make a concerned effort to do that because everything is in git anyway. We can always go back and look at the history if we want to. Was there any question left or should I shut up? I have a couple of questions. One is how much time you were spending on the project before being able to work on it? Sorry, can you repeat it a little louder? Yeah, sorry. How much time you were spending on the project before you were working on it full-time? I have a long-standing tradition in my family that I spend every night on curl. So, when the rest of the family goes to bed, I stay for another two hours working on curl. So, I've done that since 1996 basically. That's two hours every day, every week, every month, every year for 27 years. Now I've just added my full-time work as well. So, now it's just instead of two hours per day, it's now 10 hours for work days. Do you delegate maintenance? Sorry, again. Do you delegate maintenance? So, do you have many... Do you delegate maintenance of your project to someone else and how many... Or you maintain everything yourself? Because there's much maintenance overhead. Well, I'm the lead developer here. I'm not the sole maintainer. We're a whole team. There's a lot of people, apart from me, who can merge code and who does. I just think I do the bulk part of it because I'm the only one who works on it full-time. I do it much more than they, but if I would stay in a conference the whole weekend, someone else can still merge code while I'm away or if I'm just absent. So, there's a whole team actually. You
OpenPrinting - We make printing just work!
Dear audience, please give a warm welcome to Peter Kirtel Kampeter, who is the leader of the Open Printing Project, being an employee of Canonical, doing all the great works to make printing easy, not only on Linux, but also on quite an amount of other operating system. Thank you. Have a nice talk. Thanks. So I will tell about how Open Printing has emerged and what we are doing, what are the challenges, and in the end I will also even tell a little bit about printing under Windows, but I am not developing the printing part of Windows and Windows does not use cups. So let's start. What we are doing is the central part of our work is naturally that we have cups and we integrate cups in the, we develop cups. This is done by Michael Sweet, but it's a part of Open Printing. Michael Sweet has left Apple. We are also, we are also integrating cups into the operating system. We are doing it by cups filters, for example, and that I am also also contributing to GoScript and I am also coordinating with desktops and we are working together with a printer working group. This is a consortium of industry, of printer industry and software industry. They are developing standards, especially the Internet Printing Protocol, IPP, and we work together with them. We have also annual meetings and we are implementing these standards in software. For example, cups is all IPP-based and what changes and comes new in IPP like driverless IPP printing, we are updating everything and implementing this so that it can be used in our operating systems. And, yes, and we are also cooperating naturally with printer manufacturers. I had also, I was also a lot in dialogue with printer manufacturers, especially longer ago when one still needed printer drivers to help them to design drivers and to do it the right way. And so these are the principal tasks also to the, also part of the integration of printing is that one, that we, that the packaging, for example, I am doing not only the package, the Debian packaging for Ubuntu, but I am also doing the cup snap, so a snap package of cups, which one can easily download from the snap store and this we are having the always the newest upstream cups on any distribution, but also for snap only distributions like Ubuntu Core Desktop. But about this I will talk at 130 in the distribution zoom with a snap and Ubuntu Core Desktop talk and, so how did it all begin? One thing is I had a printing problem like Richard Stallman had also a printing problem. He invented the principle of free software and therefore we are all here. And I had also, I was system administrator in the late 90s and there we had also printers and they only worked so more or less because we were in the dark times of LPD and in the beginning of 2000 cups appeared. Michael Sweet has released the 1.0 of cups in 2000 and there was an article, a news, a Linux magazine article by called Pfeifler about cups and I have read it and I have seen all of this, this will help us a lot and I have installed it and everything worked better with it and so, but it was all command line and so I have written a little print dialogue and put it on fresh meat in that time in 2000 fresh meat was the place where one told about new free software and then the court Pfeifler the author of this article saw this invited me to the Linux talk, a big Linux conference to show it on the booth of his company and so I showed it there and most distros were not interested in cups and print dialogue and so, but Mendox soft. The show was in the beginning of July 2000 and first of August I lived in Paris because Mendox soft invited me to work with them to switch to switch mandrel Linux from LPD to cups so I lived six years in Paris but please don't try to speak French to me and then I had and then I have in a few months already switched mandrel Linux to two cups so the fall edition of mandrel Linux in 2000 was the first Linux distribution with cups and there were also some challenges so that especially I had not only two package cups and RPM packages this was the easiest part but I had also to take care that all the printers which worked before they work afterwards after the switch over there were drivers many were built into Go script and many were a small little filter program which some student has written to get their printer working and but they all did not come with PPD files because they were written with only LPD in mind and not with cups in mind and I needed a way to get PPD files for them all because cups needs PPD files and there was linuxprinting.org someone has created a website linuxprinting.org and there was a database with printers how will they work and with which driver and especially with the PPD file generator and the database for the PPD file generator needed also much more information options paper sizes and resolutions and whatever but most of these entries did not have that some have that and so I knew that this works and so I asked the author to to fill in the rest and he did not have the time and so he gave me right access and in 2000 and in 2001 even the full control of the site so that I continued maintaining it and so I filled in this database and this way I could make all the printers working with cups and therefore the mandrake linux the first mandrake linux with cups was working perfectly and printing was work much much easier than with in old LPD times and I also gave talks on conferences organized every year on the linux tag and open printing booth and so on and so I spread the news of cups as Michael sweet themselves was stage shy he did not go to conferences and so and so the other distros saw it and switched all over in 2003 also all the distros used cups and so I made cups the standard and I also found it open printing in 2001 in together with some people from the PWG and so this was the start and and in an LPD and any other approach to make printing better like PPR, PDQ or whatever they were called they all stopped maintaining because everyone used cups and nobody wanted to use all these other things yes I found it open printing in 2001 it was not what it is now it was at first that I worked together with some people of the PWG to to uh on printing APIs I had every every week a phone meeting a phone meeting not yet video phone classic phone uh to talk about APIs and to develop APIs and I continued naturally to maintain to to my work with cups and with linux printing.art database and so on and one thing is back in in 2006 I organized my very first open printing summit in Atlanta Georgia in the US at uh at RICO and it was a lanyere that time but probably now they are all called RICO and there there were there I brought something like 40 people together from printing projects desktop projects uh printer manufacturers, printer driver projects and so on to work together on the future of printing on improving print dialogues especially on on improving drivers and so on and on this meeting there was also Jan Murdoch one of the founders of Debian therefore the Jan in the end of Debian and I asked him I talked with him in a hallway session and I asked him this I told told him that the the linux printing.art server the database server is standing in the house of the of its original author and it's already carrying official ppd files of manufacturers and it should be for security and reliability that that should be in the data center whether they're at Debian they could perhaps host it or at the free standards group we uh uh uh Jan Murdoch was also uh was also engaged in the free standards group and open printing this api effort was also part done as part of the free standards group and so Jan told me I can host it at the free standards group but he did not only want to host the database server he also wanted to host me he invited me to work at the free standards group full time to manage open printing to to join the linuxprinting.org and the open printing work in one call it open printing and this is the open printing of now and the free standards group has eaten up the OSTF the open source developer laboratory in one year later in 2007 and founded the linux foundation and so open printing is part of the linux foundation and I was working at the linux foundation that time but in parallel in 2006 in 2006 on the open printing summit where I also was where I was organizing an open printing booth I bumped into Mark Shuddleworth he also asked me whether I want to work at canonical and so I started part-time at canonical also and then I was at canonical and at linux foundation full-time working for open printing and I could leave Paris because I worked from home and then in the time and then time was going on and the next step was that in 2008 I started meant I started to to be the org admin for the linux foundation for the google summer of code every year up to now and the open printing a part of it so open printing was in the google summer of code and I was mentoring students mentoring open print open google summer of code contributors for open printing and this was very important because printing is not really sexy for volunteers to choose this and so by means of heavily engaging in open printing I get contributors who contribute code and some even stay and continue mentoring or the website we have now is also done by former google summer of code contributors and also from 2006 on I organized every year an open printing summit later on together with a pwg and another another milestone more more recently on the open printing summits every year I was I've met avik bazu from lexmark in India and he has contacts to universities in India and from 2015 on every year he was reaching out to the universities and finding for us contributors we have then done some some selection they have done fixed some issues for us and so on to learn to work with open printing and so we had five or six every year working coding for us they were mainly from therefore they were mainly from Indian universities and mainly from the iit Indian Indian Institute of Technology Monday and therefore I have also organized a conference last year in Monday and met them in person so and what we have all done is that at first we have we have made cups the standard printing system as I told and it is used in more in all posic style operating systems including mac os michael sweet has been for a long time at apple and worked on cup and developed cups and integrated and mac os mac os he has left in the end of 2019 and he continues to develop cups and now it is hosted at all and since then it's hosted at open printing and all I have made off three printer drivers working with cups as I told and I have and another thing is in former times in LPD times the standard print job format was post script because the only printers in that time which could print graphical content and were used with unix machines and computing centers at universities were typically a post script printers and so post script was the standard format for printing graphical content and this got and in 2006 as there were so many printers with so many different printing languages and post script is also rather awkward it's not secure it's a two-in-complete programming language so one can inject malware with it so we did we mic sweet and me have decided to switch the standard print job format to pdf and so since 2006 the standard printing format is pdf not really in real life it took some years until the distributions actually were using pdf as standard print job format I think 2010 or so pdf was always used I also I also did a grand unified ghost script as in the 2000s when Mike wrote cups he made a false fork of ghost script for the cups last format and to integrate some third party printer drivers because the ghost script the original ghost script the top of the line the newest version was not free software they released it as free software only one year later and so we so this was also a reason why Mike did the fork and in 2008 the ghost folks decided to release the ghost from the top of the line also as free software and this was for me then the point to do the grand unified ghost script to to do the reunification of ghost script to put Mike's ghost script and the original ghost script and third party driver all together and then all was there at artifacts but unified in one ghost script and and system config printers also maintained by open printing it was originally the printer setup tool of wetted and wetted was and later on when I have left mandu xsoft and I entered canonical and Ubuntu needed a printer setup tool I I ended up to take system config printer for that and improved it a lot for especially for the association of drivers and printers with each other that this works fully automatically and correctly and then in then in 2011 also 2010 to I think 2011 Apple has decided to not maintain the cups filters anymore which Apple does not need because Apple uses their own proprietary filters and I've overtaken this code as the cups filters project and maintain and then I have maintained it on open printing and another thing the common print dialogue backends this is something the print dialogue contains the print dialogue communicates with cups for printing and there are many print dialogues and many GUI projects Mozilla KDE GNOME LibreOffice and Chromium and all of them are big projects lot of inertia difficult to find the right contact and so it was difficult on a cups change to get all these projects go with this change and update their dialogues and so I came to the idea one can put an in-between layer there that the that cups and any net and any cloud printing technology in 2017 when I started that there was Google Cloud Print is implemented in a back end and the dialogue communicates with this back end and the back end with the print technology and so we can maintain the back end at open printing the one for cups and so the dialogues always work correctly with cups so common print dialogue backends and I have snapped cups as I told earlier and now all the free all the free software printer drivers had to do a transition again because the cups three I will tell it later will not support ppd files and classic printer drivers anymore and so I had to put all the printer drivers into printer applications which emulations of driverless IPP pointers as the cups we will only support driverless IPP pointers and so this way we had another transition and this I also have already done what is this So, and you see what we are actually doing. We do Cups, we do Cups filters, common print dialogue, backends, and Papa Retrofit is a library to put old printer drivers, classic Cups drivers into printer applications. And then we have also printer compatibility databases. We still have the Fumatic, linuxprinting.org database from the good old times. But it is less important because the printers which one can buy currently are all driverless. And therefore, Mike Sweet has created a list of all driverless printers, which is very well maintaining, which we have on open printing. And so, and we are collaborating, as I told with the printer working group, and with printer manufacturers. But the dialogue with printer manufacturers about printer drivers came more or less to an end because we have driverless IPP printing. And we are taking care about the whole printing stack, the whole printing architecture. And integration in desktops, I talk with desktop people. I'm running Google Summer of Code projects in updating printer setup tools and print dialogues. And also the integration in distributions, I also have to take account of immutable distributions. One is Ubuntu Core Desktop, I have the Cups snap for it. But I also plan to make Docker packages of Cups and of the print applications. Because there are so many other immutable distributions which do not use snap. And in these immutable distributions, the only way to get system software in is to use Docker images or OCI containers or Podman and so on. Because the desktop applications are in those ones usually added by Flatpak and Flatpak does not support system utilities. And now one important thing, where we are very intensely working currently, is the so-called new architecture. And the new architecture means that we do not support PPD files and classic drivers anymore going all IPP. And all IPP means that we support only driverless IPP printers. We support only driverless IPP printers. And this means that the old legacy printers which need a driver, they would not work anymore. And therefore, we have introduced the printer applications, emulations, software emulations of driverless IPP printers. So it's a damon which is a very important part of the process. So it's a damon which on one end is like a driverless IPP printer. And on the other end, it communicates with the printer. And internally, it does the conversions to the printer's native language. So the driver is more or less encapsulated. And this way, the old printers can live on and we do not lose the support for them and require the users to throw them away. So here's the scheme. The old Cups 2.0 which you are currently using. And you see the old cups, the user application sends a job. And in the old cups, there are different possibilities. IPP everywhere is a driverless IPP printer and cups can directly talk with it. So the current cups 2.x already can behave like the new architecture. It supports the driverless printers directly. And the cups 2.x supports PPD files and classic drivers as any older cups too. And through this, it supports the older printer, PostScript or Asta printer. Any printer which needs a driver. But the printer applications, as they are just emulating an IPP printer, driverless IPP printer, they also work with the old cups. Because the old cups supports driverless IPP printers. So you could already switch over to a printer application. Yes, things. Switch over to a printer application. And so if you are writing a printer driver, don't write a classic printer driver anymore. Write a printer application right away. It works with the old cups. It works with the new cups. And with the new cups, you see you can only talk with driverless IPP printers. And the printer is actually one directly and otherwise, only via the software emulation printer applications. And we'd also even do driverless scanning. Because the scanners in driverless multifunction devices, they all understand ESCL, so also a standardized communication protocol. And so we can make also scanner applications which are emulations of ESCL scanners. And inside one has, for example, the same driver for the old scanner. And cups, we will this year, in a few months, we will release cups 2.5.x. This is not yet the new architecture. It's only to get all of and some other features into cups without doing the big switch over to make it easier for the enterprise distributions so that they can do this more lightweight switch. To get the all of in and in the end of the year, we will release cups 3.x. And the 2.x will, and here the cups 3.x. This is really doing away with the PPD files. But it is also doing another big change. The cups demon is replaced by two demons by a local cup server, which is a user demon and by a sharing server, which is a system demon. So the local cup server is only for getting the print jobs of local applications. And passing them on to driverless IPP pointers, either on the network connected to USB or a printer application for a legacy printer, and not for sharing printers to other users. It even does not use a network socket. It only uses a Unix socket file for communication. And the sharing server. And the sharing server, you can really share printers and you can configure the sharing printer on how to share the printers and so on. It's a system demon. And but so you install it only optionally when you really want to share printers and therefore print server, it has also ACL, it has accounting and everything. And by default for the desktop machine, you only use the local server. So this is the scheme. The very light blue is the sharing server. The medium blue is the local server and the dark blue is all which is in common. This is in the cups library. And so we will have three packages, the sharing server, the local server and the Libcups three. And here you see the two servers and the scheme how everything is together. Thank you. And now with all this, we are already working for more or less five years on it. And planning it and telling everyone how this works. We go all IPP and we want to have print applications and so on. It seems that Microsoft has heard it and that they want to go away from the crap of printing system from Windows 3 from the late 90s. And so they came with a new printing system. Microsoft protected print, which is already available in Windows 11 for testing. And in Windows 12, it will probably be the stand system. And this one is all IPP. As it is with cups three, but it is not cups. It is not that we have system DSP to one of Windows and then we put cups onto it. The code is for more pre-ar. And one already told me that it's still a little bit wonky and flaky. People who have tested it, there's no access to the source code. And Microsoft tells IPP driverless printers are all supported, but legacy printers, they are obsolete, you can throw them away. But I can tell you if someone of you has to use Windows or someone in your family uses Windows, they do not need to throw away the old printers. As under Windows we have WSL and under WSL, you can run printer applications. And so every printer which works under Linux will also in the future work under Windows. And currently works under Windows too, because we have WSL and printer applications. And Microsoft wants to do it for security because they want to get rid of the drivers. Because the drivers have vulnerabilities, it's old, often un-maintained code. And they want to do away with all these vulnerabilities. Because these old cups drivers all run in deep system, perhaps even kernel level. And so crash the whole system when someone hits these bugs. And another thing is Microsoft is telling about print support apps. These are not printer applications. They are not trying to support old printers. PSAs, they add ons to the driverless IPP printing to do something specific. So it's some way that they are fleeing a little bit out of the driverless. So, and if you want to get into, and we need naturally a lot of help. And therefore, I want to ask, therefore I am asking for volunteers that you could voluntarily. As in any other open source project, you should also, it would be great if you would step up to help us at open printing. If you can also participate as a Google Summer of Code contributor. We are also participating this year again. And we have a list of nice project ideas. If you download these slides and then click the links, then you can get to all of this. And we need people for, especially for desktop integration, for updating printer setup tools, for updating print dialogues. We need also people for maintaining the website. We need people for CI testing, for creating CI testing scripts. For OCI containerization, Wax, Docker and so on. And we need, so we are in need of many people, documentation, we need to make documentation for our libraries. And this is very, very important. Here we already need eagerly volunteers for that. And so I hope on all of you, come up to me or contact me through these channels and step up and help us to make printing just work and to make it even work better. As it seems, as Michael Tunelt told, there's no painless, there's no painless way to print under windows. And it seems to go on. And so we should have a nice printing under Linux and other operating systems with open printing. And so are there any questions? Yeah, hello, thank you for a nice talk. I have a question a bit more about the printer sharing server and accounting and things. Well, we work with schools and all the individuals are actually happy that the printing just works. But the principles are not always happy that the pupils are printing pretty pictures and using a lot of paper. So this controlling who can and how many have they are printed. Can you go a little bit to the steps also in a couple two point something about the printing sharing server and the transition to three point X2 kind of what's the right way to do it? And so the sharing the sharing server will have all the same possibilities which you have also in cups which you will also which you had also in cups two dot X and will it will also have the possibility to define profiles where you can where you can filter printers which the users can see and cannot see. And you can also tell which which printers to share to where and I don't know whether you can force option settings which would be a possibility for example to force the use to force users to print in black and white or so and not in color and only selected users can print in color. I don't know whether it's possible but you can in detail tell which users and which which clients can use and see certain printers. In cups yes in in cups three it will be in more detail. I think in cups two you can at least say to share printers to certain networks or to certain clients but in cups three you can also also tell in cups two you can also include and exclude users who are allowed to print. And in cups three you have also profiles which which you can filter which printers are visible and which are not. Hi thanks for the talk you mentioned scanning on a couple of slides. How far away are we from having internet scanning protocol and all modern scanners supporting it. We have we have already standardized protocols for scanning. The common protocol the most common one one is ESCL and if you have an a driverless IPP multifunction device it can at least it can usually scan with ESCL some few others with another protocol which is called WSD which is from Microsoft. We have a same backends for these two which is same AirScan which is not in the same core package it's a separate package but it's in all the distros currently and one of these protocol is supported by any driverless IPP multifunction device so the scanner and such a multifunction device also works. And for further more sophisticated support there's also IPP scan but this is not yet picked up by the print and scanner industry the standard is there and completed but we will perhaps use it later on at open printing for network scanning and to have more detailed permission and client server handling when using IPP scan instead of ESCL but for just scanning we currently use ESCL. Thanks. Good morning to so do you happen to know why Microsoft decided to adopt their own stuff instead of open printing. Can you. They decided to go with the Mopria and why did they do that instead of go to. Yes, yes, Microsoft is a classic commercial close source company and so they are starting with open source they have they have they have hired a Leonard Perturing and and this way they the development of system D happens in Microsoft but somehow they did never come up to me and say we want to use open printing so that into there is also consortium Mopria probably also of printer manufacturers and software companies but all close not that's the printer working group which is all open and in Mopria they have also defined the specification for driverless IPP printing which is very similar it's practically the same specification as of Apple Airpoint or IPP everywhere and therefore I call out I call it always only driverless IPP printing because it's technically the same and so Mopria is also a group of consortium and this group Mopria also writes code right puts puts the standard in code they wrote for example also an Android app which is called Mopria for printing. And so for printing on driverless IPP printers and Microsoft seems to have also worked with Mopria to get their code for for driverless IPP printing and to get into Windows protected print. They did not by themselves tell that they work together with Mopria but people from the printer working group has told this to me. And unfortunately they do it this way it it had been a dream by me if Microsoft had had overtaken cups and so we have a really all over standardization and one organization providing the printing code to everyone. But unfortunately this did not happen. Okay. Thank you to you to Michael and all the printing heroes that help printing not suck as much in Linux. What. Thank you to you to Michael and all the printing open source heroes that help printing not suck as much in Linux. You're welcome. Thank you very much. Thank you very much. Any more questions. Till many thanks for your interesting talk. We've got a little present for you. Let me have a look where it is. You're welcome. Thanks for your talks.
SCION, hitting the future Internet road: Next-generation Internet ecosystem and burgeoning opportunities
Thank you. My pleasure to introduce the next speakers, Jordy and Tillman, who are going to be speaking about SCION, the next generation internet. Can we get a round of applause please? Thank you. Hello everyone. Thanks for attending the talk. Thanks to the host for having us here. I'm a little bit of a rockstar right now. We are Till and Jordy. We both come from ETH Zurich. We are part of the network security group at ETH Zurich. We are also part of the SCION open source implementation team. First question, who has heard before about SCION? Okay, some people. You can skip the introduction and the overview. Okay, so for the rest, I will start introducing what SCION is. SCION is a clean design of an inter-domain architecture that considers security from design to achieve security properties, mainly availability, also transparency and control, reliability and scalability. So I want to make here clear that SCION has to do with an inter-domain network, so it doesn't have anything to do with inter-domain protocols or higher level protocols in that sense. And the other thing I want to highlight on this slide is that SCION is an open source project. So here you have the GitHub repo in which you can find the reference implementation of SCION. So in here you have also, well, Till will give more details afterwards. Here you can find also references to documentation and related stuff. So the second question is why does even SCION exist? So SCION comes as an alternative to our old friend, BGP IP Internet. So yes, this was created even before I was born, so imagine how things have changed so far. So SCION has the distinct aspect that incorporates those security aspects I mentioned before from the very inception. Why do we need this? Because we need a network that provides availability even under the presence of malicious actors because there are people interested in harming the inter-domain routing. So we can find some examples, current recent examples, so for example an outage caused to a Spanish ISP due to a BGP attack. And we have several malicious actors in Internet unfortunately from nation state actors to cyber criminal groups that are interested in harming and due to different reasons. They can be from political reasons to economic incentive, you can name it. And yes, sometimes the trust boundaries, so trust nature of the current routing architecture sometimes doesn't make it clear enough where the trust boundaries are. So probably some of you are hungry, maybe not enough for just running away and grabbing lunch. So I cannot offer you food like actual food but I can offer you some yummy desserts. In this case towards the end of the presentation we will give or we will present a couple of demos. One is a browsing demo using Scion, second one is a Scion word first person shooter, and finally we will walk you through steps and guidelines for developers that hopefully find this interesting and want to contribute or just use what's there so far. But first some overview of Scion. So the whole Scion ecosystem includes different entities from different domains or from research institutions to ISPs, to vendors and integrators and users of the system. All these ecosystems is nurtured by the Scion Association which is a non-profit organization responsible for the standardization of the Scion protocols. They are currently pushing and they have published three or four ITF drafts and they are pushing it to RFCs. So they are also responsible for managing the open source implementation. So here I try to summarize Scion in five distinct aspects. The first one is that Scion is a pathway where internet architecture meaning that end hosts are presented with path information in the network and they can make a choice of what path they use to send their traffic through. The second aspect is that Scion designs and implements a scalable trust infrastructure. I will go into a little bit more detail just in the next slide. It also designs and implements scalable path discovery basically in the control plane for trying to achieve rapid global connectivity. Then another aspect is that it has like multi-path nature. So as I said before end hosts are presented with several paths that they can use even simultaneously. And finally another aspect I would like to highlight in here is that there is already real world deployment of Scion. So these I will show you towards the end or the middle of the presentation. So first just some terminologies. My idea here is that you kind of get the idea or the gist of it so you are not completely abstract in this. So first term is that Scion organizes itself in trust in so called trust isolation domains or ISDs for sure. These trust domains as the name indicates are isolated trust so are nothing else that group of ASSs of autonomous systems that share a common trust route configuration. So they basically agree on a set of routing policies that they want to use. And the other term here is the core ASSs which are the ones in charge of managing meaning updating those TRCs and so on. And they also provide peering with other ISDs. So basically they isolate trust. This is an important point I want to emphasize. This isolate trust it's not another kind of isolation. Then the other part of the overview is the control plane. So here I will explain briefly how the control plane and the path dissemination looks like. So again this is an overview of code. This is full of details and you will find them in the documentation and references to the books and several papers that we have about this. So the routing information is disseminated to the network in these so called beacons which are these squads, these corals and squads in here. Those beacons are initiated from these core ASSs that I mentioned before and they are either propagated farther down the network in the local ISD and they are also propagated between these core ASSs. These beacons are authenticated and extended at every hop and every hop meaning every ASS on path decide how they extend these beacons. So they do it based on only their local policies. So yeah basically on the slide you see that they are disseminated and finally you have found some path information. You have already those segments, sorry those beacons have been disseminated and at the very moment they reach an ASS and here we can focus on the green ones. For example those are already usable so there is no need for convergence in that sense so this piece of information is already directly usable. This helps to achieve rapid path exploration and scalability. As I said before this is just a quick overview because I don't want to overrun with details but there is exhaustive evaluation on this scalability aspect of the control plane system. One other aspect I mentioned before is the multi-path nature of SIAM. So from the N-host perspective N-host retrieve path information from their local ASSs. So basically they request this path information and they retrieve several paths they can use simultaneously. So the path server, so called the servers in the local ASS that provides this information will provide to the N-host this information. So this is different from source routing so here N-host directly retrieves the path from their local ASS. This many paths or several paths allows applications for optimizing on different metrics so they might find some of those paths better in terms of latency than others. Others better in terms of bandwidth and they may hopefully find a point that better suits their needs, the application or N-host needs. So just for putting some numbers in here in the current production network, so the real fabric we are building, if you take two ASSs you will find from dozens to even hundreds of paths that can be used to reach the other endpoint. And last slide on this overview is this control plane and data plane slides. So the control plane is what I have just explained in the previous slides and the data plane is what I will try to explain right now. So as I said before N-host retrieve those path segments from local service and they combine them to create a path. So you can find here two examples of paths. So segments are combined in one path for packet one and in another different path for packet two. So once the N-host has encapsulated this information into the packet they send it out to the network. And routers forward packet based on the path information, so they inspect this path information which contains information of which one is the next hop and then routers can simply forward to this next hop. So this allows for simple routers and stateless operations. So as you can see here, so for example those packets may belong to the same application. For example, you send the packets or the N-host in this case send the packets using two different paths that in this case are even disjoint. So for example this can be useful if you have an application, it has a control channel, it can use low latency path for control channel and higher bandwidth path for the real application data. Okay, now, so I want to also convey the idea that there is already some tangible stuff so it's not, as I said before, it's not only a research project, of course there is a lot of research in it, but there is real deployment and engineering right now already in Cyan. So for that I will basically explain these two networks. So first one is the actual, the global Cyan internet, so the real production network in that sense, so the real fabric. And I will just introduce briefly some concrete ISDs in this case, those color bubbles that I showed at the beginning. And then I will also talk briefly about side lab test with network which is a completely separated network, in this case this is an overlay network that anyone can use and I will give more details afterwards. So in general this production network again is, this is not an overlay network, it's the real fabric, it's BGP-free. And it's currently deployed by several international ISPs, so here you have some logos, you don't have to look at them. Currently there are over 100 ASS and they are distributed in Switzerland, you find a few of them, also in the EU, in North America or in Asia. And the other thing about the production network is that recently also has been enabled Cyan-Claus-Bates access, in this case this is a commercial offering. But so just for you to know that if anyone happens to have like cloud deployments, they can also access the production network. Okay this is one of the examples, this is one ISD again, one of the, this color bubbles that I showed you at the beginning. This is the Education and Research ISD, Sierra is the fancy name. And here you find universities, in this case here you have some of those. This is a growing ISD so it's not closed, so more universities may come and some are interested in coming in. There are also other research institutions and research and education networks that also provide connectivity between those research entities. So here is a world map picture of how they are distributed roughly around the world. And then very shortly there are also industry use cases right now, so there is this secure, in Switzerland are those two that I'm going to introduce. The first one is the Secure Swiss Finance Network, so they are basically using Cyan and they are going to phase out the finance IP net that they are using. And by June this year, and by then they will have, or the network will have around 120 participants. And the other example similar to the Secure Swiss Finance Network is the Secure Swiss Healthcare Network that provides similar service for health professionals. So yeah, that was the real production network and this is Cyan Lab, which is the research network. In this case Cyan Lab is a globally distributed testbed to conduct experiments and test deployments. So anyone can join, so anyone in the audience can join this network just by downloading a virtual machine. So with background file you run background app and then you have your node attached to one of these transit nodes. So all of those are transit nodes, leaf nodes are not in here. And yeah, basically I'm not that interested in the names, so they may be a little bit unreadable, but different boxes are located in different parts of the world. So for example you have Korea, you have North American, also you. So yeah, Tillman will also give some pointers afterwards where you can find the information for joining Cyan Lab. So now we also have awesome Cyan project, so basically this is a compilation of projects that are related with Cyan. So we have from infrastructure projects, so we have people implementing Cyan into Fino routers, also express routers. We also have firewalls using Cyan and other kind of infrastructure related projects. We also have application projects, so we have the Cyan Naval Browser extension for example in Chromium. And we also have, so far as Cyan were, Quake 3 video game, video game client distance. And of course we also have the libraries. We have pointers to reference implementation again to network APIs and client and host stack in different languages. So we have in Go, then we have this client libraries for Java that Till will give some more information and explanation just right now. We also have client and host in Rust and bindings to other languages like C++ and Python. Then also this list includes useful tools, so Cyan is integrated in the CIT emulator, so if you are using CIT emulator you can also bring up your Cyan network. Then there is also escape libraries for package generation and YSAR plugins for packet capturing. So here is the first demo I want to show to you. I will just switch to the video. Okay. Yeah, I guess. So this is the Cyan browser demo and basically here you will see, so first of all this uses the production network. So you will see this basically browsing in the production network. This is part already, I just said, of these awesome projects in which we have projects of, so different projects using Cyan. So in here you can find this extension. You load one Cyan-enabled website, in this case for example the ETH website. And the extension provides some information about the resources and where they were loaded from. So here you see that the resources from the ETH domain were loaded via Cyan, so the green indicates Cyan and red indicates that they fall back to the GPIP. Of course this is configurable and you could choose not to fall back to anything if they are not available. Here you have some path information. You see that we stay within the Swiss ISD, so my client in this case is in the Swiss ISD, so we stay in there. Because the server happens to be located in the same ISD. Then an example of navigating to another ISD. In this case we navigate to the MacW University server and in here we see that resources are loaded, all of them via Cyan and we have also some path information. So here we see that we go from the Swiss ISD where my client is through this CERA education network that I presented before. So here you find the exact AS numbers that the traffic traverses. So here I type yet another example. And basically also we have more path information. This resource also happens to be located in the same ISD. If you really do find different ISS but this is not important information. So basically that will be it. Now this is just again for just showing that we fall back to the GPIP but this I already explained. The second demo I wanted to show is this Quake 3 demo. So here this demo is using the CYLAB testbed network so it's not using the production network. And here our client is located in note at ETH in Zurich and we connect to the server which is located in Magdeburg, Germany. We connect it to the server and okay. So basically here what you will see commands being typed. The piece of information I want to convey is that so different things that we have is this showpaths command that's CYM specific. And this showpaths command show all available paths from the client to the server. So you get a bunch of paths and well we see a little bit more of them. And the other thing is that for demonstration purposes what we do is we bind to the key to command next path. So then we can iterate basically and see how different paths provide different latencies. So we have this key shortcut and while we are playing you will see on the top left of the screen not right now but while we are playing. So this path for example, so if you saw before this path had 100 milliseconds latency, we start playing. And then what I was saying before is that now for example you see in the top left corner we have switched the path. So we just press this key shortcut and we iterate over the set of available paths. So this is for demonstration purposes for show that we can find paths with different latencies. We keep iterating, we see changes in latency, we see we keep iterating on the top left part of the screen and yes basically we see these different latencies. So hopefully this will stop now in the last frame and here for this specific path we receive this latency. So this interactive of course you can program this and adapt your application to have path selection algorithm that does it automatically and always takes the path with best latency. And yes that was it. I will now let the floor to Till for explaining the rest of the presentation. Okay thank you Jody. So now let's imagine you found all the science stuff very interesting and you tried to implement your own project. How would you go about that? Yeah the first step would be to go through this awesome science list that Jody presented earlier where you can find existing projects but also libraries, language libraries to connect to the science network. So these are probably most important ones for a new project. The first one here is the Go API that's like the reference implementation, the original implementation of science. It's the most comprehensive implementation. It contains everything including border routers, control servers and everything you need to completely run science. It also comes with language bindings for C, C++ and Python. More recently we have a Rust API 100% written in Rust and just released a few days ago we have an alpha version of the Java API and that's actually what I'm going to talk about in the next few slides because that's kind of the project I'm involved in the Java API. So yeah it's written in 100% pure Java. It is very similar to the Datagram channel that people may know from Java with a few exceptions. Datagram socket is currently not implemented but that's very pretty much the next thing to do in our list especially since I realized that a lot of existing older projects rely on Datagram socket instead of Datagram channel. The library also has an API for path inspection. This is pretty much all what science is about. You kind of get a lot of paths from your AS and you select the best path for your purpose. So yeah path inspection and selection is very essential. It also supports the SCMP protocol. SCMP is like ICMP for science. So again you have echo or ping and trace root commands available. So let's look at a very basic Java client. So this is a Datagram channel example and basically there's nothing to see here because it looks exactly like you would use a normal Datagram channel. The one thing to just bear in mind is for example the host name eth.ch that needs to be in sign and able to host otherwise you can't run that example and also your local machine needs to be somehow connected to the sign network. Let's look at a bit more interesting example. So there's an additional method. There are several ones but this is one that may be interesting. It's called set path policy. So what you can do of course is just go through all the paths that you get from your local ISP, your local AS and then pick the one that you want to use. But it's much easier if you can just define a path policy. In that case maximum bandwidth. You set that path policy on your channel and the channel will always try to find a path that suits this path policy. Now we're going to look at the server side. It's a little bit different from the native Java implementation in the sense that the receive doesn't return an Ionet socket address but a path object. And the path object does contain the Ionet socket address from the client that connected to the server. But it also contains the whole path that the packet actually took through the internet. And you can just use this path and to send a response back to the client. The idea here in sign is that usually if you send a packet to a server you would send it back the exactly same route. Technically you don't have to do that but it makes it a lot easier for the server because the server doesn't have to look up paths how to connect to the client. It's just much faster. So I mentioned path policies before. The Java library comes with some predefined path policies. Somewhat self explanatory probably but one path policy is called first. That just picks the first path that your AS gives you back when you ask it for a path to a certain destination. That's kind of a cheap way for the AS to actually recommend you a path. They think this is a path you should use. The next one is min hop that just tries to find the shortest path. So with the least hops in the path on hop being like a border router or other ASs you have to go through. Then there is min latency and max bandwidth which also pretty much do what you would expect them to do except that these implementations are non parameterized aesthetic. So they just rely on metadata. So you ask your AS for a path to get metadata back that kind of estimates the latency and give you like the allocated bandwidth for the short for the links in the path. If you want to have like really the best latency you would need to implement a new filter or I may also provide that in the future. A new filter that looks at all the paths, pings all the paths and then just selects the one that has a lowest latency. At the bottom of the list we have the ISD allow and ISD disallow filters. ISD being the isolation domain numbers that we previously saw, although this whole set of ASs. So basically isolation domains can map to countries for example or to something like the university network that we saw earlier. So these ISD allow and disallow can be used to implement something like geo fencing. And since the ISDs represent countries for example you can decide that you don't want your packets to go through a certain country. So imagine you're on the bottom left ISD in one of those ASs in that ISD and you want in your ISD is 110 and you want to send to ISD 130 and there are a lot of paths, some direct path. Some go out by 99 and one some path by 125 and 120 and for some reason you don't like ISD 99 which could be a country, could be just some other organization. And then you can just define your path policy like that. The exact syntax is a little bit different. So I simplified it here but it's pretty much that. So you can just exclude 99 so the filter will pick any path that is not and that doesn't go by a 99. So once you wrote your application, your first step, well not the first step but yeah, the next step is testing. And the common way to test sign is just to run a local network on your computer on your machine. And you can do that using the reference implementation that was mentioned earlier, the sign proto reference implementation. So what you do is first you define a topology in a file, topology file, then you run this command. This will create a lot of configuration files for all the border routers and control servers, demons and whatever needs to be started in your machine. Then you can view the topology if you like. So we have a very simple topology here with three ASs like the three ellipses, one core AS that's a button on the top and they all reside in the same ISD. That's just a simple example. There are also a number of topologies that are already in the repository so you don't really need to write your own if you don't want to. Then you just run the topology that will start up all the different processes for the different border routers, control services. They're all connected by loopback devices and then you can just connect with your local application to the network. In this case I just run a ping to the core ISD and that's the result. So there are other methods for testing. So there's for example the seed emulator that Jordy already mentioned that does support Sion. Then there's the Sion lab. This is a worldwide network of Sion nodes. If you want to use it you can go to the website, register, you can allocate your own ASs if you want. Then you can, as mentioned previously, download an image for a virtual machine and the virtual machine is like an AS that you run locally. You can even create several of those and then create a network. Then you can test in the production network but that requires you to actually have access to the production network. So if you're lucky your ISP supports that but there are currently not that many. AWS offers nodes that have Sion access so you can rent an AWS cloud center or something. Or maybe your university has access to the Sierra network. And finally for debugging there's a lot of command line tools. I think I mentioned ping and trace route before, show path and several others. And there's also a very neat wire shark plugin. So you can just look at Sion packets, inspect the header, look at the path that is associated with the header. So and if you want to contribute there are several tons of projects that could be done. You could carry your own project so we are still missing native libraries for C and C++ for example. There are no libraries at the moment for C sharp or swift. You could think about embedded or mobile devices. Also network protocols. For example Java implementation currently only supports UDP. We aim to use support quick and probably TCP very soon but there are many other protocols that would need support. Or you can just use one of the big existing projects for web proxies, HTTP servers or video conferencing clients like JITC or something. Or gaming libraries and try to make them sign a way so you can select path or automatically select good path in these projects. So yeah finally if you want help or support there's a Sion Slack channel, there's a Matrix channel. And since last week we also have a Sion tag on Slack overflow so you can tag your question with Sion and we have some developers subscribe to the tag and they will try to answer your questions. Yeah that's already from my side everything. Thank you and looking forward to some questions. Thank you for a great presentation. I have my question is regards security and protection against the DOS attack. So you allow everybody to select a path for packets and how do you protect against someone doing it maliciously. For example sending packets back and forth between ASS to overload the network. Yeah so the question was I think how do you prevent DOS attacks or how to prevent people abusing paths to send traffic for creating loops for example. So these paths that you can see here they all signed. So that makes it kind of impossible to create your own path. The paths are all signed by all the ASS on the way. And that also makes it a bit easier to prevent DOS attacks because you know where a packet came from. And if you don't like that region you can quite simply block everything that comes from that region. Thanks. Yes. I think a question is someone in the similar vein how do you deal with resiliency of the network which with. How do you deal with. Oh yeah the speakers. How do you deal with the resilience. Like the internet is very resilient because the routers can take independent decisions but if you select a path as a user and like a link goes down then like. Like information that information has to disseminate all the way back to the user so they can select a new route and then. Have their link up again instead of that just happening transparently to the user. So. Okay the point here is that normally you send the packet out to the network right and just pick routers to send it to the next hop just based on the destination information right. So when you have for example some link failing in the middle you need to converge to a stable state in which okay what's now the after the failure what's the next router I have to send the package to. So this is this takes some time and by that time your packet may be time out already so you need to still send the packet so with Sion you can already detect that when you see that the packet it's taking long or you don't get feedback and you can already take for example. Completely disjoint path and in that sense this failover mechanism is quite effective. So normally when link is busy you also get network feedback right so you see that packets are thinking longer latency is increasing so what you do is you as a user you are interested in taking a more. Healthy path in that sense if I can say that so you will automatically switch to that path. I have I have a question you showed in your API example that when you create a connection you specify the route. But supposedly a country changes you don't want to go your want your packet to go through a certain country then you have to change your software. Currently when I make a connection I only specify the destination and I don't care how it gets there. Now the information of how it gets there is mixed with where it should go. So when the route changes I need to change my client software. If I understand correctly. I hope I understood the question. So the routers become very simple because I don't need to do any decisions anymore. The only thing is they could verify whether the path is assigned past. So yes you do have to update your clients. The clients all need new libraries that could be. I mean I have a quite high level library with a Java but it could be an operating system. Just another driver that sits underneath UDP or TCP and adds like the sign path transparently. But yeah that's kind of the big work we have to provide updates for the clients. Let me add five cents. So there are also transition mechanisms right. So we have for example a thing so called Sion IP Gateway. So there for example traditional IP applications so you have your applications in a certain subnet and this traffic gets to this Sion IP Gateway. And this Sion IP Gateway encapsulates traffic to the Sion network. So in that sense you don't need to change for example with this transition mechanism. You don't need to change your application. Of course the application is not taking like the best properties in that sense. For example then you would not be as application choosing your path and optimizing for all of those or for some of those metrics. But you can still let this traffic go to this specific gateway and then this gateway will decide for you depending maybe on policies and local policies where do I send this traffic to. I don't mean the one time conversion of course if you switch to a new network protocol you have to change your software. But it's more the dynamic stuff when now my provider decides if some AS is out of the loop. But now I have to decide that as a client. Yes but you can easily default to whatever paths your provider provides to you. So you can always pick your default and just don't care about the rest. But on top of that you have the choice as an host to decide where you want to send your traffic to. If your for your use case is not important whether your traffic goes through I don't know any country you name it. Then you can just fall back to the default paths and then you don't have to make this decision if so. Is path selection always from the client side or can the server make a decision of what paths are acceptable. So the password is usually the client would connect the path server in his AS in its AS to get a selection of paths and we would use those paths. And those paths are sent to the server and the server could look up a different path to the client in theory but that feels a bit ineffective. So they can just reverse the path what's automatically reversed in this API to you send the packet back. Yes but just adding something else. There are some projects for example that try to make some negotiations so the server can signal or indicate the client. What's the best path to choose in their opinion but this is also like separate from the vanilla side I don't think. This is something you could put in place. I have a few questions but they should be quick to answer. So I've read about this thing called secure backbone autonomous system which is where you advertise better routes to existing BGP infrastructure. Is this in wide deployment is this popular is it being used. Also have there been any experiments with Wi-Fi does Zion work well in wireless context. Also in the quake example I didn't see an IP address in the quake demo I saw some other sort of address. What is that address. It's like the ISD or something. So I got the first and the last questions I will try to answer those first. So the first was talking about as much right. So as much is also an incremental deployment model architecture which basically is a hybrid between BGP and and so in that sense. The idea very basic idea and I'm not the one like involving the project but very basic ideas that you combine BGP so that BGP are announced closer to this backbone. And then you use this backbone as secure backbone and then you go out to the Internet again hopefully close to the to the to the destination. So the specific question was about the current deployment so it's under deployment. So some members in our team are making efforts and this I would say should be soonish already to be used in production. It's not yet there but it's there. The last question was about the address format right. So yes the address format that you saw is composed of the ISD so this color bubble and a yes the individual is of the of this ISD. So these two numbers plus the end host the end host address. This end host address has a scope within the autonomous system. So you could use basically any address you want but the scope of this end host address is specific to this specific autonomous system which is indicated by the ISD plus AS numbers. There was just one more question about wireless. Is there any experiments with wireless? Has anyone done anything? So now we will be starting projects for supporting Sion in Android and there we are going to go deeper on that. But of course I mean for example if you are interested in providing wireless support and optimize for wireless this Sion use case I mean you are more than welcome. But yeah. Thank you for the talk. My question is you mentioned earlier at the presentation that there is a way to update ASS which nodes which which nodes can be trusted with the further routing. So my question is how does that work? Who decides which nodes can act as ASS which ones cannot and how those bigger bubbles ISP or something else whatever it's called. How do you decide which nodes act as ASS and can you dynamically update it? So if I understood the question correctly it's about who decides what ASS is trustable or not trustable right? So what Sion brings is this possibility to the sender in this case the end host within the ASS as I said before. So your ASS will provide you with a set of paths and then you will based on your local policies as end host you will apply these policies to these set of paths and then you will end up having a subset of path that you consider that good for your use cases. So this of course brings so this delegates some responsibility but this is good. I mean at the end as I answered before you can fall back to any default path right? And just be kind of agnostic to okay where my packet is going to but I mean the main benefit is just having a choice on that. So ISDs basically represent jurisdictions right? So you can think about them for example we have the Swiss ISD. We will have other countries ISDs or regions or in this case for example this group of university institutions. So you may think so in there is pretty much for my use case for example I would want my traffic only go through research institutions because I'm deploying this thing from maybe my home country. And then I could basically steer that policy or implement that policy in there and then of all the sets of available paths I will be doing that. Of course this will depend maybe on your application for doing certain things you are good with other paths. So I don't know if I... The initial trans routes so they are agreed upon so basically... Can we take this offline so we can get the next speaker please? Yeah I mean I can answer you offline because they need to... Hallway track it's a thing. Okay thank you for the talk thank you. Thank you.
Open Food Facts : Acting on the health and environnemental impacts of the food system
Welcome everybody. We're going to start the next session now. It's my pleasure to introduce Pierre Slamish who will be speaking on open food facts, acting on the health environment impact of the food system. Hello everyone. I just have a quick question. Have any of you in the room used NutriScore to choose food products by raise of hand? Okay. So you'll see that open food fact has played a little part in getting NutriScore out. So let's start and let's dive right in. There's a lot on the menu. So for those who don't know open food fact, I'll briefly introduce it. I'll have a section on what's new in the project this year, what's cooking for next year, and also we'll be able to do Q&A probably outdoors. So about open food fact, it's a project that we started 10 years ago. So it's an NGO and it tries to answer how do you choose the best product in the supermarket? A lot of information and it's not legible. I've never been able to understand the nutrition table. It's abstract out to me. So a long ingredient list as well. And yet food has a massive impact on public health. To give you an idea, obesity and overweight wipes 3% of our GDP due to the cost of treating obesity and overweight. And the same goes for the planet. One quarter of food emission is food. One quarter of carbon emission. So the idea of open food fact is to empower users and contributors who have an impact on their own health, on the environment, and on the health system at large. Our slogan, if you will, is don't panic but organize. So crowdsourcing is a way to do that mobile crowdsourcing. And if Wikipedia was able to build the largest on the planet, open street map, the largest map, why not build the largest database of food products on the planet? Two days, 10 years in, we have 3 million products from over 160 countries. Main sources, crowdsourcing, so you and me using mobile. But also the food industry which has started to realize that transparency in the end wins. So the mobile app of open food fact allows you to choose products that are good for you and the planet. You scan barcodes, you get NutriScore and EcoScore. You also have a personal scan for those of you who have food allergies or want to go vegan. It will help you on the journey. It's of course privacy preserving. So it's privacy by design. We don't require any login. And if you don't have a NutriScore in your country yet, you can get it on any products in a couple of seconds. You answer a few questions in the app and you get the scores instantaneously. So you can take your health to the next level with NutriScore and which is about the nutritional quality and NOVA which is about food ultra processing. So avoid NOVA for products as much as you possibly can. We also do additives and labels. And we make it simple to understand all of that. With NutriScore, we started computing it in 2015 when it was a scientific paper. It was called the five color score. And now we compute it in every country including Mexico and the United States. Everyone can get it even if a producer don't want you to get it. So we recompute it. We create an ecosystem around it. And the nice thing is that as you all experienced, it's now in supermarkets in Europe. It's still not compulsory though. And producers are beginning to improve their products. And we also show EcoScore which is about the planet. So same principle. We use something called life cycle analysis which are very precise analysis of food products. So it's an average. And then on top of the average, we make the computation more precise to the products using specific data. With EcoScore, the great news is that France will have an EcoScore despite all the trouble you are seeing right now in France. It's in low. So that's the cool news. It's beginning to be experimented in Belgium, in Colbert. And it's also available in many European countries and the US. And so yeah, we are having a more global discussion around it. In terms of impact, OpenFoodFact has quite a lot. Because we are open data, over 250 projects, application services reuse the data to inform users from questions on pregnancy, allergies, etc. Even big corporations use it. In terms of impact, it's a simple circle. We collect data using our mobile phone. People are more and more reusing that data to do many things including scientific research. People get more educated, more mindful about what they eat. They start changing their behaviors, their purchase behavior. And the whole industry actually starts to follow. The producers are taking notice and they are changing their recipe as a result and everyone benefits. And the circle goes on and on. So from those kind of Photoshop or GIMP images that we did a couple of years ago, we went straight to this where the NutriScore is everywhere. So yeah, you go from Perl code to real life impact where basically all products, all newly introduced products start to change for the better. What you can also see across Europe is for instance the differences between In The Food Offer. We take photos across space and time for 10 years and we found out that the Fanta Recipes change across Europe. So for instance, Italy 12% fruit, Serbia 3% fruit, Portugal 8% fruit plus high fructose corn syrup and 0% fruit in the French island of Réunion. So that's the kind of thing you can do with data. We can also have a giant map of food factories in Europe. So that's may near me. And all the packaging code you see on food products, we actually collect and we can map them. You can do benchmark if you're liking to data, if you want to choose a perfect year old. No, you can. So it's highly customizable. In 20 seconds, you can do your own charts. We also have a platform for the food industry. So whoops, sorry. Yeah, for the food industry to help them actually reformulate, we say, okay, here's an opportunity to reduce a little bit sugar and then you will get a better NutriScore. So we compute all of that. And brands have started playing the game. Some of the brands you consume every day are actually doing open data and sending open data to open food fact. Even the big ones like Unilever, even Ferrero from Nutella are doing that. So they're starting to realize that consumer pressure is important. In terms of milestones, so as I said, we launched NutriScore in 2015. We launched EcoScore more recently and ultra processing in 2018. So the project is a bit over 10 year old. And this year, we cross the three million products threshold, which is a nice milestone. We are now at 3.1 million monthly visitors on the websites and the app and contributors together have made 28 million edits since 2018 and it's growing, it's still growing. The permanent team is growing. The community is much more engaged this year than it used to be. We were doing European meetups. We had our second open food fact days this fall. And we are also getting more people into coding. This year, we also scaled app marketing so that new users discover about open data, open source and open food fact to 40 languages. And we started getting into European events and trying to get a European committee off the ground and not just be a French project. In terms of manufacturers, we introduced a few new features as well. Manufacturers are getting on board. And even more important to us, as scientific use and reuse, we have 30 scientific paper in nutrition, in machine learning based in 2023. And we have increased the reuse a little bit as well. So what's cooking for 2024? It's going to be a big year. First and foremost, because the new score is going to change. The formula is going to become more strict. You know that there's Italy is trying to block it at the European level. And the scientists are overwhelmingly supporting new three score. Seven countries have adopted it. And now it's the question of whether it will become the European score. The new formula is going to be more stringent, like seven out of 10 products are going to lose a grade. Most of them are going to lose a grade. And it will be a two-year transition in real life. But as soon as we start deploying it on open food fact, it will be on every, the new computation will be on all products directly in open food fact, even before producers do the transition. On mobile, it's going to be a big year. I'm going to go very fast because there are only four minutes left. So we did a lot of user interview this fall. And so we are going to make the app more pedagogical and to improve search. So here's a screenshot of all the ideas by the community. So we are going to improve the onboarding so that people better understand the scores. We are going to make the personalization engine more intuitive. We are going to make all the information more legible, guides even to go further for French people. We are going to try and tackle the mineral water scandal. And improving search. Also, thanks to the support of NGI, NGI search, we are going to have a live search in open food fact. And this year, we are going to go beyond food. So the thing is, we have had an impact on food. But there are many objects like, I know, this projector or this chair, which have a life cycle. And then at one point, the owner decides it's not worth keeping anymore. And as a result, we are surrounded by object, but some of us no longer serve us or please us. And they end up in the incinerator because we fail collectively to give them a second, a third life to repair them, to fix them. And so open product fact is all about that. Giving open data to power of circular economy. So today, this year, we are going to merge open food fact with open product facts, beauty facts and pet food facts so that you can scan anything on the planet and get solutions for it. And yes, people have asked us for that for years. We are getting into price collection this year. People, we started open food fact once, so what's in my food? But no, people want to know at what price. So we are starting open prices. Currently, it's only a web app. It's only five weeks old. So it's still a very experimental project. Even the logo is experimental. But basically adding a price takes 20 seconds. You scan the barcode, you put the price details, you put the location. It remembers the two or three locations you inputted previously. And then you start to realize weird stuff. Like for instance, price variation in the same city for the same products, for the same supermarket chain, and nobody is able to explain why. We are also thinking that we could kickstart a European price collection and build the first European Nutella price index. So we already have a few prices in Europe, but you'd be very welcome to add the prices nearby at your favorite shop. We are also, this is more experimental, but we also would like to help people free their data from receipts. So now at this point, you are asking how can I get involved in my country? So we have a broad European coverage that's already there, but there's still a lot of work to do. So how can you contribute? Scan and add new products. That's the most basic, but the most vital way to contribute to open food facts. Translations, word spreading, taxonomies and design. So a lot of knowledge about food required. And if you develop in any programming language, hacking and fixing is welcome. We have many programming language you can contribute. So the mobile app is in Flutter. We have some machine learning, robot off in Python. So we're even experimenting with LLMs and 60 seconds on the clock. Perl, Python, you name it. There's really something for you in there. So that's the QR code. If you want to become a volunteer, you can scan this QR code or go to openfac.work.com. Also, if you're a student or an adult, you have a Google Summer code we are going to apply. So if you want to become a mentor, a mentor or refer a mentor, feel free to do so. It's nice to have a large impact on food. We are independent from the food industry, by the way. We're not like a startup or anything. So we'd like to thank all the sponsors that are supporting some part of open food fact. So thank them for enabling infrastructure or everything. So I guess let's get in touch. Eight seconds on the clock. You have the contact email, my personal email, and you can install the app right here. Thank you.
Observations on a DNSSEC incident: the russian TLD
Welcome everybody. My name is David and I have the pleasure of introducing Stefan Bortsmeyer who will be speaking next on observations of a DNS sec incident, the Russian TLD. Hello everyone. I work for AFNIC which is a data file domain name registry so I know one or two things about the DNS. Time to see first the problem. So the lightning talk appeared quite recently in the schedule because everything happened on Tuesday this week. So many users noticed a problem. A lot of sites services under the .ru TLD. TLD is top level domain. .ru is for Russia. And there were many problems. Many people reported this as I cannot reach a young dex or I cannot reach V-contact or other service. But actually it was a very general problem with .ru. Everything with the name .ru was down, it seemed. But some people said, okay, it still works or it works for me. You know that on the internet because the internet as a previous speaker said, the world is not coherent. On the internet it's perfectly possible that some users say it's down on others. Hey, it works for me. So in that case there was no apparent reason for some reason for some people in Russia for instance it worked some not. Outside of Russia it was the same thing. On the problem lasted a few hours, three to four hours, which is a very common duration for an internet incident. Someone told me once that every internet incident is two hours of panic on five minutes to fix it. So a bit of analysis of the problem now. So I have something terrible to tell you. Don't believe what you read on the web. A lot of bullshit. Many people don't know what they're talking about. They don't rely on facts. In that case for instance a lot of things are observable on the internet. Anyone can run a DNS client, can run trace routes, can try with Curl or other software. So it's possible to have data, actual hard data. But yet some people prefer to immediately start writing anything on the social network rather than collecting data. So if we collect data we can see that the problem was not with one website or the other. So when people said Yandex is down, no it was not specific to Yandex. But also there was a specific problem. It was about Russia. Many people immediately started to assume that it had something to do with the war. That it was an attack by the Ukrainians or a problem with Russia. There is the first problem that many people talk on the social networks without first gathering data. But there is also another problem is that many people reacted to this event not based on facts but based on whether they were pro-Russian or anti-Russian. So they said it's a fault of Ukraine, CIA etc. or the opposite or it's a fault of Putin or Kadyrov or I don't know who. So for instance you can find in many articles published about this problem that it was because of Russian censorship, some censorship test that failed. There is no evidence supporting this. There is censorship in Russia but in the specific case of the incident on Tuesday there is absolutely zero evidence that it was an attack or zero evidence that it had anything to do with Russian censorship. It was just a technical problem. So to debug this sort of problem let me spoil immediately it was a DNSSEC issue. But it was in the title so you already know it. So the best tool to debug DNSSEC issues if you don't know it is DNSViz. DNSViz is one of these few programs that are loved both by hardcore hackers and by managers. Hackers love it because it's technically sound and it produces correct diagnostic. And managers love it because there are pictures. Here you can see the chain of cryptographic keys that were used in .ru. At the top is what is called the key signing key which is one reference from the DNSWoot. The key signing key signs two other keys which are called the ZSK, the zone signing keys. One was inactive at this time. It was the old one which was soon to be retired but still published because again the world is not consistent which means that different parts of the internet see different things so you have to keep information in case of. On the new one the active one on the white well as you see there is a problem. Red is not because of Russia it's because of problem in that case invalid signatures for all this type of data. So this was at the heart of the problem. The zone was signed cryptographically signed but with invalid signatures. So the issue was at the .ru domain name registry which is the organization in charge of the top level domain .ru. Unlike what many people said without any facts. It has nothing to do with the system of resolvers used by the internet access providers in Russia. The problem appeared for everyone. I had the problem at home for instance because the source of the problem, the root of the problem was at the .ru domain name registry. Also this registry is the same organization is also in charge of two other top level domains which were unaffected. Again unlike what you can read in many articles about the problem. So DNSSEC is a technology of security. The goal is the idea is to sign cryptographically the DNS data so the resolver at the other end can check that the data is pristine, is correct and has not been modified. So that the idea is in a way, actually it was even in the official statement by the domain name registry, in a way DNSSEC worked because the signatures were invalid so the resolvers rightly so rejected the signature. So you cannot see immediately that the signatures were invalid. You can query DNS with tools like a dig, drill etc. etc. But of course unless you can do RSA or ECDSSE computations in your head you will not see that the signature is invalid. You have to trust the software. So why did it work for some people? It's because not all DNS resolvers on the earth validate. I didn't try the resolver used on the first-dem network for instance. I assume it validates but for instance many big internet access providers don't bother to validate which means that if the signatures are incorrect it doesn't matter because they don't check anyway. So big public DNS resolvers like Google public DNS validate on the other problem. Also at home I have my own resolver which validates so I was also unable to see anything under .RU. But it can explain why some people said hey it works for me. Sure because every resolver DNS is decentralized which means that any resolver on earth will do its own validation and some decide that no it's broken so you cannot access it and some will not validate so it will work in a way. So the lessons we can take from this incident. One is that DNS is important. I can even say critical. Most activity on the internet start with the DNS so not having the DNS for most people is like having no internet. There have been some reports that for instance Russia was disconnected from the internet. Bullshit. It was easy to see that if you know the IP address of the server you could still reach it. But of course it's not really convenient. You cannot spend the day using ping and truss route with IP addresses. So for most users it was exactly as if the internet in Russia was down while it was only a DNS problem. So DNS is critical. That's why the people who work to maintain the DNS should be paid much more but it's another issue. Also an important thing about the DNS is that the domain names are organized in a tree with a root. So you can create top level domain like .fr.be.ru and then second level domains, .yondex.ru etc. And because of this organization in a tree if you break one node in the tree everything under it is down as well. So if you break something .com every name under something .com disappears and if you break a TLD, a top level domain, big problem because you break everything underneath. That's why domain name registries are extremely important. Also cryptography is hard. We know it. It's hard to do properly. It's hard to debug software as bugs. I'm sorry again to inform you that software as bugs. So it's still a problem today that internet could be more robust if we could get rid of security measures because every security technique can turn into a denial of service. In the case of .ru many people said oh okay because DNS sec was broken and access was then denied we should get rid of DNS sec. Okay it's exactly the same if you find an expired certificate on an HTTPS website you decide that checking certificates is a bad idea. It's the same thing for every security technique. If you lock your door when you leave and if you then lose your keys you cannot go back to your home. You have a denial of service and then people lock their doors for good reasons. So it's the same here. It's true that in that case a problem in a security technique made a denial of service but it doesn't mean that we should get rid of security. Again it's a very general problem with every security technique. Also one important lesson but you already know that free software is great because in that case without DNS v's debugging such problems would be much harder. Of course we could use tools like a dig, drill etc but typically they don't make nice reports. It's not just the pleasure of a nice picture. It's also a good summary and it allows you to see very quickly what was wrong. Some tools like drill for instance I use drill a lot and drill reported also the bad signature but it reports also many other things so it can be hard to pinpoint the problem. So DNS v's is really great. It can be used online but it's also free software so you can use it on your own machine if you want. Also during the problem I used a lot the Wipe at Last probes. There are small probes with free software in it. Of course that volunteers install all around the world so you can make distributed measurements. Again the world is not consistent. You can have things that work in one place and fail in another so you need also distributed monitoring of the internet, distributed debugging. And this is exactly what Wipe at Last probes are producing. The software on the probes is free software but typically you don't mess with it. The server is not so it's not really free software everywhere but it's quite open because not only anyone can install Wipe at Last probes but anyone can also ask for measurements from the probes. And they can do everything which is needed to debug DNS and DNS SAC issues. Thank you. I'll be there if you have questions or issues or you can ask them on the metrics room as well. Thank you.
A simple caching service for your CI
So, hello everyone. So, as I said, my name is Remedio Raffa, I'm a principal tech lead at Lino. I've been working on Open Source Project for a long time now, and I've been at FOSDEM for many years now, it's not my first FOSDEM presentation. So, I've been working on VLC media player on V8, Javascript Engine, and I joined Lino some years ago working on Lava and on Automation and CI in general. So, today I wanted to speak a bit about a really tiny project that I created some years ago, which is called Keyscash. And in order to present it, I have to explain why we are using Keyscash in Lino. So, at Lino we contribute a lot to the Linux channel, and not only by developing new stuff, drivers, and a lot of different things, but we also contribute a lot by testing the Linux channel. We have a project called LKFT, Linux channel functional testing project. That is, if you go to the website, it's written that the goal is to improve the Linux channel quality on the ARM architecture, because we are now mainly about ARM, but not only. By performing regression testing and reporting on seleting Linux channel branches on the Android command channel in real time. Okay. That's what is written on the website. More or less, it's a project led by Leno. It's an automated system to build and test a set of Linux channel trees. We mainly care about LTS, obviously, mainline and next. And by contract, we have to provide a report in 48 hours. So, it's quite tight between an RC on an LTS trees. In 48 hours, we have to provide an SLA. We have to provide a report, all right. So, if you look back at 2023, we built and tested 396 different RCs, so only LTS channels. As we also care about mainline and next, we built 2,443 different channel commits. That's 1.1 million builds. So, 1.1 million channels were built by the system by LKFT. And we ran 297 million tests just in one year. And if you look at the Android parts, Android command channel, that's 580 million tests. The tests are running both on virtual machines, so QMU and FVP. We have a specific system where we can instantiate in the cloud many machines for running QMU and FVP. That's a stock suite service that we created. We will not speak about it today. And we also have a physical lab. So, with physical devices in Cambridge, that is managed by a tool called Lava. That's a tool that I'm running inside in Salinaro. So, if you look at the LKFT, really simplified architecture because obviously it's way more complex than that. So, as I said, we care about LTS trees, mainline and next. So, we have GitLab repository that are just mirroring the different trees that we care about. And when there is changes, GitLab will pull it and we create a GitLab pipeline. The GitLab pipeline will send a set of instructions to our cloud service for building, called text build, that will run the builds. So, it will scale from zero machine to 5,000 machine in some seconds, do the builds, shut down the machine and then send the artifacts to an S3 like storage. So, the artifact will be the kernel, the TTB, the root file system, the modules, etc. And then these artifacts will be pulled by our lab in Cambridge to be tested on real devices. So, in the lab in Cambridge, we have some hundreds of boards, Raspberry Pi, Dragon boards, IKs, X15, etc. A lot of different boards. And at the same time, they will all pull the artifacts, deploy them on the hardware, depending on what kind of hardware you have, run the test and then report back. And obviously, everything will run in parallel and don't leave from the same storage. So, our CI system, as I said, will build and test artifacts, L, DTB, RAM, these modules, etc. And for each kernel, DTB and root file system, they will use multiple times because when we have one commit from the kernel, we'll build it for multiple architectures. We'll build it for x86, ARMv7, ARMv8, ARMv9, PPC, SH4, MIPS, etc. Then for each architecture, we'll have multiple configurations. I want to build with some virtio-specific configuration. I want to build in debug in release, etc. And then for each configuration, for each commit architecture configuration, I will run a set of tests. So, KSELTest, KUnit, libgperiod, the LTP, etc. Considering that LTP, for example, is broken into 20 different test suites that will be 20 different test jobs because it takes a lot of time to run. So, the CI system will run a lot of different jobs, of test jobs, that will actually pull the same artifacts all the time, which means that in the network, on the network in the lab in Cambridge, we have a lot of network usage and a lot of duplication. We are re-downloading always the same artifacts. So, that's normally really simple things to solve. You just add caching. So, just, I'm really adding that because that's really important. Our system, our CI system, the Lava Workers, will download multiple times the same artifacts at the same time in parallel. So, if you look for a caching proxy in the open-source community, you will obviously find that Squid is the main caching proxy and it's a perfectly good one. It's really working well. So, you should just install that on our network, point all the workers to it and it should work. Short answer is no, it's not working just because of the two reasons above. So, and also for another reason, this one. All artifacts, as I said, are published in an S3 like bucket. They are somewhere in the cloud. So, obviously, if you want to download them, you will download over HTTPS. You will not download a random binary from internet and run it in your local lab for testing. Not something that you will do. So, we have to validate. So, we use HTTPS to be sure that what we're downloading is what we're expecting. At least we are trusting the software. But when you add a Squid proxy in the connection, it will not work well with HTTPS. That written in the script documentation, you can make it work with that. It's not easy. The main problem is that as an HTTP client, when you connect to a website over HTTPS, you're expecting to get a certificate and the connection will be encrypted with the certificate and the certificate, you have to trust it. When you add Squid in the middle, Squid will have to connect on your BI to the server. So, the connection between Squid and the website is encrypted correctly. The certificate written by the website is a legit one, so it will work. But when Squid will have to decrypt the content to cache it and then re-encrypt it to send it back to you, it does not have the private certificate from the website, obviously. You don't have the private certificate of Google.com on your machine, so you cannot re-encrypt the traffic. So, Squid will need to have his own certificate and it will encrypt the traffic with its own asset certificate. And you will obviously not trust it. You will not trust your local Squid proxy to sign something from Google.com or AWS or any website or Linux Foundation. So, when the HTTP client receives the custom asset certificate, it will just say, no, I don't trust you. There is a workaround and it's written in the script documentation, obviously, which is create a wildcard certificate, which is a certificate that will be valid for absolutely every website on the planet, every DNS, so it's kind of a dangerous asset certificate. And you can install it on every of your HTTP clients. It's possible, but it's really crappy, honestly. That's the first problem. The second problem and that there is no way to work around it is that when Squid, when you try to download multiple times the same artifact in Squid, so, for example, you have two connections downloading the same root FS, Squid will download it twice and stream it back to the clients at the same time. And when it's downloaded, it's finished, then the third connection will have a cache version. But as long as it's not cached locally, it will re-download from the start. And as I said before, our system is by-designed running everything in parallel, so it's often the case that we have multiple downloads of the same artifact at the exact same time. So when using Squid, it was just not caching anything. Sorry. So that's why we created KeysCache. So Keys stands for keep it simple, stupid. It's a pretty simple and stupid caching service. But the main features that it has are exactly what we need for a API system. It allows to cache HTTPS resources without any acts or anything. It allows to download only once, even if you have multiple clients and they will all get a stream back, the stream of data back. And the reason why it's not, it's working for both cases is that it's not a transparent proxy. So it's not like clients that will know from an environment of the Bible that it has to go through a proxy. Instead, you have to prefix your URLs. So if you want to access example.com slash .fs.x4, for example, you have to prefix it by your KeysCache instance. So even if you're downloading over for HTTPS, your clients know that it goes to KeysCache and not example.com so that it's expecting a certificate from KeysCache, not from the original website. That's the first reason. And KeysCache also, we made it so it knows how to stream back to multiple clients, the same content. Fun thing, we also added a lot of automatic retries inside the KeysCache backends. So if for any reason, and it happens a lot, the connection between your network and the S3 like bucket breaks and it often breaks, honestly, KeysCache backend will automatically retries. This is a list of HTTP codes that we're retrying automatically. And it will also, so when it's retrying, it retries up to 50 times over a period of two hours because we had exponential backups. So sometimes a download will actually take two hours and 50 retries just because the S3 like bucket is just sometimes a bit buggy to answer. We also added partial download, which when you have, we do a retry, if the HTTP server knows how to do that, we only download the remaining content, not from the start. And the good thing is that with the automatic retries, the client will never see that there is a broken connection because from the client to KeysCache, the connection is kept alive. It's only the backends that sees the network issues. So it has been in production for 3.5 years. It downloaded 32 terabits of data from internet and served 1.6 petabytes of data locally just for a really small tiny software, which is an expansion ratio of 51 times. So we divided the network usage by 51 just by having a small working proxy. It also improved a lot of stability thanks to the automatic retries, I said, up to 50 retries, which is insane. And it also lowered a lot of the S3 egress cost because you have to pay for egress in the cloud. When you, for 1.6 petabytes of data, that's a lot of money. So yeah, we saved around 150 K of euros just by having a local proxy. Just because I have just two minutes, a look at the global architecture of the service, it suggests a Django application with a salary backends. So you have a reverse proxy and Ginex. It can be any reverse proxy in fact, that will receive an HTTP connection. It will send that to Giniacon, which is a Django runtime. The Django will see if the, we look at the database, but at the base, progress, to know if the artifact has been downloaded already or not. If it's a case, it will then look at the file system and just give that back to Ginex saying, please send that to the client. And I'm done with it. If it's not already downloaded, it will send a message to Redis that will spawn a salary task that will actually do the download and retry in the back end. And it's done only once. And it's then saving it to the file system, appending to a file, byte by byte. And at the same time, the Django process just reads the file on the file system and sends the bytes where they are available. And that's all. Waiting for the file, the file to be just finished. And if a second or third of many different users arrive for the same file, then they will just reuse what is already available in the file system and wait for the download to finish. And that's all. That's all. It's pretty simple and efficient. And it has been a really good use for us. And it might be useful for your CI system. So if you have any questions, I will be here after the talk. Thanks a lot. Thank you.
Reinventing database exploration with Azimutt
Welcome everybody, let's get started on the next session. My name is David and it is my pleasure to introduce Loïc Nuchel, who will be speaking on reinventing database exploration with Azamut. Thanks a lot. Hi everyone, thanks a lot for coming to my talk. Indeed, I will talk about Azamut and how we can explore the database with it. My name is Loïc Nuchel and I am principal engineer at Dr.Libb. Basically, the whole talk is a story about how I started at Dr.Libb and now I am here talking to you about Azamut. Three years ago, I started at Dr.Libb. I joined the company so if you don't know Dr.Libb, it is a French company in healthcare, allowing patients to book appointments with doctors. Basically, it is built on big monoliths, on rubric and rails backed by PostgreSQL database. Basically, it is a huge monorepo and also a huge database with 800 tables inside and several petabytes of data. As an architect, I joined Dr.Libb to work with the team and help with architecture, improve the code and things like that. But also, for that, I have to understand what is inside the database, what are the models and what are the relations. The thing with rubric and rails is you don't define the properties inside the models. You just define the relations, but often the models are quite long. They can do like 1,000 lines long and sometimes the relation is as far as 100 lines or something like that. That is not really convenient and I had to look inside the database a lot to understand what are the things and how it works. Basically, that is me working at Dr.Libb for the first month and obviously, this is not very friendly. I had to find a tool. I looked for a lot of tools. They are called ERD, and they show tables with relation and nodes. As you can see, this is not very friendly. Here is the 10 tables. Imagine 800 and you will have some trouble. I tried quite a bunch just for you to have a look at what they look like. Basically, the NWAS NAP failed. For a few reasons. The first one and most of you find is all of the tools I could find will show everything. When you show 800 tables, you don't understand anything. The second one is most of them don't have an SQL or database import. The last one is they are not private. Basically, I had to upload the schema to the service and I don't want that for Dr.Libb. Basically, when we are a developer and we are in this situation, we do another tool. That's what I did with the big one goal to make it easy for large database like 800 tables again. You may see tables with a lot of columns like 100 or something like that sometimes. Locally, this is not for us, but this is a possibility to stay local and just have it in your browser. Not send any data to the service and of course open source. The first part was the schema exploration. When you load your schema into azimuth, you don't see anything. You just see a search bar and an empty screen with some situation. The goal is to look for tables with the search and just load the table you are entered in. Mostly, if you are working with a big database, you don't want to see everything. You just want to see one, ten tables around your scale, your feature or something like that. You can do some nice diagram like this with choosing the table and the column you want to show. Also, you can navigate from one table to another following the relation. Obviously, the foreign key with outgoing relation, but also the incoming relation coming from the primary key from the other relation. That's pretty nice to expand your diagram and explore what's around. Of course, you don't see everything. You want many layouts. One per scope, discovery, team or anything you want, but several layouts of your database to understand it. The last thing is sometimes in the database, you don't have foreign key for all the relation. Sometimes for performance reasons, sometimes for reliability. There are a lot of ideas around there, but sometimes you don't have the relation as foreign key. So, azimuth can infer and suggest them directly inside the diagram. The last feature on the stream exploration is a fine pass. If you want to join data from one table to another and you don't know really all the tables in between, it will be a good one. Basically, when developing this feature, I was very surprised about how many paths there are. You will be surprised too. Basically, that's also a good idea to have a look at that. The second thing is when people are starting using that, like on read-only on the database schema, they wanted to draft new features on it. Basically, doing some more design for the database. I made a DSL with an explicit goal to be very simple. Here is a bigger version if you want to read, to just write the table name and the column name with two space before. Then you can add some attributes like the types and some primariki, unique index, new label and things like that. The goal of this DSL is to be very simple, very quick to write, to go as fast as you and you can flow and your figure can type. When you do this kind of exploration, sometimes you have some discovery and sometimes you want to write them somewhere, maybe for your colleague but also for your future self, again the exploration. There is a lot of documentation. Of course, the SQL command from the table, so it's loaded and accessible into azimuth. Also, the nut into the table, this is the same thing as the SQL command but inside azimuth you can edit and view it easily. There is some tag also to find type easily and of course there is the same thing for the columns. So the SQL command, some nut you can add. The nut are in markdown so you can do the formatting with images if you want, links, lists and things like that. On the layout, you have one layout for what you want and you can document them with some memo inside. Same with markdown, you can put image, link, whatever you want to explain the whole schema, some part of the schema, you can have the color behind. You can also have table groups to show that tables are in the same position or in the same context. That's how you can do the documentation for azimuth. The last part I did not long ago is the data exploration. Basically, before we were only on the structure, on the database model, but sometimes you want to go a bit deeper and understand the data inside the database and how it's working, what you can do. I think this is quite interesting. When you open the detailed sidebar for the table, you have all the details but also you have all the columns with the sample of the data inside. This is random picked data, not just a row with everything but I avoid nulls and things like that. You have interesting data to show here. The same for a column, when you open the sidebar for a column, you have the most used value, the count of rows, the cardinalities, the number of nulls and things like that to know a bit. What is inside this specific row? That's for the quick access but you can also do some full query from it. We have a visual editor for very simple query, like a table with some filter, but also you can write any query you want to have the result. Basically, you have all the results on the right in a list so you can see different results and have some nice features to filter, to sort and things like that. The most interesting one is this small arrow here. You can click on it and see the relative row on this part. Here, I selected all the events, like on the CFP database, we have the event but they are linked to a group here. You can see in one click that it's the human talkspire group which is the link row on this event. This also works in a nested way so if you scroll down and see other relations in this sidebar, you can have multiple sidebars that are stacking to navigate from one row to another. Basically, this was quite interesting but the very nice thing here is you can add this specific row, so one row and data from a table into the diagram. You can add to the layout and see this row specifically so this is not a table anymore, this is a row of data with of course the table name and the table column, but also with specific data for a specific row. You can refresh the query again but with all the data. Same as the layout, you can navigate through the row inside the data. If you click on the primary queue again, you will have all the linked tables and for the table, all the linked rows, with a maximum of 20, because sometimes it can be very expensive, for example, event or if you have some tracking things, you can have thousands of them. Basically, you can see easily what are the linked row, what are the incoming links to this specific row and not just the table in the schema. And then if you click of course on a specific one, you can show it. And the same is for foreign key, so if you click on an outgoing relation, you can just show the relation with it. This allows to do some nice diagram with not only the table in the schema but also actual data from your database that sometimes is interesting to show that you have several rows in the same table like here. And of course you can mix both having on your layout, having your schema, so the table above and the table below. So this is very small, it's not intended to read, but on the right you can see there is several times the same different row on the same table, on a clear blue. So I think that's a very interesting way to navigate into the data. So if you want to try it out, so it's available on azimuth.app, but there is also a nice CLI to load any database almost here, so you can just do NPX azimuth explore and then your database URL. It can be of course a remote URL but also a local one, you will start to get away on your machine which is just a node server to proxy the query to your database. So it also works with local database which is like I think the one of the only tools that I can do that. So thanks a lot, you can try it on azimuth.app, it currently works with several database, so major relational database but also some document database. And basically for relational database when you have a json field, a json column, it inspects the json column selecting 100 non-empty row and infers the schema from it so you see directly the schema of your json column inside azimuth. So this project is fully open source, I've been working on it for a bit more than two years and basically I intend to develop it a lot more in 2024, so if you are interested with it, there is a survey with a QR code and I will be happy to have your feedback on what you thought about what I presented but also what are your current problems about the database, what you expect to see from a tool helping you interact with the database and so on. Thank you all, there is still two minutes so maybe I can run one or two questions. Is there any supplementation when you explore a state or really state database? Yeah, there is several things, so it's made for big database, so basically the table is 100 tables and the biggest schema I think is 1000 tables, 1,500 tables, so there is no issue extracting the schema. There is more issue and basically that thing I will address soon. When you explore that, basically if you have a lot of data inside your database, the quick show of the value into the table and the column can be quite hard to get, but after that you just run some queries. So you will have performance issue if you do queries that take a lot of data but the queries run on your database are not linked to azimuth.
Kùzu: A Graph Database Management System for Python Graph Data Science
Next we have Prashant Rao with Kuzu, a graph database management system for Python graph data science. All right. Good afternoon, everyone. So my name is Prashant. I'm an AI engineer at Kuzu. So I'll be talking about graph databases today. Just a quick show of hands, how many people have worked with graph databases or heard of them? Fair number. Okay. So you're in the right room today. So I'll outline a bit about what I'm going to cover. I'll start with what graphs are for those who are not familiar. And then when you need graph modeling, I'll also cover some of the features of a competent graph database management system and what that means. And that leads into the vision that Kuzu has, both as a GDBMS or that is a graph database management system and as the go-to solution for graph data science. And I'll end with a walkthrough on how Kuzu makes graph data science workflows easier for the developer. So the first question we must ask is, what are graphs or networks as they're sometimes called? They are an abstract representation of entities and relationships. Essentially an entity is represented as a node and the way these are connected together is represented by an edge, which is the relationship shown in this figure. And as the figure in the bottom shows, these can get pretty complex and reveal really interesting structures about connected data. And that's exactly what we see in the real world. Graphs are actually one of the most natural ways to represent data. Social networks are of course something we are very familiar with, but graphs are very prevalent in many other domains, all the way from drug interactions to molecular networks to traffic networks. In the world of finance, you analyze transactions for things like fraud and they also are very common in knowledge graphs that encode factual information about the world. Kuzu is a graph database management system, which is a class of database management systems. So I'll start by giving a general overview about GDBMS. You generally have three components to any database system. You have the data model, you have the query language, and you have the system implementation. From the data model perspective, graph data models differ from the conventional relational data model in the sense that you typically represent the data as nodes and edges. And you have key value properties on these nodes or edges. In this example, with this triangle you see here, you have a cyclic relationship of transactions between people, one, two, and three, where the nodes one, two, and three have property information on the name and the edges have the amount of the transaction as a property. So this is called the property graph model of graphs. And it's very, very common and prevalent in the industry. But there's also another data model called RDF, resource description framework, which has a similar concept of subject, predicates, and objects, which represent a triple. The triple is a basic unit of data in the graph, but it's the same idea as the property graph model except the implementation is different. From a query language perspective, every graph database management system needs a high-level query language that's designed specifically with graph syntax. And an example of this is shown here. This is the Cypher query language, which Kuzu implements. And incidentally, Cypher is the same language that was invented and popularized by Neo4j, if anybody's used that before. But what this example query snippet shows is you have node information, A and B, of type account, and you're matching on those nodes. And then you're running a query, a joint query equivalent in a way that reminds you a lot of SQL. It's very declarative and it's very high-level reminiscent of SQL. From a system implementation standpoint, universally, I think it's hard to come with a statement that covers all graph systems, but in general, they implement storage structures, indices, and operators that are specific to graphs. One example is the shortest path operator. These are operators that are not prevalent in relational systems but are very common in graph systems. There are many reasons why you might need graph modeling, but I'll cover just a couple of them in these next two slides. For an example, let's take this query where we are trying to find direct or indirect possible sources of money flow into a person's account from a particular location. So in this example, the person is Alice, represented by node B in this Cypher query, and we are matching on the owner of that account, which is Alice, but also matching on account A, whose location is Canada. The key here is that the middle portion, which is the transfer star, that star syntax is a high-level general syntax called clean star. It's used to implement indirect and recursive joins. As you can see, the query is quite concise. It's quite readable. You can do this sort of query in SQL, but it's a recursive query and it's not as easy. It's going to be a lot more verbose and not that easy to read. One other example of this would be the shortest path query, which is a lot harder to do in recursive SQL, but in Cypher, it's very, very straightforward. It's just an additional clause that you add attached onto the previous query. Another case where you need graph modeling is in heterogeneous data. This example here shows an example of Dbpedia, which is a structured version of Wikipedia, and we're taking this example of the location we're in right now, University, Liberator and Brussels. On the left is the way it's stored in structured form, where you have key value properties, and each of these properties links to other properties. But on the right, we schematically represent that as a graph. As you can see, the university is linked to the city of Brussels. It's also linked to the country of Belgium, what affiliations it has, and each of these individual resources can be linked to other resources. This actually expresses the power of a graph model, because doing this with a tabular form of data would be almost impossible, because that's how Wikipedia is structured. It's a lot of connected information. This leads us to the question of what is a feature set of a competent graph database. We list a few of them here, but I think it's very difficult to go through each of them in the time we have. We do have a blog post that covers this in much more detail. It's called What Every GDBMS Should Do and Division. But in a nutshell, every GDBMS has to support things like many to many growing joins, recursive joins on top of heterogeneous data sets, for example, knowledge graphs. Another thing that we can highlight here is the schema querying aspect, where in this last example, you have account information and transaction edges, and you're able to query on the type of the edge. Let's say you have two different kinds of transactions. You don't want each of those on either side of the middle node to be the same transaction type. You're able to say, you apply a predicate on the edge to say that you don't want nodes of a particular type. This is the sort of thing that you can't do in SQL. You can only do this in a graph model. The vision of Kuzu as a graph database management system is it aims to represent the state of the art of how graphs should be stored, indexed, and queried. It does this by being highly scalable to several terabytes of data. It's very fast in terms of query speed. It supports the property graph model, which we described earlier. It also supports the RDF data model, which is going to be coming in the next release. It does so via a high-level query language, Cypher. It's easy to use and uses an embeddable architecture. We like to think of ourselves as like duck DB or SQLite, but for graphs. If you ever come across either of the other two relational systems, Kuzu is like a graph analog to those systems. I should also note here that Kuzu is based on many years of research at the University of Waterloo. It's now being developed in an independent company called Kuzu Inc, which we're from. The other big vision that Kuzu has from a data science perspective, specifically graph data science perspective, is to be the go-to back end for graph modeling and data science. Essentially the vision here is if you look at the bottom half of this figure, you have a lot of data sitting in disparate sources like data lakes, warehouses, relational databases all the way from Postgres and many others. There's a lot of interoperability challenges that you have with these data sources. Even though you have structured data, in many cases working with them as a graph is quite challenging because of the movement of data across the systems into a powerful graph database back end. Kuzu aims to be a simpler way and sort of an interface to that. In the upper half, the aim of this is to make graph data science much more accessible in the sense that we provide zero copy access to the data by writing out the format that is native to those libraries. For example, PyTorchumetric, NetworkX. These are popular graph data science libraries and machine learning libraries in Python. By being well integrated with the Python data science ecosystem, we believe this makes it a lot more achievable. I'll quickly walk through an example of how Kuzu makes graph data science easier in terms of a workflow. Let's consider this real world, a simple example, a toy example, where you have two different data sources. You have people and the movies that they watched. You also have people and their friends and where they live. These are two different data sets. Your goal is to use the information from this data to build a movie recommender system where a person who has watched certain movies gets recommended other movies. There are many ways you can build such a recommendation system. One we'll cover here is using a graph neural network, specifically using link prediction where you're trying to predict a recommended edge between a person and a movie. This is a very simple example where you have data set one which has the persons and the movies with some additional metadata, could be age or any other attributes. Then data set two has persons and what friends those persons have and where they live. For those who have not worked with graph machine learning before, it's a very high level overview in this slide where the goal of graph machine learning is to embed the nodes and the surroundings space into a vector space. The benefit of this is that it incorporates the structure of the graph based on the nodes and their surrounding neighbors. The idea is that you perform a computation on the graph nodes and transform the features of that graph into a feature vector like the array shown there. The idea is very similar to the kind of vectors that you may have seen in other domains like computer vision or natural language processing where the only difference is that in those domains you are considering the similarity between words in a sentence or pixels in an image whereas in this case you're considering the similarity of the topology of the graph itself. All of that is great but when you're working with the data, you're immediately faced with a problem. The data you have might exist in different sources. For example, the movies watch may exist in Postgres and the person friends may exist in other structured data sources that you export to CSV or Parquet or something similar. You need to bring them together to form a graph. Conceptually, this is how a graph would look. You have nodes that represent the persons. You have edges that represent the movies that they watched in the first graph and then in the second one you have edges that represent friendship between people and what cities they live in. The moment you do that, you have another problem where you potentially have overlapping data or duplicate data between these two subgraphs. In one of them you have the persons and the cities and the other ones you have persons and movies. Many of them might be the same people. There is some deduplication logic that's required where you have to merge people with the same attributes and there's some custom logic that needs to be put in terms of how you decide whether something is a duplicate. Once you do that, you have the final result where you have some nodes that are dangling in the sense that they have no edges that attach them to other nodes. These actually don't inform the machine learning model and would need to be removed. To do all of this is actually quite tedious if you were to write your own custom logic in your own language of choice. Where Kuzu comes in and where it's very powerful is the ability to just install an embeddable library using pip install Kuzu in Python. Once you have that, you're very rapidly able to run query execution to create the tables, load the data in and perform deduplication logic and dangling node removal using a high-level query language like Cypher in a way that scales to the size of the data that you have. In many ways, you don't have to worry about the scalability problem because now you have a high-level query language supporting your operations in the middle stages. Once you have all of the data and the features that are loaded into a graph, you essentially can walk through this process where you not only have the data in the right form, but you're able to actually encode the features into the graph and store that on disk. One of the biggest limitations with PyTorch geometric is if you ever worked with it before, is it's very memory intensive when you're working with large graphs. Kuzu helps a lot in this regard by persisting the features onto disk. That's exactly where I think we want to highlight this point. I think we are almost out of time, but I'll wrap up by saying the key points to take away. Kuzu is an in-process analytical graph database system, kind of like DuckTB is in the SQL world, but for graphs. It's highly scalable and optimized for multi-core parallelism and very well integrated with the PyData ecosystem, including NumPy, PyAro, NetworkX, PyTorch, and so on. It supports both the property graph model as well as the RDF graph model via Cypher, a high-level query language. It's embeddable and very easy to use from your application. It's also accessible with other language bindings, not just Python. If you come from other languages, those options exist as well. That's it from us. Kuzu is an open source, very permissive, licensed, MIT licensed project. I'd love for everyone to give it a try and reach out to us on Discord. We're always open to chatting more about graph use cases. Thank you.
Testing Containers with Python and pytest
Okay, next we have Dan Chudamak with testing containers with Python and PyTest. Wow, thanks. You haven't heard the talk yet, but thank you. So, first, the boring part, I'm Dan, I'm a software developer working for SUSE, I do other stuff, but since we only have 15 minutes, I'm just going to jump right into the meet, and that's why should you test containers? I'm not going to answer that, please test your containers if you deploy applications or anything else. And the first question usually people ask is, why don't you use shell scripts? Because I mean, shell scripts, they are super portable, they run everywhere, and shell scripts are also pretty fast. And given that shell scripts run everywhere, and they are so super-duper portable, everyone understands them. Apparently, I'm not everyone. Because in my opinion, shell scripts are very brittle, especially once I have to do string modeling, that's the point where I start to test my tests. And if I need to write tests to test my tests, I think I'm doing it wrong. You can disagree, but so let me give you the short sales pitch, why you should use Python, PyTest, and especially this is about a PyTest plugin that I wrote that's called PyTestContainer. So, what this thing can do for you is, it handles all the boring plumbing part of a test suite for containers, like pulling images, building containers, launching everything, and cleaning up, and not leaving you with terabytes of stale data. It uses the Python test-infra module, in case you know Python, PyTest, this is just another convenience module to just access files, check whether there are open ports, stuff that you can do with the Python standard library, but it's just another convenience layer there. One part that took more time than I care to admit, but that I'm moderately proud about is that the whole test suite is designed, it supports parallel test runs. So, if you use the PyTest X-Test plugin, it allows you to execute all your tests in parallel, so assuming you have 500 cores, you can run 500 tests in parallel. And the whole thing also works if your container images expose ports, provided you don't open a thousand ports on each and run 500 tests in parallel, then you'll run out of three ports. But there's tools for that. If you're using Podman and not just Docker, it can also work with Podman parts. You can also create abstractions to create container volumes and it will clean up after itself. Also, if you're more into the area, I have an application and I want to check whether it works not just on my box, but also on Fedora, CentOS, Debian, Arc Linux, Alpine, and whatever else there is in the world. You can just define a set of tests and you tell the plugin on which container images to execute them and it will do that for you. So, that allows you to have the same set of tests and run that on different containers, which would be, that would be more into the area that you are looking to test an application. It works with Podman and Docker, you just change it by changing environment variable. And if you happen to be in the lucky position to support enterprise-grade software that's very stable, hence very old, it still works with Python 3.6 and on all the important architectures, which took also more time than I care to admit. So, let's just take a look at a very simple example. This would be just a typical Python test file with your imports, then you define your container image. In this case, it's just the open-source-at-umbleweed image. And then you define a very trivial test. And in this case, what you can see here is, so this stuff here, that's really what testing for shines and it just takes a look at the O.S. release there. Very simple test, but you could do more elaborate examples. So, what are possible use cases? Of course, you'll just base images, you could test those, you could do applications inside containers, and another test, another possible use case would be you have an application and you want to check whether the application works on multiple O.S.s, but you don't need virtual machines for that. Then you could use PyTest container for that as well. So, I guess if you're in this talk, you might know a bit about PyTest and as the name suggests, it's a Python testing framework, otherwise the name would be very bad. All it really does is assemble tests, so it's like unit test on steroids, executes all test functions. And one thing that PyTest container uses extensively are fixtures. If you're not known to PyTest, you probably know setup and tear down functions from all other testing frameworks and PyTest fixtures are kind of the thing there. So, a fixture is really just a parameter for a test function and it can return a certain value and before that do some setup, give you something in this very simple example. This is from the PyTest docs, so it would give you, it would for instance create a mock SMTP connection and for the PyTest container, it gives you a connection to the already created container. And another cool thing that PyTest has is test parameterization, where you can just define multiple parameters. So, in this case, you would get your, you would have a test and you want to execute it for all combinations of those values, so it would run your test for all the, for the whole Cartesian product, so all combinations of 0, 1 and 2 and 3. Let's just jump into a few usage examples. So, in case you want to build new containers, you just define yourself the base URL, you have the, you define yourself the docker file and you create the creatively named class derived container and if you didn't already see, I shouldn't be in charge of naming things because I'm terrible at it, but I'm not very creative. But so, what will happen now if you pass this, if you pass this created class into your test function, the plugin will first pull your, pull the space image, it will build the container on top of that, launch it, pass it into this, pass it into the test function and once the test is executed, it will get cleaned up. You can also, so you can also define pass in other already created containers into this as a base, that all works. I have an example for that later. As I mentioned, binding free ports, you might not say, why don't you just add a parameter somewhere, okay, expose port 8000 on the host and that works as long as you only launch, as you don't launch tests in parallel. So, if you want to launch, if you for instance want to test this specific container five times in parallel, you can't bind all of them to the same port and for that, there's a relatively simple abstraction. So, you just create this port forwarding class, pass it into the container and then it will get exposed in the test. There, you will get the host port and this is inferred automatically on launch of the test. If you want to test ports, so this is very apartment specific, works rather like this. So, essentially a port is just a combination of containers and the only really interesting part that you want to use it for is again port forwarding, works exactly the same like with containers. One little catch. So, so far I was telling, I was claiming that your containers would be launched after the test and destroyed after the test and that's not entirely true because most tests don't modify the container and then you can get away with creating your container before all tests and tearing it down after all tests and you save actually a substantial amount of time. But if you decide to do tests like these where you try whether RM minus RF actually works, then any subsequent, any subsequent test will fail and start burning. So, and therefore there's a different fixture that's called just container per test and that will actually ensure that all, that you just get a fresh container for every test but it costs extra. So, but then you can also RM minus RF everything in your container and the subsequent container will still kind of work. For the case where you decide, where you want to run a bunch of tests, but you don't want to do the whole pie test parameterization before that, you can just dump all your containers into a global variable that's called container images and pie test will do the automatic parameterization. So, in this case, these, all the tests in the test module would get executed with all these containers and that's for instance what, what we're doing in, in the, in the, for Kiwi test functions where you just want to ensure, okay, they work on CentroStream, Fedora, DBN, R-Clinux, etc. Pp. What I've, what I hinted on previously that's dependencies between containers which is essentially just you want to build a container based on another one, based on another one, which would essentially be just you, you split up your, you would split up your Docker file. This might sound like maybe a weird idea at first but it can be, it can be relatively useful if you have, if you want to check different base images and then build stuff on top of them or you want to, or you decide to modify your base image. So, we have used this relatively extensively in the BCI test suite and you can, you can simply create containers that derive from others and that derive from others and you can do this relatively extensively, just don't add loops. That, that will not work. In case you want to, want to check whether, for instance, the environment in your container is still, is what you, what you expect or some config and you don't want to mess with the JSON that Docker Inspector, Portman Inspector gives you, that's also to a certain extent implemented. So, you'll get, you'll get a Python usable version of the inspect of a container where you can, for instance, check what's the user in there, what's the CMD of this container, if there's something in the environment and other stuff. Since I'm nearly out of time, I'm going to jump over this since it's not really that interesting actually. One important thing is if you are, so if you create an application in your container, applications usually take time to launch, please use health checks. Health checks are cool. I know they are not part of the OCI spec, that's a bummer, but please use health checks. Because if you start using a test suite and you, and you yourself execute a test manually, you launch your container, you try to curl where the application is there and it all works and then you automate it and it suddenly fails because the machine is much faster than you are. And your application is simply not up there. What PyTest container supports, if your app, if your container image has a health check defined, it will wait until the container is healthy. And as long as it's not healthy, it will not execute your test. If it becomes unhealthy, your test immediately fails. So, if you add a health check into your container file or if it's just in the image, then it will wait and it will start execute the test and you can always be sure that your container will be healthy. And if you don't want that for whatever reason, then you can just say, okay, I don't care about health checks. Then you just define the timeout to be minus one or you comment the health check out, but well. And then you can, then your container might not, might still be starting or it might be unhealthy. As I said, by default, it will use, it will pick Portman, but you can just set use Docker. And what I'm, as I said, moderately proud is you can just run your tests in parallel. That also works with port forwardings. So, that can save you a lot of time unless all your container builds run themselves in parallel. So, then you're not going to save a lot. The thing cleans up very well after itself. So, if you create containers of volumes or parts or temporary directories, all that gets cleaned up, images and intermediate layers are retained because otherwise it would just take forever and ever. There's a few people that use it. Most of them are those that I bullied into that, but maybe you find this useful and you will be on this list. And since I'm out of time, thank you.
Documenting and Fixing Non-Reproducible Builds due to Configuration Options
Good afternoon, everyone. So next we have Aaron, speaking about documenting and fixing non reproducible builds due to configuration options. Thanks. So hello, everybody. My name is Aaron. I'm a PhD student at the University of Rennes. Doing research in the software engineering research team diverse of Inaria, Eriza, Ray in France. And today I'm going to talk about reproducible builds and software configurations. So what is reproducible builds? I took this definition from the paper, reproducible builds, increasing the integrity of software supply chains. So it says that the build process of a software product is reproducible when given a specific version of the source code and all its dependencies, every build produces bit by bit identical artifacts and plus no matter the environment. And I think it's a really important point. So to achieve reproducible builds, there are a set of guidelines in the website of reproducible builds such as how to have deterministic build systems, what not to ship in the binary or even how to distribute an environment, set some environment variable and so on. So let's take an example. So for the Linux, I can go to the source tree. So I've downloaded and I just generate the configuration of the kernel. So here in this case, I generated a tiny configuration, then I just build it. And once the build is done, I have a binary called the VM Linux that I just keep on the put in the TMP and then I clean everything up and I just reproduce the process. So tiny config run called twice just produce the same configuration. And now if I want to compare the product of these two builds running Difascope, which is a tool provided by the Producible Build Initiative tool. So what's happened? Just because I've built the two binaries few seconds apart, I have two binaries that are different, not bit by bit identical. So I'm following the guidelines. I can set some value to environment variables of the build system. So here in this case, K build. So I can give a fixed date, for instance, here the 1st of January of this year. And now I can have a bit by bit identical binary. The question is in Linux, for instance, we have all different set of configurations. We have the default configurations per architecture, all these config, all mode config and so on. And especially round config that will set some configuration options randomly. So do I need to just fix all of the predictability issues for Linux just with this trick? So we can look in the documentation. So the K build trick is obviously written in the documentation. But there's the documentation emphasize on the configuration options. So here we have six of them. So just as a reminder in the kernel, you can set some values to some options, either yes, no or module to ship them or not. And so here we have a list of six configuration options. But is that all? So as the latest version of the kernel, I think there are more than 19,000 configuration options. So there are six configuration options that have an impact on the predictability of the kernel among all these configuration options. So to answer this question, we have basically have a kind of a brutal approach. So we just generate the set of random configurations. So as you can see here on the left, then we build them in the same environment. So we have a fixed Docker file. And for each build, we just build them in a newly built container. Then we compare the binary. So we don't compare all of the intermediate file of the build. Just compare the final binary. Then you simply do a diff on the binary and get all the results, as you can see here. So there's a way to encode the configurations in a tabular representation. So we just have a row with all the configuration options. And zero means no. One means yes. Enabled. That means module if it exists. Then we get all the data and put it in a classification algorithm. And we just get the outlier options that are responsible of the non reproducibility. Then from the list, we have a phase that is an exploration phase that I will explain a little bit later, where we enrich the list we've got from the classification algorithm. Then we just have a fixed phase. And the idea is to add, if the options are indeed responsible of the non-reposibility to add them to the documentation. So the setup is, so this is the setup. So we have 2000 configuration for each system we study. So the Linux kernel, but also busybox and toybox. So we generate random configurations. We have a preset for x8664 for the kernel. And then for the environment, we just derive the tuxmake image. And then we set all of the environment variable so they don't vary during the build, like the timestamp and so on. So here's one of our first results is for Linux 47% of the builds were non reproducible. And for busybox, we have two cases here. We have the first case where we didn't vary the environment, so the build path. And there's a case where we just vary the build path. So we wanted to show case that there is an interaction between two layers, so the configuration and the build path. And to solve it, you just choose either to change the build path, to fix the build path, or to disable the debug configuration option. So it's up to you. But if we enable the debug configuration option and we just vary the build path between two builds, we have 49% of non reproducible builds. And for toybox, it's 100% reproducible in our study. And so now who is to blame? So no for the Linux case. So here we have an example of the decision tree we got from the process. And we have five configuration options here. So what we do is we don't consider destructor like if I disable module six, one. So here the structure is that if I disable six, one, the next responsible is Jacob profile of trace and so on. So here we just flatten everything and we consider each configuration as independent. Each configuration option as independent, sorry. And so we have this list of five configuration options that, so module six is a similar configuration option is in the documentation for both, but for the rest of them, we don't have them in the documentation of Linux. And now we have an exploration phase where the main idea is that we want to identify all the options of the same kind. So in the documentation, we saw that we had some configuration options on the module, CIG, all module CIG module and config CIG module CIG and so on. And so here the idea is just to identify the siblings of the options. Like if I disable one option, I have another alternative of the same kind and we just explore all the alternatives. And a great example here is module six, one. If I disable it, I have to enable two, 24 or 256 and so on. And so once we have, we've got all of the siblings, we just use the name and conversion in K config to just get the parent. So we know that if I want to disable this specific option, I have to disable this parent. And now, place to the fix of the each configuration options. So the idea is to remove all of the detected configuration options from the initial configuration. And it's a kind of hard task sometimes in the Linux kernel because we have to get all of the dependencies of the configuration options. So to solve each, I mean to detect the dependency of each of the configuration options you want to modify or to change, we use a tool called config fix that is a set based solver that is presented in detail in this paper here. And it just gives a list of options to a list of conditions to satisfy. And it could be in the configuration option and the value in order to apply a change. And then once the change is applied, or once the change is applied and the change being just set in the configuration to no, we just build again and check for a predictability. And from the list we've got, we were able to make 90% of the non reproducible build reproducible. We had 31 configurations, so 3.5% that is still not reproducible due to some dependency we couldn't identify. So that's one of the limits of the approach. And less than 0.5% because the tool we used couldn't find the diagnoses. But compared to the first result I showed, we went from 47% of non reproducibility to 1%. So now the summary. So I think one of the takeaways that options matter. So we should explore more the impact of configuration options in the reproducibility of each build. The second takeaway is that there could be interactions across variability layers, such as I showed for our busy box. So we also need to detect them and to pinpoint and describe precisely in documentation. And we have identified more configuration options that could be added to the documentation, so we'll send the patch soon. And now we could remove some of them. So 96% of non reproducible builds made reproducible. So if you want more detail on the whole approach and the rest, this will be presented at the mining software repository conference. It's an academic conference that will happen in Portugal in April. And thank you for your attention.
Platform engineering for dummies
Great. So good afternoon everyone. Next we will have Donnie Burkaltz introducing platform engineering for dummies. Thank you. Super excited to be here today. It's been a number of years for many of us since being at a Fosdham in person, so welcome back. I was very happy to be here. I got myself a very nice Belgian beer as soon as I arrived, so I'm feeling great right now, all ready for my talk. Only one now, just one. The rest will be later. And I hope I'm assuming none of you are actually dummies, so thank you for coming to this talk. This is just for people who have heard the term platform engineering. It's starting to get increasingly popular. It's the only thing people talk about besides AI these days. We're going to mostly skip that one. And we're going to talk about what it is, how vendors are completely destroying the term, just like they do with everything. And then how to get started with it yourself. How you really make it as easy as possible. You don't have to buy vendor solutions. You can use open source off the shelf software. It doesn't even have to be custom and brand new. So by the end of this talk, you'll have a really good sense of platform engineering, at least as good and as deep as you can get over the course of the next 12 or 13 minutes. You'll have a lot of good resources. I've got links in here and a couple of the slides as well. So you can go check those links out afterwards because it's not just about technology. It's also about the people. It's also about the process. There's a lot of different pieces you have to do to get this right. In fact, the technology in many cases is the easy part. But first, a very short story. A few years ago I worked as a technology leader leading a DevOps transformation at the time. That's what we called it. We now probably call it platform engineering at this travel tech company called Carlson Vaganley Travel, CWT. It was actually an office here in Brussels. I visited there a few years back. Great place. Lots of interesting development happening there. Since then, I actually have led products, management, and products at Docker and at Precona around open source containers and databases. I've spent a long time in the platform space. Long story short, I know what I'm talking about. I've been doing platforms for like 20 plus years at this point, as have many of you. I'm just sharing my own story and my own perspective here. I'm sure many of you have your own. When we think about platform engineering, or at least the way I look at it, there's really three key pillars to it. There's platform operations, platform as product, and self-service for developers. We're going to jump into each one of those pillars and talk a little bit more about what that means. If you want to check this out afterwards, I have my own independent analyst from my little blog post about it. Feel free to check that out at your leisure. What does platform operations mean? There's a lot of companies today. In fact, how many of you come from a large enterprise? Do you have something called a platform team? Does it maintain maybe Linux OS, maybe some other OSes that we won't talk about, some things like that? It just got called the platform team at some point. It might have been the OS team. Before that, maybe they merged it in with the network team or something else like that. When we talk about platform operations, we really mean operating it as a holistic platform regardless of how many servers, how many VMs, how many containers might be underneath it. The same thing we talked about 10 years ago with Cloud, the same thing we talked about five years ago with DevOps, moving away from that Pets mindset into the Cattle mindset, moving away from that single server, single container, naming things after our favorite characters or our favorite TV shows into that mindset of these things are fungible, they're disposable, we operate them as applications and fleets of things and they're automatically created and deleted on demand. We're in this world of SRE now, we're moving more and more into things like SLOs of how do you monitor the user impact of the applications you're serving. In this case, we're talking about platform engineering, meaning building for developers, but even if you're serving internal developers, a platform, you still have to care about the quality of service that you're giving them. You still have to care about your latency, you still have to care about your error rate, you still have to care about how much of your capacity you're using in any given moment. You have to treat those internal applications just as importantly as you treat the ones that you're serving to your external customers and users. A lot of companies don't do that, they'll have things like their tier one applications, those are business facing, they get major incidents, spinning up war rooms and all that kind of thing when there's an outage, but if their CI pipeline goes down they say, oh well, it'll be back eventually, it'll be fine, we can just have our developers kind of doing nothing for most of a day, no big deal. A lot of companies are still like that, but we have to apply this platform operations concept not just to our external customer facing applications, but treat our developer productivity as something business critical in its own right, because developers are expensive. Sitting there for a day, not being able to ship software is expensive. And so we went through exactly this journey at CWT. One good example of this was we started by monitoring tens of thousands of different infrastructure metrics, right, classic old school world of monitoring, and we shifted that into just a handful of user facing impact metrics, but along the way we actually had to educate our developers and our operations teams on how to debug things in a much more complicated way than they were used to, because with the infrastructure metric you could have a simple runbook. You see this thing, you push this button, done, whereas if you have a metric saying my application is slow, there's a lot more potential causes, a lot more you have to learn to jump into it, and so at the same time we made this transition with technology, we also had to upskill a lot of our level two operations teams and had them become an SREs in their own right learning how to automate things, learning how to debug things much more deeply. Now the second piece is platform as product, and when I say this what I mean is for things like your internal CI pipelines, for things like your container services, whatever other internal developer tools and services you might have, you have to apply the methods of product management to them. You don't have to have a full-time product manager, that's fine, if you do fantastic and you're lucky and fortunate and congratulations on that, but if you don't, there's a lot of different people who can pick up some of that load, learn how to do modern digital product management, right, you might have people even depending on how traditional your company is called service managers, right, they might use a framework called ITIL to talk about things, and those people still have the potential to modernize and move forward and get with the times and apply modern product management approaches, meaning talk to your internal stakeholders, understand the problems they're trying to solve, right, in many cases they might be providing a service like source code management is a service you provide to your developers, there's probably a team running it inside your company if you're at a big company, do those people talk to their own developers about what problems they're trying to solve and what their workflows look like, Jets are probably not, they just shove stuff at them and say good luck, right, and we're fortunate we now have better tools than we used to, but there's a lot of opportunity for people in those positions of being these central platform teams or central developer productivity teams to go talk to their own developers about the problems they're trying to solve their day, understanding their pain points, and bringing that back in. At the bottom I've shared a handful of links in varying levels of depth that are super good resources if you're wanting to learn this or if you wanted to share these with other teams, there's an entire specialization on Coursera that'll probably take somebody six months to go through maybe an hour or a few hours a week, there's a great book by the same person who put together that course or the series of courses and then there's a website you can just go read for free to start checking it out right now. In every one of those cases they aren't written for Platform as Product People, they aren't written only for internal product management, they're written for anybody doing modern product management of how do you get that up to speed and so you have to do a little bit of extra work to think what does this mean for me specifically, but all of you are smart people you can figure that out. Applying this Platform as Product approach is absolutely critical to doing Platform Engineering right and nothing about this requires a specific piece of technology, nothing about this says proprietary versus open source, this is the people and the process side of it, but you have to get this to get Platform Engineering right because if all you do is say oh hey we gave you a platform now we've got Platform Engineering, you're wrong. What probably happened especially if you're at a big enterprise is you still have a ticketing system somewhere and you're still requiring developers to go file a ticket every time they want access to some new resource whereas if you're getting Platform Engineering right you're moving away from that because you talk to your developers, you've understood their needs and you've probably moved into something much more policy driven where there might be an initial ticket but the only thing that happens is to assign the developer a role as I'm working as a developer or I'm working as a developer in a certain application area then they're granted that policy driven access and they're able to move on and get on with their lives instead of every single time they need access to a new server every time they need a VM created every time they need additional memory provision to the VM right all these things are crazy and in many cloud environments they have been partially solved but a lot of us are still working on premises we're still working with servers in data centers or in colos or working in clouds that feel like we're that in every one of those cases this is an opportunity to make dramatic improvements in our own productivity as developers um one example of this from my own experience at CWT was we applied this approach to a really novel area which is um one of the teams that reported me to me was the major incident commanding team right so every time stuff got really really bad it was like the fire department you'd call them in they'd run the the issue and run it through to conclusion now that team had to send out a lot of different communications to a lot of different audiences they had to send things out to our internal executives had to send things out to all their employees who were being affected by it we had to send some things out to our customers as well um all those communications were things that hadn't really changed for a long time we had to get a lot better at them there were all kinds of complaints that would come in from these different audiences because it wasn't a one-size fits all approach it was something where but we were sending communications out that way and then things had gradually evolved very organically there wasn't a clear way to understand who should get what i mean so we applied these these platform as product style approaches to the communications going out from the incident commander team and made dramatic improvements by just doing things as simple as going out and regularly talking to the people who need to consume this stuff to understand when do they need it what do they need what do they need to understand so they can turn around and make the right decisions or do their jobs more effectively or tell their own customers the people who actually pay us as a company what we need to do and what they need to do and how long they might need to wait and when to try back and what their alternatives might be what was interesting too is we did this in a very lightweight prototype sort of fashion right so of course we had a technology solution for sending all this stuff out but instead of using that and using our developer time to sit there and iterate and work their way through their backlogs we literally just wrote a heavily formatted email by hand and started sending this out and used that as a tool to iterate on what the product should look like and so we just put together this email and we'd send it to somebody and say hey like what is this what do you think of this like walk me through how you're interpreting this what you're doing and by applying that really lightweight technique of just doing things by hand doing things the rough way before we had to put in the effort on software development it dramatically speeded up our ability to figure out the right thing and then spend our development effort building the right thing instead of getting getting it wrong very slowly multiple times on the way and third self-service for developers this one is pretty self-explanatory so I'm not going to spend a lot of time on it but really this is the continuation of that consumerization of it trend right the expectations for user experience in the enterprise side are very different now than they were five ten years ago and the same history for developers right developers should not have to put up with really clunky terrible interfaces on their internal tools anymore right it's been bad for a long long long time but things are finally starting to get better right things have gone through very ticked-dirgin approaches my own experience at CWT was you know we came in and we did something called value stream mapping which is a great technique for anybody who's interested in solving a lot of problems like this where we worked through a very specific workflow and the one we picked was deploying a new application for the first time um worked through every single team a request went to every single team that had to touch it and end up being something like 15 different teams were involved in this because there was a single silo team for everything you could imagine right there was like a network team and a security team and a firewall team that wasn't the same as the security team uh and you know the list just goes on and on and on in large companies like this and every single one of them required a ticket in some case it was the ticket you had to file in some case there was a ticket that a team filed to another team and that team filed to a third team and then somebody else would audit it and somebody else would review it and finally it would work its way through right but imagine getting all those to a place where you can clearly define the policy once get agreement on that from all these teams and then work on that policy and use that policy to automate all of your governance going forward all right that's what we're talking about um we took out of a 45 day timeline to deploy new app we took 30 days out on the way there um by making some simple process improvements and applying some automation now let's look at some solutions over the course of the next minute what do you need from a solution you need a job runner pretty simple because you got to do stuff you need a web GUI so you can click some buttons you might want to click on it have an API or CLI but those aren't necessities you need to access controls so that only the right developers can do the stuff you want to do and of course you need to be floss now there's a few different classes of these job runners you might look at internal development platforms you might look at CI servers you might look at workflow and data orchestration tools or you might work on look at task schedules there are all good options when you're thinking about how do I do this platform engineering and really the answer here is use whatever you've got don't make this huge start where you are you can use GitOps you can use backstage you can use even Jenkins you can use workflow and data orchestration tools or task schedules so hopefully that's given you a sense and I'd encourage you to refer back to the slides later to see that list because I went through it pretty quick of what platform engineering is all about what are some of the different solutions and that you should start exactly where you are today using the tools you have don't make this over complicated thank you
Taming the Beast: Managing High-Growth Postgres Databases at CircleCI
Hold on. Hello everyone. Sorry? No, I think people are just using the arrow keys. Sorry. Less high tech. Hello everyone. So our next speaker is Bryce Kenta, introducing Taining the Beast, managing high growth postgres databases at CircleCI. Thank you. Hi everyone. My name is Bryce Kenta and welcome to my talk on Taining the Beast, the CircleCI journey to managing high growth postgres databases. First, who am I? So I'm a staff engineer at CircleCI, where I've been working for the last three years. I have over eight years of engineering experience spending the full stack back in front end. At CircleCI, I've been focusing on backend architecture and reliability. Over a period of hyper growth, reliability became a big problem at CircleCI to the point where our CTO started posting a monthly blog post to keep our customers updated about the improvements. So a key part of those improvements was dealing with large databases, which I'll be talking about today. I'm very enthusiastic about the develop experience and making that better, which is why I love my work at CircleCI. And when I'm not in front of a computer, you can find me on the driving range because Canada is very cold and occasionally traveling the world of my wife. All right, so let's get started. Just to give you a little bit of background about CircleCI, it's a global CI CD platform with a wide range of customers. A bunch of open source projects build on CircleCI, such as React Native, Angular. Anytime you see a .CircleCI folder in a repo that typically is building on CircleCI, and on the right screenshot, that's an example of a React Native workflow, which is currently just running some tests. And so this should be familiar to any of you that are maintaining any CI CD pipelines. So our platform runs about 4 million of these workflows per week and over 20 million jobs per week. Each workflow that runs on our platform generates net new data to be stored, such as the workflow itself, the dependencies between the workflow, the workflow graph, the job states, and test outputs and things like that. So to handle all of this traffic, our infrastructure runs over 150 services and 70 plus post-grace databases. However, some of these databases were growing very rapidly. The particularly one that supports the platform's engine. The growth of such databases was directly correlated with the number of workflows and jobs that are created per second. So an example of high-growth database that my team was responsible for had grown to 5 terabytes in size and growing by 500 gigabytes per quarter. The right amplification on that database was a recurring cause for incidents. The nail in the coffin, though, was when we tried to upgrade that database from an end-of-life 9.5 post-grace RDS instance to a 12.5 instance. This took months to complete and incurred significant downtime because of incidents. The first attempt at migrating the RDS instance took a couple of hours and resulted in poorer query performance. This is because the large tables required lengthy vacuum operations, post-upgrades, which led to massively degraded performance. We considered using AWS Database Migration Service, DMS, but it would take too long to complete given the database size because DMS uses logical replication which is concerned with the number of rows and the amount of bytes that you're transferring. We were finally able to do the version upgrade using a form of home-brewed logical replication, taking advantage of application-level knowledge of the database. But this required significant engineering effort with engineers working weekends. So that wasn't great. At the end of all this, it was clear to the business that operating these large databases is very risky and could cause a company-ending event. So we needed to tame this growth. So now I'll take you on the journey that we took to taming this beast. So first, I'll talk about the storage reduction, so the immediate savings that we gained by deleting some of the low-hanging fruits. Next, I'll talk about the growth restrictions that we put in place to make sure that the data growth remained at manageable levels. And lastly, I'll talk about some of the optimizations that we made to ensure long-term success. So the first thing we did to reduce the storage was to drop unused columns, tables, and indexes. Indexes in particular can grow large in size over time, so dropping them was a quick win. We leveraged a tool called PG Analyze to identify indexes with those scans. So that means they were not used, and then dropping the indexes not only benefits the storage size, but it also reduces write amplification, so the writes to the database are actually faster. Next, we switched a bunch of B3 indexes to use Brin indexes instead. So Brin indexes are designed for handling very large tables where in which certain columns have a natural correlation with where they're physically on the table. So for example, if you have an Ordis table with a created-at column, earlier records on the table would physically show up earlier in the physical location. So those Brin indexes are optimized for that kind of data. So from the screenshot, you can see we had a bunch of created-at indexes across multiple tables, but the thing to note is the size of those indexes. That took over 400 gigabytes of storage in a single database. So dropping them, or those the ones that were unused, or switching to Brin were able to save space immediately. The next step we did was to reduce the storage further, and we had to upload any static blob data to S3. So S3 is much cheaper, and you can define object life cycles to automatically delete the data. But my greeting to S3 came with some drawbacks, such as additional latency, because we had to put a Redis cache in front of it. And the other drawback was that it added more dependencies to our service, and the queries were no longer transactional. So we had to add code to stitch together the response from Postgrease and S3, so that added a bit of complexity. So at this point, we freed up some storage size and to give us some runway, but we haven't addressed the growth. So let's talk about that next. So the first thing we did to slow down the growth of our databases was to put in place data retention policies. Our product management team collaborated with other parts of the business to identify data retention periods. So the data retention period differs based on the customer plan. So for example, a free customer will get three months of data, and higher-plan customers will get up to two years. We communicated these policies to all of our customers ahead of time. We gave them a quarter, so three months of leeway, before actually enforcing any restrictions. So the next step after that was to implement data access restriction, but at the API layer before actually deleting any data. So this meant customers no longer have access to data beyond their retention period, which enabled us to go to step three, which is safely delete the data, because now customers don't have access to it anymore, using background jobs. I should point out that at this point we still have growth, but mainly due to new customers, or existing customers that are building more on the platform. But the growth is contained because we don't retain data older than two years. But we ran into some issues. So the first issue that we ran into was, as we're deleting data from the primary database, it caused degraded performance on the replicas, as the deletions are getting replicated. So we experienced like spike in IOPS and CPU usage, and so we needed to upsize the replicas. Another issue that we faced was index bloat. So frequent background deletions without a periodic maintenance of the indexes, reduces the efficiency of those indexes over time. So a solution for regularly re-indexing the database was necessary to make deletions sustainable. This is something that we're still figuring out. We haven't found a proper solution yet. But lastly, post-grace databases do not automatically reclaim space when a record is deleted. This is something that we found out. So there is a built-in vacuum operation to reclaim space, but this process only frees up space back to the table for reuse. So once disk is allocated for a table, it may never be released until that table is dropped. The vacuum operation has a full option which builds a new table and swaps the old table for the new, but it requires an exclusive lock. So this was not a viable solution for us because, again, it requires downtime. We're able to use PG-REPAC, which is an open-source post-grace equalization that allows us to reclaim space on the drop columns with minimal locking of the table. So that was great. And then the last step on our journey was to establish a long-term strategy. We needed a data archival process that could be applied to all of our high-growth databases. So we established a data reliability team with the mandate to own a single historical data store. The data store would support functional requirements such as high availability, be horizontally scalable, support multiple query patterns, which is needed by the API or the UI to filter data. But this historical database is only used to serve customer data only, nothing else. No ETL, nothing like that. And then each service team would implement a data archival process, which is similar to the diagram at the top. The service sends requests to the historical service to archive data. What data is archivable and when? It depends on that particular service domain. There's a sweeper job that makes sure that any missed archivable data is archived. And then there's a deletion job that is continuously deleting archive data. Also, as product teams are building new features that require net new tables to be added or to be created, we aim to partition them from the beginning. We use PG Partman, an open source partition manager to create time-based partitions. PG Partman enables us to configure retention periods and will automatically delete any old partition. So as soon as the partition falls out of the retention period, so in our case 24 months, it is automatically deleted by PG Partman so we don't have to worry about it. And finally, so now that I've taken you on the full journey from reducing our storage size to establishing long-term data archival processes, I'd like to take a moment to acknowledge some of the key learnings because an initiative of this magnitude was spanning almost two years and was non-trivial for us. So the first learning was to implement a brief retention policy as early as possible. Ideally, one that allows you to serve more data at your discretion because this means you don't have to implement the code to delete the data until you really need to. That would have saved us hours of engineering effort and downtime dealing with massive databases. The second learning rehearsed any major database maintenance, things like major version upgrades, space reclamation, re-indexing, anything like that. Make a copy of your production database, validate your changes there, compare query performance against the production database before actually running that maintenance in production. And finally, write down your learnings. This creates a knowledge base for everyone to learn from and helps other teams move faster. The extensive documentation that my team put together throughout the last two years is what helped me a lot to come up with this presentation. And that is it from me. So thank you for listening. I hope this was helpful to you.
ε-serde / mem_dbg / sux / dsi-bitstream / webgraph: a Rust ecosystem for large graph processing
Hi everybody, we're just about to have our next talk, who will be Sebastiano Vigna, who will be talking about a Rust ecosystem for large graph processing. Sebastiano? Thank you. Okay. Okay. How many Rust programmers here? Well, some. How many Rust programmers who handle large data structures, like those of gigabytes? A few. Okay. The first group is reasonably interested. The second group is more interested. The rest of the people can't sleep. I'm not offended. You can use the computer. It will be very, very boring. So okay, let me introduce why. Okay. What I'm doing is just announcing a few crates we are distributing that do very specific thing related to large-scale data analytics. And the original of this is a framework for graph compression that has been around for around 20 years. And that's being used by the community around the WWW, the web conf, the largest conference on the web in general, academic conference. For the last 20 years, there are many data sets that are distributed in this format that are utilised and so on. There are a lot of journals. And in 2011, it was used to measure the degrees of separation on Facebook, if you remember it, maybe you're too young. But it was quite a feat at that time because, I mean, it was for 15 years ago and Facebook was still rather large. But we were able at that time to represent the entire Facebook graph in just 211 gigabytes, which made it possible to run some pretty nice algorithm to compute this and distribution. Maybe in this community, I should mention that in the late, I started to do free software in the late 80s on the Amiga. Okay. So nobody remembers what it is, but I have some history with the free software movement as well. So at some point, we decided to move to Rust for the obvious reasons, like it's a high-performance, safe language. But, okay, all I said is in Java. It was written in Java, started in the 80s and of the 90s. And at that time, it seemed a very good idea. Okay. Then things happened like arrays are at most two billion elements. And if you have graphs with 50 billion elements, you cannot even index the notes, which gets very, very annoying. And today, anything this size is done using memory mapping. I mean, if you go to Facebook, Google, whatever, all the large structures are there in memory, but usually they're just memory mapped because you don't want to start up time. If you load in memory a graph that is half a terabyte, you wait minutes, whatever the platform you are on. But if you can memory map it, this time is amortized along the visit of the graph, for instance. Okay. And we actually need to represent very large graph. If you ever use Java, the access to memory mapping facility, I will not say words because they would not be proper in this particular situation. There are really lazy iterators. If you're written in Java and iterator, you know what I mean. And okay, so we, to do this, we needed to port a number of ideas from a Java library and to develop a few new things. So the first thing is absurdity, weird name. So it's a framework from epsilon copy, serialization, deserialization. So you might know what is zero copy, serialization, deserialization, means that you serialize something and then you use the memory, actually in the state it is, to represent the object internally. Okay. So there is no deserialization. You don't build a new object. The piece of memory is directly used as it is. And this is how things work, as I said, in all these organizations that have large indices, Facebook, Amazon, whatever you want. I mean, the index is on disk, it's memory mapped as it is. It's not deserialized in any proper sense. There are a few frameworks like abomination that do this kind of things in Rust, but they all have problems for us. The first one is the oldest one by Frank McSherry, writes into the serialized object. So if you want to memory map a file, that's out of question. You might know it is from the people that do the internalization library. Nice idea, but it has a huge impact on performance. It does some kind of runtime resolution of the access to vectors. And then there is Archive, you might be familiar with, which too does some relative memory that is differentiation. And also the structure you deserialize is completely different from the one you serialize. So you have to delegate all the methods and then each time you change one, you have to change the other. Not very practical. So what we did was develop this framework, which requires a little bit of collaboration from the underlying struct. But the basic idea is that you serialize something and then you epsilon copy deserialize it. So you access it, you allocate a very small amount of memory and then the rest comes directly from the disk without any intervention. And the way we do it, we remap vectors essentially. You build a structure with a vector, but when you deserialize it, it has a reference to a slice. In this way, we just have to allocate the actual struct that you want to deserialize, but then anything that is a pointer inside just point to the original memory. So epsilon copy, the idea is that it's not a zero copy because we did a little bit of copying, epsilon copy, a very small amount. But the advantage is that now you have exactly the structure that you serialize. It's exactly that structure with all its methods. The only thing you have to do, if you have vectors, there must be a type parameter and you must write the access method for as a left to a slice. Of course, when writing, you write for a vector, but when you read, you read it from a slice. This is the collaboration you need. But then, completed transparently, like you can do it with basic type. You store a vector and then you memory map it and that's it. And what you get in T is a reference to a slice. More precisely, something that derives to a slice, to a reference to a slice. And again, you work essentially transparently with respect to the framework. Unlike the other cases, and since there is nothing intervening, resolving the pointers, there is no dynamic resolution, everything is done at this realization time, zero impact on performance. The performance is exactly the one of the original structure. We use this to map massive immutable data structure like representation of sequences of sets and so on that are like those of gigabytes, 100 gigabytes on disk directly on memory, without any load time. So if you handle large immutable data structures, that could be for you. Memdology, that's a very small crate, but it's a problem we had. Okay, it's a high performance memory occupancy detector, which sounds ridiculous when you say it because, well, it does as to measure the memory occupied. It's not so easy because if you use the one that are around, so it is like a large vector and few other things, this is the amount of a located memory. These are the three more common frameworks, sorry, crates that do that, and this is the amount of time that they take, and this is the amount of time we take. So the reason is that without some infrastructure similar to the one of absurdity, you have to iterate through collections to measure the space occupied. And if you iterate through a billion element collection, it will take a lot of time. So we routinely measure the space and occupancy of things that are like 50 gigabytes, it will take eight minutes. So we develop this if you need to measure the actual occupation memory, not stack occupation, the actual occupation in memory of something large, try MDBG. Also, as a nice, it does you a print out of the structure with the old memory occupancy. It's important for us because we do all the time this succinct data structure that have various components and we need to know the relative size. So this is only if you have very large data structures. They are small, you can iterate, no problem. Succ is an ongoing problem, ongoing problem, yeah, it's an ongoing problem. I won't say an ingrate, but it's actually kind of an ongoing problem. And it's a part of an existing C++ project and Java project about succinct data structures. You might know what they are. If you don't, no problem, you don't need this crate, but they're very fashionable now. There is one crate at least that does this, but we wanted to have something more sophisticated. So if you're interested in Elias Fano representation of monotone sequences, ranking, selection, and so on, please have a look. This is really getting to existence, but we like to have feedback. Fungal piece bit streams, very, very high performance bit stream with read and write by word and support for big and little Indian files and a lot of instantaneous code, gamma, gamma, delta, go long, and so on. This is kind of cosy you'd like in MPEG or so on, but we use it to do graph compression and we spend a lot of time to optimize every single shift and move and also to give you scripts to just run and we massively test all parameters you can configure on your architecture so you can choose how to optimize the speed of the coding and the coding specifically on your architecture. Like which word size to use to pick up stuff from memory, using the coding tables or not, and so on. And this comes from quite a long experience in doing this with web graph. So if you're interested in writing this instantaneous code for compression, you should have a look at this IBS stream just to tell you a gamma code is ready in less than two thousand seconds. So I think this is pretty nice. Okay, the last piece which is probably the more specific, so you might less be interested in is web graph. So web graph is a framework to represent very large graphs in a compressed form. So typically snapshot of the web are represented in about one to two bits per link. The software heritage graph which is a graph with about half a trillion edges, it's three bits per link, Wikipedia costs 10 bits per link, it depends on the structure of the graph. But usually in particular the graph is redundant, you can represent data in 10, 20, even 50 times less than you do with a redundant version. It's a rough sport of the Java version and of course we use the SIB stream for instantaneous code and sucks for pointers in the big stream. And just to give you a very simple example, the software heritage graph is 34 billion nodes and a little bit more than half a trillion arcs and you can do a BFV visit single thread in three hours. It's very nice. Okay, you have to notice half a trillion edges. The ergonomics of the whole thing is incredibly better than Java. Just having real iterators changes completely the game because it's much more natural that what we had. And this is all the others are crates that you can download and use that are pretty stable. This is still on GitHub because it's a lot of code, a lot of optimization. We just merged into main the last big chunk of modification, the API should be stable by now. But this is very specialized. I mean unless you have graphs with hundreds of billions, half a trillion arcs, for instance, this biologist did this huge data set with a trillion protein-protein similarity edges and they did it with web graph because if you need a trillion edge and you need to distribute it and analyze it on a standard hardware, not a massive supercomputer, you do it using compression. There is also support for labels on the edges that you can enumerate and it's much better in the new version than in the old one. And one thing that we had to fight a lot against is lenders. So if you're familiar, I don't feel familiar with a lender idea. It's generally an idea and a number of crates for Rust. Lenders are iterators whose return object depends on the iterator itself. So iterators in Rust are thoughts that give you values and you can take the values and use them. But in all this kind of batch processing for graphs, you iterate on the graph and you cannot look at two nodes at the same time. There is a sequential iteration which goes through a file or a sorting of labels. So you need to be able to say, okay, this is the next batch of successor, use it, but I won't give you the next one until you finish with this one. To do this, you need to use essentially generic associated type. Not really that. We use higher order trade bounds. But you need to impose that each call to next can be made only when the previous one went out of scope. So you cannot do two calls to next in a row. And this is called a lender. There are a few crates that implement lenders now which have, say, almost feature parity with iterator, but the fact is that presently they work because of bug in the borrower checker. So the borrower checker doesn't check certain things that if fixed would make all these lender crates not work. And at that point, we would be in really deep shit because we have no idea how to do this other than the way we're doing it. In fact, we're even in a situation where we have a chain of an iterator returning iterators and the final value depend on state on the initials thing. So there is a propagation of bounds of on lifetime that goes through two different types. And that gives me headache each time I look at it. And in fact, I didn't even invent it. I asked on Rust forum and they said, I have this completely crazy situation. What can I do? And a very nice guy wrote a type like this with 25 different implied type bounds and now it works. But let's hope it continues to work. But this is just to say we need a little bit more borrowing in Rust than there is now to make this work properly because it has been a little bit of a pain to get something like an iterator in which the return value depend on the iterating object. In the last thing, if anybody know how to get one thing done, index get. Since 2015, it's been sitting in the issues of Rust to have an index trait that gives you a value, not a reference. Because index give you a reference. Now, index give you a reference is fine. But if you do compress, succinct, any kind of implicit data structure, index giving you a reference is a pain in the ass. Because you don't have the data. They are implicitly represented. You need the trait that giving two nice square brackets will give a value, not a reference. And then you can enter the world of modern implicit data structure. So if you know anybody who can implement this, convince someone in compiler team to get done with this, you please do it. I'm over. Thank you. Thank you.
Using elliptic curve cryptography for the purposes of identity
Hi everybody, next talk is about to start. We'll have Yamo Makinbach talking about using elliptic curve cryptography for the purposes of online identity. Thank you. Shall I start the buzzer? Shall I? And we're off apparently. Yeah. Alright, welcome. So I'm Yamo. I work on this project called Keogh's side, which is about online identity. And we're going to talk about it in a minute. First, because of the last previous talks, I wanted to specify the skill. There will be no 5 terabyte database here or serialization of billions of nodes, which is going to make a little script. It's a bit of a Bob Ross talk, I guess, which is going on a journey together and have fun, discover. And before I really start, we're going to try something experimental. We're going to try a little interactive demo at the end. We're going to write the script, but you're going to verify if the script that we're going to write actually works. So for this, for whoever wants to participate, you should consider downloading the Keogh's mobile app. It's available on these locations. You can just get the APK from the CodeBerg repo. Alright, let's get started. So if someone makes a claim, how do we verify that? Well, quite simply, with a proof. What do I mean with that? So for example, if Alice lost her luggage and then Bob found it very conveniently, and then Alice says it's hers, then Bob asked for the proof, of course, because, you know, and then Alice fiddles with the little dials and unlocks the luggage, and then she verified that the claim was indeed true, that it is indeed her luggage. So now we want to know, is this also true over the internet? Can we do this over the internet? Well, yes, we can. We can claim things over the internet, but humans travel rather poorly through ethernet cables, so we need to find a way to connect Alice and Bob in a different way, so that Alice can make her claim, and Bob can verify that claim, each in their own space and time. And so for this, we're going to use cryptographic signatures. So, yeah, we could talk for a long time about cryptographic signatures. For the purpose of this talk, let's just... the important stuff is basically just like a real signature, but digital, but the big difference, I guess, is that it's really difficult to forge, so that's good. And in short, we have a secret key, which we will use to sign documents, text documents, with a public key that we will use to verify those signatures, combine those two keys, and you have a key pair, and each key pair is identified by a unique fingerprint. All right. So let's try and work out this process then. So let's say that I will write this text document, which just says that this is my account on the Fediverse, on Macedon, now I will sign it with a key, which has this conveniently fingerprint, which starts with very familiar letters. And now the signature itself is just zeros and ones. We're not going to worry about that. So now I will give this text document, my claim, together with the signature to my friend, and my friend will use those two pieces of data. They will first verify that indeed the signature corresponds to this text document, and once that is done, they're going to my actual Fediverse account, and then they're going to read in the bio, oh, this person indeed wrote in their bio that they have this key. So that is the proof with which I verify my claim, and that it is indeed my account. So now we're going to do that whole process. We're going to try to create an online identity with just 100 lines of rust. I did need five dependencies. I tried to minimize it, but without these, it will be a lot more than 100 lines of code. So yeah, these will be it. So we're going to generate a key. This is where the elliptic curve part comes in. Elliptic curves are a technique of creating cryptographic keys, and in this case, we're using these specifically the P256 curve, but all this just to say, yeah, we're using these two lines of code just to create an entire cryptographic key. So this includes a public key and a secret key. Now, of course, I said every key pair has a fingerprint, so that's what this code does. It looks a bit complicated. This is the most complicated part. So the most important part about this script is basically we'll just get some data from the key, we'll get some parameters from the key, and then we're going to hash it, and that is how we get the actual fingerprint. Now we're going to collect the identity data. So we're going to create what we call a profile. Just a profile is just a name, some other metadata about the person, and claims, multiple claims. So I'm just going to continue with the same example as before. I'm just going to claim that that is my account on the Internet. Now we need a way to encode all this data, because we need the text document and we need a signature. So for this, we're going to use a JSON web token, which for the purposes of this talk is just a convenient way of combining a document and a signature. We'll need three parts. We'll need a header, a payload, and a signature. So let's make each of those. Oh yeah, some quick notes. So whenever you see that are you at ID, that is just the namespace that we use for the creation of the tokens. And sometimes you will see JWS instead of JWT. Those are different, but for the purposes of this talk, we'll just consider them the same. So let's create a header. So the header is just a little bit of metadata about the key that is creating this profile. So we'll set the fingerprints and we'll set the actual key. We'll just give it. And the public key, of course, not the secret key, because that one should be secret. We'll create the payload. The payload is the actual profile itself. So we're going to say like, oh, it's the type as a profile of this token. We're going to say line 10. We're going to say like, oh, what is the name? It will be the name and the identity claims. Don't mind all the payload set claims. That's just to confuse you, because JWT also uses the term claim in a different way. Just to make it easy. Now that we have the header and we have the payload, we're going to sign the two. That's what we do here. So line three, we get our key that we built earlier, generated earlier. And in line four and five, we're going to use it to sign the payload and the header. And with that, we are done. We have our profile. So now, if you would like to copy this, write this over. Yeah, that's not convenient. So we need to do a second part. We need to do a second step. I need to get this from my computer to your phone, to your device, whatever, so that you can verify for yourself that I do indeed have that account. So we need a way to transport, I guess, documents and preferably sign it. You guess where this is going? We're going to use another JSON web token. So we're actually going to reuse the same header, because we're going to use the same key. So we'll just use the same metadata about the key. We're going to create a second payload, which will be very similar. This time, instead of being a profile, it would just be a request. And we're just going to ask the server to create this profile. And then in line 14 and 15, we're actually going to give that document that we created earlier. We're just going to give it to the server. And this second outer JSON web token, we are actually going to upload it to... Sorry, we're going to sign it first, so we'll have a similar string, a piece of data that we can actually then send to the server. So this is where we're going to send it to what we call an ASPE server that we're working on. And it's just basically a way of storing and exchanging these kinds of profiles. And yeah, that is basically it, what you need to do. Those were the lines of code that you need to actually make an entire profile, make a claim, and make it so that people could verify for themselves with their own devices, with their own methods. So yeah, it is a fun script. You can actually just try it at home. Or as I said, we could try it live on stage. That is what we're going to try right now. So I did prepare it somewhere. So you'll see that apart from some cosmetic changes, if it loads... Yeah, that's the big risk of doing this on the stage. We'll give it a second. Apart from some cosmetic changes, it is largely the same script. And you'll see that it will fit neatly within 100 lines. And it might not. We'll give it another second. And if it... Alright, well, maybe it won't do it. It would have been phenomenal, I can promise you. Alright, I'll reload it once and then... I do have a sort of a backup. Alright, it's not playing game. Alright, so let's go back to the presentation. I think it's this one. I don't... wait. I have lost the presentation. That's a different presentation. What? That was not supposed to happen. Yeah, I don't know what's happening. But basically, yeah, this would have been... We would have run scripts and we would have created a profile. And then it would have presented you with a QR code that you could have scanned on your phone. And it would actually have worked. And then you could have seen that the script would have created a profile that we built here on stage. Yeah, and just with a couple of lines of code, we can work with cryptography, we can work with identity. And, yeah, thank you very much. Thank you.
Timestamping with opentimestamps
Alright folks, we're just ready to start our last talk, which will be time stamping with open timestamps by Timothy Riddia-Eli. Okay, thank you. So I'm a Red Hat employee that works as software engineer but not for this stuff. So what is time stamping? What is time stamping? Time stamping is needed to be sure a document or a file is made in a specific date. And for example, in Italy, because I'm Italian, the law requires that the data are ushered by a public officer, so you can't do that by yourself. So what about digital documents? Well usually digital time stamping is made on a third-party data center, so you must trust some other authority, and it's usually a certification authority. So how we can do that without reeling on a third-party authority? So we could use the blockchain, so you create the hash of a file or information, and you put this hash inside the blockchain, so you can demonstrate this hash was present on a specific time. So why the blockchain? It's safe because it's backed by millions and millions of dollars. It's open in the case of Bitcoin we use. It's not cheap to create a new Bitcoin because mining is an expensive process, but it's quite cheap to use that. So why open time stamping? So the blockchain is open, anybody can write on it, anybody could do the same thing directly without using the open time stamp or another framework. So open time is a standard way of doing the same thing in a trustless way, so without trust no one. It was proposed by Peter Todd, a Bitcoin Core developer. It's used by dozens of different companies, and it's almost because in information technology we can't have infinite storage, so it's almost infinitely scalable because it uses a Merkle tree. So what is a Merkle tree? Merkle tree is a tree where you just put the top hash or the root of Merkle tree inside the blockchain, but you can demonstrate that your file or your information existed without the need to push any hash inside the blockchain, but only the top hash or the root. So open time stamp provides users multiple and easy way to create an independent verify time stamps. Open time stamp project on GitHub includes these different implementation. The first one was written in Python. Then somebody has wrote one in Java, then in JavaScript because it's easier to use in browser or in some Node.js stuff. They also started to write a Rust open time stamp because Rust, as you told in a precedent talk, it's good languages because it's safe because it's fast, low memory usage, etc. Or on the open time stamp.org website that uses the JavaScript implementation. So now for this slide, I show an example of usage with the Python client because it was the first one. So if you want to use that, you just need to use OTS stamp command and stamp command create the Merkle tree of the file, submit it to some remote server that are the server that write the information on the Bitcoin blockchain every summer. So when you do stamp, the operation creates the hash of the original file concatenates with random nonce for privacy just to avoid having your hash on the Merkle tree directly and recalculate the hash. So you have double SHA hash and it sent the value to the calendar server. So the calendar server add the hash to the Merkle tree and return the response to the client in order to generate the OTS file that is a file you will need to verify the signature later. Of course this file is incomplete because it doesn't contain the record in the blockchain because you need to wait the calendar server to send the record to the blockchain and the Bitcoin networking to mine the block with the Merkle, etc. So when a time is elapsed, some hour the user rerun the OTS tool with upgrade operation and this update the file with which block of the Bitcoin blockchain includes the hash. It's also possible to create a timestamp for several different files simultaneously. In fact we did a test when we got all the ashes of all the files included in archive.org not the web.archive.org, the archive.org that includes the petabytes of files. Of course we didn't download all the files but archive.org API supports to you can ask the hash directly. So we took all the ashes from archive.org and we were able to put all these million files inside only one Merkle route. So it's absolutely scalable because you can put tons of files only with one Bitcoin transactions that you don't need to do yourself but is the calendar server that you have. So it's absolutely free. So the verification requires both the file and the original file or the original hash. And if you want to do that by yourself so without trusting nobody that's what you want. You need an up-to-date Bitcoin node. You don't need a full node but since the attestation is on the block either. But so you just need a prune node that only so you need only a few gigabytes of data instead of almost one terabyte of a full node. So if you do that you are sure nobody can fake your check because OTS asks directly the blockchain and so you don't need to trust anybody including the calendar servers that put your verification on the blockchain. So the OTS file includes three main sessions. The hash with the nodes, the Merkle tree construction because you need to know which other hash you have in the Merkle tree in order to be sure your file is in the tree by your root and which Bitcoin block includes your hash. So the timestamp is saved on a binary file to save space and to avoid problem of interpretation especially on Windows. The file is as OTS extension and it starts with this line. So if you use the OTS information command with the file it prints lots of information so I can't show them because it shows all the single Merkle ashes. But you can try that at home and you can see which Merkle tree is how the Merkle tree is created. So this is some example of open timestamp usage. The website I presented at the start, Proofmod.org that is an Android app by Guardian project that it uses to certify a photo is valid with GPS data etc. And ASA check is an example of how you can use timestamp newspaper article and to stamp it's a website that you can put the stamp on a Twitter. The end.
Compiler Options Hardening for C and C++
Okay, hello, good morning here at the lightning talks at Fostum in Brussels. I want to introduce you Thomas Neiman, senior security technology specialist from Ericsson, and we will give us an introduction to compiler-optioning hiring guides for C and C++. Give him a warm welcome. Thank you very much. Start. Thank you very much. I work for the network platform and telecommunications company Ericsson, but today I'm here to talk about the compiler options hardening guide for C and C++. I also am in the open-source security foundation as the sub-initiative lead for the compiler hardening best practices initiative that has produced this guide, and we had an initial release in November last year. I hope many of you might have heard about the open-source security foundation, but maybe for those who might have not. This is a community of software developers and security specialists who are working towards improving the security of the open-source ecosystem. This means both innovative open-source software as well as these efforts to develop best practices and collaboration around security in open-source software. The background for the work I'm talking about today is the C and C++ hardening challenge. We all know that the C and C++ languages are consistently the preferred languages for systems programming, embedded systems, and performance critical applications. But C and C++ are also memory unsafe, and that means that they are susceptible to a certain classes of programming defects that affect the memory integrity of software written in C and C++. In unfortunate cases, these defects can lead to software vulnerabilities that can be used by malicious actors to then exploit the software in different ways. Addressing these types of vulnerabilities in C and C++ in a large scale presents several significant challenges. There are many memory-safe alternatives for these languages, but there is also a lot of C and C++ code in the world today. Rewriting all of these existing code, the memory-safe languages is both umber-ably expensive, both in monetary value, but also from this kind of opportunity-cost point of view. The alternatives often have unsafe dependencies, and these unsafe dependencies will then slow down the migration to the memory-safe alternatives. One example of this is Rust, which is a very promising language and provides memory-safe guarantees. But if you look at the dependencies, there are some references here to recent surveys where the conclusion was that over 70% of Rust crates, the Rust packages in the official package repository, have some form of dependencies to either C or C++. This is not just a technological problem, but this is also something that is actually gathering regulatory attention. In the US, something that has been very influential in shifting the attitudes towards surface security was the presidential executive order on improving the US cybersecurity in May 2021. Also in the EU, we've had this cyber-resilience act that has also been heavily discussed among the open source communities as well, and specifically relate to memory safety. We've in the past two years seen a lot of guidance from cyber security authorities, including the USA NSA and the US CISA, who have issued these joint publications with other national cybersecurity authorities, the most recent being the December 2023 document on memory-safe road-waps, where they urge organizations to explicit plans how to shift away from memory unsafe code. So what we are doing in this initiative is that we are providing a guide for compiler options hardening, and currently this is specifically geared towards C and C++ code. The idea with this is that we have a guide that will help developers and packages of software to configure programming tools during development to reduce the attack surface of produced software. You can think about this as something that is quite close to what sometimes is called product hardening. If you have a hardening document that usually provides guidance to the parties who are deploying this software in configuring the operational parameters that help you deploy a software security in its operating environment, so we are focusing on these kind of parameters during development that helps everyone who is then later deploying the software. And the modern C and C++ compilers provide many optional features that help improve the security of the produced software, but this must be explicitly enabled when compiling software for the software to actually benefit from them. And if you are consuming software from the major Linux distributions, then these are usually, the major Linux distributions are usually enabling these features by default, but then if you are consuming open source software from source, then you are responsible for making sure that when the software is built, these kind of protections are enabled correctly. And of course, these also come with various challenges, right? So I will not go into this in detail here, but these challenges can sometimes make it difficult to deploy these in a sort of a correct and correct manner, and we hope that this guide will help practitioners in some of these challenges. And the current situation we are seeing now is that according to some academic surveys, the situation for these are actually much better on the desktop side, but especially embedded systems often ship without these protections enabled. So here is a publication from the Network and Distributed Systems Security Symposium from 2022, which shows that there is kind of like a radical difference between the deployment of specific hardening mechanisms between desktop and embedded systems. And of course, compiler options hardening is not a silver bullet, right? So this is something that is necessary in combination with the adoption of memory safe languages, secure coding standard, as well as security testing, but we hope that this is like one way of addressing the C&C++ hardening challenge. So if you look at the guide, you will find that this is sort of divided into sort of four main parts. So we have a large section on the recommended compiler options that currently cover a wide range of different features in GCC and CLANG LLVM, and this includes both flags that one developers about different software defects that are related to these security vulnerabilities, but also flags that will add instrumentation to the binaries, that helps the binaries be resilient at runtime against attacks that are trying to exploit possibly residual defects in the software. We also have a section on discouraged compiler options, so these are compiler options that have some specific purpose, but if you use them inappropriately, they may impact the security posture of the produced software in one way or another. We also have a section on sanitizers, so these are compiler-based tools designed to be used during development and testing to basically pinpoint memory safety issues and other defects, and these provide really a lot of valuable information for debugging and testing, but they often have sort of more runtime overhead or memory overhead which makes their deployment for production software more difficult, but they are very valuable during the development and testing phases. Then we have some information on managing debug data, so this is something that can help in making produced binaries more resistant to reverse engineering, where you have threat actors actually analyzing binaries specifically for ways to exploit them, but of course in practice these decompiler tools that are used for this purpose, they can work with debugging information, so the security of the system should not depend on omitting this information, but there is some sort of guidance with this respect. As I mentioned, we had the initial release of the guide in November 2023, and we have a lot of activity on the OpenSSF best practices working group GitHub pages where the development happens, and for this year we are planning on documenting new features that are in upcoming versions of GCC and CLAN, so this is actually an area where the compiler communities are very active in providing new valuable features that are security relevant, and we hope that this guide will eventually cover all these new features as well, and then we also have some plans with partners to also introduce information on new compilers, so we hope that this will also be possible during this year. And then another effort is that we have a separate guide on using compiler annotations in GCC and CLAN, so there is some work in progress work up on the GitHub if you are interested in that. And this is of course, everything is open source, and I hope that we also welcome contributions also from people who are not necessarily security experts, so we've had very valuable contributions on improving the readability and presentation of the guide, so if you think that there is something that could be improved, I urge you to open an issue or open a pull request towards this material. And other development happens on the OpenSSF best practices working group. GitHub repository and we have calls every other week on Zoom to discuss any open PRs and short developments around the guide, so these are also public. And this slide has some links, more links on how you can participate in the work that OpenSSF does if you're interested, so the slides are available on the talk page on the FOSTEM site if you want to access the links. And lastly, I'll leave this slide open so if you want to access the guide itself, you can do so at the URL here or by scanning the QR code. And that's it for my side, and I want to thank the FOSTEM organizers for giving me the opportunity to present this work here. Thank you very much. Thank you.
A Lazy Developer’s Approach to Building Real-Time Web Applications
Mark was a wrench and his talk is about a lazy developers approach to building real-time web applications. Give him a warm welcome and applause. Thank you and it's your state. So good morning. Today I want to tell you about my two hobbies. First I'm a musician. I play the bass guitar and my other hobby is a cloud solution architect. That's my money making hobby. And the project I want to talk about today gave me the opportunity to combine these two hobbies into one little project and I want to share the learnings from it with you. Okay, so the challenge. The picture above you see Ralph. He is a friend of mine and he plays along. He plays songs, people sing along. But we had one problem when we play somewhere with WENUs from 100 to 1000 people. Songbooks don't scale. We had songbooks but they got damaged. The WENUs were dark. People couldn't read. Songbooks even were stolen. One fact that was beneficial for the project, we have music stand software on our tablets. They are networked to each other and this music stand software has an API. Terrible software, proprietary stuff. I don't want to talk about this software today. But what we made from it so that you are able to use it in your own projects. How to get lyrics to the people with minimum effort. That was our task. We had to solve. And so that it doesn't get boring, I want to start with a demo so that you see the result and later on I will show you how we accomplished that. So please use your mobile phones and with it decide or your computers will work both. If you call this web page, you will see. Let me show it here. This side, the other side. There it is. You will see exactly that page which I have loaded here. It's waiting for lyrics. So the communication is established and when I now start sending the lyrics, imagine somebody on the stage would change to the next song in the music stand software. Okay, let's do it. I had an AI friend of mine write a few songs about open source and the like. So if my talk is boring, just look at the songs and I'm also open for collaborations in getting music to them. Okay, let's go on. What do we see here? The songs get updated like I just said before. And we even have confetti when there's a new song. So the title of the talk was the lazy developer. Why does being lazy matter? If you are too eager, it can happen that you think of a structure, how to implement something and that you stick to this structure, that you don't have this gut feeling that's too complicated, has to be easier. You create code duplicates and the like. If you're too lazy on the other hand, you get nothing done. So you have to find the sweet spot. Being lazy enough and being eager enough to get things done. And because I did this in my spare time, that was the only approach which could work. So I had to have something easy which allows me to get this job done but also allows to scale to a venue of a thousand people, to a thousand people requesting this resource at the same time. And here's my technical approach to that. We have the people who want to sing along. We have the musician with the music stand with the let me call it rest-ish API, what I saw wasn't so good. A small VM at the cloud provider. Everything should start with something like this. A host name which I said on it. After installed, then I used catty as a static web server for static pages. Great project makes it easy to host things with the same default TLS configuration without any effort. So it was just spinning up the container and it's immediately served the web pages like I wanted to have them. So now we need a component which does the heavy load which transports the data to the devices of the people. There are many solutions around and since I'm working as a cloud solution architect with Kubernetes, you always look at the CNCF landscape. And as well in my company, as for this project, I saw NETS as the solution. We use it for micro service interaction in our projects but NETS also has a web sockets interface which make it possible that the people which are getting the web page through the static web page on the browser, the JavaScript part connects through web sockets to the NETS server. And then the musician needs to have a computer which pulse the API and as soon as there is a new song, the lyrics get sent to the message broker and when we talk about message brokers, there are a few patterns around how messages are being distributed. We have a classical fan out pattern here. The message comes in, the message broker distributes it among all of the subscribed devices and what's really nice about the approach, it's just a few lines of code in the end. So let me show you. We have the project here. You will get access to the GitLab repository at the end of the talk and also linked online. Okay, so we have the NETS server. The ports, the 8443 is the web sockets port over TLS. 3 times 2 is the NETS native port mapped to the outside. Then we tell NETS the host name so that for the TLS mapping and since Katie takes care of the certificates, we map the certificates from the Katie directory as we only file mount which we can access in this Docker compose repository I have set up here. And Katie, the easiest thing, just the regular web server ports mapped to the outside. Katie took care of getting the Lats and crypt certificates automatically. I only had to set the HSTS headers and had an A plus on Wallace SSL check. It's something I always want to try. Okay, and look at the application itself. This div does everything so it's more meta text than real payload on the page. There's the div with the id lyrics and as soon as something new comes in, its content is replaced and the JavaScript part is also something very simple. You see that's everything I did, basically developer. The communications magic is here. We include the NETS web sockets library and then we connect to the NETS server. We subscribe to the subject lyrics and as soon as something drops in as we receive new lyrics, we handle them over to handle lyrics which form its first line in bold and shows the rest just like we received it from the NETS server. If you want to have a look at the NETS configuration, it's also not much. I have defined two permission sets. One default permission so any user of the system has the set of permissions and it's just subscribed to lyrics and we have the lyrics publish security profile for the authenticated publisher. I defined the user with the hash password, assigned the permission and down below here you can see the web socket definition where I also use the TLS certificate, caddy gets me from let's encrypt. Next line I assign anonymous access to the user that it works with the login. No, not login. Okay, that's it in a nutshell. If you are interested about the topic of message brokers, I can highly recommend the book enterprise integration patterns. It's a book from 2003. I'm showing an IT book from 2003 but the principles are still the same. Of course, there are a few new. If you go to the website, they also have listed new principles but I wish I had new in the book 20 years earlier. Now I have it on my desk. My GitHub repo, check out Nats, check out caddy server and it's absolutely possible. You don't have to use Nats for this. You can use an MQTT server. I did the same example with EMQX. Rapid MQ should work. Also with Redis, it also has web sockets integration so you could also use that. My example was in Nats. If you are interested in Nats, I asked the guys of the project if they could send me some. On this corner of the desk, if you leave the room here, you can grab a sticker. After all, we are on conferences for the stickers, isn't it? That way. Okay. That's about it. What did we learn? Let others do the heavy lifting though. Just be lazy enough to find the right ways and get things done and concentrate on the things that really matters. Reach out to me if you have questions. I will be around. Don't forget the stickers. Have a great Svastim Sunday and a safe trip home. I'm Markus Röntschler. Thank you. Okay. Thank you for your talk.
System for Television Off-air Recording and Archiving, BFI National Television Archive
So, our next speaker is Joanna White and we talk about the system of television affair recording and archiving. BFI National Television Archive. Welcome, Herr. Thank you. Thank you. It's wonderful to be here today. Thank you for coming and thank you to FOS STEM for letting us speak here. I am Joanna White, developer at the BFI National Archive in the Data and Digital Preservation Department. Today I'll be talking briefly about STORA, System for Television Offair Recording and Archiving. It's a project that we've built in-house. So the BFI or the British Film Institute promote and preserve film and television across the UK and the BFI National Archives Department within the BFI and is also one of the largest archives in the world. So we have nearly one million digitized moving image assets in our digital preservation infrastructure or DPI as we call it. That means they've been ingested into our Spectralogic tape libraries for long-term preservation and they've also been catalogued in our collections information database, what we call SID. By far the largest collection of moving image materials in our off-air is our off-air television recordings with nearly 650,000 program files in DPI. You can see a selection of them here displayed. This is our staff DPI browser. It's internal. There's also a further 800,000 preserved. This is off-air recordings waiting to be processed and ingested and seeded in a future project. So the BFI is the body designated by OFCOM as the National Television Archive. Under the Provision and the Broadcasting Act of 1990, the designation allows us to record, observe and make accessible TV off-air under section 75 of the Copyright Designs and Patents Act of 1988 and later the Copyright and Rights and Performance Regulations 2014. Okay, that's the official bit. The BFI National Archive began recording off-air TV to one-inch real videotapes as you can see here in 1985 with the permission of select UK broadcasters. Programs were captured, curatorially chosen, captured by teams who would work there around the clock in shifts. In 2015, off-air TV recording became an automated process for us when we started collecting live TV programs 24-7. To do this, the BBC agreed to provide us with a fork of their Redux Off-air Capture Project, which you can see here. We worked with BBC developers to integrate it into our digital preservation infrastructure. The goal was to store MPEG TS files to our Spectrologic Tape Libraries for long-term storage. This is built on open-source technology. It's run from Linux, installed servers and uses open-source tools to record both television and radio programming for the BBC. At the BFI, we just use it for off-air television. So in May 2022, BBC Redux was shut down. In anticipation, the head of my department, head of data and digital preservation, Stephen McConnacky, launched our own R&D project the year before. Along with two BFI engineers, John Daniel and Brian Fattarini, we built the software recording solution to emulate many features of Redux with the name not to disrupt our existing DPI ingest workflows during that change over period. So like Redux, Stora records satellite-streamed UK broadcasts. The channels are a mix of high definition and standard definition streams, many broadcasting 24 hours a day. One full day of off-air recording captures around 500 programs to our storage. That's roughly 850 gigabytes of data, and that's roughly 300 terabytes every year. So we receive our signals from Astra Satellites, which broadcast mostly direct-to-home TV channels in Europe. It is nice to be considered still in Europe in this regard. They're received by our satellite dishes, passed through Quattro low-noise blocks before passing through TVS, TV, PCI receiver cards. The signals are routed through patch fields to a multi-switch, which selects band and polarization. We use three multi-switches for Stora so we can have 24 potential multiplexes. We've got a SESPA application, which demuques each channel's MPEG transport stream into a single program transport stream, creating a Unicast real-time transport protocol, or RTP stream, and a Unicast user datagram protocol, or UDP stream. We need both for our recording method. If you'd like to know more about the hardware setup, I can put you in touch with my colleague. It's not my area, I'm afraid. For those of you who are familiar with BBC Redux, you may recognise the folder naming convention and the contents of the folders. As I said, we have automated ingest workflows that needed this structure to be maintained. The folder path comprises recording date, channel name, and individual program broadcast time data in the name of the folder. We've got also the Unic event ID, which is for the program that's being shown, in this case 476. With the folder, you'll find three files, the Info CSV. This file contains program information, including channel, title, description, etc. Next, we have the Stream MPEG TS file. This is the recording captured from the broadcast. This is not encoded stream, but it's just dumped directly to storage, so it contains the packetised elementary streams, which wrap the main data stream, usually H264, video codec, AC3, or MPEG audio, subtitles, also in there, and information tables. You can view all this data really nicely when you look at it in VLC. Finally, we have the subtitles in there, which contains an extracted transcript of all the spoken word from the program. It's formatted as a Web Video Text Tracks format, or Web VTT. Making sure that we don't lose any of this information is really critical to our preservation goals. Storage code has been made possible by a wonderful collection of open source tools, which you can see here. We have Linux Ubuntu operating systems, and we use Linux command line tools throughout the code. Storage is written in Python, and a few external libraries such as Tenacity and Python VLC. Python VLC allows us to work easily in the code with the amazing software VLC from Video LAN. You'll probably see them, I'll foster them in the hats. VLC relies on the outstanding FFMPEG libraries to operate. FFMPEG is kind of worshipped at the BFI and in many archives globally. LibdVBT passes service information in the UDP streams, and it's key to how the scripts record the programs. Media Info provides detailed technical metadata for analysis of the MPEG TS files. CC Extractor extracts the subtitles from the MPEG TS file, saving them to a separate formatted file, and Nagios Core provides a monitoring service for real-time alerts when streams fail or recordings stop for us. So I'll quickly talk you through how storage uses these pieces of software. We'll look first at the recording script, which makes the file contain the MPEG transport stream. They used to have two recording methods for the storage code base, but they've been merged into one script now recently. I'll unpack that shortly. Both methods capture the MPEG transport stream using VLC, but they differ in how they start and stop the recording methods. So the first script I wrote utilizes electronic program-grade metadata, which you can see at the top. We get this from a commercial supplier, retrieved daily from their REST API. The EPG data is converted in Python into a JSON schedule for a single day's programs. One is created every day for every channel. Recordings are then prompted to start and stop from this JSON schedule. The script loops through every scheduled item before it then exits at the end of the last program, which usually just after midnight. And then we have shell restart scripts that run from Prontab, which immediately restart the script again, and it picks up the next day's schedule and carries on. Quick shout out here. I'm quite a new developer, and when I had this project placed on my plate, it was a little bit overwhelming, but I came across this script. It was on ActiveState code written in 2015, weirdly also written by somebody named J-White, J-White88. If anyone knows them, please thank them for me. Nobody knows them. I'm going to assume time travel is a thing by the time I'm 88, and I come back in time and give this to myself, which is a nice idea. So onto the second and better method for recording the off-air streams. It monitors the user data-gram protocol stream, UDP stream, and it gets the service information data, watches for changes in the event ID field for that broadcast stream. You can see that in the top. The event ID is that unique identifier for a program. The script stores the last five event IDs that have been broadcast, and if a new one turns up, then it knows that there's a new recording that needs to be triggered. So it should potentially loop indefinitely, monitoring a UDP stream in this way, creating and placing TV shows into their own unique folder paths, which you've seen. And these event IDs changes usually always fall right at the beginning of a new program as it starts to record. So it's a really very neat way to start and stop the recordings in the schedule. And another shout-out is needed here for the open-source project Libdvbt. I think it's a fork from a VLC library, I'm not sure, but it's by Michael Krufke. It's the stream parser and service information aggregator library for MPEG-2 transport streams. The recording script calls up dvbt from a Python sub-process spawn shell, captures the Libdvbt JSON-formatted response. The command has a time-out flag, which usually ensures the information is returned to you within two, three, four, five seconds. This response is reformatted and exported into a Python dictionary, and this provides the trigger for the VLC record start-stop process. So just to visualize how this method works, it does require us to have two streams, which is a little bit awkward, but doesn't really cause us any problems. So here you can see that the script monitors UDP stream waiting for an event ID number change in that stream, so from two, six, five, two, four, five, two, six, four, two, six, five. When the event ID changes, it's sensed the current VLC streaming recording is stopped on the RTP stream, and the new folder is created with the start time and duration of the next program. So in this folder, the RTP stream is placed, captured by VLC. And this is the code used to start and stop the VLC recording. The Python bind needs to create a VLC instance from the instance class in Python VLC and initiate a new media player object. Both are called into the main script to start and stop the recordings. We use the demuxt dump command, which uses a VLC unique codec from the demuxt library, a tool developed essentially for debugging, but it actually dumps the content directly to file without decoding it. I have the append flag also in there so that if a recording breaks midway through a program and then starts again, it will append it to the existing file and not overwrite it. If that happens, a restart warning text file is placed into the channel folder with the date and timestamps so that we can know that there's potentially a break in the stream. This is pretty rare though, it doesn't happen very often. So we also rely on media info software in the get stream info script. It uses the Python sub process again to spawn a media info call capturing the program start duration metadata. This is all then dumped into a CSV file. And then to extract the WebVTT files, we use the software CC extractor. We launch the software and the Python script again from sub process. Sub process is so important to these processes. This is a simple command that flags the WebVTT output format and then creates the file that you can see here. We then import this data into our SID database, which is viewable and searchable and provides a rich text metadata for the curatorial teams. Lastly, we have Nagios, which is an event monitoring system, which issues alerts when problems are detected. We have separate channel alerts for recording failures, which is identified by comparing a checksum between the current stream MPEG TS file and one four seconds earlier. And then we also have a stream check, which looks in the Cesbo software for an on air equals true for every channel. If either of those fail, then we get a display that says critical, but also we get an email that's sent to us with the context for what the failure is. Okay, so that's a rough guide to the store. In particular, how the code interacts with these open source projects. The open source repository contains all the store of scripts, descriptions for the code base, dependencies, environmental variables, and quantum launch details. It has an MIT license. I hope it may be of some interest here. But as a relatively new developer, I'm quite welcome. I welcome kind of feedback and advice. None of the team in the data and digital preservation department have computer science backgrounds. They're all archivists or TV people. I used to be a cameraman and an independent documentary maker. To be able to stand here and talk about this project like Stora, with just a few years coding experience is really mind blowing for me. And particularly at a time when accurately recording our televised social history is really just so critical. So this has really been made possible thanks to the open source tools we use and the developers we see in the room here. Thank you from the archiving world. And there's also quickly a growing interest in audio visual archives globally to try and work more with open source software and standards. Many of us meet annually at a conference called the No Time to Wait conference, which happens here in Europe. We welcome new attendees, who are developers, definitely. This conference has been connected with the development of the FFE1 codec, which was originally an FFMPEG project picked up and expanded by archivists working as developers. This codec is critical to the BFI's long term preservation of thousands of video and film assets. So the maintenance and upkeep of projects like FFMPEG is really very important to us. Traditionally archives have relied on expensive proprietary hardware, software and codecs that are not scalable. They keep their information behind paywalls and they're not likely to offer the kind of technical support we need long enough into the future for long term preservation. So having open workflows and standards developed within our own community is incredibly empowering for us. And yeah, this is the community where it's happening most, I would say, at the moment in the UK, in Europe. That's it. Thank you. Thank you. The next talk will be in five minutes.
Do you know YAML?
So, hello, good afternoon. Then we are going to start the next talk with Tina Mueller. And the topic is, do you know Yamu? Quite interesting topic. So Tina, this stage is yours. Thank you. Hello. Can everyone hear me as well in the back? OK. So who of you knows Yamu? OK, are you sure you know Yamu? So something about me, I'm doing Pearl since 1998. And I'm also intensively doing Yamu since 2017. So I guess I just have a weakness for misunderstood languages. Yes, the topics, some introduction, some history, Yamu usage, versions, new libraries, and Yamu test infrastructure. Oh, I got one extra minute because the timer wasn't started. So Yamu, it all started in 2001. I think 2004 was the first specification. It was invented by Orin Benkiki, Clark Evans, and Ingy.net. And Ingy says hi. And he's also the one who's still actively working on Yamu and relate things. And here's actually a mini talk that he sent me. He wants you to know about. So there's Yamu script. Many people try to do programming things in Yamu, but Yamu wasn't designed for that. Ingy has been working on a new Yamu-based programming language. It's complete and general purpose, best when embedded in plain old Yamu files. Excellent interpolation features, merge, filter, concatenate, any functions you can imagine, define your own functions, solves most programming things that people want to do with Yamu. So here you have a Yamu file, people and places. And this is the Yamu script. You can see the header. And you load the Yamu file. Then you get people from it and the list of places. Here you define a function with interpolation. And here you go over the arguments of the command line. Shuffle until you iterate over the list. And the output is this. And it just works. It's fast. And it's really easy to try it out. Just go to the web page. And there's a code command which executes a bash. And then you have it installed. And yeah, there's a link to it in the slides. And the slides are already online. So have fun with it. And that's the end of the talk in my talk. And I'll go on. So what does Yamu stand for? No Yamu ain't market language. It's a state-of-the-realization language. It's a superset of JSON. It has block style and also flow style, which many people also call JSON style because it's similar. And there are many ways to write a string. But they are all kind of useful in certain areas. It has aliases, like references or pointers, and commons. And there's an allowed comma after the last item. Hello, Jason. Multiple documents in one file. And really powerful tags for loading objects are doing customized loading. And I started this Yamu.info page, which gives you also the right words to actually talk about these. Like, for example, some documentation referred to Yamu's references. But it's called aliases and anchors. And I think it's good to have the right terminology because then you can actually find the right documentation for it. So the history is Yamu 1.1 was implemented by Pi Yamu and Lip Yamu with some divergence to the spec. And the decisions were with good intent. But it had other problems because if you diverge from the spec and others do not, then it's problematic. And many other libraries ported this or used Lip Yamu as a binding. And 1.2, the version 1.2 was not widely adopted for a long while. Many people just didn't know about it. And there is a prerequest for adding 1.2 for Pi Yamu. I created it some time ago, but there are some issues. So it can't be merged yet. And Lip Yamu and Pi Yamu were even used in the NASA Mars helicopter mission. And so this is something you can say these days. Yeah, as mentioned, 1.1 implementations were really widely adopted. And there was no clear change lock for 1.2. And there hadn't been a test suite until 2016. So before 2016, updating a library to 1.2 would have just to be sitting down and read the news back and start from scratch, mostly. So this is about the history. And now from a different angle, how do people actually get in touch with Yamu? So usually you're using an application that is using Yamu or some kind of Yamu, starting with examples from the documentation. So here's a salt stack. So you have these funny curly braces here. And is this a Yamu file? No, it's an SLS file. It's not a valid Yamu. And you cannot use a linter or anything on it, because first it has to run through ginger templating. And then the result is Yamu, hopefully. And many people think this syntax belongs to Yamu, but it does not. And the intro on their website doesn't even say which version it's using, which Yamu version. And here we have an answerable example. And here we also have these syntax, but inside of the string. That's also ginger templating. But it happens after you load the Yamu. So I think that's a better way. It has disadvantages and advantages, of course. But also here, many people think this is part of the Yamu. And they come to our Yamu channel and talk about it. The website also doesn't say anything about the Yamu version. Or yeah, it has some links at the bottom. And the GitHub workflow uses this syntax. And that's quite nice, because the dollar sign at the beginning is not special in Yamu, so you don't need to quote it, actually. And many people think this is part of Yamu, but it's not. And also no Yamu version information. So they don't document it. And I tested GitHub, and I think it's doing Yamu 1.2. Also, they learn Yamu in why minutes is mentioned, but it's also not saying anything about Yamu 1.2. So what are the actual changes? So they can be divided in syntax and in schema changes. And the syntax changes are really probably not important. There are also a few backward incompatible changes, but affecting even less people. But the schema changes are important. So the schema is about deciding if something is a Boolean, number, null, or string. And in 1.1, there are 22 values that are resolved as Booleans. On, off, yes, no. And you probably all know the Norway problem, so no is the same as false. So if you have a list of country codes like ES, DE, and O, then you will not get what you think. This is unexpected, and this has been fixed. So the 1.2 schema just has a lot less values, a lot less unexpected things happen. The sex-agasable numbers, base 60, are also gone. Who knows what sex-agasable numbers are? Wow, like a handful. No underscores and numbers allowed anymore, and the base 2 is also gone. And you can also click on the link in the slides to see these differences here. So only six values for Booleans. And yeah, it's a lot cleaner. But still, of course, there is this problem. When is this a number or not? So here we have a number. That's a string, and that's also a string. The thing is, what do you want? Like, you don't have to quote, and that's actually nice in many cases. And you can't have everything. So we have to live with the problem that sometimes we don't know exactly if it is a number or not. But what you can do and what you should do in many cases is actually validate. So who is using JSON schema or something like that for their YAML files? Come on. OK, you should think about it. And same actually goes for JSON. At least sometimes you can make mistakes in JSON. And you don't just send out your JSON or YAML files and think that it will just work. We have tests, hopefully. So use a validator. And we're using that in openSUSE for openQA. It can also protect you from processing unexpected data structure with a recursive tree of aliases, which is known at the Billion Lafs attack, which is actually not a real problem of YAML because they're just aliases. But if you process it or dump it with JSON, it will be huge. Yeah, use the right tools. So who of you knows YAML lint? OK, great. So that's a great tool. And it can tell you if you have unnecessary quotes. But the thing is I hate typing. So if you have an extremely limited number of fingers, you really hate typing. And so I wrote a YAML tidy, which is removing the quotes for me. I don't have to do it manually. And you are using often four meters for other languages, too, right? So here's a YAML tidy configuration. And here you can say the default scalar style should be plain. Here's a YAML file with unnecessary quotes. And this is what it looks like after YAML tidy. This would have been a number, so it's the quoted. And the curly brace here is problematic. So OK. YAML lint currently supports 1.1. And Adrian is working on it to actually support 1.2. And I also want to support 1.1. What else can we do to improve the situation? So there's a YAML test suite. Started in 2016, like I said. And Felix implemented NIMYAML and added a lot of cases. I started with YAMLPP and added test cases. We have 400 test cases. And 12 libraries are using it. But I would like to mention specifically a couple of libraries that's libfyaml. And it can be used as a replacement for libfyaml. It passes all tests. That's really rare. It's fast. It's actively developed. It can run through YAML comments. It's still experimental. And bindings to several languages are planned. There's also a new JavaScript library. It passes, I think, all tests by now. It's actively developed. Sorry. It can run through YAML with comments and blank lines. And it supports 1.2 and 1.1 and the merge key. And it's really good. And yeah, just because it's by me, YAMLPP passes most of the tests, except some things that are not relevant in ProL anyway, like arrays as hash keys. It also supports both YAML versions and comes with a nice highlighter. So YAML containers, I will go a bit faster through the last slides. So Engi started to put things in YAML containers. And you can actually look at the YAML playground right now. So here you have to start a Docker container locally. And now you can live edit YAML here. And now we just added something that is not valid. And there's one library which actually thinks it's valid. But OK. The test matrix is this one. And it really looks very red. Don't be scared. The test suite contains many edge cases. So that's why it's so red. And yeah, we're trying to actually make it better. And so you can also visit us on our matrix channel. There is some kind of construction going on, because Engi is moving the server. But if it's not there, then there's a fallback on IRC. So please contact us. We are really trying to improve YAML and everything around it. And thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Introduction to BlissLabs and Bliss OS
Okay, let's go to the next talk right now is John West and the talk is introduction to bliss labs and bliss OS. Thank you John, the stage is yours. Thank you. I represent bliss labs. Thank you very much. I represent bliss labs and bliss OS. So what bliss labs is, is we're a volunteer based 501C3 non-profit organization that helps open source projects thrive, mostly Android based, but we also do Linux projects as well. Our goal is to create and maintain various open source operating systems and other software that helps extend the life of devices in order to help with the world's e-waste problems. Bliss labs also maintains bliss OS and other open source software. We're not just a bunch of projects, we're a very mature open source org with a proper organizational structure. We work to mentor and teach future open source developers in all aspects. We form alliances with other projects that share in our visions. We develop tools to help minimize the complex development process of Android and Linux and aid in learning. And we also act as a global fiscal host for open source projects that are not able to monetize in their current region. So this allows for a much larger opportunity to work with others globally and increase their user base. Our region is global, so we have members of bliss labs all over the world. Age group ranges from 13 to 60. Students, professionals, retirees, the whole nine yards. Women, men were LGBT friendly, different able in key positions like CFO, CPO, CTO. For education, we don't have any requirements really. You can be a middle school student or you could have a PhD in the professor. Our estimated download count from the year we started in 2014 is now up to over 6 million. That's across the entire suite of bliss labs projects. And speaking of the projects, how many of you recognize any of these logos? Anybody? Okay. To go over some of these, we have BlissOS, which is a unified Android experience irrespective of hardware. We use Android X86 as the base OS, works on Intel and AMD X8664 version 2 devices and greater. We also run RayDroid, Android integration with Linux. It's a lightweight containerized base project. We run Android Linux hybrid, which is a cross of Linux and Android running on bare metal hardware. We maintain SmartDoc, which is a desktop UX for Android. We maintain XTmapper, which is an on-screen keymapper for Android as well. We maintain a community called Supreme Gamers, which is centered on Android X86 development. We also maintain BoringDroid, which is a complete open source desktop UI solution for AOSP. We're adding Bliss base to the mix now, which is a production ready example of Android on X86 based on Bliss, but geared towards, we'll say, commercial applications. Then we maintain Android Generic Project, which is an easy button for Android X86 and BlissOS builds. We also maintain BlissROM, which is Android for Android devices. Our matured process model includes building, testing, releasing, and documentation for both dev and users, and then post-release support. Popular open source projects having links with our projects. BlendOS is one. They use our RayDroid project. Ubuntu Web also uses our RayDroid project. Ubuntu Touch is well. PrimeOS, which uses Android Generic Project to generate their images. Then Android X86, we fully support that project by supplying team resources, build servers, development, et cetera. What we do to support sub-projects is we attend conferences like this. We supply hardware for testing, hardware for development, hardware for build and web servers, and communication servers like Slack, Discord, GC, Telegram, Matrix, et cetera. Then we provide software services like storage, source, forging, GDrive, or development services like GitLab, GitHub, Access to Jira, Confluence, Trello, and then servers for updates, OTA, GitHub, and CICL. Which brings me to the next part of what we're doing here, is we're introducing BlissOS, which is an Android on X86 hardware. It's Android for your PC. Funny thing is, my Linux PC took a crap on traveling out here. I actually had to put this whole thing together on Android X86, so on BlissOS. That's what we're seeing this whole thing run today from. It's based on mainline Linux, so using Android X86 patches on top of the kernel in order to provide support for Android subsystem. We have features of desktop UI, changes for X86 hardware, custom house, et cetera. Generic builds, so one ISO runs everything. We have tons of customization options included. Low resources based subsystem, so it's a very low overhead system to run Android on X86 hardware. A lot like Edge Linux would be. Then we have added Linux tools integrated with Android, like Turmux, networking tools. We bring in DRM hardware composer, Greylock, Mesa, all from mainline Linux. Some of the diverse use cases of BlissOS and the likes are kiosks, mobile devices, PCs, gaming devices like the Steam Deck, automotive displays, POS and customer facing displays, new displays, ad displays, TV and large screen applications, IoT and industrial IoT, industrial displays, gaming and component displays. BlissOS is open source. It's based on Apache version 2, GPL version 2 and GPL version 3. If you're using BlissOS, as is, it's free and open source to use anywhere. If you're using BlissOS, using a modified source, still comes as open source as long as you release the source code. It's open for anybody to use in their project. Coming to BlissLabs is not a requirement as long as you release that source code. Then a small per device or perpetual licensing cost for those that are using a modified source and do not wish to release that source code. Some of our milestone achievements as of lately are EIDU. We've gotten into a pilot program for Kenya using their low end hardware with our operating system. Companies have been using it in products lately. We've been shortlisted by Swarchforge multiple times as a featured project. We've been adding more and more devices every year. You can find demo videos on our website as of this week. You can also download an ISO for Android 11, 12, 13, and as of today, Android 14, which is our announcement today. We will be making BlissOS version 17 available within the next couple of days for everybody to test and download. Swarchforge is already on GitHub. The source is already on GitHub. Initial features are ported from Android 13 to 14, including desktop UI, dual ethernet, multi-display, et cetera. What you can expect from us in the near future are more device groups will be supporting Raspberry Pi, RISC 5, et cetera. More leaning towards the Linux hybrid side of things, where we'll have a larger Linux subsystem running Android on top. We'll be more independent and a complete fossil ecosystem, so we'll be providing tools for companies and individuals to create their own cutoff app stores, pretty much. Then we're going to be supplying some Edge IoT and IIoT examples as well in the code. Our process, we're going to be documenting even for newbies, so we're going to continue on that, dumbing it down for everybody. And automated installation support, so we're working on a couple new installers, graphical installers as well as text mode installers for Linux. And then we're going to fine tune our AI-enabled support bots that we use to help users answer the questions so we don't have to. Community engagement includes contests, Easter eggs we put into the operating system. We have goodies we often do giveaways, stickers, t-shirts, et cetera, get points, and then Blisify videos in the link on our website. Opportunities available through Blis Labs and BlisOS are internships, mentorships, open roles paid soon. We'll be web development, server maintenance, project management, HR, finance, and developers. Contributors can get mileage for the next move in their career or can be absorbed by BlisOS, BlisLabs for commercial opportunities. We have a very easy streamlined way of joining. We cut out most of the crap, no ego, but we are healthy and drama free. We are full democracy on team, so we have a flat structure, nobody's head of anything, nobody's overseer, nobody's the god of Blis. And that brings us to where you can find us online. So if you could scan the QR code, that'll take you to our link tree and that has all the links available for you to contact us and move forward in the future. If you have any questions, I will be available outside the room and I have my device and a couple other screens so I could demo if you guys are interested. Thank you very much.
Introducing the Open Podcast API
So, let's start the next talk. Introduction to the Open API, podcast API with Kuhn Glotsman and Karen Einzwort. Sorry for the pronunciation. So the stage is yours. Thank you. All right. Thank you. Thank you, everyone. So, yeah. We are here to present the... Thank you. Thank you. We're here to present the Open Podcast API, which is a new specification that allows users to actually synchronize their podcast listening data, like your subscriptions episodes where you started, where you want to continue, et cetera, et cetera. So why are we actually doing this new API? It's because we have a problem. The problem is that there is actually a defecto standard for synchronizing podcasting data between devices, but it has a couple of challenges, let's say. One of them being that it's no longer actively maintained for the moment. There is a draft for a new version of the API, which is good, but has been still for a while. So that's one issue. But maybe a bigger problem is that there are some technical issues fundamentally in the API and the way it's designed. One of them is about the episode identification, which is basically based on the media URL, which is in the RSS feed. And that thing is not always unique. RSS is a standard, but it's a Wild West at the same time. So we can't really rely too much on that. So that's a problem. And also, the software behind this standard has some issues with feed duplication, which occurs if a podcast changes the RSS feed, they change their URL, then you get the same podcast twice in your subscriptions list. So I said, well, we didn't say that yet, but I'm with Antenna Pot and Kiran is with Funqvill. And in Antenna Pot, what we see is that that service, that software behind the API de facto standard is actually used as a centralized service. So there's a lot of users, which is great, but it's also a restrain on the servers. And so that overload is actually causing end users in Antenna Pot to see errors, and then they come and complain to us and we're like, well, yeah, we don't have too much influence over that. So the solution there to this, these set of problems, is to build that new API standard, which is actually building on the existing standard, but being more extensible, more standards compliant, easier to implement across different projects so that we avoid the centralization aspect. So for users, that means that they can synchronize their subscriptions, listening progress, favorites, cues, etc., etc. That the idea is that they can connect all their different devices. So whether you're on your desktop or mobile, or if you have a work mobile and a private mobile, that all your listening progress, etc., moves from one to the other. And also, this integration with the different apps would allow you as a user to actually switch from Antenna Pot to Cast if you don't like Antenna Pot for some weird reason. And so, that's on the end user side, but we need developers to implement that API, of course. To make that as easy as possible, we want to have clear and comprehensive documentation about the features, but also about the behavior. So if I send this API call, then what is expected to happen? We want to have the specs being reliable and easy to implement. And also, we want them to be feature complete, because different podcasting apps and servers and services all have different features. Some might have multiple queues that you can create. Some like Antenna Pot, we only have one queue that you can create. So we need to make sure that the API covers all these different use cases. So the approach there is to build a new API spec based on the existing standard, which I assume many of you might have guessed, is gpotter.net. Notice that it's a great thing to start from. And there are some issues that we are trying to solve. So we're building on it in a way. We're not building on it. We're taking inspiration from it, I should say. But actually, compatibility with it is not our main focus. We also try to follow the OpenAPI standard specification, because that allows for easier integration into software. With by respecting this standard, we can have CI create libraries, which are always up to date with the latest specifications. And that's our plan also to do that for different languages. And an important aspect there is also that RSS is our single source of truth, meaning that we don't want to synchronize, for example, episode titles, because that's already in the RSS feed. So why would we synchronize data that's already in the RSS feed? But at the same time, we also have already the GUIT of an episode, the unique identifier of an episode in the RSS feed. That's unique, but not really, because of RSS Wild West. So we do actually expect to create and synchronize a true unique identifier for episodes. And then we're also trying to be podcasting to the already, which refers specifically to the GUIT at the podcast level. And there are some technical challenges. One is about the episode identification. Like I said, there's a GUIT in the feed to identify an episode, but that's not always globally unique. So why is it called a GUIT anyway? You have links and you have enclosure URLs. And we thought, okay, to identify an episode in order to sync data between devices, we could do a hash of these three, but they're all changing. They're all optional in the RSS standard. So you might have none of these and then end up with no hash, I guess. So that doesn't really work. So yeah, we are having the solution that the first discover of an episode, whether that's a server, if it pulls RSS feed, the RSS feed, or if it's a client, it creates the GUIT. And then yeah, there's some things that we need to consider. Like first pull the new information from the server and then send it back to avoid race condition, et cetera. And also we expect the client to do the application of episodes. But if you're interested in more technical aspects, there's a link to the notes. Okay, thank you. So building on that sort of quite specific example, there's the more general question of feature compatibility. So clients and servers need to agree in a way on what is compatible. We need to have a way of communicating that. So we can't expect all apps and services to support every single endpoint, every single call because different apps implement different things in different ways. So to sort of get around this, what we've decided is that we should have essentially a core feature set where we say that specific endpoints are considered core and you must support them as a client or as a server in order to be considered open podcast compatible. There is of course then scope to optionally sort of extend this and to add additional endpoints which give us more functionality but are not considered core. So you can then negotiate that between your API, your server and your client. These would be then documented in the specification, what is necessary for compliance, what is an optional extension and then we can sort of work with clients and servers to map that and say what is this, what works for you and what do you need to implement. So what sort of endpoints are we looking to add? Well, we've got a few that we've already been working on. So as Kun has already mentioned, subscriptions is a big one. It's fetching and storing and syncing all of your feeds, all of your subscriptions between devices with the option to update them, change their URL or change their position or whatever it may be and delete them and to manage them across all devices. Versioning this is an important one. If the specification changes and we decide to deprecate something, change an endpoint, we need to express what major versions are supported so that clients are aware of what they are able to get from the server. We are currently discussing episodes but as Kun has already alluded to, this is a very complicated thing. So we already have a pad full of information about how we will synchronize this but the goal is to have that sort of implemented to synchronize status and playback position, how long you've played it, that kind of thing for all episodes across all different feeds. In future, we would also like to be able to synchronize settings or give an optional endpoint for synchronizing settings, search endpoint, discovery for discovering similar podcasts and features and also ratings and reviews which are becoming a big part of a lot of podcast stores. Who's involved? Currently, you've got myself and Kun. We're from Antenapod and Funqwell respectively. We've also been in conversations with casts, pod friend, Gpod Async for NextCloud and Musicpod. The idea is to get as many projects on board as possible from both the client side and the server side. Funqwell acts as both but we're more steering for server side at the moment. If you are involved in a podcast adjacent project, we would love to hear from you and get your buy-in and your advice and feedback. Just to mention on that last point, those are all open source but the idea is that closed source projects could also use this if they wanted to. What are our next steps? As mentioned, we're still discussing episodes. It's a big thing. Something we need to get right and something that we need to finalize before we can consider ourselves at a point where we have a core endpoint. We also need to discuss authentication. This is super important. You should not be able to query somebody else's status and you should not be able to get a hold of anyone else's data. It must be locked down. We need to discuss how we want to do that. It will probably be just a case of OAuth. That is for someone who knows more about OAuth than me to decide. We're currently building a new website. Currently we have a website which is built using Sphinx but we found some limitations with Sphinx in terms of having dynamic content. We're going to be rebuilding that using Astro and Starlight. It's currently just in a pull request somewhere. We're just waiting for that to get deployed somewhere. We're mapping features across apps so we need to get a greater understanding of what different features are available in different applications, how they present that information, and therefore how we can present that in our API specifications. We want to get a beta implementation in a few applications. Client applications specifically. We would like to have at least two maybe more supporting some of those core API endpoints so that we can show that it works. Which of course means we also want to have a reference server implementation which the FunQuel team will be working on just so that people have something to test against, something that they can deploy themselves if they want to. We can check that our client implementations also work as expected and according to the specification. If you want to get in contact with us, contact details are up here. There's a QR code you can scan but basically search for Open Podcast API. It's where we are. We're on Matrix. We have the website which I mentioned we'll be replacing soon. Obviously we have a GitHub organization which is where all of the conversations are currently happening. Get in touch especially if you are interested in podcasting or are currently involved in podcasting we'd really like to hear from you. If you have questions we will be outside I guess and we would love to hear from you. We're very friendly I promise. Thank you very much for listening. I'll just put the contact details up again so that you can all take your time, scan the code. Thank you very much. It was lovely.
FOSS for DOCS
We have Dr. Candace McKita Moore with PhosphorDocs. Thank you. I'm Candace McKita Moore and I will be giving a presentation about PhosphorDocs. First I'll tell you a little bit about myself, then I'll tell you what I mean by PhosphorDocs. Then I will try to convince you to get involved in this or if you're already involved in it, kind of give you some tips about what you're doing. And I'll conclude with some deeper insights about how to be successful when you do this. So about me, I have my bachelor's from Columbia. I got my MD and Technion. I went further in my medical training. I did an internship. I did further training in emergency and radiology. Now I'm a research software at the Netherlands, a science center. So stopping my biography there, you can probably figure out what happened. I married a really awesome guy and I wanted to get out of the place I was. So he got a job in Europe. I said I'd follow. I've learned Dutch. You can't really work in the hospital when you speak, don't speak the native language. So I sort of reverted back to something I did before I went into medicine, which was software engineering. These days, like almost three years later, I do speak Dutch and you can find me two days a week now at Rutter Damoressens Medical Center. So I think I know a little bit about this because I've helped create a lot of what I call phosphor docs and by that I mean free open source software meant to help medical staff accomplish medical research or treatment goals. So right now I work mostly on CVSL, which is sort of processing arterial spin labeled sequences and other sequences from brain MRIs. And that's really typical of what I do. Usually I'm working with radiological data, but not always. So an example where I did it is Resurf EMG. That was a project where I was the lead engineer because my center gave a grant to a group at U20 to work with Respiratory Surface Electromography. The grant ended, I guess, about a year ago, but recently I realized that the scientists and engineers I work with have actually had a couple releases since I left the project, so it's still going strong across a couple academic medical centers. So now I want to warn you a little bit if you're new to this area. If you get into this, you're going to be annoyed. So one thing I want to warn you about is that in hospitals and health systems, on average, we don't have the best computer scientists and software engineers. That is not always true, but maybe you could see that as a positive, right? Because if you know anything, you're going to come out looking like a hero. But seriously, I'm here to try and get more people who are really enthusiastic and into open source to think about doing these kinds of projects. Unfortunately, when you do this, if you work in a hospital, you're going to be at best outside of a hierarchy. At worst, you'll be on the bottom and people will treat you like the gum on their shoe. Okay? Just deal with it. That's part of the culture of medicine. It's a very strong culture. One of the things that distinguishes it is the language. I can tell you from experience, if you're sort of a math nerd like me, nobody's going to speak your language. Just as an example, a long time ago, when I started doing quantitative image analysis on radiological images, I tried to talk to one of my colleagues who was another doctor about it. I was just sort of going off about this and the dot product. He was like, wait, wait, wait. The matrix is like the matrix of the movie, right? He wasn't kidding. I mean, that's kind of just what you'll have to deal with. I want to add a couple final warnings. If you're truly hardcore into FOSS, you will just have to make peace that people in healthcare systems, they use all of these proprietary products when perfectly good FOSS is available. Part of that is trash issues. Part of that I really blame on us as FOSS creators because a lot of FOSS projects that are actually pretty good if you bother to read the code, just have a lack of swag and swagger. What do I mean by that? I brought an example. Locos, a little merch, make your thing stick in people's mind. If you push past all of this and you're creating software, there's a final thing I want to warn you about if you're working in a hospital system. Unfortunately, within the hospital bureaucracies and health system bureaucracies, there are some people with power and some pretty weird ideas about the possibilities and ways to make wealth through technology. At some point, like myself, you will run into people who tell you, no, you can't open source this because otherwise you won't have any money and we won't have any money and that thing will work. It's not that they're evil. It's just that they aren't aware of how these things can actually be viable. Just as one example, most hospital and health systems have some really kind of wackadoodle legacy systems that are all kind of joined together in a weird way in the hospital. If you do something that needs to harvest and move around data, then you can make FOS and also charge the hospitals just to customize what you made. This is just one model. I don't have time to get into all of them, but you have to tell people this because otherwise you'll just hit a hospital lawyer who says, no, no, no. You can't open source that. Now that I've told you about all of these things, I want to tell you why to do this anyways. The simple answer is it matters. I've seen so many bright minds like literal PhDs in physics go to startups where they do things that in my opinion don't matter as much like use neural nets on fashion on the internet, whatever. I tried this in a room of doctors and only one person got one of these. I'm curious if anyone will even guess I'll give you some merch. Can anyone identify either of these diseases? No. Okay. I'll give away a merch at random later. These diseases are diseases that we've had phenomenal success in getting the life expectancy up on. It's cystic fibrosis and sickle cell. Specifically, I can tell you in the case of sickle cell, or both of them obviously, 100 years ago, computers and computer programmers were not part of the story. Today, especially in sickle cell, that curve is going up and that is powered by software. I can tell you that because I work with people who work specifically on this. There's also the international humanitarian angle. In my first slide, you saw me on the coast of Greece where I was part of an emergency volunteer crew. In those efforts, software actually plays a role because we have to do things like track infectious diseases that are coming from people and going places. You'll fight a strong culture in medicine, but you can win and you can do great things. You just need to come prepared. The three things at the top there, I think, are just non-negotiable. You might not have the funding to get all sorts of swag immediately, but for crying out loud, at least get a logo. I've seen so many beautiful projects that don't have a logo and they don't have the kind of person on them who will go out and speak about stuff and they don't get any use. They're going to die. The second thing is get a medical reader. Get an MD who doesn't hold that much to read your documentation and give you honest thoughts about it. You may end up, like I do, essentially splitting your documentation so there's a side for engineers and programmers and there's a side for doctors. The third thing, not as obvious, but probably the most important, is get your legal game going from day zero. So, I mean that for everyone, even if you don't touch a piece of patient data. If you touch patient data in Europe, yeah, GDPR, all of these things will come into play, but hospitals are large bureaucratic institutions, health systems, anything that touches health. Things like even getting the right contract may take months. But if you don't strain this out, you will end up with problems. So those three things are not optional in my opinion. As you move forward, get some videos. This is because, as doctors and other people of this type move higher in the hierarchy, they get less and less negative feedback and they sort of want to appear in charge of everything and they're not going to go to a meeting and tell you, I don't understand. Videos are something they can play in the privacy of their own home and learn what you're trying to tell them. Another thing that I think is really important, especially because I do signal processing, is getting more than one institution on board early. You will discover that algorithms you might be using at one institution might not work so well at another and it's better to discover that early. And of course, it's great if your team has a nice person you can send to meetings. And finally, once you've really built up what you're doing, please get a no-code interface because a lot of physicians are not even going to want to do as much as putting into two lines of command line and you will never convince them otherwise. So on a deep level, these things I'm talking about, they really have to do with culture. And when I think about culture, I sort of prefer the metaphor of water to fish, which I think it was an American writer came up with. You sort of don't know you're in it until you're out of it. And there are really different professional cultures between medicine and software engineering. One tiny example of that is how overloaded the terms are in computer science and software engineering, like correctness. I mean, how many things does Docker mean? I mean, like, this is just painful for me, even though I'm kind of part of both worlds. In the past year, I've gone to a bunch of things that were about diversity, and I sort of left annoyed, but they talked about breaking the world up into F-cultures and G-cultures. And they say F-cultures are hierarchical, conformist. They emphasize the group, and they're usually non-Western. These are cultures that, and now there are lines people think of as like exotic. Yeah, I've worked in the Western medical system for many, many years, and I could tell you that's medicine, OK? Now, there is a reason for that. We can't just all go our own way and do what we want, otherwise patients might start dying. So you have to learn to sort of navigate our culture. And unfortunately, you have to learn how to navigate your place in this hierarchy. So you have to be very respectful of those above you. You have to not sort of make them feel threatened. So give them their learning in smaller doses. I mentioned videos. The other thing that is super effective is to actually go sit with people. Even if they are like what we have in the Netherlands, technical physicians, they might not be so technical. Those people are supposed to be like halfway between an engineer and a doctor. You may have to sit with them and show them about something like as simple as command line that we're all very used to. But that helps because you get a sense of what they will be capable of dealing with. And you probably walk away and think, God, I just need to make a gooey because there's no hope. But you also get a sense when some of your nomenclature is unsettling for people. And it will be. And finally, please, worry about your legal issues. And make something shiny in the sense that it has a logo and it's well presented. So some final thoughts. I want to emphasize that there's a lot of unevenness in how software is sort of spreading across the world. And I've worked in places like Haiti. I've interacted with professionals in several African countries. Software is spreading. And unfortunately, it's often proprietary software. And this is really terrible because what you see is when big companies, just to give an example like Microsoft, sort of move in, they often set up systems intentionally or not that make a sort of vendor lock in inevitable that the health data in the system becomes so fused that the institutions, the hospitals just can't get away from this stuff. It's like sticky. So I think it would be great if people who made FOS sort of got there first and get their shiny in a way that builds trust. So I hope I've convinced you either to think about this or maybe to sort of up your game if you're in this area. And if you have any questions, you can send them to me. My email at the Science Center is right on the bottom line. And that's it. Thank you. Thank you. Thank you.
Journey to an open source contribution
Next, we have Thierry with Journey to an open source contribution. Thank you. So thank you for coming. I'm Thierry Berger. I love open source and I'm here with you today to tell you about a few open source fixes or stories I've done. So follow me. Let's make things better. I don't know about you, but I have a dream. My dream is that different players from different backgrounds, okay, it's a problem technical, but yeah, with different backgrounds, with different interests, players could still be able to play together. So you can imagine an old grandmother playing her match three game, you can see the three candies, and she will be able to share it with her grandchild. Hey, I'm a grandchild. And that grandchild will be able to share that candy to another game, or like a pet's life simulator game, something. So even though they have different interests, they can play together. And it's awesome, so I'm very motivated in that pitch. So I started a hobby project by using the Bay V game engine, which is an open source game engine made in Rust. And the project was going smoothly until I hit a problem. I couldn't input an ad, and it's a big problem because I want my players authenticated. And yeah, every email address is AvanHAT, so that's a problem. So time to fix it. I have to tell you about my keyboard. I'm French, I'm using Azure keyboard. So that means I have to input ALT GRZN0 to input an ad. And actually, behind the scenes on Windows, it's actually equivalent to control and ALT and GRZN0. And that control mapping is pretty important because control can have a lot of capabilities. It can scroll with the mouse wheel, it can copy, cut-paste, it can move the cursor, it can scroll with mouse wheel and move the cursor with arrow keys. Well, anyway, it can do a lot of things, open the task manager and other stuff. So I opened an issue and the library I'm using, it says bevy edui, a bridge to edui, a UI library. Hold the term. But yeah, I opened that issue. It's actually scrolling, it's a very long discussion. You can see it now because I'm using a PDF, but yep. And eventually we landed on a fix. It was a very long discussion. And yeah, I think it's pretty important, pretty interesting when discussions are way longer than the actual fix because it really shows that communication, software development is very important. And yeah, so if you have a problem, just ask questions and eventually everything will progress. So now we fixed our at sign, we can progress, right? That password field was my next difficulty. I have to tell you a bit more about my project. I want to support one-time authentication. So when the user registers to the application, it sends an email to the user. The user copies that email from their email client and copies that into the application, into that password field. And then web. It was working fine on a native, but on web, it's a bit complicated. So bevy edui, I told you a little bit about that, uses AR board, which is a library to support clipboard, but it's mainly focusing on native clients. So it's a synchronous API and web, it's a problem because we cannot really block the browser as it would freeze the entire browser. So it's just not allowed. And AR board desires to say simple, so that means we cannot add web support to it. So bevy edui implemented the local only clipboard, which is handy to copy from inside our application into our application, but that's not enough for my use case because I want to copy from outside my application. So time to fix it. So to fix it, first I checked what were my options and how other projects were doing it, mainly eFrame, an official framework from e4egui. And I could quickly have something working by using the web sys create, create to interface with JavaScript. And I had the copy, cut and paste event going through JavaScript directly from the browser and bypassing all the bevy machinery, which is great. But then I had another problem. I noticed that on Mac web, the controller command was not very well implemented because on Mac user, they don't control A or control C, they command C and command V. So we don't want to support correctly command C and command V to paste and then control A to select all text that's inconsistent, so time to fix it. So I fixed that by using the user agent on web and to detect on which platform I was, so eventually all my controls were consistent. So at this point, my pull request is the state of my whole adventure is that I have a pull request waiting for fixing the clipboard and it's on review. It can be a lot complicated. We did see a lot of little devils in the details. So I let it sit. The main contributor of bevy eGUI is in Ukraine, so you can guess he has a lot of other stuff to do. So anyway, I can just target my branch and I can continue on my journey, right? What is it wrong again? Let's rewind a bit. We skipped a little bit that first fix we did about that add sign. The fix was mostly if it's control, it's not text, but if we are on Windows, it might be text, so we're running the condition. And then web, it will not work because it's a macro there. And it's on compilation time. And web, it will not equate to Windows. It's actually wasn't 32 unknown and rest for the text savvy. And so it's not working. So now that I've studied the subject more, I could have done the same check I was doing before with the user agent to detect the correct platform. And that would have fixed all my problems. But then I did that in another pull request to separate things and do things the correct way. And I was a bit confused, so I first tried to remove that and then I was like, oh, okay, what about alt code? If I remove that, I can input alt code because I'm French. Did I tell you that? And yeah, I'm French, so I like to input A, O, R, weird characters. So I removed that and then I was like, oh, well, there is another if right there, maybe it just would fix my problem. I don't know what I was thinking. I was like, it's an emoji with the exploding head. But yeah, I was like that and pretty telling. But anyway, so I decided I would have to step a little bit back. Mistakes happen. So I blanked that out before in a previous slide if you remember. Baby behind the scene is using Winnit, which is a backend library for handling windowing. It's basically a low level stuff to send raw inputs. So I noticed they had a lot of fixes related or not to my issues and I was like, ah, will I have to do all my fixes again? I wasn't too confident in it. So yeah, mistakes happen because I think I would have been able to fix that by using the user agent and call it a day. But anyway, I like rabbit holes. So I went to update Winnit update. Yeah, why not? So I knew it wouldn't be too easy to do because I had to track multiple main branches, multiple unstable dependencies. I had to track baby main branch and Winnit main branch, which had multiple commits every day. So I had to have a stronger plan than doing it in provision. Yeah, well, anyway. So I had to first update, make everything compile and work. And then after everything compiled works, I could update to the new Winnit goodies. Yeah, Winnit API and good stuff. So first, when doing a dependency update, check the documentation. But I was updating two main. So documentation is not really great. So that meant foraging through the source code, pull requests, chance logs, working with the server, and occasionally chatting with relevant experts. Winnit uses Element, which is a response from them. So yeah, thanks, Element. First, when I was ready, I rolled up my sleeve and dove into the code. And the first thing I did was updating a lot of NM names. And I'm thankful that most IDE support for search and replace. Yeah, VSCode, sorry. Then another task I did was update a lot of dependencies. As you can see, there was a bunch. And I like to focus on a particular one, row window handle. It's a create to provide a common interface to interface with the window. Most of the dependencies had updated to a new version. As you can see, version 0.6, actually. In Bevy, we use continuous integration testing and cargo deny, which helps us prevent from duplicate dependencies. So I had to have my whole stack targeting the same row window handle dependency. And WGPU, which is another low-level create for graphics, wasn't updated to that. And I felt adding yet another main branch would be too much of a time sink. So I had to use version 0.5 for row window handle. And it's quite interesting how it's supported by the whole row window handle ecosystem. You can just add a feature to most ecosystem crates to say, I want to support this particular version and everything will be consistent. I had to do a few pull requests to the dependencies. But everyone was very responsive. We eventually had something consistent. So now we can profit, right? And progress on my task. Not yet, because the WinIt update is pretty complex. It can impact a lot of architecture platforms and stuff. And I don't have every platform to test. And I have also limited time. So I reached out to the baby discord to help. Like, hey, my pull request is nearing completion. Can you help me review it and test it, please? So yeah, we caught a few bugs. So I'm very thankful for everyone who chained in. And eventually the WinIt update was merged. Yes. So now we can profit, right? When I'm doing anything, I like to focus on the objective now. So that meant taking a few shortcuts. I noted them all as faithfully as I could. If you check out the whole pull request and the WinIt follow up, there is a lot of things. But I didn't write it in one go. As I discovered then, I would write them for me, for readers, and for future readers, for afterwards. Yeah, so now I think I will step back a little bit from all this and go back to my use case. Let's remind a bit. We implemented, yeah, we did a lot of things. We implemented copy-paste via JavaScript events. We detected the platform using user agent. And we even updated WinIt. So whoa, that's a lot of things. So does it work yet? Not yet. But I'm very confident that we have everything in our disposition to make it work. So next time we talk, I will tell you how everything works perfectly. Thank you for your time. Everybody can help. And if you want to help Bevy come into our Discord chat or just come talk to me afterward, I have free Bevy stickers if you want. So yeah, just come and talk. Thank you. Thank you.
Trusted Postgres Architect - Deploying Postgres with Infrastructure as Code
So, next we have Boris Mahihaas with trusted Postgres architect, deploying Postgres with infrastructure as code. Right, so thank you very much. Thanks for coming. My name, as she says, Boris Mahihaas. I used to be a solutions architect at ADB, but I grew a little bit of white hairs around senior solution architects, but this pieces off a lot of the building architects. So actually my real title is holistic system software engineer, because I would like to see the things from the fundamental interconnectedness of all things. I used to be a developer, operational people, a DBA consultant, so I'd like to see the whole stuff. That's why I see the value of the DevOps philosophy, because it's trying to get the whole thing kind of a one thing of deploying stuff in a more reliable way. Apart from that, I'm also an air guitar player, and I really love metal music, and within other type of music. I'm going to talk about trusted Postgres architect. So here, who uses already Postgres here in the room? Nice, okay, that's very good. Okay, so who didn't raise the hand wants to use Postgres maybe, but I think everybody already raised the hand. That's good. Okay, there you go. Thank you. So this talk is for you. All the rest is also an interesting problem, this, because it's about reliably developing Postgres in multiple different infrastructures. Okay, so this is a use case. This is a developer. She is trying to develop a new project. She wants to use Postgres finally because it's being one of the most popular databases in the last years, and then she has this brilliant idea, but she doesn't want to start using single commands all the time because she wants to have an environment where she can test, test, test, and when everything is working finally well, she's going to be able to deploy that into different environments, either test environment, pre-production, staging, whatever you call it, but exactly the same thing. So the typical stuff that people do is like, I have a container, I'm going to put that specific container into the server. This is not exactly that, but it can also be relying on containers in order to emulate the final architecture. So let me explain a little bit more. So you want to do it in a reliable way, and that's why the name of the tool is called trusted Postgres architect, which is the tool, TPA. We like to call it TPA because it gets people confused with TAP, which sounds like tap, which is for you to get your favorite beverage at the bar. The first goal that she has is to deploy one single instance running Postgres 16. This could be also if you are running already Postgres 14 or 15, you want to try the new features of Postgres 16. Who is already running Postgres 16 here in the room? Okay, so much less than the people who was already using Postgres. So this is probably one way for you to test the new version. So I'm going to just show you code here, which is YAML. You might not like it, but it is the standard way of doing Ansible stuff or doing deployment. So in this whole screen, which is pretty large, I'm going to put all the code that you need in order to have this one single instance. First of all, in TPA, you have to specify your architecture. This is a master one. I know now we call it primary, but master still sounds nice because it reminds me about master, master. So that's why it's called master one. And obviously it's going to be called, FOSTA, is the name of the cluster. Then you can have cluster variables, plenty of stuff that you can ignore at the moment. I'm going to come back to one of them afterwards. But the most relevant here is this one, Postgres version 16. That's what you need. Okay. So this is the version you want then, and it's going to put you that version deployed. So this is the cluster variables. I'm going to come back to the cluster variables later. Then because you want to be able to do deployments in multiple locations for fault tolerance, high availability, it's always good to specify a location. We are in Brussels, so we are going to call location Brussels. And we are going to have an instance, obviously. Thank you. At the ULB, but first we're going to say which type of instance and which default we have. So we are going to do it with a Debian image. It's going to be a specific detailer by TPA, but you can use whatever image you want here. And here the platform, it says that it's Docker. This is not cloud native stuff. It's really an easy way to have a virtual machine that I can connect to it and try to behave as if it is a virtual machine, but it's a container with everything that you need there for a virtual machine. And as you can see, TPA uses Ansible. So we are going to have this Ansible user here for connecting to the machines. And here is the instance. You specify only these parameters, the name, the location. This is a number within your cluster and the role. And the role is to be a primary. So here we use the most modern way of referring to the primary node. That's it. That's all the code that you need for one single instance, of course. Then because this is Ansible, you do TPX-SEC. This is the executable of the trusted architecture. Provision, so you provision your cluster and then deploy. And then you get it. Okay? So how do you connect to it? Yeah? Well, I told you that the Docker containers are going to behave as a virtual machine, so we can SSH to the machine. So we do SSH using that file that is going to be generated in the process of doing the provision. ULB is the alias that we gave to the instance. And then we do the typical thing. I become user Postgres. Oops. I become user Postgres and I execute PSQL. Yeah? So it's really nice because it's using the super user Postgres and you want to have this for applications. So let's get a new requirement. But this is how you connect, okay? You want to have an administrator, which is not the Postgres user. So we are going to call it Slonic because that's the name of the blue elephant. Yeah? And this is going to be an administrator. And then you have Ada Lobleys, which is going to be the application user. And we don't want to use the Postgres database, so we are going to have a FOSMDB, which is owned by the application user. So this gives us a little bit of more security already. So how do you change the previous code in order to allow this new request? So in the cluster variables, we are going to just keep these two variables there, the failover manager, which I'm going to use later on, and the Postgres version. And then I'm going to add the users. Yeah? So this is how you add the user. You say the username. I'm going to ask TPA to also generate a password for that one. And the roles of that particular user, in this case, is going to be super user. That's the administrator. You can also grant permissions and stuff like that. But in this case, I want to have a role attribute. And then we have the developer, which is doing the application part, Ada Lobleys. We just got the password for that one. Then we ask for the database. We give the name and the owner. And that's it. So I'm adding new stuff. So it's not just for the first deployment. It's also for maintenance. Okay? So you can do a git commit of your new version of the stuff. So you can keep a track of your infrastructure with different versions. If you want to revert this, you can also do it. Right? Then, of course, provision and deploy. And you can continue. Now I show you that you can connect to the database through SSH and then PSQL. Now we wanted to do it with an application. So what we are going to do is that we are going to ask for TPA. Give me the password that you generated for this cluster for that particular user. The password is a random stuff, which is not that random. It actually contains a reference to a metal band from Belgium. If you can figure it out, I will pay you a drink from it. And then using that password, you can connect with the normal PSQL. You provide the host IP, the port, the user. And you put the minus capital W user that you can put that password there if you don't want to put that in the PGE pass file, for instance. But now I'm not using the SSH stuff. So now I'm really behaving as if it is an application. Okay? You can take a picture and try to figure out the reference. So we have this now with that little amount of code, but we don't have any fault tolerance yet. What happens if this thing crashes? Well, we want to have a replica, exactly the same version, physical replication, and that's the new thing that we are going to do. So let's take the code again. We have that cluster variables for the failover manager that says absolutely nothing running Poster 16. GPA can do it with a tool called Rep Manager, which I like a lot, and also Patroni, which is also very, very good stuff. So you can choose. In this case, I'm choosing for Rep Manager. And then in the instances, if you remember, I have this primary one. The only thing that I need to do is to add another instance. This one is the VUB. So you see the French-picking one, the Dutch-picking one, but the city is in English so that nobody complains about the one that it picked. Now you have this one. It's a different role. You see this is a replica. And this is the primary. So I have to say who is the upstream of this replica and is the ULB. And I have a cluster with two nodes. Again, TPXSek provision, TPXSek deploy. Let's continue. I want to have more fault tolerance. What happens if there is an attack in Brussels and the old universities get destroyed? We want to have a third replica, but we don't want to do it replicating from the primary. We want to have cascading replication. But if somebody deletes a table here by accident, it's also going to be deleted in all the nodes. So you need to have some backup and restore for point-in-time recovery. That's why you want to have in another part of the country your barman, because you trust your barman, which is for backup and recovery management. It's important that your backups can be recovered. If you just have backup but you never recover them, you don't have backups, basically. So this is what we are going to build now. Let's get back to what we have. We have the location Brussels and two instances. Let's add another location, Vlonderen. This is in the north. And then we are going to add a replica in a very nice place called Achel. Used to be a trapeze beer. Not anymore. It's still a very good beer, but it's no longer a trapeze. This is just for your common knowledge about Foslin. So then you get the location is Vlonderen. I'm going to say that it's also a replica, and I'm going to have the view be a upstream. This is how I build cascading replication. I could do now provision and deploy, and I have my older replica, but I also want to have backup and recovery. So what I do is I add here another location, which is Wallouni. And then I add a very nice place, which is still a trapeze cheese, also beer. This is my favorite one, actually. And then look at the role. It is Barman. So here's how it gets a backup and recovery management just by adding this. So it's an instance with a Barman role. Now where am I taking the backup from? Well, I need some space there. I didn't put it on the bottom because otherwise you wouldn't read it. So it is coming here. You just put backup, rush for, and that's how you build it. So this is all the code you need, and you have already a cluster with cascading replication and backup and recovery tool. Good. You do provision and deploy, and then you're done, and you have built an architecture. So this is all done with Docker containers. The idea is that you can take exactly the same file and put it into virtual machines and other stuff. So if you don't remember how to do the configuration files, it's very easy. You can also use the tool TPXSecConfigure, the cluster for them, and you can say, I want to use this architecture, running PostgreSQL version 16. The platform is going to be Docker. My operating system is Debian, and the failover manager is RepManager. And you get something very similar that you just need to change some names. Now, look at the Docker thing here. If I change it to Bear, because I like Bear Metal, you just change that and you get a different configuration file with some IP addresses that you just need to add. It's basically the same. But you can also deploy to AWS. So TPA is going to also create your virtual machines at AWS if you have the credentials, and it's going to manage all the network things. So you just have to do provision and deploy, and you get everything. It's super cool. Okay, so configure. You provide architecture, the platform, and the OS. Then you do provision, and then you do deploy. And then deploy also provides some hooks, like doing pre-deploy, pre-NDB, post-deploy. You can have some stuff like enhancing your cluster. So to summarize, you have an architecture here, and an executor of the executor, which is of the architecture, which is the orchestrator here, and it's going to deploy to some machines. This machine can be running on virtual machines in AWS if you have the credentials, or it can be Bear Metal, or it can be Docker containers. When I see a ship like this with containers, I always think about the Albatross and the rhyme of the ancient mariners. Okay, so for those people who also listen the same kind of music, you know what I'm talking about. The cool thing is that it is exactly the same architecture here, and exactly the same way of doing the provision and deploy. It's just a different target. So instead of submitting your container to somewhere else, you say, I'm going to deploy the same architecture somewhere else. So what we basically do is when we do a project with a customer who wants to run an architecture, we deploy that with using TPA, the definition, and then we pass to the support team, which is going to continue talking with the customer after we have finished the project, exactly the same architecture, but in Docker containers. So whenever the people who is having the project in production has an issue, contact support and say, like, I have an issue with my architecture, they can deploy a model of it with Docker containers, and then they can test the whole thing there. So it gives you really a continuation of your project. It's not just the first deployment that is easy, but it's also the maintenance and the documentation of it. You don't want to document everything on PDF. You want to document in code. So your configuration file is the documentation of your architecture because you are using your infrastructure as code. That's the main advantage of using this kind of tool. That's why I like it a lot. All right. So these are the platforms. If you want to have a look at it, it is in GitHub now. It is released with a GPL version three. It is recently been open source, but this tool, we have been using it for six years already. So it's quite mature and have our best practices. Everything is done with security layers, SSL, host-based authentication, everything is done for you. And you have the documentation also at enterprisedba.com. To include, it is infrastructure as code. We always put them in Git so that we can have different versions. We know how it is evolving our infrastructure. It is not good only for testing because you can test your entire infrastructure, but you can also use it for deployment in production afterwards. It is a way of documenting your deployment, not just PDF, but documented as code. And it's not just for the deployment, but it's also for the maintenance of your stuff. And finally, we get it open source. It's been there for a while. We have been using it. We have been fighting for getting open source and you get it there. So you are free to use it as much as you want. Now all the documentation is there, but you can also contact me via my personal email, company email, also mastodon. And that part is also for the other social media with full of haters. And Hale Slonic. Thank you very much.
Switching the FOSDEM conference management system to pretalx
Okay, so next we have Johan van de Waal speaking on switching the FOSDEM conference management system to pre-talks. It's already time? It's not too early? Okay. Wow, thank you. Hello, everyone. I'm going to talk about, well, maybe not such a technical issue, but I'm going to talk about how we migrated from PENTABARV, which is a logo on the left, to pre-talks as our conference management system. So very short thing about me. I do scientific programming. I developed together with my friend over there, fiber-based monitoring solutions. And apart from that, I've been in FOSDEM team for quite some time. I visited for the first time in 2007. I did some research for this presentation. And I've managed the geospatial dev room, and since a few years I've been coordinating the dev rooms, and I'm part of the server team of FOSDEM. I am not a web developer, and I'm also not good at slides, as you can see. It's important to know. So what is pre-talks? Oh, no. What is PENTABARV? And what is pre-talks? What do we use it for? PENTABARV and pre-talks are the tools that we use where people will submit their talk, where dev room managers or we or staff will choose our talks, we will review, we will build a schedule, and then finally we will publish it on the website using that tool. This is the tool we used, which was called PENTABARV. This is the new tool pre-talks, which we use this year for the first time. Why did we switch? Anyone here, who of you has actually submitted a talk this year? Okay. Who are the dev room managers in the room? Well, okay, at least one. Yeah, most of them are in their dev room, of course. So I would love to get some feedback from them. So what was the main issue with PENTABARV before? The main issue with PENTABARV is that it's Ruby on Rails. I tried to get it running on my computer for a few years because we wanted to improve it, but it didn't work. I couldn't get it working. Actually my next slide is maybe more interesting. This is a screenshot I made. So this was a state of pre-talks. This is their master repository. So you can see it has been abandoned for quite some time. We made a fork with a nice name Postgres9, which gives you an ID about when it happens. And you can see we did some updates, but not too much. You would get people making pull requests like this. So yeah, I could not install it. So in the web archive, I found install instructions, and they wanted to add that. I also found those install instructions, and we had some in our internal wiki, but without that, even with those, I could not get it running. That's why at some point I said no, I will not improve PENTABARV. I went to improve another project, which is still in use for other conferences. So I had a look at pre-talks. Pre-talks is a Django application. And when I tried it, I was struggling with that for a very, very long time. And at the end of the evening, I said, well, let's try the other one. So I just did this Docker compose stuff, and I had this thing running, and I could import a schedule which was generated before. And I almost had something that looked like a full conference system for FOSM ready, maybe after one hour. So I was quite happy with that. Yeah. So I was not the first one who had planned to move from pre-talks, because what are actually the issues with PENTABARV? The main issue was nobody could install it. It was still running, but we didn't know for how long. And if some strange bug would occur, I'm not sure anyone in our team would be really capable. Well, if there's really a bad bug, people will start to become a bit better, and they might fix it, but it was unmaintained for such a long time. So there have been many plans to move, but they usually failed because then they said, well, we need to have that feature, and that feature, and that feature, and that feature. And the other thing is nobody works on FOSM until, let's say, September. In September, we do a kickoff, then we open the call for dev rooms, and then it's like, okay, yeah, we're too late. It's not yet ready. We cannot use it. And then the next year, nobody will work on FOSM until September when, again, they kick in, and it's not ready. There's also some resistance to change. People will say, it works for us, so we don't need to change it. That's part of the... This is mostly the internal people. This is not the people submitting. We had people, kernel developers, sending videos of trying to login and do PentaBARF without it working. So I think that's quite a bad state. So in order to avoid those things, there are a few things I wanted to have before the kickoff that we have in September. And for me, there's were two things. It was building a website. It should be possible from PTOX. And the second thing is having an audit log. What is an audit log? This is an example from PentaBARF. It is everything which is entered into the system. It's actually interesting that somebody gave feedback on a talk of a year ago, but it was the last thing I could find. But it shows the difference everywhere in the system. And this is really useful because we have had these discussions. This year, some deaf room manager approved a talk in another deaf room. And then, yeah, that's a bit... It's not nice because the speaker, he books a stick and he thinks he can go. And then, oh, that was a mistake. We fixed that, by the way, so they can no longer approve each other's talks, but in the beginning it was possible. We had a presentation where they completely changed the scope after it was approved. It was also not very nice. And then you can always go back and see what was history. I actually would really recommend people if they do something. Use such a log even for normal database, but definitely for a conference management system, it makes sense. So this was one of the two things I wanted to have ready by September. It didn't have to look nice. This on the left is very useful. This on the right, well, you can do those things if you need them, but you will not get happy from it. But at least we had a way to find out if strange things happened, how they were happened. It was also a template for both changes, or if we changed some configuration, then at least we could trace back what was history. The second big thing which we needed, which I said, was to be able to build a website. Why? Because our website is used by all of our other integrations, or most of them, which include Matrix S review, which is what people use to review their videos, and all the scheduling applications that you have on your phone. So that was also one of the things which had to be ready, at least in some form, before we could switch. Yeah, I forgot. Oh yes, now I know again. A third thing which I did, but this actually only started after this initial session. This was during the actual organization of the event. I created a plugin for pre-talks, which has some specific settings for FOSDM. For example, people, they will pull out a call for papers on the website. They could enter it here. Well actually, all of them, they send it by mail before I had the system ready, but at least next year they will be able to do it. It's a bit similar for most of the other boxes which are there. So they can close their call for papers, so people can stop submitting to their track, because some tracks like to keep it open for a longer time, but then at least they get a URL where they can still submit. So if you're really quick, you can get that code and submit to the main track, but I don't think anyone will accept it. During the event, this is actually something I actually fixed this morning. Dev room manager will find some instructions. I hope that this will grow a bit over the next year, so that they have only one place to look while doing their team. Yeah, so I wrote here most of the things were already a bit late, but it was also only the way during the conference that it was like enough vibe to add those things, to build them, to realize that we needed them, because if you just click around in the interface, it all looks fine, but it's only when you start using it that you notice that you need some extra tools, unless you're really good at testing. I'm not. We had to make some changes to pre-talks itself, mostly to limit Devroom managers to edit other people's things. I made some changes to the review system. As I said before, sometimes it is too late. This was something I didn't understand that no one of them complained, because if they did reviews and they clicked next review, they would get a random track from another thing. That doesn't really make sense, so I changed it. It would always stay in the same track. So what are submissions? This was by default not enabled, but we have some people who submit a lot of talks. One person submitted 15 talks for this Boston. So if all of those are in different tracks, all of these Devroom managers, they would spend time in looking at the same proposal again. Well, now if they see the list, then at least they know, okay, he is already there. Let's keep him. The last things we've had to change, which are a bit complicated and where I'm maybe not completely happy about the workflow, is the fact that we have parallel scheduling. So pre-talks in itself, it's actually made for, you have a large group of reviewers, and then you have a small group of people who actually built a schedule, which works for most conferences, but which doesn't work for FOSM, because we have a lot of people scheduling. So well, that's actually the nice side. Some of the last things which I want to mention, these are like the annoyances of at least some of the people, mostly from the staff. Pre-talks is much less information dense. If you look just even to the resolution, you see that here, all information is a bit spread out over the screen, while here it's very close to each other. You have a search book. If you start typing here, you will get to any talk. Here you would go to proposals, then click on talks, then type in the checkbooks, then search. It will take you much more time. So this is one of the annoyances we had this year, which we hope to improve for next year. Things that we had over Penta, so these are things which already went better this year, even though it was like a migration year. There were much more reviews, so that was, we could reach a larger group, I think, because it was easier to use, or maybe because we promoted a bit more. Devroom managers, they can now send mails. Before they had to export all the email addresses, then run it through their own mail program and then send mails. While here they are in the system, if you would go back as a speaker, you can click open those things and find your mails by yourself. So finally, I have only three minutes left. What are the roadmap? What are the IDs? First of all, the audit log, which I showed you, it's a bit integrated. Well, actually, the code is quite separate, but as soon as pre-talks makes another release, I want to make it a separate plugin, which can be installed completely apart from Falsum, because I think it's interesting for all pre-talks users who choose Postgres, they just should use it. It will always help you. And then the next part is actually a bit going back to what I told earlier, Pentabar which was so hard to install. I don't want to create a new plugin which is as hard to install as pre-talks. Well, nowadays it's a bit hard. We have a demo site, which is that one, pre-talks test. So I actually want to make that into something that you can install with. It will not be one click, but it will be quite easy to install. So people who want to improve it, that they at least can do. Yes. Then the other thing is we made some custom changes. I hope to get rid of them, integrate either them into pre-talks or make sure that there are signals which are like places in the interface where we change something, but that it's already ready for it so that we can put it into the plugin instead of in the code itself which is forked. The last thing is we want to get more information about previous year submissions because it's interesting especially for main track speakers to see has this person presented before, how was the feedback? Maybe he gave the same presentation already, then we will skip him, those kind of things. Finally, my last slide. You can help. Well, an obvious way is to help upstream with the project. Pre-talks is used by a lot of other conferences. I believe they have about 100 conferences or something like that organized every year with pre-talks. Maybe much more than, sorry for it. We have our own repos, but especially the first one is useful. That's a pre-talks integration because that's the place where we really do the bug tracker. I put some questions there and I intend to put a few more also with questions for you. Then especially I mean you are the users like the dev room managers. Which settings do you want to have for reviews? Do you want to score from 0 to 10 or from 0 to 100 or in different categories? I just made a random choice and you should get some feedback. Then there are the two forked repos which are actually only useful in combination with the other one. I just give them for completeness. That's my talk.
From OpenLLM-France to OpenLLM-Europe: Paving the way to sovereign and open source AI
So thank you for your coming. It's quite early, 9.30. It's difficult to start, so I will try to push energy to this session. So just before to get started, I would like to know more about you. So with three very simple questions. First, who has ever locally run LLM on his laptop using Lama CCP, VLLM or LLM Studio? Please raise your hand. Okay, right. Second question, who has ever fine-tuned a model? Fine-tuned. Okay, let's find 10. Okay. And the last one, who has, like me, dreaming to have one-dred open-source model in here? Not only one, not only open-weight model, but really open-source models. Raise your hand. Okay, you are in the right place. So we will do the job. Okay. Yes, my name is Michel Marie Modés. So I have a co-founder company, a software company called Inagora. So we started in 2001. So as the first time, we will be very close to our 25 years. So it will be for next year. And our mission with Inagora is to invent and develop good tech for good. So what I can sum up as ethical open-source. And for AI, we do the same. We do ethical AI. And to come up to achieve this goal, so we started a community, a very large and a brand community called Open LLM France. So we started in June 2020. And we have two main goals. First, as well to build trusted sovereign and real open-source generative AI technologies. And the second goal is to build a strong ecosystem around LLMs and generative AI systems. So for the second objective, so I can say that we have success because the community right now is more over 450 active members with a strong support from the academic and public research in France. So it's very important because, for example, with the GenC, we can use freely supercomputers like GenZ. And it's very useful for us to give freely GPUs to train our models. So it's very important. And at the same time, so we have a lot of corporates, corporate private company, who have are using AI technology or many to build with us AI solutions. So and all this track for today. So I think there is a lot of important things to build ethical AI system. But my talk, I will talk with you about three topics as well, what we could consider as open-source AI. So this is my first part of my talk. The second part of my talk will be related to diversity and the underrepresentation of our culture, our language in these models today. And the third part of my talk this morning will be related to data quality and the evaluation of this model. Okay. Under. Okay. Right. So to be very clear, and to start on the biggest problem, so the most popular open model that you are using today are not open source. They are open wave model. So this afternoon, Stefano Maffuli from the OZ, open source initiative, we have a talk to report to their progress to this definition of what we could consider as open source AI. So I'm very proud because I'm part of this small group and private group inside the experts from external from from the OZ to try to define and to get this definition because it's important to clarify the situation because as you know, and I'm not alone, but Stefano and probably some of you's have used as a published post to raise the problem to the misuse of the open source term today by some players on the far work ecosystem. And so and I put in this slide, you know, the OZ definition of open source. So to be very clear, if you have limitation on the use, the license and the term of use of a lease license, or if you don't have the artifact, the element to train again the model or to make a derefited work on it, you can't say that you are doing open source. This is very clear. And today, the main part of the popular model we have, you don't have a view and access to the data set used to train the model. For us, for this community, what we open source AI means three things. First, as well, that we are able to have the open source of the model. All the tooling system used, for example, to train the models to evaluate the model of the pipeline to do the evaluation of the model. And so for different things, it's not very easy to find this information on an open model today. The second point is related to a license. So if you have for us our license, we don't have this license, we have to have, we thought in the limitation of who and what we are doing with this model. And the most important is the third point is related to that asset, open corpus, open corpora. But you know, it's very interesting because probably if you follow the news related to AI, you saw during these past days some new models with data sets published under open source license. So, and I think it's very important and I think that for 2020, not only the year of open source AI, but also for data set publication, open source license. So I changed my presentation last night, just after the talk of Joss, the co-founder of Next Cloud, because he present an ethical rating system. And I'm very glad to see that we share the same point of view. And it's very simple for also for the Next Cloud community. If all these conditions are met, the three conditions, so you are in the green area. If you have only one, two conditions, so you are in the yellow, only one orange. And if you are using, for example, open AI, in fact, ChabGPT from open AI, zero condition are met. So you are in the red area. So if we have today this morning developers from this beautiful Next Cloud community, thanks for your job. It's amazing and we love it. And so for us, by the way, we are in the first green area and we try to do the job. The second part, the second topic I would like to underline this morning, it's the problem that AI generative models are more and more representation of a picture of what we are in terms of culture, in terms of society, in terms of language. So I think that's figures talked by the by themselves. So in the left, you can see that since 2018, less than 8% of LLM has been created in Europe. And on the right, what you can see that it's the volume of language used to train LAMATU model. So 0.16 for French and 0.17 for German. So percent. So I don't know what do you think about that. So but in my point of view, we can say that we are not really well represented as our culture values in this model today. So we have a problem, I think. And we have a community we try to solve. So first, first try we did, it's to adopt a data first, drive an approach or quite a quality first, drive an approach. And because the small also is beautiful. And we try to get the proof that quality of the data set is more important than the quantity of data you have. And to demonstrate this this point, we release a first model in October called Claire. So Claire like the woman, the show name in France. So I'm not against I have nothing against a podcast, Albert, Alfred, Mr. But you know, we prefer in our community to promote women because by fact, it's our little contribution to have more women in our AI ecosystem and a global unity. So I will, I will not go deeply in Claire because Julie, the real one. Yes. Julie will go deep and tell you all about Claire what we did. Oh, we did this model. But just for very, very, very, we just gave the proof that it's we are able with a lot of amount of French tokens to give a very, very conversational model. Conversational means that Claire is able to understand dialogue between people with their realization. And the second part of Claire, the second features, it's that Claire is able to talk like, like you, to make a dialogue, human like dialogue with defluence, hesitation, because we train Claire with conversational data. So we continue to collect a lot of data. And today, so we are around 140 billion of token in French. So and we I'm very glad and happy to announce that we started to the training phase to train our new model called Lucy. So Lucy, the main goal of Lucy is to fix or to yes, to improve the under representation of the French language in generally in LLMs. But at the same time, we put in our data set some over European language, the German, Spanish, Italian, some code to some some some source code to make our model to have a capacity of reasoning. And we try to build some new features to make this model efficient, not only in French, but for over language. So probably you will be interesting to follow this work and probably our custom tokenizer and so on. But the most important things I would like to share with you this morning is that we are not the only one community involved in this goal to build this sovereign LLM in Europe. So I'm sure that this list is not exhaustive. If anyone knows new or other initiative, please call me just after the presentation. I will be very excited to discuss with you. But the most important is that we are strongly believe that we have all the capacity, all the technology, all the GPUs in Europe to build our models. And it's why I'm very delighted to announce you that today, during the first day, we changed OpenLLM France to become OpenLLM Europe. So you can use this QR code to inboard yourself in this in our Discord server. So we all the content we produce during the six months in French is still available, available. But we have created the channel for each European language. So please welcome. And if someone want to be part of the community management team, please contact us and we will be very pleased to inboard you in our initiative. So that's my tool for today.
LinTO Studio as Your Ultimate Open Source AI-driven Media Management Solution
Okay, great thing everyone. Thanks to come to discover linto. linto is your ultimate open source AI driven media management solution. So I'm Damien Lenn. I'm head of R&D engineering here for linto at Lina Gora and I'm proud. So what is linto? Essentially linto is a set of voice technologies that enables you the best on the open source side of voice tech. You can find in linto all the cognitive APIs that you are craving about like transcription with a live or batch transcription. We have a set of NLP APIs that enables you to add punctuation, name entities and topics identification or so on. And also we worked on speech synthesis. This is the first set of linto technologies. Leveraging those technologies we built a full-figured surrogate, I mean alternative to Alexa and Dialogflow to build agents, smart agents which includes chatbots, smart assistants, voicebots with custom full software work walls that work on the browser that's very neat. And finally we the past two years leveraging further our technologies we built a business oriented solution which is called linto app which is a media management platform that enables you to load media and to make to run these cognitive APIs to edit the transcription in a nutshell to turn routine recording into fully qualified data lake. So there's a lot of software's closed source that enables you the same kind of features but more or less all of them uses the APIs from the big players you know them. Okay so the question here is always the same what happens to my data when I use the services provided by author, dictation, happy scribe and so on. In a nutshell you just send your data to them. So linto studio I will present you a quick video to show you the platform but here you have all the functionalities and note the link which is currently displayed you will find the link to immediately just use our alpha version which is online free you can just create your account and try yourself just after the meeting and you will find the link to our github pages to download and work with the source code. So linto studio enables you to use the APIs I've been talking about to add automatic stamp with our modified run times for whisper.ai not a whisper by openai. We are so enabled to speakers and turn identification and all I've been talking about before just note that the platform is a web platform where you can collaborate in real time using organization roles and share resources within the platform. It's shipped with companion Android application that you can use to to recall. The final slide before I move to the quick video of course as my colleagues presented you a work on the large language models of course we want to also leverage these technologies within linto studio and add this kind of feature I'm drafting here on the picture to work with the documents loaded into the platform and ask some things with large language models. Okay so here I jump to the video. Okay so I recorded this yesterday. Here on the left I'm currently recording something within the sorry so I'm recording with live transcription. Okay whenever I'm done I just stop I can navigate local files and listen back what I recorded but what I want to do is to send this recording directly within the platform which is of course the big window displayed on the right so I can change I can send it to the platform I choose the language the model I want to use then the media I uploaded just lands into the platform and here I can see that the transcription here includes the capitalization and normalization I can also explore the platform as I tell you media management solution so it's a multi-user platform we where everyone can create accounts and use roles within organization so here I just showcase the way you might invite users and assign roles within a given organization here I show the share mechanisms which is total rip off from notion way of doing things and I'm proud of it it was flawlessly I can share with external users as well send email automatically when I just share transcription to a user okay here I jump to the editor where I can you see use AI insight which is our NLP APIs okay you just click on the one you want to use and start generation forum identifying stuff in your text like name entities or locations and decisions topics and put highlights you can also manipulate the text and add manual highlights to annotate the text okay also we have another editor which is also very neat where it's a place where you can basically just built the SRT or VTT and you work with the screens you have the center the current screen you can arrange arrange them the timing you can of course correct the text which enables you to add something that you want to rip on the video directly some close captions here's the way I want to navigate within the platform I just can use tags and fetch the document I'm looking for also using full search text and so on and once again I get back to this recording I can show you here that I can also correct add some correction corrections to the text change speakers which is a real-time collaboration with a reconciliation of multi multiple users editing the text and finally as you saw we can export the document okay that's our platform demonstrated in a nutshell I took 10 minutes for this presentation hoping for any questions from you so if I am if you thank you for this presentation I have two questions one of them is technical and the other one is about money I'll start with the money this specific project how is it sustained that do you have revenue for this specific project and so what's the business and then the second question was what kind of power of computing power do you need to run this for a small organization maybe okay so the goal here for our business is very clear we offer as linear go around services for tuning models okay so this particular platform is also intended to be a SAS service where the user will be at some point when we have time to develop a subscription for that users will be able to use our system as a SAS but the source remains free and it can be austere on premise with the same features like away like like always at the Nogura we have no premium plan or whatever but we just feel that it's convenient to just host directly a solution as a SAS offer the other question was about the computing power okay so it requires quite a lot but we batch the process of the transcriptions and the long models inferences we just provide the best default way of doing stuff and if you dig in the code you'll see that our runtime supports kind of everything you can dream of we can run on CPU of course it will be a little bit clumsy we work on CPU with Intel extensions for transformers and so on and we of course work on GPU if you want to process a large batch of transcriptions when the hosting on premises any other questions we got time for one more how do you handle a typically French language setting which is irony how do you handle because of the keywords and so on the typically French set which is irony meaning that the speaker means exactly the opposite of what he says he's asking how do you do with the irony of French language of course using the you know the irony mark you know this one thank you Damian all right we're gonna start the next talk here in two minutes
LangChain From 0 To 1: Unveiling the Power of LLM Programming
Hi, y'all. I have the privilege of introducing you to Stefano. And he is from Italy, in the middle of the Italian coast. You've been a Linux enthusiast for 20 years, got me on that one. And your focus is on VoIP, interestingly enough. This is his 10th Fosdom, and your favorite animal is you after four beers. Very appropriate. Everyone, welcome Stefano. Thank you. One of my hobbies was caving. I spent 10 years going into caves, descending pitches with ropes, crawling into mud, and doing those awful things. The reason for doing that is that the very few time I had the chance to be the first one in an unknown place, it was awesome. When you are in an unknown place, you face some dangers, but you also have infinite possibilities. Behind the light of your headlamp, there could be anything. A river, a beach, kilometers of unexplored passages, who knows. And I feel the same about the AI today. And I'd really love to increase the power of your headlamp today. So I'm going to kick start you into Lang chain. This is the GitHub page for the talk, where you can find the proof of concept code and the presentation itself. It's better if you look at the code during the presentation. We'll explore Lang chain using one of its notable use case, that is retrieval of met generation. And for doing that, we will look at some of its components and concept that are document loaders, text splitters, embeddings, vector stores, retrievers, prompts and templates for generating prompts, large-length models, of course, and finally we'll combine some of those together in a chain. Then I'll experience the adrenaline of a live demo, and maybe we will take a look at some other notable use cases. Let's talk about our main quest first, that is retrieval of met generation. This cutting edge techniques involves giving additional data to the LLM to enhance its responses. It's interesting because when you give additional data to the LLM, the answers become more precise and relevant, and it's also allowed the citation of sources, and allowed to respond to data that are not in training data set, that could be even personal data or real-time data. It's a very discussed topic, and it's an intriguing case for showcasing Lang chain. This is the scheme of what we want to obtain. Multiple use cases exist over retrieval of met generation. We will look at the simple one that is question answering over unstructured data. We will take some text that is our unstructured data, and we will put it into a storage. Then we will ask a question and use the data from the storage to help the LLM answer the question. Let's look at it in more detail. We will take data from a transcript from a YouTube video, and we will load it into a usable format. Then we will split it into smaller parts and compute a vector representation, also known as embeddings, of this data. We will store it into a database. Then we will ask a question and compute the vector representation of the question, and use this vector representation to find similar documents. Then we will put the question and the retrieved documents into the prompt and give it to the large language model. If you're thinking that it's complex, I assure you that it's not, and it fits in a few lines of code. If you're thinking that it's trivial or worthless, I assure you that it's not the case-hater, because there are a lot of concepts behind that. Why using LineChain? LineChain is a framework for developing LLM-powered applications. It offers us a lot of ready-to-use of the shelf components and building blocks that make our life easier. Should we take our code in production, it also has components that make it easier for us to do it, and also it has a lot of samples to copy. It's fun because it has an extreme speed of improvement, and something interesting came out of its community continuously. On the other hand, it's very young, and breaking changes may happen, but we like risk. We are using Python. LineChain is also available in TypeScript, but that's not make-up-of-tea. We also have our main requirements that are LineChain, of course. OpenAI that we will use as embeddings and LLM provider, and TraumaDB as vector store. Since we're using OpenAI, we will provide an API key. Okay. In this part, we prepare and store our data. We will use four components that are a document loader to retrieve our data, to get our data, and convert it into a usable format. A text splitter for divide the document into smaller meaningful units, an embedding function to compute the vector representation and the vector store to store our vectors. The document loader is an object that takes from various sources to the data source. It takes from various sources of data and gives us a transform it into a usable format. That is a document. Multiple sources are available, and for instance, we can have files like PDF or text file or web pages or cloud storage such as Amazon S3 or Google Drive, social media like Reddit, Twitter, GitHub, and papers, and of course, YouTube transcripts. It's also very easy to write your own if you don't find something that fits for what you need. You can just extend the base loader class. This is our document loader, and we are using the YouTube loaders from the LineChain community. And this will take the transcript of our video and put it into the document class. This is the document class. It has a page content string that will hold the transcript of our video and a metadata dictionary that will have a key source with the URL of our video. Now that we have our document, we want to split it into smaller meaningful units. Why do we want to split it? Well, for free reason. The first one is that the input size of our LLM is limited, so we want to give smaller pieces. The second one is that, like me, our LLM tends to be easily distracted, so we want to increase as much as possible the signal-to-noise ratio and avoid to distract it, giving it useless information. So we will choose only the pieces important to answer the question. And the third reason is that usually we pay per token, so the more we give, the more we pay. We can think of five levels of text splitting from simple to complex. The simple one is splitting just counting charters or tokens. This is simple and easy, but it has a problem, and the problem is that probably we will end up splitting in the middle of a word or a phrase. The second level addresses this problem, and this recursive splitting. It recursively tries to split text on special charters like new line or punctuation, then combines those phrases together till the maximum length specified is reached. The third one, look at the document structure that works for HTML files or markdown or code. And then there are semantic chunkers that is still experimental on a long chain, and it's very interesting because it combines phrases together only if they are similar and use embeddings to compute similarity. The last one is highly experimental, and it's asking an LLM to split our text. This is highly experimental and also very expensive. It probably makes sense only if you are thinking that the cost per token is going to zero. We are using the recursive charter text splitter, that is the second, and it's a good default choice. We can specify the length of the text, and if you want some overlap. There's not a golden rule about that, so maybe you want to try what works best for you. Okay, now we have our documents, and we want to compute the embeddings. The embeddings are a vector representation in a high dimensional space. That means that we take our data and represent it as a vector. Each dimension of this vector will reflect an aspect of context or meaning of our data. There are thousands of those dimensions. If two pieces of text are similar, they are next to each other in the embedding space. That means that we can compute the similarity of two pieces of text just measuring the distance between those vectors. It seems complex, but for us it's very easy because for us it's just a function that we use when we create the vector store. We are using an external provider here, that is OpenAI. And auto privacy, obviously if you use an external provider to compute embeddings, you are sending your data to the external provider. We now have vector representation of our data, and our data is split. We want to store it into a vector store. A vector store is a database that is tailored for storing and searching embeddings. We are using TraumaDB here. It is open source, it's very easy to set up. This is the initialization. And as we said before, we are passing the OpenAI embedding function to it when we initialize it. These are the most used vector store in the reports of the state of AI for 2023. And TraumaDB is at first place, and FACE is also open source, it's from Meta. And Pinecon is a very popular cloud vector storage. Okay, we now have hard data into the vector store. We want to use it. We will use four main components here that are a retriever to search similar documents to our question, a prompt that will give the LLM the instruction on the output that we will give, the LLM that is the heart and lung and brain of our application, and finally we will combine those three together in a chain. Okay, the retriever is an object that is responsible for searching documents that are relevant to answer our question. The simple retriever does this just computing the vector representation of our question and search for document that are near to this vector in the embedding space. This is the simple retriever. Long chain also offers us more advanced retriever like this one, this is multi-query retriever. Please use the LLM component to formulate the variation of our question and then use the embeddings of those variations to search for similar documents, similar and hopefully relevant to answer our question. Now that we have similar documents, we can put them into the prompt and the prompt to give to the LLM. This is the prompt that we are using and the prompt is just a template with the instruction for our LLM and two variables in this case that are the context that will be our documents and the question itself. I love delving into details because it's just a template and also we can take this prompt from the long chain app. Long chain features an app with all the prompts and other objects that we can use, all the of the shell components that we can use. We have the prompt, we want to give it to the LLM. We are using OpenAI SLLM and this is how we initialize it. I use streaming, the first variable because it really improves the user experience and temperature zero means that we don't want creativity or hallucination, we just want precise answers. Maybe you can argue that I should have used different LLM providers but nobody gets fired for buying OpenAI so I chose that. These are the most used LLM providers always from long chain state of AI. OpenAI is at first place and I'd like to rant a bit about that because CloudAI, the third on that list, is labeled from almost from everywhere in the world except from Europe. This week the Italian data protection authority is going against OpenAI over privacy issue again. I know that there are a lot of privacy advocates here and I also care about user privacy but I think that defending the user right shouldn't mean going against going against war against them. That's my two cents. Those are the most used open source providers. It's interesting because the first three has a very different business model. The first one rents hardware, the second has a cost per token, paper token and the third one is for surf hosting. We now have gathered all the components, we want to put them together. This is all the components called one after another. We have our question and we pass the question to the retriever and we get a list of documents. The list of documents is joined together in the context variable then the context variable is used in the template to generate the prompt and the prompt is given to the LLM. It works nice and easy but we can do better and this put everything together using a chain. A chain is a sequence of components that does a function and it's better than just calling the component one after another because it has several advantages like it offers sync and the sync support and that allow us to take our code directly into production without changing it and also as advantages of observability and it's integrated very well with other launch chain components that are used to take code in production. This is the code put together using the LLM expression language LCL that is a new way of doing those chains. This is an acquired taste and it's quite new. It's from September but I find it very useful when you get used to it. Okay, let's see how this works. This is our code and there are two examples. One uses the chain, one not, this is the one that doesn't use it and it's just a few lines of codes. It's very easy. Okay, I forget the open AI key. Okay, I forget the open AI key. Of course it doesn't work. I'm not connected, you're right. Okay, I have a backup video. No, no. By the way, it's just for giving you an idea of the piece of calling the various components and the parts that takes the most time is computing embeddings and this is the streaming output. Okay, I have prepared some questions that are those questions and those are given too fast, sorry. I gave the question to the LLM and this is the output of the output of the LLM. Also, okay, it's nice because this one, the retriever wasn't able to find the answer for this question and so it wasn't able to give us a response and the LLM told us, I don't know. I'm not sure if I can move forward. Maybe I also have it for the LCL. The LCL version uses the multi-query retriever. So you will see now that it will ask multiple questions. Each question is transformed into multiple questions. This is low, I'm sorry. Okay, those are the questions and this is the answer that came out. Okay. There are also other interesting use cases of luncheon. We look at the simple one that is question answering over unstructured data. Also it's very interesting question answering over structured data. This one uses the LLM component to convert our question into a sequel query that is executed and the result of the query is used to improve the answer of our LLM. It's very interesting. Another one is data extraction. You just have to provide a JSON schema and then unstructured text and the JSON schema is automatically filled in with the data from the structured text. The LLM understands what to put into the JSON schema. It's interesting because there are people paid for doing that work. Summarization is very useful and it has a lot of, let's say, problems. It's an open problem. It's very interesting and useful. Then there is a synthetic data generation that is useful if you want to find a model or maybe if you want to anonymize some data. It works like data extraction backwards. You have a JSON schema and the LLM generates a text unstructured that contains data that will fit into the JSON schema. Finally, there are agents that is a key concept of luncheon and it's very fun. With agents, the LLM takes charge of choosing what action to do. It's worth studying. It's very interesting. Okay, that's it. So, thank you. Do you have any questions? I saw his hand first. Thank you. Very interesting. My question is how does this scale? You showed an example in which we have just one transcript. What if we had billions of transcripts? I didn't see any mention to the ranking of the retrieved chunk. If you can elaborate a little bit on that, it would be very good. Thanks. Okay, luncheon helps to take this in production. This was proof of concept so you can take this in production. Also, it's out of the scope of this talk. This was luncheon from zero to one. So, that scaling is from zero to 100. You can find a lot of examples on how to take that in production. If you take a look at the GitHub repository, there is also a link on how people from luncheon use this in production with the chatbot that helps searching in the luncheon documentation. You can find the code and it's very interesting. If you want to take it in production, it's worth copying that code. It's the best practice. Did I answer your question? I'm sure you'll see this coming. If I have some money to spend on a hardware and I want to get an LLM, there is a lot of proprietary intelligence that you use, like the Mbendix in particular, and also the other part that it's on the query side at the end of the chain. How difficult it is to do this without using OpenAI? It's really easy because luncheon allows to swap those components. I use it here at OpenAI because it's the easy way for having a result. But if you, for instance, use the Ollama, you can self-host the LLM and ask questions to the LLM, or maybe with a face you can rent hardware and run your open source model on their hardware. So it's easy because those components are swappable. All right, y'all. Let's give Stefano one more round of applause.
ML Guided Optimizations in LLVM
Can you hear me okay? Cool. Okay. So if sound doesn't work for the rest of the presentation, this is basically the key of it, right? So I'm a compiler engineer, I'm not an ML specialist, so I'm not a compiler engineer, so kind of like a heads up, if I say something wrong about ML, that's why. You can use ML in an industrial compiler, which is LLVM. Actually, show off hands, does anyone have you heard about LLVM, Clang? Cool. Okay. About half. I have a slide about that too. So out of the box, actually, as of Clang 17, it's not very well documented, because it's still work in progress, but you can actually connect to Clang and train models. So that's an interface just for training. It's a DMM kind of an interface. I think that means something to the ML community, if not, tell me. And this is not vaporware, it's just a virtual computer. In the sense that we actually use it for real, right? So I mean, you can read what's there, but we've been using it for about almost four years now, and we have some experience with it. And most of the talk is actually about trying to get to point three there, which is like what we've learned. The rest of it is set up. Okay. So LLVM, for those that did not raise their hand, is an open source project. It's a compiler. Actually, LLVM itself is a library. So it defines an intermediate representation. That's what IR stands for. It contains state of the art optimizations. It also knows how to lower to X86 or ARM or other targets. And then Clang is something like it compiles C or C++ down to LLVM IR. So basically Clang is built on top of LLVM. And so it's Swift. There's a Rust compiler. There's a Fortran compiler as well. And I mean, the LLVM project is bigger than this. There's a full tool chain there, like debugger, linker, all of that. Actually, shameless plug for the LLVM community that I'm part of. There's a dev room this afternoon here somewhere. To us, to Google, so I work at Google. To us, C and C++ is very important. Basically, anything that is performance critical, which is basically anything is written in C or C++. When we say C and C++, I really mean LLVM. And when I talk about LLVM, I mean LLVM at the tip of three in GitHub. So we don't have a special fork or anything like this. And we really chase the head by plus, like, well, minus usually two weeks. So we're very close to the head all the time. We have a release theme that keeps it basically in sync. And even small performance improvements matter, because a 1% saving across the fleet really means that much less hardware you have to buy, what you have to produce or consume, et cetera. And we keep doing this. All the performance improvements that we make are small, but they're constant. And it's like interest. It compounds. Our binary is no shocker. They serve RPC requests. No surprise there. The key thing is that to do that, to optimize these things, there's many things you can do. But as a compiler engineer, we're primarily occupied with how do we make the RPC request complete quickly. And the RPC request traverses a lot of code. Most of it is actually not the code that you want to execute. So there's things like networking stack, serialization, deserialization, security, blah, blah, blah, blah. And all of those things are reusable code. And they try to be genetic, which is the exact opposite of what I want for performance. Because for performance, I want it to be as specialized to what I'm actually doing. Like, I don't want it to be genetic, right? And for that reason, actually, the biggest levers that we have for performance are we collect profiles that tell us, like, well, actually, the program is spending time and then we reoptimize it. So we recompile it with them. And link them optimizations, which are basically like we can look at the whole program and try to, based on that understanding, try to make the right decisions. So things are big, like, you know, lots of data, lots of instructions to execute, nothing fits in any cache. I'm not being ambiguous there. I'm being actually precise. No cache fits the data that we're talking about, the instructions or the actual data being processed. So that's why, like, optimizations like inlining are, you know, very impactful because they contextualize, so they specialize things down to what you actually really have to execute. And then you end up with large functions, which means that optimizations are register allocation or have like a big problem to solve. What am I doing? Okay. Here we go. Okay. Which kind of gets us to why we want to do ML, right? So we want to do ML because we're looking at problems that are, sorry, sequential decision making. So inlining is about, hey, is this call site worth inlining? Sure. Okay. Fine. Well, the program just changed now, right? So what about this other call site? Is it still worth inlining? Maybe not, right? So as you go along, the state of the problem that you're trying to optimize changes, we don't have an Oracle that tells us what's the perfect optimization decision, especially at like the scale that we're talking about. I'm kind of like getting us to say reinforcement learning, probably no surprise to an ML community. Because I mean, otherwise what we do is like we have heuristics that can only operate on like local information. And because I mean, there's the one that actually we can make sense out of, right? So, and we have evidence that they're not good enough in the sense that we know that if we play a bit with them, we can, we can find headroom in optimization. So, but, you know, we cannot constantly twizzle with them, right? Like we want something a bit more systematic. So that's why we are interested in ML. We are also scared of ML because the compiler is about everything that ML is not. So the compiler must be correct. I don't think that it's a surprise to anyone, but it's a non-negotiable. The compiler must be deterministic again, because otherwise it's something that you cannot live with or, you know, to take forever to compile things because we cannot do incremental builds. So ML at least like naively to us felt like something more analog, right? Like it's more like, well, fuzzy, maybe something and that's not, not what we are about, right? So how did we go about it? Well, first we're not asking ML to deal with correctness. So already in the, in the code that I'm talking about, like in the compiler code that makes decisions like in lining and register location and things like this, we kind of already had a separation between what's correct. So, you know, there are certain things that are illegal to do so that we don't do them. We don't even wonder are they worth, like, you know, would they be valuable in doing it? We just don't do them. What we did here is we stressed that boundary even more. So we created like a very clear interface between ML questions and like what heuristic or policy questions and, you know, correctness issues. So the correctness stuff is, you know, written in normal imperative C C plus plus code that we can all look at and agree that it's actually correct, right? Module of bugs as always. But then out of choices that are equally correct, we go and ask ML, you know, which one should we make? To the end user, we don't want to tell them any of these not because it's like a shame or anything, but because it's the more different the compiler would look like the more difficult it would be to adopt it. So how about we make it look the same as it is today, which means no new dependencies, nothing extra, just additional flags, right? So that's something that is fine. So which really means that when we give the compiler to the user, we embed, we need to embed the models inside and not show any sort of like dependency on some sort of like an inference engine or anything like that. But for training, there's totally different. So for training, we're totally cool with like, depending on like TensorFlow and like whatever and, you know, like random generators in the weights and all of that is fine because that this training and actually we're fine with compiling a different compiler just for training, because that's not something that, you know, like, it's not for like everybody, right? So it's just for whoever does the training activity, which we also want to be rare because we don't want to like keep training it as you're trying to ship a product, right? So, you know, like, we give you the compiler and then like, hopefully the more the models are good enough, just like heuristics today, like, you know, like to, to resist changes that people make to their code, right? So basically, there's two types of interfaces that we ended up having. One is between compiler and policy. And there's like domain specific. What I mean is like, there's a different question that you ask is an inlining pass from the one that you ask is a register locator from the one that you ask is a instruction selector or something like that. But then the ML obstruction, like the way we interact with the mail is common because fundamentally ML to us looks like a function that we pass a bunch of tensors to and it comes back with an answer. And we, you know, like how it's implemented is, you know, it's not important, but it's irrelevant from the perspective of the interface and the implementations that we have are like either ahead of time, like I mentioned, or, you know, the interpreters who use TF light, like the people in embedded or for the DMM case, we're actually doing IPC over pipes. So the state in LLVM today, like if you, if you go to GitHub and you pull LLVM down, you basically have everything that you need to, to, you know, add the mail to a pass if you're a compiler engineer. It's TensorFlow centric, no surprise there, but it doesn't have to be. So the obstruction that I mentioned earlier can be, you know, like, I mean, you can, you can plug by the pytorch or anything like that. I mean, we, we made a pipe based protocol work over that obstruction. So it's clearly not TensorFlow specific. Any tools that are genetic, you know, like other utilities, like how you collect a corpus for training, right? So that's a problem. That's also in LLVM. We used to have them in, in, in a different repository, also open source, but they make more sense to go into LLVM. The training tools that we use, so for example, the, the fuchsia operating system that I had on an earlier slide trains using those tools, they are available there to as a, as a reference. But if you are a researcher, you probably want to use something like compiler Jim that is more research, research friendly. So there's kind of like different concerns in, in these tools. And then there's also like using the tooling that I mentioned, like there's, there's another body of work that produced a large corpus of IR that you can use for like whatever you want, like training for these purposes, or maybe doing LLVM training or anything like that. There's links there. In fact, like all the links in the, in the slide that are in the, you know, like when you go to falls them and you see the talk, they're there. Okay, what we learned, that's what I wanted to get to. And I'm doing well with time. Okay, so the, the, it works thing, right? So there's a difference between, I mean, there's been work doing ML with compilers in academia, but I mean, that there's a big difference between that and actually shipping a product and shipping a compiler for production teams. So the key thing is that, at least with a size problem, we have evidence from, from the Fuchsia team that it can work completely, meaning like they periodically, like about every month, pull LLVM, retrain a model on their code base, all on vanilla build bots. So they're like normal CPU based machines. They train for like about a day or so. And they produce a compiler at the end of that that optimizes for, for size, because that's what they care about. There's links, I think, down there, like an example of such a build bot. So it all, you know, this can be done completely openly. And the key thing also is that it works like turnkey, meaning like you don't need someone to go and pay attention to it. It just works repeatedly. And he's been working like this for like almost four years now, which is, which is good. Like we have a signal that we can have like an industrial process that produces an optimized compiler, you know, on a cadence, right? Okay, here's what it didn't work. So performance is hard. So, okay, so you are ML experts, you are not surprised at the statement that for reinforcement learning, the, the quality of the reward is very important. And we understood that through we, okay, it makes sense. However, for performance, the problem is a bit tricky. So it goes like this, you cannot just say, oh, let's run programs and see how well they run, because it takes time to build a program. And it takes time to run it. So you either do it very quickly, which, which means that you're doing it for small little benchmarks, which are completely relevant to what we're doing, right? So then basically you learn on something that has feature value distributions that have no match in what we're actually going to try to use it for. So we don't want to do that. Or you cannot do it. Like, it just takes too much time. So we were like, hold on a second, but we have profile information, like I talked earlier, like, we know, we collect this profile information that tells us where the program spends time and how many iterations loops take and all of that. So can't we do something based on that that kind of like guesstimates, at least a trend, right? Like, we don't care about absolute values, but at least something that can allow us to compare, you know, like to a baseline, the results of applying a new policy. And we thought we could any kind of worked like for register location. But we ended up having to select a winning model out of like a set of models that we trained, you know, like with this over synthetic reward. And we're not very happy with that. Like it's not how to put this like, we're missing that explanatory thing of like, well, why, you know, like, so if I do it for how long do I have to do it? And what do I have to look at when I look at the TensorFlow rewards and all of that? Like, what do I have to look at to know that I have to take it out and now train it or like, sorry, compare these models on on on running benchmarks? There's basically a bit of a waka mall. And that's not engineering. That's waka mall, right? So this is basically the main challenge for performance. And I basically like, you know, scaling this effort to more performance problems. And well, knowing that there's efforts on that, of course, like, come on, okay. ML model evaluation costs. So in the big scheme of things, when we did like in lining for size, or we did register location, I mean, we measured like the micro measurements on how much it takes to evaluate the model. But in the big scheme of things of like the entire compilation of a module, like of a C plus plus, basically, they kind of like goes in the noise, like it was more like a few percent variations. And it's fine. But there's not going to be that funny if the methodology, you know, like gains traction, right? There's not going to have lots of these things that take a lot of time. Also, the size of the model, which is really the weights, seems like it was kind of surprising to us. Initially, we had a small one and then working with some researchers in other teams at Google, they managed to produce a much, much larger model kind of accidentally, like, which kind of like took us by surprise, like it was suddenly 11 megs, like out of nowhere. And it's kind of funny when we're trying to optimize something for for reducing the size of either binary and LLVM itself blew up, right? I think that these are more like things that caught us by surprise. And we, to our understanding, in talking to ML experts, there's ways to mitigate this. But we kind of learned that we look a lot more like an embedded scenario than that we imagined, basically. So kind of like an interesting research topic, I think it's interesting at least to us as compiler engineers, but it's a research topic for the ML community, rather. How would we know without having to actually compare the result that a policy loses power, if you will, right? So, you know, like I was saying, people like Fuchsia, for example, train a policy and then they just decided, well, we'll just retrain one automatically whenever we we produce a new toolchain, right? But is that overly aggressive? Or was it like about time to do that anyway? Like, it'd be great to have a signal that tells you, hey, you know, hypothetically, maybe the feature value distribution changed, and it's out of the domain that actually the model was trained on. So hint hint nudge nudge, maybe it's time to train. But we don't know if that's actually what the indicator is. So that's what I say. I think it's an interesting topic that would be valuable to us, because it was give us an early indicator purely based on compiling, right? We can run the compiler and just see these values as you compile. You don't have to like do benchmarking for for for that. Oh, so in retrospect, I really so this is like honest truth. The first statement is true. We thought that right, like we are convinced that ML is magical. And we will get these policies that are awesome. And there will be at least not regressing and you know, like improving things and there will be no regressions and things will be great. And then we saw that all of them have the typical pattern that we have also in manually written heuristics, which is, you know, some things regress, some things improve. So that's all things are, I suppose. And maybe we can do something better than than that with additional policies that select the right one. But that was a bit of a surprise to us. Okay, performance. So like I was saying, I guess performance is some issues. But we went ahead and like, looked at like, where does the train model find opportunities for additional savings, right? And taking a step back. So what do I do as a compiler engineer in these sort of cases, like I look with Linux Perftool at, you know, runtime information. And I see where it's read. So where there's hotspots. And then I think really hard and look at the compiler and why it made those decisions. And I go and fix that. And then the red turns gray or green and sweet, right? And then I have to do it again and again until I make sure that there's no regressions in other parts of the code base. But that is basically what you do in that case. So when we looked at like functions that we had both indicators in the reward signal as poor as it was. But I mean, it was indicating that, you know, he's doing better. And we looked also empirically at them like, and yeah, they were doing better. And we're like, well, why? Right? So we look at the code and we couldn't tell why like we look at with Linux Perft and there was nothing shining, right? I mean, the code was different, right? Like we could tell that like, you know, pure line by line, you know, deep, it was different, but nothing was popping. And then we did a bit more investigation. And it turns out that the mail was finding or like, you know, the enforcement learning algorithm was finding opportunities in lukewarm parts of the code. So these are things that kind of like end up being like a peanut butter effect, right? Like I mean, nothing in particular is bad, or is improved categorically. But in aggregate, you can, you know, you get like a spread effect that is actually amounting to something. Great, but it's possible that that something is actually just noise, right? And I mean, today, we don't have a way of capturing that. Like we just say, Hey, here's the profile that we got by collecting it from from a running binary. And then I'm as is great. Okay, here I found an opportunity and actually that's just purely noise, right? So this is the part that I kind of had a bit of a trouble like how am I going to title it or anything. So what I ended up doing is just saying what I wanted to say. So as a compiler engineer, so as a developer in the open source, like as an LLVM compiler engineer, if this pans out more, like, you know, if you get more passes and the mail is, you know, like actually delivering more and more value to us, right? What's going to happen, right? So, well, on the plus side, I spent less time, you know, like tuning and twizzling with thresholds and other flags that I have today in the compiler, because I actually can can use a automatic feedback driven, self improving methodology, right? Like reinforcement learning. Okay, I think that's great, because I can actually focus on understanding what actually matters, right? Like for for driving that performance, like what features are important stuff like that. The barrier to entry though might change. So today you can use like, you know, like, you know, cheap, not this one, but a cheap machine, right? And compile the compiler and look at performance, you know, like optimization problems, and it's all fine. And ML, at least my view of it is that it has this risk of like quickly skidding into like, Oh, you need a farm of computers. And today, that's not the case, like I was saying, like, with what we've been doing, the models are small. So we didn't hit that problem. But that's a consideration, right? Like, I mean, is it going to be harder for, you know, the compiler engineer aspirant of the future to enter the field or what? The mental model is kind of different. You can have hinting at that before, right? Like, I mean, like, you don't think of the problem like you were before, you look at Linux perf and you find hotspots and stuff like that. But that's fine. Different, different just means different. It means like, you know, we can adapt, right? This is my pet peeve. Like the when you look as a compiler engineer, the ML frameworks, they are scary, because they're like very low level and they talk about things that I don't understand. And they're not talking about things that I want them to talk about. And we're not sure yet where that interface is. And I think that part of the the goal of the project is to kind of like figure out what that interface is. But today, it's like that. Like I was saying, there's links in the, all the links are actually in the, in the deck. And that's the end of my presentation. Yeah, questions. So the optimizations that you find using machine learning in code, can they also be put in LLVM itself without using machine learning? Or is it, can it only be learned using machine language because it is using the data, for instance, optimizations? So the optimizations, can they also be put in LLVM itself without using machine learning? Is it missing up? Is LLVM missing up? The optimizations that you find using machine learning? Right. So I'll say just to make sure that you're saying like the types of optimizations that we learned, could we just do them as normal imperative code back in LLVM? Some yes, some no. So especially the, when we looked at the type of optimizations that the size optimizer was doing, means some decisions are unexplainable, right? To do the wrong thing early on, but just because he kept learning the statistic by taking that path later is going to be all right. So that's kind of hard to translate into imperative code, I think. But some, some might be. What I'm saying is that the hope is that we, like so far in the evidence is that we kind of, it's hard to do that. We only have one time for one more question, one more question after this. Hi, thanks for your great talk. You've been talking about applying these techniques to clang and traditional compilers targeting, well, executables in the usual sense. What about machine learning compilers? So I'm thinking, yeah, applying ML to ML. I know there is some research in that. Do your techniques connect to that? Yes. So applying ML to ML compilers, right? I mean, MLIR, for example, is part of the LLVM project. And I think that there is work trying to do that too. And the infrastructure would be the same because I mean, it's all the same code, right? I'm not an ML for ML compilers compiler engineer. The word compiler appears way too many times, but we work with those people, like, so I don't see a reason they cannot apply this. I think that the domain though is, has its own idiosyncrasies that you cannot just take exactly what it is and apply it over, but the tooling would be the same. Does that make sense? Okay. One more question. All the way up there, really? Hi. I saw during the slide that one of the problems is that you are not really aware if by choosing a tree, a representational tree of the semantics that you are trying to compile, it's going to be better or worse compared to another tree that you are not for. And I was wondering, are we using the operative research theory? I mean, all the mixed integer linear programming theory that gives you a model of the reality and help you understanding how far you are from the optimal value of a certain representation. So, I'm not sure understood the question. Let me try to say back to your saying, are we applying? Okay, yeah. I'm seeing that machine learning basically relies on a loss on how far you are from a certain optimal value. And I'm seeing that there's a branch of mathematics called operational research that his work is trying to describe a word in an idealized matter. And you try to describe how it's costing respect to my objective value, making a certain decision instead of another one, and you get like a math formula. And there's the simplex algorithm that helps you to traverse those. Yeah, and I was wondering, are we trying to integrate those two fields of mathematics to reach? So, I think, let me give the answer because it's also time. So, and if the answer doesn't make sense, let's talk. I think the key problem is like understanding what that gap is, actually measuring that. And it goes back to the reward signaling thing. So, should we apply what you said? Probably, again, I'm not an expert in that. So, I mean, if you think it's worth doing like great. But the problem is that you'll hit very quickly is that the reward that we give or the signal that we give is bad. Right? So, then probably the rest of it falls, right? So, we need to fix that first before we can apply these things. But yeah, absolutely. Like, I mean, we should try all sorts of like methodologies. Like, there's a whole point. Did I make sense or did I miss it? Okay, let's talk more. All right, everyone give March another round of applause, please. All right, we're starting in about two more minutes. So, please, stick around. Don't forget, the desks are very loud. Please hold them down. Don't slam them. And we have the matrix room up and running again. Can you help me try to figure out how to make the both mics work? Can you do that? Can you hold it and can you talk to that? And unmute it in a second. This, this. Yeah, yeah. Can you start? How about now? Hello? Can someone give me a thumbs up? No. Someone got a thumbs up? Hey, thanks Marty. One second. Huh. At all. Nothing at all? Nothing? Okay, yeah, this is not working at all.
Using Haystack to Build Custom Functionality for LLM Applications
We'll be starting our next talk here from Tawana Czelec. She lives in Amsterdam, but she's from Istanbul. She loves historical fiction and free diving. Surprisingly, she's dived 25 meters down to save her GoPro before, which we got a bunch of asses this year. This is pretty crazy. So thank you so much and take it away. Alright, thank you. I hope, let me know if I need to eat the mic. Alright, so this particular talk is a bit of an outlier to the talks I usually give because it's nearly fully a showcase of a very simple actually project that I built with some community members of Haystack and some functionalities of Haystack that made this project possible. And then I'll end up by showing a few other projects that we built together. And the way it usually goes with the Haystack community, so a quick side note, I work for Deepset, which is the company behind the open source project Haystack. And we have a Discord server, and from time to time, this is basically what happens. So this evening I say, I'm a bit bored, I want to do something, and I go and join a voice channel on our Discord, and there's one particular community member that I have to give a shout out to for this particular talk, Rec, because oftentimes Rec will then come up with a random idea, and either myself or the two of us will just share screens and do some pair coding, hack together something. And this particular project is exactly something that happened like this. I'm pretty sure a lot of you know this page, it's Hack and Use, there's a lot, it changes a lot. So the idea that Rec came up with was, well, why don't we build something very simple that gives you like a TLDR of the top K Hack and Use articles. So we built that, and then just recently, and when I say recently like two days ago, that became a hugging face space that you can actually get to at this QR code. And we've kind of vamped it up a bit, and we can now pick between two models. You can use MixTrail or an OpenAI GPT-4 model, and then provide a number of the top something, and I make it go up to five because no one's made of money, and you're going to be making APR calls to OpenAI possibly, so it goes up to five, and this is literally when I ran it yesterday, and the funny thing about this one was that at the time, the second top article was actually the FOSTA and Livestreams, but you get like a short summary of what the top three articles are at this point with a URL to get to the full article itself. So my whole talk is based on how this was made possible, how we actually built this project, and we built it with Haystack. Haystack is a fully open source, large language model framework. It's all written in Python. The main ideas behind Haystack is providing tooling for developers, so nothing is a plug-and-play really, you're building it all yourself, and the two main structures in Haystack that make it possible are called pipelines and components, and a pipeline is made up of a few components attached to each other where every component is forwarding some data to the next one. And I'm not going to get into what RAG is, I'm pretty sure a lot of you know what RAG is at this point, but a retrieval augmented generative pipeline might look a bit like this. You have a query, and then an embedder component is creating an embedding for that query, then a retriever component is retrieving the most relevant context for your LLM to actually use, and then it's forwarding that to what in Haystack world we call a prompt builder, so that context gets embedded into your prompt itself, and then you use the generator, and that can be any model, open source model of a hugging phase, or open AI, etc. And then you get an answer. This is a pipeline, but what a pipeline does is really dictated by what components it's comprised of. You might also create a pipeline that indexes documents. I'm not really going to get into this here, you're just basically fetching the contents of a URL, and then writing that into one of the available document stores that we have. But this is all made possible because of the structure called a component, and a component in Haystack is something that could have, for example, in this case it has two inputs and an output, but you don't necessarily have to have a defined output. Haystack doesn't really make assumptions as to what a component has to be. It can also be something that has two inputs, two outputs, and then the idea is you attach those components to each other, and you can be very, very specific here. You can be specific into saying, like, I want output one to be forwarded to input two of the next component. You can be very, very precise here. And maybe you can already start to see this starts to look quite like a graph. So how do we build these components? There's only a few requirements for something to be a component in Haystack world. We provide a bunch of ready-made components that, if you go to the Haystack documentation, you'll see a bunch of sections there, generators, converters, embedders, etc. These are all basically components that have been built exactly like this that we just provide in the package itself. But what you can do is build your own. And what you need to have is a class. So here I've just got a very, very well-named MyCustomComponent class. I need a run function. And the other thing you need are these decorators. So basically this is telling Haystack that this class is a component. And then the second one is around the run function. And this is actually used for pipeline validation down the line. But it's basically telling the Haystack pipeline what outputs it should expect from this component. In this scenario, I've got a MyCustomComponent that's expecting a query, which is supposed to be a string, and then it's returning documents. In this case, it's just hard-coded. It returns high and by. So we know that this, whatever query this gets, is going to be returning two documents, high and by. And this has then led to quite a bunch of components that don't actually, they're not served through the Haystack framework itself, not all of them are, but you can just install them as separate packages. And it's meant community members have just gone ahead and built components that they need for their very specific custom needs and made them available to the rest of the Haystack community. So let's come to our Hack and Use TLDR, if you will, project. The idea was that we wanted a component that would take top K, that could be a number, and it would return articles. And again, this is a colab that you can use, it should be running. And the way we did that was, this is very much pseudocode. Later, if we have time, I'm going to show you the actual code. But we built this component called Hack and Use Fetcher. It takes top K, it queries the Hack and Use API and gets the top whatever number we've decided. And the other thing I wanted to show here is, at the end, I don't know how well you can see it, but we've also added some meta information, because down the line, we can use meta information in our prompt, because you also get titles of the Hack and Use articles, you also get URLs, which is great for referencing down the line too. So we return full documents that have the content, the title, and the URLs of each Hack and Use articles that we fetched. And at the end of the day, we're going to be building a pipeline that looks like this, and everything you see in green is already provided with Haystack. So that came with pip install Haystack AI anyway. And the orange is what we've just built for ourselves, and it just slots into the rest of the Haystack pipeline ecosystem. And for this co-lab that I've shared with everyone here is, I've decided I'd just go ahead and use Mixstrahl, I've tried this with OpenAI models a lot, so why not try something new? And then the last thing I want to highlight about this particular pipeline is how the prompt is being built. So prompt templating happens in Haystack world by a component called the prompt builder. And templates use Ginger templating. And what's really important here is, okay, we have an instruction, you'll be provided with one or more hack and use articles, please provide summaries. But if you look at this close theme, we have actually a for loop. So this prompt builder automatically knows that it should be expecting an input called articles, and it can loop through those articles, and then it can access the contents of that article object individually in every step of that for loop. And that's how we're embedding URL here as well. And this is the final product. At the end of the day, we were able to build a pipeline where, given top three, we were able to run it, and we've got the TLDR summary and the URLs that you can find, the full articles of hack and use, current hack and use top articles. So with that, I want to show a few other projects that this custom component building functionality has enabled. The next one is slightly questionable. Please take it with a pinch of salt. I put it everywhere on that tugging face space too. And the idea came from, at the time, Twitter was very different. So the idea was, could we like build a Twitter fetcher that, given a username, could give you, this is really bad, could give you like a vibe check of the account, and we called it like, should I follow? And it gets like the last, I think, 40 posts of that user. Obviously after that, Twitter changed, so I went ahead and built a master on fetcher. You can also find that on the Haystack integrations page. And the best way I like showcasing this is actually using my boyfriend's master on account, because every time this tells me something a bit funny about his account, once it called him pessimistic, this time it called him sarcastic when discussing personal opinions. So that's also open. I think I linked to it in the notes as well, so you can go ahead and try that out. You just need to provide the full master on username. Without the at at the front, that's a bug I haven't fixed yet. Another thing that this enabled, actually not only used the Haystack custom component functionality, but also, I don't know if you remember when I showed the components earlier with the two outputs and the two inputs, et cetera, you can already start to imagine that you can actually have these pipelines loop too. So the idea was, what if we have some meeting notes, and we have our own GitHub repository, and anyone who's used GitHub repositories, you know that you can create those issue labels that are very specific to that GitHub repository. Could we build a system that, given meeting notes, generates a list of GitHub issues specifically for that repository that you're discussing in that meeting? And could we actually then use those generated structured outputs to query GitHub to actually create those issues? Now, this is great, and our experience has been that actually a lot of large language models are great at generating structured output, but not necessarily in the structure that you need. So it's going to be JSON, but is it going to abide by what you need that JSON object to look like? So the idea here was, okay, well, why don't we create an output validator component, and we use Pydantic for that. This is all based on a tutorial that's up on the Haystack website right now, and basically what we did for the GitHub demo was modify this tutorial just a bit. In the tutorial, we said that we provide a Pydantic model, and we said we need the output to be cities data, where in cities data you've got cities, and each city has a name, a country, and a population. And then we used a GPT model, and we saw that initially, for the first round, we did get structured output, but it's not valid JSON, or it doesn't abide by what we need that object to look like. So the idea is, what if we provide back to the LLM, the output that it just gave us, with an error message from Pydantic as to why it's wrong, why it doesn't abide by the Pydantic model we just provided. So the resulting pipeline looks a bit like this for our GitHub issues demo. We want to provide meeting notes, and we want to provide a schema. We give that to the prompt builder, the prompt builder exists in Haystack World. Then that whole prompt is given to a generator that generates... There's a one pass, like a first attempt at generating some structured output, which is then validated by our output validator, which doesn't exist in Haystack World, so this is a custom component. And either you're all good, done. But if it's not good, then we go back to the prompt builder with the invalid reply that was produced, plus the error messages. So for our use case, where you were trying to build this for Haystack, this is not accurate, by the way, our labels are not exactly that, but just for demonstration purposes, we went ahead and built a Pydantic model called issues, and we had to be very specific as to what our labels were, because you can't make a query with a new label that doesn't belong to that repository. And then we used our output validator. And then this is where things start to look a bit complicated, but the ginger templating is very useful here. Earlier, for the Hacker News articles, you saw a for loop. Instead here, we have an if statement. So if we have an error message and invalid replies coming in from any component in our pipeline, then this little section here, you already created the following output, yada yada, is appended to the full prompt. Again, at the end of the day, we ended up with a pipeline that looks a bit like this. So do I have time? I do have time, right? Four minutes. Okay, so the last thing I wanted to show is how these pipeline connections are actually defined in Haystack. Oh, great. Okay, I have plenty of time. All right, so ignore the corgis running around. Can everyone see this, or should I make this bigger? Okay. All right, so I told you that before the Hacker News Fetcher component was very much pseudocode, this is kind of boring. We're basically making requests to the API and getting the articles. Here, we're going to be using mixed trial through hugging face TGI. Hugging face TGI is free, but it is rate limited, and you need to provide an API key to use it. So you can go ahead and use this collab, but you do have to provide an API key. And then you see the prompt template you saw, and here's what's going on in the Haystack pipeline itself. We've got our prompt builder. We've got our LLM, so mixed trial via hugging face TGI. We just created the Hacker News Fetcher. What we do is simply add those to the Haystack pipeline. And then this is where Haystack can be quite verbose, but it can also mean that you can be very, you can create very custom pipelines, and it can get a bit crazy. You can have pipelines that branch out, loop back in, et cetera. We're being very specific that Hacker News Fetcher's articles output is being provided to prompt builder articles, which is going in here. And then finally, the only thing missing to actually run this is Hacker News Fetcher is the only component here that is missing an input. All of the rest have been provided inputs through the pipeline itself. So I can then define what the input of Hacker News Fetcher is when I do pipeline.run or pipe.run. And then, optionally, you can also give more inputs that are not necessary for the others, but for example, here, I'm using mixed-rull, and I wanted to up the created max tokens at the end, so I can also provide that at runtime. And that's it. Thank you very much. And you can also access the GitHub issues pipelines here, but I'm happy to take questions if there are any. Thank you very much. Thank you. Thank you. Hi. So in the Hacker News article summarizer, to the LLM, the URL, and asking it to both summarize the article and also just print back the URL. That appears to me a bit risky because it might change the URL. Do you consider it best practice to pass the URL through some other way, or do you find it fine to always ask the LLM to do that? I love this question because try that hugging face space, especially with mixed trial a few times and sometimes you just won't get the URL. Yes, and there are a few ways to make this a lot better because actually the Hack and Use Fetcher component itself earlier is just an API called Hack and Use and you have the URL there. So probably the best practice here would be to have the LLM only produce summaries and the other component provide an output of the URLs that was used to produce those summaries because yes, my experiences a lot of the OpenAI GPT models do a great job of following that instruction, reference this specific URL, but this is very much LLM based and how that large language model expects to be prompted. Not every instruction works the same way with every model. Any other questions? Oh, this one. Thanks for the presentation. I have a question on the prompt. I saw for an if, is it specific to Haystack or? Not at all. We use Ginger for the templating language. Actually, I will add a link to the Ginger documentation in the speaker notes of this slide deck that you'll find on FOSM2, but that's all Ginger syntax, which comes very handy because you get for loops, if loops, and you can actually start defining your own custom functions for that Ginger templating as well. All right. Give her a round of applause, please.
Using code generated by AI: issues, misconceptions and solutions
We also have the Matrix Room Up and Running, which is great. And I have the pleasure of introducing you to Andrew. You live near Oxford, which I believe is pretty cool. I'm a Texan, I wouldn't know. You love the Oxford music scene? You once played croquet for Cambridge University against Oxford, and he has a great joke for you. What is brown and sticky? A stick. Take it away, Andrew. My kids like that. Happy Fosfemme, everyone. So it's great to see everyone here. I am a lawyer and I've been advising on AI for quite a long time. You can hear me. That's better. Clearly, the emergence of large language models has meant that there's a lot more analysis going on in the legal context behind that. So I just thought I'd spent half an hour or so just going through some of my thought processes when I'm analysing mainly copyright law as it relates to AI machine learning and large language models. You may have heard a number of myths. So the first one is what models are essentially data. Data are facts and facts can't be covered by copyright. So statement number one. Statement number two, you may be aware that in various jurisdictions throughout the world there are different exemptions that exist in copyright law to enable you to gather and use data to ingest data for machine learning and data mining purposes. So you may hear that gathering training data within one of these copyright exemptions solves any copyright problems that you may have. So you really don't have to worry about any copyright issues. The third thing that you may hear is that output that is generated by generative AI cannot be subject to copyright. And the fourth one is that as a result AI generated code cannot trigger copy left obligations. So let's look at some of these statements in slightly greater depth. Now I'm going to do this in a slightly strange way because I've only got half an hour or so. I'm not able to go through all of my thought processes in great depth. What I propose to do is to give you the conclusions of my thinking first and then I'll give you a flavor of some of the thought processes that led to them. So you will by necessity only be getting part of the picture. I do apologize for that. So first of all, my belief is that a large language model is capable of containing and I put containing in inverted commas there because what that means is subject to a certain amount of interpretation. And I believe that it can contain copyright works or derivative works of copyright works. Secondly, I believe that if the training data was ingested under a copyright exemption, that doesn't automatically mean that that exempts any output from copyright protection as well. Third, the generated output is likely to be subject to copyright. This is a statement that will vary significantly from jurisdiction to jurisdiction. I'm licensed to practice in England and Wales and it's certainly true there. It may not be quite as true in other jurisdictions, but certainly from my jurisdiction's perspective that generated output is likely to be subject to copyright. And also under the English and Welsh jurisdiction, the output that's generated by a prompt view of entered belongs to you or possibly to your employer. But even having said that, then it means that AI generated output can be infringing and similarly it can also trigger copy left effects. So what I'm going to talk about now is a few things that you need to bear in mind about copyright and about the process of generating models and generating outputs from generative AI. So again, I'm going through these fairly quickly. I'm not necessarily going to link them in detail together. But when analyzing this, the three things you need to bear in mind from a copyright perspective are the copyright can potentially impinge at three points. So first of all, the point when the training data is ingested and the model is created in the first place. Secondly, and this is the one that people tend to forget about when the whole model is transferred from distributed from one place to another. And this is particularly relevant when that model is distributed over jurisdictional boundaries. And the third one is that we need to consider from a copyright perspective the point at which those results were output. So there's also potential of copyright impinging at that point as well. Now, there are a few things that you need to understand about copyright. And all the developers I've met have a pretty good grasp of copyright in theory, but there are always some areas around the edges that they're potentially a little unsure about. Many lawyers are unsure about these as well. And a lot of the arguments that we employ when we're talking about copyright analysis of large language models, they do tend to involve these edge cases. So there's a few things that you need to bear in mind when you are considering the application of copyright to AI, and these are just characteristics of copyright in general. So first of all, it's possible that more than one copyright can exist in a work simultaneously. So for example, if I write an opera that's based on the Lord of the Rings, then my creative input has gone into writing that opera, and I just like to tell you now it wouldn't be a very good one. And that will therefore be both a copyright work that I've created, but it's also a derivative work of Lord of the Rings. So if you want to perform that opera, you're going to need both a license from Middle Earth Enterprises, which is the organization that holds the rights or the relevant performing rights in the Lord of the Rings, but you're also going to need a license from me as well. So there's at least two copyrights being held in this opera simultaneously, and you're going to need licenses from both of those copyright holders in order to perform the work. And the classic example here is the Linux kernel, which will have many thousands, possibly tens of thousands of different copyright holders simultaneously, which is the main reason that we will never see it re-licensed under any license other than GPL version 2 only. So if you wanted to re-license it, then you would basically need to have permission from all of those copyright holders, or you would need to extract their copyright works from the kernel. That's never going to happen. So just stepping back a little bit, software is covered by copyright in just about the same way as literary works are. This is sort of the legal fiction that was established back in the day when the legal systems were thinking about how software should be protected under copyright. Indeed, if it should be protected at all. And one very key characteristic, one key piece of the philosophy behind copyright is this distinction between an idea and an expression. And again, this distinction is very clearly laid out in US copyright law. It's also laid out in European copyright law, but not quite so clearly. But the copyright directive, the software directive does make explicit reference to this distinction between the ideas and the facts and the expression of those ideas. So basically the facts themselves cannot be subject to copyright, but the expression of those facts can. And so one thing to bear in mind. Another thing to bear in mind is that copyright infringement is not subject to intent. It doesn't matter whether you intended to infringe copyright or not. If you do an act which causes copyright to be breached, if you copy something, believing that it was yours to copy or believing that you had a license to copy it. Then from a civil law perspective, we're not talking about the criminal law here, but from civil law perspective, that still counts as copyright infringement. And the third thing to realize is that if someone produces a work independently, which is the same as an existing copyright work, then each copyright owner will retain their own copyright in that work. So there's no infringement. So if I write a little melody and then somebody else independently comes up with an absolutely identical melody, they haven't copied me. They've had no opportunity to listen to that melody. Then they would equally hold the copyright in their melody as I hold in my melody. So those are three things that tend to be a little bit counterintuitive about copyright and their concepts that I draw on when I'm talking about the rest of my analysis here. OK, so let's look at one issue. So we take the premise that a model is essentially a set of statistical facts about the information or the training material. Does it mean that it is capable of containing derivative works? OK, let's look at changing the subject completely. Right, let's look at a WAV file. So you could also argue that a WAV file is just a set of facts about how far a speaker cone is from a fixed point at a particular period of time. And we know, obviously, that a WAV file can be infringing. A WAV file can take a piece of copyrighted music and the number of lawsuits that exist that cover copyrighted music encapsulated in electronic file formats is obviously huge. There's absolutely no doubt at all that music encapsulated in a WAV file, which you can argue is just a set, you know, selection of facts can be copyrighted. And just because you so just turning over to the concept of a model, people would say that you can't reverse engineer a model easily to find out that what is within it. I mean, it is just a set of information about statistical relationships. But just because you can't reverse engineer the model to determine its contents doesn't mean that it doesn't contain potentially derivative works. And I'll go into that in a little bit more detail later. But if we go back to the audio example, I mean, if you look at a more complex audio file format like Agvorbis, for example, you know, if you're just given that file, you're going to find it impossible, I would say, to reverse engineer it and get the music back out again. Unless you actually know how it was encoded in the first place, a WAV file is sufficiently simple that you probably could reverse engineer it and figure out the music was in there. But an Agvorbis file, you're just not going to be able to do that unless you know the encoding scheme in the first place. But nobody's going to argue that a Vorbis encoded file cannot be infringing. And there is, of course, a way that you can get AI models to reveal whether they contain any derivative works, and that is if it's part of a generative AI, you can just simply ask them. And I will give a few examples of that shortly. So some of you may be familiar with this poem, which was written by Lewis Carroll. And I'll just read the first line. It was Brillock and the Slythe Toves did Gire and Gimbal in the WAV. Now, for those of you who aren't native English speakers, the words that I've placed in yellow up here, they're just nonsense words. They were just made up specifically for this poem. Now, it's important to realize that Jabberwocky is no longer in copyright, which is partially why it's easy for me to talk about it here, because I don't have to worry about that. But because these are nonsense words and they only exist in this particular poem, or in derivatives of the poem that have been ingested later, then it turns out to be a great test of whether an AI has had access to this particular work as part of its ingestion process, if you can get it to disgorge these words later in some way. So I did a few experiments with chat GPT and I asked it to write a poem entitled Jabberwocky and it wrote Jabberwocky. So the result was verbatim. Didn't even try to change anything at all. So we know that chat GPT has ingested Jabberwocky. The chances of this being developed independently, produced independently, are infinitesimal. Did it know that it was out of copyright? I'm not suggesting that there's any copyright infringement going on here, clearly because Jabberwocky is out of copyright anyway. So it may be that in choosing the training materials that great care was taken to make sure that there were no copyright materials or materials which didn't have an appropriate license being ingested. So we did quite a few other tests on this basis. We found a number of works that actually are in copyright, contained within various large language models. I have to stress that this doesn't mean again that OpenAI is necessarily infringing. They might have attained a license to these particular works. But it does demonstrate that it's possible for LLM to contain copyright works. The argument that the LLM just contains facts doesn't really hold a great deal of water when you analyze it along these lines. And indeed, there are plenty of studies now subsequently showing that copyright works do exist in various LLMs. There's one from the copilot itself, which is some research showing that from time to time copilot will disgorge some verbatim copyright works. So the conclusion here is that we really can't believe AI can be used to essentially launder copyright works. You can't take a copyright work, feed it into an LLM and then claim that because the same copyright work has come out the other side, that it's no longer subject to copyright of some sort. So is it possible to have AI extract the ideas and leave the expression? And we can remember that ideas themselves don't attract copyright, but the expression does. Now there's a great video here. I put the QR code up there for it, which shows the difference or the similarity of two songs. One called My Sweet Lord by George Harrison, which you may be familiar with. And the other one called He's So Fine by the chiffons that you may not be quite so familiar with. And in a nutshell, in a case some time ago, George Harrison was sued by the chiffons for releasing My Sweet Lord, which does to my untrained musical ear sound very, very similar to He's So Fine. And the crux of the case was that although it was never established that Harrison had consciously copied He's So Fine, and if you recall, we said earlier that intent has nothing to play here. So whether he'd meant to or not was by the by. The fact is that if he had copied, then infringement would have occurred. The judge said that Harrison had the opportunity to hear He's So Fine because it was a quite popular song at the time. It would have been played in shops on the radio and so on and so forth. So there's a high probability that he would have heard it and somehow subconsciously it would have entered his mind and it would have become part of his thought process when he wrote My Sweet Lord. And there's a reference to the case there on the slide. So it seems to me quite logical that the courts are going to follow a similar reasoning with AI in that if a generative AI produces some material which appears to be copyright infringing and it can be demonstrated that that material was part of the training data, then the courts are likely to come to the conclusion that infringement is happening, notwithstanding that we can't really work out how exactly it's encoded if that's the correct word to use within the AI model in the first place. So let's look at a different case now which is sort of straining this idea, expression, distinction. So this is a photograph that was pretty popular in London a few years ago. It was in a lot of tourist places where you were buying souvenirs and so on. And it's a pretty striking picture of a red London bus crossing Westminster Bridge. So the picture that I've just shown you, which was on a drinks coaster, reproduced here on the left. And on the right, another company called New English Tees Limited decided that it would be nice to have a similar sort of image, but they didn't want to pay a license fee to the holders of the first image. So they asked a photographer to go and take a picture from roughly the same location of a London bus and then they got somebody to retouch it so the London bus was red and everything else was in monochrome. And you would imagine that that is about as clear an example of an idea versus an expression as possible. The expressions of these two photographs, but the idea is basically a red double-decker London bus crossing Westminster Bridge, which is all in monochrome. And it's not a particularly good case because this was only a very decision at the lower court, so it's not particularly binding. But nonetheless, it was determined that there was potential infringement going on here. So I went onto Dali and I used this as a prompt, which to me seems to be a fairly reasonable explanation or the distillation into an idea. And you'll see that Dali using that prompt has produced something that under that particular legal doctrine, this is almost certainly infringing. But hopefully not in Belgium. So that's an example of where you can have court cases that do not help this analysis a great deal. So what can help us in circumstances when we are using generative AI and we're trying to avoid infringement situations? I mean, there are different ways to do this. First of all, you can filter what is ingested. So if you limit your training data to things that are not in copyright anymore or things that you have a specific license for that allows, this is sufficiently broad that allows it to be used to generate materials using an AI or their subject to an exemption, again, that's broad enough to enable you to do that, then that may assist you. But the trouble is that then that is going to fairly dramatically reduce the pool of things that you're able to do to generate the model in the first place, which of course means that your model is not going to be as good as it potentially could be. But that's something to bear in mind. The second thing is to review the algorithm. If it turns out that your algorithm produces a model that's only two megabytes in size, then there's not going to be much space to fit a whole bunch of copyright works inside their derivatives or otherwise. So that's going to be a pretty strong argument that what comes out of it is unlikely to be infringing because it was going to be pretty difficult to actually fit anything inside there that could potentially be a derivative work in the first place. And the third option is clearly to filter what's output. You look at the output of the AI and at that point you determine whether that is potentially derivative work and sort of block its onwards transmission. There's potentially some infringement happening at the period where it's reduced, but if you're not distributing any further, that's kind of the limit the issues there. And there are a number of technologies available that could potentially help with doing that. So YouTube's content ID, for example, Getty Images has got some software for plagiarism to section and so on. So without going into detail, it's very easy for me to say that, but without going into detail, it's possible those sort of techniques can be used to determine whether the output is potentially infringing. And indeed, this is already being used in certain cases, including co-pilot at the moment. So if you've got co-pilot duplication detection that intends to do that. Those of you who are involved in open source compliance are probably familiar with snippet matching services like BlackDark or FOS ID or Scannos, which use a database of existing code and they have quite sophisticated algorithms to make sure that you can't defeat these algorithms simply by obfuscating the code. But how effective those algorithms are is obviously is variable. But I think it's likely that specialist products are going to be developed. And I happen to get in touch with the founder of one of the developers of one of these scanning software companies a couple of weeks ago and mentioned to him and said, you know, to your knowledge of the developments of foot to help with this situation of filtering the output of generative AI to see that it was potentially infringing. And he basically said that he couldn't tell me anymore unless I signed an NDA. So you can take that as you will. So few sort of final thoughts that I have here. I don't believe that permissively licensed knowledge base or permissively licensed corpus of materials used to ingest and create the model is the answer. There are a number of models available that say that we're only using permissively licensed code. That doesn't mean that, you know, there are no compliance obligations. I mean, you know, pretty much all of the licenses in question will have attribution requirements. So how do you follow those through onto the output? So you're taking a sort of risk based approach way of saying, well, you think that piece somebody who's licensed the code under Apache is less likely to get unhappy than somebody who's licensed it under GPL. But it's not a legal analysis. It's really a risk based analysis there. So you still got to be careful about that. Not a magic bullet. The other thing to be aware of is that different jurisdictions have very different rules about where the machine generated code is subject to copyright. So we touched on this earlier. There's a specific clause in the UK Copyright Act that says that machine generated works are subject to copyright. And that copyright will be owned by the person who made the arrangements. Now, it's a bit difficult to determine what made the arrangements means. Is it the person who created the model or is it the person who created the software that uses the model to produce the output? Or is it the person who put the prompt in? My gut feel is that it probably means that it's the person who put the prompt in to generate the output, but it's not been determined judicially. And there is one case which has got nothing to do with AI, but it's got to do with image generation that suggests that the person who wrote the software, in this case some game software, is the person who made the arrangements, not the person who was playing the game. So that's a little bit problematic, but I say it's only one case and it didn't go to the appeal clause. One other thing to bear in mind is that quite often, if you're looking at two pieces of copyright work and they're quite long and extensive and they potentially have mistakes in them and they are identical, including the mistakes, then you're going to make the assumption that the only way that Work B came into existence with all of those mistakes, etc. is that Work A was copied. Now, up until now, that's been a pretty reasonable assumption to make, but of course it's entirely possible that using Generative AI, two people could put the same very similar prompt in, or identical prompt in, and that prompt would generate identical output works, and therefore we can't automatically assume that a long and complex work that has mistakes in it is essentially only going to be owned by one person from a copyright perspective. So that's just a sort of cautionary word. So one thing that really does worry me here is the potential that AI can be used to automate a clean room rewrite, so we can use AI. Again, I've done some analysis on this, but I won't share the details now because I don't have time. But if you take a piece of code and you ask the AI to analyze the code and produce it to a functional description, and then you take that functional description and insert it into another AI and say, please write code to this functional description, then does that mean that because you've been through an automated process that has basically stripped out the expression, taking it to the functional description, which is purely an idea which we know does not attract copyright, and then reproduced a piece of software from that that somehow we've developed an automated way of copyright washing? An awful lot, I think several billion, if not trillion, dollars says that is not going to be allowed to happen, so we just need to be aware of that of a possibility. So that's a sort of whistle-stop tour through my various thoughts on the topic. Thank you very much for taking care, taking the time to listen to me. Do we have time for a couple of questions potentially? Two questions right over there. Fantastic. So one of the questions we have, I'm going to have to move up. So, I mean you might as well have a look at the whole thing. One question we had was if the model outputs a copy of an image who is infringing the AI machine or the human who asked the prompt. Good question. People infringe. So potentially it can be a legal person who can infringe, so it could be a company as well. But ultimately it's going to be who was doing the act of copying. So it's almost certainly going to be a human, but if the human is employed by an organisation, then it could be the organisation that would be infringing as well. While we've still got time, another question we had is there seems to have been some confusion about whether this was under US copyright law or England and Wales. Which ones are you talking about here? Any of you either? So what I'm saying was under the copyright of England and Wales, which is the same as the rest of the copyright law in the UK. But references that I made from time to time were about the idea of expression dichotomy for example is something that's much clearer under US law than it is under English law, but it still subsists. Thank you.
Open Source AI at TechWorks, the UK trade body for Electronic Systems Engineering
Okay, so our final talk today is by Jeremy Bennett here. I have some notes. So you live in Southampton also, which is also in England. By the New Forest, which is almost a thousand years old. You spent some time in Paris and Nuremberg, which is great. You adore compilers from what seems like from reading this. And you have acted with Hugh Grant? Wow, that's an interesting story. Alright, sir. Jeremy, take it away. Thank you very much. You can ask me later where and when I acted with Hugh Grant. Okay, this is our last talk. It's only a short talk and it's a bit of a long story. I want to talk to you about my work I do in my spare time and William works with as well with tech works. Anyone here heard of tech works? It's the trade body for electronic systems in the UK. And just in case you think that's relevant, it's worth about 100 billion a year to the UK economy. It's about a million people working in that industry. It's 8% of the entire British economy. There's a reason why the minister turns up to the annual meeting. He listened. So it's a powerful body and you will certainly know the members, IBM, ARM, Cadence, Mentor, Siemens and the like. So it's a big body. It covers a lot of things. It was originally the National Microelectronics Institute and that's the one on the top right there that looks after silicon chip design. Going round, you've got Power Electronics Group. You've got the UK Electronics Skills Foundation which is the educational charity arm that oversees students' internships going into universities across the country. There's TechNest which is the embedded software group. There's eSIN, the Automotive Expert Group that looks after the automotive industry. And lastly, there's the Internet of Things Security Foundation. I'll come back to that. Now, what are they doing here? Because they're not an open source organization, anything but. But part of our role as open source engineers is to educate the wider world into the merits of openness. And I want to draw your attention to the Internet of Things Security Foundation and that's what it says on their front page. Okay? Material is published. It's a contribution from industry and you can download the material and you can download them for free. Okay? They're freely available to you and indeed there's an example of one and when we say free, we mean a Creative Commons attribution license. And that's a perfectly valid open license for what is documentation fundamentally. And so even though this has some of the biggest proprietary people amongst it, they have chosen to do their standardization work, their best practice work, their guides to the engineers in the industry to make them fully open. And they were put together by an open process and one of my open source engineers, you'll find his name in that document because he wrote a big chunk of it. And that's where the open philosophy is something you sell to them. And I was one of the group that sold the idea of doing this in the open and I'm a founder member of the Internet of Things Security Foundation. So how does that apply to AI? Well, William and I have been heavily involved with AI at TechWorks. I have the last year or two been co-chair with Mike Bartley of the AI initiative we've had going on long under the hood. And most of our members are experienced professional engineers and I think we heard a lot earlier from Stefania about the importance of education, but I'm particularly interested in the education of people who are already experienced. We've got lots of experienced engineers. How do you bring those people into a new industry? They've got their marketing guys telling them our new product's got to have AI and that's probably about the detail they get in their product spec. And they've got to implement it. So what TechWorks is trying to do is fill a gap in the market by making guidance available to those professional members it has. And the initial things we're going to start on is guidance on trustable AI because that's seen as one of the barriers in our industry. And quite honestly, if you've got companies that are making jet engine controllers you really want to trust any AI they put into them. And more generally, the professional engineer. So actually what William and I have been working on and you can join the meetings if you want to, the next thing that we've been doing is the best practices guide. We're not trying to tell you how to do AI. We're giving you the pointers so you can do it. We're not duplicating what other people are doing. We're trying to provide you the set of questions, a Q and A you can go to to say, should I even be using AI in this product? If I should be using AI, what sort of AI? What are the questions and risks I need to address? And the idea is if you're an engineer, but you don't know AI, it'll help you make a good job of your first project and subsequent projects. And hot news, this is I think the first public meeting this has been announced. So TechWorks announced its new AI innovation cross working group. It's a cross working group because it doesn't fit in any of one of those subsidiary organisations. So we'll work with Automotive, we'll work with Power Electronics, we'll work with the Electronic Skills Foundation, we'll work with the In their Things Security Foundation. It was announced on Thursday, there will be a launch event in London and then there will be more public events. The launch event quite honestly is to get the key influences in there to understand. So it'll be aimed at the government, both the civil service and the politicians. It'll be aimed at senior managers in the industry across the UK. And then we'll propagate it down and there'll be lots of events for the ordinary working engineer. But the good thing about that is the work we'll be doing will just like the In their Things Security Foundation be in the open. And there wasn't even a question about doing that this time. It was taken as given because it was seen what the success is. So really my talk is just an appeal to you is don't just engage with the open source community, engage with the wider engineering community and try and bring them online for using open source. And I'm hoping next year we'll come back and we'll be lots of feedback and this group will have fed into the other groups you've heard around here and will have drawn on what they've done and will be a useful addition to what's there. I say you can get involved with the best practice group, just send William an email and he'll hear it. So I'm the last speaker today. So I've got the, my last slide is nothing to do with tech works. It's some thank yous. So thank yous to those here. So I'd like to thank Will Jones, who's been overall charge of organizing this room. I'd like to thank JJ for chairing all day and JJ hasn't taken a break. I tried to make him take a break but he's indestructible. So he's gone through the whole day. Michelle from the Nagara, Jonathan and Stefania. I think Stefania's had to rush off for all their work from the European network on AI safety. And those four people, I should say there were four submissions to do an AI dev room and we've put all four submissions together. So you've got the best of four possible dev rooms you could have had all rolled into one. But the most important people making a success are all of you. We've had tremendous interaction. I've not been in all the talks but when I have, it's been great to have that. So thank you very much and of course we'll see you all next year. Thank you.
Introduction to OpenAPI
Good morning everybody. Thank you for being so patient. I don't think I've ever had a full room with 24 minutes to go before the start of my talk before. So that is a very special experience. Thank you for sharing it with me. I unmuted, but thank you for checking. So I am going to talk to you today about OpenAPI. I'm going to try to give you something new that you could maybe take back and try, whether you haven't seen this before or whether you're just looking to level up your game a little bit. My name is Lorna. I work for Redockly. I'm VP Developer Experience there. I love APIs. My background is in software engineering. I've been a developer for most of my career. I've built APIs, integrated with APIs, worked for API producers, done API consultancy. Now I build the API tooling. It's, yeah, look, it's a thing that I enjoy and I'm happy that you are all here to share it with me. So let's start by talking about OpenAPI. OpenAPI, I know a lot of people raised their hands, but maybe it's new to some people. OpenAPI is an open standard. It's a way of describing your HTTP APIs in a format that aims to be both human and machine readable. What's nice about that is when we use a standard format, everybody uses the same format. And when that's an open format, it's developed in the open. You can be part of that development process and I'll talk a little bit more about the OpenAPI community at the end. You can see what's coming. You can join the meetings. You can follow the issues on GitHub. If you are using OpenAPI as a producer, as a consumer, if you make tooling for OpenAPI, there are no surprises. You know what's coming and you can be part of that. So it really improves our confidence on working with it. I think the most difficult thing about working with OpenAPI is it's just very verbose. It takes a lot of lines to describe what can be quite a simple thing. So I'm going to start by talking a bit about the structure of OpenAPI because I think when you can find your way around, you understand the map, it's much easier to work with it. So this is a representation of the things that you will find at the top level of an OpenAPI description. OpenAPI, which version of OpenAPI is this? Info, a bit of metadata about the API that this description describes. So here you'll find the title, probably some license information, some contact information, the version that we're on. All of that is in the info block. External docs. It's very easy. You publish an IAS developer website, you link to your API reference docs. If the user arrives on the reference docs, maybe from a search engine, is there a link back to that nice developer website that you made them? Check, because I feel like I've put this right on everything I've ever worked on. There is a security section and that will describe the authorization and authentication needs for the different that are used by the different endpoints in your API. We've got a service section, where is this API published? Tags allow you to attach metadata to individual endpoints. They're listed at the top level and then you can just use them where you need them. The paths section is where the real API documentation actually happens. This is what we think of as API docs. We have an entry for each endpoint describing what it does, the parameters that it accepts or how to shape the request and the response or responses that you can expect back. You'll also find web hooks here, so where you have an API that as well as receiving requests and returning responses, something happens and it sends you a response. You can describe those with web hooks. They're a little bit different to the request response feature. Those were added in 3.1, which, although it is the newest version of OpenAPI, is 3 years old, so wouldn't describe it as cutting edge. We also have here the components section. The components section allows us to describe things that we're going to use multiple times. If you use the same filtering, pagination, date format, if those are common patterns across your API, I mean, if they're not, we need to talk. But if you reuse those things, you can define them in the components section and reuse them. So knowing kind of, they can go in any order, but knowing where you are and where the other things are that you might need can make these very long documents navigable. OpenAPI descriptions are often thousands or tens of thousands of lines of code. My favorite test API description to use is the GitHub one. It's quarter of a million lines of YAML. Like, yeah, you need to know where you're going. Your tools can help you. But it's like carrying something that's not exactly heavy, but it's just a bit unwieldy. So let's drill into some of the detail. Here is just basically the top part of your OpenAPI description. We have a version. It's not very exciting. We have an info section. We've got a title. Give your API a unique and meaningful title. We have summary and description. A lot of OpenAPI elements have these two texty fields, the summary and the description. The difference, the summary is just text. It's short format. It's usually shown in a listing. The description supports markdown, specifically common mark. It's usually shown when we're looking at the detail. So if your API is shown in a catalog or in a list, it'll use the summary. And if you are viewing the API reference documentation, you'll probably see the whole description. And don't be afraid to use the markdown features for links and to really enrich what you do within your OpenAPI file. There's an info version field. And I think this is one thing that I see people getting confused with frequently. Info version is the version of the API description. So if you change this definition document, you're going to change the description field. Does your API info version need to match your API version? I don't really care. But if you change your description a lot, can you please bump the info version so that I know I don't have the latest version of this document? You lock it to your API version if that helps or don't. Maybe you haven't made any API changes, but you did add great descriptions, better examples or something else that changes the OpenAPI description of your API. Bump the version so I know I need to get the new one. Please add a license. Yeah. So this is like some nice fluffy rendering. I made this with Blockly. I hope that you like it. And I think it's just easier to look at than the real thing. This is the YAML version. And I can do 10 screens of YAML and I will be having a nice time, but I don't know if you will be having a nice time. So I brought you some pictures. But this is kind of the equivalent of seeing it in YAML. Like now imagine another 20,000 lines and you're starting to visualize how this thing looks. Okay, let's look a little bit at the paths. We have within the YAML path section, we have one block for each combination of URL and verb or method. So like I have one that is item endpoint, it's got a get operation. Got another one. I'm really good at naming things. Called things another URL which has both get and post. Those are different operations. They get their own description. If we drill into one, how's an operation ID? Fun fact, operation ID is optional in open API. It's technically optional. Honestly, you need it. It needs to be unique. Just get your linting to put that in. There's very few APIs where this isn't a useful thing to have and it's not like it's painful to do. We've got a description. You probably would have a summary as well. Won't all fit. I have added some tags to my endpoint. This is related to user and accounts. We might have user and orders or some other combination of tags here. You can have multiple tags. If there were request body requirements or parameters, those will be described here as well. And then we've got the responses. I've only got the 200 response here. It's very bad. You should always describe your 400 response errors. I got 200 response here. It's application JSON and it's just got a couple of fields in it. I'm going to drill into that in more detail. It's the same endpoint. More detail. Shuffled down a little bit. In my response, you can see I have a maybe you can't see actually because the font is quite small. This schema has a message and an event ID. I've got data types. I've got descriptions. And I've crucially got examples here. The examples are the magic because it lets the user know what kind of data will this be. You can tell me it's a string. But if your example is, I don't know, are you UID? I'm like, oh yeah, I know what that is. If you show me it's my username or you show me it's an ID, okay, I am just instinctively going to put the right thing in when I'm using those tools. If you use the same fields in other places and it's becoming increasingly standard that even if you're not reusing them, you'll often use the open API reference syntax to refer to them being stored somewhere else. So instead of defining each of the objects or elements of the response payload, you just refer to use a reference, dollar ref, to refer to that description and put the description in the components. So your path entry looks like this and then we have that detail down in the components section under schemers. So this gives you a very powerful reuse. The key to API experience is consistency. And so the reuse helps us to just, without thinking, get it right, get it the same, get it consistent and avoid having similar named fields that might take different timestamp formats or look identical but validate differently because our back end application didn't understand that they were the same thing. So that's the structure of open API but I really felt when I created those slides that I was missing the magic. The thing that brings me to this and makes me believe in open API as the powerhouse of our modern application development. And when I think about open API, I think about the things that I do with it and the things that it enables. You think about the way that you design your API, giving meaningful operation IDs for each endpoint and these can be used by the tools that consume your API description. Having great descriptions, naming things in such a way that developers don't need to come and read your documentation because they will know from the operation ID what it's going to do and it's very consistent. They feel at home. You describe your error responses even if I never publish my open API description. The fact that I wrote down the error responses makes my API better because I thought about what I wanted to do if something went wrong. I can validate my API and make sure that my open API is valid, is at the standard that I want and I can have my own linting rules as well. Operation ID is optional. Why? Not in my APIs. So I write my own rules. I say we use kebab case here. We use plurals here. We always define an error response. We make sure that our examples match our media types. These are the things that you can add with the additional linting rules. We can create documentation. That's great. You have an API. You should probably have some docs for it. We can also allow other people to pull the open API description and generate their own docs, keep it locally for reference. I have some accessibility needs. If you have an accessible API web-based documentation, I can just generate with something that works for me with my open API locally. It's ideal. Beyond this sort of entry level, there's some more things that I think we are not doing enough of in open API. You have an API. You describe it with open API. You lint it. You generate some docs. This is great. Please do these things. You are all awesome. The next level is how you deal with very complex API setups. If you work in a large organization with many microservices, how does that pipeline look? How do you keep them all meeting the same standards? How do you bring them together to publish as if you knew what you were doing to the user? Don't mind if you do or not, but you need to look like you do. How do you bundle those things together? If you have one enormous open API description, how do you collaborate on that when you are making changes, whether you are an API experience specialist, product owner, engineer, tech writer? How do we give you a clue that GitHub file is not maintained as a single quarter of a million line YAML file? Looking at how do you manage your files? What do you do with references? How do you split across manageable file chunks? Then how do you bring that together to ship downstream? Finally, what do those downstream tools look like? A lot of organizations, organizations come into open API because they want documentation. This is the beginning. We don't want to write a whole load of words. We just want to describe once with open API and then we can generate some documentation and we can generate it in different ways. Then for free, you start being able to get all these other benefits. You can generate some client SDKs. You can even generate your service stubs if you want. Lots of tools will automatically integrate with your API if you have a good standard open API description. So your API gateways and other integration platforms will just take it. But you can also start to automatically look at how do you describe sequences of API calls? How do you test your API? What does a mock server look like? Because you've described this API in so much detail that a tool can pretend to be it very easily. So there's a lot of pieces here that make up the ecosystem. Open API is kind of the seed from which the rest of the tree grows. For me, this is the magic. It's the interoperability. It's the way that we come back to maybe we generate some open API. It's terrible. So then we use overlays or decorators to add all the descriptions and examples. And maybe not all of these end points are public yet. So we just filter out the public ones to make the final open API and generate some docs. Maybe only some of them are available in the SDK. So we filter differently, make a new open API file, pass that down to the SDK's endpoint. Maybe the next generation of your client SDK has some new functionality. Well, that you start with the same source file or files and bring that together. So it's all about how do you not code, generate docs, but how do you create your open API? Don't have time for my design first rant. So I'm going to try and hold that in. However, your open API comes into the picture. How do you maintain and manage it successfully? How do you ensure the quality on it? How do you transform it and get it ready for all the outputs that you choose? There's just so much in this picture. Let's talk about some tools. Now, I've just linked open API.tools here. I'm not making any specific tools recommendations. That's for two reasons. One, this is a really hot area. There's new tools every week. There are different tools for different text acts. When you are ready for a new tool on that day and no sooner, you should go and look at the list and pick something. The second reason is I work for a tools vendor. I work there because I use their tools. I cannot possibly give you an impartial recommendation. I went to ReadDocley because they know me and I know them. I really don't know the other tools that well as a result. So don't listen to me for specific tools. I work on the ReadDocley stuff and I love it. You need an editor. There's basically two ways to go. You can use a programmer's editor, something like VS Code. Please add some plugins to help yourself. ReadDocley makes an open API plugin. Even if you just have some syntax highlight for YAML, the one that makes the indentations a different color helps me a lot in YAML. Find something that works for you. There are some graphical editors and if that's your thing, then go find one of those. You don't need to pick the same as your team because it's an interrupt format. You use whatever you want to collaborate. Try really hard not to lock your team into tools. Again, accessibility needs. I need to do it in Vim and of course I can. That's part of the magic. Open API governance, which is clearly not a tool, but let's skate over that. When you write, your API standards do not exist until you write them down. They are not standards until they exist somewhere that somebody else can look at them and they are consistently enforced. We have a lot of really good linting that can really help you, but the humans are always going to be in this review process. Find your most wise and thoughtful humans and invite them to be part of the review process. Naming is the thing that the machines genuinely cannot do for us and just the joined up thinking of being able to see things next to each other. As you introduce API standards, start small. Do not be tempted by other people's recommended rule sets, not even ours. Pick what works for you. Look at the recommended rule set, but then pick the things that you aspire to and can adhere to today and commit to reviewing every six months and building up the quality on your API. If you're retrofitting standards to an existing API, there will be things you cannot change now and that's okay, but you can set those rules for the new versions. If you don't know where to start on this, I am going to recommend Zalando. Have some brilliant public API standards and you could do worse than, okay, they have a lot. Start small, just pick your favourites out of theirs. It's a great place to start and your organisation will evolve as it goes along. Please put some linting in. The machines are genuinely good at this. They can help keep you straight. Is your open API valid? Does it have descriptions? Does it have examples? I've got one team that I work with where we have a whole API where the description for the access response is okay with a full stop and it turns out we enforced sentences. So it has to be at least one word and at least one full stop. Yeah, we did some work with them on that. Get some case conventions, some naming conventions and be really picky about what you include. I do this with Redockly CLI, so if you are using that, feel free to send me questions. If you use something else, I can't answer your questions, but good luck. Open API documentation. Read the docs for your docs tools. I see a lot of implementations where those functionalities exist in the tooling that you've used, but you haven't really dug into what it can do or looked at how you can extend or configure it. API reference documentation is evolving very quickly in a good way. There's a lot of new entrants in this market. I'm not sure if I'm supposed to be saying that we have a new product coming out later in the year that does this. It's beautiful, but you have lots and lots of options. Whatever you've picked, make sure you're making the most of it. And if you have something that's, oh, our, I don't want to malign any other tool families, but something which isn't specialist docs and it can render documentation is a great way to start. But because you have the open API format, you can use all of one tool set for one thing, something else for docs, something else for your SDK gen, like lots and lots of options. Open API, when you publish documentation, your documentation is part of the product. You should be deploying it often. It should be easy to deploy and redeploy. And make sure that you're treating it like a web product. Get some metrics, have a look at what's happening, see what people run into. If you have interactive docs, what are people calling the same endpoint all the time? Is it super popular or is it super confusing? Why is everyone here testing this thing? Have a look at those metrics because they can really help you understand your product. I want to talk a little bit about the open API community. This is something that I don't always include in my technical open API talks, but as far as them, it feels appropriate. It's an open standard. It's part of the Linux foundation. You can get, you can learn more about it on openapis.org. The GitHub repository is public. Everything happens there. We have a Slack group. It's very active. Also, public to sign up. And there's a weekly technical meeting. I will confess, it's not super friendly for Europe. I think it's 6 p.m. Central European time, 5 p.m. for me in the UK. Yeah. I'm trying to get to a critical mass of EU-based maintainers, and then we need to start mixing that up. But yeah. If it's unfriendly for Europe, it's sort of dinnertime. There's no hope at all for anyone east of here. So yeah, we need to fix that. But the open API community is currently growing its maintainer set. It's working on some new stuff. Like, this is a good time to get involved. We've also spun up some special interest groups. So just to kind of tease some of the headline activities within the open API project. The Workflows special interest group describes a sequence of API calls. So if you have, this has come from the travel industry. So where you need to find the flights, find the seats, ask the user, book a seat. None of those make sense by themselves. Workflows aims to give an extra level of description for that. Overlays is a special interest group that describes repeatable modifications to an open API. So if you have a generated open API that is just thin, you don't maintain good examples and good descriptions when you're generating from code, and lots of organizations struggle to get away from that Java doc workflow. Overlays can help for now, where you can get your open API and make the same changes every time to make the descriptions better and add examples, hide things, whatever. Open API 4.0. Code name, Moonwalk, why? Don't ask. Don't let engineers name things. Open API, Project Moonwalk, is committed to doing some sort of release this calendar year. So that is just starting. The high level goals are to give you a really simple upgrade from 3.1 upwards, so 3.0 you might want to go to 3.1, and to include a wider range of HTTP APIs. Open API is amazing for RESTful APIs. Okay for some other HTTP-ish, RESTful-ish ones. Moonwalk will include the RPCs and a wider family. So if you've struggled with open API, have another look in about a year. Yeah, open API, an open standard for API descriptions. If you're not using it, I hope you will now or feel like it's a thing that you can approach. If you are, maybe I've given you some ideas to go back and look at what you might change in your current workflow. I'm going to leave you with some resources and say thank you very much for your time. Okay, I'm allowed to take two questions. Would anyone like to take a question? Yes. This is a really good question. How do I feel about generating open API from code or code from open API both ways? Let's start at the beginning. A lot of organizations generate open API from their back-end server-side code. I don't like it. And the reason I don't like it is I think when you go code first, you're missing a design step. When you design first, you're thinking about it in the context of the rest of the API. You're more likely to get the naming right the first time because that implementation is not done by an engineer by themselves. So you ideally design first APIs. You propose the change to your open API with a pull request. You're wise people and you're amazing linting. Go a few iterations to get it perfect. Then we build it. And that's my ideal and that's why I prefer it. The other question, generating code from open API? Yes, go for it. I think we have this machine description and there's a lot of boilerplate. So we can go quite a long way to things like client SDKs from open API. When I talk about the transform step where you have an open API and you make it better, for docs, you're going to add examples and descriptions. For API gateways, SDK code gen, that sort of thing, you're going to add metadata here. You're going to give the type hints that the specific programming languages and text stacks need. And you're going to give extra information. You might not have that at design time, but if you think of it as a pipeline that splits off, you might want to add some extra magic from your standard open API to enhance it before you generate code from it. But generating code is typically fine. It will only be as good as your description is. And lots of those fields are optional. So cool. I am out of time. Thank you so much, everyone. I hope to see you during the event.
Deploy Fast, Without Breaking Things: Level Up APIOps With OpenTelemetry
with the topic. It is a very big mouthful of a topic today, but I'm hoping that we're going to break this down for you today and that you're actually going to learn something that you can take home back to actually implement yourselves. I'm here just to be talking about the open telemetry part. Sonya is actually the brains of this operation. She's basically been planning this whole thing, set everything up and just invited me at the end because yeah, because I'm pretty. That's basically all that I'm contributing today. So I am hopeful that a lot of you have had any type of touch with open telemetry and observability in general, but also that you know the basic DevOps principles and how that is going to be connected with API Ops. Just an introduction for both myself and Sonya. I am Adnan. I do developer relations as you obviously might have already figured out. And yeah, Sonya here is a product manager at Tyche and I would like to hand over the microphone. Yeah, hi. I'm a product manager at Tyche. So we do API management. We have an open source API gateway. If you were in the session before that, you have seen it on the screen. It's an API gateway that's written in Go. It's really fast and has lots of capabilities. So do check it out. And now we are happy to talk about the topic. Cool. Just a quick rundown of the agenda for today. We have four main topics for the agenda today. First and foremost, we're going to talk about API Ops, what it is, how you can get started. And then from there, we're going to take a closer look into how to do API Ops hands on. So we're going to start with the Kubernetes cluster. We'll walk you through how to use Argo CD and Tyche for your API gateway and basically just enable very fast flows and very fast deployments and release cycles within your APIs. From there, we're going to move into production environment. So we're going to say, okay, so what do I need to do to get observability, to get insight into my production APIs? And from there, we're going to shift left even more and figure out how to integrate the release cycles and make them have integrated. I'm going to say integration testing as well. So we're shifting left even more using the production data, so the observability data for testing as well. So that's going to be, I'm going to say my most favorite part because I'm here from Trace Test and we do that. But for right now, let's do the API Ops portion first. Yes, so what is API Ops? Thank you. So you might be familiar with API management and I find that sometimes in API management, we have too many manual operation. And as you all know, manual operation, that's a cause for disaster, that's a cause for error, that's a cause for security problems and we need to speed up things. So my interpretation of what is API Ops and you might have heard about API Ops and some vendors will try to push their ideas of what is API Ops. Some would say it's about deploying your API fast. I'd like to bring a bit back the cultural side of DevOps and say that API Ops is the offspring of DevOps and API management. So it's applying the culture of DevOps to your API management lifecycle. And why? Because you want to deliver value fast without disrupting your users. So if we think back about the DevOps culture, the DevOps principle that originally came from before we started to have lots of vendor trying to sell off things that are DevOps applied, it's about fast flow. I want to be able to commit and have it used by user to have feedback. So to have that culture of having feedback loops. And it's also about enabling that culture of learning. I want to understand what's going on. I want to learn fast, fail fast and be able to provide value to my users. And we're here today to tell you that we think that observability is a key enabler for all that in API management or API Ops. So let's take a look at how to implement API Ops in modern Kubernetes environments to have fast flow. So typically you will have a developer that's building a service. You will have things like open API specification along the way. So we had a talk in this room earlier about open API. I'm not going to go more into details, but it's definitely a space, a place that you have to take into your CI, into your continuous integration, making it all automated. Today we're going to talk now a little bit more on the deployment side. That's why we haven't added it, but of course things like linting and generating documentation. All that should be part of your process. So once the developer commits something, it goes to the CI, continuous integration, and the result might be a Docker container. So it gets published. And now we want to deploy that. We want to deploy that new version of that service. We want to deploy it with an API specification. And for that in Kubernetes, the new way of doing continuous deployment is to use GitOps. There are projects like AgroCD or Flux that are able to do GitOps. What does it mean? GitOps? You're lucky you're really pretty. Okay. So the main thing about GitOps is you don't have a continuous pipeline that pushes the things and deploy to your server. That's the Kubernetes cluster with something like Agro. Pull the information and deploy it itself. So how does it look like? You have then at the end of your CI pipeline, you have to make a change into your deployment repository. You have a code artifacts for all your changes, all the configuration. And you might have a new version that is placed into staging. And AgroCD on your Kubernetes cluster can be configured to automatically pick it up and deploy it. So all automated. Now there's another thing that you need is to expose an API is an API gateway. So in that example, we are using tag API gateway to use the authentication, verification, monitoring. So we add an API gateway, open source API gateway to that. And that's going to be interesting also for the observability part later. So an API gateway helps you to centrally manage your APIs to use authentication, authorization, weight limiting, all this capability that you need in operation. How do you add that? The Kubernetes and GitHub way. Typically we focus on resource definition like it's the way in Kubernetes. So you can add things. And that's a very, very simple where you can say which protocol it use. You could define things like weight limiting, like security policy, which service is proxying on your cluster. And again, it's configuration as code. So it's again central repository. And when you make changes to it into your deployment configuration repository, something like ArgosCD will track it and will apply it automatically. So what we see at the end in your ArgosCD application, you see, okay, all my application definitions, all my application are synchronized automatically with whatever I put into my Git repository. So now we have the first step, right? We have automation for fast flow. We are preventing configuration drift. We have enhanced security. All is automated. No manual error. We are more efficient. We also have an audit trail. So we see exactly what was changed in the deployment of your APIs. And we have better collaboration and visibility on what's happening. Wonderful. And obviously, as the slide says, that is not enough. So we're getting the automation part down. What do we do next? Step three in the whole process is to get additional feedback into your feedback loops so you can connect both ops and dev correctly. So what this means is that the ops team needs to enable the dev team to fix issues by exactly knowing what the issue is, so that the dev team doesn't need to spend useless cycles trying to figure out what the problem is. And we do that by using OpenTelemetry and using Yeager, which are observability tools within our API ops pipelines. Now, this is what we exactly don't want. We don't want to see gears turning and hoping it's all fine because it's not really fine. You don't know what your users are seeing. So we don't really know if our users are happy. We just kind of know it works. And then you kind of do prayer driven development, as I like saying, that's not really what we want. We want to use observability to infer the internal state of our system by getting telemetry out of our system to understand what's actually happening. And then we can figure out whether our users are happy. Because this is something that we can see by using observability with distributed tracing. When our API is exposed telemetry, we can actually see, oh, okay, obviously something is wrong because we have breaking APIs. So it's pretty obvious that our users are unhappy because we can obviously see things breaking for them. And this is a particular view that you get by using Jaeger. Now, let's get to the fun part of actually showing you how it all works and how you can set it up yourself. Now, the way you do it is you use CNCF observability tooling. So tooling from the CNCF tracing landscape, more specifically open telemetry and Jaeger. Open telemetry is an incubating project. Jaeger is a graduated project. So they're all fully open source supported by the CNCF. Now, the specifics are that you use open telemetry as the open standard, we're very focused on open standard for the whole dev room today. So once again, it's an open standard to generate, collect and export your telemetry. Remember that part, it's a bunch of libraries and APIs that help you generate, collect and export telemetry. Now, where do you export it to? Well, you export it to Jaeger, which is a tracing backend, which is just like a data store for your distributed tracing. And then you use Jaeger for all of your production monitoring troubleshooting and whatever else you need to do in your production environment. Now, from this, one of the bigger issues is that open telemetry is quite hard to implement if you're new to it. So some vendors like to bake it in into their systems. One such vendor is there was a lot of suspense, right? Yeah. Yeah. So one thing that we did in tech is to add support, native support for open telemetry, because we know that people that works in the API space, they use API's to proxy multiple services, and the developers might not yet have implemented open telemetry. But we know they need one where to report the data on all the APIs have really visibility on what's happening. And so we added support, native support for open telemetry in tech to enable our user to export this data and to capture them automatically for older APIs. So that's need a couple of settings. This is settings for our hand charts. So where do you need to enable it in tech? You need to say where do you want to send the data to an open telemetry collector could be also directly to an observability backend. And this is what you get. So for every API request, you get a distributed trace for what's happening at the gateway and till the upstream service. So you can see, first of all, you can see any error that's happening already at the API gateway level, authentication error, wait limiting. We see sometimes people only monitor what's happening on the service, but they don't realize they're already missing a lot of people having issue with the authorization, authentication, wait limiting. And then you see what's happening in the upstream. So you can very, very quickly catch up errors, understand not only the timing text, the HTTP response code, but really what's happening if there's an error, if something is slow, where is it happening? Is it on the API gateway, is it on the upstream service? What are the details of the transaction that enables a team to better troubleshoot the issue? And with that, we have now achieved feedback from production. So we have healthy development lifecycle with feedback loop between Dev and Ops. If there's an issue, then the Ops team can report it, can take a look. So it's not only an error on a metric that goes up, it's really a trace where you understand where's the problem, you know, which team needs to act on. And it enables you to provide a better user experience, fix the issues earlier. Again, what have achieved, feedback from production, we no longer relying on user reporting feature, no longer somebody that calls support and say, oh, I have a problem, something is done, no, you see it, you see it all, so you can be proactive. You understand the API performance, you understand really what's happening, where the error is happening, and you can solve issues faster. And with that suspendsful mic switch, again, it's not enough. So we need to introduce another layer of, actually this one, no, we need to introduce another layer of protection. Because right now, we want, we're only stopping bugs after our users are seeing them. So we exactly know that a user saw problem that broke our API, and then we're now rotating back to fix it. We need to be more proactive and figure out how to stop the bugs before they even reach our users. Now, so this is a shift left even more approach, but actually for you guys, it's shift left even more approach. Because we want to add observability to our release cycles as well. So not just our production systems. So the way we're going to go through that a bit is by doing this little squiggly in between, as well. So this basically means that you need to implement something called trace-based testing, which is also called observability driven development. If you like honeycomb and their CTO, it's a term that they coined. Okay. Anyway, the way that you use trace-based testing is you quite literally using the distributed tracing that your observability, like open telemetry exposes, and then you're running tests on those actual data points from your infrastructure. So that means that even though we can see that we have our gears turning, that's awesome. But my initial connection to that API gateway is returning 200. But how do I know this is not broken? How do I know if this is on fire or not? This is an external service. I don't like I don't manage this. So this is something that easily breaks and that you don't really have a lot of control over. Now, let me show you how you can actually get to that state where you can do your testing against the distributed trace itself. This is a screenshot from Trace-Test, which is also a CNCF tracing landscape tool. You can build your test by getting the trace itself from Jaeger, and then you're writing your test specs directly against trace data. So you're not using any mocking, you're not using any faking or whatever the word is nowadays with kids use, I don't even know. You're literally getting the actual data back and running your test against that data. Now, the magical part here is that you can quite literally test against anything that's exposing telemetry. It can be an API gateway like TIC, it can be databases like Postgres, it can be caches like Redis, it can be pretty much anything that you have instrumented to export traces. Now, this is a really cool use case for authentication as well, but also for GraphQL. Now, for authentication, you have a very good example. Yeah, something like Off-Flow where you have multiple service taking to each other and getting the request, that's one of the really cool, useful examples. And also something that I've noticed as well is for GraphQL. So one thing for GraphQL is that it often returns a 200, even though it's failing because the actual error is within the response. So you don't really know, it's very intricate to test that. One thing you can do with trace-based testing is you can drill down to the actual middleware that handles that in your API gateway, find the exact error that happened, and then you can run your test spec on that exact value. So with all of this, we're getting step one, which is functional testing. So we can actually functionally validate our behavior of the system by using all of the telemetry that you've implemented in the prior step to make your production environment reliable. Now, but it doesn't really stop there. We also have step two, which is performance testing, because every span has a duration. You can quite literally go in and say, I want my duration for this span to be less than whatever value of 200 milliseconds or something, which means that if you have external services, external APIs, upstream APIs that you're not in charge of, if their performance is bad, you can validate that and you know exactly what part of your system is misbehaving. So this is the performance aspect as well. So you're getting basically two things from one, I'm going to say exercise. Now the way you do it, I'm going to walk you through quickly. You do this shifting left with trace test, which is, as I said, open source part of the CNCF tracing landscape as well. And what it does, it is quite literally giving you the infrastructure by actually the distributed system architecture by looking at the trace data. And then you can both get the overview of what your system is doing, and you can run tests against exactly what's happening in your system. So those are two powerful things because as engineers, it's very hard to know what the system is doing if it's highly distributed with a lot of microservices, especially if you're a new person on a team, it's just, it's a pain to do that. But with trace test, I want to show you how you can implement these integration tests in your Argo CD, like right here. So this is what an integration test in a post sync hook would look like. You have a API that you're deploying, you have your integration test, which basically runs a Kubernetes job from Argos, from the Argo CD sync hook, then it runs a few integration tests. If they, if they're failing, awesome, you know that they're failing, if they're passing, even better, you see that they're passing, but doesn't really stop here. The thing that you get with this is also every test that fails, you have a URL to go to that particular test to actually see precisely which part of that transaction failed within your API, within your API microservices. And I really like that part because this is not just, oh, yo, this failed, this is actually, this failed, here's exactly how, where, and what happened. And with that, we're actually getting to a stage where we're validating our production, but we're also using that effort we put into our production reliability to validate pre-production as well. So you're basically getting the exact same overview graph that Sonya just showed you, but instead of using your end users, you're running tests with trace test against the API Gateway platform, then you're getting the traces back from your Yeager or Grafana or whatever you're using, and then that info goes back to the API developer that can then fix the issues that were found. Now, with this, I'm just going to wrap up everything that we learned from this last section, which is that we got functional testing and we got performance testing. So you can both validate your behavior, or actually the behavior of your system, so all upstream and downstream services, API transactions, both the ones that you manage and don't manage, you can actually test database performance, you can test cache, you can also test the size of an HTTP response and request, but you can also do very intricate performance testing by validating the duration of every part of your API. And with that, I have a saying where I'm from. We say you're swatting two flies with one swing because I think that's more friendly than killing birds with stones. So yeah, with that, I think that this is the closest we can get to be bounty hunters because we're bug hunters. That was very lame. Anyway, so that's a CU space cowboy reference if somebody can. Thank you for making this. So, and just before we close, I want to say if this is a topic that's interesting for you, we're running an online API observability conference in February. It's going to be called LEAP because it's going to be on the LEAP there. So if that's the topic that's interesting to you, make sure to register. We have lots of people from the API space and observability space that will be coming. We also have a GitHub project about all the screenshot that we showed to you today. We were working on it as a GitHub example. We don't have a link for it, but if you're interested, just reach out to us. Those are LinkedIn. Yeah, I don't like Twitter anymore. So make sure to send a connect and we're happy to send you a link to a GitHub project. You can try it all by yourself at this combination of open source projects. Thank you so much. So we have some time for questions. Yeah, there is one over there. Questions down. Questions down. Go ahead with one customer. Yeah. Okay, so the question is, I have to repeat for the video, the question is, if I have a service that can be accessed by multiple customers, do I want to have one to send the data to different places so to split them out or do I want to have just one year, one open telemetry? And as always, it depends. And on what does it depend? It depends on do you want to give access to those data to your customers somewhere? Do you want to have strict regulation on the data of your customer where you may need to split them by location? But yeah. Yeah, yeah. Yeah. Yeah, that's a very, very, very good question. So the question is, how do I monitor the service level for every customer? So typically you have for every customer, they have, they are authenticated. So you have maybe something like a token. Yeah, yeah, but in production, yeah, yeah. So they're authenticated. So when they come to you, you can put a tag on an information on the trace, and tag will do it automatically if you're using the authorization or authentication from tag, tag. The API, yeah, it's tag. Tag. Yeah, no worry. And so on the traces, we put the information on who is going to API. And with open telemetry, you can then use the data to create your own report based on that information. Yeah. So we add that information on the API call so that you can reuse it for your report. Yeah, it's directly exposed. Yeah. That's a very good question. It's really important to monitor per customers because you want to, some customers have different usage, different patterns, and you want to make sure that every one of them is happy and not just like an average where you don't really understand whether problems. Also, the question is whether Trace Test notifies on errors. No, Trace Test is just a testing tool. You would then need something to automate the test, like Argo, and then you need something to alert on failures as well. And then you can pick the alerting tool that you want. Whatever you're using right now, you can automate within your CI, so you can build your CI within Argo or within whatever you can use Tecton. You can do basically whatever CI tool you're using, and then you're sending errors on that. So think just integration testing. You just get works, doesn't work, then you do whatever else you want to do. Yeah. Yeah. Another question. Observability data for APS, I can take that one. So, yeah, so the question is how do you deal with data privacy? And because in the observability data, they can land a lot that could be considered privacy data. So first, you have to be very aware of that, that observability data could potentially have some data that in your country, in your own regulation could have some impact. OpenTelemetry has a lot of tool for that. In the OpenTelemetry collector, there are kind of plugins that you can define using Yamal and say, that arguments, that thing I want to filter out, I don't want to register it. So you're very flexible in your observability pipeline, but that's something that you have to take care of to make sure that your developers haven't added something that you don't want to store. Sorry. I'll go for it. Go for it. Jack, when I use the data to send the data to the OpenTelemetry, this data is made only on HB8. HB8, the status only. So like a 100, 500 message, the status of the response of the HB8 request. Yes. All on another way is to analyze the response of the request. So the question is, what do we track or what kind of data do we expose with tech? So, yeah, so in tag the gateway, when it's being called, you will get the answer, but the traces, it will export using OpenTelemetry will contain all the data, all the steps, the traces that we saw in Yeager. And you can also extend them. So we have a plugin mechanism where you could, that you could load into there and add even more data if that's more open, extend your OpenTelemetry traces. The question is, where is the effort? So tech make it easier for you because it captured the starts up to the call to the upstream service and it tell you how long it took. And but if you want to get even more details, what happens after that, then it's where you need to instrument your services using OpenTelemetry. And then the beauty of it is when all the services speak the same observability language, they all send the data to the same place, then you have the full picture and that's kind of the operational dream. Thank you. Yeah. You suggest to run that on a trade production? It's right. Correct. Correct. So you wouldn't use trace this in this point of view for your production, you would use it in pre production, where you need sampling to be at 100%. Yeah, yeah, we can also just stand. We'll just wait so you can come by and chat with us. So because yeah, we don't have time. Don't follow up on questions. Yeah. So yeah, yeah, we'll be here. Come here. Yeah. Cool. Thank you.
Public calendars aggregation using Linkal
Hello everyone. Is everyone hearing me correctly? Yeah? Great. My name is Jounia Malka. I am a PhD student at Telecom Paris and I'm doing software supply chain security. I'm also an XOS developer but what I'm going to talk to you about today has nothing to do with this. I'm going to talk about a weekend project I did that is called LINCO and about like deficiencies I see in the public calendar's ecosystem. I'm running with a pretty adversarial screen resolution so if at some point the slides are completely broken I will try to describe what you're supposed to see. Right. So what I'm going to talk to you about today is like what I think is problematic in the public calendar and the calendar ecosystem for collaboration. And I'll explain a motivational situation that made me do this weekend project and then I explain like the two software we came up with to solve this situation. Right. So I think public calendars are sometimes or calendars in general sometimes a bit painful to interact with. And the problem I saw when I started thinking about this is like when you have like a public calendar and you want to follow this calendar on your calendar clients. There's different things that your calendar client can do. Let's say you want to. It can maybe have the capacity to import some ICS files. So even files in bulk but then it will not do anything more with this ICS file than display display them to you and will not for example subscribe to like the updates of these events and will not like. Continue to fetch new events as they come forward. There is like the intermediate player that will fetch the updates. So if you're even get updated like change location or something like that some some calendar clients will update them. And some will be the next player that do everything that you want is basically fetch new events as they come into your calendar. The other problem that I think is is is big like is that calendar providers are not always nice with with the possibility to export your calendars as public calendars that other can follow. Sometimes it's make it very complicated to find the actual option to export these public calendars and make it complicated for people using other calendar software or providers to to actually subscribe to your calendars. And I think also the calendar ecosystem is lacking some nice to have features that would make life easier. So I think like public calendars are not easily composable. It's not easy to like take a few public calendars and merge them into one one collection of calendars which is a nice thing that you can want to do when you're for example you're like you want to follow all events in about for example let's say in XOS because I'm an XOS developer in your region and you have several entities that organize these these events. And they all have a calendar and what you would like to do is maybe do some creation of these calendars and propose a collection of calendars that other XOS users might want to follow to get all the events in one place. This is not easily done. And the other thing that I think would be really nice is filtering of events in calendars. So just like you you are able easily to do filtering of emails. Why not be able to to filter from calendars you follow events that are relevant to you. For example events happening in certain certain geographic area or at certain given date or hour that could be really nice. And this is also very complicated to do I think. So all this thinking came from like a concrete situation where at my school there is a lot of different association that all organize their own stuff and they all maintain some play some kind of of place where they put all the events that they organize. Sometimes it's a calendar. Sometimes it's just a plain web page that you cannot do much with. Sometimes it's just send emails. But there is no there were no central place where you could just see all that get organized on the campus and be informed that way. So we had like a first iteration of solution for for this. This problem the first software that got developed at my in house developed at my school was called Mitis Mitis is a is just a web web service where there is some kind of interface. I don't know if you see it correctly but it's just an interface where it shows all the events from all the calendars. And but it's it's really nice and it was a first step in the right direction. But what you can do is this interface you can ask it to export ICS file so you can import all these events into your calendar clients. But what you can do is ask it to act as a calda server and add it to your calendar client and have on your phone or your computer all the events getting updated in real time and basically be able to follow all these events from all this this situation without action on your part. So what when I saw that I was like I kind of want this to be a calda server. So I created Lincoln so Lincoln is a is a weekend project and it does exactly that. It takes this idea and implement it as a calda server. So the design goals when I try to to think about Lincoln is like I wanted to basically do a calda server that will display several calendars coming from different places into one collection. So for the client it looks like it's one your collection of calendars that you're importing but actually all these calendars are hosted on different places. The other design goal that I want I wanted is like to be able to do some processing locally that Lincoln be able to process in a way or another the events so that we can have at some point maybe like the filtering features that I was talking to you about. Okay so the first iteration of this when I was thinking and trying to implement this my first iteration of my first idea was like okay I'm going to implement this in rest because why not. And actually I wanted to learn rest at the time. And I'm going to is going to be simple I'm going to use some rest libraries that act as calda plans so we have like mini calda for kitchen fridge. And these libraries are going to to perform the request to the underlying calendars and and this this part is kind of like logical and easy but the problem is that you also have to implement all the web dev calda specification on the other side. So you have to implement the HTTP server that's implements all the end point and all the specification of the web dev calda specification and and then you have to get all the calls and rewrite them in terms of function calls of these libraries. So the problem here was like it's kind of like a bit too painful to do because the calda web dev specification is very big and a bit. Complicated and it was a lot of things to do just for a weekend project so I was like this is no this is too complicated to painful there has to be something else. The second iteration. I was like this time I want to implement as little as possible. Of the of the web dev calda specification and and still get something working. And the idea is like we are going to rely on on the the clients so the calda clients they know how to do to format correctly the request. And the calda servers the underlying servers that we are trying to aggregate they also know how to answer this request so basically some somebody did the jobs the job for me and what I need to do is only like forward the appropriate request and appropriate body. To to this underlying calendars get the answer and maybe do some some some kind of modification at some point of the in the answers but we try to keep it as a minimum so what we see is we have the client client collect connect to link call. And then link call for what the request to the underlying calendars and the the answers come back and then we forward back the answer to the clients so we get we kind of act as a proxy and at this point some processing can happen of the request some filtering and some minimal modification needs to be done. Okay so if we if we start to to go in into the depth of the subject. We have two kind of request that we need to handle. So the first part the first kind is like request that the client is going to send us to discover the calendars that are inside the collection and these we kind of to this is the part we have to implement ourselves. Because we cannot forward this this request to to the underlying calendars it would make no sense. And the second part the second type of request is the one that wants to client as acquired all the as a list of all the calendars that are in the in the collection that we are trying to to give it to him. Then it can query the individual calendars and this we can completely just forward the request to the underlying calendars and practically do nothing on them like. Okay so I try to give you an insight on how this this can work. We have like in a in a calda client what you do is you connect to the you write down the the URL of the server and the username and password and it will try to to query the webdav server to to ask what is the calendar home. For this for this user for this principle and so we implement one one endpoint that is called principles link all so link all is the name of the user you should provide to your. Calda clients and and then it will try to to query this pass and. And what what kind of clients are going to send this is called like prop fine request it's property find request and it will ask for this specific property calendar on set it will ask for. A lot of different properties that we don't really care about but at some point it will ask for this property and when it does we answer that it should go and look at this pass slash cal. And so when it behaves correctly this is what it does. So the next column the calda clients do they they go to this past because they now they know this is the collection route. For for calendars and will try to to now find out what are the calendars that are in this collection. So it queries this this pass and then at this point I tried to implement also this. This pass by try to like guess what properties we should send back to the to the client at this point. But it was too also to painful so I took another direction. Instead I forward all the I forward the request that the client send me to all the underlying calendar and the also an answer and I aggregate the answer and this is what I send back to the client so now the client know the. Basically all the calendars that are in this in this collection. And we have to do some kind of hijacking of the answer. So that we modify some of the fields and the most important field so there is a lot of cosmetic fields that you can modify but the most important field that we need to modify is like the URL of each calendar so basically. Each underlying server here when the answer to the request they will say oh fine this calendar this specific calendar at this specific URL which is they will give their URL right so we have to change this so that it corresponds to what where we can answer the. The request for the specific calendar so we we change the URL for each calendar to slash calc slash the name of the calendar and so now the calc clients as a list of URL for each specific calendar. And it will query this URLs to fetch the events. And so now this is the part where we just shamelessly just forward this this request to the underlying servers acting as a man in the middle. And again when the response come then we can. Do some little modification and we can do some cosmetic modification like change the color of the calendar as it should appear on your on your client so. It may be possible that you try to aggregate several calendars that have the same color so you want to do some modification at in the in the. In Lincoln so that when they appear the collection appears on the clients they all have different colors or nice colors. So as a little working example. Sorry. Let's say I'm like I want to offer Nick's US calendar to the user that aggregates several several different. Calendars that are offered by different entities and I have like three entities so for example the genome which is like an association that can offer Nick's US meetups. Let's say a school can offer Nick's US courses and there is some next parties which are like let's say very real things organized by Nick's people. And that is also in this third third calendar and so here I have three different calendars in three different. Hostors. And so the way it works is like I have to create like a JSON file that. Basically states which are the calendars I want to integrate in my in my aggregated collection so I just list them like so. Then I just run Lincoln with this specific calendar that JSON file. And it gives me a Lincoln server so basically if you want to try at some point during the day and tell me that it doesn't work on your specific client. Oh it doesn't that it does work I don't know there is the server is currently live. But what you get if you are using. Mac OS or iOS like I was when I worked on this this project is. You had the Caldav collection and you just specify the URL that I gave you and the user link. And what it gives you is one calendar one collection that has three calendars these three calendars. And that will display basically the events that are in these calendars. And whenever like the underlying entities add new event to these calendars it will update and be. Be available in your client directly. She's also working on Thunderbird and I don't really know about other clients. And now let's let's talk about what I would like to do in the future. So as I told you one of the goals of this project is. Is to also have some some kind of filtering feature where you can say I'm only interested in events happening in let's say this city or happening on Tuesday night or whatever. And currently the way Lincoln is implemented is that you can do that you could go in the rest code base and implement the filters yourself. Which is. Admittedly what not a great user experience so what I think I want to do if if I ever get some time. Is kind of device like a domain specific language where you can write some filtering expressions. And for your calendar so you can you could say you would have the expressivity to express basically the kind of the kind of filter or rules that I just told you about. And then you would way on there like upload this expression to Lincoln. And it would it would do the filtering before the events comes to your calendar client. The other thing that I want to improve is that like Lincoln is currently only able to to serve one calendar collection. And one improvement that I would like to do is have it be multi multi tenancy so it could host as many calendar collection as needed. And and have like some kind of web interface where you could upload this this expression in this domain specific language to define this new calendar collection. And the last thing I want to say is that I think maybe this kind of filtering idea could be also in the future accepted in by KELDA servers and so maybe entering in some standardization. Thank you for your attention. Lincoln is available on GitHub at this year. And. If you have any question I would go to answer them. Yes. Hello. Hello. Hello. Hello. First is someone who has dealt with a lot of counter hell. I appreciate the effort you're putting into this project. And my second question my question is. Is there any sort of right functionality for give me you cover this early. But if you're just passing things in proxying them. If you have the appropriate credentials can you not like could you add events to these collective calendars or is it a sort of read only set up. You mean can Lincoln add events. Can you add can you add events through link how or can you could do that. Yes. There is no limitation that you but what kind of events would you I mean. What's the use case like you're managing the collection and you want one more event to appear to the people that are following this collection. Yeah. Or maybe the people who are subscribing to you know people who are receiving these events say hey I want to have an unofficial after party. I'm adding it after this main event. Other people can see it. So the immediate answer I can give to this is if you really as a collection manager want to add some events you could add your own calendar that you manage and add the calendars. The event to this underlying calendars and it will just work. There is no really there is no real limitation that you couldn't do it directly in Lincoln. But I think like in terms of user experience there is no real interface where you could do this easily. OK. Thank you. Thank you for the talk. Have you considered aggregating from social media like Facebook or similar. Sorry I didn't hear very well. Oh sorry. Have you considered aggregating from social media like Facebook or similar. Would this also work. I have not considered this yet. I know that I mean this could totally be an option. This is Lincoln is currently a very rough prototype. And what I want to do is add some some other ways to integrate events that are not directly from CalDaf servers. Mostly like the priority is adding events from Unpoint that just serves ICS files which is like I know some some people ask for this. But then adding some events from sources that aggregate some events like social media is also interesting and I will consider that. Other questions. OK. Thank you.
Indico: an event management system
Okay. Thank you very much. Yeah. Hi everyone. I'm really happy to be here. I'm Pedro Freira. I'm a software engineer at CERN. I'll be talking to you about Indico together with Dom. He's going to do the second half of presentation. First of all, it's a pleasure to be here. It's our first time at FOSDEM and it's really nice to see such interest. Thank you. So, yeah. As the title of the talk says, we'll be talking about Indico. It's an event management system as you may have realized by now. Well, all of the things that are being presented here today, collaborative effort and open source project and the MIT license, it's developed at CERN mainly with contributions from the United Nations and the Max Planck Institute for Physics. It counts with contributions from more than 70 developers over the last more or less 20 years. So, Indico is probably the most popular event management system you have never heard about. There's something like 300 servers around the world most belonging to educational, research, scientific institutions serving more than 350,000 users. So, it's a tool that you, yeah, as I said, it started out in the research world. Since, as you know, CERN is a research laboratory, but then kind of spread out to different environments and there are, yeah, a few examples of organizations from different domains that are already using it. So, a little bit of history starting in 1999, the physicists working at the Large Hadron Collider, which back then still didn't exist. They were still sort of projecting it, building it. They needed some sort of application which they could use to manage their meetings. So, what would normally happen is that you'd have a meeting, you'd exchange a few emails with their slides and so on, and then this would get kind of lost at some point because it'd be kind of spread around a few mailbox of different people and disks and so on. So, they wanted to have an application which they could use as like a focal point for this sort of event and as kind of an archival platform as well. So, this was the first attempt that it was a CDS agenda back then. Then in 2002, the opportunity came up with a European project which was focused on having a conferencing platform. So, they kind of put the two ideas together and then that's when Indica was born. It went into production in 2004. In 2007, we've added a room booking system to it. Then in 2008, a full interface overhaul. Then 2013, first workshop, word of mouth starts spreading and in 2015, the United Nations adopted it and we started a really nice fruitful collaboration which goes on to this day. 2017, we did a full rewrite of the application. We were working on an aging software stack. We changed even database system, moved to Postgres. So, that was 2017 and 2021, then we moved to Python 3 within the code 3.0. 2023, last year we surpassed 1 million events only at CERN and 2024. So, this year we celebrate our 20th anniversary. So, you may have heard about CERN, the big tunnel which we have underground, the LHC. You probably heard about the detectors and all the things that go, you know, that happen 100 and so meters underground. But a less known facet of the organization, well, maybe not for you because you're all tech people, is that the World Wide Web was invented by Tim Berners-Lee at the organization back then, in the late 80s, early 90s. And CERN is actually producing a lot of open source, also using it but really producing a net contributor to society when it comes to open source production. So, open science is really at the core of our mission. And we have a series of software products which, you know, to this day, I use around the world and which are developed mostly in the organization and then with collaboration of several labs. So, that's Invinio, Zenodo, there's also Roo, White Rabbit, a few other things. There's also the CERN Open Hardware License which, which, yeah, goes on to show how the laboratory was a bit of a pioneer in this whole open hardware movement. And like last year, we also set up our own open source program office. And yeah, as I said, we're also using a lot of open source software. Many of these projects are represented today here in the stands. So, yeah, thanks everyone also for your help. A little bit of publicity, there are three other talks from CERN in this conference. So, if you're interested in, you know, storage or research management with InvinioRDM, you guys are invited to pop by. So, yeah, coming back to CERN, we have around 17,000 people on campus at any time, around 230 meeting rooms, organizing more than 100,000 events a year between meetings, lectures, conferences, all sort of stuff. And many of these meetings are highly distributed. So, yeah, when you come up with Indico, the objective was actually to solve this problem. How do we get, you know, super big collaborations of thousands of physicists to work together in a distributed environment? And, you know, how do we conciliate that with the organizations also, physical presence? So, this is, yeah, this is a science gateway. It's a pretty recent addition to the laboratory. It's a super fancy project by the same architect who was responsible for the George Pompidou Center in Paris. But, yeah, just a disclaimer, we don't work in this building. We obviously work in the Brutalist buildings back there, where is the IT department. So, but, yeah, it's, you should really visit it. It's a really nice place. So, at CERN, Indico became quite popular very quickly. We've been growing year after year. This is the number of new events per year. So, we still kind of accelerating. And these are just examples of a few events, a few meetings, conferences that we currently hosting at CERN's Indico server. There are basically two types of events. There's the conferences, which are a sort of, you know, the more traditional workflow where you have a call for abstracts, paper reviewing. You have workflows which allow people then to interact, do the, you know, the reviewing of papers, refereeing and so on. And then there's the meetings, which are more, a bit of a simplified view in which you can upload, you know, your slides and share it with other people. And you have a common shared schedule. And now, I'll switch over to Dom. All right. People call me Dominic or Dom. I don't really care. So, this is Room Booking. It's a module which is part of Indico. As you can see by this nice screenshot, you've got the leaflet-based map on the right, which shows you rooms. On the left, you've got a timeline of, you know, the rooms which have been booked. Very, very, very simple stuff. But it's not just that. So, we're going to go into the technical aspects of Indico. So, at its core, it's a very, very general purpose. So, just because we use it at CERN to handle our conferences and meetings and also everything else, is very, very, it's not set in stone with, you know, what you can use it for. It's, you can use it for almost anything, pretty much, while in that realm anyway. You can also go through plugins as well. And also, you can customize it with, you know, standard CSS or what have you. So, under the hood, yes, it is a Python application, specifically a Flask-based. So, that handles our back-end. For the database, Postgres SQL, I believe they have a booth here. Then we have other stuff as well, such as a Celery, which is handling our tasks as well. And SQL Alchemy, which is essentially the ORM for Postgres. Again, that is a Python-based. And also, that's for the UI, well, the front-end, we could say. And a semantic UI, which is just the styling of this. And we've got a lot more services on top. Okay, so, as I said, plugins, extensions, so yes, Indico has them. You might be interested. So, yeah, these are just a couple of our plugins. I'll get into a lot more. But yeah, video conferencing payments, conversions to PDF, search via Alasah search, storage and URL shortening and, you know, a lot more stuff, which we can, which Indico handles under the hoods for CERN. So, for example, we've got a nice one-click Zoom join plugin here, as you can see there. Payments, so yes, CERN does handle payments for the conferences. Apologies. Apologies. So, CERN does handle payments for the conferences via its own plugin. So, you can see there, we can handle payments via the post-finance plugin, but also for people running their own instances. There is a third-party integration out there for collecting payments via Stripe. And a PayPal also. Workflows. So, when you come to CERN, you probably might go to a conference. So, we have our own internal workflow for handling your access and other stuff as well. That relates to it. And also, yeah, this is a bit more into the access. So, yes, Indico can also handle printing of your badges and also actually your access onto the site. Recording of events. Again, this goes back into a little bit of Zoom, but also Indico handles the entire life cycle. Conference and events. So, yeah, so here's just a quick screenshot so you can record an event. And on our side at CERN, the event will go to our CDS archive. So, it can be played back on the maintenance, you know, and that is the archive for our events. Okay, so you probably saw a little bit about room booking. This is our internal spinoff called a bureau tail. So, room booking, as it says on the tin, it's for rooms, bureau tail, bureau, it's for desks. So, at CERN, we do provide a modified version of Indico, which only has this specific module, which has been modified, and that is via a plugin. Again, going back to what I said earlier, you can also customize it. So, here is my screenshot of the International Linear Collider Indico instance, which is hosted at CERN. And, yeah, so nice and feel. And it's not just, you know, the front page. You can also customize your meetings with the same CSS rules. And also one more of the conference for Higgs 2020. Now, one last thing. We have a nice checking application. So, previously this was a React native application, but I think around last year we rewrote it from scratch to act as a, well, to be a PWA, a progressive web application. So, basically it's like in any other conference, you might have someone at a door scanning your badges, scanning your tickets, what have you. So, just an application where you can use your smartphone. And then, yeah, it gives you the all the functionality that you would expect from a badge scanner, so a QR code reader. And also lets you bring up details of who's attending. You can check them in. And also, you know, other bits and pieces on top. Okay. One last thing, I guess. So, it's a very accessible event management system. It's open source and we have a pretty nice and thriving community. So, it's a screenshot of our forums. You know where everyone is welcome. And, yeah, so, I guess you have any questions. I'll be sure to follow us as a shout out, I guess. But, yeah, that's all. Thank you. Thank you. I was wondering if you also had some kind of back end for budgeting. Like, when I organize a conference, I want to make sure that all the money that we receive from the thing then pays out for the Dora sun and things that I'm going to spend for the conference. So, should we repeat the question, right? Yeah, so the question is whether we have some sort of back end for budgeting to kind of budget different aspects of the conference. And the answer is no. I mean, you have customizable registration form where you can kind of assign prices to items. I don't know if that's what you need. Then, yeah, in terms of them doing, you know, financial data analysis and so on, then we don't have anything like that. But, yeah, but you can extract everything basically to Excel and do that stuff on a spreadsheet or, yeah. Okay. The question, I think there is some space for integration with the Giante de Nuit or GCNit or Viglovap for conferencing. And is there a way to manage Wi-Fi every password distribution for participants? You... The tokens discount for social events in the night. Repeat the question. Can you repeat the question? Yeah. Yeah, yeah, you have to repeat the question. Well, yeah, so the question is if there is some sort of way to distribute Wi-Fi passwords to participants. That's it? Yeah. Wi-Fi passwords or tokens for social events? Not built-in, but you could probably implement it through a plugin, right? That could be... I mean, this will function as it would be plugin-based. So, yeah, you probably would have to write something yourself or probably hire someone to write it. Sorry? Made it not for tokens and not for Wi-Fi passwords. You have to do plugins. Yeah, no, there's nothing built-in for that, no. Yes. Is there... Is there a time of the attendance registered for participants? So, the question is whether the time of attendance per participant is registered. Well, not the attendance because I think we don't have any mechanism. Actually, if we have people say, you know, I'm attending Nali's talk and so on. But we have the checking time, yeah, that the app that Dom presented before, that one, yeah, if you check a person in, the time is registered and you have like a log of who checked in at the event. But that's more for kind of the reception part of the event, like to give maybe the... There is more to check out or only check in? Only check in, yeah. So, it's like Hotel California, if you want to... Yes. Are there plans to have like a progressive web app for participants or partners, not for the organizers, for example, to schedule what is happening with these... So, the question is whether there are plans for a PWA which targets the participant's side of the event, not so much the organization like here. The answer is yes. We are planning on getting started still this year. There are some funding issues to be addressed, as you guys probably know very well, is often the case. But yeah, it's on the plan for this year. Yes. What priority has accessibility in the UI as you showed? It's a very good question. So, in terms of accessibility in the UI, currently in the code... It is currently going through a phase where we have in collaboration with the UN. There's basically a... We have... The UN has hired a developer to contribute back to Indico, some improvements to the accessibility. And that's about it. So, it is a thing which is, you know, it's a work in progress at the moment. And there are some features out there already, which are going to be released soon, or they're already available in mind of releases. It's already... Many of those have already been merged into our main branch and will be included in the next release. But yeah, there's a lot of work which is currently being done in making sure that we pass the WCAG. Yeah. Yes. What's about developer documentation? Is it well-documented so people can easily access and contribute to the project? Or it's kind of more... So, regarding a question on the developer documentation and how someone can contribute to it, yes, there is documentation out there. So, on... Change of slides. Yeah. So, if you go to getindico.io, we do have a couple of pages on how you can contribute back to the project. And also, we've got a pretty good ReadMe and some ReadTheDocs pages on how to contribute back. And it also covers stuff like how to set up your own developer instance and everything from, you know, how to probably write a half-decent comet when you... Or a PR. So, yeah. There's also some API documentation, Sphinx documentation based on the code documentation. And it's not as complete as we'd like, but yeah, it's a work in progress. Any other questions? No one? No. Well, thank you very much. Thank you.
OpenTalk - Video conferencing secure and GDPR compliant
I need to support me for the in-depth technical details because he is more proficient than I am in these areas. This is a very high level overview of the project. We are not going to go deep into the detail, but if you have questions, let go deeper, just ask them in the end. If you want to have product side view or customer view, you can always use the official contact channels and you will be answered there. So a little bit of background about OpenTalk. There is a company behind OpenTalk. It was founded in 2021 in the middle of the pandemic by a group, so a group is doing since more than 30 years I think already consulting and training for Linux and mail operations hosting and it is also the provider of the well-known mailbox operator MailboxOrg. And the OpenTalk company right now has around 20 employees right now, so it is increasing slowly but steady. So who are we? I am Wolfgang. I joined OpenTalk roughly one and a half years ago and became the backend team tech expert, so more or less the technical lead in July this year, or last year already. I have a master's degree in embedded systems design, but I am much more on the software side than on the hardware side. I am doing Rust since 2015 and I am still in the honeymoon phase and from all the languages I have done, this is the longest honeymoon phase I ever had. And I am the co-founder and organizer of a Linux user group and you can find me on the FEDIVERS. So Stefan. Yeah. I guess I have been like two and something years with OpenTalk now and I am mainly on the media team which is our thing for all the real-time stuff, audio, video, recording, streaming, webRTC. It is kind of in between front and then back end. And yeah, I also have been in university before, long time doing parallel programming, operating system stuff and also some real-time things and software defined radio. So if you are interested in that, just talk to me later on I guess. Okay. Some information about the project in general. So the project is written, or the front end is written in TypeScript, the back end is written in Rust. It is free software under the copy left EUPL 1.2 license. You can find technical documentation online under this domain docs.opentalk.eu. There is also a FEDIVERS account called OpenTalk Meeting. You will find it by that. And there is a Matrix channel as well, hosted on matrix.org. This is, yeah, the Matrix channel is where some of the devs are hanging around and answering technical questions but it is not an official support channel in that regard. Okay. So the user interface, this is what the video conferencing software looks like. So it is roughly similar to what you know from other programs. It was important to make a nice design that looks good and is, yeah, comfortable to use. We also have what we call the dashboard. This is where you can create meetings. You can add start and end date. You can add meeting series and you can also get an email or maybe that's on the next slide. You get an email when you are invited to a meeting or when a meeting is canceled and also the creator of the meeting gets the invite so they can put it into their own calendar. Okay. So short list of the features. We have a lobby with a mic and camera check so you can check that everything is working. We have some interesting moderation tools, one of them being the coffee break which we will show in the next slide, a timer so you can assign tasks to people and say, okay, you have 10 minutes for this and if you want then report when you are ready and when everybody is ready the timer ends or when the timeout is approached. Meeting participants, we have a poll feature and breakout rooms. Screen share, yeah, that's well known for conferencing software. One important information here is that multiple people can share the screen at the same time which comes in handy for peer programming. Yeah, you have the speaker view where you always see the large picture of the speaker of the person who is currently speaking. You can call in from the mobile or landline phones via SIP and we have integrations for a shared editor which in this case is Etherpad and a whiteboard which is SpaceTech currently. Yeah, I already said the invitations end. Right now we are in the course of finishing recording and streaming so you can record the meetings and you can as well live stream it and the idea is to also allow streaming it to multiple platforms at the same time so you can have YouTube Twitch and on-cast stream at the same time if you want. If you are interested in that, talk to Daniel over there, he did one of the work. Yeah. Okay, so here you see a screenshot of the coffee break, that's what it looks like. Everybody gets this full screen as soon as the coffee break is started but you can go back into the conference anytime you like. So for chit chatting up front before everybody is back, just like in real life. And this is another nice feature we have, we call it the shared folder feature. So in the dashboard when you create the meeting you can enable this shared folder switch. It must be configured for this OpenTalk instance but then the system will create a shared folder, it will create a folder on a next cloud instance. This is the part that needs to be configured. It will create two shares to this folder, one of them being read write and the other one being read only. And the moderators of the conference receive the read write link so they can put their material into this folder up front while all the other people have access to this either by clicking on the link in the invitation mail or by opening it through the icon during the conference. Okay, so this is a more technical part, I'll give the word to Stefan here. So that's what it looks like from a rough perspective of the developer or the administrator of the system. So it's not just one big service but we tried as much as we can to use existing components. So what we built mainly is the dark or the dark colored parts and the other services are more or less what you get just from the different projects. So we use Yarnos and RabbitMQ for communication and Yarnos as media gateway but we manage all the video rooms our own using our controller backend and as said there is a web front and written in TypeScript and React here but it's kind of symmetric to what happens on the other side with the, I like to call them back end clients for streaming call and all that stuff. They just have another way of starting the whole process but actually they do the RTC and signaling just as the front end would do and by now they also have a way to authenticate services against key cloak via service authentication so that's also, we can see that later, where you can extend, that would be a way to extend our system in that part. It's meant to be scalable so you can have multiple instances and they just share their data where Redis and so forth is session stuff and for the persistence data like which rooms do we have, what users are in which rooms invited, that stuff that would be stored in the normal relational database and we'll do a lot of integration stuff on that OpenID Connect key cloak side with other like user systems or databases what people tend to have already on site. Okay so this is a sneak peek of Rust code, it's currently not ready yet but we have approaching this. We are right now working on extracting the protocol data types into a separate library which was not the case when I started working with OpenTalk and the idea is to publish the client library to crates.io which is the default publication platform for Rust code and yeah it should be as easy as this, I mean the authentication is usually a little more involved than these two lines but you basically connect to the instance and can do things with the client so this is now the web API for managing appointments and so on so here we create an event with a title that we set and then we invite a user with the email address elizetexample.com and the role of a regular user you could do the same for a moderator as well so the idea is to allow automation and integration in a very easy and approachable way if you're familiar with Rust code. This is also what we will be using for the recorder which connects to the meeting session for the call-in via a landline or telephone and for other maybe future services. So yeah talking about these kind of services that's actually the flow you have there, you build your new backend service which will act like a client to the conference, it first needs to authenticate and get some access token however you set it up and then you usually just go to the backend and say hi that's me and that's the token I got so I'm authenticated and I would like to join this room over there which has this ID and then you essentially and by that you open a web socket where all the signaling happens and you see like the publications of media streams so the backend will just announce when new users arrive and will also announce what screen share and camera stream they have so you can then start the webRTC connection with the Janus and on that signaling channel you just exchange STP and other credentials to get the right stream set up and here you would like in our case we usually use GStreamer as a media stack here which is then that up to get all the streams and for instance do video mixing and when you're done with your recording so somebody tells you on the signaling okay stop now recording you will just upload the file of the video it produced to the controller again which puts it into a F3 storage which is also currently we use for development purposes we use Minio but you can use whatever F3 you would like and there it also becomes available on the dashboard then which also would work with other artifacts so like whiteboard or yeah meeting minutes would be the same thing just another document format right and what I missed out is the other way is when you don't initiate the action yourself there's also the RabbitMQ service where you can just attach and listen for the controller to call you and say hey your service you should do something like start a recording for instance and then just start the signaling session right that's that's basically it yeah okay that's also your part yeah so we talked or we've seen a lot of components which are open source and which we integrate there's also been as we are a company that also been other yeah companies and software developers which we integrated with so I guess that's one of the main things and themes that we and other people have yeah projects and try to integrate with each other and there is like UCS and integration where they basically have their key cloak and they use a management part and we just connect there and there is Innova phone which does mainly zip and has some also some platform and we try to integrate there also wire my D connect and made some adjustments to our zip stack so that we are compatible with them and yeah so it goes on like MGM is like we just started I guess they they talked about how we could do like components where you just would have the video but it's like in the starting phase and not just the whole front end and yeah as much people many people use it right now and this has been a high demand we did outlook plugin but there's also been some talk I know for Sunderbird plugin but it's just not yet yet on the way I guess and so yeah maybe just if you have some some questions or need or want to do something on your own just talk to us and we'd be happy to try to tell you what's going on and to support it as far as we can okay yeah yeah that's it more or less so we try to keep it short so if there's some specific questions and details yeah I'm gonna just go ahead you haven't mentioned entry and encryption yet at all and I know that did see has already some support of entry and encryption and also matrix is now getting into the real-time communication business and I was wondering what is your strategy here yeah I can say a word I guess it's it's not so easy is the starting point the thing is if you want to do end to end encryption you basically don't trust the back end that's that's a deal and we're talking about a web application right now which is like a problem because in the first place you would load your application from the server you don't want to trust so we are looking into how we can ensure that you can really maintain the integrity of all your personal keys and all that stuff and that's pretty hard to do in a browser environment and yeah of course we could encrypt media connections but that's just half of the deal so yeah basically we're in the process that's also a goal for certain projects we're working in but it's not yet a thing I can say okay that's that's a route we we're gonna take right now and here are the details so we didn't put it on the slides yet so if there are question on then topic yeah maybe we can have a have a discussion in detail later on or maybe if you have specific needs and that direction also let me know I'm interested in what do you consider that are still like very important features or properties which is not yet in any open source video conferencing solutions and which you are working on which you also don't have yet but you're working on what is what are kind of still important pieces to come so yeah as mentioned this is a whole streaming and recording a part which will right now in is one of the main things so we can support like bigger conferences with a with a feeling of being in a room so for now we're just doing the the low levels or finishing the low level streaming part and the first UI part to enable streaming but we're thinking a lot about how to integrate like the to have a mode where you have a stage and an audience and the stage would be like a normal web conf and web RTC conference and the audience would get to see the live stream and get a chat interface but it would all happen in our user interface that's something to come I guess but we have no time frame for that right now and the other part we are from the project side in is all the telephony part like zip and 8 3 2 3 I guess which is the old video conferencing standard on telephony nets yeah I guess there's much more but there was another question so I reckon an organization was 100 people but once in a while we host conferences for a few thousand and now I wonder should we then have a very large a Janos media gateway just for this one event per month or is there a way to scale easily down and up the resources because I've heard of Federation of Media servers in the matrix context and I think this is a very interesting concept when organizations have joint conferences so yeah we we also thought about that hard and long and there is like a limit on if you don't cascade Janos instance there's a limit on how many subscribers can be for a single publisher so the speaker in the room and that's for for our experience in the yeah say three to four hundred depending on how you configure load balancing and all the stuff and instead of doing cascading and all that we are right now looking more into the streaming direction then into have it and having it cascading and real time all because usually the audience will not interact heavily and you would have to invest a lot into getting all of the people like fast in there it might be a thing and we are looking also into the matrix how matrix does it with underlying they use live kit as far as I know but yeah we are exploring the other direction was having it on streaming and getting people in and out of the room or you know so into the web RTC conference or back into the stream view that would be my take on that because then you can just have a have it more resource efficient like have a small meeting which is easily manageable and also have a streaming set up which can easily scale lots of people thank you so the question is is there a support for island audio as in in a large meeting where two people can talk to each other alongside with the orator without interfering with the others this has been on the road map for quite some time already the idea is to lower the main room audio volume and have a private talk with a subgroup of the conference but it has not been implemented yet I guess it's already we already have an specification for it but not the time to build it yet
Securely collaborate with CryptPad
Okay, so hello everyone. So I'm, I joined CripLad last year and I'm here to present you like the product and the future directions. Yeah, so yeah, so it's called Securely Collaborate with CripLad and so, CripLad has already been presented last year, but I will start again like showing you what it's all about. Yes, no? No? Okay, thank you. Okay, so CripLad what it is, it's end to end encrypted like a collaborative, collaborative office suite. So you have a lot of different applications inside CripLad. So it all started as a project, I mean, as a name. The name is a bit confusing because you may think it's only for pads. And actually it started like this, so we're only pads. But then if you think about it, like all files are just like text files. And that's how we managed to be able to produce a lot of different applications from that. So some of them are like homemade, like the Kanban one and the Form one are really, have been made with our own little hands. While some other like which text, Whiteboard is for instance raw.io. Like those three presentations, spreadsheet and document are on the office. So we either like build a full application or just try to use other applications and plug it onto the CripLad layer, which is basically the encrypted encryption part. And the goal of all of this is to have both collaboration and privacy. Because when you think about it, when you are collaborating, you want to share some data. But then you don't want to share it with everyone. You want to share it with your collaborators first. And then maybe once you have a full document, you want to share it to others. And moreover, you may not want to have like the service provider. So in our case, it will be CripLad. But if you are using like some proprietary stuff, you don't want it to know what you are working on sometimes. Even if sometimes when you are a company, you are working with Google for business and then you just give them all your data. But anyway, here what we are advocating is that, yeah, we can have both collaboration and also privacy for our end users, which may be you. Anyway, I should have closed the fender bird. Take, take, take, take, take, good. And so one example of that is Disha Ravi, who is an Indian activist for climate, which has, I mean, which has been arrested in India near Bangalore because she was working on the farmer tool kit. So as you may know, like for instance right now in France, there is like some farmer protests. But in India, it was more, I mean, there was one as well in 2020, 2021. And actually, the thing on which Disha Ravi was arrested on was for helping rating the farmer tool kit, which is a document helping the farmers to cooperate and get together. Because India is a very multicultural subcontinent, so it was a big help for them. And actually, the document was published on Crip Pad in the end. And, but yeah, but at first it was made on Google Docs and Google helped India police. I mean, it can be understandable in some sense because it is a big market to become. But yeah, I'm not sure to judge, but at least we cannot sell your data to anyone. And how does it work? So now let's get into it. So basically, we have a model where we have a central server, which will deliver the files, but the files are all stored. So this picture is actually not really accurate because here what you see is that you have the first Pingu, which is writing hello, sending it to the server in an encrypted form, which is then broadcasted as well in this encrypted form. And then as it's a symmetric encryption and everyone has the key, they can all decrypt. Actually, it's not exactly like this that it works. It will be like saying, oh, I wrote an H and then it will send, I wrote an H. Then you said I had an E and so on. But it will just be like difference between, it will be decent patches. But here we just simplify things. And so we decided to keep this centralized part because in the end, we could have imagined something like peer to peer, but then it will be hard to synchronize, we'll have issues like if a message arrives before another. We'll have another issues, but as we already have a server that delivers the file, we can also use it to coordinate communications. And that's how we managed to achieve our goal of having end-to-end encrypted collaborative edition. Sorry. And right now, so that was presentation about CripPad. And about this, we recently, we are mostly done on it, but we had an NLNet project, which is called GroupRinse, where the goal was to analyze the security of CripPad and try to find new directions prepared for the future here. I wrote it here. And so we had many different improvements that have been shown by this analysis. So there are many things actually in CripPad because it's basically something that may be called cryptography driven, where basically, our design really relies on cryptography. For instance, when you log in, it is used, I mean your password plus login is used to derive your different keys, like signing keys, asymmetric encryption and everything. And so everything is based around this. And for instance, we don't store, for instance, a hash of your password and then try to match it. So for instance, it makes password recovery like a big hassle. We cannot do it, we just can do it. When you subscribe on CripPad, there is a big warning saying, don't forget your password, but of course people forget it. As well, as we are also mostly working with the document keys that we are sharing with people, we don't have any ratcheting or any key rotation. And then for instance, revocation, also no foreign secrecy, but I won't talk about it. So for revocation, it's also hard to get. And also another hot topic, like right now in the cryptographic community, it's post quantum crypto. So there have been like the NISDs and NIST, which started post quantum candidates evaluation like in 2015 and last 2022, almost last year because we are sort of a new year. Like in 2022, there have been the first selected new standards. So it was Falcon, Deletion, Kiber and another one. Anyway, so the thing that right now as well, like as I said earlier, CripPad started just as a small project in the company and then it's expanded. But the core is still there and the core is not really easy to work around. But there are also a lot of like refactor to do about the cryptographic layer in order to move toward the cryptographic agility. And actually, I mean, between all these different improvements, I will talk about the password recovery, recovery, because it's something that users may want. And so let's talk about it. So I said it's like something cryptographic driven. And I said that user are identified with their signature public key. And the thing that this relation is only one way. If you know your password and login, then you can get this public key. But I mean, maybe you cannot like relate it any other way around. And the thing that we have something which have become like a hassle to solve because of cryptography, so one solution is just like, hey, let's add more cryptography. So we'll add something which is called linear secret sharing or sometimes Shamir's linear secret sharing, which is the idea that you want to be able to share a secret between multiple parties. And then if a subset of them like above a certain threshold, but it can be more complex data structure. But anyway, let's keep it simple. So if you have more than a certain threshold, like if you have like, for instance, you split your key into five and you want at least three people to collaborate, like for instance, the majority to be able to reconstruct it. Then you can get the key back at the end. So what we'll use is like social linear secret sharing, something akin to the ring of trust where you will share your keys between different participants. And then you have to trust them, of course, because if two of them colludes, then they know your password. And here we can see a weird reference, which is Reed Solomon's code, because it's basically the same ideas of splitting things. So yeah, for instance, for me, I came from the cryptographic community. And we like code people because they are always telling us, all that code people invented everything before you. And I mean, at least this time it was true, not always, but sometimes. And what is social secret sharing? So as I said, you have a secret, like for instance, here it will be your password. And your password, even like directly like some keys to a file which contains part of the thing in order to be able to have some kind of revocation. So you have your secret and you share it between your friends. And then they all keep a shard, a part of it. Individually they cannot do anything with it. But then if you ask like for instance here, I will take the majority vote. So if you ask two of them, then you can get your secret back. I mean, hopefully they won't keep it to them. And then once you have your secret back, then maybe you can change your password with other convoluted way, but ideally. So then in some sense, you cannot lose your password because it's always somewhere there. Obviously it's not very sound because if for instance, some people are not connected because the thing that you also have to think about UI and UX, so how will it work? So if they have to click on a button, they have to be connected for that. You have to contact them, but I mean, it's still way better than losing your password in the end. And this raises like a lot of different questions. Like for instance, I mean, as I said, how will we make it understandable for the user because we don't want them to have like really just like a jizz and take that they have to send you back. You want it to be stored properly, so maybe directly on crit pad. But then there is a risk explanation because you have to tell them like what is sensitive, what is not, what you can do, what you can't. It's an unusual system for users. I mean, maybe for good reason. But anyway, I don't know a lot of systems where you have these kind of things implemented. And I think it's something like, okay, so something I didn't say about crit pad, that one of the aim is also like user friendliness. We don't want for instance something like PGP, which PGP is a very nice tool, but it's hard to get around it when you are starting and you just want to do something simple. You have to read the documentation, see exactly what you want to do. Even the OpenSSH client is like really powerful tool, but not very user friendly. But for us, we want something that, well, a lot of our users just want to use crit pad because it's open source and it's an office shoot, not because it's end to end encrypted. So we also want to keep this user base and we think it's important to make cryptography available for everyone. And then also like it just, in the end, just some displacement of the issue because like before, we were like, oh, but we can just lose everything. So now we may not lose everything, but then some other people, like if the trust is, if your friends are not very trustworthy, then they can call you and compute back your secret. I said, if they are not available, then you can do much and that. No? So to conclude, so I'll come back on everything I said beforehand. So crit pad is an end to end encrypted collaborative office shoot. And everything in this sentence is important. It's collaborative. You have all the, most of the tools you want. And it's also secure as in it's end to end encrypted. As I said in the previous talk, we also have like other issues like we cannot, I mean, as of now, we don't guarantee that the code you are executing, the JavaScript running on your browser, that it's indeed the real one. So there are also some other parts where we still can do better. We still can improve. But this is also one very sensitive thing. It's end to end encrypted, but I mean, there are also like some cryptographic solution for that. But yeah, it can be quite expensive. But we are also thinking about how to go toward this direction. I forgot to tell you about this, but I mean, as a full office shoot, you also have other collaborative office shoot, you have other collaboration tools that are available like calendars. Unfortunately, we can't synchronize them directly using CalDAV because it's encrypted on the server side, so it cannot serve it directly. And we don't want to send the servers the key. We also have like some teams, I mean, a way to share your document and calendars in between a team. Like for instance, right now in CripPad, we are working, like in XWiki in general, we are working using CripPad. And we have different teams like one for the support teams, one for the CripPad teams, like to organize things and stuff where we can find every document that we need, like the other size for instance. And also one of the very important points that it aims at being user friendly. And so for the future, we want to go toward quantum, crypto agility, like making the code more modular and that we can switch algorithms more easily to move toward post-quantum secure collaboration, which will be a way stronger security guarantee than what we have nowadays. Even if like right now if you, I mean, basically like as all the symmetric part is still like more sturdy than the asymmetric part, what is stored in the server, I mean the data are kind of okay even if, let us imagine that there is like quantum adversary right now. It will be more like someone can impersonate you, which is still a big issue, but it's, I mean, if you just get the data on the server, you cannot do much more from that. You need extra information, even if you have a quantum computer. There is also this revocation which I didn't talk about at all, but it's also an interesting issue to handle because it will, may help us to move toward forward secrecy, which is nice to get because it will mean that if you get a document at some point of time, you don't know what happened before and you can be revoked and then you won't know what will happen in the future on this document. We also can imagine other ways to resolve, I mean, so right now I was mostly talking about like we have a central server and stuff, but we can also use conflict-free applicative data type, like CRDTs, to like try to solve conflicts and stuff because right now it's really something very, very naive, which works in the end because in text, you don't have that many weird conflicts that happens and yeah, as I said, perfect execution. So now, as a last word, I will just present that, I mean, it's just the Crippett team and so thank you for your attention and if you have any questions, I'll be glad to answer. I have two questions. The first one, I visit Crippett only for document writing, something like Google Docs that have only a little problem for me. It's not have a full screen document, it was information about Crippett, where there's information about Crippett around the document, not like Google Docs, by instance. So I was at the beginning of this kind of thing, but maybe it can be resolved with a full screen text, something like that. So when I go to Crippett, the first, the first time. So the second one is interfacing with different type of document, worksheets, database, also text with table and so on because Google Docs and Google Sheets, there's not good, a good interfacing between both and I will be happy if it's good in Crippett. So let me be referred to be sure. So you said that the interfacing between spreadsheet and documents are not that good, right? Yes. Yeah, I mean like so far with both, I mean we are right now depending on the office, which we are also like, which we interfaced with the Crippett service, but it will be kind of hard to get, we can try to, I mean we are always trying to improve things, but I mean I know that we have work done here which is working mostly on this part, but yeah, I don't, yeah, maybe it will improve, but I mean we'll keep it in mind and... It's a good use of the tool to use, the interface. Yeah, I mean we are working a lot with user interface. Actually the project lead is a designer, so it's really giving us feedbacks about how to make things fitting nice. And yeah, any other questions? Hi, we'll discuss about it. Thank you for your talk. I've been helping package Crippett for NixOS, Nix packages, and one thing that came to my attention was that the whole thing is like 800 megabytes, and I was like, whoa, what's going on? And then I noticed that it's the integration with only office, that it's a lot of space. I just wanted to know if, I mean are you keeping that in mind in the future? Will you keep it? Because it's quite big for, like if you compare it to a WordPress release for example, the size is huge. The thing that we have, I mean we don't have the original version of only office because the thing that we only keep is the only office client and the server part, we are emulating it with Crippett basically. So we need to have this hacked only office in our repositories, and every time we have, we need to make an update, it's a mess. We are aware of this issue, we are trying to find a solution, but we are also other issues too. And then, that's it. Thanks for your feedback. Hi, a question. So what are exactly the technical limitations at this point? Because you showed the secret sharing. I imagine I'm not a cryptographer, so based on the theory, the more people who are collaborating at that point, the more harder it would be to manage. So what are the other technical limitations perhaps that you see at this point from that? So technical limitations for what? You showed the example of the secret sharing, where you have a secret shared between different users. The more you scale with the users for any document, the more it might be complicated. Actually, here's the main... So basically the question was that what are the technical limitations of Crippett, like in this context about secret sharing, for instance, they may be scalability issues. And actually the thing is that for that it will only be small islands where you will share... I mean, you will only have small sharing islands. For instance, for scalability issues, we don't have any issues with the number of users growing in terms of collaboration, because in the end, on a single document, at a single point of time, that many people will be working on it, and the server is only acting, is only there for communication, because it doesn't... I mean, there is really not much processing in the server, because it cannot do anything. It cannot be done. So it's all spread within the clients, within their browsers, which makes it a bit of an issue on mobile devices, for instance. But for secret sharing, this won't be a technical limitation. As I said, the main bottleneck with the use of cryptography is the fact that it will block us for... I mean, it will make some functionalities harder to implement, because everything is hidden, and you don't have access to it. But at least, as far as I understand, for secret sharing, it may not be an issue. The main issue will be that you have to coordinate with the other parties, but that's all. Yes. Sorry, Ludovic, also from the team, to answer to the next person. One of the big reasons there is a lot of space is that we have multiple versions of OnlyOffice, and the reason is because we store in the native format of OnlyOffice because of real-time, we're not storing the Excel version, so for compatibility reason, when we upgrade CripPad, we need to be able to upgrade the pad, so we need the older version so that we can upgrade the pad to the newer version of OnlyOffice, and there is a plan to make the installation of the OnlyOffice modules optional, and basically say, which ones do you want in your CripPad so that you're not carrying very old versions of OnlyOffice code in CripPad, and this is why it's so big. We're sorry about that, but there is a technical reason. Yeah, unfortunately. Thank you. So, yeah, I think we are done with this talk. Thank you Fabrice. Thank you, everyone.
Collabora Online usability optimization
Okay, so thank you for joining. The next talk is still about Collabora, second talk of the day about Collabora, which is about Collabora online usability optimization. And we still have Kaelin, and that was in the previous talk, and also Michael that is joining us. Thank you, Kaelin. Fantastic. This is Kaelin, this is Michael. Good. This is what I'm going to say. You'll see it as we get there. And yes, fantastic. Kaelin did a very good spiel earlier on how this thing works. So if you're in the previous talk, you saw something similar to this, but you have your browser, and then you have a web socket talking to a server on the back end, C++. And this talks to the Librofiskit over a Unix domain socket, which does all sorts of beautiful interoperability rendering, tiled goodness. And yes, this fetches data from an own cloud, an OSIS, a next cloud, a pygmc file, lots of things, any kind of WAPI share point I think we can use even. Yeah, for the good guys, right? And yes, so anyway, so this gets the file, this pushes it in here, it renders it, it comes back out to the browser. And yes, we do all sorts of things to try and cache that. So JavaScript here, good stuff over there. Anything else on there? Nope, nope. Seems pretty silly. And I just want to talk a little bit about latencies. This is an interactive presentation. I'm not going to ask you to put your hands up just yet. But just here are some timings. And the one I want to time is this human eye blink, 100 milliseconds for a human eye blink, okay? Right, so here we are. How good are you at blinking? Are you ready? Okay? So I'm going to press a button and we'll start blinking. And when you see red, stop. But you need to count at the same time, okay? You ready? Silently. Silently. Yeah, yeah, here we go. Ready? Ready? Are you ready? Go. How many? How many did you get? Do you want to try again? Yeah? Okay, so here is reciprocation for beginners, okay? So this is an advanced topic in maths, okay? If you need help. Anyway, so if you're a falcon, you've got like 7.7 milliseconds. So that's pretty good. Me, I'm more about here. I don't know how about you. Six, seven, eight. How many did you get? Do you want to try again? Okay, we're going to try again. It's like, okay, right? You got the idea now, right? Okay, ready? Not completely, okay. So I'm going to click and it's going to go green. Start blinking. And count the blinks you're doing. Blink as fast as you can, right? As many as you can. I want to get a high score here, right? We're going for the Peregrine Falcon 153 in a second, right? Okay, ready? Okay, three, two, you've not started yet, have you? Three, two, one, blink. Okay, that was a second. You had to blink. How many did you get? Five, six, seven, eight. Yeah, okay, fair enough. So this tells you your score. And interestingly, in the UK, they say a blink takes between 100 and 150 milliseconds. In Harvard, it takes between 100 and 400, which tells you something about Americans. Maybe. I don't know. It's slower pace of life is good for people generally. Anyway, sorry. So here we are. So actually, the very interesting thing is that when you start looking at some of these numbers, now on a log scale, so they're a bit more friendly, you know, the blinking is really quite slow. You can go from the Frankfurt to the US east coast and back again in the same time, right? So that's pretty good. You know, the 60 hertz frame time, 16, you know, is also quite long. You can get Frankfurt Milan, Frankfurt London is a similar time to the time it takes to get something on the screen, particularly when you add the monitor latency. So you blink faster than you miss. Lots of people are very worried about latency, and they don't have a good feeling for how long things take. But it's quite interesting to see some of these things. And also, in terms of typing, you know, like the average typist is supposed to be like three characters a second, pro 6.6. Yeah, it's human eye blinkers quicker. But you know, even me typing, not very accurately, it's like, yeah, quite, quite, and if you mash the keyboard, it turns out you're massively faster, like you're 10 times faster than the average typist when you mash the keyboard. It's not, you know, it's not good for it. So yes, there we go. Anyway, I'm going to hand over to Depp, Aquilon, unless you have anything to add? No, no, no, no, nothing to add on blinking. But yeah, the fundamental point that networking is really, really fast and stuff comes from one end to the other and back in a very, very sharp period of time is great. So, you know, don't generally have to worry too much about that part of things. Yeah, so what we do is that we have a bunch of demo servers that are generally publicly accessible. And what I've started, we started in recently is to use perf to sample once a second and record for an entire week what happens on the public servers. And at the end of the week, then we generate a single flame graph from all of that to see what, where, where, where our time is spent over the week generally. That's the demo servers, multi user testing. We have this once a week called some of the people present in the room, join us from that from other people, organizations and, and community members, members. And we just have a general feel as to what it feels like in that little 10, 20, 15 person call for the applications are still responsive or whatever issues arise in testing that can be checked at that point. And that is also profiled and flame graph generated, typically one for writer and one for Calc in recent tests, which are all stuck up in GitHub that you can look at yourselves if you're interested to see the change in time over what we're looking at. We use it internally in clapper, of course, with the deployment that is used daily there and the same week long profile that I mentioned for the demo server is run on the internal one now as well. Yeah, so that's the tooling that we're looking at there. And then interactive debugging, which you have the clapper online, you can do yourself. You just go help about and you trip a click on the dialogue there. And that'll show you up this debugging display that we're looking at here. There's loads of information in it there. The far right inside the tick box as you check them on, certain ones will check on display in the bottom left corner to tell you things. But maybe more interestingly, the one that we're calling the tile overlays. When you type in the documents, you'll get these flashing areas. And that's the part of the document that has been required to be redrawn because of your interaction. So what you're really hoping to see, especially looking at these things is that people are typing and you're hoping to see a small rectangle around the area of change that they're actually making. If the entire screen starts flashing, it means that there's a whole reason other piles of things have been redrawn or been invalidated to be painted to be redrawn later on to avoid that. These are the kind of flame graphs that we look at and the week and just for the purposes of looking at these things, the colors don't matter in these flame graphs or most flame graphs. What matters is the width of the line, the width of the bar, the wider the bar, the more proportionally time has been spent there. What you want to do is you want to take a quick look at it. You want to see which is the widest line and see can you make the wider lines narrower. I mean, it's nothing to the profiling really. It's just make the wide ones narrow. Yeah, so this particular one is in the widest bar there. This whole gigantic pile of boost, spirit, classic, whatever, which is all being used to detect if the PDF that people are opening up is a particular type of PDF, the hybrid PDF that's using LibreOffice where you can embed the LibreOffice document inside the PDF. So when you open up PDF, you also have the original document. It just takes a ludicrous amount of time, especially over the course of a week to collect up that information when it can be done in many orders of magnitude less. Yes. So it's good to see that sort of stuff and disappear off the profile. You should never optimize before profiling, obviously. Cool. Thanks, Will. Storing previous tiles. Yeah, so we've done a whole lot of work to improve our tile rendering performance. We store previous tiles that have been rendered so we can see what the difference is and just send the difference. That saves a lot of bandwidth and reduces latency too. And we've completely rewritten this. Well, how this is done in the last six months to a year. So we've already compressed it, so just a simple run length encoding. Because we're extremely modern, instead of doing stupid stuff like using byte lengths and this kind of thing, we use bit masks. And you'll see why in a second. So the bit mask essentially says, is the pixel the same as the previous pixel? So you end up with a bit mask. We have 1056 square tiles. So in four 64 bit numbers, we can have the whole bit mask for the row. And yeah, it's pretty easy. This removes a whole load of things. Previously, we stored them uncompressed. We compared them uncompressed. Turns out to be massively slower. Touch is much more memory. It uses much more space. And we also did clever things to hash each row as we did that while we were copying. But it turns out this is far better just to use the bit mask and some of that stuff. And, Koel and I did this fun thing with AVX2. Why not? You hear about these processor accelerated things and after shrinking our inner loop down to almost nothing, it's still not as quick as it could be on the CPU. So this is how we do it. We load a whole load, actually eight pixels, into a whole single AVX register, which is just kind of nice, right? Eight pixels at a time. And the problem is we need to compare it with the previous thing. So we shift a bit off the end. We shove the previous one. We shift it along, although actually it's really a sort of, yeah, it's a crossbar switch here that you permute to move things. There is no shift in AVX registers that does that. And then we just compare these guys. And that gives you a whole load of either whole all ones are all zeroes. And then comes Koel on magic trick. Well, yeah, in AVX, there's the AVX2, which is like practically available. But AVX512, which is not practically available, has a particular call that you can do that will compare the two things for you and give you that bit mask, which is not available in the AVX2. And if you look at what's available, though, you can guess if it was done in floats, then the number is basically available for you. So you cast it to floats, and you do this move mask thing brings your top bits in and gives you what you were hoping for in the first place, which is just an individual bit result for each individual pixel that you've compared, whether they're true or not. And you can basically so compress, pull the bits you're looking for out in no time. It's great. Which is pretty awesome. So, you know, you convert this into a floating point number, and you get the sign out of it. And that's your that's your orally bit mask. So the nice thing about this is there's no branch, there's no compare. There's nothing. There's a simple flat loop with about five instructions. At the end of that, we then have to work out how many pixels to copy because it's all very well saying these the same, but you need individual copies of those different pixels one after another. So a bit of a pop count will count the bits in the mask. And then with a clever lookup table, we can also use this. Yeah, this clever instruction shuffling instructions to shuffle the things in that we need to copy them out, stack them up. Bingo, twice as fast, which is nice. And hopefully AVX512, you know, will make it even even faster if you believe that you'll believe anything. So yes, here we go. So this is a real problem here. And if only we can find the idiot responsible for you. We don't need to suggest. Yeah, no, what's sometimes interesting is that, while I said earlier narrow was better, sometimes it can be interesting to see that wider will be better in the sense that when you look at the flame graph, what you should see is individual threads should all be positioned separately. So they shouldn't be, you know, combined with the main thread. So if you're not seeing work that you expect to see happening in a thread on the left hand side, basically, of your flame graph, then it means the threading isn't being used. So it becomes apparent that while there's this code that attempts to do this threading for doing this previous delta stuff, there is no existence of the threads and there's a flaw that needs to be sorted. So when you fix the flaw for the threading and bring it back in, you see then on the far left hand side, because it's rooted in the threading area, all that work is put on the left hand side separately in the flame graph. And while it's wider, it now means it's operating in a separate thread and you've made progress. So it's nice to get twice as fast and then four times as fast on top of it. That's the right sort of approach. Yeah, I think we're going to skip through some of these because we're running out of time. But working out where to do the work, either in the browser or not, and I'm pretty multiplying and the stupidity of the web and having an RGB, un-premultiplied alpha API. When it's almost certainly going to be premultiplied underneath its hood. Yeah, underneath the hood, all the hardware, everything is doing premultiplying because it's so much quicker. You can see the complaints online about people pushing RGBA into the canvas and getting something out that isn't the same because it's been premultiplied and then un-premultiplied. Anyway, there you go. The web APIs are awesome. What else? What should be on your profile? Well, it's very hard to know. This could be okay. Here's a whole lot of un-premultiplication here. It's a very old profile. It's a time, but hey, there's a lot of rendering on the profile. Not very much painting, lots of delta ring, so we fixed that. But actually, it's very hard to know if this is good or bad looking at that. Actually, with lots of bogus invalidations, you start to see lots of rendering and that's not what you want. So everything should shrink and you'll end up with a profile that looks the same, but everything feels much quicker. So we've done lots of work to shrink, I guess. Mr. Enders, do you want to pick a couple of these now? Yeah, just as you mentioned, with multiple user document tests, we have kind of basically monitor what's happening. People are joining documents. We got that full document invalidation we mentioned about happening. Clicking in headers and footers were causing the same things. I think fundamentally, because the invalidations and redrawing on the desktop has become so cheap, while in the past, the very distant past, we might have been pretty good at keeping validations down. In that case, we've become slack in recent decades and now we've treated it as cheap and that has affected things. So let's kind of have a look at that again and bring things down to smaller rendering areas and less invalidations. Yeah, and the good news is that improves LibreOffice, of course, as well. It's more efficient and clean on your PC as well underneath. So good. We've done lots better latency hiding in terms of more aggressive prefetching. So the next slide is there before you switch to it. So it's absolutely instant. Hiding latency in those ways is quite fun, enlarging the area around the view and maintaining that as tiles and just storing and managing much more compressed tile data in the clients that we manage much better now. This is a fun one. But we don't have much time for it. Yeah, well, God, classically, standard list and C++ was always a standard list. And if you wanted to get the size of it, you had to like pass the entire list from start to finish. That was sorted out decades ago. But for whatever reason, for compatibility purposes, if you use the particular Red Hat developer tool chain, then you seem to get the classic behavior or standard list back again. So when we were assuming that you was cheap and cheerful to get the length of a standard list, it turns out to be not the case with this particular case. So you have to go back to a different approach and it appears in your profile like that. But again, it looks normal that it should take some time to draw things. And it's normal to have a cache to speed that up. But if the cache has got 20,000 items in it, and you're just walking this list, you know, point it, chasing anyway. So gone. Oh, fun stuff. Like why not have a massive virtual device in the background that you could render to instead of the whole document every time you do something? Not great. Or another one, why not have a benchmark every time you start the document to see how fast rendering is, allocate a whole load of memory and dirty it, you know? Great. Yeah, trying to cache images. So we didn't bother catching compressed images because they're compressed, right? So why bother? They're small. They're good to have memory, except TIFFs not so much compressed, you know, you eventually have the whole massive chunk of memory there. Using G-Lib C trimming functions on idle to reduce memory usage. Yeah, trying to get better measurements of various things. Yeah, this is a fun one. Well, oh, this is the S-Maps word. Yes, yes, yes, we're reading the proc S-Maps to see how much memory we're using. And the classic S-Maps has got multiple entries in it for many, many parts of your process. So you just read multiple lines. So there's a relatively new one that has it all pre-edited for you. ProxMaps roll up, which is exactly what we want. Same code to read the previous one should work with the new one. Then apparently we're running out of memory, or it's being reported that we're running out of memory, and it's all very, very bizarre. You can't proc S-Maps roll up yourself. The numbers are good. There's something very odd, but it turns out that if you seek back to the beginning and then read again, that the numbers double every time you do this. There's an actual bug in the original implementation. It's not there in the kernel, my version 6 kernel, but it is there on version V18 or 16 that the servers were applied on. So you have to be just the right version for it to appear. So Linus fixed it, thank God. Quillholt found it. Well, it was fixed before we found it. But it's always nice to know you have to check your kernel is the right, you know, is the quality kernel before you start asking it how much memory it's using. Yeah, hunspell in the loop was almost entirely dominated, not by actually spelling things, but by looking at the time. You know, I'm sure in a bad talk, you know, it's quite similar. But that's a little bit unfortunate. So yeah, some improvements there. And lots of other things, graphs showing speedups. We've got to get to usability in the last minute. Let me whizz through this then. Here we go. Accessibility, dark modes, pretty pictures. This is going to be fast. Keyboard accelerators. This is all of the good stuff for people. Screen reading, and all sorts of nice things, videos of that. Better page navigators at the side so you can see where you're going. And lots of just little bits of usability polish, a nice font previews. Was this your page number thing? I forget who did that. Making it easier to insert page numbers so people can see, you know, what's going on easily, better change tracking and showing changes, AI, depot, stuff, and hey, some some. The good news is there's more opportunity for performance improvement. So we're still, we're still having fun. You know, hey, come join us. There's some cool play files to read. Right. Well, yes. At the moment, in Calc, when you're typing the entire row and validates beyond the right hand side of where you're actually typing. So we brought that down to the self in the most generic case, but it's not done for writer. In the writer case, if you're typing, we are invalidating all the way to the right hand side of the screen. So we'll bring shrink back back down again. We have some new metrics that we've included in that debugging overlay thing that give you an indication of, you know, how much of these updates that are coming through are the same data as they were before the update came through and the numbers are staggeringly high. So there's plenty of room for improvement to validate less, send more data down. So what we have now is fix, uh, approval. Yeah. The moment that's always been troublesome in, uh, Lear Office is the treatment of the alpha layer. We picked the wrong direction than everybody else does. Everybody else picks transparency. We picked opacity or vice versa. So we have the opposite direction. Everybody else would want to actually output something in the real world that handles transparency. We have to like reverse our transparency. So that's problematic. That's, that's now fixed. That one is fixed. That one is fixed. But then we've also kept our transparency layer in a separate, uh, bitmap, a separate buffer than an actual bitmap. And if we put them together someday, that would make things a lot easier, I believe. Yeah. It's the Windows 16 API decisions that are still with us. But anyway, we're getting rid of them quickly. That's great. Um, yeah, performance regression testing with Valgrind, uh, pipeline loading. So at the moment, oh, we got five minutes. Oh, look at that. Fantastic. I went too quickly. No, you're doing fine. Okay, right. Fine. Excellent. I think we're nearly the end. Um, so pipeline loading. So at the moment we have, um, we, we essentially fetch a, fetch a webpage that passes all the credentials we need to check ourselves. We'd load lots of JavaScript. We open a web socket. Then do we actually see if we can actually load the document and start checking who the user is? This is really foolish. I'm taking on a first start, we can be, you know, checking the user, downloading the document, even loading the document ready to get the web socket and then have a pre-rendered version. So this, this is very substantially reducing, um, startup time to make it incredibly quick. You already have a huge advantage that you have a real server at the back end and you're not having to jit, you know, millions of lines of code in your browser from JavaScript into something or, you know, web assembly into something. Um, so it should be just amazingly fast. And so this is a great way to, to speed that even further. And, you know, and a real server, you may have a time share, but you know, when you arrived your server, it's probably not doing much. In fact, the CPU cost on most of our servers is extremely low. So, you know, there's suddenly all these threads ready to render your document and get, get stuff to you quickly. Say some good things. And Valgrind, we've done a whole lot of work to get, um, it to run nicely under Valgrind with our privilege model and container model. It's a bit of a problem. Uh, and so we have some code now that turns everything into one process. So you can load and collaborate on one document and automate that, but you can run it in, in Valgrind. And why do you want to do performance profiling in Valgrind? It seems like a retro, uh, poly, right? But the beautiful thing about Valgrind is the simulated CPU. So anybody can run the same workload on their machine and between two runs, it's the same thing. And Valgrind luckily doesn't have a simulated thermal management system that randomly throttles your CPU, uh, performance. And it luckily doesn't have people screwing with your cache memory and running cron jobs in the background and, you know, thermally recalibrating your disk and all this other stuff. So what you discover is that between two identical commits, you're getting fractions of a, small fractions of a percent difference in the Valgrind numbers, which is beautiful because performance tends not to go away in big jumps. Like we can, it can go in big jumps, but it tends to go slowly downhill. And if the noise is bigger than the slow downhill, you've no idea where the problem is. So much better to have a little series of steps going down in one half a percent at a time and go, hey, we get rid of that and that. And did you realize and, uh, so, so this is really vital. And LibreOffice uses this on its perf, um, automation has been beautiful web pages with graphs. Um, and we'll, we'll be applying to, to collaborate online to, to try and avoid regressions. Yeah. Someday soon. Someday soon. Yeah. Neil, Neil Lazzone, we think probably. Anyway, anything else? No, I think we've covered plenty. Well, so, and yes, of course, we can't do anything without our partners and customers that pay for it all, blah, blah, blah, commercial plug. Good. Yes. That's good. Job done. And conclusions. Yes. So, uh, computers are unbelievably fast. I mean, like this is something that you should take home. You know, like the quarter of a nanosecond that your four giga hertz processor takes is just unbelievable in the scale of a hundred milliseconds plus. It takes you to blink your eye. It's just fantastically speedy in a way you can't explain. Uh, the network latency to anywhere almost, you know, you can go three times, uh, London to Frankfurt and back in the time you can blink, right? Like it's, it's unbelievably fast. In fact, you can go, you know, Frankfurt, Milan faster than your monitor can refresh, right? So, so like, it's quite amazing when you start looking at the times of things. Um, architecture is really a bet on CPUs and networks getting faster and cheaper. Has anyone noticed a trend there? I think there might be something in that. And, and we're basically racing the hardware guys. I mean, you know, we, we do stupid stuff, obviously, and then we remove it later. But, you know, the hardware people are also trying to beat us to run stupid stuff quicker. You know, that's their mission. And, uh, yes. And, and we extremely smooth. Don't get the feeling that it's bad. Try it. You know, most of these problems, you'll only start to see them when you have 20 plus people collaboratively editing in document. So, uh, yeah, it's, it's kind of, it's kind of cool. So give it a try and try the latest version and see, give us some feedback, get involved. And there's lots, lots of fun to get involved with. I mean, I don't know. Yeah, I'd like us to play two things. As I mentioned earlier, the profile that we have for Calc and Writers uploaded to GitHub once a week, generic Calc performance profile, generic writer performance profile, search on the online GitHub issues. And you can see all of the, the chats that we've mentioned there in the past. And you can even see with the progress there and the occasional blip during a call where things go horrifically wrong and get sorted out in the next one. So yeah, plenty to see and see what we're doing. There's some links in the slide. You can't see to the profiles and get involved in the Libre Office of Technology. Thank you. That's it. You've been very patient. Thank you.
Document collaboration made simpler: Revealing the concept of rooms in ONLYOFFICE DocSpace
So, we're going to be presenting the rooms in only office doc space and it's going to be presented by Alex. Hello, hello everyone. My name is Alex. I'm with only office, so this is my second time here at Forstum in Brussels. Thank you a lot guys, so the same clusters. That's everything is brilliant. And today I'm going to take you through the challenges of only office and of document collaboration and how only office can help you to overcome all these challenges. So I will also be introducing you to some existing features in only office and some updates. So let's get started. In our today's world is very important to work online and if teams have a lot of documents so the things can get messy very fast. So for us developers it is important to create and for users to pick a good solution for organizing online document collaboration. We at only office have more than 200 unique connections, unique integrations and we have a long, long list of requirements from the people who are trying to integrate our solutions into their services and that's why we are absolutely sure that we know almost everything about the integration. So wondering what is the most interesting cases, here are our points. So first we want to make sure that we save your time and efforts by automating the everyday routine. Having a lot of features is very important and it should be easy to add new features if needed. The next thing to consider is support for all popular file formats. If a software is built on their up-to-date technologies it will most likely be reliable, suit user needs and even have some killer features, so like we do. And of course security will always remain one of the most important questions for everyone who works on their documents online. Cross-platform apps give us ability to work on any document in any browser, from any device and remote workers need to be able to work together and all these challenges can be difficult to overcome without using write tools but only office can ensure your effective teamwork. Talking about usability, we want to provide real-end user experience. If a software is easy to use it boosts your productivity and accessibility, how it is easy for people of all abilities to use your product. If software have lots of bug fixes and updates it indicates that it is well maintained and here big thank you to our community which plays significant role sending us information about bugs, troubles and of course feedback. And last but definitely not least is availability. We are constantly increasing the number of distribution forms or builds of our products. Considering all these factors we at only office decided to create a new product, only office doc space, a product for organizing secure online document collaboration. Actually today is the first time when we are talking about that product. I was talking about that at FOSS Asia in 2023 but there was only better version available and now we have already a ready to go product that can be integrated and is already integrated into many well-known services. So before we dive into each factor I would like to share the history behind that idea. So the journey started in 2021 when we decided to rewrite, to completely rewrite our productivity platform. I am talking about that platform I mean the package with CRM, project management, mail clients and many many other features. So the main idea was to implement infinite scalability and in that same year we released free for personal use app server that made it faster, more stable and more functional and the idea shifted slightly. So we decided not only to rewrite the architecture but also to change the mechanisms of working with documents and in 2023 we released on office doc space. The main point for everyone who tries to integrate our solutions into their services is that there can benefit from our extended experience in the integration. So many office solutions and productivity platforms when working with the files create a mess of unstructured files, folders, subfolders but with only office doc space you are able to create rooms, doc space rooms which allow you to clearly structure your files depending on your requirements on the project goals. And when you have a long to do list every day so it is smart not to waste much time on every day routine like creating sharing or anything with the files. In only office there is no need to work with each file individually. You can create a room and all files within that room will be available according to the access level of the room. So there are few types of the rooms at the moment. Let's start with collaboration room. These rooms are perfect for those who are trying to work on the documents together to co-edit their documents. So here in these rooms you can make use of all beautiful co-editing features of only office software like using commenting, mentioning track changes, using revision controls and many many other features. So we do have built-in chat and telegram plugins right within the editors to communicate and we also allow to make audio and video calls using plugins for Zoom, GCE and Rainbow. So when inviting a user into your room you are able to set the access level. It may be administrator with full rights, power user with extended rights for editor or viewer. So the next type is public rooms. You can invite anyone using public links and what's very important there is no need to register somewhere and you can generate multiple links with different access rights. But you also are able to apply password for all files in the room or for example restricting the copy of the content of the file is available here or just the downloading and printing can be disabled. Yeah and there are also custom rooms that allow you to apply your own settings for any custom purpose you have in your mind. Again here everything depends on your use case. You can create a room for requesting form filling, for requesting commenting or document reviewing. Everything depends on your use case. So on the Office Doc Space includes different viewers and editors for all file types. Let's start with the digital forms. Here you are able to work with your forms with your form templates in DocXF or PDF format. So these PDF forms can be filled in, can be shared with anyone for filling or you can create or work with the files created by alternative applications of course. So you can easily view, create or edit text documents. I hope you are aware that only Office works with almost all text file types. Office Open XML files are supported but if you'd like you're welcome to work with Open Document Format or RTFT, TXT, HTML or anything else. The same for spreadsheets where you can work with your sheets and use more than 400 of different formulas and functions. You are also able to create slides using a variety of different animations, transitions and different objects. And now to PDF again. So PDF is widely used in the document workflows from meeting brochures to contracts for signing. And now you have PDF editor. You are able to annotate your PDF files using only Office editors. You are able to work with your PDF. You can convert your Office Open XML to PDF and vice versa. You can convert PDF files to Office Open XML to edit them. Additionally you are able to work with your electronic books that can be converted. Next feature is integrated media players for working with images or video and audio files. So the functionality of the described solution can be extended by using plugins and AI integration in the form of chatGPT plugins. So you can work with your text, you can make simple requests, generate keywords or images. Everything depends on your license with chatGPT. So if you do have chatGPT license you are able to work in the editors with your paid functions. If not just work with a free version. So Office Docspace is created using up-to-date technologies. We do use .NET Core and .NET Server to ensure reliable backhands and for frontend we do use React to make sure that everything is mobile friendly. Office Docspace is a safe way to handle your documents. So we follow all GDPR and HIPAA rules treating your personal information very carefully. With flexible permissions and JWT you are able to have complete control over your files but you are also able to add password protection or watermark everything you have in your room. For data and transit we do use HTTPS of course and for data at rest we do use industry leading AES256. And moreover the administrators can allow some additional settings like trusted mail demand configuration or session lifetime configuration or for example use two-factor authentication or single sign-on to have control over the login procedure. And of course backups and recovery are also here. So I'm glad to inform you that in the middle of 2023 we have included on the Office Docspace and our main HackerOne program. We received few reports and all these reports have been fixed in a timely manner. So thanks to Ethical HackerSchoolWorks with us in the HackerOne. On the Office Docspace is primarily designed for web-based operations and we understand the importance of using it on mobile devices. On the Office Docspace offer a user-friendly interface with interactive negation between rooms and settings for example. And we conducted several usability tests with more than 200 people from different countries and different industries. So and according to their feedback only Office has overall usability score 4.1 and the main advantages were so simplicity, clarity and modern interface. Of course you can customize your product. You can change the space name and URL of the portal, Docspace portal. You can change a color scheme and to support your corporate style you are able to use your own logo or to change the welcome page. As for accessibility on the Office Docspace and on the Office Editors are designed to accommodate users with spatial needs. So there are few options like screen readers, hotkeys but we do also support different plugins like voice to text, text to voice or translation for example. There are a lot of different plugins. For developers and integrators on the Office provides the ability to extend the functionality and here you can find the information about our plugin SDK. So you are welcome to create your own plugins and we do have few plugin samples on our GitHub page. For example PDF converter allows you to convert your PDF files to Office Open XML and vice versa as I said already. And the next one is draw IOR plugin for working with your professional looking diagrams. There is a plugin available that converts your audio and video files to text and of course you store it into your rooms. Open API documentation allows you to see how to integrate on the Office Docspace rooms into your product and to give your visitors, I mean visitors of your website ability to view and interact with documents right on your web page. Docspace rooms can be integrated into your service as an iframe. So the same we already have with only Office Docs, just iframe and of course data display settings can be configured. And now the main point, so there are a lot of services without document management functionality, without document editing functionality or without any cloud storage functionality. And only Office Docspace allows you to add everything you want here. All these features are available and anything can be used here. I mean it can be integrated into CRM, into CRMess, into any messenger and the next example is one of the most popular collaboration solutions on the today's market. Just an example. So I'm glad to say that we have only Office Docspace for Zoom integration. So just go to Zoom marketplace and look for only Office Docspace. You will be able to install Docspace for working on your documents right within the Zoom meeting. No additional actions like registration are required here. I think that everyone can remember him sharing a document and saying, okay, let's write it down together. So I mean in the Zoom session, but with only Office there is no need to share your document with anyone or to give someone access to your screen to work on the documents together. So just use only Office Docspace application. The same for WordPress. Only Office Docspace can be integrated into your WordPress pages. These are just two examples that show that our product can be integrated into any service. In 2023 released on the Office Docspace 1 and on the Office Docspace 2 with more than 50 new features. For example, public rooms are available now right to left interface and better and system plugins are supported. And for online editors, as the part of Docspace platform, we delivered three major updates with more than 200 bug fixes and about 200 new features. And the last version of only Office Docspace is available now. So release just a few days ago. So in the latest release on the Office Docspace 8, we have few very important features. Again, fillable PDF forms. We have PDF editors right now. Another interface for plugins has been updated and we do have long awaited right to left interface. That's very important and we understand that we have a long, long way to go with that functionality. I mean right to left. This is why we are looking forward for your feedback about that functionality right to left. I mean, and we really need feedback from our clients and integrators. The next point is our performance optimizations. So here you can see some numbers. We have moved some portions of the service to the client side. So and I think this definitely will add some more points to only Office editors. And thanks to our partners from own cloud, we have now the lower testing results for 100,000 of simultaneous connections. Having 100,000 of simultaneous connections, I mean that all these connections are active, sending some information from your client to their server. You can see the details of the infrastructure. Everything is in Kubernetes. So 12 big machines for documents server and two big machines for K6 just to generate that huge traffic. On the Office Docspace, we'll soon include private rooms. We are also going to implement electronic signatures. And there are more features that we plan to add. It can be used on the Office Docspace, can be used as a cloud solution, just look for on the Office Docspace. And if you'd like, you are welcome to install it as an on-premise. So in Kubernetes or any type of deployment. To try in the cloud. To try download in server version. So and thanks a lot for your attention. If you have any questions, we'll be happy to assist. We are here to guys in this on the Office t-shirts. Yeah. Thank you. Thank you very much. Are there any questions? Yes. Just a question about interfacing type of document worksheets. And so also because the documents in database formats and Writers and so on have problems to with Google from Google Docs to Google Sheets, it doesn't work. And so you have this kind of functionality. And also it was very interesting for me to gain a lot of text, a lot of time with converting speech to document. So there's a first question about whether there's an integration between sheets and documents and the second one about converting speech to document. So the first question, yes, we do have. Sure. Yeah. We do have plugins to add that functionality right within the editors. You can work with that. But you need to install an extra plugin for working with that. So and what about the integration between two types? Yeah, two types of the editors. Yeah, I see. But as far as I understand, you are working with X-Wiki right now. And no. No. No. No. Yeah. It's a free question. Yeah. And no, I mean the product. I mean the product. So maybe you just have one of the previous versions of OnlyOffice, for example. And there was your question about interface. Yeah. There was your question about interface. Yeah. Yeah. And I think that your question before, you are before. So you will be able to find a solution in one of the next version of the OnlyOffice. And what about the integration between these two types of editors? Yes, we do have that again in the latest versions of the OnlyOffice. For example, let's try work with OnlyOffice docs 8. Great. Any other question? There was somebody there I think. No, yes, no. Okay, thank you very much. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And. And.
openDesk - The Open Source collaborative suite
Okay, so welcome to the talk about OpenDesk, the open source collaborative suite presented by Klemah Obang from Exviki and Vila Lydantal from Open Projects. Enjoy. Hello. Hello, everybody. Thanks for coming. The funny thing is we are not OpenDesk. We're just vendors. We're just contributing to OpenDesk, but we come to this later. Yeah, OpenDesk, what is that? That is an idea of building an alternative to Microsoft Office 365 or to Google Apps that come to the public sectors. They are used everywhere, and public sector wants to have an alternative to that. So if you really want to go after the big elephant that is not in the room, we're trying to create an alternative to that. So probably it's the biggest opportunity for open source software right now. Let's say at least in the realm of collaboration and working together. So OpenDesk is a powerful initiative of the German government with a goal to provide a serious alternative to the proprietary Big Tech establishment. It unites independent open source software vendors to create a sovereign workplace tailored for the public sector. We too here, we are just nerds. We're just software developers. I'm from outside. I work for and co-founded Open Project. So that is one of the parts of the solutions here that we will talk about. But I'm software engineer. I'm not OpenDesk. And this is Clément. Hello. So, well, you maybe have seen me during the room this morning. So my name is Clément Aubert. I'm an XQQ committer and I also work at XQQ SAS to do sales mainly. Okay. So well, let's start by discussing a little bit about the story of OpenDesk and how we got there essentially. So the issue with essentially collaborative suites goes a long way, right? Since 2015, especially in Western EU, so when I'm talking about Western EU, it's mainly French and German government because from more point of view, that's where we have the most information, let's say. Since 2015, what we see is that there are growing concerns when it comes to the US-based cloud offering that exists in order to create collaborative suites. And these concerns are mainly regarding the fact that you don't necessarily have control on your data. You don't know exactly where it is being stored, how it is being processed. There are especially privacy risks. If you are putting sensitive data, maybe they could be accessed in different ways. In particular, since 2018, there is an extra territorial law that exists in the US that allows the government to ask a company to access customer data, even though the data of the customer may not be from a US citizen. Just for, let's say, well, usually to get more information about that customer. And then there is the big question of locking, essentially, when you migrate your data. How easy it is to make it back? Are there any open standards that exist in order to do it the other way around? These concerns exist. And so actually, since late, well, the end of 2010 and beginning of 2020, French and Germany have started to create some rules when it comes to the handling of critical data in their government. In France in particular, there is an initiative which is called Cloud Nubu and Cloud Pi, which is essentially two cloud specifications that are used for public administrations. One is for, let's say, conventional data and another one is for more sensitive data. Germany also started another initiative, which is the Deutsche Verwaltungs-Cloud strategy, which is, let's say, kind of the same. It creates kind of a standard in order to protect the data that is used by public administration from the external actors. So this is essentially, to be clear, this is essentially infrastructures or definition of infrastructure that should be implemented by the state so that on the long run, states have the capability to host some data securely for them. In the meantime, there is the question of having security certifications, basically making sure that the different vendors that will provide a specific service for you, they have the necessary amount of security to provide that service. And so in the same idea, there are two standards that were created over the past years. In France, we have SecNumCloud. SecNumCloud has been created by the ANC. It's basically derived from another security standard, which is called ISO 27001. If you are doing security, you may know about it because it's well known. And what SecNumCloud does is that basically it takes most of the rules from ISO 27001, but it adds also controls when it comes to the nationality of the people and the location of the people that are allowed to process the data. The whole goal of SecNumCloud is to protect from extraterritorialities, in particular the US Cloud Act that I mentioned. In Germany, there is also another certification, which is called BSI C5, and we will talk about it afterwards because it's quite important. The goal is to basically be able to qualify a specific application that you want to deploy on a cloud offering that you want to deploy to be used by public institutions. So C5 certification is a little bit different in the sense that it's not about extraterritoriality at this point. It is more about basically making sure that the application is correctly developed. There is a good standard of quality when it comes to the change that you are applying to your application. You validate its compliance, etc. On the long term, so if you are residing in Europe, in the EU, there is a vision to create one unified standard, which is mainly based out of SecNumCloud and C5, which could be called the ES Cloud that would encapsulate and basically allow any vendor to qualify to this standard in order to be deployed in the different governmental public administrations in the EU. So that's very nice. We are actually introducing new laws that allow us to control what can be put and what can be used by public administration, but in the meantime, we may not be creating solutions here. So that's another issue that has been tackled by France and Germany over the past few years. In France, over the past like in 2021-222, there is a project that was started that was pushed by the DGE, which is the Extranger Neral des entreprises. It's a branch of the finance ministry. The goal was to create different consortium that would lead the creation of an alternative suite to things such as Office 365 or Google Workplace. The ones that Wilhelm talked about. So it's a project that has been started in 2023. The total of the project is around 23 million that has been invested by the state for the three consortiums. The idea is to have results by 226. Now we will not talk about them that much because it's actually not based on fully open source software, so it's not really the scope of this talk. The one we are interested in here is the project from Germany, which is OpenDesk. And the idea of OpenDesk is essentially, so it's a really different approach. In Germany, there is the Ministry of Interior. I will not say the name in German because it's going to be a nightmare. The Ministry of Interior decided in 2022 to create a consortium made of different actors, so we'll see them afterwards. But essentially, Dataport, which is a big service provider for public administration in Germany, as well as a couple of vendors, software vendors, are providing open source software. And the Ministry decided to group them together in order to create a platform which is coherent, fully open source, and can basically last longer and be maintained over time. So to give you an idea, in France, the financing for the three consortium I mentioned is like around 23 million, knowing that there is a little bit of a loan there, so you have to reimburse it. And then in Germany, it's basically orders, so it's not a loan, and the budget into 23 was 23 million, so a little bit more budget. So if we go into the details of OpenDesk, so the project was initially started by the German Ministry of Interior. Today it's being handled over to Zendes, which is also a public organization that has been created in order to handle, let's say, open source projects that have been created by the federal state. And so, as I mentioned, the project is currently co-managed with multiple actors, so there is the Bundesministerium, the Ministry of Interior, the BMI, PWC, which is helping on trying to have, find use cases, basically, find the correct user stories, the issues that we are trying to solve with our collaborative workplace. Data Port, which I mentioned, which is providing, basically, hosting for the project, and is also doing a lot of work in order to organize the different vendors all working together to create a unified workplace and create a product that works. And then we also have Bechle, which is present for helping on the financing of the project. So we talked about the vendors today in this project, in the OpenDesk project we have around a little bit more than 500 people that are working from both the different vendors, PWC, BMI, and Data Port and Bechle. So that's quite a large project in the end. I will get to the name of the vendors afterwards, but this is just kind of a quick view of what we're trying to achieve, right? Basically we're trying to achieve one solution, one workplace that fills in different needs. We want to have email management, of course, you need to have emails, you also want to create events, so you will have calendar modules, contacts, task management, we want to also have the file management parts where we can create new files, collaborate on them, and then you also may want to continue working on your projects, so develop projects within your organization, for that we have a project management tool, a knowledge base tool, and you also want to communicate with your co-workers, so there is modules about chats and video conferencing. So all these projects, the idea of OpenDesk is essentially that these modules should be made of solutions that can be switched easily. The ideal vision of OpenDesk would be that basically you have a software which is providing the email functionality, but let's say that tomorrow you want to switch it, you don't want to use the default version, the default software that is provided, then you should be able to do the switch fairly easily. So in practice today we are providing some sort of a default implementation, meaning that we have a couple of software that correspond to each of these features, and we don't really have two options for file management. Yes, that's a very important part because the German state don't want to get back into the vendor lock, so that's the reason why it's talking about mail and not about a certain vendor. Exactly. So if we look a little bit more in the details, so when it comes to everything which is related to email, agenda, contact management, calendar, and task, OpenExchange is handling today that big part of the project. When it comes to file management, it's mainly in NextCloud, and in NextCloud when you want to collaborate on different files, we have two external tools that have been integrated, so Collabra for which we had a talk a couple of minutes ago, and also CripPad, which we had a talk this morning, so they are used in order to edit basically office files, and for CripPad it's for editing diagrams today. When it comes to the communication, so we have Element which is handling, so Element based on Matrix, which is handling everything which is related to chats between teams, and there is a clever integration between Element and GC, the integration is provided by Nordic to basically allow to have video conferencing and basically rooms within Element where you can start calls and make calls and chats on a specific subject. Then on the project management capabilities, there are us, so OpenProject as part of the project management tool, and NextWiki for knowledge management. I would say that we are rather the, OpenProject and NextWiki are kind of the latest projects in the project, right? We arrived at least in NextWiki, we arrived at the end of 222, so not that far away, and not that long ago. Finally, so this whole portal that you see here, it's managed by Univention, which is another solution that allows to create user portals and it's also handling access management, authentication, user list, et cetera, et cetera. Single sign on. Yeah, single sign on also. So today the development of the project is managed by the hosting of everything which is related to the development of the project is managed by Dataport, so thanks to them. So we'll get to, at some point we'll try to do a demo if we have internet, but let's go a little bit more in the technical details. So today the architecture of OpenDesk is mainly based on Kubernetes, mainly because we are integrating a lot of different components, and at some point it was decided to use Kubernetes because it was the easiest basically to integrate all this complexity into one big package. So we are using Helm charts and GitLab CI to deploy Kubernetes cluster, and then basically each component of this cluster is one application managed by one vendor. Each vendor can provide either a generic Docker image that needs a little bit of configuration in order to work within the context of OpenDesk, or sometimes some vendors will provide really a tailored application with, so a tailored custom Docker image that meets some specific requirements, and well, so sometimes there are also other ways to deliver basically the applications. Yeah, a quick look at the features I may not have completely talked about. So of course it comes with a user directory managed by Inevention. All the applications are connected between each other through OpenID Connect, so that's very easy in some way and very standard, and we also have unified navigation across the different applications, that's something I want to show you in the demo. And finally, the goal of the project is really to make sure that all these components that we are integrating, they are connected together like fully. And so that's going to be very important. I would say it's a work in progress, but we'll see in the later part of the talk that we have, for example, examples of integrations between OpenProject and NextCloud, which are very exciting to see. Yeah, a little bit of a note. Also, when it comes to the distribution of OpenDesk, of course, it's made of OpenSource software, but the whole build itself, the whole project itself is OpenSource. And actually, you can access it on OpenCode, which is a GitLab instance that is used by the German state to publish its OpenSource projects. So there you should find basically a mirror of all the source code used by all the components of OpenDesk. But we will also find dedicated repositories with all the handcharts that are needed in order to deploy the platform. Another thing in order to be secure, to provide some security and compliance, so every release is signed. We create software builds of materials. We audit also the licenses of the different components that are being integrated within the workplace in order to make sure that we are really completely free OpenSource. The idea is essentially to have something that will work for the German administration and working for the German administration means being BSI C5 compliant. So as part of the project, we also have a little bit of work in order to help each component be, well, match the certification in terms of quality of development, for example. Finally, and maybe we'll talk a little bit about it later on, there is a big concern on accessibility. It's one of the things that we see that I'm the most talked about about OpenSource software. It's nice. It has a lot of features, but maybe sometimes it's not fully accessible. Well, as part of this project, we have to match also some accessibility guidelines in order to be used for the public sector. So this is also part of the final thing that we get from the project along with security. On the offering side, so the goal, the long-term goal for the project is essentially to have, today there are plans to create offers for the public administration in Germany, mainly through two entities. One is Zendes that we talked about, which is currently, let's say, coordinating the project at a very high level from the federal government perspective. And also DataPort, which is participating to the project as part of the project management, but DataPort has its own also suite, which is basically a fork of OpenDesk with some extra components that are not necessarily OpenSource that have been added or modified in order to answer some specific needs. So, yeah, today mainly offering for the German markets, but not really for the rest of Europe. So that's going to be a challenge for later. But there's interest from all over Europe in the product, so the French government also is interested. Yeah, Austria, I think. Australia, Sweden. So it's a big thing in Europe, and people are looking from all sides on that project. Okay, so let's try to do a quick demo. Let me see. Okay, so let's do it very quickly. Whoops. So this is, okay, this is an OpenDesk instance that we are using for review. So I just want to show you very quickly the different applications that are available. So let's say that I want to go to my emails. We are fully dependent on FOSDM network, so I hope it's going to load. Okay, great. So email, we are running on OpenExchange. So if you know OpenExchange, you probably already recognize this interface. What you see is that it's, so all the applications have been customized so that they have a color theme that matches, that is unified in order to have like a nice user experience. I can use a user directory which is based on the user that are registered in my OpenDesk instance. I can potentially add, in my email, I can add files, and when I'm selecting files, I have the choice of uploading a file from my computer, but I can also link a file from my next cloud account. And so if I do that, it will create a share automatically and make sure that whoever is receiving the email has the necessary access to see the file. This one, no. Okay, this one, thank you. Okay, so apart from that, in the email application, you will see that we have this little button here, and that's the transversal menu. It allows us to switch from one application to another, and it will change depending on your access rights. So we can look at, for example, next cloud, so integrated for the file management. Here I can, so I can create my file, I can create spreadsheets, and in that case, it will create a document that will be open within Collaborah. So I can edit it. We also have files that can be diagrams that we can edit directly with CripPad, so for that we integrated CripPad within OpenDesk for one specific functionality, and here we are using draw.io actually within CripPad. We can also look at maybe chats, in which case it's going to be a managed instance of matrix and element of the frontend, where I can have discussions with the other member of my OpenDesk instance or potentially other member of the Matrix Federation. And here what you see is that I'm actually part of a room which is used for a specific meeting within Matrix. And so these rooms, they can be created automatically when I'm in the agenda of OpenExchange. I create a new event and I say that I want to have a conference in Matrix, and it will create me a link that will help me to this room, that same format, where I have video conferencing here. I know that was a bad idea. Let's leave. And I have a whiteboard and I can also chat. Finally, we also have project management with OpenProject and knowledge management with Xwiki. So here I can create my new project, I can create my work packages, create some milestones and link them together. I won't go too much into the details because I don't want to spoil you. And here we also have a customized Xwiki instance. Today we are synchronizing users and writes. We don't have particular integrations with third parties with the other applications. So that's a very quick demo. And by the way, that is released. So you can download and try it and play around with that. So it's an open code. It's open source. So about the roadmap of OpenDesk, essentially the goal is to have a stable version like this month. As you said, it's already released. The main issue, I would say when it comes to the deployment, if you want to try it out, is that today there is still a good part of the documentation which is only in German. And sometimes if you're not speaking German, it can be a little bit difficult. On the longer run in 224. We are trying very hard. Yeah, yeah. It's just a few contributions for translation. I'm welcome, I guess. Exactly. You can do a full request. And so the idea in 224 is to have like more improvements in order to improve the BSIC5 compliance. Remember, the goal is to deploy that within the German federal administration and also within some German lenders. So compliance to any standard that exists for the public sector is really important for the project. And it's good because it allows to also improve the open source projects that are behind, that are being bundled in the platform. Yeah. Yeah. Yeah. So, I mean, we are super vendors like OpenProject or XWiki. For us, the perspective is a little bit different on that whole project of the whole OpenDesk because we already have a product. We already have clients. We already have a roadmap. And then suddenly someone says, hey, we want to integrate you, but you should have the same look and feel. We want to have the single sign on. We want you to finally come together and create deep integrations. So that is challenging for us because usually we tend to stay in our own soup because it's easier to build stuff in our own software. And integrations are complex. You need to organize. You need to find collaboration, like meetings with others, line roadmaps and important priorities. That's difficult. And now suddenly someone from the outside comes like, no, we want you guys to work together. And we will pay you, actually. Yes. So by integrating two very, very multiple different types of applications, we are going to build very deeply specialized applications. By integrating them, we create a multiple value. We multiply the value instead of everyone brewing their own soup. So also for us, it's a huge chance because if, let's say, XWiki is integrating with us, then maybe their clients, which are also likely to use OpenProject, would also book OpenProject, like the professional services. So for us, it's a huge, huge opportunity. And with OpenDesk, it comes like, okay, the German government also wants that it's easy to procure so that not every little city needs to go through a tender. Tender processes. Tender processes. So it will be much easier for a small city to book services from us. Right? So, and with that, we can build better software and we can better integrate. So maybe some challenges before we dive into more, like how we create integrations, basically. So some challenges that we see today is integration between the UI and the UX, so the products, of course, it's difficult to, well, basically not all software are created equals when it comes to the capacity to customize them because sometimes it has not been thought out from the beginning. Yeah. Oh, sorry. So there is a big challenge on UI and UX. There is also a question of overlapping features. Sometimes us as vendors, we create features like, we create, I don't know, like a task management feature in XWiki, which collides with OpenProject or a Wiki in OpenProject that collides with XWiki. Well, we have to find solutions for that, but usually we are like civilized, so it's okay. One issue is also like maintaining all these customizations that we are creating outside of the core of our products. So basically, we create an overlay that makes our application compatible with OpenDesk, but then like, how do we get the financing for that? Like, how do we maintain it? And if it's really difficult to maintain it across new versions, like, how do we do it? And so far, we need to find solutions on the long term for this. Talking about integrations, like, of course, these two systems, they don't exist only in OpenDesk. We exist outside OpenDesk as well. And then integration also makes sense. And those people might not have the whole OpenDesk infrastructure. So we always, when we build integrations, we want to build it in a way that is also suitable for other environments where the software could run separately. Exactly. Exactly. And so last thing is a creation of offerings. We mentioned the fact that other EU countries are interested. So apart from Germany, so that's going to be a challenge on the long term to find ways to provide OpenDesk in the public sector, maybe for other actors or even for the private sector. Yep. So. Yeah. To also go a little bit into an example for, I always like to talk about integrations because I think this is where the power lies in collaboration. I want to go into one example that I was working on with my team. That's the NextCloud and OpenProject integration. I somewhat sub-staffed this already presented last year here, but I want to have a different point of view on that. So quickly, okay. So NextCloud is mainly for us here today. It's a file storage platform. And OpenProject is something like, let's say, Jira or something like that. So you create and organize your work in work packages, issues, whatever you call them. And you can have them organized in gun charts or boards or whatever you need. So, okay. So we have a file management environment and we have a project management environment. And the outside perspective, let's say the public sector, they have a different perspective on that. They are saying like, where are the files for my task? Two things in one sentence, right? Does everyone in my team have access? Oh, so we are in project management system. We're organizing our work. The files are in a different system. Access management, okay. The third problem is I want to do the same thing over and over again, like doing the same processes, having the same steps organized in projects, task by task by task by task. And also I want to have the same template files and the same folder structure, having this all again and again. And they both need to go together. So what they don't want is that we, as OpenProject, that we build our own file management system because NextCloud is pretty good at that. And it's also integrated in the desktop and so on. Like people want to work on files like in the NextCloud experience. But also NextCloud might not be the best choice for organizing complex projects. Okay, it has the deck, right? But if you really want to go a bit more professional, probably OpenProject is a good idea. So from the public sector, they don't want these tiny solutions. They want the integrated solutions, right? And also it's not only OpenDesk, like other clients like the City of Cologne or the University of Duisburg, Essen or the Deutsche Bahn. They want the integration. They don't want the separate solutions. For us, it's easier to focus, if we integrate, it's much easier to focus on project management. Why, for example, NextCloud could benefit from focus on file management and so on. So this integration creates a great value in the combination. And also it's interesting, like once already mentioned, if we work together, and it's like NextLogest and Example, but if we all work together, then the sales also becomes easier because we all have clients that the others don't have yet. And with that joining together, joining forces in the integration of sales, we together then can capture a bigger market, get more money to build more open source software. Okay, little examples for how this looks like. So this is OpenProject. You have a work package. And on the right-hand side, you see files that are related to that task, which is baking pizza. I love baking pizza. And the interesting thing is you can see the files that are necessary for this baking of pizza on the right-hand side. But the files are not an OpenProject. They are NextCloud. And in NextLog, when they change their name, this name will change here as well. If they change their location, these links will still work. So this deeply integrated reference integrity, that is what you need in order to get away from chaos. This is what actually organizations want. They want to get rid of chaos. They want to have control over this stuff. Access control. So for projects in OpenProject, you can have something that's called a project folder. So we, OpenProject, we create folders in NextCloud for which we manage the access. So members of a project, this is the scope of a team, right? They need to have the access to the stuff that they want to have access to. So we say, okay, here in NextCloud, these people have access to it, fully automatically managed. That helps people to keep the data where it belongs to, the files where they belong to. So we are working on this project. Here are the files. Put them there in that folder. Okay, don't put them anywhere else. And if you leave the company, they're still there, right? If you're in the organization, they're still there. And then on the NextCloud side, also deeply integrated, we can show you which task of work packages are actually relevant for that file or where this file is used in. Let's say you have a template file for an employment contract, right? So where is this used and in which contracts is that file used? So you can find them on the right-hand side, directly jump into the work package of OpenProject and find the processes there. So the bottom line is like integrated. We are much, much stronger. Exactly. Exactly. Okay. Thank you. Thank you. I think we have some time for questions. So... Yes, I'll do it. Are we going to have to figure out the way to answer that? Yeah. Oh, you're going to give him a question. Thanks for the talk. Are there any license requirements in order to integrate into the OpenDesk infrastructure? And second question is, which vendor, so to speak, is kind of the product owner of the dashboard and the top bar we saw, and what are the requirements for them, which all apps share? So the first question, okay. So for the first question on the license requirements, there are some requirements. So basically anything that you have to commit on OpenCode needs to match within the list of authorized license by the German administration, by the admins of OpenCode. The list is not fully compatible with the one that is provided by the OpenSource initiative. So it's a little bit shorter. Basically we found that the hard way, essentially when, for example, if you have a software, you package it as a Docker image, and then when you have to upload the Docker image on OpenCode, you have to provide an SBOM for it with the license, and then you found out that in the base Docker image that you are depending, there is a Pearl library with a weird license header, and so it creates an exception, and you have like three months of review to make sure that it's okay to have that in OpenDesk. So it's a little bit of a mess. There is a list available on OpenCode for the license. When it comes to product ownership, I'm not 100% sure. I believe that it's, so the design of the navigation bar, it's managed by the, I think it's managed by either Zendes, which is like handling the project at a high level, and Zendes has been helped by consulting, by PWC, which is doing consulting and usability tests on top of OpenDesk. And the same goes for the portal, I believe. Thank you. Any other questions? And technically the widget port that you saw, the answer is the Univention. Yeah, thank you again for the talk. I have two questions. One is very specific. I'm from Lassen, Germany, and I've heard of Project Phoenix. You wrote Project, or D-Project Phoenix. Is there some difference, or is it just that Project Phoenix? And the second question is, is there... Is there a data port over there? Okay, can I just phrase the second question? The second question is, if you're in the context of a company which does not have an own IT and stuff like that, but likes to keep to an open source software where you can switch vendors, are there vendors just providing this setup where you can get an account and use it for your company? There's the idea of the job of data port, for being one of the potential hosts of that D-Pheonics suite. So then you could get that product from there just by renting it. But I think they just offer the services to public administration. I'll try to answer it. I'm a part of the Project Phoenix and also have a little insight of OpenDesk. The thing is Phoenix is a branch of this OpenDesk, and what was your question exactly? They just wrote D-Projects. Yeah, it's the same. There was some name changes on the way to the product. D means data port, they dropped it, and now it's Phoenix, so it's basically the same product. Is that like the second generation? No, no, no, it's just renaming, you know? And is there a possibility just to rent this somewhere for small companies who don't run an own IT? Not to my knowledge, but if there is a high demand on that, that will be possible. They already do it for some customers, so maybe it's a question of strategy and how much this is asked. It's Helm charts, right? So the idea is that it's easy for any host to host it. And then just to protect... Ah, there's Markus. You have a mind set? I'm going to take the questions in order. But the idea is that it's easy to host it, right? It shall be easy for any organization in the public sector to simply say, I have a data center, I just rented somewhere, and I just pull up the hand charts and off we go. Okay, you mentioned that this German strategy was to put this 23 million in year 2023. So my question is how does it go forwards on the funding side? It depends on the German farmers. Is there kind of a guaranteed maintenance for this code, or like who is taking care of the boring stuff, the security patches and all? I don't know. I don't know. So the budget allocated to the project depends on what's being voted in the parliament. There are budget cuts nowadays. I think so. I'm not 100% sure, but globally there is still budget on the project. It's about half of what we had last year. The budget repetition is another issue, right? It's basically like around 30 million dedicated to open projects in 2024 to be validated. Open desks, sorry. And sorry, what was the second question again? Who's handling the long-term maintenance of the security patches? So the goal on the long term is essentially to find a business model so that whenever you are deploying an open desk, there is a team that is managing basically the packaging of open desks, making sure that the Helm chart are up to date, et cetera. So this team needs to find some funding, and the idea would be to have, like, if you are taking professional support, basically the team would get a part of the funding. And then the idea is also to redistribute this funding to the vendors themselves in some way. The specifics of this distribution here, they are not fully defined, basically, because we are really right at the point where we have this default implementation of open desk that is just going out. And now there is a second step of finding the first clients and making sure that it deploys properly, basically. Okay, any other questions? I'm going to take some people that haven't spoken just for, like, distribution. Thank you. First of all, thanks for the talk. That's a really interesting project. And I just wanted to ask if there is an interest of at some point adding any repository tools or, for instance, I don't know, pipelining CI CD tools to the platform? I will repeat the question so that it's registered. I'm so sorry. So that it's registered. Whether there's any integration of repository or pipelining? Right now, yeah. Right now not. I personally think it makes perfect sense. I would very much welcome that. And I guess it's just like knocking at the door at vendors and saying, hey, we want to have this. Hello. Hello. Okay. Yeah. So you said there is unified procurement so you can buy license for all the different softwares if you want, like professional support and stuff. Is there also like a single point of contact for support? If I want to self-host this and have some issues with any of the softwares in the suite, who do I ask if I have problems? And is there someone who can help me and do I have to know which software has the problem? And also second question, what's your favorite pizza? Thank you. Okay. Good question. So the question is like, is there central support for the whole product? And actually, I don't know that much. I think it's not different yet. Yeah. Something that needs to be developed. It's part of what you said about the whole package. It's part of the discussion on the business model, basically. But I think it's more important to be first built the software now integrated and make it open source and available for everyone for free. And second question, the favorite pizza. Oh yeah. There are many. Okay. How are you? No, how are you? Okay. Maybe one last question because it's about like, it's about 40 seconds left and then we have to go to the next question. Let me go back there. Not just not a question, but a remark. Hi, I'm Renee. I'm with Zenders for two days now. And I'll be happy to take any questions or feedback on Open Desk with me. I'll be out to talk. Thank you. Okay. Okay. Really cool. Sure. So my question is there are huge parts of software stack that are still, I think, vendor tied outside of, for example, just office software, I guess. So is there a way to get the software stack to be open source? Yeah. So I think it's a good idea to have a lot of people who are interested in the software stack. So I think it's a good idea to have a lot of people who are interested in the software stack. Yeah. So I think it's a good idea to have a lot of people who are interested in the software stack. For example, like GitHub is one of the biggest ones. There is an alternative in GitLab. And operating systems, BIOS, hardware. I mean, I physically have problems to understand the question. So what's the question? Like if there are other software going to be integrated or? There is a huge part of the computer science ecosystem such as going down from hardware all the way up to operating systems. So the different layers. Different layers. Is there, you know, movements to free those? Yeah. So OpenDesk focuses on the desk, the working desk. So the tools that you need on your machine in order to work together. That's the current scope. There's not the scope of controlling the hardware or the operating system. That's a different story. Thank you. Thank you very much. Thank you.
Another approach to AI
Thank you all for joining us in a great, you don't. It's not working? Why is it not working? It is? Okay, good. Let's not correct. So thank you all for joining us and here to talk about next. AI. Exactly. Another approach to AI is just port fleet. Okay. Does this work well enough? You can hear it in the back and all that. Okay. I see thumbs up. That's wonderful. Yeah. Well, I had this. I'm your sport fleet. Direct communications co-founder at NextLoud. So, yeah, the thing we do at NextLoud is collaboration. That's of course what this room is all about. Now, there's this AI thing coming. And so I'm hoping to try and make this conversation a little bit interactive. I mean, there are other people here from Xweek and other projects who are working on, well, collaboration tools as well, open source collaboration tools. And, you know, this AI thing, I mean, there have been AI-ish tools being used for a long time, but a lot of them are also still quite new. So I'm kind of hoping that we can also have a bit of a conversation about it. Because, well, there are pros and cons. I mean, we'll get to all that stuff. Of course, the big thing here is, yeah, that we have, like, these big companies, right? They all want our data, and AI is for them another thing to use that data for. So, yeah, I mean, AI, I don't know how deep I want to go into what it is, because I think all of us know it a little bit. But we don't want to live in a world where there are five companies, you know, who run all our data, and that's kind of a little bit the case right now. I think that if Trump and his next presidency tells Microsoft to shut down their service in Europe, then basically you cannot get a new passport, you cannot, yeah, nobody can work at government here, for example, right? This is, I think, a bit of an issue. And, I mean, NextLoud is one of the projects that's working on solving that issue, essentially trying to give, well, companies, individuals, but also hopefully government, back the control over their data. We've built a collaboration platform. I'm guessing how many of you are not familiar with NextLoud? Yeah, okay, it's like six people. Google it. I will then not go into that, sorry, or duck-duck-go it, that would be better, obviously. So, as a company, we build an alternative for 365 in very quick, simple terms. And with alternative, we mean that as a government or a company, we think it's important that you have a choice. It's totally fine if you're happy that your data is at an American company and that U.S. buy agencies have access to it. If you're good with that, if that's not a threat to your business, that's fine. Government, I think it's by definition a threat, but that's their choice. But we think there should be, like, a choice. There should be an alternative. And an alternative is only an alternative, is it is, if it does what the other product does, obviously, and in a safe way and has enough ability to be used for a serious company. So, that's what we're building and we have built. That's why German and French and most European governments are in places already using Nexlout, be it cities, be it at a state level or federal level. So, as a company, we care a lot. Nexlout for us is a mission, it's a goal. It's like our way to try and make the world a tiny little bit better. And we want to work in an open, collaborative way. Therefore, we're very happy, of course, that it's used by thousands of governments and universities, et cetera, et cetera. And, of course, we're building this completely in the open. And again, that will be relevant because, of course, I think the future for AI will better be open, otherwise we are, well, just as crude as it is with collaboration platforms, honestly. And we have a wonderful community working with us and all this stuff, which is awesome. Also, as a company, we try to be open and transparent, not depending on venture capital, et cetera, but be self-owned. Anyhow, AI. So, AI is like, we've already introduced a ton of AI things over the years, like little things, and I will show some of them, but of course, with the latest LLMs and stuff, it's getting really complicated. I mean, there are tons of problems with it. I have a lot of potential, right? AI can help us make repetitive tasks easier, quicker, et cetera, but at the same time, Big Tech is basically loving it. They have all the data to be able to build the AI's. It costs tens or hundreds of millions right now to really train the proper LLMs, so they really have a bit of a monopoly here. And, yeah, the rest of us will have to just accept that they're using all our data to do it. And a lot of companies are already really realizing this is a problem for them, right? It's Citigroup and Goldman Sachs. They are actually not allowing their employees to use tools like chatGPT. I mean, if you're BMW and you're working on a new car and you're using an AI to generate some ideas or summarize some proposals, and you discover later that six months later Tesla, while designing their car, suddenly got some of your ideas coming into their AI planning, then there's a bit of an issue here. And, of course, this kind of stuff is happening. The company, like a while ago, Twitter and Zoom, they changed their terms of service to allow for training on user data. And, yeah, this is really an issue for business as well as, well, obviously, all our society. And then I'm not even talking about data biases in these models, carbon footprint. I mean, I think most of you are aware of all the issues with AI. So, honestly, I don't think the question is to AI or not, because there are too many benefits. The opportunities are really big, I think. I've been trying to make a bit of a list of that, but I was just changing it while standing here in line outside. So this is definitely not complete. So I'm just going to put it all on the screen and ask what's missing. I mean, I think there are some basics, you know, text to speech, speech to text, recognizing faces on photos and recognizing objects on it, et cetera. This is already, like Nexot has been shipping this for three years, four years already, these kind of things. It's just one model that you download and does this stuff, and translation and other one. It's fairly, I think it's, I mean, it's not simple. It's technically complicated stuff, but it works. And there are not huge risks. You don't need to send your data to Google anymore if you want text to speech, or if you want image recognition and being able to search for a dog and find all the pictures of your favorite pet. So this is already there, and it's not terribly complicated to use for a person. But of course, you now have all these new language models. I think there's really a big benefit, unlike dealing with information overload. You have tons of emails coming in. You have, I don't know, papers to read, et cetera. And these LLMs, they, I know they create a lot of fake content and hallucinate stuff. But the thing they're pretty reliable at is summarizing. And this is really quite important. I don't know how many emails you get, but I get a ton, and I would love to be able to summarize it or help select, you know, useful emails, et cetera. And this stuff is really possible, or, I don't know, meeting notes. And, yeah. So this is, I think, where these models can be super helpful. And you have, like, text generation, of course, they can do help out with this. You have also image analysis of various things. There have been some demos from Microsoft and Google already about a year ago, where they basically were showing that you have, like, a spreadsheet, and you select something in it, and then you type a question about it, and then it makes a graph that answers the question. This kind of stuff is also pretty magical. And there are tons of people in office all over the world that would benefit a lot from having this stuff. So I think, yeah, the benefits are really there. Another thing is, like, automation. Just talking about it with a colleague. This is, like, also a next step, you know, if you can say to the LLM, like, hey, send, make an appointment with another person, and then they try chat, and if that doesn't work, they try email. These kind of things would be really helpful, I think, in day-to-day work. So, yeah, I don't know. If there are other ideas or things that are missing, I'd love to hear it, actually, and make my list a little bit more complete, but we'll get to that, I think. So I just wanted to show a couple of examples, like we have this feature now, the Threads Summary, that makes a summary of your emails. Another example is, like, an Excel text. You can just select some text and say, hey, summarize it, create a headline. It's all quite simple to use. And image generation, of course, I mean, this is a horrible image, but, you know, you can make things that look good. And then you have the data analysis, and you have automation, all these other features we have ideas on. I'll share some things a bit later on. So I think we need to do AI in our collaboration platforms, like X-Wiki, you guys need to have a plan. I know only Office, but they integrated just chat GPT. I think we need a little bit more than just that, because, well, we're losing the on-prem capabilities, right? It's not competitive if you're just integrating chat GPT, then the data is sent to the U.S. anyway. So, yeah, that's not really a good solution. So the question is, how can we get this without the problems? And I think I'm in a room with open source people, so I think the answer for most of you is obvious, and this to me at least, just transparency and being open, yeah? And this is kind of the thing that we've been working on at NextLoud. We kind of made some rules for ourselves, so we have been doing AI as things, but when the whole text stuff from chat GPT came out, actually that was at the FOSDEM two years ago, we talked to people and each other, and we have some fairly smart people on board, also from the research community, and we tried to come up with, like, how can we handle this? Because we add more AI features, like we don't want to be left behind, and we need to be an alternative, as I said earlier, and you can only be an alternative if you offer similar features, otherwise who's going to use your product? But then how can you do that in an okayish way? So the idea we came up with was to at least create transparency, and of course, choice, I'll get to that next, but first the transparency. So we came up with the idea of creating a rating that has basically red, orange, yellow, green, and we would rate each of the integrations of AI features in NextCloud with this rating. So first, is it open source? Is the model available, and is the training data available? And so if a model has all three, it's green, if it has two of them, it's orange, if it has one of them, it's yellow, if it has none of them, it's red. So chat GPT integration, red. Completely on-prem model that is trained and has the training data available, for example, for speech to text, that can be green, and you have everything in between, of course. And the second thing is choice. So for us, it's really important that you, well, can choose, right? I mean, there are, again, legitimate users for something like chat GPT, and I mean, they're throwing so many billions at this problem that you can hardly argue that open source can really keep up to the latest stuff they're doing, and sometimes you just need it, fine? So in our user interface, we have these choices that you can have, like Opus, that's a translation exactly, so this would be a fully green one, and that's, well, we all know, chat GPT. So we try to make sure that for the various features that you can choose between these different models, on-prem, et cetera. So for us, of course, most of the work we put in on-prem and open source locally running AI features, because, well, that fits with our values as a company and, well, with our ethical AI rating, but the others are available. So at the moment, I made a list, but I'm sure there are many more you can use, like models like these in NextLoud. I have four of the various features. I'll show, well, actually, I'm showing examples right now. So this is just a bunch of the features we have. There is more, but suspicious login detection is something we developed like a really, really long time ago. It's basically a neural network that gets trained on your login data. It just runs completely local every time you log in. If you work nine to five from the Berlin office, let's say, and suddenly somebody at 3 a.m. logs in in your account from China, maybe there's something wrong, the model will detect that and give you a warning. Very simple, and we have had this for, I don't know, since 2020, so quite a while. And it's green, right? It runs fully local. There's nothing special about it, no data sent anywhere. We basically do a very similar thing with our mail app, where we basically train a neural network on subjects, sender, and email recipients, et cetera. And it creates a smart inbox trying to put important emails on top and the rest not, and again, no data sent anywhere, because it just runs on premise. I already mentioned phase recognition and stuff we did in 2022, I think, so this is all. But the problem with this is already, you need to download a multi-gigabyte file, which has, like, all the values needed for the neural network to recognize stuff. So we already had to re-architect a lot of the way Nexard works just to be able to download this big blob without creating all kinds of complexities for the users. And obviously, this problem gets bigger and bigger when you get to modern AIs. We even have music genre recognition using machine learning. It's yellow because it's trained on all the music on Spotify, which means the training data is actually copyrighted and therefore not open. And, yeah, we had, like, a pre-trained model to do call transcripts. We introduced that last year. That is nice. You have a call and the recording then gets text, speeds to text so that you get the text of the recording. Again, this model runs fully local, so that's cool. And speeds to text the other way around. Background blur, just a JavaScript thing that we upload in the browser, very simple translations. First, we made it with Deepol, which is not cool. So then we made one using the Opus corpus. You saw it earlier, and that is running fully local, so that's much better. So these are still mostly basic features, I think, today, and yet already pretty complicated. You need to keep an eye out where the data is being sent to, like with translation. But, of course, the big thing are LLMs, like the text operations. What we've been doing is to create, basically, NexLad Assistant. It uses the large language models, but open source on-prem ones that you can host yourself. It's like this little thing on the top. When you click on it, you get a dialogue. You can give a free prompt, or you can give a text to summarize and some other things. And it just runs this through one of the models that is supported by NexLad. And, again, you can put JetGPT here as model, or as back-end, but you can also then run your own LLM and connect that to NexLad, and then it can do all this stuff on-premise. So it's fairly simple. When it's running, then it'll get the results, and after a while you get the output of it. You can copy it into a document, et cetera. And, again, if you take a local model that is trained on public data, then it can be a fully green solution. So that's really cool. In places like NexLad Text, I already showed that, that you can select some text and then run this. Mail, I already showed this as well. In talk, like our video calling and JetSolution, you can translate a message, select it, and then choose translate, insert images and other stuff. We even made a little bot. This isn't the smartest bot. It's a very small model, but hey, it's fast. And you can ask it questions. Honestly, I wouldn't say such smart things. It's fairly shitty, I've noticed. But still, it works on your own server. That's kind of nice. So a lot is possible. One of the newer things we're working on is more of these services, because they're now companies like Amazon. They are running LLMs as a service. And other companies are doing this also purely in Europe, like you have ALEF, ALFA, and I think MIRROR or something. In France, there's also a company that is building local AI. So we're trying to support these, that you... You know, everybody can run these AIs, like you need a lot of heavy GPUs. It's a lot of compute. So you can use it as a service that at least it stays in Europe or at a company that you trust. Then I wouldn't recommend Amazon, per se, perhaps. For this, we also made it possible that you can put in some limits, otherwise users get a little creative and start to basically cost you a lot of money. And we worked on the interaction with this. I'll skip through this. A thing we're working on now is also to make all of this even smarter. A newer thing is the ability to take your documents that you have into account. So ContextChat is a feature of the Assistant that basically it has access to your documents, your emails, everything you have that gets indexed. Let me see. It's indexed into a vector database. So this runs as a separate service next to NextLoud. And then when you ask a question from the Assistant, it can actually answer using your documents, your company documentation, your emails, etc. So you can really do stuff like, you know, can you give me an idea of how we organize events, rather than in general, it can look at your documentation and then tell you, like, oh, you know, at your company, organize events this way. Or you can say, hey, can you give me a summary of the different requests that a colleague has emailed to me last week, and hopefully it'll give you all the to-dos that you got from that colleague in the last week. So this, yeah, has the context basically of what you are doing as a user at hand. It's, I think, really kind of, yeah, an important step forward to make this useful, because otherwise you're just getting the generic info that's in the LLM. As I said, they hallucinate stuff all the time. They're much better at, like, taking information and summarizing it, and that's, of course, what this does. I think it's much more reliable in that way, you know, vacation process, et cetera, et cetera. So that's a couple of things we've been doing lately on this, as well as in the context yet. So that's our approach to AI. I would really like to hear thoughts on that, and, like, I don't know what other projects are planning with this. One of these will be giving a talk after mine. But I know any feedback, questions, thoughts, fears, and anxieties? Okay, so is this working? Can anybody confirm in the back? Great, thank you very much. So any thoughts, questions? Let's start here. So you said your screenshot showed that... Yes, the screenshot showed that I need to double check the information that the assistant gave me. Notice that it doesn't give me the reference to the emails that it was quoting from. Is there a possibility to get that? Currently not, but thinking of the way this works, I mean, I have one developer here. They can interject, but I think that should actually be quite doable, because the way it works is it looks in this vector database and gives that information to the LLM to then summarize and give you the answer. And, well, in the vector database, I guess it knows where it came from, and therefore can then say what information was used to summarize that answer. So I would think this is possible, but I don't... I don't know. Yeah, I see a thumbs up. Excellent, okay. Any other questions, ideas? Okay. Or are we going to do this? Yes, for me, I am an e-aseptic. There's something... some examples. So it's good, something... when user is at the end, and he can correct what is said by the AI. So an example for translation, I have the word in Dutch, Académie de Sie. It's not universitaire, the translation in French. It's personne issue des milieux académiques. And that's any... the translator or a consultant don't give this response. Yeah, but it's... So you have a control, they say, from the user, also from the citizen in general, so when the user has no power of the system. You can't check a human translator either, though, unless you know the language at that point you didn't need them in the first place. So, yeah, you have to use this stuff in a skeptical way, but then... yeah. Yes, and the other thing is about consuming energy. So it was an emission in the RTBF about consuming of energy, of shaggivity. It was hard. Yeah, so the amount of energy that these models use is big. That's, by the way, one of the reasons I think they should be open source, is that the researchers who do stuff companies aren't interested in can try to optimize them and make them run with less energy. Yeah. Hi. I think another good use case for all these... If we combine these features, that would mean that we could have a super accessible environment, because if someone is blind or nearly blind, people could use all this text to speech if someone has autism, ADHD, whatever. You could try to find a shorter version, an easier, understandable version of a text or whatever, and combining this would help, I think. That is awesome. I'm going to add that to my slides right now, but I am completely making the laptop slow now. That's a really good point. Accessibility is a really important benefit. Actually, hint to the developer, bring it up in the team. Maybe we can already work on that. Yeah, any more? Yeah, just a question on the REC approach that you were describing before. Do you have any figure that you can share to which extent you tried the retrieval of the vector? Sorry, I did not hear the question. So when you were describing the rag, the retrieval of the augmented... The green, the colors, yes. No, the rag. When you're retrieving the vectors from the vector DB. Right. So can you give us some figures on to which extent you tried that? Talk to somebody, not me, in one of these, who knows the technical part there, and I'm not even sure we have somebody right here at the moment with that. Sorry. Okay. And we're out of time. I'm afraid. So this is probably it then. Somebody wants their microphone back. Alright, thank you all. Thank you very much, Josh.
Using Generative AI and Content Service Platforms together
Very much. So our next person is Ahel Boroy from Highland that's going to talk to us about using generative AI and content service platforms together. Thanks. I was on. I was on. I was checking the microphone. Okay, yep. Welcome to everyone. So this is another view on the same topic. So we are going on the technical side now. It's not like a final feature for a product, but it's a framework in order to help you to build all the features that we were seeing before in the context of a content service platform or a document. Okay, so we are going to review some in a stack. The next step we are going to use that is including also LLN on premise. We are going to review all the options. We are going also to describe the features we can build with this stack. And then we are going to review how to integrate that with your in our case because I'm working for Highland and we are building an open source product with the name of Fresco that is related to content management. So we are going to see how to integrate that with that content management platform and also just looking a few to the future. And obviously I need to include some AI picture because it is what it is. Anyway, so this GenAI stack that we are using includes mainly three components. So the first one is Olama. Olama is a service that is able to provide an API to interact with different LLMs. We are going to see later all the list but you can download your LLN on premise and this layer is providing the interaction with the LLM. You can even interact with different LLMs at the same time. The second one is Neo4j. So Neo4j is the vector database when you are using rack, we augmented reality and so on. Then you need to increase the information for the LLM. So you are storing all this information in this database. And finally we are using land change. So this framework is providing land change that is a framework to communicate all these different elements. So this framework is in Python but if you are not comfortable with Python there are many other languages that are including this kind of piece. Okay, so mainly what we have, if someone doesn't like Docker there is no problem so you still can deploy that without it but it is oriented to services. So you have Olama that is the one providing the services for the LLM that can be used in GPU or not. So you can, we are going to run this without the GPU. Just using my regular CPU on my computer. This is lower. I recommend you to use a GPU but you can do that. And we are piling all the models that we need so we can just use more than one model. With that we can increase the information for the operation with that Neo4j database and we can develop an API with this string LLM, with this framework. Okay, so these are the pieces. You have the project Docker.gen.a stack. Mainly these pieces is a sample. This sample is oriented to prompting. You have to reply questions. So we are going to do something a bit different from that but the first sample you can try is this one. Okay, so all the LLMs that are able to manage Olama today are these ones. Likely as this is growing every day there are more. But this was like last week. Okay, so this is what you need to understand. Obviously the larger the better but you need to take into account your resources. So these are very small. Your 4 gigabytes of RAM and 2 gigabytes of storage. So you can run that on a computer and then if you want to use something that is better is also larger in resources. And you can even use these LLMs that require, I don't know, many different computers once I say, okay. So today we are going to use this kind of LLMs. Also it's relevant to look at the license. So just was talking before about the license. So this is also something relevant if you want to build something commercial or something open source or whatever. You need to take care of the license. Also you can look that there is some weight license there because you have this LLM2 community license agreement that some people say that this open source, some other people say that it is not. So it's something different. So better to check if you don't see a patchy license or something that you can recognize. Better to check the conditions. So you have a lot of them to choose. We are going to work today on the demo with Mistral, CementB, that is a French company that is producing this kind of LLMs that are more or less the same performance as GPT 3.5. So it's good enough. And so what is open source, the LLM is free to download and to use, but the training data is not free and likely it has some copyright material on it. We don't know because it's not free. So on the next law ethical AI writing we have, sorry, yellow. I thought it was orange but it's yellow. Okay. It's more or less fine. So we are just only missing one. That was for text and for pictures. We know some LLM with a visual encoder on it. So for this part we are going to use lava. And lava really is granting all the different requirements. So we are using a green LLM for this other sample. Okay. Perfect. So all the demo is running on my computer while I'm there in the presentation. So I have everything running inside is 32 gigabytes of RAM and is AM64 architecture. So it's not AMD. It's MacBook Pro two years ago, something like that. Okay. As we were also reviewing the previous version before this GEN AI momentum, we also had some data section, test recognition, test classification, content analysis. Anyone is using content analysis for a real use case? Okay. It was not me. So it's something but you saw. But we have all the things, right? Some kind of automation. But now with the GEN AI, we have also a power classification. We could classify in the past. But now we can classify better. We can also, and when I say translate, we are going to see later the demo. Obviously we can translate. But we can also interact with the LLM in one language and to get the response in another language. Right? So that is the difference. We can also summarize a test. This is the most common use case and we can describe a picture. Prompting. Obviously we can use prompting. We can read that. So we have some new features that we can use in our documents. Okay. We are going to see some of them implemented. Okay. So what is this project about? It's not yet. Okay. The project is at some point of the slides. Okay. If not, I will give you the link. So in this project, what is created is a API by using this, all these infrastructure in order to provide different services. What we are using is some LLM embeddings. So we are just trying to avoid hallucinations. Just giving some additional information to the database from the document. So we are working with a document. Right? So we are not going with search. We are not going with some other applications of GNI. So we are focused on features of a document. So we are adding all that information so we can get a better response and more suitable to the document we are dealing with. And for that we are using Mistral. And if we are talking about a picture, then we can use the other LLM that was Java in order, for instance, to describe or classify the picture. We have also some, so we can choose the LLM. If you want to choose some other LLM than Mistral, you can do that for text. You can choose some other LLM with a vision and color enabled, like Java or some other on the list. And we can also choose the language. So we are going to see that later. We can just drop a document in Japanese and we are getting the summary in English or in the other side. Right? And also you can choose some numbers like the summary size or the number of tasks and so on. So these are parameters. Okay. So this is the API. Right? Pretty cool invocations. But let's see that leave. As always is better. Can you see the, better? Okay. Okay. So for instance, I'm going to work, let me find the, I'm going to work with this document. Right? I could be using an English document, but it should be easier for the AI. So we are using this one. And I'm also going to use this picture. So for your reference. Okay. Okay. Perfect. So for this document, we are going to ask for a summary. So give me a summary of this document that is in Japanese. So with that, if I'm able to. Okay. So this is running on my computer. So I have this ENAI stack running in this Docker deployment. And I'm getting the request. Okay. And with that, I'm getting the answer. Okay. So the test, this is a problem with kindergarten, in Japan, blah, blah, blah. Okay. That's fine. So I'm giving something in Japanese and I'm getting the summary in English. The second one, come on, note this one. I did it. Okay. The second one is just to classify. Classify a document that picking a term of a list of terms. So I want you to classify this document according to Japanese, Spanish or Vietnamese. Again, it's an easy example. Right. But you can choose whatever list of values. So if I say just classify this document into one of these three categories, the term is Japanese because the document is in Japanese. Okay. This is also a Revan for classification. And finally, we can also make some prompt on the document. What is the name of the zone or this document in Japanese document? The name of the zone is Musoku. Okay. So three different features that we can use on this, on a document. You can build more. Again, it's a Python, Python program with these three specific features, but you can grow up to include something else. And if we move to the, to the pictures that was for text, but for the pictures, we can describe this, this picture. We can also extract some, this is a person, this is, but that was done before. But describing is the, the, the new thing that GNI is providing for us. This is a bit slower, but in the end, they made so some man posting for the camera. He's wearing a green beanie, glasses, a black hoodie. And the land yall says air fraked. Well, no, it's a fresco, but more or less. Okay. The picture was not big enough, but it's fine. It's, it's something that is, is useful. And it's not that consuming internal resources, because it's running in, on my machine. So it's, it's fair enough. Okay. Once that we have all these features, and we have this, Python, just let me show you a bit. So this is the project, right? You have the Aeboroi, a fresco GNI, and you have the GNI stack, and mainly it's a Python program. Okay. With all these endpoints described, classified, prompt, and somebody. Okay. It's no more than that. Okay. If we go back to the original goal, is to integrate this kind of operations with our, with our product than in our case is a fresco. So a fresco, we can deploy that also in Docker or whatever you want. And we have two different APIs. So the first one is the classical press API. And the second one is a messages API, synchronous and asynchronous. So if we have existing content in the repository, you have a folder with 100 pictures, and you want to describe that. So you can use the recipe. Yes, to get the document, apply the operation, and update the document. And that's fine, because you can make a batch with that. Okay. You have all the operations available. And if you want to create that like more dynamically, when the people drops the document, yes, perform the action, then you have the messages API, the asynchronous API. So you can listen to the event, okay, there is a new picture, and this picture needs to be summarized. I'm going to summarize the picture, and that's updated. Okay. So these are the two different patterns we can, we can apply for it. What we are going to see now, again, live, everything is running on my laptop, just believe me, is something that allows us to classify a document. So we are going to upload a document. We are creating this rule. The rule is the same just for you to make the similarity with what is before. So we have a list of languages, Japanese, Vietnamese, English, whatever. And we are creating a rule to move the document to the right folder. So you draw a path document, and the document is moved to the right folder. Okay. Okay. So let's do that. Okay. Let's open a fresco. So there is a folder at some point. And this folder has a rule that is classifying the documents that I'm dropping on it. Okay. So if I, for instance, come for classify, no, for classify things, we are going to try with a Vietnamese one. It has to be a bit creative. Okay. Okay. So at this point, a fresco is listening to this new document, and it's classifying the document. So it's just selecting a term from the list of terms, and the document has been updated. So it has been classified. So if I refresh, what I find is that the document is on the Vietnamese folder, and you can do that with invoices, with whatever you want. And we can track that it was mistral, the LLM, that created this classification. Okay. Pretty easy, right? So you can integrate also all the other operations in that to get some automation. Okay. So I guess that I was running out. But no problem. So we have more time for questions. So again, this is a simple framework. You can deploy that on premise. You can choose your LLM. You have an initial REST API for operations. Public works are welcome. And then you need to integrate that with your product, with your organization, or whatever. Right. There is also some interesting hackathon with more use cases. So I presented you some use cases, but you have more of them on this hackathon. The slides will be, they are available on the, on Foxen. Okay. And also I'm using Olamma, but there are many other alternatives. You don't need to choose Olamma. So you have GPT4 all locally. This solution is the one used by, by next cloud, second state, high-end phase is the most known probably. But again, just, this is an initial framework. Take it as it and try some things with, with the NAA. Okay. That was all. Thanks. Thank you very much, Angel. Are there any questions? I'm going to do it in the order of the rule. Thank you, Angel. It seems to me all these operations are on one picture or one document. Are you also considering me asking a question on all my documents? No. So this, this sample is only for a single document or a single picture. But, but that is as easy as you have the database, the Neo4j database, then you can include as information as you want for a single document or for a single query. Right. So what I'm doing in the source code is to remove the previous information. You have to create something that is only for a single document. But you can modify that in order to add more than one document to one query. But on the sample is only for a document or a picture. While summarizing the Japanese PDF, why did you need to provide for context your picture Sorry. You showed the summarization of the Japanese PDF. Yeah. And then you provided for context the picture. No, no, the picture was for the last operation. So the three first operations for summarize, for classify and for prompting were related with the document in Japanese. I could use some other document. I know, but I love the document because I'm using this for testing for 15 years, something like that. So it's like my, my precious document. And, and the picture was there for the last one. It was the description of that picture that is more or less like, like yours then. Thank you. Similar to the previous question that I had, but for a single document, right. So the summarization for very large documents. Yeah. So, the problem is that again, I'm running on my lap. So I cannot use like a very large document, but I was just trying to summarize, for instance, books. Do you know the Gutenberg project? On the Gutenberg project, you have all the classics of Alice in Wonderland and so on. So I was trying to do that with that kind of documents. And it's able to do that, it takes a while, like minutes on my machine. Again, if instant adjusin, the regular CPU, you use a GPU, the tiny slide, I don't know, 100 faster, something like that. So I don't know. I need to make serious test with that. But having the right infrastructure, I guess that the, the performance is enough. It's not something like very instantaneous, right? But you can work with it. Thank you very much. Any other questions? Yes. Hi. A follow up on the previous question. Was the insertion into the vector is database taking a lot of time or was the actual query to the LLM? Because the insertion into the vector is database has to be done once, whereas the query can be done multiple times if, if you already vectorize the document, right? Yeah. So again, I was not trying to deliver a session on how to develop AI, right? It was just to create a framework. You have the AI track that can reply to you better than me in relation to that. But yeah, obviously, you can use the database. I'm not, I'm only using the database for a context of a single document, right? So you can create categories, you can add more than one document. You can add also the, the links to the response and, and so on. So yeah, sorry. Maybe I didn't understand you. Maybe you misunderstood my question. My question was when you added the Alice in Wonderland book, was it the vectorization that took time or was it the query to the LLM? No, no, it was vectorization, vectorization of the chance of the document. Okay. Sorry. That was the only one question. I'm not an expert, but I know a bit. Any other question? Okay. Thanks. Okay. One more question. Last one. I'll be around. So if someone just wants to, to catch me. Can you say a bit more about like the biggest use cases you see and if there's any open source setups of this that are out there for us to look at? In my opinion, the, the main use case of that is searching. So but this is a different world with different beasts. So but for searching AI, it's really quite relevant. So again, this is just to create a framework and then it's just to apply your imagination. Thank you very much, Angel. Thanks. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Web Accessibility and Environmental Sustainability and with Popular CMS
Before we actually start, I just want to go off and quickly note the last two sessions being about AI and their rooms being packed, completely packed standing room only. We come down to a conversation about accessibility which is ultimately about humans, right? It's about us, all of us, right? And sustainability, which is the fucking planet we live on and the room is half full. So, you know, AI is an exciting thing at all, but really our priorities are a little bit at a skew here. This is not how the proportion should be about things that matter in our world. So I just want to add that little hint. No pressure. Exactly. Okay. So we can start, whether accessibility and environmental sustainability with popular CMS by Mike. Thank you very much. So thank you all for coming. So I wanted to give this talk because there's two issues I care a great deal about and I see there's a lot of overlap between them. On accessibility, we've made a lot of head roads. It's really nice to be able to see the kinds of changes that have happened in the last few years around digital accessibility. And we're only just starting to think about sustainability. And so I wanted to go off and have this session here so we can talk about them. The last two sessions had more technical information. I'm not going to show you any technical information. You will not see a terminal here. I'm here to talk about issues where there will be spaces and times to go off and to get into the terminal and to get into the technology. But generally I find that these kinds of presentations are not the best place for that. So first of all about me. Mike Gifford, a senior strategist at Civic Actions. I've been a Drupal developer for 18 years. I began working on accessibility in 2009. My assumption at that point was it wouldn't take me that long to go into, to fix the accessibility problems in Drupal. Just to let you know, we have not fixed all the accessibility problems in Drupal. This is a big problem, much bigger than I had thought. That said, Drupal is still one of the most accessible content management systems out there. I've also been involved in promoting open source for a long time. This isn't only my second time at FOSTIM, but I've been actively involved in promoting open source now for at least 20 years. And happy to be able to do this full time with Civic Actions now. And yeah, I've been able to focus more on accessibility and sustainability here at Civic Actions because I don't have to worry about HR and planning and all of that sort of stuff. So I've got deep roots in the Drupal community and that's just one CMS that people here are using. We also had a good experience in two years ago with FUNCA, which is an accessibility group in Sweden. And with them, what we tried to do was to build a, a cross CMS study of the offering environment. How do we try and make sure that offering tools are built to have accessibility built, how do we try and support authors to create more accessible content? How do we, we structure that? And how do we, if we can, looking at the European Accessibility Act and the Web Accessibility Directive, realizing the huge challenge that was facing the European Union to try and meet these particular accessibility goals, there was an effort by FUNCA to, to actually try and bring people together and to set some best practices for supporting authors and actually to do usability studies with authors to find out what, what we were doing. And I think it's actually the first, probably the only study using, looking at authors and engaging authors to say, how do we support authors in producing accessible content? And that's, that's strange given how many, how many, how much, how many billions of dollars are spent every year on accessibility and how little money is actually spent on fixing the problems upstream, either in open source projects like Drupal or supporting the authors who are creating most of the accessibility issues right now. So it's, it's an interesting challenge, but that was funded by the European Commission and it was a great project to be part of and allowed, allowed me to engage with people with, with PLONE and Type of, was Type of 3 there? No, Type of 3 was not, but JUMMA was, Umbreco, and again to see that the collaboration through, through different open source content management systems and some non-open source content management systems as well. I do think that a similar process should be happening with web sustainability, partly because we, we are in a climate crisis. This is something that we need to act on, all of us. It has to be something that we're thinking about the, the, how we're managing our, our technology because although we talk about the cloud, you know, all of the stuff has real world impacts. We do have to, to realize that there are, there are atoms, all, atoms behind all of the bits that we are driving. And unless we're starting to be more conscious about that, we're not going to be able to to, to have, we're going to continue to, to exponentially grow the, the, the, our sector, which will have a lot of environmentally negative consequences. So I wanted to touch on a couple different platforms, CMSs, and, and what they're doing around sustainability. Because this is something that, that I think every, every open source project should do, not just a matter of the ones that are, are, are content management systems. I'm only really familiar with the content management system world, because that's the world that I've been living in. But, you know, so for Drupal, we have, have a page that is talking about accessibility, about sustainability. It is really important to have something on, on, on project websites to say that this is something that matters to our community. This is a value of our community. You know, I've been writing about it and talking about it since 2016. But again, this is something that, that we need to see this information. If you don't hear it, reinforce it, and see that action that's happening, people are, are going to, people aren't necessarily sure how to plug in or how their actions today affect the sustainability in the future. There's a Drupal Slack channel on sustainability. There's also a sustainability statement that is up on the Drupal.org website. Again, it's not a very bold one, but it's something that, that is a start. It's a starting point for us to go off and say, this is something that matters for the Drupal community. And people need to have that, that starting point to go off and say, how do we, how do we try and, and get involved, take the next step, realize that our community is having an impact, both positive and negative in the world. So I'm trying to, to own as much as we can that responsibility around that. Is there any WordPress people here? Excellent. I like the WordPress sustainability folks. Like I like what is being done in that space. They've got a strong community. There's a lot of sustainability people who are engaged in, in both the Slack channel as well as in, in, in broader thought leadership, whether it's Tim Frick with Mighty Bites in Chicago or Tom Greenwood in, with Whole Green Digital in London. There's some great, great people in the WordPress space that are pushing sustainability and helping people engage with it. So, so yeah, there's, they've got a WordPress initiative and they meet as WordPress does them on a regular basis through the, the Slack channel and engage on issues that are important to try and help, help WordPress users go and, and make, make building sustainable websites more easier for everyone to go off and do. How much would it matter if WordPress were to go off and reduce its energy efficiency by 20% if they were able to reduce that by 20% I mean, that's, that's probably like, I don't know, I mean, like, they're 60% of the web, right? So how much energy is that? It's, it's, it would be significant if we could go off and reduce that demand side and, and think about the data centers, the impact on, on like the, the trickle-on effects to having 6% of the web be 20% smaller, would be enormous, 20% faster. If we could think about how, how we're, we're prioritizing our, our structure. Waitail, another really interesting CMS is doing some great stuff. Anybody from Waitail? Okay. So excellent, excellent. So, I, I think that, that of all the CMS out there, Waitail is doing the most to try and, and calculate what their impact is. There's a whole phrase that you've got to, if you don't measure it, it doesn't matter, right? And so, being able to say, you know, Waitail accounts for 8,240 tons of CO2 per year. And that's going to grow over year after year, I hope, because we want the Waitail community to go off and grow. But that's, that's still like 8,000 tons of carbon per year that Waitail's producing. And, I mean, we hope that the sites are doing good things, but ultimately, it's, it's a, if they weren't using work, Waitail, they'd be using something else. And it's so good to go, go off and start measuring this information. So, yeah, and then they've got good documentation. There's a, there's a roadmap in place. It's lovely to go off and, and see that there is, is, is support for authors so that authors again can have the support that they need to go off and, and to create more sustainable products. To have built in aid of support. Like, you know, this is something like, it's just the defaults that matter, right? Because, because authors are, authors are lazy. Developers are lazy. People are lazy. We want to do what is there by default. If you, if you, if you have the outcome, if whatever the outcome is, if it's the default, it's more likely to be the case. If you have to, if you have to do something else, if you have to go that extra step to do something, then most people aren't going to bother because they have the time to do the default, if that. And so, let, you know, let's, taking that effort as, as creators of tools to go off and to, to, to implement that as a default is really important. And that's something that we've done in, in Drupal for accessibility and something that, that, that has worked quite well for our community. So we definitely encourage that for both sustainability and accessibility. So how many people here are familiar with WCAG, the Web Content Accessibility Guidelines? Has anyone read it from end to end? Okay, we've got two people, three people. It's like it's, it's the most boring document. It's, it's written by committee, oh my gosh, is it ever a dreadful document? Because it was written by committee. How many people here have read the Web Sustainability Guidelines? Less, because it's a, it's a much newer document. This is something that only was released in September at the, the WC3 forum. But, but yeah, the, the, the, the Web Sustainability Guidelines have been written for, for, for it to be human consumable. And it's, it's written for the, the WCAG structure and framework, because we ultimately want to create a standard that governments and other organizations can sign on to and say, our organization is going to embrace these best practices for sustainability. And we want to go off and to, to measure ourselves against these criteria so that we can, we can put that stamp of approval that says we are meeting these best practices. So, from an accessibility point of view, the, the principles are, is it perceivable, operable, understandable and robust? And, okay, those are fairly generic concepts, but they've been broken down and explained enough times that many people understand how that's, how that's structured. Is it written in the most usable or most easy, easy language? We'll know, but, but it's, it's a, it's a fairly useful set of structures. The Web Sustainability Guidelines are based on, on the Sustainable Web Manifesto and the principles there are to be clean, to be efficient, to be open, to be honest, to be regenerative and be resilient. So, again, those are broad, broad goals that are, I think, help the, help us try and understand where, where those are of the North Stars in terms of where, where digital products should be going. I did, because this is an open source event, want to go off and highlight that, that open right now in the Web Manifesto is, is structured as products and services that provide, that are willing to be accessible, sites to allow for open exchange of information and allow users to control their data. So, that's not quite the same definition used by the OSI or the GNU Foundations or others, but it's a, it's a starting point and hopefully as more people jump on board and if our community is able to embrace this, I'm hoping that, that the manifesto may be able to be modified so that there's actually stronger language about supporting open source because we need to be able to collaborate with each other. We need to learn from each other's best practices and we need to use the innovative abilities of open source to be able to make change, the changes that we need faster. We don't have the time to go off and, for all of us to go off and, and to make the same mistakes over and over again. We need to be able to quickly find out what works, share that, those, those ideas and experiences about what works around sustainability with others in our community and with other communities, right? We need to learn and test our ideas. We need to be evidence driven, not something that is based on, well my community thinks this and your community thinks that. It's like, no, let's get out and, and find evidence to be able to determine what is the best approach. I would love to be able to have the numbers that Wagtail has for Drupal to be able to say, we think this is the CO2 impact for, for Drupal. It would be much bigger than Wagtail. Likewise with WordPress, it would be much bigger than Drupals. But to have those numbers would be useful to go off and say, this is actually what our community is responsible for and why removing this megabyte or this, this, this byte of data is important because it's this byte of data is being used on, you know, we know this, but with this byte of data is being loaded and transferred across websites, you know, a billion times a day. Like that's huge when you're trying to, when you're thinking about the load times, like it's a millisecond. What does it matter? On your side, it matters nothing. But when you're scaling it up to a global level and it's being replicated, you know, a million times like it is in the Drupal condition, that millisecond matters because that's, I'm not going to do the math, but it's, you know, still quite a few seconds. Any questions about, about both web sustainability guidelines or the web accessibility guidelines? No, any other CMSs, any sort of opportunity to jump in, contradict me, tell me what I've missed so far? How far are they from being official standard? So how far is the, the, the web standards, web sustainability guidelines from becoming an official WC3 standard? And thank you. So there's right now a community group that set up and created the draft to people, basically, Alexander and Tim went off and did most of the work to try and create this draft. And it's, it's really quite good and quite readable. But that's not how any other WC3 specifications being done. So, so right now it's a good enough document for people to start punching at. But we need to both set up a sustainability guidelines working group. We're in the process of doing that. So creating a charter to go and, and to create that, the oversight, invite more people into that, that working group. Once the working group has a charter, and then there's the long process of going off and, and getting an adoption for that. And then as soon as we have two use cases, so if the, the UK government and the US government say, yes, we're going to endorse this, this is the direction we see going ahead, or it doesn't have to be the US, could be, you know, there's the French government and the UK government decide that this is a useful, and then, then there's a, there's two implementations of this guideline and then basically the WC3 can make that a recommendation. But that's, that's the path forward for this and it is like so many community organizing efforts, a bit of a sausage making factory. So it's, it's great to have people involved and would love to have people check to be, to be involved in the web sustainability guidelines and to help sort of champion the formation of this because it is really important to have, have documents that are understandable and that are executable, that we have passed ahead to go off and implement those. So wanted to highlight some other, some other innovations from the web sustainability guidelines. One is, there's a real effort to try and be usable. So there, there, there's the WC3 document and many people who've, who've looked at other WC3 documents, whether it's on HTML or CSS or whatnot, you see that structure and your eyes start to go, go, get weary and you close them and whatnot. Well, the, the, the committee's gone and built a JSON file for the WC3 standard. So it's available both in the JSON format and the WC3 standard format and also in the web sustainability, web, sorry, web sustainability. Web, I'm blanking on the name of the, of the, the website, the web sustainability guidelines.org. Is that right? No, that's not quite right. I guess web sustainability.org. I should have it here in my notes and I do not. But there, it's available in, in, in three different formats. I wonder which is designed to be more human readable and, and, and a shareable reference. We've also structured around different elements. So there's, there's, there's structured around, around user experience design, on web development, on hosting and infrastructure, as well as on, on business strategy and product hosting. So thinking this, think about this as holistically as possible. There are some elements of the web sustainability guidelines that do touch on, on, on issues of accessibility because if people are, are not able to go off and access your services and they're trying to net, net, to navigate around to go off and access the, your, your content, that's something that, that is, is going to be a performance issue. It's going to be something where it's going to cost CPU cycles because you're, you're constantly trying to go up and access this. Also, it's, it's, you know, I think a lot of people have, have looked at the, the, the data centers and, and the hosting environment is one piece of it. But actually if you look at the, the overall energy component, the hosting piece is just, it's actually a small part of the puzzle. You know, most hosting companies are very aware of the, the cost of electricity and are doing their, their effort to, to minimize that cost so that they can increase their revenues. The real, from an electricity and, and, and CO2 perspective, there's a lot more work coming from, from, from our own devices, right? Like we plug in our devices, we maintain our devices. There's also elements of, of embedded energy. So, you know, I don't know how much CO2 is involved to go off and create this iPhone. It's not an iPhone, this, this Samsung phone. But, but there's, but looking at that embedded energy and the energy that it's going to take throughout the whole life cycle, life, life cycle of the product is something that we need to be starting to think about. So I mentioned that JSON file is, is a part of the innovation. They've, they've also, the WCAG format doesn't really structure the, the, the, the impact of the level of effort. But it's nice to see that, that the web sustainability guidelines have tried to go off and to, to pull those apart. So you can get a sense of what is the, what is the easiest thing for me to do that will have the, the biggest bang for the buck, right? If you can, you can highlight and structure your information around that ROI, then that's something that you can more easily scale up and, and know how to go off and invest your efforts. Because we need to be able to, to get people starting on this. There's a lot of, if you give people a checklist of 100,000 things that they need to change to the website, like, yeah, good luck. Maybe they'll hit the right ones. Maybe they'll just ignore it and walk away, right? But if you can say, here's the top issue right now that you can address. This is how you can start now having to have a bigger impact in your, in your, in your work. Then, then that can motivate people to get started and to say that they can actually make a difference. Because nobody wants to, to take on a wall of errors and, and to take, add that to their existing issue queue. Yeah. So, so I'm a big fan of automated tests. How many, how many other people here like automated tests? Get the machines to do it. You know, and, you know, there's good reason for that. I mean, people are good at some things, but we're generally not good at doing the same thing over and over and over again. Sisyphus really should have been a bot, because he would have, he wouldn't have minded at all. But in terms of automatic tools, I do want to go off and point out both Google Lighthouse and Unlighthouse. Has anyone tried Unlighthouse before? So, Unlighthouse.dev. It basically scans your whole website, more or less, with Google Lighthouse and gives you a Lighthouse score. So, it's what's from a, from a sustainability point of view. That comes with all of the stuff that comes with Lighthouse. You get the performance hits. You also get the accessibility score for that. The accessibility score is less than acts. You're not getting as much of the features as you do with acts, but it is a good solid starting point for people. But if you look at the performance numbers, that's a great place to start looking for sustainability. There's lots of overlap between sustainability and performance. And it's so useful to find allies too, like who are the people who are interested in the same issue. And absolutely performance people are on the same camp of sustainability. They're sort of like security and privacy. They're not the same, but they're very, very related. Also useful to point out sitespeed.io and co2.js. You can also tie in actscore into this as well. How people know what sitespeed.io is. It's similar to what Google Lighthouse can do. It does gain a site, or Unlighthouse can do. It does a site-wide crawl. It provides recommendations and a little coach that gives you some direction. It is open source, which is good. Not that Google Lighthouse isn't open source as well, but you deal with Google. But the co2.js module is a tool that's being implemented by the Web Sustainability, sorry, the Green Software Foundation. Green Web Foundation, thank you very much Chris. So the Green Web Foundation, fortunately, is the organization that's behind the co2.js. And it's a great module, particularly to give a snapshot quickly of what is available within your website and what the estimated co2 impact is for that. There's a lot more work that needs to be done. Two tools like co2.js. The default right now is to look at a byte model that is more or less looking at how many bits are transferred and to focus on that side. There's so much more research that is needed to actually get to something that is a verifiable amount of co2. So we know that one byte of an image is going to have a lot less co2 impact than one meg of JavaScript. JavaScript is so much more intensive in terms of CPU processing, whereas rendering HTML and images are really not a big deal for browsers. But JavaScript is something that does take a lot of processing power. But it's a really great tool and you can integrate it with sitespeed.io and have that information available. I also want to point out EcoGrader and websitecarbon.com. Both of those are good tools to get a sense of how heavy your website is. Take all this information with a grain of salt. This doesn't actually give you a number that gives you an estimate. That is something you can work towards and improve on. But just like with accessibility tools like the WaveTool bar or Microsoft's open source accessibility insights, just because you don't have any errors within your accessibility tool doesn't mean that your website is accessible. It means that there are no automated errors that have been found with your tools. So it's trying to understand what the automated tools can give you. They can't tell you that there are no accessibility errors. They can only tell you that there's no errors that the automated tools can identify. And the same thing is going to be the case with sustainability. We're going to have to have human input in the mix. So manual, we're going to have to have humans who are looking at do we have this page? Do we need to have this page? Is this something we can yank off of? Is there an easier process to get where people need to go in order to achieve the goals of what we're trying to achieve? That's a very human thing to try and understand how humans are acting. We can't be expecting bots to do that. Same with on accessibility. It's a question of does this alt text actually make sense? It may or may not, but we need to be able to... A bot isn't going to be able to tell that. It can tell you whether or not there are... If it's image123.jpg, that we can get a machine to do. Most don't, but it's possible to go off and do that. But we need to be thinking about where the limitations of what the machines can do and what we're thinking about these processes. And again with JavaScript, is this JavaScript file actually needed? Could somebody actually navigate this information with the mouse? What are the comparisons between the accessibility world and the sustainability world? What can we learn from one that we need to bring to the other as we scale up and start addressing issues of sustainability in our sites? Is it content fulfilling user needs? Does it work for the assistive technology? Those are all kinds of related questions between the two different disciplines. Any questions at this point? Going through a lot of stuff and I know how hot this room is and how overwhelming FOSDEM is. So totally understand that if people just want to go off and engage. But do feel free to go off and stop me if you have questions. So I wanted to touch a bit more about open source tools for sustainability because this is an open source event. So we've got, and Chris, if you see things that are missing, please go off and jump and say if there's things that are missing. There's the carbon cloud footprint, Scandis, how do you spell that? Scrapis here? Yeah, whatever, that one. There's another one by Yadz. It's such a hard time naming things. Why is it so difficult? Cube Green, Kepler, Green Metrics Tool, CO2.js. And CO2.js is built into Firefox. I learned about that last year at FOSDEM and it's really wonderful that that's been something that's being brought into a popular browser. Hopefully Chrome will be shamed into doing that as well because we do need to go off and build these tools in. I also mentioned sitesv.io previously. In terms of websites to learn more, there's the awesome green software list. So you can go and find all kinds of information about green software that's an open source tool that are available for that. Also opensustain.tech and climatetriage.com. So there's so much information out there. A lot of it is free. This is stuff that people are learning and sharing because they need to go off and because there's not a lot of awareness. This is not an issue that most people still believe that as long as you don't print out the web, if you don't print out your pages, you're being environmentally friendly. We're not thinking about the overall impact of our digital devices and the actual weight that they carry on the planet. Any questions about that? Any tools are missing? Chris, anything big that I might have missed? No, the only thing to ask is that there's a talk tomorrow where it's Firefox profiled talking about sustainability. Excellent. And there is also a talk by a digital spread on Skafanja at 6 o'clock tonight as well. So tonight at 6 o'clock there's a talk on Skafanja which is... I'm going to be at lower this now. Yeah, that's taking in the other room at 6 p.m. Marvel. That's great. Definitely want to learn more about that. That's the energy room, is that right? Yes. And also there's... Sorry, tomorrow there's the talk about the Firefox profiler that's in the energy room as well. No, I think that's another room. But if you look up where sustainability, Firefox, it will show up. Wonderful. Thank you. I will definitely share that out after the talk as well. So that's great. Sorry, in terms of the question, sorry, did I repeat enough of it that it's understandable? Okay, excellent. So yeah, we need to go off and just like accessibility, we need to bring these things in early as possible. So how do we tie this into our development process? How do we start looking as early on in the process so that we're catching where we're starting to add bloat? Where does the page start to slow down? How do we make sure that every sprint were a little bit faster and a little lighter than we were previously? So we want to catch bugs before they ever get to production. So it has to be part of the CI CD process. If it gets to production, it's too late. Not that you can't fix it later, we probably won't. We're developers. So also trying to go off and setting page budgets is quite useful as well. With accessibility, I like to go off and aim for zero X errors. So they call that axe clean in the DQ world. For website pages, like you're going to need to set your own, it'd be lovely if people could go off and reach 200 grams of CO2 per page. Most are much, much more than that. So I don't know how that many sites out there that are meeting that 200 grams of page, but let's set a goal and try and see if we can improve it over time. And measure our pages now so we know what we are now and see if we can get achievable goals over time. Again, think about sustainability and accessibility bugs. We need to start, we can't think about these as features. If we leave them as features, they're not going to be addressed. They have to be seen as bugs and treated like bugs, right? So that they're more likely to be fixed in their address early in the process. And even if you're looking at minor bugs, if you're repeating the same minor bug a million times, it becomes a major issue. So again, think about the cycle and how these things scale up in our tools. Look at these tools and try to find ways to get a multi-layered process for quality. How do we make sure that we're building into our CI CD, that we're measuring support for authors, that we're scanning the environment for errors, that we're doing randomized tests, like we don't want to be scanning every page every week. But we should have some sort of process where we have a plan for how we're going to provide automated and manual scans for the information. Are there ways that we can construct your manual testing? How do we try to make sure we have a thorough rock process to remove content that we don't need? Are we doing annual reviews and doing deeper guides? Are we encouraging people to get certifications or to learn more about this? There are some good tools out there from the Linux Foundation and others around learning about the sustainability impacts of digital tools. We've got some ideas about what a robust approach to accessibility looks like. Very similar. We've got to go off and check for errors in our process. Use tools like Editorially and Sally to go off and evaluate it. Is anyone going to use Purple Hats as an accessibility tool? Purple Hats is a great tool for crawling for accessibility errors. Singapore's Government Digital Services has created this using Act, but it's useful to try and think about ways to build processes and to have a belt and suspender approach, just like you do with security. There's also a tool called the WKGEM, which is WKG Evaluation Methodology, which is useful for thinking about a structured approach for evaluating websites or an accessibility to see that you can compare two websites and you're going to have some confidence that you're going to get similar results or similar kinds of comparisons so that you're not dealing with an apples and oranges situation. I also want to go off and highlight that in terms of, yeah, we've got information around CO2.js and incorporating that into our pages. There's tools like SiteSpeed.io, the Firefox Profiler. There's also efforts to try and have, I don't know what the evaluation, what the comparison would be between WKGEM, this accessibility evaluation, and something like sustainability. I don't know how you would create that tool. The Web Sustainability Guidelines is trying to do that, to have that comparison, so you have some way to do a comparison of two websites and have a sense of how sustainable both of them are, but this is stuff that needs to be developed. There's no tools for that. The Firefox Profiler is one tool. We need to have others because Firefox isn't the only browser out there. It may be the best browser out there, but it's not the only one out there, so we need to be thinking about this in terms of how people are engaging with the tools. I do want to encourage people to go and to learn more about sustainability, to think about what their next steps are. Here's the sustainablewebdesign.org website. That's the URL I forgot about earlier. This is the Human-Readable approach. Is everyone here use Slack? More or less, even if you have to, just because you have to. The climateaction.tech has a really great community for learning more about this, and I think it's a wonderful place for people to ask questions and to be able to share their ideas and to, if you've written blog posts about how you're engaging with sustainability in your open source project and you want to share that with others, that's a great place to do that. I think that's where I learned about the work that Wagtail is doing and tried to sort of bring that over. There's also a whole bunch of interesting books, and there's more and more coming out all the time. There's Green Code, there's SustainableWebDesign, there's Building Green Software. Depending whether you're a designer or a developer, whether you're back-end, front-end, there's material out there which is geared to you. Take a look at what's available and see if there's material out there that you can read and learn from. There's also any project managers here in the group? Okay, two project managers, excellent. Three, so there's now actually a course that I don't have listed here on sustainability for project managers to learn how to go off and how to project manager, what do they need to know in order to learn about digital sustainability. There's also some excellent podcasts including Environment Variables where you can hear our very own Mr. Chris Adams, much of the time, not all the time, but quite a lot of the time. And Green I.O., which is another great podcast, there's others as well where people are being done. There's a lot of places to learn about what's available here and engage with it. But I really want to encourage people here to test their code, to test their websites using tools like the website carbon.com website or eco grader or co2.js, test your stuff and see how it looks, what you can learn from it and share that with others. Let's start talking about that so that we're encouraging other people to think about what their impact is with the digital tools that they use. And with that, I can be contacted here and I can also, if people have any questions, happy to answer questions or engage with people here as people for CFET. Thank you, Mike. Any questions? I'm glad I didn't answer all of everyone's thoughts. Hi, thank you. Thank you for this. I'm not familiar with these topics, especially sustainability. You're talking about accessibility and sustainability as they are somehow related to each other. They're both elements of quality. The way I see it, accessibility is more about individuals, users, on the land, sustainability is about the general audience, the general environment issues. So how do they correlate to each other? I mean, if your website is sustainable in terms of electricity and money saving, it doesn't mean that it's accessible for a user which has issues using your website. They're mostly combined in my head because I work in both areas. So to some extent, you can have different areas, but there's also a real effort to see the development of, like you know about human-centered design? There's the effort to create planet-friendly design. We are the only species that is engaging with the web at this point that we know of. And so we are the users in this case. And we only live here and a few people in the space station. But it's like having a planet-friendly focus is... Of course they're... But the way I see it, companies will be happy to talk about sustainability because they will save money. Right. I'm not sure a company will be happy to talk about accessibility because it's not giving directly money to them. Is it correct what I'm saying or just maybe my wrong perception of everything? There's definitely different incentives. And partly it comes down to there's new legislation that is coming into place and finds that will be in place. And certainly in the U.S. if you're trying to sell to a university or a government agency, there's an effort to go off and to meet Section 508, which is now the more or less WKG 2.0 standards. So there are some financial incentives around that. But there's also... And people look at sustainability. They think about the cost of both producing or buying new hardware, but also in terms of electricity savings and seeing that as being a cost savings initiative. But there's other elements about digital sustainability that does actually cost money too. Like to do it properly, you do need time to focus the developers on building systems that are optimizing them and finding ways to cut down on the amount of craft that's... How many websites have redundant JavaScript libraries running? Or they've got multiple instances of analytics on their site or other tools that the marketing team wanted to install, but nobody's looked at for months. We know this happens in our industry all the time. And we are paying a price for that, but not necessarily a price that is... That the companies themselves are paying. Again, from an accessibility point of view, a lot of people are seeing this a lot cheaper to go off and to add an overlay to their website and hope that paying $50 a month will go off and wish away all of their accessibility problems, and they'll have something that they can justifiably say, we've got that covered for $50 a month as opposed to the justifiably thousands, hundreds of thousands of dollars to go off and pay people to actually fix the problems. But that just pushes it to the fringe. It's not something that actually solves the problem. Chris. Hi there. You can hear me all right, can you? Was that? You can hear me okay. Cool. I'm curious if there is a role you think that public sector organizations or early organizations could play like we saw with accessibility. Right. So mainstream this or make it easier for people to adopt like we saw with say public sector and Microsoft and stuff like that. Are there any similarities there that you would draw people's attention to? Absolutely. I mean, ultimately we've got to look at incentives, right? Follow the money. And that's starting to happen with accessibility, with sustainability. People are still not aware of it. And the UK government has done some great work talking about sustainability and digit and measuring it and trying to be aware of it. And their site is probably the most sustainable in the world. So it's really wonderful to see that. But also the, yeah, I think that the government sector is in most countries the single biggest procure of technology. So if we can get government to commit to buying green software, it will make a huge deal in our industry. Because people, if they want the contracts, they're going to have to go off and be able to follow the web sustainability guidelines and say that this complies, right? Even if they're doing a half-hearted effort, it will make a huge difference if government steps up and says this matters. Even just to make it an issue of public discussion. Thank you. Any question? Thank you, Mike. You mentioned earlier the AI hype. Do you think we can find that for either the BDT or accessibility? Yeah, I mean, absolutely. I mean, it can do wonderful things. But people also, there's an environmental cost to AI, and that's something that is not being measured. And so yeah, we can go off and solve some problems with AI, but we have to be very careful about how we're doing that and whether or not we're being responsible for our use of AI. Because it's something that we can't just solve problems and throw AI at a problem and hope that this goes away. We need to have humans involved in the process to see this actually makes things better and be focusing, like with accessibility, we need to make the people with lived experience of disabilities the experts. They're ultimately the ones who need to be heard to know that they're able to access these sites. And not the standards, not the experts, but the people with lived experience with disability need to be involved. And the same goes with sustainability. You can't just have, like, we have to be measuring what the overall life cycle impact is. That includes people coming to conferences like this. It also includes the cost of involving thousands of generative AI models to go off and to evaluate the performance of websites. We need to be cautious about it because I think that it can be useful, but it has a huge cost. Thank you, Mike. No other question? Okay. Thank you very much. Thank you very much. It was green the whole time. Perfect. Thank you.
Cristal: a new Wiki UI to rule them all
Okay, so thank you everyone for joining. We'll soon start the talk about crystal. One week you have to rule them all with Ludovic and Manuel. So I'm going to give you the mic. Okay, we're good. So hello everybody, welcome to this talk. So we're going to talk a bit about the new project we have at Xwiki. And so we're going to present the team first, who we are, the product vision, what vision we have for this product. And since it's a new product, it's not ready, so it's not something that's usable today. It's something that we're going to build with a lot of energy. And then we're going to show the design proposals that we have for the UI, which is from our belief very important, then the technical architecture, which is another part that we believe is very important. And then the current status and the roadmap. We call this project crystal because we want to both... We want that it's both beautiful on one side, but that it's also like the chemical structure, like it has a very nice and very well-done architecture. So first, who are we? So I'm Ludwig Dubost, the CEO and founder of Xwiki. We're going to talk about Xwiki just after. Manuel. Manuel is the tech lead of Crystal. He's going to talk to you about the architecture. The project has also Vincent Massaud, our CTO, Thiago, our designer on this project. And we have also the support of the whole Xwiki team on this project. So Xwiki SIS is a company that is established in 2004. For 20 years, we're working on Wikis. So it's quite a long time. We have friends in this endeavour of Wikis. We believe Wikis are very important. Hence our tagline knowledge is power. We are a self-funded company. And we have reached 4 million euros revenue. We're also building the CripPad software. We have done 50% growth in 2023. We have 60 employees in France, Romania, but also Germany and anywhere. As I said, we have two software, Xwiki, CripPad. We engage in digital sovereignty. We believe open source is really important for that. And we have a business model to really try to fund the software. We believe it's also very important to find a way to find open source software. Open source software cannot be done with... Microphone is not working. Is it okay now? Okay. So I have to be careful. It turns around. So the product vision. So first, there are lots of technological shifts that have happened in the latest years. For example, JavaScript is getting more and more mature. Like better development methodologies come in JavaScript. We come from the Java world. And so we're very keen to have great development methodology. And for a long time, JavaScript was like a bit in every direction. And now we see that it's also getting more organized with development tools. It has better frameworks also for developing JavaScript applications. Standards have evolved. You also have web components, JavaScript modules that are working much better. You have also technologies such as JSON-LD or SOLID that bring new capacity. There are also new paradigms. Paradigms, real time, is becoming something that any application should kind of have. We believe offline is important also. And that the technologies allow it. We also see a convergence in the field of wikis. Not only between wikis, so the features of wikis are getting similar. Like there's a better understanding of what the features of a wiki is. But also we see that there is a convergence between wikis, for example, or even drives. They're getting closer. So there are also questions about how they could be similar applications. For example, we have always had attachments in wikis. Maybe you could consider attachments or documents like their wiki documents. So there are convergence there in this area. We believe there is a model of future. So Jitsi's founder, Emil, in last year, he mentioned at the end of his 20 years Jitsi talk that in open source building on layers is going to be an approach that matters more. And it enables tremendous innovation. So if you look at Jitsi, you have a Jitsi library that makes the video conferencing module. And you have the Jitsi application. Like I really believe in this and open source can really reach all of its power if you can do a lot of reuse of anything. So you need lots of modularity. We had reached a lot of modularity in xwiki with everything's components in xwiki and Java. But now applications are way more client side. And so you need the same level of modularity on the client side. And we also need integrations between open source tools. There was a talk about OpenDesk in which we are partners where we need to bring open source applications together so that the whole suite of open source applications could replace Microsoft or proprietary applications. And we need to be able to integrate tools much tighter. And for this you need, again, the modularity. We also got an opportunity to fund that work. We have one with other companies as part of consortiums. We have one actually three projects, but two of these projects include the work of Crystal that we have included in the projects of building this new UI. And so we had this opportunity to fund it so we're able to get money for that. And so we have this big opportunity. We also have the opportunity to have the collaboration of partners. So actually the partners of this project, unfortunately they're not open source, would be also users of that Crystal module for their application storing data in their own system. We'll come to that when we talk about the vision for the product. So what's the vision of the product? It's actually one UI, one wiki UI, a modern one that brings all the features that you have in wikis today and that can support multiple back ends. So you would have an application that is web, desktop, mobile, and this application would be extensible, very modular based, but it would have a common data model behind that applications, which support offline, real time, and then it would be able to connect to different systems in terms of back ends. Of course it would be able to connect to Xwiki. We built Xwiki and we wanted to connect to Xwiki and support all the features that Xwiki has, even the most advanced one that we have in Xwiki. But we also would want it to do a basic wiki based on a file system that you store locally on your computer. We also would want it to work as a nice wiki with the next cloud back end using webdav or git. And we also would want it to support a wiki storing data in an end-to-end encrypted system such as CripPad that we build ourselves also at Xwiki. And this application, as a whole, where you can activate and deactivate modules, you could decide that these features you don't want them, you can change modules, you could replace modules in this modular application, would also be embeddable. That means you could put it in the next cloud server and serve it from the next cloud server. You can put it in the Xwiki server, serve it for the Xwiki server to access Xwiki data, or you could put it in any other application. That's the vision of the product, Crystal. The key concept is that we want it to be a slick UI with modern editor and slash comments, multiple back ends, as I mentioned. So, slick UI means it needs to be as good as what Notion does today in the world of wikis in terms of UIs, or the Notion competitors that we see coming in. So, we believe the Notion competitors are nice because they support a lot of nice UI features, but they don't support the modularity that Crystal will have. It's going to be offline in real time. It's going to have accessibility by default, support web components, and also be sustainable. There was a very nice talk before about sustainability of software, measuring the consumption of software. We want to try to do that also with Crystal. We want Crystal to be a UI that is built in a way that will consume less. It's going to be available as web, desktop, later mobile. It's going to be extensible and configurable, and it's going to be a strong editor. So, I'm not going to go into details of what a strong editor is. It's going to support markdown, but it's also going to support the Xwiki syntax, but it's going to be state-of-the-art UI. Lately in Xwiki, we implemented slash comments. It's going to have slash comments in Crystal too. It's going to also support structured data. That's one of the big advantages of Xwiki that our customers and users have loved in Xwiki compared to other wikis. We have a whole system around structured applications, structured data, and we're going to support that in Crystal. Some use cases, we wanted to be a UI for simple storage, markdown. So, it should be a simple wiki. The idea is that it can be a local note-taking app. You use as an offline with a storage that is local, and that would be really interesting. It's going to be a modernization for Xwiki because Xwiki has a UI that's quite old now. We have done a lot of things with this UI, but we wanted to be the modernization of the wiki UI in Xwiki. It's going to be embeddable, as I mentioned, and we wanted to be a wiki view on all your wikis so that as an individual user, you would have multiple wikis in your Crystal UI. You could even create a wiki of wikis, so you could create your own tree of pages, and you could be navigating different pages in different wikis that you have in the back end and locally, of course. It can be an end-to-end encrypted wiki for CripPad, which is a feature that we would love to have. So, we can summarize that as a new wiki UI to rule them all. The design proposal, I give it to Manuel. Is it working? We only share the results of work by Ciego, UX engineer, we hired a few months ago. Since we have some experience with Xwiki, we are able to part from a blank state, but using the experience we have already to redesign a more clean and modern UI for wiki. That's one example, but we have some documentation online where you can find other wireframes. Of course, everything is community-based, so you can come to the forum where we are openly discussing design ideas and contributions. One important aspect we want to work on is, since we want Crystal to be unbedaible, it can be with its own style. It needs to look like the application where it's integrated. What we want to do is, as a developer, when you design a part of Crystal, to design it with some abstract components, UI components, by configuration to be able to say, I want to use shoelace for the actual visualization without much code to say, OK, now, for this application, I want to use Vue.define on... to make it easy for the developers to switch from one design system to another. It can even be convenient, for instance, the French government has its own design system, so if you want to have some knowledge base for the French government, it should be by extension to define a new concrete design system and to use it for their own needs, for instance. We can imagine some other use cases, like in Nextcloud, they have their own set of components, and if you want to have Crystal inside Nextcloud, you want to have it look like Nextcloud and be seamlessly integrated inside the ecosystem. So, a few notes on our technical view for the future. Starting Crystal was a very good opportunity to be able to try new things, so I'm working for a few months into studying a lot of libraries that we can use for Crystal. That's a snapshot of things we have settled down for now. I went to the JavaScript room this morning, and now I have like dozens of new technologies to check. We have this page where all the choices we made are listed, and we maintain it over time. And so, in terms of architecture, it's starting easily with two main components, the web one on Electron, all the ones with Dash as the platform where the most work to do in the future because they are the most challenging for integration of Crystal inside XWiki because it's 20 years old project, so as you can expect, a lot of features to make us compatible with. Reach editing is very challenging. We need to choose a new technology for the editor which is compatible with offline editing with real-time editing, so that's a lot of work ahead, but we have plenty for our next roadmap. So, the key aspect we have in mind is that we want to preserve from what we already have in XWiki and that we deem as important its accessibility and sustainability. That comes with artifact size, of course, measuring performance, making the Crystal locally usable, modular with inversion of control, based on standard as much as we can, for instance JSON-LD web components, to keep having documentation for users, for developers, to have a broad idea of the artifacts we want to publish, so the abstract design system library for others to develop design systems on top of Crystal, a set of connectors to different sources, as we said, and a JavaScript syntax formula to have offline editing with a rich experience, a software development kit to be able to develop extensions, and a set of components we're considering web components in particular because that's independent from a particular framework, which I believe is better for the future, a long-term future of the project. So, on users, we have this electronic application for desktop not taking as a replacement for the XWiki front-end. So, I'll get back. And so, the tricky part now is what's the status and where are we today? So, the first thing is we have a prototype of the extensible architecture using IOC and the version of control. That's actually a very important part of the way we've designed the applications. So, people coming from the Java world understand what components in Java are and what inversion of control is, and this is actually something that is not used that much in the JavaScript world. It's used by frameworks, so ViewJS or AngularJS, they're frameworks that are doing inversion of control, but when it comes to JavaScript libraries, this is not something that is used that much. So, the key feature that is really important for the extensibility and modularity, if you want to be able to replace one piece of the system because you want to change the way it behaves, you need to be able to replace any module for which you have defined an API. And inversion of control is a key method to be able to do that. In the prototype we did, we've been able to load dynamically a module by configuration that is coming from the Internet. So, in the configuration that you want this module instead of the other module, and from a static build that has been built as a standard crystal delivery, you can add an extension that will replace one of the modules of the system. And this is key. We have designed the architecture, the basic architecture of plugins, skins, and user interface extension. In X-Riki there is a great feature that is called Skin and UIX. Skin is a way to replace the UI. UIX is a way to add an item somewhere in the UI. So, if you want to add a feature in the product by an extension, you need extension points. UIX and X-Riki is the way we do it. We have replicated these methods in the crystal prototype so that you can add things in the extension. And we'll also replicate the fact that you can replace the skin. So, in addition to what Manuel explained about the abstract design system, which allows to reimplement the basic view and the basic components that we're using in the whole application, we can also replace pieces of the user interface. We have implemented X-Riki and Markdown renderers. One difficulty was to bring a JavaScript renderer for X-Riki. If we want to be compatible with X-Riki, we want Markdown to be a first-class citizen in crystal because that's the standard today, but we also need to support our customers that are using the X-Riki syntax in with X-Riki. We've also done prototypes of client macros, rendering in ViewGS of a macro. So, new macros. We've done the choices of design system libraries. The first one we want to spend time on, Shule's Vutify. One thing we had in the previous slide that Manuel didn't say. Shule is one of our performance tests. Actually, twice as fast as Vutify, and Shule is a web component library. We were quite impressed by that. Vutify is a pure Vutify library. Shule is a cross-platform library of components like supporting React or Angular, etc. Really interesting work. We have done design work. We have a prototype UI for basic view. We have first test of the editor UI with Markdown and TipTap, and we have the project infrastructure. You can check the code at the link I gave, crystal on GitHub X-Riki contrib. Basically, what we want to achieve in 2024, we want the first version for basic wikis. You can browse your, you can actually take notes in Markdown with an electron system. You can access a git on the other side. You can access a basic X-Riki with not all the advanced features of X-Riki. Maybe about 50% of X-Riki's current features. By, during 2025, we will achieve 75% of X-Riki's feature, including structured data. We want to bundle it in X-Riki by 2026. We want to have also plugin repository. We'll probably have that earlier, but we want to start having more plugins. We want to also have done more plugin development and a Crip at release. We probably want it as a default UI for X-Riki. Also, if we have done our work properly. That's it. You can look at our website, crystalx-wiki.org. There is also very interesting information there for anybody that does a JavaScript application, an advanced application. We're not necessarily the biggest killer in JavaScript. We come from the Java world, as I said, but we have done a lot of studies of what are the good technologies, because we have a lot of experience about choosing libraries right. We're trying to really make tables. We have tables about libraries and so on, about technologies. Don't hesitate to look at this. X-Riki is also hiring. If you find this project interesting, you can also join. If you're interested in what X-Riki is about, I have a beautiful conference at 9am tomorrow. If you like to wake up early in K, and we also have a party, you can scan the QR code if you want to join our party tonight. There's no room left? You can still try. You can still try. It doesn't matter. There's a risk. Thank you. Questions? APPLAUSE Any questions? Do you have an example of an extension you're imagining or planning for? First, any macros are extensions. If you want to add macros in your wiki, it's going to be extensions. If you look at X-Riki, we have 650 extensions. We have at least 50 high-quality extensions that we're not bundling with X-Riki. Lots of them are macros. Macros can be extensions, but it can be just adding a feature. Structured data would be an extension. We would not bundle it in the basic one, in the basic crystal, if you are not using crystal as a back end, because we wouldn't support anything. Everything is going to be X-Riki supports. For us, anything will be an extension. The difference is some will be bundled and some won't be bundled, but storage system is an extension. Access to Tiki can be an extension. Access to GitHub is one. Access to Git, access to file system, they are all extensions. Thank you. Another question? No? No question? OK. Thank you very much. Last second. Do you have a specific library for JSON-LD? What do you want to use? Can I repeat the question? Is there a specific library for JSON-LD that we want to use? First, when we look at storage, there are two ways to do the abstract storage. One way is to hope that the server application will support JSON-LD by default. We'll actually do that for X-Riki to try to make X-Riki give you JSON-LD by default. We believe that will be better because we'll do conversions of structured data of X-Riki and JSON-LD. That will be very interesting. In the Java world, we have found a Java JSON-LD library that is widely available. On the JavaScript world, at this point, we didn't feel we needed the library. That's just JSON that we can manipulate. At this point, we haven't seen the need for a library because we're just storing the JSON-LD data as offline right away. Sorry, I forgot to say. The second way is to do the conversion to JSON-LD on the client. That means the storage module will use the standard API of the backend and then transforms things to JSON-LD to give it to the other crystal modules that will understand JSON-LD. The conversion would be on the client and then you store as offline the resulting of that conversion so that you can do anything in the application. We didn't see the need yet, but we're not there yet. We did some tests of how X-Riki.NET to JSON-LD would display when it has structured data in a page. We've been able to replicate things we do in X-Riki on the client side in a similar way. We're not there yet. For now, we're focusing on the editing experience, which is the most important part for the beginning. Thank you. Thank you very much. Another question. Sorry, we'll take it outside.
Pushing Tiki to its limits
Hello, so Tiki provides a very powerful and flexible database abstraction layer. Thanks to a concrete example which expanded for three years. We have learned a lot. As we start a similar project, we have time to reflect on lessons learned, pitfalls to avoid. And why not share everything with us, with you. So first I described the context, what the project was about, how we did it, what the challenges were, and what we learned as a summary. So I'm Jean-Marc Kipps. I discovered free software last century. I'm in the Tiki community since 2006, live in Strasbourg. I'm alone in front of you, but I don't want to believe that I did all that alone. It was a team project. It was headed by Evoludata, and a lot of people helped. Some of them are in the room. The customers were the peak team from the Institut Nationale et santé publique du Québec. The end users are medical testing laboratories. So that's the website. As you can see, everything is in French, but I'll translate as much as I can, and I translated before I did the screenshots. It's a way of cheating. This is the team, it's quality control, actually. And what I do is that every year they produce by medical samples. They ship them to registered labs, so peak ships, and the labs have to register. They have to register because not all the labs do the same analysis. It depends on the machines they own. They have to be certified for all the analyses they can do, and so they have to choose them. Then they do the test, and they send results, and peak analyses, the results, sales reports, and recommendations. And this is what they call one campaign, and there are many campaigns which are linked to group together in the program, et cetera. That's one of the processes. They used to do that using faxes. So at first you think, hey, how hard is it to be better than fax? It actually faxes hugely flexible. So for example, different medical disciplines did things in different ways for totally valid reasons. So we had to adapt, but there were also clever people. So they also used that project in order to kind of streamline and make their processes. So we met in the middle. Everybody improved. And of course there are other processes, but I don't have time to explain everything. That's just an example. So yes, they also have every year to draft, review, approve, and publish the programs that the people can register afterwards and manage those registrations. So in general, that's the website. If you don't have an account, and if you're not involved in a process, it does not match in it for you, even if you understand French. What's in it is this is, for example, the example of what I mentioned, that's the management of the program. So they have all the interface they need. They can edit it. They can view it. They can go and edit the campaign which are linked to the subprogram. They can go to other pages. There's a lot of it. Every table, as you can guess, is actually data linked to this program, but in other tables and sometimes in all other tables and other tables. So it's not simple. This isn't too hard. That's the process where they approve. They discuss on it. That's just comments, and then they validate it when they agree together. This is actually the same subprogram, but that's the end user view. So we have that flexibility, and also that's where they actually click when they want to register, as I said, they would. But there are lots of variety. Here is another program that you have plenty more combined. You can click on them because it's not the time of the year when they register. So as I said, it's rather complex. How we did it. Tiki, in case you don't know, has plenty of features, and you have to choose the ones you want for each project. So basically, we use the wiki pages in Tiki in order to embed widgets, which we call plugins, in the wiki pages, and that's where the logic is. Well, you can also use them for documentation. We have file galleries. We don't use them a lot. But there are some documents to share. Trackers is the huge thing. Trackers is the Tiki name for the database abstraction layer, because it's starting it, it's starting, started as a bug tracker with grew and grew and grew, and now it's a full-fledged database abstraction, but it's hard to rename things afterwards. The fact each tracker item still has a status, open, pending, et cetera, and we use it. The categories are useful because that's what we use for the permission system. The scheduler, I'll get back to it. It will be simpler. The performance-related features, the main one is that, well, when you have a lot of data, the important thing is how you set and index it, and the default is MySQL full text, but you really need to install elastic search for that. We really had to because there are too many limitations, especially in the number of fields that MySQL can do. And the rest are basically, we had to raise everything, and it's easy to do because we do it within Tiki. It's just configurable. So all the time we doubled some memory limits, et cetera. So trackers. Trackers are basically, you can think about trackers as tables. Each tracker is a table. We have 86 of them so far and still growing. This is the tracker admin view, which end users don't see, but the customers love it because they feel empowered. They can see what's going on. They can edit stuff. We have activated inline editing. So when you see that, you can click on any of those little widgets and edit what's there, correct typo, a filter on what you want to see, sort on every column, et cetera. So that goes out of, that allows to do a lot of things without bothering to set up a whole workflow and it's really useful. So I said that trackers are like tables and tables have fields. So there are plenty of kind of fields and these are also, you can just add them, et cetera. So the useful ones or the auto increment was, is really practical because it allows to access and display the item ID of each tracker item. Item link, it's super powerful. The item link, well, if you're familiar with SQL, think about foreign keys. The item link links to another tracker. So when you edit the item, you have a selection of item from the other tracker and you can link track, well, two trackers. You can link tracker items from one tracker to tracker items to another tracker. Once you manage to link those, items, it is super useful because it does the other thing. For example, as we said, these are the campaign. Each campaign is linked to one subprogram and once it's linked to the subprogram, I mean, the subprogram has a year of the subprogram. So the campaign just gets the year from the associated subprogram. You don't have to do a double entry, et cetera. So you get those data. It's all indexed together and when you display the campaign, you have all these, all these values from other trackers. So when you start to link trackers, as you can guess, yeah, it starts to look like, you know, database schematic. That's a schematic I did. By the way, in a wiki page, in a tiki page with a draw widget, I needed it for a workflow because otherwise I couldn't figure out what to do. And so this is, I linked all the item links and the item lists and I put color because the color is about the fact that, about the fact that when you link tracker items and you delete or change the status of a tracker item, you may want the related tracker item to be also deleted or its status to change or not. That's configurable and that's why I wanted to keep track of it. Yeah, still not 86 trackers. That was, yeah. So how it dealt with source management. We had three, not four environments. We set up a dedicated private GitLab repo. We had our branches and we stuck dev and test on those branches. So every commit would instantly update the site. We get from one to the other by merging and terrific and production is not locked in production. Then the staging environment we called test is approved. We create a tag with the date and we run that in production. And so that means that, well, we have auditability about our versioning system tells us what we were running at what time, how it evolved, what we had in production at a given time if we want to recreate production at a former date when you hit a bug and you try to figure out is that a regression or is it something we missed last year and it was already there. All our commits were very careful. We put that we do not edit the Tiki files as much as we can. We add our templates in our theme or in the custom translation. That means that when we do a merge and we want to get the novelties from the Tiki community called and the security improvements, we do not get merge conflicts. The database management is just the opposite flow because the reference is in production. That's where we have the real data. Some of this data has been entered by end users. Some of it is those wiki pages we edit in order to show you later why we have code in our wiki pages. The nice thing is that we can try that. We synchronize test and dev from that. Then we do experiments. Then we get that to be validated. If it's okay, that's the approved edit. Then we synchronize. Tiki takes care of keeping a history of changes in the wiki pages, in the tracker items. There's an activity log and that's how we get our auditability for that part. I just said that all our environments are running the same database. You may get how this is an issue. What we do is that one single file here is not a versioned. This one is specific to each Tiki because this is the one which has the database credentials and it also has a link to configuration file which can be versioned because we have an item for a section in the configuration file. That means that in the same configuration file, each environment uses another section. In this section, we can override any Tiki preference. This has two very big advantages. The first one is that all the security preferences and others can be set in that file and cannot be accidentally modified through the Tiki admin panel. The other is that we can have different things in different sections. That allows us, for example, to ensure that only the production server can send email notifications. You do not want your end users to get notifications from a test server or a dev server that I'm not supposed to know about. What else? Yes, and you can change the browser's title. You can change the theme options and end up having your browser tabs like this to have different colors when you are working in production or in staging or in dev. That avoids big mistakes when you are editing a site where you want to be sure that you're not editing fraud when you want to do stuff in dev. So there's still the part about how you do that. So Tiki has a no-code, low-code approach, but at some stage you just have to accept that the project is really complex and abundant. The great thing is that there are options for doing really complicated stuff. These are basically the list widgets which we call plugin because the list widget is super useful because that's what allows you to display stuff which come from anywhere in Tiki, but here we are only interested in the tracker items. List execute is very similar to list, but that's not for displaying. That's for listing stuff and doing things for on a whole bunch of tracker items at the same time like deleting them or changing their status. Custom search is also closely related and this is for allowing people to do searches, to filter, to end user have control in this case. So that's a list widget example. You are not going to understand how it works like we don't have the time. We ourselves have that documentation page. We spent a lot of time. It's plenty of info. Everything is there, there are examples and all that. We spent a lot of time in it. Basically the general idea is that this is something we can put in a web page. There is a section which says what filters, what we are going to display. There is a section about how we want to output it. The more there are predefined templates, but if we want full control, you just give a smart ETPL file and then you can code whatever you like. You can even change the formatting before it gets to the template. And if your filter doesn't match anything, there is an alternate in order to an alternate section. So that allows you to do all the pages you saw before. You have to realize that when I say you can do whatever you like in the template, one of the things you can do in the template is call another list plugin. The syntax is slightly different from the web page. And that allows you to collect information to trackers which are linked to other trackers, etc. And you can go on and on and on and on if you like. There are no limits at this point. So that's basically what we used nearly all the time for all the pages, for all the workflows. The scheduler is also really useful because sometimes some processes are just too complex. There are two special cases and all that. And we had especially like the scoring system. We just wrote a script which was directly doing the calculation and updating the values in the database. And the scheduler is our way of ensuring that things can run whenever you like. For example by night because luckily neither our customers or the end users really wanted to work outside of working hours. So we can run everything we like during the night, especially nightly script for calculating scoring things or index rebuilds, whatever. So what were the challenges? One of the challenges we had was that page because that page was awesome because we had lots of information. It doesn't show here but actually those columns have related information which are in different trackers. So that's one of the cases where you have those templates which call another list plugin, which call another list plugin, which call another list plugin. So obviously the first year everything was great and you had everything here. We were using table sorter. You can sort on any column. You can filter in these places and you can move the pagination around. It was all client side, meaning all the data was in the page and after the third year it starts to get you cloud player timeouts. So we have rewritten the templates in order to optimize to do some caching ourselves in the code. And then we had to raise the memory limit because you know, trade-offs. But that's not a solution which is going to last forever. So it's solved for this year and they want to have five years of data, I understand. We'll see that. So basically this will need to be rewritten using custom search and just paginate. Or here they have the download button because they will all those information in CSV so that they can do more data mining. So we will provide, we will rewrite that but let them download subsets of the data. And that should solve that. And figure out, I'll talk about CSV extract here. That was another issue we have. That's a, every link here generates a CSV file which again gets data from plenty of trackers. So for the big labs we did have some timers and we were about to do the same thing, you know, rewrite the optimize, the TPL, etc. But luckily we had another idea which is to talk with the customers who explained that, well, that those data hardly ever change. So the solution is not to calculate them when you click on the button. We are just going to use some caching and have a nightly mechanism or, you know, generate the caches at the right time and just link to a file. Yeah. So mainly our lessons and improvements meaning that since I said that we are going to have a similar project which is about to start. We wanted to see what we, well, essentially it worked. The customers were happy but we can still improve stuff. So what we are going to improve is use more sophisticated Tiki permission mechanisms which is called templated groups. That's for the permissions about what people are allowed to see depending on the groups they are in. We just used a simple way and then we had to add another layer of security in the smarty templates. We want to avoid that in the new project. Make sure all the layers of data are present in the design. Well, that's always hard because, well, it's always hard to realize that there is a missing table or tracker. It makes a lot of extra work to discover that too late. Then again, I'm totally convinced that it would be even worse if we were working in real SQL. The other thing is, as I said, the other lesson is, well, table sorter is not a tool for data mining huge data sets. That's the summary of it. So you have to get your customers to accept that sometimes they have to use pagination and not have everything available and there are technical limits. Same thing for identifying huge CSVs. And also we have taken advantage of this. We are going to improve the list plugin which will be expanded with sublist sections which basically will allow us to do joins without having to do that in TPL files. And that's about it. Thank you, Jean-Marc.
How to get rid of Confluence: Comparing Open Source Knowledgemanagent Systems
So, hello, I'm Markus Feiner. I'm, some of you may know me. I've been around the Linux and open source world since 1994. Started really early with Linux and had three operating systems at the time on my computer. And since the early 2000s, I'm an open source journalist working for Linux magazine. And the Hi-Zi-X Big Jump Tech magazine also. And super, thank you. So, wonderful. Perfect. So, I've done a lot of things. I was also team leader at SUSE, at documentation team lead. And yeah, having done lots of things, we don't have that much time. So, I better be fast now also. But within this talk, I have a lot of links and hints for you where to go to get much more information. Because all of you that are here, probably are only here because you heard the term conference and you like it a lot. And like everybody does. And it's, we started six minutes late, but we have 15 minutes left, I guess. Good. Okay. You ping me five minutes before we're done. Thanks. So, this is a presentation that is kind of typical for me. I'm going to rush through a lot of topics, but there's a lot of links inside. And you can go and find a lot of things. If you're not a German native speaker, you will find some articles that I wrote that you have to translate. But the best thing is that something we did in December or November last year is a large tabela chart of lots of open source alternatives to Atlassian. And I'm not going to dive deep into the things that happen with Atlassian and the 15th of February coming up. I'm sure you're all aware of that, that the support is running out. So, that's a different thing. I'm going to talk a little bit about knowledge management and about a concept that we found in SUSE in the documentation team that I called Agile Recursive Documentation. And generally the problem in knowledge management, and that is what Atlassian actually is about, is that we all sort of mis-underestimate the problem. I'm doing this Bush reference by Bonn-Perlman. But it's like an iceberg, and this is probably the iceberg that hit the Titanic, an original photo that I found. And I've been using it in knowledge management presentations because, like icebergs, there is implicit and explicit knowledge in companies. You have a lot of knowledge that is documented, that is fine, that everybody knows, but there is a lot of Rumsfeld reference, unknown unknowns. And in companies about, yes, there is. And about 80% in companies today is assumed to be implicit knowledge. So, knowledge that is there, but nobody knows that it's there, and people just do it. And just announcing doing knowledge management now will not help against that, and will not mitigate that in any way, because you have to take the people. You have to, oh, I forgot something. The implicit knowledge is also referring to, for example, people that go into, yeah, in a longer holiday, or that go into old age pension, and that are gone. And there are stories of, yeah, I can tell a lot of stories about that, what happens when the people are gone, and you have to find out what actually did they do, and how did they do it. Some people that I know, they had to understand, and had a Pearl Programmer in the company, and the Pearl Programmer had called all of the processes and scripts of his whole big setup after figures from the Simpsons. So, they found a process that was called like APU, and they were like, what does it do? What does it do? And then they found out, oh, it forks a lot. And only once they knew this, they could figure out what the whole Pearl thing does. But there's many stories like that. But in the end, you have to inspire and motivate your people to follow you. So, you need solutions that work in the processes and software that works, and that the people like in your company. So, as many as you are here, there will be lots and lots of different solutions in the end. Because documentation and knowledge management is teamwork. That's one article I wrote for the Linux Magazine together with my former colleagues from Susie. It's always teamwork, and you always have to work together, and you have lots and lots of different people in the company from the... I once heard that we could... When I was a consultant, we are the pathfinders, the mountain guides. We are the ones that find the trails, and then we tell others how to go them. And as we have different people in the mountains, you have the locals, you have us, the pathfinders or whatever, you have the tourists, and the locals and other beings. So, you have to make sure that everybody understands what you're talking about when you give them a description about something. And for that, we have the engineering part or the scientific background. And this is all... Every one of these is so huge and a large topic that you can do university studies on some of them. Knowledge management itself, organizing knowledge, process management, quality assurance, and then all of that basically combined into some things like knowledge process quality management. That is K-P-Q-M... K-P-Q-M, yes. And at the end, there has to be a presentation layer. That's what the people see. That's basically the editor with which we work. But in the background, you have to do lots of ordering, indexing with metadata. Anybody here who knows the term RDF? Still, the semantic web and all of that stuff. Yes, that's the background. And then you have taxonomies, terms, terminologies, registers, tables of contents, notations, catalogs. And it's a huge scientific realm that you can read books on each of those. But for a company, it's important that you do the needful, not all. And the representation, showing it to the readers, the customers is actually the mapping of the information. How do you do this with models, glossaries, how-to's, encyclopedias, documentation? And you see there's already the type, the form of how you present things is coming in with glossaries. And well, of course, what I'm going to tell you is the way, the right way, and found this yesterday in Brussels here, and I really like that. Maybe some of you remember Magritte. This is not a pipe. So, and somebody pointed this on his garage door. This is the way. And I still have to dive into the apple. I think the apple is also a Magritte thing. But the cloud, and I really like that. So, of course, what I'm presenting to you is the way how to do it. No, it's just a suggestion. Because if you don't have time for all that scientific research and all of that, maybe, and that is the usual, that is, then you're not alone. It's usual that companies, we don't have time to do the documentation. We don't have time for all that. And then we were at SUSE in the situation. We had five people in the doc team, and we were told to grow up to 10 people fast. And the people in the doc team, documentation team, they said, we don't have time to teach the new ones. So, what we did was we created this agile recursive onboarding. That means we have one new guy in the team, and this new guy will be, for example, the mentor for the next new guy or girl. And we created a mentor who is in charge of teaching the new guy. And the new guy would, for example, as an article in Linux magazine that I did. And some of the, why we call it agile is because only because we're using agile tools and methods. Like a Kanban board for it. So, this is what the new guy saw. Not at this one, but it's structured. This is an easy tool in next cloud that you can use, for example. For the start. This is the task that the new guy had. Or this is the tasks that the mentor had to do before the first day. These are the tasks that the new guy would have to do on the first day. This is in the first week, and so on. You know, Kanban, very easy there. And out of this came the first documentation of the job, of the documentation shop, the description, what he's doing. And this is an individual board for a new team member. And this new person was from the start involved in making the team better, making the documentation everything better, and yada, yada, yada. Until, and it was recursive because the next one would again start at the same point. But with the new improvements from the last one. And that is exactly what you can do with documentation and knowledge management in a company. You can start documenting things and have people be a part of it. That is the most important, take them with you. And there is other things that are really cool, but only larger companies usually can apply it. There's things that Stack Overflow and Reddit, those companies do really good. You can have your customers, your readers, the signpost, a triage by user pane, so the important documentation items first. And what nobody needs will disappear from the list and topics that are interested will go up in lists and documentation. So if you're writing on something nobody ever reads, maybe you could have invested your time better. But I'm jumping over to the tools because that is what we are talking about. And yeah, so in decision making, decision making is also very clear. You have this important, regular, not important ones, and thereby you decide what to document on this scale. But it's probably not you at FOSTA who decides that. It's usually the management because there's constant risk involved. So this is the stuff that we need to document because if this guy is run over by a car, we won't know how to do this process and this will be a lot of money and customers will be angry at what we're getting. The team has the expertise, the knowledge or documentation team has the expertise, but the management knows about money and risk. So that is why they have to actually define what. So if you're not, yeah, and the tool, oh, super, good. The tool that you're using should be the last decision that you take. Otherwise you end up in technology-driven development or design. I don't know if you've heard that before. That is, for example, happening with AI a lot. We want to do something with AI or we want to do something in Rust. We want to do something in what is this new framework that we bought. Oh, in Go. Let's do something in Go or whatsoever. That's not uncommon that a company buys some development framework and then they think, okay, what can we do with it? It should be the other way around. You should, as in project management, you should see what you want, what you need, all the needs and then the risk and the money involved and then in the end look which product does match my requirements the best. Well, with Atlassian, that's usually not the case. Also because it's just simply there and for a long time it has been there without competition. And now in the last years, Atlassian did these moves to the cloud. This increase of license cost and, well, and they have, of course, they have a good product, a viable product and it's highly integrated with ticketing and you remember Trello. And I like to call them something like the Microsoft of Knowledge because everybody in the development world uses it and it's hard to get around it. And most people are not very happy about it and especially since I learned that the price increase that comes now or came now for small companies can be up to 1,000% that they have to pay because the small bundles start at 500 users or something like that. And it's not the usual, you know, the usual increase is something I think like 10 times more than before. So not that bad. And so they did also an article on that that I don't have a screenshot in there but they're forcing the users into the cloud. They say no we don't because they can buy a data center license. That's the one that's like 100 times more expensive than before. But they reacted also to that but there's other issues also. But them being an Australian company and thus part of the Five Eyes which is the five countries that are very closely working together in terms of the NSA stuff. So there is GDPR issues. They even told their customers when in 2018 when we were working with Trello at SUSE they let a mail came in that told us we shouldn't put business secrets into Trello because that might not comply with data protection rules in Europe already then. We're like what the fuck are you doing? Knowledge management tool and you're telling us not to put business knowledge in there? And so yeah and then recently there have been more issues. There have also been security issues and also but that's okay that every software has that. But then there is also the fact that they are more focused on a global market than for example on a German or European market. So we had the situation that we had severe issues at a company I was working with with Umlauts, the German Umlauts A and they said well there and we opened the ticket and the answer was oh yeah most of our customers are using English. We are very sorry about that. We're like okay good but that's just basics and so last year and that's actually the core part of this. I did with Tim Schurman together we did two articles in the IT administrator where we took a deep dive into the open source alternatives for Atlassian. Both of them make five pages and a large chart that I have here yeah that's two pages. That wasn't in the printed that's only online and the link is this is the this is the link to it. It's in the presentation that will be on the FOSTA website. Yeah and um nice take the photos tell me when you're done then I go back good so and we came to the solution that there is a lot of alternatives and a lot of them also are facing a boom. Some of them say we have so many calls right now from people that want to get rid of Atlassian that we can't handle it all and one of the companies is a friend of our friends of mine from Regensburg where I come from in Bavaria and they say it's most of them are customers that don't want to use Atlassian anymore they don't want to do the move and now they have this deadline coming up on the 15th of February when their support is ending for their old product so yeah so many customers are turning away from Atlassian the priorities have shifted and as you see in this table I'm going to show you a picture of each of those and hopefully if my brain is good enough I will say a few words to each of those five minutes great so in this list we have blue spice just don't worry I'll tell you more about the names just name dropping now blue spice bookstack it's alphabetical docu wiki fos wiki media wiki open km outline p m wiki wiki js and x wiki that's the ones we compared in this list I'm sure there's more out there but we tried to address those that are open source and have enterprise support so that's what we this wrong laptop and here you the first one of course the big the biggest knowledge management software is of course wikipedia but wikipedia with media wiki software has one big flaw it's not well it scales obviously very good yeah it can run the seventh largest website in Germany I think but um what it it has only one use case and that is not the enterprise clearly it is has the use case of making wikipedia run and work forever and really good and so I mean you all know wikipedia and you all know that this is already the editor of you I don't know if you've been to wikipedia recently last year or so they integrated this wikipedia editor not you don't have to work in in markdown whatever it was anymore you can really type inside the text and from them there is a there is an enterprise distribution that's called blue spice that's the guys from rigs book that I've been talking about disclaimer I they're a friend of mine and I blog for their website so um but that's you can make your own image about them they have a lot of things that enterprises need for example something like privacy administration so they add usability and at enterprise features on it yeah and they're open source they base they're based on media wiki and they have several versions up to cloud and farm and sars and stuff and um we have then we have x wiki there was just a talk so I hope I'm not saying anything that's that's wrong x wiki is also very old and they are in my opinion really interesting because they do a lot of innovative features that go way beyond the wikis like their crypt pad and there which is sort of like a um end-to-end browser based collaboration office thingy space that is actually if I understand it right serverless yeah and so they are so and if and then there is doku wiki doku wiki is um something that is often found in scientific and or educational realm I have been okay I've been working with blue spice I've been working with x wiki I've been working with doku wiki and um doku wiki I found at a company I was consulting that did that is from the space agencies from German aeronautics space uh it was the gaf the yeah so they this this world somehow yeah and so they have a lot of people that are experts and I call that that's where the expert systems start I think I think that both x wiki and blue spice is something that you don't need much expert knowledge to work with yeah but with doku wiki it starts but doku wiki is actually really cool because you have some features that others don't have for example you have a shortcut and then the page that you're working is a presentation just one thing and that's that's really cool and uh then we have t wiki we just heard of I think was it t wiki that you just heard or x wiki the previous t and that is um well it came from well the t wiki was the old project and first wiki and q dot wiki are folks from it they have a lot of extensions and a lot of features also but I I haven't really seen them that much in companies and they are very um I can how do you say they um you have to look you have to see if they work for you but they are not like for me they are not like a valid atlasian replacement because you but these you need to be an expert to use them yeah and other ones there's there's a lot of more I found these four and I think I'm good in the time there is for example there's bookstack which is another very interesting project that works with books and shelves so this German here so bookstack has this imagery of um use the knowledge is stored in books in bookshelves and chapters in books so they always use those those metaphors from the from from books so they have pages chapters books and shelves and this is the page of the um uh the access rules and uh
The Challenges of Creating a FOSS Fact-Checking Platform for the Brazilian Community
Okay, so thank you very much for staying for the last talk of the day and last talk of the year for the collaboration day room. So thanks a lot. I see people that stayed a lot during the day. So thank you. So the last presentation is about Matheus, which will be the challenges of creating a fast fact-checking platform for the Brazilian community. Thank you. Thank you. Yeah, can you hear me okay? Yeah. And thanks, appreciate everyone here. I thought there would be less people, so I really appreciate that. It's my first time at FOSDEM as well. Let me introduce myself. I am a senior product manager at the Wikimedia Foundation working on MediaWiki. But this product is a different hat that I wear. I'm a volunteer on an NGO in Brazil and we are trying to combat fake news and make something that is open and in software, in data and in knowledge. So what I want to talk about here is the mission of this project and I'm going to go over the challenges of trying to do something against misinformation in Brazil. Brazil is like a very fertile soil for misinformation, information, especially recently. The reason for that to arise in my will was basically because seeing how my family was sharing and spreading misinformation, it all started. So I started imagining, so I'm a co-founder, I'm not the only one that volunteer on this project, but we started imagining a society where everyone can freely access and engage with true and reliable information about autonomy. And I kind of stole a little bit of the Wikimedia Foundation mission here because we share similar values. So the mission for this project is to encourage ad-edu communication. I'm not going to talk about it because the other co-founder, which is a journalist, is the one that kind of coined or uses that term. And basically it means that we can only achieve our mission if we actually educate people and we need to communicate that. So the platform and everything, all the product what entails, it all focuses on the specific pilot. But the idea is that for the values on accessibility, credibility and autonomy, we're creating individuals or we're making autonomous individuals in Brazil being able to access information or at least question information without losing the credibility. And so when we started, so this is a study from Caspersky. And so more than 70% of Brazilians with internet have believed in the fake news and 62% of Brazilians failed to recognize false news. So this was a study that at the time was what motivated us to keep going. And challenges were immense because it forced us to tweak planning and change and pivot a lot in the years of the foundational years, which is kind of the timeline, which I call the foundational years. So I'm even excluding here all of the technical exploration that I did since 2018. And in 2020 when I thought that, okay, we have technology and maybe we're ready to proceed, we signed up for the Mozilla Open Lab. We participated in receiving product mentorship and preparing an ideation which came to be Alete Effect. And from that exploration with Mozilla it was very interesting because we decided, so we came as dreamers like we want to do something multilingual for everyone. We want to fix the world because the logo of that program that we participated was fixing the internet. And then we learned that it wouldn't be like that. So we would have a focus group. We would only focus in Brazil. We would look into people that is already engaging with fact checking or at least a reading about fact checking somehow. So active readers or independent fact checkers, even professional fact checkers. And then we would even look into a specific demographics, which would be from the age of 18 to 29, would represent like 16% of Brazil. And we set a goal that if we get 0.1% of that, 35,000 potential independent fact checkers, which is an increase of 7,000% of professional fact checkers in Brazil. And I'm going to talk about this a little bit because, so we did this exploration on Mozilla Open Lab. Then we started working on the infrastructure and launched the infrastructure and tried to experiment more. We participated on the trust, it's the TTL, it's the trust and truth online, yeah, conference and so we introduced the concept of the democratization of fact checking in Brazil, which is something that we are looking forward. And in the same year, we started a residency at Projeto Comprova, which is a group of news outlets and independent organizations that combat fake news in Brazil. And by participating on that, we actually engaged with our personas or professional fact checkers. And from that point, we started exploring a platform focused on professional fact checkers only. So we wanted to do something to speed up their process, make sure that they were, the process were optimized and they could actually chase fake news and combat it. So with that in mind, we thought that we are ready to formalize and then we started understanding what that would entail. So with these learnings, we defined that with process transparency and the didactic of representation, we could enable a fact checking manual or operational guideline and then we could replicate that and create autonomy of the individuals that we wanted. So the methodology should be accessible and understandable and that's the requirement that forced us to be aligned with the Creative Commons license for our data and also everything that we create from courses or workshops are all open and available. And we also had some, in the last year, I'll also talk about that, but we had multiple workshops and partnerships with universities in Brazil and all of that was free and based on the knowledge sharing proposed. And the platform or the product that I mainly worked on would be just a facilitator or a place where people could use and engage with. And our main goal was that, so we were here in 2021, we wanted to reach the Brazilian elections with something that could be used for good. So then we formalized and then we participated on the Global Fact Nine conference, which is organized by the IFCN, the International Fact Checking Network. And when we got there, we went to validate some of the use cases that we have built and it was like a bucket of ice on us because we understood that there is a different dynamics happening on fact checking community. So one of the things is they are very worried of being open, mostly because that can be weaponized by bad actors and create more problems to them. So every software that they write, it's not always open, sirs. It's not always open. Licenses, because they are mostly tied to news outlets, they don't always follow the Creative Commons license. So everything that we have built to create a shared space, like a public digital space for people, was not going to work with professional fact checkers. So we understood that, we pivoted, but we kept the same model, the same values and we went forward and launched a platform with a few people, like 15 volunteers. So in the platform, we created a process where we would listen to the debates during the Brazilian elections and do live fact checking. And it was good. So this is just a screenshot of some of the parts of the functionality. So here, if you have something highlighted, it means that it was fact-checked by someone and the experiment was very good. Very small as well, because if you look into our views, you're going to see that we only have like a thousand views. But the impressions from the people and what we were able to achieve was a good data to proceed forward. But because this is a very small project, it's a very small organization, we were dreaming big for a presidential election, trying to do an impact and of course, the stretch goes there, but what we learned is there is a use case, we can do that, but we need to begin small. So we took a step back and looked into what we can really do with our resourcing. And from that, we decided that we would have three product pilers, the fact checker productivity, reduce to credible information and reduce obstacles to product. So talking about the last one first, because we were very small, it doesn't work like our reference, which is Wikipedia, it doesn't work like that. You cannot just go anonymously and do a fact-checking, you cannot go and create an account and start doing it because we don't have a governance model or we didn't have at the time. We didn't have a governance model, we had a lot of obstacles, so we put obstacles on purpose so we could only test with a few people and understand what we need to do. And then we need to now remove those obstacles because now we are becoming more confident that now this can be used by everyone and we are going to have a procedure, a code of conduct and a governance model that will help. Access to credible information is the one that we are not focused too much right now because we believe that from the model that we have plus the productivity, we are going to create the credible information, but access to it is a little bit different. So we need to be able to make sure that the audiences that we serve are actually being able to access this and they can access the platform, we also provide goods SEO using the claim review schema and to be able to be searchable without people having to go to the platform. However, in Brazil we have a lot of people that, so there is an inequality or disparity of resources in Brazil that includes internet access in a sense that people can have data plans that can only work with WhatsApp but they cannot access internet. They can access Instagram but they cannot access Wikipedia. So the access level is a total different gain that we are still studying how we are going to approach, but because these are not our focus we should not lose sight as well, so that's why it's a pilot. And the productivity was because of an opportunity that we had because in 2023 we started a partnership with the University of the State of São Paulo which I actually came from and did my bachelor there and we had four interns and this is the other co-founder, our SEO, I like to call her my president and these four students were our focus group, we had a group of students in a sense that they are very fresh like first year journalists they know nothing about life more or less, maybe, I don't know, maybe they can learn from Tik Tok now but the idea is that they should have no knowledge at all and be just willing to work on. And we had a very smooth process and it was very refreshing because they finished four months of internship having the same level of productivity than news outlets when we were delivering fact-checking material. So of course different levels of comparison but in any way as I mentioned the platform was launched a while ago this 2022 is when we did the experiments with the debates and then we see that at least from the platform we are increasing the engagement of the functionality so we still have pretty much the same unique visitors but now people are using the platform more because now we actually have productivity for the team. So these are just a few screenshots of the platform, the code is in GitHub and you can access the link in the end of the presentation. And so this is an example of a fact-checking report that is available after the fact and as I mentioned so we focus a lot on the productivity of the fact-checkers so we started putting in place tools that would be tied to a specific workflow which is flexible. The way technically speaking we are using state machines here to control everything and we adjusted the processes and adding different steps depending on how we learned with the team. And the idea is to have a visibility of the productivity and also be able to collect data and see is this actually improving because once we, if we actually go, if we actually reach the goal that I mentioned before like 35,000 people checking facts in Brazil it's going to be a very different thing to administrate and make sure it runs smoothly. So yeah, this is I think that the learns. A few things that I think I would like to mention from the experiment, it was in Brazil the open source community is very spread, not so much well organized so the whole period of creating the software, trying to test the software only captured six volunteers like actively working on it which it looks like well this is actually pretty good but I forced most of my friends to actually go there and hey come on you know how to QA things, can you help me QA this, you are a DevOps engineer, can you create a pipeline to me so this was kind of a best effort from a community but after all of this what happened was that a lot of entry-level engineers were starting to looking for something to work on and they were, because we had partnerships with universities we started having people just coming in and they were trying to learn with the software which was a very good experience and something that I would like to explore moving forward it requires more management on the technical side being able to actually provide good feedback to them but it has, we now have like two or three active volunteers that they are 20 years old and just learn how to code and now they actually provide good development for the platform of course there is a skill to consider but so yeah there was this challenge so the time frame when I look to this project like five years is a lot for the stage that we are at but there were multiple factors there like it's a product for a very specific area it's a problem that no one solves yet and the only way that I see moving forward is doubling down the educational effort for the actual goal which is being able to provide credible information and stop the spreading of misinformation there is no other way around there is no other software there is no AI there is nothing that can help other than having humans understanding what they are reading and being able to have something that is accessible so I also put here multi-generational because in Brazil the disparity on misinformation spread is based on age as well so the older you are the bigger the chance to be a victim of misinformation and what we are also going to be looking at the general TV AI so and I write here with all the truth in my heart that I have no clue what we are going to do about it and the reason that I put it here is because it's not that we need to use the general TV AI but we need to defend against it which is I think very different perspective to look at of course maybe we need to put it here and make pieces with the devil who is the same tools to combat at the same level but the concerns are different because we are already seeing on multiple elections around the world the usage of deep fakes and general TV AI to manipulate public speech and so this is going to be a very difficult thing to do however because we are losing the battle we need to consider and that's it the code is open source it works specifically for an audience in Brazil it has been tailor suited for that but because since the beginning we were concerned about having this for multiple audiences so it allows internationalization the code is the reason that so the stack that we chose and one of the things so the stack we chose was no JS and React because the ability to find more people to join the effort but that of course we are now considering if it makes sense to keep the same stack or rewrite some of the stuff because if we want to be lean and optimize for some use cases we might also consider performance and other things that right now the platform doesn't provide and I forgot to mention something very important everything that we do is integrated with Wicked data right now and we have efforts to integrate with the whole public data infrastructure the idea is that we only have what it needs to be done for fact checking but information about personality more information on Wikipedia all of that should be included and integrated with the ecosystem and in the end we encourage other people to build their own communities and be part of the movement fork and change toss it off test the same things that we did I think that because it's something very very important and it's going to change a lot in the next few years I believe that we should double down the effort as much as we can that's it it was supposed to be a gift but it's a PDF so thank you thank you very much for the attention thank you Matheus any question? Hello so I was looking at the website and I see the personalities and the declarations and I see the reviewers but then how do you define or who puts the new declarations in for the fact checkers to really check it? Yeah so one of the things that we learned from the procedure of the fact checking is that monitoring has a specific operational guideline so one example that a fact checker might find to is that they receive this information but they look into how is this happening on Twitter X how is this being spread into Facebook how is this being spread on WhatsApp is this a big effort does it make sense to to actually check because there is one thing which is called the strategic silence because if someone checks it makes more public and it spreads more so the the decision on what to put in there is for the volunteers that have been have the capability to operate the platform so right now in order to operate the platform you need to go over the training understand how our code of conduct works sign that you understand and you are going to vouch for it and once you understand the whole process then you are able to operate the platform and you will be you will be responsible for monitoring so these volunteers are the ones that also select what is putting there but we receive we receive suggestions from community in Brazil from people that follow us and we use that into consideration to decide on if we are going to put on the platform or not and because it's a small group it's going to be small data now but the idea is to streamline this process and grow and possibly monitoring will be we will need to evolve based on on on that does that make sense thank you it's the last one from the UK we have a few fact checking organizations there do you connect or are you intending to connect with other similar organizations from the world yeah so we did when I talked about being part of the global fact nine which is part of the the international fact checking network we learned a lot and met a lot of the the I think full fact is one the biggest one and one that we get in touch we got in touch and now we are in the process of being part in part of this network there's a lot of criteria it's if we do it we are going to be the first open project that actually enters so we are having some trouble to actually match on some of the criteria that they ask but we can we have some connections with India we have some connections with the Latin America network as well and recently we joined the the the network that is only on Brazil so every out Nils out in Brazil are also part of this network and we are connecting with them as well yeah we need that okay thank you other questions no okay thank you very much and well that was the collaboration they've room for 224 so thank you for staying until the end
How do you change the governance model of an established open source project?
All right, awesome. Thanks everyone. Thanks for joining us today. So my name is Ruth Cheesy and I'm going to be talking a bit about how we went through the process of changing a governance model in the Mortec project. So if you haven't come across Mortec before, it's an open source marketing automation. We've been around for about 10 years. I'm going to talk much about what Mortec does, but we've got a stand in H block. So if you want to come and chat with some of the community will be over there. So yeah, I'm project lead from Mortec. I'm also co-founder of the Women in Open Source community. You can connect with me on LinkedIn by zapping that QR code, but the slides should also be on the FOSDEM website afterwards. So if you need to check something, you can check all the links that I mentioned and everything is up on the FOSDEM website. So let's start off by talking about what we actually mean by governance. So in open source for me, governance can be something as simple as like a few paragraphs on a piece of paper or in a bigger project. It can be a lot more complicated, but ultimately it's about talking about how power structures operate within your projects, how decisions are made, how interaction happens within the project and also ultimately talking about steering the human collaboration and software evolution in your project. So where do we come from Mortec as a project? Well, we were originally what I call a corporate backed open source project and what I mean by that is one company backing the project and all of the governance is around that one company backing the project. So we were founded in 2014, GPL3 and in 2016 the founder created a SaaS company providing the software to Enterprise. In 2018 we had our first ever community lead release, so it was the first time someone led a release who wasn't an employee of the SaaS company. 2019 the SaaS product was acquired by a company and rolled into their marketing suite along with the brand, the trademark and everything to do with the project and community. And then in 2020, so soon after that, we started to make a governance model to make it clear what actually the company involvement was, what the community involvement was and how we made decisions collaboratively. And this is what that first model looked like. So you can see at the top here, these pale blue are, they must be a member of the company, the dark blue colors are, they must be a member of the community and then the gray ones here are anyone. So it could be company, it could be community. So there was quite a lot of corporate involvement in there mainly because the company wanted to steer the project and support the project. This was developed in collaboration with the community but very much designed by the company to make sure that they had still had a say in the project. So the key decision making structures we had here was the company. So the company actually owned the trademark and they gave the community the ability to use those trademarks. And they also employed the project lead, which was me at the time. And they chose the company representatives of the council. The project lead was hired by the company and the job was to kind of steer the project in the right direction to organize the community, to remove any roadblocks but also be the bridge between the company and the community. And then we also had a community council which I showed there. So four people from the company, four people from the community dealing with issues that went across the project. So they weren't just to do with one particular team but there were things that were slightly more complex or maybe needed a bit more thought before they were interacted. But for all intents and purposes those community representatives were the team leads when we first started. So we didn't have enough people active. We just kind of said if you show up then you can be the team lead really. So in April this last year, the company informed of the company that they weren't actually able to support us at the same level that they were supporting us to that point. And so things needed to change basically. And because of that we needed to find a way to go forwards that wasn't going to involve being backed just by one company. So the first thing we needed to decide was what's the fiscal structure going to look like for the project? How are we actually going to organize ourselves? How are we going to manage the governance? Things like that. So the way we made this decision was initially going away and doing an awful lot of research, looking at what other open source projects are doing, what other projects have changed their governance models over time and how did that work out for them? And bringing that all together in an idea of some proposal that I would take to the council. And at this point it was only me who knew that this was happening with the company. So some of the options were maybe looking at joining a foundation or joining an umbrella organization that could support us. But what was important with this is that we were able to still be autonomous, that we still had the ability to decide how we did things and what tools we used and so forth. So there were pros and cons to that approach. Another option that was in front of us was we were at that point using Open Collective to manage finances. So if we ran an event, we had somewhere for the money to go. We were only using it for finances. So there was also the option of expanding what we were using them for to provide some of the services that the company gave us. So like holding trademarks, holding our assets, employing the project lead, providing legal support. So that was another option that was open to us. And then also creating our own nonprofit organizations. Creating something ourselves that was maybe a 501 or a nonprofit CIC in the UK that would deliver all of those things that I just talked about and we would be able to do that from our open source project. Put them on nonprofit organizations, sorry. So some of the resources that I found useful in this process that are up here. So governing open is a really great starting point. If you're having to think about governance, there's got lots of resources and links off there that can get you going. There's also a really great one from the Python organization, PEP 802, which explains the governance models of lots of different open source projects. How they've changed over time. What went well, what went wrong, what was difficult. That was a great source of, they're not all the same kind of projects as us, but they were encountering similar kind of problems. And FOS governance, if you need any kind of document, whether it's a code of conduct, a privacy policy, what don't agents we accept, a governance model, there's absolutely loads of awesome resources there and you can also upload your resources. So you can share your resources as a to do for me to actually upload the new governance model, come done like that. And also don't underestimate if you're going through thinking about this, the power of the network. There's just so many people who took my calls when I was like, I need to speak to people about this to get some ideas. Gave me some good contacts, pointed me towards specific things that would help in this process. So if any of you are those people who I spoke to, thank you so much because it really did help. So once I'd kind of come up with, well, those are the three things that we could go for. And as project lead, these are the pros and cons, I think, for those things. I shared it with our council and then later shared it with our team lead. So the council and then the assistant team leads as well. So we're at 10 of us at this point tossing around the ideas of what are we going to do and what do we think is going to work for the project. The challenge of course with anything in open source is reaching a consensus. People had views on what was going to be best for now, what's going to be best for the long term. But ultimately we were able to come to a consensus together. And that consensus was that we wanted to actually become an independent open source project to use open source collective more and to refactor our governance model accordingly. So that news was shared in April. You can read the independence blog post there. And actually it was one of those moments where you hit publish and you're not quite sure what the response is going to be because you all believe in it, but you're really hoping everyone else is going to too. And it was a really positive response. So some of the things that we learned from this is language really matters. We're a massive international community and we are invited people who we trusted from our main communities to translate that important announcement so that people in the local communities could understand actually what that meant in their local language. And they really valued the fact that we'd taken the time to do that. So major communications it was really helpful. We also had a lot of people who either did not care at all, which I couldn't really understand, but some people don't care about governance. They just want to use your product. Some people who work the other end of the spectrum who really cared a lot and were extremely passionate. And then some people in the middle. So I guess I'd say that the lesson learned is like you've got to be prepared for all of them, not just the positive, but also the negative criticism that comes with that. And also being available. So in this stage it was really helpful to have opportunities. We had webinars with a translator for our Brazilian community, for our German speaking community, where people could actually hear what the changes were, what it meant for them. And then they had the chance to ask questions. It also was really helpful to have an open door office hours where people could literally just drop into a Zoom call and talk with me or talk with the team leads directly about whatever they wanted to talk about. Okay, so one of the features we had to think about when we were actually creating this governance model, once we decided what the structure was going to look like, was do we actually need a hierarchy at all? So someone in the community was saying, actually, I think we should have a completely flat organization structure. I don't think we need to have leadership and councils and things like that. We did a lot of research on that. We couldn't actually find any organizations in larger open source projects that had that structure. We didn't think that was going to be practical for us over the long term to not have some kind of organizational hierarchical structure. So we did investigate that, but we actually decided, yeah, we do think we still want to have structure. But we decided that some of the structure we already had was actually working alright. So like the teens and the working groups was working alright. The council was working okay, but it wasn't democratically elected. It was chosen. And so we wanted to change that so that it was actually chosen by the community. We didn't have a step in between the council and the teams where the community got to discuss and debate changes, which then go to the council to be introduced. So that's what we introduced with the general assembly, which is a collaboration of members who can decide and debate, and then things go to the council to be enacted. So that was the structure that we kind of came up with for the project. But the next step was if we vote in a council, how do we make sure they don't all disappear at the same time? Because we're going to be doing this at a specific moment in time. And for this, we took inspiration from the Python Software Foundation. So we did an election. We had people voting, and then we ranked them. So the top three people did three year terms. The next two people did two year terms, and the next two people did one year terms. So that worked really well. The community found that really positive. We did have two people right on the border who got the same number of votes. So we just had a conversation who wants to do three years, who wants to do two years. But that seemed like a really good way of us kind of making sure that we have fresh blood coming into the council as well. And then who actually manages the project lead? Because they were employed by the company, and now they're employed by the community. So who manages that? And ultimately, we decided that that would be managed by the council. So that would be like they would be reporting into the council, basically. Some of the things we also had to think about was how do we make decisions? Because although we made decisions, obviously we've made decisions. It wasn't really explicitly clear how long we give for what different types of decisions and what methods we use. This was also a subject that we did lots of research on. We did need to find a way to do voting and to make the voting fair and to make it a system that we could easily roll out a vote for anything, basically. So we ended up using an open source tool called Decidim, which we've implemented at community.mortic.org, which allows you to have a voting system. It also lets you do events and meetings with transparent notes using EtherPad. And that's actually worked really well. So that's the tooling that we actually implemented to do the practical voting. And then once you have voting, it's like, well, who gets to vote? And this again was quite a contentious subject. What we decided was that we would have different ways of you being eligible to vote. One is financial. You throw some money, you get to vote, $100 a year, or you can use the Big Mac index, which proportionately reduces the amount based on the comparative cost of a Big Mac. You can Google it as by the economist. We already use that in other places in the project and people find it helpful. So we just use the same system we were already using. Contribution based approximately five hours a month, consistently over three months. They can apply to be contribution based member. Corporate based where we have tiers from $1,200 a year up to $30,000 a year. An honorary membership for people who've done extraordinary contributions to the project. So those are the membership types that we decided on. So once you've got the types and what have you, people then started saying, but I do more contribution than him and I want to have more say. So here be dragons, like this is a really difficult thing to get your head round. It can get very complex very quickly. It can be exploited very easily. So we just decided one member, one vote. So whether that's a human individual member or a corporate, they get one vote. And that works because they have one account on our community portal and there's one member in our membership list. And the membership list is who has the ability to vote. So that kind of simplified it. People wanted to get really complicated, but we have to start somewhere. And then how are decisions made? This one we actually decided, well, trivial decisions, we don't want to rat red tape random. If it's trivial and you don't, it's not going to impact many people and it's reversible, just make the decision. Talk about it amongst yourself, make the decision. If it's non-trivial, like how many tracks should we run at a conference or who should we invite as a speaker? Or if there's a code situation where there's a few different options, but they don't have major impact if you take one or the other and it can be reversed. Then we say that's a 36 hour time box, taking into account holidays and things like that, but generally 36 hours. And if it's a significant decision, which impacts several teams or the whole project or it has financial impact or it's not easy to reverse without there being significant consequences, at least two weeks time box. And those decisions happen on the portal. So that everybody who's on the portal sees things happening, they see the discussions, they can be involved in the decision making process. And then ultimately we try to get to a point where we come to a consensus. We default to lazy consensus. So if nobody has given an opinion and the time box that elapses, decision is made. If they have, we try to find a way to bring their feedback in so everyone feels like they're on board or they can at least disagree and commit, you know, the best thing. So how do we come to the final version of the governance model? Discussions happen very, very fast. So we had a channel on Slack for the governance discussions. I could go in there in the morning and there'd be like 250 more messages in a thread and you're just like, how on earth can I keep up with this? If you come in completely fresh, it's really hard. So we tried to summarise this in a Google doc and each day someone would take on to write who had given what views and what the discussions were. So it made it easier for someone coming in to actually get an overview of where we were at. When we got to a point where there was a first draft, I posted it up on the forums. I explained that this is a first draft of a governance model. Anyone else is welcome to submit another one. This is the one that we've been working on. And the important bit is that we could see down here, we chunked each section of the governance model, which was quite lengthy, into separate forum topics. So you could go and discuss the membership section or you could go and discuss the council section and provide your feedback there. And then based on the feedback and suggestions, we could update that initial thread and people could see where we were at. And then we collated all of those back. So this was time box. We did actually have to extend it by two weeks because people said there was too much to discuss and too much to make decisions on in two weeks. So we extended it to four weeks. And then once that was done, the positive thing about having it on the forums is our community are marketers predominantly. So on Slack, they won't be following it. But when they go and say, my mortar can sense is broken, or I can't send this email, they're going to the forums. So they're coming past this post in the forums. So we actually got more people involved that wouldn't normally be involved in these discussions. Then we posted the final version basically for two weeks for people to review the whole thing. And if there's still things that they were worried about, they could respond to this thread. And I highlighted all the bits that had changed from the first draft and why they had changed. So some had changed from the forum. Some had changed from a panel that we did at our conference, but it was easy for people to check. So in this stage, long live the time box. I think it was Angie Webchick who was like, time box everything when I first started as community manager. And that's so true. Like giving people a fixed window and saying, we will make a decision at the end of this time box. Delegating the research as well. So delegating research, if somebody's really interested in something, ask them to go and research it and bring it back. And then you haven't got to do it yourself. So we've had some people who are super passionate about decision making, and they went and did all of the research on that. I am the worst person for complicating things. So keep it simple. Yeah, with governance, it can easily get really complicated. But we kept on saying, what's the core of what we're trying to achieve with this? And how can we get rid of some of the fluff that doesn't need to be there? And also these ones, go where they are. So as many places as you can, talk about this governance stuff that you're trying to do. Social media, sending emails, talking at conferences, talking in person. We actually had some code of conduct infringements because of this, because people got so emotive about something that they really believed in. It doesn't mean that you don't obey the code of conduct. And I think modeling the behavior you want to see is really important. So when someone was disagreeing with something, one of the most useful things I learned to say was, you know what, I'm about six out of 10 that we keep this because x, y, z. Or I'm two out of 10 about this. I think it's kind of nice, but I'm not too worried. And then people have the language to understand and communicate themselves how passionately they are thinking about this thing and why. So you can then kind of get into dialogue. And yeah, draft early, iterate often, be ready to chuck it in the bin, but get something on paper because otherwise it just turns into this big nebulous discussion that never actually becomes anything. And it can be very frustrating. So where are we at now? It's been a longer process than I would have hoped for, mainly because of the community engagement. It takes time to get people to engage, to get people to give you thoughts, and then to kind of go through that process. But actually we've done all right. So we published the final draft at the end of July. We launched our membership model where people could become a member in August. In October, the community portal came out of Beta. So it was in Beta for about a month where a couple of teams were using it. And then in December, we had our extraordinary general meeting where we inaugurated the council who had been voted through the nominations process and we adopted the new governance model formally. So far we've had about 150-ish people joined the portal. We've had 44 financially contributing and 14, actually it's more like 48 now, 14 practically contributing who have joined us members who have a practical contribution route. We've also got people who've paid and they're eligible practically, you know, but whatever if they want to pay them, great. We had the voting on the portal which was really successful. And also what we do is all of our meetings. So team meetings, working group meetings, everything happens on the portal. People can join on the portal. They get the link. The notes are taken there so people can see the notes from the meeting when they finish. And it's been really good actually. It's really been like a central place for all things community. So going forward for us as an open source project, what's next is financial stability. This is the biggest thing that we're working on right now because we don't have the backing of a big corporate anymore. We need to do this all ourselves. So we're exploring lots of different revenue streams, membership, but also having a trial system where people can try the software for two weeks. And if they wish to continue, they go into a contract with a provider, but we get a 40% revenue share for the first year and the 30% for the second and so forth. It decays down. So we're trying to be creative in exploring ways that we can offer value and also get the money. We're very much focusing on product adoption. So our kind of adoption is like this, which is great to see, but we need to continue. It is a competitive sector in the proprietary world. There's not much competition in the open source world, but we're still kind of moving forwards. And also from the product development process, we're 10 years in, but we're dealing with an immense amount of technical debt. So it's also kind of making the product more stable and introducing many more features. And then finally, what we're really trying to move towards is transparent by default. We do do that quite well and we have done that quite well since 2019, but basically every time a leadership role expires, it's voted on through the portal. Every time we have to make a decision, let's take that debate to the portal instead of having it in Slack, on GitHub, wherever, have it on the portal and then it's centralized. And also, yeah, making use of voting. So any time we need to actually practically have a vote on something, we now have a system that we can do it through. So that's me done. I think I'm just in time. Hooray! Yeah. Thank you. If you have a stand, as I said, on HBlock, so if you want to know anything about Morte, come and chat. Questions? Any questions? I'll come back up. Oh, Lord. You're going to get your steps in today. So thank you for your talk. I would like to ask you, how do you manage like liability against law? And how do you, who is deciding the salaries, like the levels, the salary levels and stuff like that? So one of the biggest expenses we've had in this whole process is legal. So we had to, an open source collective who was our fiscal host, have legal experts who are specialists in open source. So we use their services to get the right contracts for transferring the trademarks and all of that stuff. And they also review all of our trade, all of our contracts that we sign because they have to be signed by the fiscal host, not by us. In terms of what was the other question? How do you deal with? Salaries. Salaries. Okay. Yeah. How do we deal with salaries? Yeah, thorny subject because I'm paid by the community and I set the salary. And I did that like three years ago, not knowing this is going to happen. What we did, we did lots of research at that point about what open source projects paid as an hourly rate and also comparing them with what we could actually afford to pay. It was when we were migrating from symphony four to five, we had big projects that needed a contractor to do because we couldn't find people in the community. And we just set an hourly rate and it's very low compared to big projects. It's $40 an hour, but that's what everyone gets paid in the project. We want to use the sliding scale at some point. There's a proposal being put to the council soon to investigate that. But yeah, with that comes a warning because I live in the UK and that will probably end up being a lot more for the project. So do we really want to do that? But that's how we've done it. So yeah, anyone else? Hello. Thank you for your presentation. I was wondering what is the emotional impact of going through a process like that? And if you have any tips or tricks, how to navigate it? The emotional impact? Yeah, because I'm guessing you will have to have some difficult talks. Yeah. Because I think you care about having a fair governance. I think you need to have your own house in order if you're going through this kind of, like in terms of you need to be able to know yourself well, because it does get emotive, especially if you are the founder or if you are involved. In terms of dealing with other people, in dialogue with other people, I think a lot of it is people are very passionate. So it's trying to understand what it is that they are getting emotional about and why they're passionate about it. And how can we find a way for that to come into something if it's not constructive, come into some constructive way of taking the bits that are really helpful. But yeah, just trying to be mindful of your own stuff and not projecting that onto other people when they're coming to you with ideas you don't agree with. I don't know if that is, it's kind of a non-answer, but yeah, sorry. Scott, one more question. I'm fascinated by the voting system that you have. Projects have problems with people coming in and leaving quickly. You said one person, one vote. How do you make sure they stick around? Do you have any way of like saying, hey, we're doing this, but like, can you speak more to the voting process to it? Because projects always have a problem with that type of system. Yeah, so part of the thing that we've done is like for voting, you need to be a member and that is linked to a financial benefit to the project because you need to either pay or contribute so the project's benefiting. Do you mean in terms of how do you get people to care enough to vote? Yeah, I kind of like, I mean we put money into it, but sometimes it's cool that I don't really care, but I need your voice to suggest or know. Yeah, so like people have said they've joined but then they don't really care enough to vote. A lot of it is to do with one-to-one engagement or not one-to-one but one-to-small engagement, making sure that people are aware why it's important to vote on that thing and you've got to accept that some people won't care. But it's, I think it's like using that emotive language and trying to explain like this is your opportunity to have a say in this thing. We actually had probably 20 people become individual members because they wanted to vote for their favorite candidates in our election for example. Another thing we're going to do is a template contest where you have to be a member to upload the template for example. So we're trying to do things like that to get them into the system, understanding how it works so it's very easy to use. So thank you very much, Shavu, they really appreciate that. And if you've got any other further questions I know you gentlemen do, but they'll have to be outside in the hallway after the transfer bit to spot over the room. So thank you very much.
Please Make It Make Sense: Product Management Methods to Make Your Project's Purpose Clear
I think are we good? Okay, so next up we have some product management coverage content by Loria. Hi. Yeah, so I've been a little bit of an AV disaster, so I'm going to have to look at my slides because I can't see them here. But here's the title of my talk today. My goal is to help you get more structure around your open source projects, hopefully save time and ideally do less. Okay, so about me, I'm an American living in Germany since 2015 and I mention this because I came to Germany with a very live to work mindset and now I have a very work to live mindset. And you're going to see that mindset shift in my talks, like the messaging I share with you. Among my many open source activities has been contributing to Kubernetes, particularly SIG release and also more recently the open SSS security scorecard project. I have this link here where I thought I'd highlight it because you can find a lot of management and leadership guidance there. It's a collection of resources, blog posts, videos, templates, things like this, including some things I'll show you today. I've worked in places. I'm not working now. My company shut down at the turn of the year. So if you like what I have to say and think I could be helpful to your organization, let's talk and there's my LinkedIn in the meantime. I'll cover basically two branches in this talk. First is some observations from my time in open source. I'll sprinkle some helpful hints and examples along the way and then I will focus on some tried and true traditional product management methods that work in a company setting. You've probably encountered them in your day jobs, but they also work in open source with a little bit of creativity. So some of those observations, I see contributors taking on so much work. Just lots of issues, many times even multiple leadership roles and it just seems like a sure far way to burn them. Because they're so overstretched, they don't have a lot of time to do a lot of research and gather data. Also, that's a skill set that not everybody has and not everybody needs to have. But the end result is often that a lot of development is based on assumptions instead of data. Another thing I've noticed is that what exists today in a project isn't well-defined or documented or mutually understood by the project team. This represents a pitfall because you maybe don't have the shared understanding of what your project is and does and should be. And lastly, there's often times of vague strategy or even none at all. I would say that the most acute manifestation of this issue is that there's often a boundary between what goes in a project and what stays out that is lacking. This can lead to a lot of work being done and that work just kind of expanding. So if you take away anything from me today, it would be this message which is I really encourage and invite you to do less if you can. I know your manager may not want you to do less. There's always very specific conditions around that relationship, speaking from experience. So I'm happy to talk to any of you after the talk if you would like to have a sound and pour like ways you can manage your manager's expectations around what you can do in open source with your limited time and availability. But if you are the pressure source telling yourself to do all of the things, then I invite you to ask yourself at first like, does anybody even want this? I mean, maybe they do, but maybe if you're the only person or you don't have a very clear sense of how many people might find value in your project, maybe stop and collect more data before you move on. Also keep your personal backlog light. I know some people really enjoy working with them, but they take on so much work that they end up becoming the blocker for other people to make progress. And you don't really want to do that, right? You don't want to impede your fellow project contributors' efforts because you're like the decision maker on 10 different things. So that leads to delegating. Delegating not just to reduce your workload, but also to empower others to gain skills that you have. And I know that's rather time consuming, but oftentimes what I've seen in open source is that a little bit of upfront onboarding and knowledge exchange saves everybody time in the later stages because you have multiple people who can work on something at once. And the last tip is something I've used over the years because I would just take on work too. I love it, like, let's be busy. And then I would find that the work that I took on actually involved a lot more than I bargained for. So I highly encourage you to unpack a task before you say yes to doing that task because you may find that it's going to take you a significant amount of time. Here's an example of that. So this is a project board that I created with collaborators from SIG Release and Kubernetes. The initial idea was to rewrite a tool from scratch. And I looked at that and thought I heard that and I was like, you know, we may not want to do that because that sounds really, really intensive. So what we did is over a couple of sessions we figured out some real things that we didn't know about this particular tool that we wanted to, you know, talking about rewriting. And what we had was a lot of questions, like what is it, what does it do, what do users want. So you may not see all this text, but just the TLDR for you. There's a lot of spikes in decision making and documentation, like proposals to write to get community feedback before even setting to write code. So this is what I mentioned earlier, like the assumptions that we often take into our development plans. We had a lot of assumptions that we just had to rewrite this tool because it's just too broken and, you know, we just do it over. That's often not the case. And so I just want to point out that I didn't come up with the idea of assumption-driven development. I found a term that someone else created, and in my search to find out exactly who, I came upon this blog post, which I found really interesting. It's a developer who basically described his own failure trajectory because he was operating with assumption-driven development. And what he did was he decided to just take on a lot of work on his own. He didn't talk to anybody around him. He also didn't understand what he was working with in that day, like the tooling and all of the different tooling relationships, and also the knock-on effects of making changes. And he kind of went in like, I'm going to do this, say, and like it's going to be done. And that also didn't turn out to be true. There was a lot more work involved that he had expected and planned for. So I thought it was a really great summary from the developer's perspective of why assumption-driven development is often not the best method to use. I'm going to give them a talk, and you can ask questions after. Thanks. So basically, what I'm suggesting here, like a way to conquer assumptions, is oftentimes just listening to your environment. And that starts with the people around you. So there's this thing called active listening, and I found a nice resource from the Center for Creative Leadership, and they give you some behaviors that you can adopt, or adopt rather, to start listening more actively to your colleagues or co-collaborators and others you work with. They say, first of all, pay attention. And we take this as a given, but in our world of smartphones and lots of distractions and multitasking, we often don't really fully pay attention to each other. And one way that we don't do this is that we sometimes can't wait to, we don't wait for the person to finish what they're saying, before we just like, oh, I want to get my point out. We have to go, and then we end up missing the latter half of the sentence, because we're too focused on our own sentence and what we want to say. So active listening means that you don't do that. You actually let somebody finish, and then you ask. And you also can do things like clarify what the person is telling you by asking them questions. Did I think, I think I heard you say this. Is that correct? Or can you tell me more about what you're trying to say to me? And then together, it starts to become a collaboration, because you're inviting them to also clarify their ideas for themselves. And you're also getting higher quality information, because A, you're taking it in, and you're also engaging with it in a team context to work out new ideas. In addition to listening to your colleagues and people around you, you should also listen to your code. So I mentioned a few slides ago about this idea to rewrite a tool from scratch. But if you don't really listen to your own code from the beginning, you may end up doing a lot of work that you could have avoided by just optimizing and selectively choosing what to work on. So having artifacts like docs and diagrams will help you to better reason about the work you truly should do. Optimize, find the points where you can make things better, and also plan accordingly. So here's another example from Sig Release where we applied this principle. We had this tool, right? And we were going to rewrite it. But I said, first of all, let's actually document the flow that the user follows to use this tool, achieve a job, go from point A to point B. And so an engineer in Sig Release did this, and then we gathered around as a group, around his workflow, and talked through every step, figuring out what was really hard, what was taking a lot of time, what wasn't working. And as you can see from the results, the first line there is the overall flow. And then I blew up this section toward the end, where you see a lot of anger, and then there's this little clock, which means it was really time consuming. And you could then see in the full landscape of this project's flow where the pain points truly were. And we were also able to use these posts to document exactly where the code existed that was executing these steps. And so what we walked away with was a much more focused plan for what we needed to do. And we can then start there and then decide after collecting a lot of information about these weaker points what we should do next. Maybe we rewrite parts of this instead of the whole thing. When you have a workflow like that in place, it really helps you to put, it puts you in better control of your project. Now if you have no projects, that's fine too. What we're going to cover next are some tools that you can apply as you start working on a new project. But you can also introduce these even if you have something that's several years old. It doesn't matter. It's never too late to understand your work and then organize yourself to do the highest value work in the future. So I'm going to cover having a strategy with a doc template, doing user research and surveys, including an example of a survey which is the NPS, making a roadmap and giving you a template you can use, and then prioritizing and refining your backlog with some methods and tools you can apply for those activities. So here's a strategy doc template that I just worked with the security scorecard team on to actually fill this out. And I know these little lines here are small and you can't see them. I'll get to that in the next slide. But it basically introduces the concepts of the 5Ws that journalists use typically to write a news story where they need to have the reader know the facts of the story right away and then if the reader wants more detailed information they can read on. But it answers who, what, when, where, and why as well as how. The goal here is that you have an asynchronous tool that you can use so you don't have to have a meeting around this, although I advise it because you'll find that more information comes out when you actually discuss your strategy. But you can at least start with a template like this and people then can contribute their comments and ideas to it. This is Miro by the way. When you actually have this template filled out and you've gone through it with your team then you can dump it into a doc, refine it a bit more and then publish it in your repository for the public to look at. And then of course you can continuously revise as your project develops and you discover new information. So those small questions in that template are here basically. Not all of them but some key questions that are quite useful for getting a sense of where you're going with your work. So who are the users as long as the contributors and the maintainers? But really who are the users? Who are the people deriving value from your project today? And who do you want to derive, who do you want to have value derived in the future? Like who should derive value in the future? What does your project do today? On the flip side what does it not do? I mentioned earlier that boundary about what goes in a project and what stays out. When you can clearly explain what a project is and what it is not and what it shouldn't be then you can get a clearer sense of where that boundary lies. You can also think here about what the UX is like and what quality concerns and constraints you have. It's really just like what is your project essentially? What is your project useful? So what are the conditions to trigger a user actively coming to you, you're solving their problem? Another way to look at when is like how long does a particular stage of your project's workflow take to be completed? Where does your project fit in the ecosystem? So I'm not going to go over the ins and outs of doing a competitor analysis here. There's lots of templates online that you can look at to do one. But I highly recommend it because when you take a look at other projects in the space that are doing similar, solving a similar problem, you can then assess the resources behind those projects. Maybe there are even products, so maybe there's like a company doing what you want to do. So they have a lot of money and they can work quickly and then you can consider like what you actually have in your time budget to actually pursue. You can also see what those projects and products strengths and weaknesses are and then use that information to distinguish and differentiate like what you want to provide. Maybe it's a niche that you want to really get a handle on and provide a really clear good solution for that no one else is providing. Maybe it's just because your project is community-based and other projects and products out there are like for money and so like you're going to be able to serve the community whereas those alternatives will not. So thinking about where your project fits in that landscape is really quite helpful. That leads into why your project exists in the first place. What value does it deliver? Then that puts you in the seat of the user who is actually trying to use your project and solve those problems they face. Another question I like to ask around why is the cost of delay. So if we don't develop this project now or if we don't iterate on it and provide these features of functionality, what bad things happen? What bad things happen to our goals? What bad things happen for users who continue facing this problem without any solution? What happens to innovation in general? There's really a lot of interesting conversations you can have around cost of delay. Then finally how does it work now? This question is also a really nice hook for you to think about the future and where you want to be in 12 months or 24 months with it. How do you want to build this to provide different features? Maybe redesign the architecture to be simpler. How do you want to be and how is a good frame for that? I pointed earlier, we're going to cover some more tools and methods. The next one is user research and surveys. Having as much data as you possibly can really pulls you out of your own biases and what the developer with the assumption driven development blog post was describing. I only listened to me and it didn't work. If you're listening to your prospective users, your current users, other project leaders, you start to get all these different perspectives that can ultimately help you develop the right most valuable thing and not develop a lot of other things that are going to take up a lot of effort but maybe won't have such a payoff for you or for anyone else. Surveys should be kept quick and easy. I tend to use Google forums. I mean, I know it's not open source but it works. I don't ask people to write a lot because you don't want to read at all. You probably don't have time to read lots and lots of survey responses. The survey respondents also probably don't have a lot of time to fill out lots of forms. Using check boxes, multiple choice, rating options from zero to five or whatever you want to set is your endpoints. You have numeric data that you can quickly turn into charts like this one which was from a Google survey and it's just easy to make a chart out of the results. Another thing I like to remind people of is Please Buy by GDPR. Be careful about how you're collecting the data of the people who are filling out your survey. Make sure they offer their consent before you offer them a chance to give them, to give consent for usage of your data before they move on. Another great way to collect user data is through discussions. Like on GitHub, you can post a question and see people respond to it. That can be a little more time consuming because you're going to have to read through all of those answers. But it can be quite useful too because you get broader context. If you're in a hurry and you just say, hey, community, I want to know if you want us to do this thing or not. You can send out an issue and have them give it a plus one or not. You can use emoticons, this like votes. There's other tools out there that product managers use all the time like AHA that offer this kind of voting functionality for feature ideas. And finally, interviews which are really can be quite time consuming. But if you have the time to do them, you can even just do a few. You can learn so much about your own project. You can sit and watch somebody try to use it and see where they get stuck, see what's confusing to them, and collect all of that data and think of ways to optimize and improve. Oh, I forgot. This is a really important point to ask them. With the results, a lot of times when people fill out surveys, it's numbers, so it's all scientific. But it often isn't because our users may be giving their feedback from a limited set of data points themselves because they may not be aware of all the alternatives, all the directions that your project can take. They may not have a full understanding of the functionality because they don't have time or maybe you didn't explain it well. So always be aware that just when somebody tells you what they want, they may not actually want that thing. That may be the best guess that they have that would solve their problem, but actually in the broader context of other types of users, it wouldn't solve the problem in the best way. So just keep that in mind that data can also be a little bit of a trap if not used carefully. I want to give this example of a survey that you can run very quickly. If you don't have time to set up a forum yourself, lots of questions, you can still do an NPS survey. This is used by lots of companies, but it's quite useful in our context because it just consists of two questions. Basically would you recommend my project in this case to a friend or colleague? And then can you please explain why you gave that score? So the number is very easy. You have to put it in some kind of NPS calculator, so I gave you a link to one. It's also the image source. You basically put in all that data and then you come up with your NPS. And then there's different analyses online for like what is a good score, usually it's 20. When you're 50 to 80, you're doing really well. So that's from the way that the score is calculated. It's a pretty low overhead way to collect feedback. Are we on the right track or not? I mentioned also the next type of tool I want to show you and that's explained with this roadmap template which you can adapt to your own needs if you'd like. I cover some of the who, what, when, where, why questions that I covered with the strategy doc template. But the roadmap is more of the short term. What would you like to do in your next, say, three to six months? It's taking a slice of your strategy into getting you more focused around what you want to develop now. My strong recommendation is to keep it to a page or less so that people can actually remember it. Keep the number of deliverables and goals low, like one for three max, using a metric to justify why it's necessary. If you don't have a metric, like a baseline to say like we're doing this deliverable because X number of users want it, then you can also think about the metric that you want to apply to then be able to measure the success of your feature. I always like to include risks, like what is known, what is unknown in a roadmap, just so that with the unknowns you can plan that it might take away time from the future development. So it might be a bit of a distraction, but you at least are aware of it and you're going to have to work it out in the future as you go. And then technical goals. And this is like to make sure that quality, observability, testing doesn't fall by the wayside. I see this happening in a lot of projects and products as well where like all the stuff that actually makes the thing run gets pushed to the end and then the engineering team is stuck with a very patchy problematic system that they want to really fix, but nobody has a lot of time for them to do so. The next last couple of slides are just covering prioritization. So this is a matrix that I like to use because it allows teams to take a stack of issues and then plot them on this matrix. The matrix asks them to assess tasks, ideas based on the amount of effort along with the value that they expect to provide for the user once they do the thing. And then this allows the team to see like if they have a lot of things that are high value but also high effort, then they either need to maybe focus on one of those because they're not going to do like 10 high impact, high effort items at once or break them down into smaller bits so that they can then go into the do it now column which is really where your quick wins and your low hanging fruit should go. It's really important to plan for those quick wins to have them early on so that you can collect momentum and the team doesn't feel like they're just in some long slog that they're never going to see the results of their work. If you have quick turnaround for impact provided then that's nice because they can celebrate those wins early and keep going. There's also this nice, this is my favorite box, the don't do it box because that's where you just like close the issue and forget. Here's where I use this matrix in action. This is also a security scorecard recently. We haven't done this exercise yet but I'm really hoping we do it soon. This is basically all the bugs in the backlog and just putting them in specific buckets like some of them weren't bugs so that was just really categorizing what's a bug what isn't. Then the goal here is the team will plot the bugs on this graph and then we might find out that some of the bugs were solved, maybe some of them are relevant now but it's really to kick stuff out of the backlog and then just have the focus on what is really important, what's really valuable, what are people really being hurt by right now like we should fix right away. That's basically the steps for how you would apply such a matrix. I also encourage using a scoring model. There's a lot of different scoring models and you can find on Google or Ecosia, my favorite search engine personally is Ecosia. You can go in there and see what scoring models can do to help you assess things like reach, impact, excitement, effort and have a weighted scoring option so you can stack rank your backlog items and then do the top items first because you've decided through data and analysis that they're the most valuable ones. This is another template for your strategy. I just found this on Miro. It's by Lou Coleman and basically if you're rolling out an MVP for a new project for the first time, your center of focus is obviously the tree trunk so making the purpose of that really strong and solid and then over time you have more time to build on your tree trunk. This format allows you to plot your plans basically on different bands. So maybe the future band might be something that's high impact and high effort but it's just going to take a lot of time so you don't project that you're going to have it done right away. I just thought it was a nice visual I like trees too. Last slide is probably something that's very familiar to you. It's a standard campaign project board but this really helps with asynchronous collaboration because if you're running your board really well you'll only have high value work in it and then your contributors don't have to have a meeting to figure out what to do. They just pull off from the board knowing that you've clearly vetted your work through the tools that I've shown you so that they know that what they're going to deliver is ready to go and it's going to make a difference. My experience people are really motivated by purpose. They don't want to just do something for busy work. They actually want to know they're making a change. So with your really nicely refined backlog you can help your contributors along by giving them valuable work to do. I suggest making a triage work in group or having some mechanism in your team but just make sure that issues are triaged regularly so they don't pile up and that's a really good way to get non-code contributors involved as well. Making valuable high purpose work. Hopefully I have helped clear your path and helped you clarify your purpose. This is a nice trail in Amsterdam. It's quiet and friendly and inviting so hopefully that your open source development can achieve some similar aesthetics and that's it and that's the links to the resources that I've shared earlier. Not a question. So this goes back to the assumption driven development that made me wonder especially since you pointed out to stake the work first so you know what you're getting yourself into but if I do that, if I had done that then I would have never started any effort at any time because I would have been too intimidated had I known what I would have gotten myself into. So what do I do to still get stuff done? I think it depends on the number of factors. If you have a lot of time to build something out and really focus on it.
Single-vendor is the new proprietary
Thank you everyone for joining Solate in the first day of FOSDEN. Quick introduction before we start. My name is Thierry Carreze. I'm the general manager for the Open Infrastructure Foundation, which was previously known as the OpenStack Foundation. I was also elected to the board of the Open Source Initiative, and I'm serving as its vice chair right now. And as part of those activities, I've been working on the draft response for the OSI on the release of a new license called a Functional Source License. And as part of that, I really reflected back on some of the licensing that has been happening over the past few years, and recently at Ashicorp, I don't know if you're familiar with that probably, a very well-known previously open source company that decided to switch licensing for products like Terraform or Vault, and therefore creating some tension in the ecosystem. So looking at those critically, it just occurred to me that single vendor open source is the new proprietary. And this talk will get into details why I think that way. And I realize this might be a controversial opinion. I realize that some people will disagree with me. I realize that I will probably make some enemies out of this. But I think it's an important way of looking at it, and it might be a bit of an extreme characterization. The rant will be very short so that we have plenty of time at the end for you to engage in an open discussion and hopefully prove me wrong before I do it again. So all this real licensing that we've been hearing recently, hello. All the real licensing that we've been hearing about recently is all built on the same narrative around open source. And the story is very well known. It starts like this. At the dawn of the computer age, I think some people will get in at the end of the rant. At the dawn of the computer age, software was not considered very valuable. It was all about the hardware. And the people using those machines would actually develop the software that would run on those hardware as a commons and share it relatively freely. And it's with the advent of the 80s and the rise of the PC that made hardware a lot more like a commodity and with it made the software much more valuable. And that's when software companies like Microsoft were created and with it the proprietary software approach. The proprietary software approach is when a single entity owns the software that is produced and intends to capture all value thanks to restrictive licensing conditions. And we've seen the 90s after that that really led to a lot of excesses, especially as Microsoft decided to exploit that dominant position that they had. And openly developed open source really grew in the 90s in reaction to this evil proprietary approach. It predates obviously that period, but that's really when it really caught on. In that model, software is produced as a commons by a community of participants of organizations of individuals openly collaborating and the value is shared across the participants in that ecosystem. And this is all made possible thanks to free and open source licenses which guarantee a number of freedoms including the freedom to build on it without asking for permission and the freedom to use it for any purpose including making money. And in the next 20 years, really open source got overwhelmingly popular and it unleashed a software revolution. And those that have been around for that time measure how dramatic that change was. A recent study estimated that the demand side value of open source software today is nearly $9 trillion. It is estimated to be part of 96% of the software that is run. Like 96% of software contains an open source component. And it would be very hard to develop new software today without using open source. And so like everyone else, the companies that produce software massively adopted open source. They would develop in-house but release the end product under an open source license. And we call that single vendor open source. And with internet becoming more ubiquitous, some turn to a software as a service model and we saw the rise of the cloud and without the rise of the hyperscaler clouds. And some of those hyperscalers would run open source software at scale which would be seen as unfair competition by those open source software companies that were using this software as a service model. And that brings us to today where those companies say that while open source is great to get that initial visibility, it's bad for monetization. It's bad for business. And so if it's not business friendly, we need to invent new licenses, you know, to continue defending open source and especially against this evil proprietary software. And you know, with those licenses, we continue to give you access to the code for free. So you know, what's not to love. And in some cases, the license will even revert to an open source license after some time. Why do you hate us, Thierry? And I'll explain why. I think this narrative is built on three misconceptions, especially the last part, which this talk is going to deconstruct. The first one is that open source is great because you don't have to pay for it. The second one is that single vendor open source is the reasonable way to do open source. And the third one, interestingly, is that proprietary software is evil. So let's go one by one. The first one, open source is great because you don't have to pay for it. I mean, we are the ones writing the software and we continue to give it to you for free. Like why do you not happy with that? We just need to preserve our business interests, you know. So, well, the problem is open source is not great because you don't have to pay for it. Open source is great because everyone is free to use it. And that's a subtle distinction. I realize that. I mean, cost is a factor, but this goes way beyond monetary concerns, monetary barriers. What matters is not having to ask for permission. Just use it. Anyone, anywhere for anything. Not just the ones with deep pockets, not just the ones in certain geographies. And this really, this permissionless innovation that enabled a ton of valuable software itself often released as open source, which fed into that virtual cycle. Those non-compete licenses that they propose restrict you from doing anything with the software that the company disagrees with or considers competition. And they use pretty vague and untested legal terms and the end result is that it ends this permissionless innovation. You can no longer just use it. The second misconception is that single vendor open source is the reasonable way to do open source and resist evil proprietary software. I mean, we are the self-proclaimed commercial open source companies. We are the business conscious open source folks. You should follow our model, et cetera. Well, let's go back to the definition of proprietary that I used earlier. Single entity owns the software that is produced and intends to capture all value derived from it thanks to restrictive licensing conditions. Well, if you take that definition, single vendor open source companies are still doing what is essentially proprietary software. I mean, they will disagree, obviously. But they still consider the software being produced as their exclusive property and intend to capture all the value that derives from it. They aggregate copyright assignment so that they can change license anytime they want. So it's still proprietary. They just choose for now to release their software under an open source license. So single vendor open source is not the reasonable way to do open source and fight evil proprietary software. It's actually just another way to do proprietary software. It's just a relicensing time bomb. And sure enough, a lot of those exploded over the past year. So the proprietary development model is moving back to restrictive licensing now in a very predictable attempt to capture incrementally more value. Now that was predictable if only we had seen single vendor open source as the temporary tactic of proprietary development that it is. And that it always was. The third one, proprietary is evil. Well, this whole story would not hold if we did not demonize proprietary software in the first place and opposed it to open source software. But as we've seen, you can be proprietary, have a proprietary development model, and do open source as a temporary tactic. So it's not open source versus proprietary. We need to shift that. It's actually more complex than that. You can represent it as a quadrant. On one axis you have open source licensing versus not open source licensing. That's pretty clear cut. The open source initiative defines it. It comes with a bunch of freedoms. And it ultimately enables that permissionless innovation that I talked about. Why do we have those freedoms? It's because they enable the permissionless innovation model that we all benefit from today. On the other axis you have the development model. It's either openly developed by your community that will share the value of the work or it will be developed by a single entity that will own it. And if you look at the traditional proprietary software, that's what I call restricted software. It's when you're using a non-open source license to impose some licensing conditions, especially to preserve your business model or to gain some other benefit. If you look at the open source side, depending on whether it's developed by a group of organizations as a commons or if it's developed by a single entity that retains all copyright aggregation, it's either openly developed open source or single vendor open source. And the issue here is that we're seeing movement from single vendor open source back to restricted software. And they hope that by doing that they will retain enough aura from their open source days to hide the fact that it's just restricted software and pretend to continue to be on the good guy's side and fight against the evil proprietary software. But proprietary software is not evil. The abuse of dominant position in the 90s was evil for sure. But the proprietary model itself is not evil. In my opinion it's just inferior. If you truly think that software developed by a diverse set of actors working in open collaboration is not better, you should definitely do proprietary development. That's fine. Just be honest about it. What's evil is really the lies and hypocrisy that we are seeing there. Doing proprietary while pretending to be open, that's evil. That's what we call open washing. Trying to dilute the meaning of open source by creating deceptively named licenses like common clothes or server side public license or business software license. That's evil. Switching licenses from under your community after having promised to be forever open source like Aishikop just did. That's evil. Being fitting from open source freedoms to build your software in the first place. And then denying that those freedoms actually have value. That's evil. So yeah, as a summary, and I thought I would leave a lot of time for engagement from the crowd so I want to make sure we have time. I want to leave you with three takeaways. Three actions. First, I think it's time for us to remind everyone that the permissionless innovation that we currently benefit from should not be taken for granted. It is a direct consequence of the prevalence of open source licensing as defined by the open source initiative. And it requires all of the open source freedoms including the freedom to use the software for any purpose. The second takeaway is that I think it's time for us to describe what a world where they win looks like. Because if their vision wins, if everyone adopted their approach, all the innovation that those open source freedoms allow to unleash would come to a halt. And we would quickly be back in the 80s. And I've lived through the 80s. You don't want to be there. Imagine a world where you have to ask your lawyers for permission before you use any library, any programming language. And they will say that after some time it reverts to open source license. After two, three years, four years, the license automatically transforms into an open source license. But that's a trap too. Like imagine a world where you have to run a buggy two-year-old version of the software with known vulnerabilities because that's the one that is open source. That's not just practical. Finally, takeaway number three. I think it's time for us to reassert the value of software developed in an open collaboration. Everything else is proprietary. Everything else is a relicensing time bomb. So beware of CLA's when they are not held by an openly governed non-profit. Beware of single vendor open source software because it's just a proprietary model that happens to temporarily use open source licensing. And they have lots of money, lots of resources to spread their very confusing message around openness. And we're clearly disorganized. So I think as a conclusion that it's time for us to all clearly say that single vendor is the new proprietary. Thank you. Ah, objections. No, actually my questions are answered in your notes. So I'm interested in having your notes if that's possible with the slides. So the short story about this talk is actually the text that I wrote for the OSI to answer the functional source license. It was deemed to be too extreme to be representative of the organization. And so we toned it down and changed it. But that's actually what made me have the idea that I should turn it into a rant that I would present. And first then is clearly the right crowd to try it. So you would publish the notes? Yes, basically I'll make a blog post that's basically the same speech. OK, I have a question. What if the code is even developed by one component but under a position of Linux foundation? There's happened many times. You can notice that it's quite common at this moment. I wouldn't call that proprietary. What makes the proprietary approach is not just that you have a single participant. It's that it's close to others to join. And I'm pretty sure in the case of a project under the Linux foundation where they have a major vendor, they would be happy to have someone else. And I'm pretty sure that they would not be able to unilaterally change the licensing conditions because the trademarks and copyrighted creation would belong to the Linux foundation and not to that single company. So I think it protects you. If by design it's a single entity but I don't think they have a lot of projects like that, then yes, there is a problem. If you are prevented from participating as an equal in an open collaboration, yes, then there is a problem. So a provocative question. Is the GPL... It doesn't allow you to do anything. You cannot choose to follow the GPL as invasive in that sense. In your definition, is it still free, software in the most direct sense or is it something in between? I mean, a GPL? No, the GPL. The GPL itself? No, it's totally embedding those freedoms. To me, it's clearly an open source license. Some would say the one, the open source license. The main difference between the GPL and the permissive licenses is how much of a function you want to have back into the contribution cycle. It's really what makes it slightly different. And depending on how much you think you will get contributions without it or with it, all the big projects they have that moment where they have to choose between permissive and copy-left licenses, it's all a bet on the future. If you think that your ecosystem is so big that you will get contributions anyway without forcing people to give back, it's actually better to have a wider funnel to get into your system. If you think your project is never going to be super big and you can use all the contributions you get, I think the GPL approach is better. So you said a couple of things that I found sort of interesting. So one is, I think the objection or the observation is that if you leave the control to commercial entities, they are going to be continually tempted to re-license it, de-license it, change the licensing. So are you advocating that one should try and get the licensing and, well, the copyright transferred to a notionally neutral entity? Because it doesn't, for me, it doesn't seem to be that having like GPL on the side of a license, if the copyright belongs to one of these companies, they can just say, okay, fine, we'll leave that on the side, but we'll keep doing stuff over here. So you either then focus the GPL community and everybody has to turn it with their own resources. So it doesn't seem that GPL provides the protection that you're suggesting. So I think it's more about the ownership that you're pointing at. Is that correct? Yes. I would say copyright aggregation is just one of the assets that you need to have in a neutral asset lock in open collaboration. Trademark is another one. If one of the companies has the trademark, it means it's more difficult to weaponize, I guess, than copyright aggregation, but you can still pull the project identity away from the project, and so that can create some tension. So, yeah, clearly being able to put all of those assets that make that project initially possible and have some stability, so the name, the ability to change the license, under some kind of an asset lock, and I'm not necessarily saying go for an open source foundation like the one I work for. There are other ways today to actually create those open collaboration fields without necessarily going for a foundation. I think today foundations really bring value to make that open collaboration successful, not just possible. But yes, I think it's part of how you would fix the problem. The problem is really that it's a single entity, it's software that is developed by a single entity. They will try to hide it. They will say, well, we take contributions from the community. I mean, that worked for some, and clearly there is the difference between the contributors that are on the inside and the contributors that are on the outside. So it's free labor, it's not contribution, it's not an openly, it's not a common. The common you have to make sure that the future participants will benefit from it. Here it's like it's the pure ownership of one single company, and they take free labor when it's available, and then they change the license under you when some VC tells them to, so it's bad. I'm not a professional developer, so I know this is a rookie question. I don't know the answer. How would an entirely new thing come into being if no one person can have that thing? Doesn't every idea start with one person? So at that moment it is a proprietary thing. And it may often live for some time before it becomes something else. Your rant clearly doesn't cover somebody having a good idea. So how does something completely new come into being? So it starts as one person, but that person makes a choice at that point. They decide either to put it on some software force, name names, and have a proto-open governance around it that says, well, I'm the maintainer, but I would consider adding more maintainers and wants to create an open collaboration ground around it. Then they take the role of openly developed open source. At that initial stage they decide, wow, that's very interesting. I could build a company around it and monetize the heck out of it, and so I need to make sure I keep control over it. And the second contribution, I want to make sure they assign copyright to me or my organization that I can do whatever I want tomorrow. That's there going the way of the proprietary software. And it's really a thinking model. You either want to monetize something that goes beyond those comments that you create with others, or you think that software is going to be the real value that you create, and you want to make sure you capture all of it. Thank you very much for your... I'm here at the top. Thank you very much. I'd like to have your view on what happened to the Matrix Element ecosystem, where the server port was re-licensed in the last three years from a permissive license to a AGPL license. So this is re-licensing in the opposite way you're describing, so kind of more open, but there are still lots of discussion because they are forking their own community with their own software, and the whole community is not really kind of it. What is your view about that fact? So yes, it's going in the right direction, but it's still proprietary software. If they can actually do that, they can do it again. And you never know exactly in which direction. So I would argue that it's a proprietary holdout in the middle. That is probably well-intentioned right now, and a lot of those companies are actually well-intentioned when they start. It's when money comes in and they get some pressure about return on investment that suddenly you need to extract a bit more juice from that lemon, and the only lever you have is re-licensing. And I don't think HashiCore planned it from day zero. Although the VCs at some VC ventures actually have a playbook for it. So it's a published tactic. It's not as if it was like a secret or a surprise. In the end, it's all a calculated move around their approach of how we build software today. They can't really get around it, so they adopt it and try to make it do what they need. I'm not sure if it's just the VCs. I like to blame the VCs and investors too. But one of the things that I think is that people that are paying for software don't value open development and open collaboration as much. They just want the vendor to be around. They don't really value the history. They could have been around for 10 years on their open source development, but at the end of the day, they're still willing to pay for the software, the cloud offering, whatever it is. But they don't value the open development as much as the rest of us do. So I would disagree with that. It depends on the software. Obviously, if you run Firefox and you suddenly decide to use Chromium instead, because you're weird, it's possible. But in some cases, the project that I've been mostly working on over the past 12 years is OpenStack. And if you make an investment in an open source technology because you think open source is the way to build it, or you think that the software, the way it's built, it's not going to be changed on you. There will not be new licensing that will force you to pay support from one single company, one of those things. Having the guarantee that it's not going to change two years, five years, ten years from then is an important guarantee because you make a pretty significant investment in that infrastructure. So there are other open source solutions for providing infrastructure than OpenStack, but all the others are single vendor. And so that means you're potentially, if you choose them, they might decide to do something else with the software, and they might put your investment at risk because they might decide that you need to pay them for support per seat, per server, for whatever, some condition that you can't really accept, especially if you're a nonprofit. I really like the idea of the kind of frictionless way you can use the open source. What I think I'm hearing is that we really should be thinking about the fact that even if you can use particular open source projects, you might want to if they're like this. That's what I'm hearing. And I'm just wondering then, practically speaking, if we have to check whether or not they're like that, does that not interrupt the kind of value of the frictionless use thing? Because now, I just don't understand how we flag those, or how do you know, or how do you avoid this basically without interrupting this idea of just being able to use things that have open source software licenses? It's an excellent point. You can't just look at the license. That's actually another thing that I've been speaking about a few times. We need to go in the way we look at the software. We need to go beyond just the open source license because it's not going to give you this certainty that I think you need. And yes, it's a problem. And there are some organizations when they put out a project, they're pretty sure that they're under an open governance and they will be there forever, et cetera, et cetera. But there is no label. There is no brand. I'm trying to say openly developed open source. I have much as much as it's like a math fall. It might not be the right, like, I don't want to say good open source, bad open source. But yes, we need some way to say this sounds like safe open source and this sounds like potentially a restrictive in two days open source. And how do we differentiate that? I don't agree. We have an issue. The talk here is more to... The goal is to more of a wake-up call where I want people to realize how much we benefit from that permissionless innovation that open source licensing has. I want people to realize that this... Not all open source is aligned with that permissionless innovation. A lot of it is actually saying open source should not be allowed to run on any purpose by anyone for any purpose. And so going back to, you know, we've been through this cycle from the 60s to 2020, going back to the age where computers, where we did not have this 9 trillion body of code that we can easily pull from and that we are free to build on. And there's nothing that guarantees that's going to continue. Like, 10 years from now, we could... Why would open source still be around? It's because if we hold the line on open source licensing, if it's all the line, if the open source initiative continues to grab the open source definition, make sure that all the freedoms are in there and we don't remove one freedom and see what happens to permissionless innovation. And I think that's the general idea. And yes, be more... Look under the hood and see how the open source software that you're being sold is actually built. I'm sorry. It's at 6 o'clock, folks, so rain, livestreams ended. But continue your questions if you have any theory outside. Everyone agrees?
The Many Hats of a Maintainer: Organizational Design That Helps Reduce Them
Thank you so much, the organizers and everybody here today. This is such a dream. Before I get into some things though, I wanted to dedicate just like the next 30 seconds right now to my best friend that passed in August. Many of you know her, Chris Nova. She is a prolific open source engineer, alpinist, hacker and past Fosdome speaker. What you're seeing on the screen right now is a photo of hers when she summited Mount Rainier several years ago. What I just wanted to say is that I hope we can always continue her memories here. With that said, I'm Paris Pittman. I'm a recovering Kubernetes maintainer. These days I hang out with the Swift programming language community and I sit on the Swift Core team. If you hear the twang in my o's, that's my hometown of Baltimore, Maryland. I sit in Seattle these days. So hello from Pacific Coast time and my brain still. I've been focused on committee management and governance of open source projects for quite a long time and I'm so happy to be here. The word cloud on the screen is a key part of my talk and my biography. It represents examples of roles, titles, groups and project organization that make up our open source communities today. I'm not going to fib. I've had a lot of very undefined hats in my life. These are some of the defined ones. Our maintainers today, if you sat in the talk that Kara just did, our maintainers today have tons of hats that they have to wear. At the end of the day, how can we help reduce these? Organizational design could definitely play a part but of course it doesn't play into all of our woes today. I'm not going to hit on anything funding related. It's quite funny how we were paired up on the schedule today, carried out all cash. I'm doing all humans. So welcome to the human piece. So in this open source world, if you have any kind of participatory goals with a project, there's elements that you need to plan for with and around when architecting roles, groups and processes for sustainability. The secret sauce to community lies in how you interpret and implement the elements. Let's go through those. Goals you have them. Lives collaboration. Distributed decision making. Transparency. And community engagement. Pull requests accepted. But now life has presented us with new elements that we need to design for and around. We have decades of open source community stories that we can look to to formulate new elements that can help support the maintainer. I'm sure you've heard a few of these, right? First one, you know a couple of maintainers that are probably masquerading as moderators. You probably know some who are tied up with code of conduct incidents. You've heard the infamous toxic community. What about that open source project that you know that has amazing engineering and no documentation? Or what if you know the engineering project where the maintainers are really trying to do their best to be the best documentarians that they can be but they just can't? Right? Same thing goes with website and branding. At the end of the day today, in order to market a project, you need to have a website and branding. So that means you as a maintainer are also going to need to put a hat on for website dev and designer. Another famous story. You're a product of your own success. Hooray! We've heard so many of those this week. Or weekend rather. It feels like a week, right? Hooray! But boo because now the workload is so absolutely not manageable for you. And yes, the next one. The never-ending quest for contributors to help out. Or for you to turn your users into steady maintainers. And that just really falls into this bucket that I call contributor turnover. And in a white paper, speaking of contributor turnover, in a white paper that I read from Carnegie Mellon, titled Why Do People Stop Flossing? That is, Why Do People Stop Contributing to Open Source? Quote, prior work has shown that the turnover rate of a project profoundly affects its survival probability and code quality. And in the same paper, another quote, 80% of projects fail due to contributor turnover. So, community management is clearly the missing element for organizational design today. If contributor turnover is that much of a metric for not having a lot of success. You've heard it themed in the stories too. Someone needs to do this work. We can't be everything to everyone. So how can we delegate this via roles, groups, and processes? When I have these conversations a lot with maintainers, because I do, that's my job. My question to them is, do you want to do this forever? Is this what you always want to do? This being 15 plus hats, or things that you don't want to do, or even things that you don't have skills in? Hilariously enough, this screenshot from Macedon literally came to me yesterday. It's such a great summary. This individual is saying that they need social media skills in order to promote their thing. And they just don't have that. And what is that? And what is this when I say, do you really want to do this forever? That's creating, maintaining, moderating mailing lists, chat platforms, forums, recruiting and onboarding new contributors, that's their documentation, their workflows, their processes. Also keeping your current contributors and maintainers. Also GitHub administration, website creation, mentoring, holy moly, y'all. That's a lot. So, first thing, maybe you should define a role that could be successful for you, which is community manager. This does not need to be someone who does community management as a profession. I know tons of engineers who wear a community manager hat. They love it. They have the skills and they're passionate for it. Because at the end of the day, that's all you need to shift the weight around from the maintainer. And implementation in the wild. And that's what I'm talking about. Dapper, a distributed application run time has a community manager role description posted in their community repo. It starts with writing down some responsibilities and posting it somewhere to be seen. That's your mailing list, your social media accounts, your issue backlog. The role can be iterated as code. That's not something that I hear a lot from maintainers. Maintainers are really pretty fearful about this. But ironically enough, you're not fearful about putting half-baked code out there. But you're okay with just going without. And let's not go without. You can iterate on this. It doesn't have to be the end state. So we talked about that. I just let loose with this one role that you should have, right? This community manager role. What are the other roles in open source though? This is it. That's what we got. Contributor and maintainer are two of the most common forms of organizational design in open source, no matter the size of the project. Two of the most common words that imply all of the work that you do. And that distributes trust. Even if you're the smallest of projects, you should at least have these two roles clearly defined, including how to get to be a maintainer. You've probably heard this term before in the last few years, but a contributor ladder, if you will. It helps people understand what you do, why you do it, and how to get there. What about the contributor ladder though? Is it that easy of a jump to go from contributor to maintainer? It's not. That's why we need things like mentoring and other community management type of activities. We added one in the middle and made it an actual ladder. Kubernetes has this, for instance, and it's the introduction of something called a reviewer. So you've got your contributor, reviewer, and maintainer. That reviewer is giving new contributors another wrong on that ladder to help build the trust, grow skills, and have practical mentoring experience versus a lofty one. Okay. But why stop there? Why do we have to stop with just those two roles or just community manager? If a project has needs, create them. Again, it's just like your code. Create it, and if it doesn't work, sunset it. According to the same Carnegie Mellon paper that I quoted earlier, role identity. It plays a strong role in contributor turnover. How about that? So while you're creating other roles, you're also building belonging and incentive to contribute. So what are some of those other roles that you could create that would solve some of the problems that we heard from the stories earlier? A release manager. A security lead. A communications lead. A social media manager. The list goes on. And honestly, it's kind of endless. Think about the things that you need and build for it. Speaking of endless, though, if a thriving community is your goal, build an emeritus role. One of the most forgotten parts, I think, of open source organizational design. How to exit off-board and be done with it and hang up your hat. And that should be celebrated. It should be kind of like retirement from your day job where they throw you a party at the end of the day and celebrate you. Project Jupiter actually calls their folks distinguished contributors. Isn't that cute? And I think that this is something that we really should try to normalize and include this in your contributor ladders. So now you have four. Contributor, reviewer, maintainer, and emeritus. So we have roles as one approach. Like groups, they're probably my personal favorite. And that is groups of humans. Groups allow people to drive work or interests in a space. And the group can only be two people. I hear a lot of naysayers from my maintainer friends sometimes. They'll say groups are so heavy. We're not Kubernetes. But like Kubernetes had groups before it was Kubernetes. Like why do we all think Kubernetes is Kubernetes? Kubernetes is Kubernetes because of these groups that were formulated in the early days. Because at the end of the day, what we've learned from groups is that they're great at bringing people in and guiding them to the work. Because at the end of the day, one of the most common phrases that I even use a lot of the times when we're talking about contributing over the source is jump right in. Just send a PR. That's not helpful when you're trying to scale a community. It's helpful to the people that you're talking to that know what you're talking about, but that's about it. So it's all about what you want out of it. Do you want experts in areas that you aren't? Do you want to distribute the load? Do you want a way to drive the work? Applying this to community management, because now that we know community management is so hella important. Several related successful groups have shaped over the years. And they are all targeting this community management of how can we together collectively work on that burden of community management. I recently created a contributor experience group with a hard focus on mentoring and bringing in over 60 new folks to the project each cycle. Kubernetes also has one that I led that supports 80,000 contributors. You all, 80,000. And all of these groups have different levels of decision making, approvals, charters, duties. But the one commonality that they all have is to support the maintainer. Some really cool examples in the wild of this that I've seen from various different types of sizes and types of projects. Again, because I think a lot of people assume that you have to be the Kubernetes of the world in order to do these cool, fun things. But you don't. For example, open telemetry has an end user working group that helps maintainers reduce their product management hacks. Because I know a lot of you have to toggle between what's important. This group helps them toggle between what's important. And then not only that, but then tries to get them to participate, them being end users. There is also advocacy and outreach with Jenkins, Debbie has a security team. Many projects have security teams these days. And then even in the SWIFT team, we are just about to launch an ecosystems work group to help our ecosystem maintainers. So again, this is also endless and this is also about your needs. But if you're still in the audience right now and you're still kind of like, I don't know. I don't know if I can do all this. Or if you have a community manager now, but your project is scaling up so much that it's even too much for them. Or what if you have tried, but this isn't just a thing for you? Well, there is the process component here. This process is really coming from the emergence of strong CI CD systems and the overarching infrastructure as code movement. And I've seen some really creative things on how engineers are project organizing themselves through configuration files. And I mean, imagine that you're an overburdened maintainer and you're really trying hard to onboard new contributors. And that means helping them with forming their groups and helping give out permissions to things like your Twitter or your Slack or whatever, you know, or your GitHub keys. And then you're also needing to update the documents in a million places, training folks on code of conduct stuff. The cool thing is in Kubernetes, we set something up like this that takes care of all of that. And it's this infrastructure is as community as I call it practice. Could be a way forward for you. This practice covers testing contributor management like bots that welcome first time contributors, artifacts, governance policy and procedure scaling. Kubernetes no longer has a full time community manager. I bet a lot of people think that it does. And the other thing is it hasn't for a long time. And the reason why is because it's held up by this infrastructure as community, it's governance. And all of those elements that I mentioned earlier with the delegated groups that have decentralized decision making, like contributor experience, special interest group and the testing special interest group and the steering committee. So we've reviewed roles. We've reviewed some groups at a high level and also showed you some process stuff. Good project organization and governance create environments where projects thrive and the humans do too. I was talking with Zach Saylor, a distinguished contributor. Again, I love that title in the Jupiter project. And he said that it took them two years, two years to rearchitect their project organization. But he said Paris quote, this is a quote, it was so worth it. And I said, why? Why? Because roles, groups and processes at the end of the day attract and bring new people to your project. It's scalable mentoring. One-on-one doesn't scale at communities of size. It just doesn't. And this way roles bring in shadows. You can easily have shadow organizations like CIG release and Kubernetes. Amazing shadow program, y'all. Just truly top notch. You should absolutely take a look at it. And then the third in that same kind of bucket is inviting other skills in, increases your chance of A, survival, and B, having a more diverse project. So why we sit in these talks of how to get more diversity, look to your project organization. And then the last thing, the one thing that I need to take a drink for because this gets me so excited. The last thing that not a lot of people recognize here, and this also goes back to Keras talk too, a little bit with the funding. It's like, how do we get more people that have day jobs contributing more and also being supported to do so? You know how we can do that by giving them roles and titles. Companies love roles and titles and it makes sense because you know why when you go to your managers at work and you say, hey, I want to contribute to XYZ project. You say that's nice. What's cool? What's in it for us? And then you're like, yeah, that's usually how the convoy goes. It's good, you know, we should do it. But imagine if you say, I would like to work my way to be a security lead for Kubernetes to bring industry experience to my day job. That's a game changer, y'all. And that's one of the reasons why Kubernetes has grown. And that's one of the reasons why you've seen Kubernetes so just well staffed in a way. I mean, of course, yes, trust me, I think the 300 plus maintainers would probably throw tomatoes at me when I say, well staffed. So I'm like ducking right now. But it's 300 maintainers, y'all. It's a lot. So to wrap up, remember the slide. Let's rally the industry around community management as an element go forward in addition to engagement, you know, pull requests accepted. Because at the end of the day, building robust, sustainable communities is more than accepting pull requests and taking issues. These two words, y'all, they can go so far to help our maintainers. Thank you. Thank you.
Kickstarting an Open Source Culture: A Guide for Mentors
Welcome folks, we're on the final leg of Faust tomorrow, so hopefully we keep you awake. So I'm delighted here to be chatting today with a good friend of mine, Phyllisys, around something that we're both quite passionate about. I suppose from our experiences out in the open source, you know, when we first got involved and we got through it as we go along, and then, you know, just working with the community, collaborating with folks, and then seeing, realizing how we can bring that into our companies and bring that culture to help, you know, things get done better for one to a better world. Okay, so, so I'm Arton and I work over, I'm a developer over at IBM, and for the last about eight to ten years I've been in the cloud native space, and I've later started getting involved in AI as well. Yeah, and I'm Phyll, also a long time, we're both old guys now, a long time working as software developers, but, you know, done a lot in the, again, cloud native space, that's even though Martin and I worked at IBM together, that's really where we're connected over open source. Now I'm at AWS, still focused on open source, and we thought we'd start with really just kind of how we got our start. Again, we started our careers a very long time ago, we're not involved in open source for a good long part of our careers doing software development. For me, it was around 2014 that I had joined a new group in IBM that was focused more on open source, doing some work in OpenStack and Cloud Foundry, and this new thing called Docker came out, and I was asked to go check out this new technology, see if we could get involved, I became a contributor, and in essence I got hooked. I loved open source, I'd been a long time Linux user, but this was really my first experience contributing, making pull requests, reviewing code, helping others in the community, and that's led to the last 10 years of working in the OCI and CNCF and the Container D project where I'm a maintainer as well. So yeah, similar to Phil, I was working on Cloud Orchestration product that built on top of OpenStack about 2013, so we were downstream, we were building on it, and then I got an opportunity to say, okay, can we extend Horizon, which was the desktop at the time for OpenStack, it probably still is, and I remember the first conference I went to was over in Atlanta, I think someone fell off the bus, to be honest, because my manager came to me on a Friday evening of a long weekend to say, do you want to go to Atlanta to a conference the following weekend, do you know what to head over? So I went, and I was just blown away, and I think it was the whole collaboration of folks and all that, and then as I went into the community and I started then contributing into the Neutron, which was networking, and if you've ever worked with networking folks, they really are into the black arts, and they really take networking seriously, and I always felt like I'm going to get funged out here, because I don't care about IP4 or IP6, I just record, but it was a great experience, and they really made me feel welcome, I remember we had a meet-up, there was meet-ups at the time over in Rochester, Minnesota, and the work I worked on, they came up to me a lot and they maintained the same, we really liked what you've done, we really liked the fact that you took it on the chain when responses came back to you, you didn't get upset, you just moved on, you made the changes and go again, and then going to Hallways forward a few years then on to getting involved in Kubernetes and in the Helm community, and being really welcomed in there, being part of the Helm Tree release going out, and then becoming a maintainer in the community, and getting to actually talk at San Diego, which was a fabulous experience, so yeah, it's been great, you know. This is where his fancy clicker does the work. Yeah, a clicker's being unresponsive, sorry. Use the buttons. So why do companies need to cultivate a culture of open source? And I suppose the key one here is, and you get lost a lot of the time, I know we bang on about it in the community the whole lot is that, you know, nearly every company today producing software is, or are probably built on top of an open source stack, so if you're consuming it, you know, you're really involved in communities, sorry, you're involved in using communities, but you need to look at, you know, how do you feed that back into the community as well, because if you're using the stacks, if something goes wrong in the stack and you haven't been helping in those communities, then you don't really have a leg to fall back on. The other part of that is the amount of, when you're building on these stacks, you're building on the shoulders of people who put in hundreds and hundreds of hours. So you're getting a real lot of value here to build your product on top of it, where you can concentrate on your product, where you can drive it forward and that you may not have all the people in the community to do the good work for you like you're getting from the community. As you can see up here, there is so much open source out there, and that's coming from Linux, over the last 100 years, definitely from the Linux community, because maybe prior to that in the 90s, there's a bit more niche, the amount of people that were involved in open source, but definitely the Linux foundation community, up along through open stack, up along, it has really opened the door for people to contribute into communities, and it's created a real momentum and shift that, you know, if anything we, I suppose, that came out of log4j was that we realized that it's open source software is in every product around there, and we need to be aware of that. The final one then is, and this is very important for your customers, is most customers don't want vendor locking anymore, and open telemetry is a great example of that. It took a long time to come up with, there's been multiple standards in the telemetry space, but open telemetry has been probably the fourth standard where the different vendors have bought in and decided to work together, and a lot of clients know they want to be able to write their telemetry metric generations, their telemetry generations, or sorry, their generation of data once and use whatever back in they want. They don't want to be coming back again, having to change code and so forth to look to do observability and maintainability. They want to be able to have that, path finger Newton, and then use the particular back in they want after that. Yeah, so it's coming into the community dev room, it feels maybe like we're preaching to the choir, many of you here, you know, fully agree with the why, you know, why do we do open source, why do companies need to do open source, but I thought one extra data point on top of what Martin was just talking about is a report that just came out a year and a half ago that had this amazing stat that 82% of companies are looking for a vendor who's not just like Martin said, and like we all know, everyone's consuming open source, but 82% said they'd like to select a vendor who's actually participating upstream in an open source community. And then there were a bunch of responses about why, you know, oh, because, you know, they're familiar with open source processes or they're helping sustain a community of something that I'm depending on. And we definitely have experienced that, you know, working on container D for myself that was used in IBM's Kubernetes engine. It's used in several AWS container compute offerings. And AWS and IBM wants people who are active in that community so that we can fix problems so that, you know, like this last response from vendors, 46% said, I'm choosing a vendor that contributes because I know when I hit a problem, I can depend on that vendor because they have people in the open source community. And I think Martin, you had an example of that. Yeah, I have a little example, I can touch them because I didn't want to go near the stats that I feel was thrown out there because was it 46% of people didn't want it or they did want it. Sorry, I was a bit confused. No, on a serious note, the final point there is very telling about a year and a half ago, I was working with a partner and they were getting involved with us at the time. And they were really, really technical. They knew their stuff and they were a dream to work with being a technical person where, you know, they told us they told me what they were looking for. I had them alone. But one evening anyway, they were using the operator SDK from the operator framework, which is in the CNCF. And they found the bulk. And before the engineer and that engineer was on North American time, so he'd gone home to bed. He raised the issue and then he came up along. So I came in the next morning and was one of those lovely mornings, I didn't even have the coffee and he was like, oh my God, I'm doing this, you know, but I thought he was brilliant. They put it out there to say, that's great. All I went, I worked away and he took me on maybe two days to narrow down the bog, get the fix in. But for that partner that I was able to jump out there, make a fix. It wasn't a big issue. The big issue was just finding where the thing was as always. Once you find this, usually the solution isn't so bad. And then just working with the community getting in. And I think they really appreciated that fact. And you know, most of our customers out there and clients and so forth, they're very, very technical. They know their stuff. So they're not going to be hoodwinked. Yep. So yeah, Martin, you're going to take, you know, we've talked a little about companies, but why do employees care about involvement in open source? And this is, this is a lovely thing. And it's from my experiences. And when I talk in a while, we've a jumpstart program to help people get involved in open source is it's an amazing kind of, you know, I hate to use the word organic and just throw it around there, but it's a great way for somebody to get opportunities that they may not get within their own company. Because sometimes, you know, in companies or in teams and stuff, you know, things are rigid or certain ways or maybe, you know, sometimes it's like a public service for one do a better work where they know I've been there 10 years, so I'm entitled to do this or whatever. But for me, I just the ability to get opportunities either to speak at conferences, to meet people on different topics, to suddenly be involved in conversations that, you know, you taught were for, you know, somebody who is way more experiencing you is just amazing. It also gives people that ability to work on, you know, you may work in certain technology in your company, but then all of a sudden you're exposed to these technologies that are out here. And Phil said it there, like, when we first go out in the communities, you know, everyone's, get up, there's no problem to that. When you first go to get up, you're playing around with it, or you go out into USB IRC, but you're open to Slack now, whatever, like, it's a bit, you know, it's a big challenge when you first start out there and you're trying to engage and so forth. But it really gives you an opportunity to know how to collaborate with people and work with people, because it's not always about the technology. It's not always about contributions. It's a collaboration as well. Because at the end of the day, in your own company, you know, Bob or Mary beside you, they're paid to, you know, to work with you. When you're on the communities, you know, if, you know, people will only work with you if you're a decent person to work with. So you get that, you get those opportunities. And the funny thing I'd say is, just the friendship you made, as Phil said there, we worked in the same company, we met each other at Open Slack. And over the years, we never worked together internally, but we'd meet each other at, you know, KubeCon somewhere or some other conference or like Fosdom here again, and we get a chance to talk together. So I think that's lovely. Yep. And we usually take an old man selfie together. You make me do that. Yeah, he just wants it because he's got hair still in, I don't. All right, so we've talked a bit about the why. Just a few points. You know, what does it mean? What does it mean that a company has an open source culture, some kind of way that they're they're doing things to encourage open source involvement? One is just the simple fact that you're contributing back in some way. You know, you have employees who have a pathway. And I know there's probably a bunch of amazing Ospo leaders here or have been active in this room. You're making, you know, policies and capabilities, making it possible for people to do that in a sort of a clear way. You may create open source projects. I've had the pleasure AWS to be involved in creating two new open source projects that we've shared. We've gotten other people to collaborate. We're continuing to build those. And then, you know, there's the whole aspect of not just that you're allowing it, but there's some kind of encouragement. There's some way that employees who do open source don't feel like they're they're sort of stuck on a different track than everyone else. Like a promotion is harder because I'm mostly doing open source, and I'm not, you know, providing for the bottom line. And really that connects that there's some value, some, some incentive that employees think, you know, choosing to work on open source is just as valuable as, as, you know, being on a product team or working on a service. And then, you know, I think one of the cool things I've seen both at IBM, you mentioned the partner story, we have a group of AWS focused around actually collaborating with other vendors and customers and partners trying to not just do things between ourselves, but say, hey, join us in this community, and let's work on this together. And so, you know, these are, again, there's probably a lot more. But really, these are some of the keys that you would look for is, you know, what does it mean to even have an open source culture? Just going back in that last night to finish there is, you know, generally, partners are really, really on the ball technically, and they're really got the air to ground with their customers. They want to give the customers exactly what they want. And for them, open source is always that easier path to do and stuff. And it's the way they want to do it. So, you know, it is in your benefit to be able to engage like that. So how can you do this? So I've been very lucky in that a number of years ago, two great colleagues, Matt Rizowski and Anne Graham, came up with the idea for a junk star program. And the idea was that, you know, for early professionals, we'll get a chance to do a course for about nine weeks where there'd be an intro for the first two weeks, for so I'll tell them we're open source and how to contribute to open source and how to use open source and then particular projects and to pick one. And the goal being to get to push a PR out there. Now, you're probably, you know, if you've been open source, well, you go PR, but you forgot about the very first time you try to get that PR rate, especially if you took a while, you're probably looking at your GitHub, you were probably on your phone going, come on, review it, get it in there, you know. And it is still like we've all had that experience, I wish that we'd get in there and then you get to a stage and you're like, sure, if they leave it in, they leave it in, if they don't, I'm okay. But you know, and it's just giving people the confidence and as I say, we started with early professionals and now we've gone to experience folks because we realize they want the chance, especially, I don't know, maybe if you're, you know, as old as I feel there and myself, you know what I mean? You may have got caught in the rutter work or you might have got opportunities and I've seen people that come and see this and go, I wish I'd seen this years ago. You know what I mean? You know, they see the potential, they see the opportunities, they see where, you know what, I can take off here, you know, and it gives people that go. So the biggest thing is informing your company, tell them about open source and the benefits of it. The next part then is introducing that into your company, the tools and practices because things don't work in open source if the practices and the way people work and the tools are using are chunky or awkward because you have to remember here, it's people all over the world and all different companies, all right? You know, you have to find a common ground and a common way of working and a lot of companies know you can hear inner sources coming out all over the piss and you know, sometimes you hear, you'd swear inner source was something that's started, you know, that just grew from the sky, whereas to be honest, all you're doing is taking the first word of open source and changing it. You know what I mean? So the value has been seen here by companies that it's the collaboration, I think, more than anything. And you know, if your teams have been struggling or they've been finding a hard to get stuff out the door, that when they really start buying into this, they realize here, look, you know, one boat lives on. It's not about the individual, it's about the greater good. So I think that's important. The last two here, educating folks, okay? And like I said about the jumpstart, we had, look, some people, and I always talk when we do the jumpstart, you know, we have, we have weekly stand-ups for a half an hour or an hour and I said to folks, look, this is not like school. If you don't have the stuff done or you haven't made progress, please attend anyway, and we'll help you on block. But I always make that story. So when they come in and they'll have a PR push literally in the first week, and someone else is struggling because, you know, their kids are sick or they're gone on holidays or work has been really busy, but giving them the opportunity to be able to say right. And I always use the story of the hair and the tortoise, okay? You know, everyone gets there on their own time. And the last bit then is around and Phil touch on it. You really need to have a pat in your company that when people contribute to open source, because they're doing serious work out there. It's not someone out there having parties, even though people do go to parties, you know, I see John Willicky up there at the USSF party last night. So but on a serious note, you need to be able to give a sanctify and say to people you're doing really good work here. Well done. Yeah, and I just one thing to add to that. You know, Martin talked about the jumpstart program at IBM. We have an open source talent task force at AWS that just kicked off in the last year. We have an amazing Ospo and Nithya Ruff, many of you know her. And just trying to think about how to actually include HR in these discussions about, you know, what does it mean to have open source maintainers on your staff? How do you treat them differently than, you know, other parts of the company? How do you incentivize them in the same way that maybe other employees are incentivized? And then, yeah, just a lot of the practical education parts. Is there a way for open source, you know, newbies, so to speak, to get mentored to get, you know, and I do a lot of mentoring, we've built a small container runtime team at AWS, where I mentor some of those younger engineers. And with them, we've created like an open source hour, actually, I think it's two or three hours now, where the there's sort of an open, you know, video call, and, you know, the guy that's three weeks into the, into the job, you know, he's just created his GitHub ID. And he's like, I don't even know what to do. But he can join this call. And there's others on the team, who are like, here's an issue, you know, go read the issue. Let's help you figure out, you know, how to get your get set up and clone the repository. And, and so, you know, these are the practical sort of nuts and bolts of like how to get people involved, how to get them educated, how to get them incentivized. And again, I'm sure there's there's ways, you know, your companies are doing that. And, you know, I think this is an area where I think it'd be awesome to see more sharing of practices. You know, what are you doing in your company to incentivize and educate for open source? And just one little thing on that is one size doesn't fit all. So, you know, I'm way a laugh at that is I saw I know if it wants to fit all or one size fit most, I think is what it said, but no, I was mentoring a person work a couple of years ago, just one to one. And, you know, he noticed I was in helm and he said, all right, I want to get involved in him. So the very first meeting we had, he said, right, I want to get in the helm or whatever. And then I said to him, I said, do you know what you do? Pick five projects in order of your preference and come back to me. And I'd say he was a bit stunned. He told me afterwards, he said that was the worst he taught at the time, the worst mentor and he ever got was he comes in, and I tell him come up with five things goodbye. I'll meet you next week. So off he went. And he came back with the five things. And lo and behold, helm was not in the list of five things. He had interest in other stuff. All right. But I kind of knew that I wanted that person to know what they wanted to get involved in. So they went away and got involved in tech on. And they made a couple of contributions and they were doing a bit of work. But as the months went on, I didn't notice him jumping to the committer, you know, getting more a serial contributor, reviewing more and more. And I eventually said to him about six, seven months in, I said, look, what's the story? And he said to me, said I was afraid to tell you, but I don't like tech time. Now that's nothing against tech time. And if you're in tech time, do not attack me on the way home this evening. All right. But he was honest. And we found out afterwards he was more interested in K native. And once he got into K native, because he wanted to do it, he flourished. And he's doing unbelievable well after all. And you know what I mean? Every known again, he meets me says thanks for helping out. And you know what? That's relatable. Give him someone to help and listen to him, not telling him what to do. Yeah, Phil, you do it. Sorry. No problem. Yeah, we got a couple minutes left. We thought we'd connect. You know, we talked about how we got involved in open source initially. You know, kind of the where are we now? For me, you know, I've been now 10 years in, you know, almost spending the bulk of my time focused on open source as a project maintainer as a technical oversight board member in the OCI, a CNCF ambassador, and then focusing, you know, all the things I've learned trying to help others at AWS similar to what I did when I was at IBM, being a subject matter expert, helping other teams were figure out, hey, we have an open source project. We like to launch. Can you help us think through what that looks like? So it's kind of an exciting point for me to feel like I'm almost more focused on helping others now than so much, you know, trying to get involved in open source myself. Yeah, a bit like Phil, I didn't put the specifics in our general digital shows, you know, left hand, right hand doing different things as we fill it in. But, you know, for me, I think it's been, you know, people believe in me, helping me in the communities and know a chance to help other folks do it. That gives me great joy when I when we do the jumpstart and I come in on a Monday and I'm pissed off for whatever reason, because it's a Monday, maybe, you know, it's just to get the joy of helping folks and then also being able to help teams internally if they need a hand with open source. So to just finish out, okay, there are no free dinners in life, my dad, you saw with saying. And he's right. If you're going to consume something, give back, because it's the best way of driving things forward and knowing what's going on. What we've learned from working in open sources, and for me, definitely, is collaboration, the ability to work together, no matter where we're from, who we are, it doesn't matter. As long as you're a decent person, you're willing to work away, you know, you will get things done. And that's what team works about. All the best teams, especially sport teams, I'm going to land at Lorna, don't worry, especially sport teams, all right, they work the best when everybody is willing to do the job that they need to do, not they don't have to be the heroes. And finally, it's a great place for people to grow in their careers and their life. And if you're a senior leader or someone who's in the community that's done really, really great stuff, please help other people because that's what life is about. Great, Margot. Awesome, thank you. Q&A, I will run this back and forth. Any other questions? There'll be a few jelly babies in for you. Yes. What is the biggest community lesson you learned from OpenSack and how have you seen that applied in open source projects that have gotten large since, like for example, Kubernetes? Well, you're handing that to me. Well, you spent more time in OpenSack than I did. I feel like I didn't. I actually can't answer that question. No, no, no, no. I suppose, like from my experience, OpenSack, I had really great experiences from it. I thought the collaboration was really, really good there. And I think that was brought forward into the cloud native communities afterwards like Kubernetes, etc. So I think a lot of folks went and worked in in the Kubernetes communities with new people that came in. But I think the key at all times was that people understand that collaboration and being decent to each other and that you're trying to work towards the bigger thing. We don't need heroes in other words. Thank you. Anyone else? Can't be good. Either we did really well or people are bored out of their lives. Yeah, very possible. Okay. Thanks very much. You need folks and thank you.
Strategies for Building Healthy Open Source Communities
So, I'm going to talk today about strategies for building healthy open source communities. I wanted to start by just quickly thanking the Alfa Peace Loan Foundation. So, they fund the Chaos Data Science Initiative, which pays me, and also thanks to the Linux Foundation and Board Foundation, which also provides support for the projects. I have been in the technology industry for well over 20 years, working mostly on open source projects with a focus on community strategy metrics and growing your contributor base. And I can tell you that it is really, really hard to build a strong open source community for a project. Most of us struggle with finding enough humans to sustain our projects. So, let's start by talking just a little bit about the problem and why it can be so hard to achieve sustainable communities for open source projects. Like I said, the problem is hard. I like to start my community talks with a quote from an alien life form on Star Trek the next generation who described humans as ugly bags of mostly water. Now, I don't think we're ugly, so I think they missed that part. But we're super squishy, right? And not just in the physical sense. We can be unpredictable. We can be irrational, especially when we're stressed out, overworked, burnt out. And the reality is, right, we're not robots. We're not mindless automatons. We have feelings. We have bad days. We have other commitments and we have personal challenges in our lives that are often completely invisible to other contributors. And they can get in the way of our contributions to open source projects. But you can't have an open source project without having human beings to maintain it. So you need to be able to encourage people to participate in ways that are sustainable over the long term, both for your project and also for those people. And it helps to be proactive and ask people to participate in specific ways and in ways that match the work you need to do within your project. Now many projects struggle to find people who will actively participate in their projects and continue to participate over the long term. If it was easy, you'd already have all the people you need to maintain your project. We wouldn't need this dev room. And none of you would be here watching this talk. But I think a common theme throughout all of the presentations in this dev room so far really has been that we're in a situation now where there are a lot of open source projects and not enough contributors and not enough resources to maintain those projects. So maintainers are burning out and they're in desperate need of help. And sometimes it can be really difficult to get people to contribute to your project. And unfortunately there's no magic, there's no one size fits all solution. So throughout this talk I'll focus on some things you can do to increase the chances of building a community and growing contributors for your project. Now that we've talked about the problem and some of the challenges, I'll shift into talking about strategies for building healthy communities. So next I'll talk about, and after that I'll talk about us taking a strategic goals based approach to metrics. And then finally I'll talk about some metrics you can use to measure project sustainability. To grow your community along with some resources and some final thoughts at the end. So as promised let's start by talking about developing and executing on a long term strategy for building a healthy community. Including motivation, project governance, new contributor onboarding, roadmaps, contributor ladders which you might have heard before from some of the talks, and leadership. To have people's motivations for contributing to your project vary widely. Some people are contributing as a part of their job while others might contribute to gain experience or maybe learn about a particular language or particular technology. Regardless of why they showed up there are some things you can do to motivate them to stick around. Clear communication, working in the open, and reducing friction are key to helping people stick around. And I'll talk more in the upcoming slides about the importance of explicit and clearly communicated project governance along with onboarding docs and fostering a welcoming community. There are also other things you can do to motivate people to contribute. Having good first issues or help wanted labels are excellent places to start because these help those humans find something they can work on while they learn more about your project. Good first issues and help wanted labels are passive requests for help. So I also encourage people to be proactive and specific about ways that people can help. Asking someone specific to review a PR or answer a question or respond to an issue demonstrates that you recognize their unique expertise and that you want their help with it. Anything that we're wanted and appreciated makes us squishy humans feel good, right? Which can be a strong motivator to contribute to an open source project or to continue contributing over time. People can also be more motivated to contribute when all of the project work is done in the open where they can participate as equals. When some of the work is done within the walls of a company or maybe inside a close knit group of maintainers it can leave the rest of us feeling left out and demotivated. A lot of people like to hate on project governance. It's just extra paperwork, it's busy work, it's politicking, it gets in the way of doing the real work on the project. But this isn't true of good governance which is really just about setting the expectations and getting all of the various humans within your community collaborating together. Ultimately the focus of project governance is on people, the roles we play, our responsibilities, how we make decisions and what we should expect from each other as part of participating in the community. The goal should be to make the processes for participation as obvious as possible. Even for people who are brand new to the community. Having clear rules about how collaboration occurs, how decisions are made, what types of contributions are in or out of scope helps community members make contributions that are likely to be accepted and embraced by the project. This helps avoid wasting people's time with contributions that maybe just aren't aligned with the project for whatever reason. A healthy project with clear governance makes the humans happy and it sets your project up for future growth and long term success. The good news is you don't have to start from scratch. The link we have here is some good templates with some instructions that apply to most projects if you want to quickly and easily build out some basic governance for your project. It's a lot more difficult to participate in a community if you don't know anything about the role you might play, the expectations, the key players or any of the rules for participating. That explicit documented project governance gives you both new and existing contributors a clear path to guide them through your project. Spending a bit of time documenting that governance up front can save you a lot of time later with fewer questions about how things work and it gives you a document that you can point those other humans to if they have questions. When I start contributing to an open source project, I want to know how decisions are made, who makes those decisions, where the discussions about those decisions happen, which helps me understand whether those decisions are made fairly and out in the open. The bottom line is that if the processes for collaboration and decision making are not clearly documented as part of the project governance, this introduces uncertainty into the mix and uncertainty makes the humans nervous. It increases the barrier to contribution and it jeopardizes the health and viability of your project. Good documentation is how we scale the things that take up precious time for the already overworked human beings, like answering the same onboarding questions over and over and over and over. I see so many open source projects with contributing guides that don't actually provide any useful information for people who are contributing. At a minimum, a new contributor needs to understand how to spin up an environment where they can do their development, the expectations for testing, how to run tests, and any processes or other expectations that you have for pull requests and then instructions for any other requirements you might have. If this is all well documented, new contributors can get started with a minimal amount of help from the existing maintainers, which can save you a lot of time in the long run. When a project doesn't have good onboarding docs, those poor, squishy, burnt out maintainers can get frustrated by the amount of time they spend on new contributor questions, which can make it hard for contributors to feel welcome. It'll take a longer time for them to become productive. This is how the humans get discouraged and then just drift away from your project. This does not mean that you need to spend weeks and months writing the perfect onboarding documentation. At this point, anything is better than nothing. If you start with a few things that help people actually get started quickly, then new contributors can help make those onboarding documents better by adding more details and maybe some additional instructions for something that they found confusing or that they struggled with. Then after onboarding, people need to be able to find something to work on. Having public roadmaps is a great way to do your planning in the open, while helping people find something to work on that aligns with the direction of the project. If you were here yesterday for Lori Apples' talk, she talked a bit about roadmaps as well. Roadmaps provide some crucial functions within open source projects, including setting the direction of the project, prioritizing tasks, organizing the work, and attracting and retaining contributors, and also providing transparency into where the project is heading. I think a lot of people underestimate the impact that a well-defined and up-to-date roadmap can have when building community around a project. They can help guide everyone toward achieving common goals and having a shared vision about the future of a project to help contributors work on activities that are aligned with that vision. The document linked on the slide has loads of detailed information about building a roadmap for your open source project. One of the most important things to think about is how you'll maintain that roadmap over time and actually keep it up-to-date. It can help to use tools that are already part of your development or your community processes, like GitHub project boards, for example, if you use GitHub, so that people don't need to use yet another tool. If you have community or developer meetings, it can help to have someone walk through the roadmap every couple of weeks just to talk about the things that are blocked or need help. Maybe set aside some focus time once or twice a year to think about the future of the project, and then you can incorporate that back into the roadmap. Bonus points if you can find a really good project manager who can help with the process. Your project should also be designed to keep diversity, equity, and inclusion top of mind. Building a diverse community where all of these humans feel welcome and included doesn't just happen. It requires putting work and thought into it. But this time is well spent, right? Providing an environment where everyone, including people from marginalized populations, feels safe is the first step toward building a diverse community around your project. Ideally having programs that give people opportunities for shadowing, mentoring, and sponsoring new potential leaders can help you grow a diverse set of people into new leaders for your project. Paris talked a bit about this. The Kubernetes experience, sorry, the Kubernetes contributor experience special interest group is a really great place to see some examples of how to implement programs for things like shadowing and mentoring. And projects that make a concerted effort to actually bring in new people from a variety of backgrounds and have programs in place to move them into leadership positions are more likely to benefit from increased innovation and just have a healthier contributor community. And by having a diverse and welcoming community, you have the advantage of getting those humans who might not feel welcome in some other projects. Now Paris and Bill both talked about contributor ladders. Defining the roles and responsibilities for contributors, reviewers, and maintainers can really help with recruiting new humans into these roles. It can help to think about this as a ladder where contributors can climb up to become reviewers and those reviewers can become maintainers. But what's important is to document and make sure that people understand how they can climb that ladder and how they can gain more responsibilities within your project. A contributor ladder usually outlines the different contributor roles within the project and along with the responsibilities and privileges that come with them. Having a contributor ladder helps set expectations for the roles and it encourages people to think about how they might take on areas of increasing responsibility within the project. And as you get more of the humans moving into maintainer roles, you can reduce the load of the existing maintainers. And the good news is, again, there's a template that you can use to avoid building this from scratch. This one was based on Kubernetes, so it probably has more roles than you need, but you can simplify it, customize it, make it work for whatever your project needs. Paris talked a little bit about emeritus as well, so I feel like I'm just dovetailing on all the things Paris said. But humans like to think of ourselves as irreplaceable. We are not. We move on to other jobs. We burn out. We retire. And let's face it, unlike the robots, humans are mortal and we do not live forever. You should think about what you might want to do next and how you can prepare someone else to take over after you move on. I encourage projects to have an option for people to move into emeritus roles, which recognizes the hard work that they've put in into a project and gives others a point of contact. If they have any questions about what came before, while also allowing you to step away from the day-to-day responsibilities of the project. And I encourage you to think of stepping into an emeritus role as a successful way of just sort of handing off your duties to the next generation of maintainers for a project. Now, I've talked a lot about things you can improve. Metrics can help you decide where you need to improve your community and measure your progress after making improvements. But quite a few people seem to take what feels to me like a random approach to metrics by measuring the things that they see other people measuring or gathering the metrics maybe that are easiest to collect. And maybe this even provides something useful. But I encourage you to think about your goals and take maybe a less random and more strategic approach by focusing on those goals. And when I say start with the goals, I don't actually mean start with your goals for the community. I actually think you need to take a few steps back and start at the very top. What's important for your organization or what's important for your project as a whole? And this in a lot of cases has been a company in my case, but it could be an organization like a foundation. It could just be the project instead of an organization. But you should start by looking at what that organization or project hopes to accomplish and what its goals and objectives are. And then you can take this down a level and figure out what your goals are as a community. And your roadmap can be one input into this whole process. And the most important part of putting together the strategies and plans for your open source contributions is then aligning them with the overall goals of your project. If your goals for the community don't support the overall goals for the project, you aren't likely to be successful. So it's worth the time to figure out what you want to do and how it supports the rest of the project or how it supports what your organization is trying to achieve. Once you figure out what you want to do as a community and then can tie it back to the bigger organization or the project, then you can start looking at using metrics to measure your progress. So people often ask me, for example, of the projects with the best metrics. But I really just don't think that's a good approach. What you measure depends on your goals and what you're trying to achieve, which may be completely different for other projects. So I prefer to encourage people to start by defining their goals. And ultimately, you need to look at your strategies and plans and come up with criteria that will help you measure whether or not you are successful. For example, if you want to improve the performance of a particular piece of open source software, measuring commits is not going to get you that. You actually need to have success criteria and measurement based on those types of performance you're trying to improve. If you want to gain influence within an open source project, maybe you work at a company. Maybe you measure increases in contributions or the number of employees who are moving into positions of leadership. And as with any good strategies and plans, hopefully the outcome and results should be measurable so that you can tell whether your efforts are successful. And this is where your metrics come in handy. Once you decide on your success criteria, you need to make sure that you can get the data required to measure it and maybe start measuring it now, get a good baseline of data. And there are loads of tools available to measure data about open source projects. Some of the commonly used tools can be found in the Linux Foundation's Chaos project where I work. But there are also loads of other tools and lots of big projects already have dashboards using either the Chaos tools or CNCF uses DevStats. There are loads of tools available for doing this. Since this is a presentation about building community, I encourage you to focus on your goals while also thinking about your time would be best spent on community activities. I've given a lot of suggestions so far in this presentation and you should not try to do everything at once. So I recommend that you think strategically about where you should start while keeping your goals top of mind. If you know you've had people interested in contributing but they've given up when they couldn't get started, maybe you should start with onboarding docs. If you have a lot of casual contributors, maybe you focus on the contributor ladder and governance to help move some of those other humans up to take on more leadership positions. An excellent way to free up time from maintainers is by getting help with different types of contributions that take up valuable time and are actually required to make an open source project successful. Things like documentation, marketing, community management, project management, and many more roles. For projects with complex code bases especially, it can sometimes be easier to onboard people into these roles first to free up some time to onboard other contributors later. This also has the advantage of bringing people in to help with things that can have a big impact on growing your community like roadmaps, governance, and other documentation. Time is precious. So it is important to identify the problem areas within your community where you can focus on the right things while avoiding wasting time on areas that are already working well. However, metrics do need to be interpreted in light of your goals, how you operate as a community, and all of the other things happening within your project. There's no one size fits all interpretation of metrics. So in this next section, I'll use some example graphs from some of our chaos metrics and talk about what some trends might indicate and how to think about addressing potential issues. One key area to look at for your project is responsiveness. This is a backlog graph from the chaos grammar lab tool. In this project, you can see that there are times where they've got a lot of PRs in the backlog that need to be merged or closed. Now, if these PRs are coming from several regular contributors who aren't maintainers, it might be a good idea to look at how you can promote some of those humans to become reviewers or maintainers to help out with the workload. But as with any metrics, you need to interpret them in light of your project. There are other things that can cause an increase in the backlog, like everyone preparing for a big release or maybe a big conference or just vacation season that might not be resolved by moving more people into leadership. Again, these graphs come from grammar lab. Other metrics to look at responsiveness focus on the amount of time it takes for maintainers to close issues in PRs. Looking at trends for these metrics is particularly important. This example, you can see that it's taking a lot longer for maintainers to close issues or PRs. It might be a good idea to look at how you can promote some more humans to become reviewers, maintainers, help with the workload. Again, you need to interpret this in light of your project. There are other things that can cause an increase in time to close, like the project becoming more complex or becoming larger, which can just increase the time required for things like testing and other activities that would happen in the process of reviewing and closing PRs. It can also help to look at the types of contributors that you have. In this case, casual contributors are those drive-through contributors who make a small handful of contributions and then disappear possibly forever. Regular contributors are the ones who make some contributions and then they stick around and continue to make contributions over a period of time. Core contributors are usually the maintainers who are there for the long term. You can really learn a lot from this graph. If you have a very small number of casual and regular contributors, this can mean that people don't have the information needed to become productive and to contribute. In some cases, onboarding docs can help solve these issues. Another thing this graph can indicate is whether there may be some fundamental issues within the project that are driving the humans away from your project. If you see the total number of contributors declining or the number of regular contributors declining, this can indicate some deeper issues, maybe toxic community members or an unwelcoming environment, and that probably needs to be resolved before you do anything else. Or it can mean there are other issues with things like lack of responsiveness. This metric is often called the bus factor or lottery factor based on the idea that if one person disappeared after winning the lottery and that person was making all of the contributions, then the project would probably be in trouble if they left. This graph uses data from Chaos's Augur software. I recommend measuring this because there are a few things that can tell you. First of all, how big of an issue is your current contributor situation? If it's like this one, you really should focus on getting some additional contributors and maintainers. You also might find that there are people who are contributing more than you realized, which is the other reason this is a good metric. This can help you think about who you can encourage to contribute more or maybe find someone who can move up the ladder into a leadership role. So you might look at some of those people who are a little bit lower down on the graph and see if you can promote them up into being a maintainer. The catch here and with so many metrics is that we don't want to just think about the people who are making commits. This is a good start, right? It's a start, but you should also be thinking about how to move people into leadership positions to be responsible for things that might not show up in GitHub, like documentation, community management, marketing, mentorship, lots of other important roles. And metrics are not something that you look at once and never revisit. It's important to think about metrics gathering as an ongoing process of measuring, improving, and monitoring. So you think about your goals and what you want to achieve. You pick some metrics. You make improvements, and then you monitor that over time. And before I wrap up the talk, here are just a few resources that you might find useful. There's some great stuff there from the CNCF contributor strategy tag around how to use and templates. The Open Source Way guidebook is just another one of my favorite community resources. And then the chaos metrics. We also have a Slack channel. You're welcome to join us. Anyone can participate in the chaos project. Maintaining an open source project is so much work. And there are so many maintainers who are overworked, exhausted, and burning out. The best way to address this challenge is by finding more humans and growing your contributor community. But it's hard work, right? And it takes time away from the day-to-day activities now, which can be super hard to justify if you feel like you're barely keeping up as it is. In the longer term, spending at least a little time on things that can help you recruit and keep new contributors will be worth it. And as I mentioned before, you don't need to do everything at once. Think about your goals. Use your metrics to help you figure out where your time would be best spent. So this is what I'm asking you to do. If you're a contributor to an open source project, carve out maybe an hour a week to improve your onboarding docs, your contributing guide, your project governance metrics, or just spend that time helping another human learn to do something new in the community. With that, thank you. And I'll, I think we have another two minutes for questions. Yes? Thanks for the presentation. It seems that some of the ideas that you presented, the contribution layer, the layer. Sorry, can you speak up a little bit? Thanks. Thanks for the presentation. It looks like some of the ideas that you presented, like the contribution ladder, can be maybe at odds with a project that is really owned by a company or where there is a strong presence of the company. Do you believe that there is a way to resolve this? Yes, I do think that there's a way to resolve that. I do think that sometimes the governance and the contributor ladders sometimes work a little bit differently when you're talking about projects that are owned by companies. I think that the best thing that the company can do is to be honest about what roles are really open to people from the community and which ones might not be. And that might not be something that your company wants to be transparent about, but I think if you're really trying to build a community around it, I do think you have to be transparent about that. And I think that the people that will stick around in your community will at least respect that transparency, even if maybe it's not the answer that they wanted to hear. So I think there's definitely room for that, but it will look a little bit different and you will have to have that balance between the company and the community. Hey, thank you, Don. All we have time for. Thanks. Thanks. Thank you.
Building Communities with Science!
I would like to introduce our last two speakers of the day, last but not least. We've got Stephen and Mike. Check. Yes, there we go. Hi. Hi there. I'm Steve Jacobs. I am the director of the Open at RIT, though I have been doing teaching students and faculty and staff about open source since 2009 when we started making educational games for OLPC. This is my first thought stem. So thank you very much. And hi, I'm Mike Nolan. I'm the associate director of Open at RIT. I do not have a career tenure as long as Steve. But I have been doing open source for around a decade now across many different sectors and industry from the humanitarian aid sector and now working in academia. And today we want to talk to you about using science to build open source communities. So I'm sure if you've been sitting in the deaf room all day, you'll probably have noticed some themes around community building and different methods. And you'll probably see some of those themes in ours, but in particular what I'm interested in sort of conveying with this talk is how we use evidence based practices to really figure out what sort of things are necessary for building communities and various types of open work projects. So just as a quick note, you know, while Steve and I are up here, we have a long and storied history of hiring team members and students and very talented people to work on some of the things that we're going to talk about today. Two of those students are Django and Daechon who are fantastic designers. I think both of them might be looking for work, design work and open source. So if you're interested, please reach out. I'd be happy to connect to you with them. They're both amazing and talented. So before we get into the nitty gritty details, Steve, do you want to talk a little bit about who we are and what we do? Sure. So what is now open at RIT has its roots in the education program. We started in 2009 when we wanted students to make educational games for one laptop or child. That went from a seminar to a standard course to multiple courses and a co-op paid internship program that led to the only academic minor in free and open source software and free culture on the planet. We're our alums, number of folks like Justin Flory who is running the Fedora community these days, Remy Decausmaker who is the head of the first open source programs office in the federal government and center for Medicare and Medicaid. So we built things like Fedora badges. We laid the groundwork for the UNICEF venture teams, Roadmap and Milestone program for their fellowship people. So we've been around doing a lot of things. We've learned a lot from a lot of people including shout out Elizabeth Barron up there in the back who is the chaos community manager who God bless her everyone. And several years ago when universities started talking about having OSPOs within universities, we decided to spun one up with a second one in the U.S. anyway to be an open source programs office. Except we don't call ourselves that. We call ourselves an open programs office and we talk about the fact that we support all academic open work. We don't want to send the designers, the artists, the people who don't feel like they do programming or they do research or they do formal academia away. We want everybody to do open stuff and we're there to support them to do that. God bless the Alpha P Sloan Foundation as they gave us some funding because they came to us and said the stuff you used to do out externally for UNICEF, we want you to do for your own faculty members and that's what we're going to be talking about a lot today. Did I miss anything? If you want you can talk about the pillars of the services. Okay, so formal academic education for students. We run a list for faculty, students and staff to look at to learn more about open. We do policy work internally for the university. We just released a position paper on the reasons we feel that federal institutions should fund the peer review for open science. Since open science is being pushed by most of our governments to try to fix pieces of science that are broken or inaccessible and peer review is one of those. We do work and research into digital infrastructure for community building. I was one of the first Ford Foundation digital infrastructure fellows many years ago and what we looked at was community issues within Pi PI, and the first types of things that Dawn was talking about, her comments about knowing what your community is about, what you're doing, you can look up the research and the fully open qualitative research we did, which means that uncharacteristically the people we interviewed signed away their privacy rights and allowed us to use their names and their transcripts from their interviews. If you look up conceptual mismatches, you'll find access to all that data, but to go to Dawn's point, one of the things that Pi PI ran into was road blocking themselves when the culture of the program made the people who got hired to do outreach community managing onboarding and governance ended up feeling guilty because they weren't pushing code. So we do that kind of infrastructure and policy work, community building work, and we do this fellowship work which we pioneered for us anyway, working with UNICEF, extended to our own faculty with the Sloan Foundation support, and now do for anyone who would like to work with us. And so going into this fellowship work that we're talking about here, this is going to be the main subject of our services that we want to sort of be describing today because I think it's quite interesting for community building. This is the approach that we landed on. So to give a quick overview of what we mean when we say community building or fellowship, when we talk about people who were trying to provide services to, these people are maintainers of open work. I'm sure many people here might be maintainers of open software, others might be academics creating data sets, or people publishing journals and stuff like that of open work, right? And they come to us because they're really interested in either building a community around a piece of work that they built, maybe growing an existing community, making it more inclusive, or bringing in a new type of potential contributor or user, or maybe they're just super burnt out and they're like, this is an entirely unsustainable way of maintaining this piece of work and I want to figure out what I can do or what sort of things are needed to make this more feasible to maintain this work. And so, well, we have slowly built over the last, I don't know, 10 years of working with different clients is this process where we provide a team of developers, designers, and community managers to tackle this specific problem, right? Not necessarily making direct contributions to them, but to tackle this problem of figuring out these community issues, what they are, and what resources are needed to overcome that. And so, what does this team actually do? Well, we kind of act as maybe a project accelerator, which is a pretty common term, and you see startup accelerators do this where you have a directed resource specifically of figuring out what your market is and stuff and what you're going to do. We do this with projects. So, a project maintainer will come to us and we'll help them launch or convert a project being open or we'll help them grow it or maintain it so they don't have to burn out. And so, as I said earlier, we do this with all kinds of projects. We have this link all over our slides quite often, this open work definition, so we don't just work with open software maintainers, but open scientists, open data repositories, OERs, and many other things as well. And kind of like the smaller cases, but things that often happen is because we're based in a university, we've had faculty come to us for help. And I use this project stuff all the time. I have all these teaching materials. They need teaching materials. How do I contribute it back? I don't understand what to do. So, we've helped faculty do that and as a result, we've also kind of done an analysis of that project's contributor pipeline and said, you know, if you kind of smooth things out here and here, you probably get more people putting stuff in. So, it kind of works both ways. You don't have to be a maintainer. And so, before we get into like the services, I think it's good to understand the background of what existing kind of resources and knowledge bases that we're pulling upon when we think about how we're going to be building communities. And so, like a big one is like the Mozilla Open Leadership Training Series, which, you know, I can't speak for these communities, but from my perspective have really influenced programs by Code for Science and Society or the Turing Way or Open Life Sciences and many other sort of community building services are offered by other institutions. And this training series was like an amazing thing that was created probably a decade ago at this point that really set out the foundations of like how you think about contributors and, you know, Don was talking about the contributor ladder, right, which I think is kind of a good allegory to contributor personas and pathways that are talked about by Mozilla. We also pulled from concepts from like Nadia Eggball's book, working in public, thinking about taxonomizing communities and understanding, you know, different, you know, depending on the type of community may face different issues. We also use design thinking in our process of ideating and creating solutions and iteratively testing them. And then, you know, obviously, all of this came together in large part when we were working with the UNICEF Venture Fund doing sort of consulting with the various projects that they were funding to build open source communities. Some of this knowledge has been stored in UNICEF's own sort of open source inventory. It's kind of taken a different shape at this point, but I encourage you to check it out if you're interested in the approach that they're using right now. And this also back fills the undergraduate education we do. We, the degree, the minor courses are open to students all across campus, not just computing students. It's, you know, as everyone has said before, you know, contribution is docs, it's onboarding, it's graphics, it's websites, I think kids build logos for the projects that they want to contribute to as part of the course. And somewhere between roughly 40% of the class is learning how to do analysis based on the type of graphs that Don was showing, that, you know, what, you know, both for themselves, you're going to have to do contributions by the end of the semester. Does this look like something you want to try to contribute to? Are they responsive? Are they supportive? What do the flows look like, right? Does, when your projects are due by the end of the semester line up to a point when they're not really taking contributions? So is that a good thing, right? So they learn to apply all that stuff. So when we start working with a project, in particular with a first engagement with a project, I think it can be, we try to divide it into these three sections of objectives that we're trying to work with, with our client. The first is, you know, it's kind of slowly learning more and more about the project and more and more about the, you know, what's going on. And through that you can figure out like what are the actual specific things that need to be created. So, you know, first and foremost we have to understand, you know, what the project is, where its goals, like why did you create this, why is openness, you know, purportedly something that's important and useful to the project and like what do they hope to get out of that. You know, then we try to figure out who are the potential stakeholders and stuff and like how would they potentially get involved. And then like from that we begin working backwards and say, okay, well what are the actual roadblocks that people are encountering when they try to get involved. What are the things that are missing? So, you know, I think just to give a little bit of additional detail for each of these stages, I kind of want to like pose a few questions that we're oftentimes asking our clients when we're working with them. Right? Like, what are you actually trying to do? Like what's the point of this project and resource that you're creating? You know, why is open source the thing that's important to it? Is it about, you know, getting people from other areas that you traditionally don't work with? Is it, you know, about inclusivity? And then, you know, do you have a community or people currently contributing? And what ways are they contributing? Maybe you have some software contributors but you know, you don't have contributors from a different place. Or, you know, maybe you have a lot of non-technical people involved in a software project but you don't actually have a lot of technical people. You know, that's super common in things like humanitarian software, right? And then like, what sort of community are you trying to create? And then after that, you know, we, once we kind of understand the objectives of the project, then we can start trying to figure out who are the different archetypes of stakeholders. So in Mozilla or in the Turing way, they talk about contributor personas, right? So these archetypal sort of contributors, maybe in our project we might have like a persona that's like focused on researchers. But then you might have another type of contributor that's like maybe coming from the private sector or something like that. And these people each have their own incentives and sort of reasons for getting involved in ways of engaging with. Then, you know, we begin theorizing like what's the ideal way of getting involved and this creates the pathway or maybe something akin to the ladder, contributor ladder. And so the stuff gets applied no matter what type of project it is, right? So when we did 25 different academic projects over two years running people through this thing, we had everybody from computational astrophysicists to great vineyard DNA people to deaf educators working on early childhood and learning languages and teaching international sign languages to my favorite acronym of any project we've ever worked on, the Victorian Autobiographical Information Network or VANE so that they could share their data on Victorian autobiography that was much broader than what they could put in their book, right? So all of this stuff works no matter what corner of the universe they're coming from, right? And, you know, collecting all this information on the contributor types and how they get involved, this has meant to create like an idealized version of what you hope your community to be, right? Who are the people that you actually want to be involved and, you know, what is like the end goal? And then from this end goal, we begin, you know, documenting like where are these like shortcomings, right? Where are the things that might be missing or preventing people from getting involved? And this stuff can be, you know, pretty simple in the end, right? It can be things like project documentation that's maybe missing or a lack of marketing materials or like just no outreach in specific, you know, either geographic or online areas where people aren't finding things or a lack of governance that is preventing people from getting the stage of like being maybe a first time, you know, drive through contributor and getting them into a thing where they're maybe approaching something more akin to a leadership role, right? And so it's this like very specific process that regardless of the type of project that we really try to stick to because by going through this process we're gathering the evidence of what's actually needed and what the materials are. And, you know, it took us some time to get to this because there's a real tendency of just being like, well, you know, just like I just want to read me template that will make my community grow, right? Just give me the best practices, right? But, you know, oftentimes when you hear like people who do community building, they're like, well, yeah, like there's best practices for sure. Like, you know, make sure you have a license and a code of conduct. These are good. But the realities of what it takes to build a community is very specific to your community. And so what we've worked really hard on is trying to come up with like a specific process of finding out what that is so you can know for sure. And so to talk kind of about the maybe the scientific methodology that we're using, we think about our engagements, you know, in two different ways. The first is developing solutions, which is kind of what I've gone through here, right? And the way that we do this beyond just asking the maintainer these questions is we're conducting qualitative semi-structured interview studies with various types of, yeah, thank you, big words. I just got my masters, so, you know, what can I say? We conduct these qualitative semi-structured interview studies with people to generate evidence on what sort of interventions are needed. So we're like, you know, heavy touch in this case, because what we do is we find out like after coming up with all these personas and pathways, we talk to all these different people in the community that we feel like are representing each type. We talk to people who were really successful in getting involved. We try to talk to people who are unsuccessful in getting involved. And through these interviews, we're generating data of like legitimate evidence as to what has worked, what hasn't worked. And through that, we can begin trying to derive, you know, kind of potential solutions to these problems, right? So if we know that people keep on having trouble with like, yeah, but I couldn't get the project up and running on my own computer, so I just kind of like ditched it and I tried to use something else. Then you know for sure that this is something that is worth putting work into. And then once we develop these solutions, maybe we write some documentation, right? You deploy it and then you want to be able to see if it was effective and like if things have changed and if new problems have popped up. And with this is kind of the next stage after that, right? Where we have a preference of taking a more mixed methods approach that involve using more quantitative data. You know, Don, I think, did a really great job talking of some of the different ways that you can use various software from chaos to evaluate against these goals that you've developed in the first stage. As well as continuing to do qualitative and confirming your conclusions that you're generating from the data with community members. So following up and talking with people and showing them, you know, hey, like, it seems like, you know, the amount of contributors have gone up here. Like, you know, maybe the bus factor has lowered in seeing if there's like new problems coming up. And so, you know, I think it's important to note that like the interview work and this qualitative work that we do and our community building, it's very heavy touch. And it's like kind of expensive, right? You got to like have people going out and like, you know, doing half hour interviews with random people over and over and collecting all the data and doing the transcripting and then like doing the coding. It takes a lot of work. But we find that it's really important because this work is often overshadowed and, you know, because this work is, it's often not thought about. And Steve, maybe this would be a good opportunity for you to talk about your research with conceptual mismatches and like what you were actually finding out. So, yeah, the conceptual mismatches effort. PyPI, the Python Packaging Index, had originally come to life as many open source projects do. Somebody had an itch, they have to scratch it, they built it, and it wasn't created for tens of thousands of people using it on a daily basis, right? So it needed to be re-scoped, re-architected, refactored, right? All those things, right? And they hit a wall. And they hit a wall even after they got funding to do that work. They hit a wall because of different perceptions of what being a community member, being a contributor, what the goals of the project were like, all those things. And so we did multiple rounds of interviews, and this is a process developed by Dr. Melchua, which she used on her PhD thesis originally, where a set of round robin interviews where we got X number of maintainers and X number of active community members to respond not only to a series of questions, but then respond to each other's answers, right? So we did three cycles and distilled all that down. And because they were willing, because they were open source people, yay, right? When you do a qualitative interview as a social scientist, generally it's just dead given that people never reveal who they are, right? Because often they might get fired if they trash their boss as part of this process, right? So you are like, never, never, never, never, supposed to actually out people or tie them to their work. And you have to go through an IRB and institutional research board where you prove that not only are you not electroshocking your students, but you are not revealing data. And we had to work with our IRB to create something and a process by which we had those folks not only sign off on the IRB paperwork, but give up their copyrights to the transcripts of their own interviews. We did let them review them first, but all that paperwork, all those forms, all that process is in the same repository as the report. So if you search on forward and critical digital infrastructure, you search on conceptual mismatches, if you will find a one-pager first, there will be a link to the repository. You can click down the repository and pull all that stuff out and reuse it. So if that kind of social science work is of interest to you and you want to try to convince your own powers that be that no one will go to jail or be sued if everybody signs off on this stuff, that's there for you. And as I highlighted briefly, there was so, you know, this happens, right? You have a core set of maintainers who are focused on how many lines of code go out a day, right? And did we hit refactoring this peep of the project by this time? And there was so much pressure there that people felt bad about doing their onboarding or their documenting or their other types of jobs, and that stuff fell behind, even that was funded as part of this big refactor, right? So culture and misperceptions about what our goals are and who we are and what we want to accomplish actually got in the way of doing what people wanted to do. So I'm going to breeze through this just so we have a little bit of time for Q&A at the end, but I thought it might help to give like a down-to-earth example of kind of a simple project that we worked with on a limited basis. So like one example is we worked with an RIT professor, Professor Rostojie. So she is a professor of computer science and she was working on developing datasets and investigating different ways that you can kind of potentially cause or ideally prevent. But issues around self-driving vehicles, getting into crashes and stuff due to harmful information in their datasets. So she's interested, obviously there's a lot of potential stakeholders that would be interested in this data and like how it can be distributed and a lot of other open source simulation systems that could use it. So we worked with her to develop specific personas of understanding like who these different types of stakeholders were, right? Because there are stakeholders in the private sector developing AV systems. There were other researchers who were interested in her work and might be able to continue generating further data. And then there were also kind of these like potential partner projects or like integrations with simulation systems, right? And so from there we kind of begin to try to understand like what were the goals of this project, which was a very early stage but potentially high impact project that could be integrated into all these different systems. And we found largely it was an issue of like discoverability and also understanding the open source aspect of this. So like for us, you know, through many of our interviews we realized that there's just, there wasn't a funnel from like finding the research to understanding that this is a project that can continually live and evolve and get contributions from other researchers and so on. So we needed to find a way to create a narrative around this. And like an obvious very easy first step to this was like developing a simple landing page website that describes the project, the research, you know, what are the main contribution asks, and then, you know, showcasing those different pathways of getting involved, which like oftentimes is just sort of linking to a well documented GitHub repository and examples of work, right? So we said, okay, well we're going to make a website and we went and we made a website and we like prototyped some copy and we got some review on it and we did the whole, you know, usual process of making a website and we deployed it and, you know, it was somewhat successful. I'm not going to go beyond this point because it was like a fairly limited engagement, but, you know, when thinking about applying this process, in particular if it's like a small project it can be quite a simple engagement, right? But, you know, when coming back to like what's the point of this engagement and like, you know, doing all this work to find out all this stuff is there's a few things, right? As a community becomes larger, developing that community becomes more complex because you have more stakeholders involved, you have more potential conflicts between those stakeholders, and more resources are needed to coordinate all that, right? And community development skills are not necessarily often found in like this peer to peer production of open work, right? So for our team we had like many designers and, you know, social scientists that like do this work and understand how to, you know, gather data from interviews and create this evidence. And, you know, Dr. Rastogi was super talented but not necessarily in these fields. And then, you know, finally if you want to justify growing, like allocating additional resources to growing your community, you need to have evidence to do that, right? Like in particular if you want to get funding for anything, like it's very, very difficult to just be like, and we're finding this now particularly in the private sector to say like, well, you know, we might go away if you don't give us money. We might go away if you do give us money but it's like less likely, right? I've spent the last four years of my life writing a lot of grants and I can promise you at least in those four years of experience has never worked once, although I've tried it a few times. So, you know, community development can get complex particularly as you migrate out of like a very simple governance structure into something that has like many different tiers and many different types of people involved. And this takes work to do, right? And that work requires these new skills, or maybe not new skills but like different types of skill sets involved. And that skill set is important for doing this work. And then finally, and what has certainly been probably one of the biggest points that has opened the eyes of a lot of people that we've worked with is, you know, if you want to like really get additional resources for your project, right? Particularly if you're like a burnt out maintainer or you feel like your project is like plateauing but you want to grow it and you know that it can but you just, you don't have what you need to do it. So, you know, getting the evidence to showcase what exactly is needed is one of the best tools that you have at your disposal for finding ways to like potentially fund this or just to, you know, give yourself the mind space to begin organizing around it, right? And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then you can also get the data to be able to do that. And then, yeah, we have three minutes and 45 seconds left. So thank you and questions. Ask away. Any questions, folks? I'll bring it up on the mic. It's okay. So in across all of the projects that you've worked on, you do these personas. And I have to tell you that that is definitely a trigger word for me having worked in a number of companies where they would do personas and the people who ended up doing the personas knew absolutely nothing about. We do them right. So you're not trigger. My question is, have you looked at across all of these projects where people have done the personas and then later succeeded in the project? How far off the personas were at the beginning and did they evolve over time? Did they get closer? Can we get better at doing personas? Great question. There's a lot of bad persona. So I think part of it is coming from a design perspective. Personas are meant to be living documents that are kind of tested out. And they aren't just meant to be. It's like corporate values. A thing I hate, I hate corporate values because these arbitrary words like we're here for goodness and whatever. Personas can sometimes be that we're like, we have our customer persona and they like use our thing because they want to give us money because we're valuable. Like that's useless. With personas, what we're trying to find out is almost like documentation of our studies. Right. So we're we're trying to figure out like, who are the type of contributors that are useful to our community? And like, why do they want to contribute? And like, you know, what does that process look like? And so in terms of thinking about like, you know, we may make assumptions or collect, you know, invalid data about like who that person is and why they want to contribute. And so when thinking about our process, why there's these two stages of like developing solutions and then testing them, this is like where the evolution of a persona comes in. So, you know, we think our personas like are just having issues discovering us, right? So they're really interested and they have the time to contribute. And so we're going to like have a marketing campaign. But from from that campaign, we may find that they actually, you know, have another problem and then we can evolve that persona. So it's really meant to be kind of like documentation of that data. Sorry, I was wrong in the answer. I mean, I started as and still am a video game design professor. I've gone horribly wrong in so many ways. But what we tell our students straight up first thing, whatever it about ready to get to make a game is you're not your player. You think you're your player, but you're not your player. And so you need to figure out who your player really is just because they like the same game as you doesn't mean you're necessarily them. Any other questions? We want to go first. Well, thank you very much for our final talk of the day. Thank you so much. And thank you all for a great first FOSTA. Thank you. Thanks, folks. Thanks for sticking around.
Intel TDX Deep Dive
Perfect, I guess then let's start right a bit earlier, but then leaves potentially a few times for questions. Hi, my name is Benifuri. I work at the Confidential Computing Enabling Team at Intel. My main job is to, I can try, but I notice when I said there that it's not really loud, the speaker is not loud there. Okay, speak closer to the microphone, I will try. So yes, Intel Confidential Computing Enabling, right, we work together with academics, companies, partners, whoever wants to use our technology, we help them to do that, right, that's my job. Today, I will talk about Intel TDX, and I will, it's called Deep Dive, but I will start with an overview and then go deep in a few slides, right. So overview first, I don't want to speak too much about that, right, it was just done in the talk we just had. Without confidential computing, or if you don't use any protection mechanism, everything is in what we call the trust boundary, right, everything can access your confidential data. With our first technology, Intel S-Chicks was just mentioned, right, only the application with Intel TDX, the topic of today, a virtual machine is protected, right, everything of that was just mentioned, just want to mention it again, because like we have the options, right, use whatever you want. In general, you could say Intel S-Chicks is the more secure technology, Intel TDX the more usable technology, right, but that's up for debate if you want. Yeah, today we will concentrate on Intel TDX only. Here, you see an overview, like this is what a regular system looks like, right, we have the platform with cores, caches and so on, the memory and a regular hypervisor, a virtual machine monitor here. And with normal VMs, this hypervisor starts the virtual machines, right, and it is also, this hypervisor isolates the virtual machines from each other and isolates the virtual machines from the hypervisor itself. In the main memory, everything is plain text, right, so, which means that every person and every program with the necessary privileges can access the data, right, it's plain text. This is different with Intel TDX. With Intel TDX, we introduce what we call the Intel TDX module. The hypervisor has to be adjusted as well, it's now says here it's TDX enlightened, because the hypervisor is still responsible for resource management and hypervisor now needs to, instead of starting the virtual machines itself, it has to go to the TDX module and say here, please start your TDX protected virtual machines for me. And this is what the TDX module does, right, it starts to protect the virtual machines, which we call trust domains, right. Intel TDX stands for Intel Trust Domain Extensions, Intel TDX Protected Virtual Machines, we call them trust domains. Inside those trust domains or TDs, the guest OS running there has to be enlightened as well, right, it has to be at least to have some changes, because it has to handle now accesses to private memory and shared memory, it has to handle that, it also has to handle exceptions and it has to block certain calls that were possible before. Yes, the application inside the TD do not have to be adjusted, that's the main advantage of Intel TDX or comparable technologies. The main memory belonging to the TD is encrypted with an ephemeral key that is dedicated and hardware managed, right, as you see on the slide, it says encrypted with key one and key two, because every trust domain is encrypted with a different key. Inside the CPU, the data belonging to the TD is plain text, right, that's what confidential computing does, inside the CPU, data is plain text, but the CPU takes care that only the trust domain to which the data belongs has access to the data. Combining the main memory encryption and access control, Intel TDX enforces the isolation of the different TDs by using the Intel TDX module, on top of that, attestation proves that this is the case, right, what we will talk about attestation a bit later. This slide is about the Intel TDX enabling in Linux, it contains a lot of details, I don't want to go into every detail, I only want to highlight three things. First via VM isolation requires to enable a lot of parts of the software and Intel has done that, right, we have put the work in and basically everything is open source, right, and this is the, basically everything is open source, and even the ones in gray, they are only gray because they are a reference implementation, but also open source. And most of the pieces are already upstreamed, but not everything, right, that's the current situation of Intel TDX, but this will change soon, hopefully. One last slide of the overview is the availability of Intel TDX. Intel TDX was introduced at the beginning of 2023 with the fourth generation of Intel Xeon Scalables, but back then, only at four leading cloud service providers you see on the right. Everybody else buying these CPUs did not have Intel TDX enabled. Previous at that cloud service providers started already in Q1 2023, and general ability is supposed to be soon this year. Intel TDX became generally available with the fifth generation of Xeon Scalables, which was introduced at the end of last year in December, meaning if you now go to your favorite hardware vendor, you should be able to get such CPUs or at least soon. Good. Now to selected technical details of the technology. First, the CPU state is kept confidential by managing it in CPU protected memory, and that's the responsibility of the TDX module. For example, on a TDX exit, the CPU state is saved by the TDX module in a protected memory region, and this memory region is encrypted. And all memory confidentiality and integrity that's provided by Intel TDX is provided by what we call the TME IM key engine, the total memory encryption, integrity, multi key engine. And this is used to encrypt all the main memory belonging to a TD to prevent untrusted software or from observing the TD's memory. It uses AX XTS 128 bit, and each memory as mentioned before has its own keys. The memory integrity feature detects TD private memory corruption by software and direct memory access. The TDX module is responsible for managing all the keys, but to encrypt the different keys. But the TDX module itself still does not have access to the keys. This is done by the TME IMK hardware that manages the keys, and the TDX module only references key IDs. No access at any point to the keys that are actually used for the main memory encryption, not by any piece of software. I will skip remote attestation for now, because I will explain details later. But a bit about IO compatibility. By default, no direct connection to external devices is possible, because those external devices are untrusted. SQIO can be emulated by software, but this has performance overhead. At the end of the talk, I will talk a bit more about these aspects and how the situation should be changed or migrated in the future. With Intel TDX, performance monitoring and debug facilities run inside TTD. This is a difference compared to Intel SGX, because this means you can debug your application handling sensitive data. Because even during debugging, you are protected, you are inside the trust domain. Sure, the person that does the debugging now has access, but still the infrastructure provider doesn't see it. One final aspect here, the page table management happens inside the trust domain now to address remapping attacks. This was also different with SGX. It was responsibility of the operating system, which was untrusted. A few more details about the TDX module and what we call the secure arbitration mode. The TDX module is provided by Intel and the code is open source. Since only two weeks ago or something, it's on GitHub now. The seam loader verifies the signature of the Intel TDX module when the system boots and loads it into a special memory region, which we call the seam RR. Only software in the seam RR itself are able to access other memory in the seam RR. In fact, hindering everything but the TDX module from doing anything. All other software access and DMA access to this memory is completely blocked. The confidentiality and integrity of the seam RR is again protected with AES, XTS, 128 bit. The Intel TDX module runs in what we call the secure arbitration mode or seam for short. To be more precise in the seam VMX route mode. The ISA was extended with introduction of Intel TDX. The ISA was extended by four instructions to enable the communication between the host, the hypervisor and the hardware. These four instructions are seam calls for interactions between the hypervisor and the TDX module. So, start the TD, stop the TD, things like that. Seam read to return the execution control back to the hypervisor. TD call from a call from the TD to the TDX module and seam ops for calls from the TDX module to the hardware. Certain security critical ISA instructions are denied in seam to provide the protection guarantees we want. Now to TDX remote attestation. The TDX remote attestation, you all know that, you all have heard of that in SGX or in other technologies, uses quotes. Quotes are created by hardware and the quotes are used to prove something. In this case, the TD can prove four different attributes, at least four attributes with this quote. The booted image is exactly as expected. During the loading of the image, it's measured, so it's hashed and this hash is stored in what we call the MRTD. This is part of the quote. Second, measurements created or extended during runtime. Intel TDX has what we call runtime measurement registers or RTMRs and they can be extended at runtime. It's not done automatically, it's a can. It's a subtle topic, if you're more interested in that, but that's what we have. Number three, the TD is executed on an Intel TDX enabled platform. It's obvious that that's important, so nobody should just be able to simulate that it's an Intel TDX hardware. Number four, the Intel TDX platform is fully patched. As you also know, I assume in the past, there were problems with the different technologies, loading Intel SX, but then we provide a patch and we have the ability to prove at what patch level your platform is. Then it's as it says here in the next line, whoever is dessert or a lying party can look at the quote and then decide if it trusts the TD or not. Some might decide even an older patch level is fine, some say only the newest one is fine, some say MRTDs has to be a certain aspect, RTMRs have to be or don't have to be used, all that's possible. A bit more about the process of remote attestation, which should look very, very familiar to people that have seen the SGX remote attestation. It all starts with a relying party triggering the trust domain of here, please prove to me the things I just mentioned. The TD will reach out to the TDX module and the TDX module will reach out to the hardware. The hardware will then generate what we call a TD report and this report contains the measurement I mentioned before, but it also has for example the security version number of the TDX module, the measurement of the TDX module and the measurements of the TD and all other aspects that are in the trusted computing base and it's signed by the hardware at this point. The TD report then is routed back to the TD, back to the hypervisor and then to what we call the TD quoting enclave. And as the name enclave already suggests, it's an Intel SGX enclave, right? So we use Intel SGX for remote attestation of TDX. The TD quoting enclave checks if the report was signed at the same platform and if that's the case, it will sign it with the attestation key. I will come, what this means in a second and why this matters. But then now we have a TD quote that's signed by the attestation key and this TD quote is passed back to the relying party who can do quote verification now. But the important question is what just happened, right? The TD quote was signed with an attestation key, what does it mean, why should we trust that, right? And a key piece I skipped before is that the TD quoting enclave has randomly generated the attestation key before the process even starts, right? Without Intel being involved at all, this happens on the platform. But that still doesn't help much. But what also happens on start, the so-called provisioning certification enclave that's also provided by Intel will do local attestation with the TD quoting enclave. It will see, yes, okay, we run both on the same machine. It's a TD quoting enclave that I expect. And it just provided me an attestation key. And then it will use the provisioning certification key to sign a certificate. So then we have, as is you on the right side, an attestation key certificate that's signed by the PCK key. But again, why does this matter, right? The important piece now is Intel is able to create PCK certificates that are then rooted in an Intel CA. And this ends the trust chain, right? So the attestation key generated on hardware, but it links back to an Intel CA. And during quote verification, whoever does it, wherever this is done, can reach out to what we call the provisioning certification service to get all the collateral that's needed to check this chain. That's the process of remote attestation. And as said before, Intel TDX attestation uses Intel SGX. All the sets of collateral we had before, PCK certificate, distribution, caching services, they supported Intel SGX in the Pado in the past only. Now they support both. And it also, this also means that as it's required to enable SGX on the machine when you want to use TDX. Just quickly, a few words about how you can do the verification, right? There are basically four options. You can use a service by the cloud service provider. You can use a service by the vendor of your application. You can use potentially the Intel trust authority like an independent software as a service offering by Intel to do the verification for you, to alleviate the process. Or you can build it your own with the open source Intel libraries we provide. A few pieces of differentiation between the services, if you want to have a separation of responsibility between the infrastructure provider and the verification party, then you should not use the cloud service provider, obviously. But in all the other cases, it's fine. If you want to have consistency, if you want to support SGX and TDX, then it's up to the cloud service provider and the application provider if they supported our variance, definitely supported. If you want to have consistencies across your applications, across the environment you have on-prem hybrid, whatever, then obviously cloud service providers cannot be used, the application vendor potentially, and the others will do it. From a development perspective, it's low in the first three cases and I would say medium in the last case. So quickly, very quickly, two upcoming features of Intel TDXO that we have at least a little bit time for Q&A. First TD migration. TD migration will allow to live migrate one TD from one platform to another. It uses a service TD called the migration TD to do that. All the data is obviously encrypted. Just skipping a few details now, everything is encrypted. Everything will go over step by step, a short break. One TD on the left side goes down, the one on the right side goes up, which guarantees that a TD lives only once at a time. You should not have two different TDs with the same content. And one last feature, Intel TDX Connect. I mentioned that before. So it's a bit problematic at the moment to connect trust domains with a device. It is possible, but what's needed at the moment is the trust domain, like everything in the private memory of the trust domain is encrypted. But it can't directly, so it can write to shared memory, right? That's what it can, but it can't, no. It can't directly write to device. What it only can do is put data on a shared memory and the device can take the data from a shared memory, right? What we call a bounce buffer. So this is a bit slow. Still, it can be done securely, right? If a secure session key is established between device and trust domain, the data can be encrypted, put there, read in the device, and it's encrypted. So even today, this solution is there. Like you can connect an Intel TDX trust domain to an NVIDIA GPU with their confidential computing technology, have it end to end secure. That's possible. But it's a bit slow, or it has a bit of overhead because of this bounce buffer stuff. And this will change when Intel TDX comes along. Because with Intel TDX Connect, the idea is that a trusted device is put in, let's say, the trust zone of a trust domain. They're able to write into each other's memory directly after they trust each other, which will make the whole thing more efficient and has low overhead. This is just nothing I mentioned today is any secret, right? All of that is open here on this page. We have documentations. Knock yourself out. It's like thousands of pages you can read in the PDFs to get all the details you want. If not, feel free to reach out to me at any point after this talk, at any point later. If you have interest in, for example, bare metal access to machine, I'm also your guy for whatever experiments at the University as an organization, whatever, right? Because at the Cloud Service providers, you normally don't get that, right? You get a trust domain. That's it. Might be enough in many cases, but not in all. So reach out to me and thank you for your attention. Can we repeat the questions? Yeah, so, yeah? I have to repeat the question. The question was, is it possible, or I rephrase, correct me if I'm doing a bad job there. You said, it's possible to run a legacy application in a trust domain. Yeah, that's what you said. The question is, how is the integrity of such classes being maintained considering the fact that this application is legacy doesn't? Okay, yeah. So the question was, again, in my words, how is the process then, how is this protected because the application wasn't written, right, for this environment? And the answer is, it depends, right? Meaning you have an in-memory only application, then you don't have to do anything, right? Because the main memory is encrypted and you're done. As soon as your application writes to disk, it's a different story, right? Because if you write to disk plain text data, then it's plain text and everybody will see it. One thing you can do is you can either your application encrypts data before, then it is a change to the application, right? Or another variant is you activate, for example, full disk encryption in your operating system. Then you have to manage the key, right? That's another question then, but that's what you can do. And exactly the same for network connections, right? If you, again, send plain text data out, yeah, plain text data is out. But if you use TLS, you can do it, you just put your TLS endpoint in the trust domain now and you're good. Yeah? Thank you for a very nice talk. So I had a question about the state software support. Thank you very much. So I had a question related to the sort of status on the software support on the guest side, right? So with some of these comparable technologies today, you still need kind of some components in the middle on the guest side, like basically like firmware inside the guest or like Paraviser functionality that hides some of this communication between the underlying layer. So how is it with TDX today? Can you take like a stock Linux kernel and run this? So you still need some components there which are not yet fully open source? So at the moment, as I said, briefly before not everything is upstreamed, right? So it's, I guess, like the basic enabling should be there middle of the year. So at the moment, it's not there fully. But what we have is what we call a TDX early preview. So we collaborate with three operate distribution vendors to provide specific distribution versions. And that's canonical, Red Hat and SUSE. And we have all this is online. You just go to GitHub and it's, I just did it yesterday night, right? It's really like you start up a Buntu 2310, for example, you clone their repository, click install, done. You go to the buyers and activate TDX. Then there have another script to create a guest image. I don't just take them like 15 minutes to create, but just of download and all of that stuff. You start your trust domain and you're done. So that's pretty easy already. Yeah, thank you for the talk. I have kind of an obvious question. Is there a latency cost within one trusted domain from memory access given that it's encrypted and so on? So performance you mean, right? Okay. Yes, obviously it has to be, right? Infection can't be for free. But how high the overhead is highly depends on your workload. If it's a processor only workload, it's basically free. I don't have concrete numbers, but let's say one, two percent, right? So really, really low. If it's really disk IO sensitive, it's a different question, right? Because of this balance buffer and all of that stuff. Again, don't nail me on it, but let's say like it might go to 10% or even more, right? It's really, really dependent on your workload. I guess I have to stop now, but you can just come to me later, right?
SEV-Step: A Single-Stepping Framework for AMD-SEV
So, the next speaker is Luca Wilk from the University of Lubbeck and he will talk about some recent work he has been doing, actually, attack research. I'm very excited that the Dev Room from the start has had some consistent attack research line as well, which I think is very important for this new type of technology. So Luca, enlighten us. Yeah, thank you very much for the kind introduction. I will be talking about SCVSTEP, which is a single stepping framework for AMD SCV, and it's open source and available on GitHub, so feel free to check it out. And this was created as part of an academic paper, which is joint work with these great people down here. Okay, just a quick recap where we are in the trusted execution environment landscape. So as the name suggests, SCVSTEP is about AMD SCV. So we are in this confidential VM area here. However, single stepping is something that basically affects all TEs that are out there right now, so keep that in mind. Okay, with that out of the way, we can jump right in and explore what single stepping attacks actually are. So we start with a quite high level picture. What you want to do here is you want to take some kind of snapshot or observation of our protected application, and we use this for our tech. Now, if our TE runs normally, then it runs basically at full speed, and if we take these snapshots, we don't affect any synchronization with this TE process, and thus the observation and the data that we get is very blurry. But now if we start to interrupt the enclave at certain points, then we have these synchronous points in time where you can start to take our snapshots. So it's not running in parallel anymore, but the enclave is paused when we take our snapshots. And thus we already get a little bit more information. And now if we take this to the maximum resolution and we are able to interrupt the enclave after every single instruction reliably, then we get a pretty clear picture of what's going on. So I hope that already gave you like a good intuition. And now we go into what single-stepping attacks have actually been used for, mostly in academia. And these are all examples that have been done with SGX that really made this popular in academia because it made single-stepping very accessible. So the first attack avenue basic here is something called interrupt latency, and there you basically measure how long it takes from where you like started this attack to when you get like this callback that the enclave has been now interrupted or exited. And it has been shown that this timing actually revealed something about the kind of information that's running in the enclave. And for some instructions like different instructions, you can even learn something about the operands. So dividing by certain numbers takes longer than dividing by other numbers. And thus you really kind of instruction and maybe even the operand with these attacks. Then the second major attack avenue here is called interrupt counting or instruction counting. And here the idea is that certain algorithms and applications have secret dependent control flows, especially true for cryptographic algorithms. We have some secret key, and then I do some large integer multiplication or division and decode the dusted. Executes a different number of instructions depending on the secret data. And now when I do this senior stepping attacks, I can simply count the numbers of steps that I take. And then if I know on which code page I'm currently in, then I can learn something about the secret data just by observing the number of instructions. So in this tiny example here with a conditional jump, and in one case we skip over this move here, and the others we don't. So here we get two instructions executed here, three. And by knowing the code that's currently running, we can infer the value of the secret bit here. Then the third really popular attack avenue is not directly senior stepping, but closely related. It's called zero stepping. And here the idea is that we interrupt the enclave even more frequently. So before this able to actually execute a single instruction. So it doesn't make any progress on an architectural level, but on a micro-architectural level, it is first instruction. It's already starting to execute, then gets a board and roll it back. But on the micro-architectural state, there's actually still already stuff going on. And these attacks are able to measure this. And what we can do then is basically take an infinite number of measurements, but only running the enclave once. And this allows you to measure really, really tiny effects. And then the third column here is kind of the miscellaneous sketch all column. So as you can imagine, just by increasing this temporal resolution, you can improve basically any side channel attacks. So it has been used in many of these MDS attacks here, for example. Okay, so now that we know what senior stepping basically is and why it's really dangerous, we come to the main question of the stock here. Can ACVVM be single stepped? And if so, how? So let's take a look at the basic setup here. So this is like a very boiled down version of the control loop that's going on in the hypervisor, where we enter DVM here. Then we execute some instructions and then at some point we exit. So for senior stepping, the obvious question is, when we exit DVM here, this is what you want to control in our attack. And there are multiple reasons why this can happen. So we can configure certain instructions to be intercepted. And you can also use page for it by removing access rights in these nested page tables. However, none of these two methods give us the amount of control that we want because they are not instruction granular. However, you can also use external interrupts to force an exit from our VM. And this is actually what will allow us to achieve this instruction granularity. And for this, the attacker uses something that's called the APIC timer. It's a common timer on x86 used by the operating system. And by injecting this timer, we will force exits from DVM. So let's zoom in a little bit. This is a typical attack sequence here. In red, we have the coded ones in the hypervisor. It's controlled by the attacker. And on the right here, the blue, that's the three instructions from DVM that you just saw. So what do we need to do now to achieve senior stepping? Well, intuitively, you would think that you would need to hit this tiny window between these two instructions here to single step. However, luckily on x86, it's already sufficient if our interrupt hits somewhere during the execution of this instruction. Because then it will be held pending and will be basically recognized at instruction boundary. Okay, but if we just naively implement this and try to do this, then we are not quite there yet. And we will see that sometimes we will overshoot here and then we will execute two or more instructions. And this, of course, decreases our resolution because now we cannot guarantee that we do something after every instruction. Maybe we have bad luck and skip over very important memory access instructions also on. So this is really bad, this mighty stepping. And on the other side, we might undershoot a little bit and zero step. And this is not really dangerous because then we simply repeat, we don't miss out on any instructions. We just try again and it's a little bit less efficient. So why is this the case? And there has been some really nice papers on SGX and they show that this APIC timer has quite some jitter. So it's not cycle accurate. So it kind of makes sense that we see this behavior here. So what do we do about this? And the kind of obvious idea is, okay, we kind of need to make this window larger because our timer doesn't have the high enough resolution. So we kind of need to enlarge the window at which our timer can hit. And for this, we look at what's actually going on when we execute an instruction here. So first we have to fetch the instruction from memory from the code page and then the CPU can decode it, issue it to the pipeline and eventually retire it. So for the attack, the idea here is now that we make sure that this year takes a long time and we achieve this by simply flushing the page from the memory. So we flush the VMs TLB and that's when we enter it again. We need to do a page that we walk, which will take some time and this effectively prolongs this window here. That is required to execute the first instruction. And now although our timer still has this jitter, this window is large enough so that we can actually rely on the single step. And the ACV step at the time of publishing was the first frame that did this shortly afterwards. There were also some papers that did something similar and it's also open source. So we hope that other people will reuse it. Okay, so now let's take a little bit closer look at the ACV step framework itself. So besides reliably single stepping, we wanted to achieve two other goals. And this is reusability and interactivity for the attacks. And I will go over these two goals now in more detail. So for reusability, let's again look at our setup here. And since we want to program this APIC timer, we want to manipulate these page tables and maybe do some cache priming and probing. All of these things would benefit from being really close to entering and leaving DVM because this is the point. We have the lowest noise. However, this also means that we need to manipulate or change the kernel code and developing kernel code. It's quite hard. It's hard to debug. You're limited to see. You don't have any external libraries. So it's not the nicest programming environment. And also it makes reusing this for different attacks or for different papers quite hard because this environment is not so nice. And your tech logic is basically mixed together with these attack primitives. So instead what you want to do here is we only want to implement these bare primitives inside the kernel, like programming the timer, manipulating these page tables and cache priming and probing. And all of the other stuff is then moved out to user space. And we use an IOCTL API then to trigger this behavior from user space. So then here we have this much nicer programming environment. And other people can simply link against this library and write their attack code with it. And one tiny note is that this execution loop of DVM is asynchronous from our IOCTL API. So it changes only take effect the next time DVM exits. So we have some data variables here for communication, but this is something you kind of need in mind when you program these attacks. Okay, so we achieved this goal of usability. Let's move on to the second goal for interactivity. And to understand this a little bit better, I will go into more detail of how I envision this programming environment here in the user space library. And there we also basically want to have some kind of event loop. Initially we set up some configuration like I want to get a page forward once this page is accessed. And then we want basically to wait until this event happens. And when this event happens, we want to react to this event. We have usually in these attacks some kind of page forward sequence that would tell us when the VM is about to execute some certain function. And then maybe at this point we want to enable single stepping and do some steps to a cache attack, this kind of stuff. So this is basically the process event and the deved config part here. And the really important thing is that once we got this event, we also want the VM here to basically wait for us to process this event because we would allow it to resume. Then we would again lose this precise control you wanted to have to manipulate the environment after every instruction. So we now also need a way to basically communicate from the kernel side to a user space library to be able to send these events and wait for these acknowledgments. And for this we opted for a shared memory protocol. So the library and the kernel code here simply agree on a shared memory page and then use a simple protocol with some spin locks to basically implement this. Why is this not the most efficient? It is very low latency because it's just memory communication. You don't have any user space, kernel space context switch as with the IOCTL here and also reasonably to implement. Okay, and this is how we achieve this interactivity goal. This is basically the current state of the framework. But to close up, I also want to give an overview of ongoing and future work. So one thing I've been working on a little bit already and I would really like to continue on is to improve this API, this programming environment because right now it's kind of basically have these start, track, stop, track commands. And if you start to write your attack code as I've experienced myself, this can get quite messy and quite long really quick. So it would be cool to have some higher level abstractions for this. For example, a component that could track a certain page for a sequence for you and restart the tracking if you get some unexpected access and so on. And then some kind of mechanism or protocol to chain together these components so that you can structure your attack better. Also make it easier for people to get started by reusing these building blocks. And thinking about this even more, this is totally independent of the actuality underneath. So this is maybe something where the existing S3X step community could come together and could build these libraries at a higher level and then S3X step and SIV step. And I think the trust zone one is called load step could basically be initiated as drivers underneath that so that everyone could profit from this. Okay. And this is more or less it. You can again find the links for SIV step and also for SGX step, which I mentioned here. They are both open source and on GitHub. Feel free to check them out. Send me a pre request if you want to change something, create an issue that's something broken. And yeah, thank you so much. And I'm happy to answer questions now. Yeah. Yeah, thank you for the very interesting talk. A new Satchel attack for me. And now you've showed how to break things. Do you have some ideas how this kind of attack could be mitigated possibly? Yeah, so it's a really good question. So for S3X, there recently has been a paper which was does is called a X notify. And then basically the idea is to make the S3X and play interrupt aware and then execute some special handler that will pre fetch this first instruction that I showed so that you can't do this. I flushed the TAB and make everything really slow approach, but ensure that this the first instruction always executes really fast and this then mitigates this attack. And for TDX, which we just talked about, there's also some mitigation built into the TX module. And for SEV, we are currently looking into ideas how we could protect SEV VMs against this. Thank you. Thank you, Luca. Yes, we're back. So can you elaborate a bit on how much of this is SEV specific and how much of it is actually, let's say KVM step? Let's say if you don't have a mitigation in TDX, can you just launch this as is on any kind of VM or is this specific to SEV in any in any way? Thank you. So I don't think it's really specific to SEV because this ability to flush the TAB that should also be available with VMX with the hardware acceleration for Intel. I think that the basic primitive should apply. I also know that there has been like an internal prototype that's what's called TDX step that's on one of the Intel pages. So they basically build something similar for this. So I guess in principle, this should apply to all like VM based systems where the VM can be forced to exit by external interrupts. There's one more question. Can you repeat it if you have all the plans for TDX? It's definitely really interesting. The question was if you have also plans for TDX and as I've said, TDX is a bit in countermeasure, but I guess it would be of course interesting to try to figure out how this works exactly. If you can do something there.
The ups and downs of running enclaves in production
All right guys, so back to the matter of the day. The next speaker is Kian who works at Evervolt and I think it's quite exciting to have a bit of a complimentary perspective in let's say this exciting new field where we talk a lot about new technologies, but you will actually talk about how to use them in production. So take it away. Thanks. So I work for Evervolt and I will talk about Evervolt to begin with just so you know why we use enclaves in production and not just traditional computing. So we offer also, I don't know how loud that is if I'm too quiet, tell me so I can speak louder. So we offer tooling to allow customers to guarantee data security in different forms like encryption before it ever hits your system or ways to process said encrypted data in secure environments and so on and so forth. At the core of all of this is enclaves. We're running on AWS so we're using the Nitro enclaves which as far as I can tell aren't as open source as the Intel SGX or any of that stuff. But we've been doing this for a couple of years now and that was when we started the best we could find for doing VMs that guaranteed the security model that we required. So like I said encryption, so yeah we're running in fully isolated VMs where we can basically see nothing that's happening inside the VM without a lot of effort on our part which is mainly so we can protect our users data. So just to give the context, relay is our main product is what I would say. It's an encryption proxy, you put it in front of your service and you define some rules and before the service your request ever hit you, the rules are applied and all your data is encrypted. Sorry, I lost my mouse. So yeah so it's very much focused on web services but it's mainly for people who want to de-scope their environment so they can be more PCI compliant or protect HIPAA data and stuff like that. Relay runs, relay doesn't run in an enclave mainly due to performance reasons because it's processing lots of network requests and we want it to get quick because encryption is slow and we don't want to add overhead to our users. So we store all of our keys inside a KMS that is accessed from a secure enclave. That service, we have no access to the keys then. On startup it tests connections to the KMS, pulls down user keys, decrypts them and then we are able to process the user's requests and outside of that environment we can't replicate anything. This started though when more users joined us and we started to scale. At first we just had a lot of automation. That was stuff like how do you run Docker containers in enclaves and how do you make sure that you can scale up or scale down. AWS Nitro enclaves are guest VMs on top of EC2 nodes. There's not much automation about actually running what's in there so you have to build it all ourselves and get all that running and actually performing requests for our users. So after we got all that running we had issues with just the libraries in general. So the parts of AWS that are open source are all the interface libraries for connecting to them but we found that there's many, many edge cases where they just were very poorly documented or not documented at all about how do you interact with it, how do you work with the proxies. So for reference for those that haven't used it, you need to run a Vsoc. There is a Vsoc on your host for communicating with the secure enclave and this is the only I.O. you have in and out of the VM. You then need to manage all the connections yourself and how you transfer data in and out and communicate. We ran into some really fun problems though trying to use this and talking to the AWS guys about using their library and getting stuff. The funnest one I think was we lost, we were dropping, we had file descriptor leakage. Our VM guests, VMs were dying because we just couldn't connect to them anymore because we run out of file descriptors on them which I had not seen in a long time aside from just like breaking my own machine which was fun. Turned out we just made some assumptions about how stuff worked because we thought, oh this is how it works in Rust and no, that wasn't how it worked in the library and we were just not reading the code but we needed to read the code which was unfortunate because I would have liked it to be in the docs. But yeah, it really showed that there was no metrics or observability for these enclaves. We weren't able to know what's happening inside them or how to interact with them. We started trying to monitor them. This was interesting. Like I said, no metrics, no nothing. I realize you probably can't see a lot of those graphs but these were our load tests. We started to try and get metrics out of them because there's limited IO. We didn't want to just try and put a metrics collector inside them and shoot all the metrics out to Datadog or AWS. We started instrumenting the clients that we were talking to it with and we started sending load data and trying out different workloads. So a lot of black box testing. This was several weeks of just staring at graphs that I may have gone a little insane during but we're here now and it worked. So once we got through it all we were able to find different bottlenecks in the code but based on guesses and automation changes we were able to go from, I don't know if you can see that but about 1,500 requests per second, no, encryptions per second inside the enclave to about 5,000 encryptions per second just by switching our default curve which we hadn't ever considered because we encourage our users to set the curve but it made massive improvements for us. But we had no idea that the encryptions themselves were the bottleneck because we couldn't see what was happening inside of our enclaves or the VMs and know where our workloads were slowing down. So once we started doing the observability we really went in on it. So we did this black box testing and we found the limit pretty quickly. We had to guess where the bottlenecks were and there was a whiteboard in the office of like here are ideas we have to try in different configurations. We just worked our way taking each box off and turning itself on and off until we were able to actually get some improvements from it. We then started working on a level of, so AWS does have a concept of debug logs but the moment you turn it on your enclave isn't actually a testable anymore. The attestation variables all just turn to zero and you're not able to attest your connection. And like I mentioned before we need to be able to attest the connection to the KMS to actually even load keys into it so we couldn't run in debug mode at all. We had to figure it out. So we had to basically reimplement a level of tracing like if anyone is familiar with open telemetry and stuff we had to come up with a way of doing trace requests inside of it. We couldn't use open telemetry because it had no understanding of how to communicate outside of the VMs. We had to take concepts, reimplement them and come up with a way of batching requests, sending them out and limiting the amount of IO overhead that we were doing that. We eventually got there and we were able to monitor our boxes. That's when we started to notice more problems. So we basically had these two processes in the enclave talk into a shutter and we expected the green line there. Yellow line would be perfect. That was our local dev environment. But the green line is what we wanted to see in production. The blue line is what we were seeing in production. I've lost. I wasn't allowed to put the lines of the numbers on it to be specific here but that was about a 20x slowdown I think which was insane. We're still debugging this one. We're not 100% sure where the bottlenecks are. We're fairly certain it's the virtualization of the network layer inside the containers is just insanely slow. So what we're looking at is how can we short circuit that. There's some things like sock maps. You can re-root sockets. But effectively you can't just take a container and throw it, take a process and throw it in or take two processes and throw it into the VM and think that will work. It works on my machine. It does not just magically work. You need to really tune the system to actually be able to talk effectively. We're still tuning it. We're hoping to have some stuff to note soon about ways to speed it up with sock map and different improvements. Like I said, it's seemingly either the VM or the user space networking. The fun one which I think was a lot of people who have worked with Enclave go, duh, of course you had time slippage. There's no NTP in an Enclave. You can mount the PTP of the hypervisor. But again, that invalidates our security model for PCI. So we had to actually synchronize with NTP which meant we need to add another layer of periodic work that needs to be done by the guest box to ensure that the VM could actually know what the hell time it was. We noticed that we were losing a second a day which is quite a lot when you are doing and that was based on traffic volume as well, more traffic, more time we lost. But if we did nothing, it was just one second a day. That really got into it when we had to do anything that was sensitive such as token validation. So, off effectively broke if a VM was running for more than three days which led us to a cron job that just cycled VMs for every three days for a little while until we re-implemented NTP through the VSOC. Fun. These are a lot of, like, yeah. So we kept running into issues and we kind of said, why is this so painful? It should be easy to just deploy a service into an enclave and give other people the ability to, like, say, yeah, that person who hosts my cloud computer definitely can't see the data being processed and I can guarantee it. Really useful for health data or financial data which are our main customers. So we put it all together and have a product called enclaves if you want an easy way to do hosted enclaves. So you can effectively give, we'll give you a Docker container. No, we don't give you anything actually. We give you a CLI and you build the Docker container with a Docker file into a secure enclave. You are given PCRs so it's fully attested. You give us your secure enclave and we run it for you. We push our data plane and control plane into the enclave and it talks to the control plane that we use so you can leverage it, all of that is open source so you can reproduce the build yourself and validate all the attestation. That's the same and ensures that everything is communicating effectively and there's no, well, me or my team aren't, like, messing with your code or changing it or anything like that. So it's just regular Docker containers. The connection is fully attestable and you can connect to it. I see 10 minutes and I probably don't need that long. So, but yeah, we're working on this. We're taking everything we learn from building our own service, putting it into our Everloot enclaves and it's on our GitHub. If you want to have a look and go through it, we want to be able to, people to be able to look at it, see that we're not doing anything wrong and try it out and hopefully have a better experience getting on boarded with confidential computing than we had because it was a lot of like throwing stuff at the wall, seeing what broke, where it broke and trying to figure it out. I'm going to go for questions then. You said you had problems with curse. British didn't be using ECC. Do you have any idea why the curse might have been a problem? Are you hitting page boundaries, packet boundaries or any ideas? Yeah, so we what we were seeing was that it was in the CPU. There was optimizations that we hadn't accounted for. So by default, the box we were developing on ARM Max, who were highly optimized for the curve we were using in default, which led us to like say, great, look at the performance here on our local machines deployed to production performance crashed. Turns out the AWS boxes we were running on were optimized for the standard K1 curve or one curve, a camera, which one it is now, but basically wrong curve. And we were an evening in the enclave, those optimizations still come true. So we were able to get 20 X performance gains from that, I think. Anyone else? Can you elaborate a bit on the nature of the payload or whatever you're executing there? Because I mean, we saw there pretty much encryption transactions. But what was exactly running there? So what do we run in the enclave? So for the benchmark was so the benchmark was basically fuzzing was what we were doing. But we send. So as I mentioned, in the enclaves, we have all our customer keys each in it. So we had one of our keys in there that we would send 20,000. So we would have 20,000 fields to encrypt. And we'd say each of these fields, we're going to iterate through this dictionary and encrypt it. So we'd send just a generic JSON blob. But for purposes of encryption, we could send just that we could just be a Boolean or a string or whatever and just send it in. And we then would iterate through that JSON blob and it would have the it would say, I am this user or application in which would then cause service to choose the right key. And it would then just it and they would say, these are the fields inside the JSON blob to find and encrypt. So it was JSON blob and ID and fields encrypt. Very simple payload, but it was just iterative work. And because of how the encryption is implemented, it's all blocking work. So we'd have to farm out the work differently, not like directly related to enclaves, but when we did the load testing, we determined that we were blocking and dropping connections in the service. So what was happening was the connection we'd schedule the work on the enclave and then the connection from the upstream service would die. Then we wouldn't propagate that connection dying downstream. The enclave would do the work, try to send the encryption back and then go, oh, no one wants this work and stop. So we had to put some keep alive and connections. But these are again the things we missed because we were having to reimplement just what would be generic TCP or or HP for talking over the Vsoc in the enclave. So you mentioned the the architecture you you are using made you adapt your cryptographic parameters. So how would that scale up to the future? I mean, crypto agility facing any words on that? I don't know. I'm the SRE who's meant to make such scale, but that's actually outside of my domain. We have people who understand cryptography a lot better than me and the company who would be able to answer that question. I can give you an email address if you want to talk about it, but I can't speak myself on that. Thanks a lot for the great talk. So I wanted to go a little bit back to like the use case you presented in the beginning. And I might have missed something, but it sort of sounds to me that like the use case here was not really like a sort of protection at runtime, but it's kind of like a long term protection of the of the keys and not while they are used by the proxy, but where they are in store. So did you like consider other solutions for this like HSM's and do you have like any insights there that why did you end up choosing the nitro enclaves for this particular particular use case? So I'll be honest, that predates me at the company. I'm not sure why it is. I will I would say that we did level of evaluation that were probably not too deep. We are a startup for and find our feet at the time and we had implemented a level of encryption just inside a process. And then when we attempted to secure it and build it, the enclaves seemed like an easy solution. I think that we've proven they were not an easy solution. But yeah, that's so we we would have valid what we validated were just ways to do encryption that would guarantee we didn't have access to users keys and we couldn't decrypt any of their data. And yeah, uncle, it seemed easy in reality, not so easy. So there's one online question. Can you explain the TSTLF protocol that you use? Is the protocol specified somewhere and has it been formally verified? So it is. So we actually had to reimplement it. We I can't remember which one we did, but we've looked at one that was done by the confidential computing. Consorting more of the paper that was published on it and we it was attestation in TLS connection inserts as our original implementation. That then I can't remember the specifics of it. So I will have to refer you to our get history on this. We deployed it and we were able to do it. In production because people had to add our root CA to their root CA store because you couldn't extend TLS in the way that was specified in the RCA for customers. So we eventually had to switch to a new attestation, which unfortunately I'm not the expert on. But it is available in the TLS. It's written in Rust. It's on actually it is linked in the talks on the page and the files them under attestation bindings. So anyone can look at the protocol we use for a testing it. Effectively, we leverage the PCRs that are provided by the underlying nitro enclave to and then we have an attestation protocol that we use to test the TLS. The underlying nitro enclave to and then we have an attestation protocol that on connection to what we do a TLS handshake that then performs the attestation and the client must supply the attestation bindings and we have implementations on the client side and go Rust node Ruby Python. Actually not Ruby, just Python and node and go. Oh and Swift and Kotlin. I will ask it like this because the interference with the micro and then you can Yeah, sure. So there was also a bit of discussion in the chat here about nitro enclaves and in what far you can go about the E and I know this is an endless debate and we even had an exclusive debate last year. Maybe can you briefly react to that and maybe also say a bit about the infrastructure you built, how tidal is to nitro and then the next problem can be solved. Yeah, so last, oh yeah, sorry, repeating the question. It's the debate about nitro enclave versus TLS. They are not, as I said, they're not as open source because it's mainly on the client side, they're open source rather than the server side and it's mainly just white papers, I believe that specify how the nitro enclaves operate or just documentation. So and the other part of the question was. How specific the tooling you built in the company? Yeah, so how specific is the tooling to nitro? So we did evaluate other cloud providers to see if we could move off to it. This was done a year and a half ago. We looked at Azure for doing it. Azure didn't have the new Intel SGX or is it SGX? Sorry, TDS sorry. They didn't have the TDS at the time, so we validated that it couldn't fit our model of secure computing. We probably need to reevaluate now, but the tooling is very AWS focused right now and nitro enclave focused because it was about trying to make nitro enclaves easier for us to use. Conceptually though, the control plane and data plane aren't specific to that. So far, they could be reimplemented for anything that wants to do TCP over a network connection for inside non-clave and outside non-clave.
Securing Embedded Systems with fTPM implemented as Trusted Application in TEE
Thank you. So yeah, so I'm Temek and I'm going to be talking to you about FTPMs and how they can be implemented. So that's me. I'm currently wrapping up my bachelors in automation and robotics and working full-time at FRIEMDepth as an embedded systems developer. Junior embedded systems developer, please keep that in mind. So if I say something, I'll do my best. And what is FRIEMDepth? We are a company based in Gdańsk, Poland and our expertise is in firmware and embedded systems development. And we're kind of cool. You may know us from our main product, Dasara, which is a core boot distribution. So that's the agenda. I'm going to first give some information about TPMs, then about ARM trust zone and then about how it translates to practice and implementing it on embedded systems. So I guess most of you know what that TPM is, but I'm still going to give a brief overview. So usually it's a separate piece of hardware, a chip that runs cryptographic operations like encryption or generating random numbers. So the system becomes more secure. Not everything is visible to the user space and thus the surface of attack is lessened. So there are a few kinds of TPMs. Oh, yeah, so those are the more about what's cool about TPMs. Yeah, they can also verify the integrity of the system, of the boot process and detect any alteration to it. And yeah, the secure random number generation is also a really important part. So yeah, there's a few way you can implement TPMs into your system. So the most basic, the most known way and the one that was shown earlier is the discrete TPM and it's a separate physical chip that's completely separate from the CPU. It's on the motherboard, but it communicates with it. The difference will be more visible when I show you integrated TPM. It has the cheaper and more space saving option and that is the TPM that is integrated into another chip. The danger of that is that now it's that chip, if that chip is somehow corrupted, it's attacked, it has an connection, it has access to the integrated TPM. So it's less secure. The next one is the least secure, but it's still something that is used software TPM and it's usually an emulation just made for tests and prototyping. So the main topic of this talk is the last one, the firmware TPM and it's a software TPM that runs in a trusted execution environment and is separated from the normal S, from the user space via the trusted execution environment. The plus of it is it's cheap and it can be implemented on devices that are already provisioned. So via an update or something, but via an update. On embedded devices, the trusted execution environment is made possible via ARM trust zone and ARM trust zone creates a hardware separation. It creates two distinct words, they're called. So we have a normal word where we have the normal user space in documentation, it's called re-choice and there we run our applications like kernel and user space apps and we have a secure word that has trusted applications and those can be like stuff you don't want the user space to have access to or only have certain access to. Such I think can be an FTPM and it can run operations like encryption, decryption, creation of keys and then it's a random number generation that I'm particularly fond of because it's kind of funny. So yeah, and also secure word provides makes it so only trusted OS can access certain parts of the hardware of for example memory. So there are parts that are reserved for operations of trusted OS and there are those allowed to be accessed by which OS. And this exactly is made possible via secure monitor. So we have ARM trust zone specifies exception levels and as you can see a secure monitor allows certain like for example, it tells hypervisor when what memory addresses it can use and what are specified for secure partition manager that is part of the D. And so the threat model here is of course if we have an app that's been like a virus or something, it doesn't get to the bottom. So we can look at it this way that okay when we have our hypervisor is corrupted, the trusted applications and trusted OS is still like valid. But if the secure monitor is somehow corrupted, then we have a problem. And I'll get to that part later. Yeah, and all of that was for Cortex-I ARM series. That's important distinction because on ARM Cortex-M the trust zone works completely different. It works through interrupts. It's kind of a funny topic because you could theoretically implement some sort of I wouldn't say FTPM because Cortex-M doesn't really allow you for example for running operating systems. There were some, there are some products that do that. They're on the border of black magic and they're awesome. But yeah, so FTPM on Cortex-M is a weird concept a little. Okay, so yeah, there are some problems with FTPMs because you could as I said update a device via the internet over the air to allow it to add to it FTPMs. But as the slide says, the best protected systems have dedicated security from the beginning. And ARM trust zone and TR and like magical thing you can just throw on a device to make it more secure. It will make it more secure but not as much secure as it would be if you would think about those things from the beginning. Because ARM trust zone doesn't add in itself a lot of the important parts that make an embedded device secure. For example, there's no secure storage on like in the ARM trust zone specifically. You can use an MMC to achieve that. But if you don't have that on the device that you're updating, you have to find some work rounds. The same happens with secure counter or secure clock that can prevent rollback attacks. And if you don't have those, you're not really protected from them. The secure source of entropy is a really fun one because there's been a work around for this. Actually, there's been work around this specified by the UNICEK presentation. It's linked at the end of the slides. The secure source of entropy is a fun one because they've managed to achieve secure source of entropy via only right ones, only fuses that have some random seed on it. And they're written once when the device is manufactured. They can be written again. And they act as a seed for random number generation. So fun. Yeah. And FTPM also has its own problems because the secrets are written to the memory. It's not safe on its own, for example, from cold boot attacks. So when the device is suddenly shut down, you can see the state of the memory that was at the end of the device last runtime. The same happens with bus sniffing. So we can just physically peek at the electrons that travel on the bus. And also, yeah, you can just plug a JTEC to some processors. And also, there's a one small caveat. Normal and secure world can't run in parallel. So always one runs at once, and they take over. So if you have an embedded device that requires real-time operation, you're in trouble. There are workarounds, of course. But I would like to hear them because it's a problem. So imagine you were a junior developer and you were taught, OK, so do an FTM in practice. You're me, basically. So yeah, that's how I approach the problem. And that's how it can be approached. So let's say you have some embedded device. And yeah, we have there are a few implementations of T that you could use. Most of them are proprietary. Opti is not. Opti is open source. It's awesome. It has a documentation. So yeah, once we have that, we need to build FTPM as a trusted application for RT. In this case, for Opti. And at the last step, we add some user space support so we can actually call the TPM. So let's focus on the second part because it's fun. No, sorry. Let's focus on the last part first because I didn't arrange the slides as I thought I did. So yeah, there's a kernel module in the Linux upstream currently that supports access to him for FTPM. So it allows the system to mount FTPM as a TPM. I'm not going to show you the code. It's called. I don't understand half of it by school. But as you can see, it's made by Microsoft. And Microsoft provides a description like the paper, the white paper that was written on FTPMs. That's also cool. And they provide a reference implementation. Great. And it's for ARM. Great. So the half of the work is done, right? Oh yeah. That's as I said. I didn't arrange it as like, yeah. So that's how the kernel driver works visually. So yeah, it mounts it. So it's seen by the user space as a TPM device. OK, so that's a problem with the Microsoft implementation. Like it's not maintained at all. It's provided as it is. It's cool that it is. It could us to them. But it doesn't work currently as it is. And so that's what I've been fighting with for the last few weeks. And OK, so this requires tweaking. The amazing guys at Linaro shouted to them. We're kind enough to create a fork of OptiMunifestudes for we used for building. And it allows to build FTPM. I have a few minutes left. So I think I won't be able to show you a demo of it, because it's there on this laptop. And I also didn't have time in the last few days to create a pull request. So I hope by the time this video is up, it will already be on GitHub. And I hope also it will be merged. But yeah, if you want to build an FTPM on Camu, that's the best currently repository available to fork it. And yeah, Yocto also provides a bitmake recipe for building OptiWeef FTPM as a trusted application. But it currently works only for ARM. I mean, it only works as a test for ARM. To add support for your own board, you have to append some recipes and do some magic to make it work. I haven't tested it through Rugu. I haven't tested it as much as we would like to. So yeah. And all of this was made work on our current operating system for embedded devices. Like to focus on security and make it as adaptable as possible to solutions for your embedded device. So yeah. This is those other resources I use. So they're all awesome. I highly recommend this book. It's not as boring as it may sound. It's really well written. So yeah, that's all. And if we have time for questions, then we can do questions. Just a request. Can you go back to the very information you took page? OK, yeah, sure. I'll go back to. OK, I was shown a card to repeat the question. So yeah, I was told to go back as late. So that was a question. How many? It's also online. Make sure that they're also online. Oh, yeah. And slides are also online. So if anyone wants to do something, they're available. So yeah. If we have a few minutes. Oh, yeah, sure. Just a few years. Can you use the opti-recipes? Opti-recipes. Opti-recipes. There are the example ones. Yeah. So the question was, did I use opti-recipes to build it? I didn't see any related to FTPM in the examples. And the examples are also cool, but they're kind of complicated. So it took a lot of reading and trying to make sense of those make files. So I used only I tried, but the thing that worked was the patching of the Linaro fork, because it also has it. Like it was last updated, I think, a few years ago. So it uses a lot of outdated. Like there's Python 2 syntax in it somewhere. And so yeah, that's the one I'll be providing. Apple request, hopefully soon too. So yeah. Are there any more questions? There's also time, Tim, if you want to do a demo, I'm sure people will. OK, sure. So maybe in three minutes, and then we still have five minutes. Yeah, yeah. Awesome. So yeah, this is the Camel image made from this forked Linaro. So as you can see, we have a normal and secure word. Currently, it's not started. So I can start it. There's some outputs. Yeah, the secure word doesn't really provide any way of communicating with it. We have any sort of, like besides user space. Oh, sorry. So yeah. And this particular example there's an I'll show you, but I have only one hand for right now. So it's kind of hard. That's all. Yeah, so the Linaro guys provided some aliases to load the utilities that are on the host system. They are not built in. They default in this example, and they load up the not the exactly kernel module that's currently in the upstream. It's a slightly different one. So that's also why I didn't like want to call it a live demo, because it's more of a live Frankenstein that is currently working. Maybe at next post them, I'll have something better to show you. So yeah, and if I run this alias that uses all of those commands, we can run some tests with the IBM implementation of DPM utilities. So I think it will output some random generated code. Oh, I also have to hear a cheat sheet, because I couldn't remember the exact syntax for creating, for encryption, the encryption, because DPM tools also works. As I said, the FTPM is mounted as a TPM. So every user's space utility that can use TPM works. So yeah, I don't know. Load. So yeah, so that's the demo, I guess. So I think we're done. So thank you. It's been a pleasure. See you all somewhere.
Integrity Protect Workloads with Mushroom
All right. So the next speaker is Tom Dormann and Tom is a real hacker. So I first met Tom last year at the CCC event where he talked about, which I did a bit about, an attack he did on NX. And I think it's very inspiring in the Dev Room that we have this great company talks, but it's also really nice in the real, let's say free software ethos. Tom, there's also some work in his free time, just pure hobby projects. And he'll talk a bit about work he's been doing on AMD-SEV. Sure. So thank you for the introduction. Today my talk will be on integrity protecting Linux workloads with Mushroom. Okay. So what are you going to talk about? Well, first up, we'll talk about some of the goals of Mushroom. And then I give a short demo to show you how it actually works. And then we'll talk about higher level architecture and some of the parts in particular. So the supervisor, kernel and the DMM host. And then we'll also talk about some of the things we don't want to do, some non goals. And finally, we'll briefly touch on how Attestation works with Mushroom. Okay. So, but this is a micro walker. Yeah. Okay. But before that, a brief thing about me. So my name is Tom Dormann. I mostly do developing and re-security research. But my day job is also reverse engineering. Here are some of my links. And one thing about me is that I also really love Rust. So all of the code that you may see here today is also written in Rust. Okay. So what do we want to do? The main goal is to run Linux programs securely. And in particular, we want to run programs that just transform an input file or maybe multiple input files into an output file or potentially multiple output file files. And while doing that, we want to prevent any tampering during the process so that we can make sure that the output files are authentic. So for example, one use case would be that you have some untrusted server and that you want to compile code on. And ideally, you want to not trust that server, but still be assured somehow that there hasn't been a backdoor injected somewhere in your code. And you just want to have assurance that the code has been compiled without any tampering. So yeah. I'll give a brief domain of that. So yeah. Okay. So what we'll be doing, I already talked about workloads. Mushroom is completely generic in what kind of workload you want to run. It has to be Linux binary, but that's basically it. So for this example, I chose to do a simple Docker image and just because it's easy to set up. And so in this case, it's an alpine image, which has GCC and muzzle installed. And it will run this in its script, which just copies the input file and that we want to transform to another file on the file system. And then we'll run GCC on that. And then in the end, we'll use that output and copied it to a special output device. And so the file that we want to compile is this one right here. Yeah. So it's just a simple head award, just a proof of concept. Okay. So, yeah, I should clear that. Okay. So beforehand, I already set up some like environment variables just for some of the components. But the important thing is this one right here. Okay. So what we'll do is we'll run this command, which as you might already notice contains some information like the input file that you just that I just showed to you. It also specifies the output and it also specifies where to put the gestation report. And because that's in the end, how we really know that that process hasn't been tabbed with is that gestation report. And so we'll run that. In this case, we'll actually take a bit longer than usual, because like the Docker image is fairly large. It's like under six megabytes or something. And just loading that in this fairly slow process. But any second now, the workload will start running. Okay, now it's started running. And now it's finished. Okay, so let's take a look at the output file. So just file test. And we can already see that it's a 64 bit elf binary, which is of course expected because we compile the C program. But before we actually run the executable, like, let's actually verify that it hasn't been tampered with. And we can do that by just using the same command that we used previously. But instead of saying run, we use verify. So we use the exact same configuration parameters. And this will take very shortly. And it says okay, so we know that the process hasn't been tampered with. And so as the last step, let's actually make it executable. And run it. Yeah, so you can see that also works, which is, yeah. Okay, so now that we saw how it work, like, what it's supposed to be doing, let's talk about some of the details about how it's implemented. And the first thing to note here is that it's implemented using SEV, S&P. And so in this case, we have all the virtualization. The workload is of course supplied by user, which in this case was GCC. And then around that we have a completely custom kernel, which we'll also talk about later. And around that, then we have so called supervisor, and which is a concept I came up with, which is basically just responsible for communicating between the kernel and the host. And the important thing to note here is that most of the logic is actually in the kernel. And this will probably grow a lot in the future as well. And the supervisor is fairly slow, not slow, fairly small, and will also probably not grow a lot in the future. It might even shrink. And even in this configuration, there's some code that's like disabled at compile time because it's only there for like debug features. Okay. So about the kernel, it's completely written in Rust. It just implements the Linux Cisco interface, so that we can run unmodified Linux programs. It currently implements 83 Cisco's more or less because like some Cisco's have a lot of flags and we don't implement all of those. But still it's a lot of, it's enough for some of the applications at least. And yeah. So apart from that, we also support 32-bit and 64-bit binaries. And the reason we have this kernel is that usually you have a lot of bloat and you have a lot of stuff that you just don't need. And so the reason we have those on-custom kernels that we can just throw things away and only implement the things that we need. And yeah. We'll also need that for some things that we'll talk about shortly. Okay. So the really interesting thing I think about mushroom is the supervisor. And so I already talked about that it handles communication between the host and the kernel. What does that mean? Well, the first thing the supervisor does is actually load up the input. And so the input is not part of the initial measurement. The reason for that is that we don't want to measurement to change every time the input changes because then we can sign it or at least not in a way that's like really makes sense. The other thing is memory hot plug. So initially mushroom starts out with a very small amount of static memory. And then after that we use memory hot plug to load in more dynamic memory once it's needed. And lastly, the thing that we do during runtime is scheduling. And so if the kernel wants to run another CPU, it somehow has to tell the kernel about the host about that. And so that's also a responsibility of the supervisor. And so the interesting thing here is that this communication, it's not just a convention, it's not just that the kernel chooses to talk through the supervisor to the kernel and to the host. It's actually impossible for the host to talk to the kernel directly. And so the reason for that is that we want isolation there. We don't want the kernel to have potentially malicious input sending to the kernel and we want to prevent vulnerabilities by just not having an interface there. And this is implemented using a couple of hardware features. So for example, one of them is virtual top of memory, which just basically makes it so that the kernel can't access shared memory, which would of course be needed to access, yeah, to have shared access with the host. Another feature is VC reflect, which is basically in some cases you need the hypervisor. And instead of using the hypervisor, we can then just offload that responsibility to the supervisor. And that way the kernel doesn't even really have to be aware of it being run in a SAP VM. Lastly, the separation between the kernel and the supervisor, which is of course also important, is done using virtual memory privilege levels, which basically makes it so that the supervisor is allowed to access all memory. But the kernel is not. So for example, the supervisor has some secret keys that it uses for agitation. And the kernel is of course not allowed to access those secret keys. And yeah. So the important thing here though is that the supervisor is the only security critical component. The kernel can have as many bugs as it wants. The host will never be able to talk to the kernel directly. So it doesn't really matter if there are security bugs in there. And this is of course really nice for auditing, because the only thing we have to audit and make sure that it actually works is the supervisor, which is this once again fairly small fairly small component of code. Yeah. So for the VMM, we don't use QMU or anything. Reason for that being is that we have this fairly custom like memory hotplug and so on. And like all those interfaces and getting the data in and out. So yeah, instead of using something that already existed, that maybe has like abstractions that are not ideal for us. And we just implemented this for our own. It's not actually that complicated because we once again, we don't have that much host guest communication. So this VMM doesn't really have to implement a lot. And as of a couple of weeks ago, it also supports running the kernel outside of an SCV SMP VM, which is of course really useful for like debugging and profiling. And maybe not everyone has an S like an epic CPU that can actually run those VMs. Okay, so we already talked about a lot about things that we want to do, which yeah, but there are also things that we don't want to do. So one of those important things is that we don't want to do IO at runtime. If I want to run GCC, I don't need network. I will never need that. That's just not a thing that we need. And sort of thing is by not having network, we can reduce the tech surface drastically. And once again, like reduce complexity in the supervisor in the kernel and mitigate vulnerabilities by just not implementing interfaces. Of course, there are a lot of a lot of use cases where you do need network. But in those cases, you can just use standard Linux and like, you can just use other projects. But the point is that for a lot of projects and workloads, you don't need the extra complexity and job us by just not implementing that you can lower the potential for vulnerabilities. Same logic goes for persistent storage. So every time mushroom boots up, you boot into tempfs with like all those files that you split doing initialization. But once the VM actually exits, all that memory is destroyed because for a lot of use cases, you don't need persistent disk. And by not having that, you can once again, low complexity. Similarly, we also have fairly low complexity in the supervisor, which once again is this one part that's actually security critical. So one of the things that you might have noticed is that none of the things that the supervisor is doing are really CPU bound or performance critical in any way. And so for example, we can get away by just not implementing multi threading, because in reality, there's nothing that requires that amount of performance that could potentially be that could potentially like get a performance boost by multi threading. And so by not implementing multi threading, we can once again, just like eliminate the whole class of concurrency bucks, because that's just can't happen if you don't have multi threading. Similarly, the supervisor is fairly simple and doesn't actually need the heap. And then once again, we can just not have any bucks in there if you don't have a heap, if you don't need it. And yeah, so I think those non goals also really important because they could strain the things that we want to do and that way, like increase the security by setting up clear goals. Okay. So lastly, let's talk about agitation. I'm sorry, talked about the measurement. So in this case, this contains all of the binaries that you want to load up in, which is the first supervisor, the kernel and the in the binary. Those could be signed in the future. Currently, we just compare the raw hash. And so the SEV firmware, when you load in the image, it like hashes all the memory and like chains it together and just produces a hash that you could be some could sign, but we don't currently. And the host data field is also field that's supplied when the VM is booted up. And so this, this field just contains a hash of the input file. And the first thing the supervisor does is when it boots up is loading the input file and actually verifying that that hashes correct. So it doesn't even really look at the data, it just hashes it. And that way, there's no way hopefully for the data for the input file to potentially be malicious and influence the supervisor before it's actually been verified that it actually is the one that we want to see. And lastly, of course, we also want to attest the report, I'd like the output. And this is put in the report data field. And this is also interesting because this is actually the only field that the guests can influence at runtime. So both the measurement and the host data field are set by the SCV firmware. And even if you have like some malicious input file or malicious input binary, you can only modify like you can only modify the report data field. And so this is really important because if you have like, assume, assume you have some untrusted input, you will never be able to fortune attestation report in such a way that it pretends to have to come from another host data from another input file. And that way, we can just like by making this the simple abstraction choice, we can hopefully reduce the potential for any vulnerabilities there. And so this is also another thing where it's compared comparatively simpler compared to other projects. Because one of the things is that we only do attestation at the end of the process. So we don't have any certificates during runtime. And because we don't have any IO at runtime, and so we just don't need any other certificates that would usually have to interface with other services. And like, I can see why there are a lot of problems like sanitization. But that's just one of the things that this model doesn't really need. And similarly for like this encryption case, so the attestation model for mushroom is just really, really simple. And hopefully made in such a way that's actually easy to audit for external people if they wanted to do that. Okay, so do we have any questions about that? Thanks a lot for a very interesting talk. So I particularly like this demo that you showed because showing like this use case where you actually run a compiler inside the CVM is like a very, very desirable property in like build environments where you want to have this notion of hermiticicity where like you actually record the entire tool chain that you use to produce software. So related to this, I sort of had a question related to the sort of the trust assumptions here. So you talked about this that the supervisor is the sort of the only security critical component, but that basically only applies to the communication between like the outside world and the kernel. And but you still, you know, you later talk about it that you can still have like attacks via the input itself. So for instance, if I have like malicious code that targets some vulnerability in GCC, let's say, that's still possible, right? But on the other hand, that gets somehow recorded in as part of that station. Can you a little bit like elaborate on like, like these aspects? Yeah. Thank you. Great question. Yeah, of course. So yeah, so if you have an malicious input, that would show up in the attestation report. And I mean, ideally, if you have like a scenario where you want to have like a code cache, where you like compile, compile code once, you will only supply inputs that like are not malicious. So as long as you don't like request malicious inputs, you will not get malicious inputs out outputs out. Yeah. So I mean, in theory, there could be like attacks from the inside, but that's not really a problem because that always shows up in the attestation report and like a normal user will not request that. So yeah. Yes. Yes. So an additional comment was that this is audible. The question was whether or not this is auditable. And yeah, so the answer to that is yes, everything shows up in the attestation report. And yeah, so hopefully that's not a threat. Any other questions? Thank you. This was awesome. And then it's not a question is a feature request. If you could spit out as bomb, as bomb from, from the compilation thing with, you know, that would be fantastic. And yeah, well, the thing about that is that mushroom is not necessarily only meant for compilation processes. But if you want to do that, that's great. And one of the things I've been toying around with was, was running nix compilations and nix builds in that. And of course, those are already contained like in the build hashes, like the way nix works, all the inputs are already specified in that. And so if in that scenario, you would like more or less have an S form at least like traceable to some input. And but that's independent from mushroom. But of course, that's also the use case I intended. Okay. So yeah, first of all, very awesome work. I really like that you show that these confidential VM based solutions can also be used with very tiny trust computing bases. That's nice. And I mostly agree with your design choice of the non golds. But they said that you don't support mighty swing. But wouldn't that be somewhat important for compilation to be able to run on multiple costs? And it's kind of CPU can you consuming? Yeah, sure. And so this thing about multi threading, this only applies to the supervisor. The actual kernel can run as many costs as he wants. I mean, technically, a second limit on 128. But yeah, that could be changed. And it's probably enough. Yeah. Maybe a question also moving forward, you mentioned support and also even protected combination. Do you have a part of your design and I'm thinking about the PMPL support? Okay, so the question was whether or not my designs are tied to S&P, a CV S&P, or whether or not they could also apply on TDX. So currently, the supervisor is highly specific to S&P. But I don't see a reason right now why it couldn't be implemented for something like Intel TDX. That should probably be possible. Yeah, I mean, the MPLs are specific by I think with TDX stuff like partitions, maybe that could be something. I'm not sure. I haven't looked into that. Yeah.
Reproducible builds for confidential computing: Why remote attestation is worthless without it
All right. Let's get going. Our next speakers are Paul and Malte from Azure Systems and they're here to talk about remote attestation and reproducible builds. Yeah, thanks. And I will start with some motivation. First, the topic of the talk is reproducible builds for confidential computing and why we need it. So first, the motivation. Yeah, confidential computing. What is the situation with confidential computing? We have trust issues, especially when we're running in the public cloud. So yeah, first of all, we trust no one. Well, that's not entirely true. We need some hardware we would have trust. So we have some will have to trust the hardware manufacturer. And in all the other components that we are using, we have to establish trust before we can rely on them. And we're doing this using remote attestation. So a quick overview of a remote attestation based on the Rats RC. So here we have our three entities, the attestor, the verify and the relying party. And the goal of the remote attestation procedure is that relying party can place trust in the attestor system. So how are we doing this? Inside the attestor, there's a testing environment and the target environment. And the testing environment is doing measurement of the target environment. And then handing out some evidence that is verified by the verify. And the verify uses two kinds of resources to verify the evidence. First, the endorsement, which usually provides guarantees about the authenticity and integrity of the evidence. And then some reference values that are compared to the claims that are inside the evidence. And yeah, there were verify does verification and produces an attestation result. And that attestation result is consumed by the relying party. And using this attestation result, the relying party can place trust in the attestor system. So the aspect about this remote attestation procedure we want to talk about here are the reference values. So as I already said, we use the reference values to compare or to check the claims inside the evidence. And yeah, some of these reference values represent the code identity of what we are actually running inside of our TE. And often these values are hashes over what we are executing. And as we all may know, hashes are one way functions. So it is really difficult to go back from a hash to what was actually hashed. So many questions arise from this. Where do these hash values come from? Who produces them? So who's our reference value provider? What do these hashes stand for? And how can we establish trust into them? And often the answer is we just can't. And in this talk, we want to present a way how we can establish trust and meaning to those reference values. So why might this be a difficult task? So the main scenario we are talking about here is about CVMs. And these CVMs have quite large TCBs. So we need to cover all of our software component with these reference values. And yeah, there are quite a lot of components from where bootloader kernel user space. We all need that stuff. Yeah, can be quite a lot of lines of code, not always only some lines of code like in rushroom. But the more interesting question is who is part of our trusted computing base? So software vendors usually and usually also a lot. And there are different ways that we are including people into our trust base. So maybe the more simple one is that we can consume code from other people. So well, that's quite usual. It's also okay. We can order the code before we include it. And ideally, our language ecosystems provide us with some mechanism to pin the dependencies that we use by some hash or so. So that's okay. The second mechanism is more problematic. We could consume binary artifacts. And going back from a binary to source is expensive. And yeah, typically, this is when we install packages using a package manager, or if we use prebuilt VM images. And even if those binaries are signed, if we rely on the signature, we include the signer into our trust domain. And then there's the third case that is even worse. These are the situations where we cannot choose what we want to what is actually running inside of our TCP. So this is, for example, the case when we have like something hardware compatibility layer running below our guest OS in the CVM. Or if we are not able to run customer provided firmware in the public cloud. Okay, so talking a bit about the consequences here. Yeah, every software vendor we include on our trust boundary could potentially run an attack on us. For example, by delivering malicious reference values, so reference values for malicious binary. It's just really difficult to check for us what these values stand for. And in the end, we have no insights what is actually running in our system. So a simple solution could be we build everything from source, right? Source is good. So we can audit the source. But usually we are not the consumers of the things we built. So we're not the end users. And as a consequence, there's one remaining software vendor in the trust boundary. And that are we. So that's not good too. And the actual goal here is to provide a testable systems for the end user. And reproducible bugs can help us to do this. And much of it will continue to tell you about reproducible bugs. So thank you. So let's quickly talk about what reproducible builds actually are. So the basic idea is you have software development principles where third parties or anyone can take the same inputs and produce the same binary output. And this part of being independently verifiable is really important to us. And let's just take a small step back and look at our perspective. We are building a lot of software that is supposed to run inside of enclaves. For example, we're building a full Kubernetes distribution with OS images and containers. And we really want people not to have to just trust us because we are like reputable. We want people to take the stuff we build, look at the source code, verify it, rebuild the binaries. And only then, only if they can rebuild the same binaries, they can also just get to the same measurements and then they know that they can trust us. So in the perfect world, this is what we would like to have. We just take the source code, we put it into a function and we get out the reference values. But as you will see, this is sadly not the reality today. So looking at this more closely, you have the source code and then you have some kind of build process. And then what you get out is binary artifacts like the firmware, the kernel, anything that goes into the user space applications. And from these, you derive the hashes or other reference values used for remote attestation. And in reality, this is already where you start running into problems because the software itself is not open and you cannot rebuild it if the source code is not public, basically. So sometimes this is where you just have to stop. But then if you're lucky, the source code is actually available. But that's when you run into a whole different set of problems because if you want to build the same firmware and the same kernel and the same user space and everything else, you notice that if you build your software, it doesn't actually just depend on the source code. It actually also depends on timestamps and randomness and inputs that you didn't know you had. And also it depends on tools and specific versions of them. So let's say you actually managed to get all of this under control. Then you can still run into this situation where you get the same firmware and everything else, the whole stack, the whole TCB is the same. And you boot it in a trusted execution environment. And still the evidence that you extract is different. And this is often the case if you include anything in the measurement that is not part of the code, but it's actually dynamic, like a timestamp at boot or the instance ID of your virtual machine. And yeah, in this case, you have to do some run a policy engine on the other side, basically. Yeah, so this can be solved, but it's also really annoying. Next, we will quickly look at who's already doing good work in that field. So first is the AWS UEFI firmware. And this is used today to run AMD, SCV, SNP virtual machines. And it's really nice. It's just EDK2, OVMF firmware with some patches, but they also provide the full build system. So you can just download it and rebuild it from source and actually get to the same measurements. Yeah, another example is Constellation. This is our stuff that we built in there. We actually provide every container image, any tool, the whole operating system, anything can be rebuilt from source. And it's all reproducible. And yeah, then there's also the confidential containers, cloud API adapter, and there's the peer ports images. They now have an option to build images with MKOSI that are now also mostly reproducible. And we also have a GitHub repository where we basically just wrote on all of the steps that are needed to take a general purpose Linux distro and get also reproducible builds for that. And it's also documented and we show you all of the steps that we took. So you can play around with that, which I think is like a good starting point. So that's the repository if you want to have a look. Yeah. So now some concrete help if you actually want to do this. So this is for building OS images in particular. First of all, you need to pin your build tools. Basically, if you don't do that, tomorrow you will have a newer version of a tool and you will have a different result. And what we noticed is if we use something like NICS, we can pin all of the build dependencies in a very nice way. And also we were able to patch a lot of the tools in NICS. So they actually become reproducible. For example, we had a tool like MKFS for FAT partitions that was not reproducible. We could make sure that the version NICS was actually creating reproducible outputs. The second thing is about any things that you depend on. So that's libraries of your building software or binary packages if you have to include them in your image. First of all, you want to pin them. So you want to make sure that you know in advance the hashes of everything that will be a dependency. And then it's not done just during that. You also have to make sure that they are available in the future. So you have to archive them. You have to make them available. And you also need to have a mechanism then to actually operate your log files because if you just pin them, you will have a lot of security vulnerabilities in the future. Yeah. And then it goes on. You really want to build every piece of software in a sandbox because otherwise you don't actually know if your build is reproducible because it could depend on something that is not actually there in the future. So user build system that does this, there's MKOSI for building OS images. There's NICS and NICS OS, which are really great. And there's Bazel that also uses Sandbox. Yeah. It will eliminate a whole class of issues. And then you also really want to restrict build actions or install actions or any other kind of logic to only perform deterministic steps. For example, I think the Cocoa project was using HashiCorp Packer and that has the issue that it can run arbitrary steps, which means it could, for example, run up to get install. And then you basically have no idea what version of something will be installed. And the same applies to Docker files. So just use something that only does what you want. So this was our talk. There's some important things we want to give to you to think about, learn about reproducible builds. We want you to provide an open software stack for CC. And we want to enable the community to reproduce the reference values that we put out into the world so we can remove ourselves from the trust. So thank you. Thanks a lot. So I have a bit of a philosophical question. And that's related to sort of the sort of like the relationship between reproducible builds and build provenance. So we also had, like for the last time, there was last talk, there was a question about like S-bombs. And this is of course something that is of increased importance because of the focus we have on supply change security in general. And there is also people working on build provenance, right, where you have the build hermiticity and you have like a record of how software was built. And that also gives you like some guarantees of how you end up with a certain set of reference values, even if it's not fully reproducible, right? So, but you know, from the provenance, like what goes into this recipe. So do you have like any thoughts on like pros and cons or like reflections around these two related topics? Yes, definitely. So I think first of all, if you're able to have reproducible builds, you already basically have an S-bomb because it must be the source code and anything that's locked in there. So it's actually like already the S-bomb there. And then also, if you have an S-bomb, how do you trust the S-bomb? Because if someone just gives you the S-bomb and it's not signed, it could be fake. And if someone does create like a trustworthy S-bomb, it probably needs to be created in a confidential VM or something like that. So then you actually make the whole problem a lot more complicated. So if you can just use reproducible builds and then the problem is just fixed. So the question was about if this also solves the problem of pinning the toolchain. Yeah. Yes. So the question is about like if you can trust the toolchain that bootstraps the whole system. And yeah, I think you can bootstrap yourself from nothing. And I think also the NICS project has some kind of bootstrapping where they do exactly that. So that's it.
Increasing Trust and Preserving Privacy: Advancing Remote Attestation
I'm Leonardo and you're not from ARM and I think it's going to be a great end of the day so looking forward. Well, hi everyone. So this is a talk about remote attestation and yeah, because we think remote attestation is sort of an inflection point and it's becoming increasingly available and used and any new technology when it comes to the fore you have to consider different aspects, societal and technical as well. And yeah, so we're here to talk about this. Possibly interesting things. My name is Thomas as Fritz said. I'm your notes. The ghost of Hannes is here with us. He couldn't come to Brussels but he's here in spirit. Yep, that's us. Okay, so I wanted to start with this sort of timeline that tries to capture some of the more relevant events in the history of remote attestation starting from the theoretical underpinnings with the DDSA paper from the fine people at PARC in 1983. And you have to wait sort of 15 years before research trickles down into industry. You have at the end of the century the first industrial consortium that is formed to actually define what a trusted computing architecture is in terms of behavior, in terms of the interfaces that it needs to expose. And so we have TCPA formed which then morphs into TCG, the trusted computing group. These are the guys that are responsible for producing the TPM12, TPM20 specs, among other things. So you have the first decade of the 2000s that is sort of where attestation, because you know, TPM has a strong attestation story bound to the idea of using the TPM as a route of trust for reporting. So you have the first decade that is sort of driven by trusted computing use cases. Then enter the second decade and you have AMD and Intel SGX cropping up. And this starts the sort of confidential computing driven decade where you have the first, second and finally the third iterations of the architectures, which culminate into the SCV, SNP, Intel TDAX and ARM CCA. And you have a few other interesting events in that period. You have the riot paper from the Microsoft guys that sort of articulates fully the ideas that were in the DDSSA paper. So 30 and odd year later, you have finally the dice ideas on paper and not just on paper and code. And you also have the PSA attestation from ARM, which is sort of an IOT targeting attestation for IOT platforms like riot as well. So you have also this area of the users of the, well, this is covered by attestation primitives. Where attestation primitives starts to enter. And then you get into 2020s and so on. And here is where we sort of see some kind of maturity in terms of the standards that are actually coming to the fore. Not just standard in terms of standardized formats, data formats and rats that was mentioned before. And that is coming out this year. Is one example, but also software standards. So the configure FSTSM ABI that Linux kernel is just upstream is one very, very concrete example of standardization in that space, in the software space. So we are here, as I said, we're probably at an inflection point. The primitive is increasingly available, not just in the configuration computing space, although CC is a very, very prominent area, right, that drives this. But also, so you have use cases in IOT, you have use cases in TCP remediation. Well it's also cropping up in your devices with interesting societal fullbacks. And so basically the idea is that, like Dave Taylor said, every authentication use case is also an authentication use case. For wherever you have the need for authentication, authentication which is effectively a stronger authentication primitive, stronger identification primitive is something that could be used to either reinforce or supplant your previous thing. So that's where we are. And yeah, so I think when you have, as I said, when you have this new technologies, you need to look at the bigger picture and try to understand what are the implications that the use of these technologies have on the wider ecosystem. One of the interesting things here is the centralization risks that are involved with attestation. Another one is privacy. Well, let's start with centralization. I think it's here, yeah. So if you have looked at the RATS architecture picture that was in the talk before, you have seen that the verifier is at the very center of the image. And it's not just visual biases, it's really a central architecture upon a choke point of the architecture where all the message flows are intercepted basically. And also where the decisions are made because the verifier box has a verifier owner attached to it. And the verifier owner is the guy that has the power to decide who talks to who, right, which attest has the right to talk to a reliant party. So he's actually gating the information flow. And therefore he's a very powerful entity. And the risk here is associated with monopoly, right? So there are the situations where if you don't look at the carefully at your design and your architecture, you slip into this potential centralization risks, which we have seen in a way. I don't know whether you have followed that. Environment integrity is something that exploded last summer. And yeah, it's the cautionary tale. It's the perfect story of vertical integration where you have basically a monopolist actor that takes care of the whole thing and basically subversely. Well, it creates problems. So the fact here is that centralization can be sort of tackled, we think. The RAS architecture has a nice, it basically cuts through the, it has sort of, it curves out the roles in a way that you can actually, across a long tussle boundaries. So you can actually remodel the roles in a way that, you know, for example, you're moving the very fine function towards the user, in a user centric way. But you know, not all use cases are, not for all use cases, it is possible to do this, this rearrangement of roles because sometimes you would end up in a conflict of interest situation or something like that. So maybe one idea is to run the verifier as a neutral entity, a multi stakeholder entity. Analyze and script, that's what they did with when they democratized the X599 world by creating this multi stakeholder consortium that runs the less encrypt function, which is another example of this kind of centralization opportunities. Yeah. Privacy is another aspect. So all the flows, all the message flows go through the verifier, the verifier has to see the claims to make the ref value matching. Therefore, you know, it sees everything. So the potential of abusing this position is great because PII are maybe not in the evidence, but can be actually indirectly obtained from that. And so, so this is, this is a risk. There are things in the, in the, in the toolbox. There are basically two kinds of ways to deal with this. One is to inflate your anonymity set, either by cryptographic primitives, group signatures and stuff like that, or using methods like anonymization in the hardware, like, you know, creating a batch of devices, like FIDO does, like ARM CCA does, in certain configuration. The other thing is, well, yeah, is you reduce the, the claim set. So what you need to expose to the, okay, to the outside world by claim reduction and, and other patterns like selective disclosure and, etc. So things are there. So these were the societal aspects. This one is instead the technical aspects. So we have, we have been in a situation where mostly the, the designs were, so we're, we're, we're transitioning from a situation where the, the solutions were experimental, right? So we are, we were mostly in research mode. Now we need to move to a different approach, a more engineering oriented approach and, yeah, more structural approach. And we think we have, you know, some suggestion to make and I'll let Yonis, and I'll let, sorry for taking so long. Hey. Okay. So I want to talk to you a bit about IEDF and why we think it's a good venue to try to standardize standards relating to remote out the station. So first off, let's look a bit at some of the IEDF principles that form the core of its mission and why we think these are relevant to, to the hacker crowd here at Fosnum. So we started openness, open process, so everyone can get involved and can, can, can read standards that are being worked on. And this includes not just technical folks, but also members of the, let's say, civil society who have things to, to say about things that are being standardized or drafted. The second is technical expertise or competence, meaning that the IEDF only works on things that it has the competence to talk to. And it will, it will listen to technically competent input from whatever source there is. And this, the third, third principle is that of practical ethos. So rough consensus and running codes or trying to base all our standards on our engineering judgment and our real world experience. And more pragmatically, it means that all the standards need to come accompanied by some, some code for verification and hopefully multiple, multi-employment, implementations that are interoperable. So let's look at, at the station in the IEDF. I think the rats working group has already been mentioned and the major milestone that's been achieved about a year ago, the, the remote out station procedures architecture document from which this diagram is taken shows that the roles involved in making the remote out station usable. And the rats working group is there to essentially standardize around this diagram, around the roles, mechanisms, data formats inherent in this. But if you want to look at remote out station as an authentication mechanism, then we need to go beyond the rats and this diagram. And we need to look at cases where the attester and their lying party are trying to interact over different protocols like O-Auf, TLS, ESD, stuff like that. So let's start by looking at credential insurance. And in this case, I mean, for example, X5-9 certificates. So the enrollment over secure transport and certificate management protocols are central to public infrastructure. And it allows an entity to request from a registration or certification authority to generate a certificate. And a recent requirement from the CA browser forum has put in place a need from RA or CA for the entity to prove the security state of the key that's being certified. So that's why we're trying to integrate remote out station to make this happen. So the way remote out station works here is the verifier sends an ounce to the entity and entity that uses that to generate evidence and package it up in the CSR. And then the RA-CA can get an out station result back and decide whether it wants to trust the entity and issue the certificate. The identifiers there are for the places where you can find more information about how this all works. If we look at ACME, it's again for certificate insurance. And as you can see, the diagram looks pretty much the same. The only difference is in the fact that the evidence is carried in a different format defined by the W3C, so web-alpha format. So just to highlight the fact that we're pretty open and pragmatic about what we use. And if there's something ready, then we can just use that. The second type of credential that we care about, for example, in this case is OAuth, where a client might want to get an identifier and perhaps some credentials from an authorization server. Again, pretty much the same diagram. And then if we move on to secure channel establishment protocols like TLS, these are quite different because of their symmetricity compared to credential issuance. And we've tried to preserve that. In the diagram here, you can see one type of flow where the server is the one that testing itself, but you can have the same on both sides. So both the client and server can test themselves. They can use either attestation results or evidence as credentials, and they can use credentials, these credentials instead of PKI or alongside PKI. So obviously, we're dealing with some sensitive stuff here, and we want to make sure that our specifications are as secure as possible. And the way we do this is obviously we use our experience with these protocols, making sure that they're secure, and we use implementations to drive testing and make sure that we catch any bugs. But obviously, we can't just rely on that because we can't do proper thorough testing. So recently, we've been integrating formal verification into our work, trying to prove that the security properties that we care about are upheld by our designs. And actually, in IETF, we have a new usable formal methods proposed research group to take care of this more broadly. So I want to leave you with one message, which is please join us. Please join us in drafting these standards and implementing them and making sure that they work properly in the real world. Yeah, and we tend to lurk around in the ROTS working group and the CCC at the station. Thank you. Okay, I'll repeat the question. Is there like a representation for applying the service? I think there probably is. Yes, so the question was for ACME, for the ACME integration of remote other stations, whether there is example codes or reference implementation. I think there probably is. I think I've seen a demo from the person who was drafting this. But yeah, we can get in touch.
Forensic container checkpointing and analysis
Thank you. Yeah, thank you. So welcome to my session for forensic container, check pointing and analysis. So my name is Adran Reber. I worked at Red Hat since 2015. I'm involved in process migration. What's the basis for container check pointing now? I guess now 13 years now. Everything I'm talking about today is about CreeW, Checkpoint Restore and User Space, a low level tool. I'm involved there for a long time. And I'm focusing on container migration since 2015 and forensic container analysis is one use case of the overall container migration topic. So this talk will look something like this. I will give a bit of background about the tools, who uses Checkpoint Restore currently, who uses CreeW, how is it used, the use cases. I will go through a couple of them. Then I will talk about the title of the talk, forensic container analysis. This is basically just a demo. So maybe it fails. And then I will talk a bit about the future of Checkpoint Restore, especially with focus on Kubernetes today. Okay, so Checkpoint Restore and User Space CreeW, that's the tool we're using today to do the check pointing and create the images for the analysis. And the reason why it's called Checkpoint Restore and User Space is because Checkpoint Restore is a technology which exists on operating system and Linux for a long time. And previous approaches were either in the kernel, that's why this one is called in User Space, or they required some preloading. So you would do an LD preload and then some library would intercept everything you do. And then later the Restore, something would try to create the steps you did before. CreeW is different. CreeW is something what you would call a completely transparent Checkpoint Restore utility. It doesn't require any preparation of the tool. You can just point it at any process and you can checkpoint it if the process is not using any resources CreeW cannot handle. And then you can restore it on the same or on another machine. CreeW was developed with the goal to use existing kernel interfaces as much as possible. Over the years there were additional kernel interfaces introduced to support CreeW. None of these interfaces are specific to Checkpoint Restore, so there are always multiple different users using those new interfaces. So the most changes CreeW did to the kernel are not Checkpoint Restore specific. Most of the time it's just how to get more information about the running process out of the kernel. There are multiple integrations of Checkpoint Restore in different projects, container run times, container engines, container orchestrations. And the first I have to mention here is OpenVz. It's something I never used personally, but that's the group behind CreeW, so they developed CreeW to be able to live migrate their containers. They were doing containers before it was container, so it's something which existed for a very long time. And at some point I'm not sure about the history exactly, but they came up with CreeW to have a Linux tool which works for everybody and not just for them. Another interesting integration of CreeW is in-bored. This is Google's container engine, what they use in-house to run all their tasks. And although the upstream CreeW developers don't have direct contact with Google, we know from conferences how Google uses it, so basically what they do, they can migrate containers and they mostly do low priority containers. So if you have a node, there's something running on it. It needs more resources than before CreeW. They just killed the low priority container and restarted the work somewhere else from the beginning. And with the integration of CreeW, now they can just move it from one host to another host. So this is... And as far as we know, they're using it at least since 2017. I think that's when we saw the first presentations from Google how they use CreeW. Then there's an integration for a long time in CXT, and I probably have to mention in just today. It's also integrated there. Also for a very long time, it's integrated in Dockoff 4. Also, I don't know, maybe also 2016, something like this. I've worked for a couple of years to integrate checkpoint restore support in Portman, so you can also, using Portman, checkpoint restore containers migrate them from one host to another host. And the thing which I'm currently working on, which I started around, people talk to me about how they think to use container migration, container checkpointing, and the simplest one is maybe reboot in safe state. So you have your system running with a container on it, and you have a blue kernel there, and it has some problem, and you want to update the kernel. But your container takes a long time to start. You're not really happy doing a reboot because your application is down for a long time. So with GRI, you can update the kernel, then you can create a checkpoint, basically an image, a stateful image of your container, write it to disk, reboot your host, and then it comes up with a new kernel. This time it's green. You restore the container, and it's running pretty fast, much faster than waiting for all the initialization. So you can quickly do reboot of your systems using checkpoint restore. Another one, it's similar to the first one. And also people have been talking to me about this. So this is also used in production. You have a container which takes a long time to start, the one I've been told about. It takes like 10 minutes until everything is initialized. So, and they have a service which they want to sell to customers, and they want to have the customers fast access to the containers. They don't want them to wait for 10 minutes. So what they do is they initialize the container once, create a checkpoint, write it to disk, and then they can immediately start in matter of seconds services from this pre-initialized container, and their customers don't have to wait 10 minutes. It's just in 10, 20 seconds, something like this. The combination of those two use cases is the container live migration. So we have two hosts. We have the container on one host. And it's hopefully stateful because if the container is not stateful, the whole migration thing of container doesn't make much sense in the end. For the forensic use case, it can be a stateless container as well because you can still analyze it. So what is the same again? We create a copy of the container, write it to disk, and then we can create one or multiple copies on the destination system. And the original container can keep on running or not. So this is really up to you how you want to use checkpoint restore today. Another interesting thing people are talking about are spot instances. Spot instances are usually something which is cheap, but they go away. Those VMs, like, I don't know, you have two minutes warning, and people are using checkpoint restore there in combination with pre-use. So you get a signal that your VM is going down. You create a checkpoint, write it somewhere, and then you can continue to run your workload on another system without losing any work or without having to do any restart or long down times or whatever you would like to avoid. And something which came up recently is people are interested to use it for AI training. So you have your AI training running somewhere with a GPU, and for some reason it's aborted, or you have to make space on the node, and with a combination of checkpoint restore you can create a checkpoint of your container. In this case it's less about migration. It's just creating a copy of your state somewhere so you can continue to run it later, or even migrate. It really depends on what you want to do there. The interesting thing here is I mentioned previously that the CRIU cannot handle all resources, and GPUs are kind of the resources which CRIU cannot always handle. We are lucky that AMD came up to us and they actually implemented support to migrate or checkpoint restore applications which are running on the host CPU and at the same time on the AMD GPU. For NVIDIA, we don't know if that exists. We have heard people talking about it. I think Microsoft mentioned it at some point that they might have been using CRIU in combination with NVIDIA, but nobody talked to the CRIU Upscreen Project at least. So we are not aware that people are doing, but we kind of expect that people are using CRIU in combination with NVIDIA GPUs. So the next one is then forensic container analysis and my demo. So my demo is based on a container. I am using OpenHPC as a base. So the container is a stateful container. It is calculating pi and memory which we can hopefully later find in the container. So to create a checkpoint, there is a complicated way to do it. So currently checkpoint restoring Kubernetes is only a Q-Bled interface. Officially, the reason is because checkpoint restore writes your container, every memory page to disk. There is the potential, the risk that you now have private keys, random numbers, passwords, now all written on disk. The checkpoint is only readable by root. So the situation doesn't really change because if you root on a machine, you could also extract the memory, but for now, because it's not clear how to handle this or how we want to continue in the Kubernetes community. With this feature, it's just a Q-Bled only interface and it looks like this. I've also written a QCTL interface. It looks like this. It creates also the checkpoint archive. It's basically doing the same. It's just wiring all the calls completely to QCTL instead just a Q-Bled. So now we have a checkpoint and there's a tool called checkpoint control, which was mainly developed by Google Summer of Code students this year. So we're very happy for this help, which they did. And in its simplest form, checkpoint control will give you, I'm just going to make the font a little bit smaller for a short time here, and it gives me just some basic information about the container. I see it's the container's name counter. It's based on that image, ID, runtime, when it was created, the engine cryo. Checkpoint size is basically the size of all memory pages, and root fsdif size is the size of all files which have changed compared to the base image. So let's unpack the checkpoint archive to see some details. And it's just a car archive, so it's easy to unpack. I'm just going to move this to the top again. And there are a couple of files which now were created by the container engine. And so we have bind mounts. This is just some information that is necessary for restore, because we need to restore all the mounts from the outside of the container to the inside, and we need to know if it's a file or a directory, because the container engine doesn't want to remember if it's a container or if it's a directory or a file, but we need it for the restore. Configdump has some information. dump.loc has what cryo tells us. In this case, it doesn't matter because it works. Then we have the root fsdif file. This is all the files which have changed compared to the base image we saw previously. In the checkpoint directory is the one created by cryo. So that has the actual process information. So if we go there, this is the normal thing which cryo does is all, most of them are protobuf files generated by cryo here. And cryo comes with a tool called crit, crit, cryo image tool, and it has a parameter show, and we can have a look at one of those files. Let's look at UTS namespace information here. It basically just tells us the namespace, the UTS namespace has hostname counters, but we can also look at a file called pstree. This is the process tree. This one, so it starts to get difficult to understand what's going on. I have a couple of commands prepared. So I see with this one, we have four PIDs running in our container, 140, 40, 142. It's important to know this is the view from inside of the PID namespace, so cryo always remembers the PIDs from within the PID namespace and tries to recreate those PIDs later. If I looked at my process, which is maybe still running, it should, I can see here, basically, it's not hard to read, but those are the four, where's my mouse? I don't know. Oh, there it is. You see, so this is the PID one of the container, and this is probably 41, 42, 43, I guess, and you can see here it has other PIDs on the outside, because that's the view from the outside of the PID namespace. So it's important if you ever do an analysis of your checkpoint, it's always the PIDs from within the PID namespace. There's also for each process, we have a file called core, with the core information about the process. Let's have a quick look at this one, and it basically has the registers, the value of all the registers, floating point and much more stuff, and the end you see, the policies and the name of the process, and using the name of the process, I think I can get a list of what processes are running inside of my container and what they do, and you see the first one is called bash login wrapper, bash, pi and t, and if I compare this again with what's currently, I don't know, it's the wrong command, with here, again, I see it's again, bash login wrapper, bash, the Python code and the t command. So looking at these files, I can find out everything about the processes here, so there's a lot of information in here, and if you're looking for something specific, it might be difficult, but the information is here. There are additional files, for example, the tempfs.dev files, those are maybe also interesting files. Those are basically, let's have a look at those. Something like this is probably the right one. And you see, this is the content of a tempfs, so every tempfs which is not bind-mounted from the host, which is native to the container, KreeU kind of puts it into the image, it's basically just a tar, so every tempfs which was in your container is now also here, you can find all the information here. This looks like this was slash dev. What else do we have here? Let's have a look. Yeah, I think that's okay. And previously, I also wrote some, my secret data into the memory pages, and I can actually find this memory again here, this information again in the pages files. The pages images are, those are not, protobuf files are raw dumps of the memory, this is all the memory which was written to disk, and I can again find the information I've written to memory here. So if I know what I'm looking for, it's easy, if I'm looking for a password, then I have to pause it all through and maybe find a useful string in there, but this is just to show you, you have access to all memory pages, and they are now all on disk, and it can be easily analyzed, or at least look at. So if, okay, let's, I also wrote a couple of files to my container, I mentioned this here, the root of sdif tar, let's unpack that one. And so now this contains three files, so these are all files which have changed compared to the base image of the container, and this is just really simple, a file which is created, it just has the, it just contains the name of the file itself, but it's just to show you, if you want to look at content which has changed in the container, you will find it here in this root of sdif tar, which contains all the changed files. And if you think this is all too much work, then I already mentioned checkpoint control before, and it's even, has even more possibilities than what I've shown you, most of the things I've done here manually, the tool, thanks to our Google Summer of Code students, can at this point do. So let's have a look at checkpoint control inspect of the, and the $CP variable is basically pointing to the tar archive, so the tool is now unpacking the tar archive and giving us all the information. And what we see here now is this information we saw before, so it's just some basic information about the image, where it was, how big the checkpoint size is, then we see CreeO dump statistics, this is basically the time CreeO needs to write the checkpoint to this. You see how many memory pages were scanned, if they should be written to this, how many memory pages were actually written to this, and then we see the full process command line. We see all the environments variables of all processes running in our container, and next one even more variables, and more and more, and at some point there's, I think it even contains the open files, too many variables here. You see, now we see the open files, you see the one has open def null, and then two pipes, and then the working directory, and open sockets, you even see that that's the socket I've been talking to, and then we go to the process here, and then we see all the mounts we need, this is also important for restoring the process later. So I guess that's the end of my demo, so checkpoint control was the tool I was using, I was using CreeO image tool to have a look at the content of the images, and then I was using grep to find my secret key from the memory pages. So one thing I didn't show, you can use, there's a tool in CreeO which converts the checkpoint images to core dump files, and then you can use gdb to look at them, it's basically the same, you see the registers and the call stack and things like this, might also be interesting for a couple of people, to what's next, especially with focus on Kubernetes, so I've shown that I have a kubectl checkpoint kind of working, that's an open pull request, it's not being actively discussed at this point, but it's there so if somebody needs it, it can be easily used, maybe the next step would be to integrate checkpoint for complete parts, I've implemented this a couple of years ago, it's pretty simple, we just do a loop over all containers in a pod, we just create some metadata for the pod and then we can recreate it, so this is not a technical challenge, it's just most things at this point are how to get it in a way into Kubernetes which is sustainable and makes sense, and then maybe we have something like kubectl migrate, so we don't have to do it manually, maybe at some point the scheduler will decide, let's move this pod somewhere else, and one thing, so the image format I'm using is currently just a tar file, I came up with, but it's not a standard, so container D uses something else, I looked at the container D format, it's applicable for what I was looking at, but the problem was they were using internal protobuf structures, I didn't thought make sense to have in a public checkpoint, in theory, checkpointing on container D and restoring in cryo should not be a problem, but at this point we don't have a common image standard, I tried to start a discussion here, but it also didn't continue unfortunately, so with this I'm at the end, so I showed you that cryo can checkpoint containers, I haven't shown the restore part, but it works, it integrated in different container run times, it's used in production by different companies at this point, use cases are things like reboot into new kernel and save states, multiple copies, container migration, spot instances, AI learning support for GPUs there, and this is all available in Kubernetes under the forensic container checkpoint in cap 2008. So, I'm at the end, thank you, any questions? Thank you. Oh, sorry. Sorry, please be quiet, we cannot hear the questions. You mentioned GPUs are something you can't handle, what are the other big resources that... So basically cryo cannot handle anything that's external to the kernel, so InfiniBand is one which comes up in high performance computing always, so everything where you have a state in additional hardware, you need some way to extract it, you need to extract the state so you can later restore it, so... And just create a text in the process, is that stuff that fails? Exactly, it fails, Daniel. So currently the people I've talked to today, they are just interested in finding out if there has been an attack or if there is an attack ongoing, things like this, and then maybe at some point, maybe if you can have a couple of checkpoints and figure out, okay, this looks like an attack pattern, maybe detect it automatically using check pointing, this would be maybe something in the future, but finding a possible attack is one of the main motivations for people for the forensic use case. Thank you.
Introducing Incus
Hello. So, yeah, I'm Stefan Graber. I'm the project leader for Linux containers. And I'm just switching to the right screen here. There we go. And I'm one of the, one of the in-cast, in-cast maintainers. I was also the former project leader for LexD when I was working at Canonical. So, gonna go for a tiny bit of history first and then go into more, you know, what in-cast is and what can you do with it. So, the LXC project itself was created way back in August 2008 through IBM. That's the original Linux containers run time and was, has been used kind of everywhere, including the original version of Docker and some other places at that point. Linux containers itself was created, so the organization was created back in September 2014 and the LexD project got announced by Canonical in November 2014. Then LexD's been going on for a while until a lot of things happened in 2023. So, on July 4th Canonical announced that LexD was gonna be moved out of the Linux containers community project and moved into the Canonical organization itself. The next day we noticed that all non Canonical maintainers had lost all privileges on the repository. So, only Canonical employees were left maintaining it at that point. Then a few days later I left Canonical, so that happened. Then August 1st, Alex Astorai, who was the open SUSE package for LexD decided to go ahead and fork LexD as a new community project called InCus. A few days after that we made a decision to include InCus as part of the Linux containers project, effectively giving it the spot that LexD once had. InCus 0.1 was released on October 7th and we've had another four releases since then. Lastly, just as a bit of an early Christmas present, Canonical decided to go ahead and re-license LexD to AGPL as well as require everyone to sign a CLA to contribute to LexD. The consequence of that for us as a not-patry-to project is that we cannot look at anything happening in LexD anymore. We can't take any changes from LexD anymore, so InCus is effectively a hard fork at that point. So, that's the history. Now, to go back to what is this thing actually all about. So, InCus is a system container and virtual machine manager. It's image-based, so you've got a pretty large selection of distros. It's going to be a whole slide about it a bit later. But, yeah, it lets you effectively kind of cloud-like immediately create instances from any of those images. The system container part means that we run full Linux distributions. We don't run application containers, we don't run OCI right now, we don't do any other kind of stuff. The containers are really like a full Linux system that you then install packages in a normal way. Everything is built around a REST API with a pretty decent CLI tool. That REST API also has other clients who will go through that in a tiny bit. InCus got great support for resource limits, so you can pretty easily limit CPU memory, disk network, I or whatever you want. It's also got extremely good device pass-through to both containers and virtual machines, so you can do things like passing GPUs or attaching virtual TPMs or sharing your home directory or doing a whole bunch of different kind of sharing and passing through devices into containers and virtual machines. It also supports all of the expected stuff. I mean, it does snapshots, does backups, it's got a variety of networking options, a bunch of storage options, all of that stuff. It can also create projects as a way to group a bunch of instances together and effectively even open ID connect, which is cannot go to standard these days. And for authorization, we support OpenFGA, which is the open fine-grained access control project. That gets you, as the name implies, pretty fine-grained access control. There's also a number of web interfaces you can use on top of that. So here you've got one of those, which is actually the LexD web interface that runs perfectly fine on top of InCus. And yeah, that's one of the options there. As far as what you can run, well, there are a few options you can see up there. So InCus is indeed all based around images. We build images for pretty much all of the major Linux distros and even some of the not-so-major. And we build everything on both X86 and ARM. The vast majority of them are available for both container or VMs. We've got a number of them that are just for containers. And then because we do normal VMs, you can also run Windows, 3BSD, whatever else you want inside of the virtual machine. All right. So let's do a first quick demo of the standalone InCus experience. So if I switch over there, first thing we'll do is just launch an ARM Linux container. There we go. So we've got that. Then let's do another one for, let's do Alpine, the Edge release. So just do that. And this is obviously at risk of blowing up at any point because I'm on the first Wi-Fi. I think Ubuntu was planning on doing a VM. So let's do a VM instead of a container. So just tell it you want a VM instead. That's pretty much all that there is to it. And with that running, so we can see that the two containers already started, got their IPs and everything. The VM is still booting up, so it hasn't got its IP yet. It does now. If you want to get into any of them, you can just exact any commands. You can get a shell into Alpine. You can get a full bash inside of Arch. And you can do the exact same thing with the virtual machine. So like you don't need to get a console and log in and everything. Like there's an agent automatically in our virtual machines. You get to just immediately access them as if they're containers. So that works really well. You can create snapshots. So if you wanted to snapshot, the opposite snapshot creates the Arch one. If you don't give it a name, it just picks one for you. So we can see there's now a snapshot that we can restore or just keep around. There's also the ability to do automatic snapshots with a chron-type pattern with automatic snapshot expiry. You can do all that kind of stuff. Now let's create a custom storage volume. So we'll just do storage, volume, create, default. Let's call it demo. And then we're going to be adding that as a device to, let's say, Arch. So just call it demo. It's a disk. It comes from the default storage pool. And the volume is called demo. Configure this. There. And I forgot to do add. There. twice add. Now if we go inside of that VM, again, we see there's a new entry there. And then empty home. Hey, that's my home die tray. So that's very nice and easy. It's kind of doing automatically, VIRTA, UFS, 9p, all that kind of stuff. It talks to the agents to trigger the mounts. And it just, like our goal is for virtual machines to feel like containers in, like as much as we can. And having that agent in there really makes that super easy. And for the last party trick of this demo, let's do launch images. Open suzer, tumbleweed, desktop KDE as a desktop image. And also tell it that I want to see the VGA console as soon as it starts. So when I do that, it actually gets me a second window, which I need to drag over here. And let's try full screen that thing. Maybe. Yeah, full screen doesn't work. Okay. But we can see it boot. And it's going to get us eventually into a KDE session. Not sure where the resize didn't work. Oh, okay. Maybe the desktop where? I saw a mouse pointer that was about the right size. Nope. Okay. So it is starting KDE there. So we even have some desktop images. We've got an arch desktop image with GNOME. We've got Ubuntu with GNOME. And we've got open suzer with KDE. We're not building too many more of them mostly because they're actually very expensive to build as far as like resource, like the build time and distributing pretty large images. But it's to show that this works. And if you want to run your own, you can totally do that. All right. So let's just go back to slides. Come on. There we go. So other things you can do as this thing is effectively your own local tiny cloud and it's all built on rest API. It's what it also makes it very easy to integrate with other things. And other things here mean some of the pretty usual tools you might be dealing with. So Terraform, OpenTofu, you can integrate with that very easily. We've got a provider to maintain ourselves that you get to use. Encebal has got a connection plugin that you can use to deploy any of your playbooks directly against virtual machines or containers. And if you want to build your own images as derivatives of ours, you can use Packer as a very easy way to take our images and inject whatever stuff you want in there. There are a bunch of other tools. I mean, LexD especially had a lot of third-party tools that could integrate with it. A bunch of those are now migrating over to InCurse or supporting both. So that's very, it's a list that's very rapidly growing. Other things you can do, well, InCurse exposes an open metrics endpoint to get the details like the resource consumption and usage and all that kind of stuff of all of the instances running on it. So you can integrate that with Prometheus to script that data and keep it on the side. It also supports streaming, logging and audit events to Gryffina low key. So you get to effectively have your events and your metric in the same spot at which point you can use the dashboard that we've got in the Gryffina store to get something like that and run on Intel. So that's pretty convenient as well. If you don't like typing the name of your remote every time, you can switch to a remote. So you just do a remote switch at which point if I do a list, it goes straight to that remote and you don't need to type it every single time. That cluster is actually using a mix of local storage and remote storage. So it's got CF for HDD SSDs and it's got a local ZFS storage pool as well. And on the network side, it uses oven. So it actually has all of the stuff in place. And actually if we look at the remote list from earlier, we can see that it uses OIDC for login. So it's also using authentication bits I mentioned. Now if you wanted to launch, say, the BN12 instance on that thing, you can do it the perfectly normal way. And that's just going to instruct the cluster to go and do it. So in this case, thankfully it's running back home with very fast internet. So I don't need to wait for the first Wi-Fi to download stuff for me. But it's actually downloading the image and parking it creating the volume on-safe in this case and then starting the instance. I didn't even tell it whatever I wanted it on. So it just picked wherever it made sense, which is actually funny because if you use an image and you don't specify what architecture you want, you're going to get one of the architectures. So in this case, I didn't tell it I wanted ARM or Intel. There was more capacity on ARM, so I got an ARM instance. We can go and check that easily. But I know that the server it picked in that list is an ARM server. So if I go in here and look at the architecture, it's AR64. All right. Let's just look at things here. And I wanted to just show the dashboard as well. I'm just going to drag that particular window over. Where is it? It is here. I had it open. I've got way too many windows open on my laptop. Okay. So it's Grafana. It's loading. It's loading. And in this dashboard. Okay. I'm just making sure it looks at the right cluster before I show it to you. So there we go. Yeah. So this is actually the dashboard for the very first I was talking to. So it's SHF, the one I was showing. It's looking at the demo project. So we can see the top offenders as far as resource usage and that kind of stuff. We can look at graphs for network, for storage. And we can even kind of go down on the specific instances and see what they've been doing. So you could expand an instance and go look at its usage. It also gets all of the events from Loki. So we can see the instance creation and any comments like that. That shell I got is actually right here. And any error and stuff is also all captured right there. So that's the metric side of things. All right. So where do you get to run this thing? Well, quite a few distros have packages now for Incus as well as I meant, as I've mentioned, Devenin Lubuntu without packages in their next table release. We're also looking at doing a long term support release for Incus itself. Right now you might see version numbers like 0.4, 0.5 and be a bit scared about it. You need to remember that this is a derivative of Lexday. So one of our zero point release is just as stable if not more stable and like a five point something on the Lexday side. We've just not done anything past zero because we're waiting for the LTS of our other projects within the next containers, which we will do in March. And that's going to be the LTS of LXC, like CFS and Incus all at the same time. And we usually try to line up versions. So Incus is going to jump straight from 0.6 probably straight to 6.0. That's what's going to happen with the LTS. As far as other features we're looking at adding, with the release of Linux 6.7, we now have Bicash FS in the Linux kernel. And it's pretty interesting for us on the Incus side because it's very close to what ZFS or the RFS does, which we already support. So we're looking at adding a Bicash FS storage driver for people who want to start using that. On the cluster side, I mentioned that we support Cef right now, which is a pretty good option, but also a bit heavyweight. A bunch of people could instead do something different, whether it's like using a shared NVMe of a fabric drive or using some old fiber channel sand they might have gotten on eBay or something like that. So we're looking at adding distributed LVM as a storage driver, which effectively means if you have multiple systems that can all see the exact same block device somehow, then you can use LVM on top of that with a distributed locking manager on top so that all of the different machines in the cluster get to use that. So that kind of solves the issue of like how do I use my old sand at work or something else, you can use that. But it can also work in some other cases. I think someone is looking at using that with DRBD, for example, as an option. We are looking at adding OCI application container support. So that's potentially a bit of a surprise for some folks. But we feel that these days, like the application container space has now stabilized enough and we've got enough of our users who literally just run, like for some reason are running Docker inside of InCast to run a few specific applications that this particular use case we could support natively. So we're not looking at like competing with Kubernetes with all of the service mesh, this auto distribution thing. Like that's crazy stuff. They get to do that. But we would like it to be possible for you to run like two or three small containers for like your IoT software or whatever. That's kind of what we're looking at doing there. And on the networking side, we're using OVEN for distributed networking, which works pretty well. But we're also working now on another feature of OVEN which is called Interconnects, which then allows for having multiple clusters and then interconnect to the network. So you can have instances on multiple networks, on multiple clusters, and then connect those together and can direct them. And you've got 30 minutes with InCast pre-installed in there to just take it for a ride, play with it for a bit, see if that's something that's interesting to you and if it is, then you can go and install it for yourself. And that's it. We can try some questions. We've seen it's a bit difficult. So please, everyone remain quiet if there's any questions. So we can try and hear them. Is there anything? Oh, you have it there. Okay. So I'm quite sure some people are interested with the differences from the end there and this too. So compared to what? Sorry, I didn't catch that part. Oh, VMware. Okay. Well, so it's a lot cheaper. Yeah, for anyone who's using VMware professionally and has followed the news recently, let's say your VMware build is not great right now. So this is a viable alternative in many cases. It doesn't do, it doesn't have all 50,000 components all around it and all that kind of stuff. But if you are primarily using it as a way to get a cluster, create a bunch of VMs, maybe create some containers, run whatever OS you want on there, that will do it just fine. So it's definitely an option there. I mean, it's kind of in the same vein at that point as compared to like, you know, a Proxmox or some of those other options, it will work just fine. With the FireTact, we do have like, it's not a distribution you can install it on any system you want. It's obviously all open source and yeah, it is a pretty viable alternative and we do have a lot of people who are using VMware that are very closely looking at this as a potential way out of VMware right now. So the question here, better understanding terminology, would the LNs find a backdoor between a system container and a location container? Yeah, so the difference between application containers and system containers is a system container will run like a full Linux distro. It will run system day, it's going to have Udev running, it's going to, you'll be able to access it into it, install packages, reboot it. It's really designed to be like a state full long running type thing. Whereas your application container is usually, I mean, ideally single process or a process of some of its children, it's really more designed around delivering a specific application and most often it's going to be quite stateless with the idea that you can just nuke the thing and replace it at any point. It's kind of two different concepts. Like some people like the idea of having a system that they actually get to select what packages installed exact config and stuff and some people prefer not to care about any of that and just have something pre-installed and that's what an application container gets you. That's why having the ability to run some application containers directly on InCus alongside the system containers I think will be quite interesting because if you just, like if for a specific application it's easier to just get their pre-made thing then you'll be able to do that while still being able to run everything else. Yep, so we do have a bash completion profile. I absolutely hate shell completion for some reason, so I don't have it on my machine so I can't show you. System containers provide the ones that are interested in the rights? Yeah, I mean it is possible to get application container run times to get you a full system container. I mean nothing prevents you from deciding that the application you run in the container has been in it. That's definitely possible. It's just not what they were really meant for so there's a bunch of kind of, it just feels kind of less polished because that's not, that wasn't their goal. Like things like being able to dynamically pass new files in and dynamically attach devices, get whatever number of shells you want, be able to interact with the outside words through like a Unix socket inside of there. Those kind of things don't make too much sense for application containers just at the beginning and so some of those features will probably be lacking on that side. I tend, I mean, I was going to say like I usually like, you know, having one tool for the job and like picking the right tool for the job and effectively if you really care about running a bunch of application containers use one of the application container run times whether it's using Podman, Docker or some of those others. One thing that's actually interesting is that you can totally run Docker or Podman inside of an InCast container. So that works. You can run your normal Ubuntu, Debian or whatever in Existio inside of an InCast container and then install Docker or Podman in there and run some containers alongside whatever else you might be doing in that container. So that's something that works fine. I think we're probably out of time at this point. So thanks a lot everyone. I'm probably going to just be outside for a tiny bit if anyone has more questions and things. But yeah, thanks a lot.
Using chroots in a single Linux Container as an alternative to docker-compose
All right. So next up we're going to have Aiden who is going to be talking to us about multi-image and container. All right. Ready? Okay. All right. Hi, everyone. I'm Aiden McClelland. I work for a company called Start 9. So this project here is a little bit of a work in progress, but it is something we are trying out because we have a little bit of a less common use case for our containers, and we decided to try something a little different. So first some background. We develop an operating system called Start OS. The purpose of this operating system is to allow end users without technical expertise to run their own home servers. So the idea being like trying to bring the desktop experience to home server administration, and that way we can bring a lot of these self-hosted applications to a wider variety of people on their own hardware without them having to learn everything you need to learn about Docker and the hosting tools that we're all familiar with. So as part of this, we do have a little bit of a different use case than is generally intended for things like Kubernetes or Ansible or a lot of these tools that are designed for deploying corporate infrastructure at scale. We're really looking at like a single host machine that the user wants very low touch with. They don't want to spend a lot of time configuring their applications at a granular level. So we decided, you know, like a lot of these applications, they come with these Docker-composed setups, right? You have a main image that has your application code and then you have things like databases and reverse proxies, etc. And commonly we deploy this as a Docker-compose file, and what this does is it creates a bunch of containers that now have to be managed by the OS and by proxy by the user, right? So what we've always tried to do with Start OS is we've maintained this idea of one container, one service. And what this allows us to do is it reduces a lot of the complexity of the management of a bunch of different containers and also provides a single IP address and virtual interface on which the application is running. So when you're doing all of your network mapping, all of that can be mapped to a single virtual IP address that can then be viewed either from within the subnet within the device or is then exported through the host. This also means that you can define resource limits on a single container basis as opposed to having to do a group of containers and managing that as a group, a C group with subgroups, right? Another final reason that we did this is that our package maintainer scripts, we prefer to run inside the contained environment and these package maintainer scripts are run in JavaScript. So we run a service manager in the container that reads the package maintainer scripts and then is able to set up all of our subcontainers, our sub file systems from there, and execute our actual binaries. Okay, so the question is why do people want multiple containers at all, right? Like oftentimes you can take a single Docker image, a single application image and install all of the software you might need, but in practice this is not as easy for the service developer, right? A lot of times we have people coming to us asking for, hey, I want to be able to use an off-the-shelf Postgres image, I want to use an off-the-shelf Nginx image, I don't want to have to use like the package manager for the distribution of my container, to install that and manage it. So that's like the number one use case that we have for that. It also allows you to run applications, like say you have one in Debian, one in Alpine, run all of them together. Then, you know, the other reason that you might want multiple containers is you can isolate the subcomponents of an application away from each other and also do resource limits on individual application subcomponents. If anybody has additional reasons why you might want to do separate containers as opposed to a single container for an application, I would love to hear them, but these are the reasons we came up with. So our solution, we cover this first use case using trutes. Number two, as far as we can tell, works for the most part, but that is remaining to be teased out. This does not allow us to isolate the subcomponents of our application from each other or create resource limits on individual applications. Subcomponents as easily, those will have to be managed by manual tuning of resource limits within the prokates of the container. So, yeah, we've ultimately decided that those last two components aren't really necessary for our use case. Ultimately, a single application is where we define our sandbox. So sandboxing separate parts of an application from each other, like has some security benefit, we've decided isn't worth the complexity. So we decided to do this with LXC. Why do we do LXC as opposed to something like Docker or Podman? LXC is a lot more composable. It allows us to pop the hood on a lot of the very subcomponents of container technology and manage it more manually. So we can, for example, easily manipulate the container root FS at runtime. So even with an unprivileged container, that unprivileged container can communicate with the host and modify its root file system very easily. We use our shared mount propagation for our root FS, which allows the host operating system to easily manipulate that file system. And then it's also unlike some other container tools, you can perform commands like shrewt and mount from inside an unprivileged container, which is not allowed on a lot of other technologies. So to put together a service, an application, we have effectively a single root FS image that all of our applications share. This root FS image is just a base image that we use for all of our containers that has a, like, we use Alpine right now, but it loads a Node.js application that runs the package maintainer scripts and then launches the various actual demons inside their trues. It communicates with the host using a JSON RPC API over a Unix domain socket. So there's bi-directional communication between the host and the service manager in the container, and then, yeah, it can perform the actual application code inside the shrewts. So the host API, what it does for the container is it can perform some manipulation of the root file system of the container, and this allows creating overlaid images in the same way you might be creating a container. All we do is we create a root FS image with an overlay file system and attach it to the container in a way that they can trude into it. And then we also have a bunch of other APIs that these packages can interact with, mostly for integration with the end user experience, and integration with other services and applications on the host in a way that the user might have to intermediate. And then we also have a set of APIs designed for hassle-free networking. If you have, you know, some application bound to a port, you can now attach that port to a Tor address, to a clearnet address, or to just a LAN address so that you can be accessed by your local area network. And the host OS manages all of the certificate management, either through Let's Encrypt, or through a host root CA for the LAN communication, because obviously you can't get a Let's Encrypt certificate for a .local. Okay, so then the service itself, it runs a very basic API that receives commands from the hosts. So when the application is running, it can receive like an initialization command, it can start or stop the service, and then shut down the service entirely in order to kill the container. And then it also invokes all of the various package maintainer scripts, such as editing user configuration, installing the service, or updating the service. All of those perform various package maintainer scripts that get called from the host. Okay, so when we actually launch a binary, the package developer defines in some JavaScript, we have some well-typed TypeScript APIs for this to describe this structure, but it defines what binaries to launch, what images to launch each binary in, where to mount its persistence volume. So we have a series of persistence volumes that are mounted to the container, and can be attached to any path within these sub-file systems, and then it defines any environment variables or arguments in any standard way that you would launch a program. And then for each command that you have, when you just similar to how you would define a system deservice file, you can define all of these arguments and then any dependencies or health checks associated with your service. And then for each of these commands, the in-container service manager will mount an overlaid image for the requested image ID to the container. It will then take our special directories, proxys, dev, and run, and bind them inside the container. So all of the containers share the same proxys, dev, and run. And then it will run the command in the true. Okay, so here is an example I have of a package maintainer script. I don't know if that's actually visible to everyone. Is that, are you guys able to see that? Okay. Well, I suppose I can just talk about it. But effectively, you have a fairly simple JSON configuration where you define your image ID, your command, your arguments, and then some health checks defining when is this thing ready, as well as some dependencies. So like if you don't want to launch a various demon until another service is ready, you can just specify that and then it won't launch until its health check passes. So all of this is available on the GitHub if you want to check it out. This particular example is in GitHub's start9labs slash hello world startOS. There should be a link on the talk. So time to do a little demo of what I have working so far. Let's see if I can get my shells over here. All right. So here I have an instance running, hold on. There we go. Here I have an instance running startOS. I've already installed a package. This package in this case is NextCloud. This NextCloud package contains two images. It's got the NextCloud base image, which also contains the Nginx server because it's running the PHP for NextCloud. And then we have Postgres, which is our database persistence layer for NextCloud. So what we're going to do, so we've attached into this container, and then I'm going to go ahead and just inject, basically run a REPL inside the JavaScript engine here. And I'm going to go ahead and do my imports here as well. And what this has done is it has connected us to our JSON RPC APIs, both the hosting of the container and the container into the host. And then we're going to create a couple of overlay images. So first we're going to do our Postgres image. And so what this is going to do is it's going to tell the host, hey, I want to mount this Postgres image to the container. It says, okay, here you go. Here's the path at which I have attached it. I'm going to do the same thing for the main image. And there we are. I'm going to go ahead and define a couple environment variables. Okay. So I have a set of temporary hacks that I've put in that will later be managed by the actual container service manager. But it's mainly around permissions of the container. I still need to get Shift FS working properly. Because LXC, what it does is it maps the UIDs within the unprivileged container to UIDs on the host. And so when we mount stuff to the container, we also need to perform that same mapping. So we're not doing that yet, but I have a set of ownership changes that will manage that. And then all we have to do is go ahead and launch our application. So I'll go ahead and launch Postgres first. And here we go. We have Postgres running inside a tru, inside the container. And it looks like it's ready. And then now I can also launch. Next slide. So here we have, now both of these applications are running within the same process namespace, the same C group, the same container. But they're running from completely separate images. And that's all I have to show you guys. I think we can open up for Q&A. Thank you. So we have considered the idea. Right now we actually haven't found it necessary yet. Like the tru seems to be sufficient for the sandboxing we need to do. As far as we can tell, the technology is at a point where it wouldn't be too difficult to do containers and containers, but realistically we haven't found it necessary. That's all. So I think you're asking as a package developer how we distribute your application. So if you have a service that you want to distribute to our users, to people who are running on StartOS, we have our own, like the company Start9 runs a marketplace. But we just have a very standardized package format. In this package format, you could host on any website. If you want to charge for it, you can charge for it. But ultimately the APIs are generic enough that you can run your own marketplace to offer whatever services you want using whatever protocols you'd like to to gate access to those S9PKs. So as a service developer, in general, if you're publishing to our official registry, that means that you have a free and open source project that you're looking to distribute for free. But that does not stop you from running your own paid marketplace. One more question. I'm sorry, I couldn't hear that. Other resources for our application? Yeah, so the resources are managed on the scale of the entire application using the configuration of the outer LXC container that everything runs inside of. So you can just modify that LXC config. Well, we modify that LXC config automatically based off of the host APIs. Thank you.
Soft Reboot: keep your containers running while your image-based Linux host gets updated
Welcome everyone to our next session. Thank you very much. Hello. Good afternoon. My name is Luca. By day I work as a software engineer in the Linux systems group on Microsoft where I am responsible for the operating system that runs on the Azure infrastructure. By night I am involved in various open source projects. I'm a maintainer in system D, a Debian developer, DPDK, yes maintainer, a bunch of other stuff that I consistently forget about. So I'm going to talk to you about this new feature we had in system D middle of last year called software boot. And yes, it's a new type of reboot and we're going to look at how it's implemented first and in the second part of the talk we're going to look at two demos showing that running and how it can work with containers. So if you were at all systems go, you probably saw the first half of the talk while the second half is new. So first of all, why? Why do we want a new type of reboot? Don't we have enough already? And the answer is of course is performance. So rebooting means if you have some services that are running on your system and they're providing some functionality during that window of time they are interrupted and people don't like interruptions. So that is the main motivation for this. I also know that there are some updates system that require double reboots. I've been told for example that DNF line upgrades require double reboots. So by shorting the time it takes to do this we can save something there as well. But the main use case is the first one for avoiding interruptions. So when you go from a reboot to a KX, you're saving time because you're cutting away the time it takes to reset the firmware and the hardware. So the next obvious step was to cut away the kernels at time. If the kernel is not being updated you don't need to reboot it and do all the device initialization and everything else. So we came up with the idea of soft reboot and this is what it does. It just reboots the user space portion of your Linux system. Again the goal is to minimize disruption as much as possible. So this pairs very well with image based Linux. We've been talking about image based Linux systems for a couple of years now. This works very well with it because in the system you have a single root FS which is usually read only. And then you have a UKI where your kernel is in VR and these are distinct components. They are usually updated independently. And so with a soft reboot when you don't update your kernel you can update just your root FS. Now this also pairs very nicely with kernel live patching. So on production system you can fix bugs in your kernel without rebooting by using kernel live patching. And this pairs nicely with that because you can use the system to update the user space portion of your image when you have bugs or security problems or whatever. Again we are replacing the entire user space atomically and moving into a new root file system. Now it's not only for image based systems though. This can be used for package based OSs because for example you cannot restart the D-Bus demon or broker on a Linux system. Your system will explode if you do that. So by doing a soft reboot you can save some time when your D-Bus has some security problems that needs to be fixed or what not. So let's look at how it is implemented. So as far as the kernel is concerned nothing is happening. Everything is business as usual. It doesn't see anything. It's all the same session or the same boot. So for example we have still some problems to solve, some papercasts. For example if you do journal CTL minus boot minus one you will not see the previous software boot. You see the previous full reboot. We have ideas to fix this only to do list but it's one of the fewer papercasts left to solve. Now as far as user space is concerned everything goes away. It's a normal shutdown. So system D goes through the usual phases. It starts a shutdown target, a software boot target that conflicts with everything else so all the services get stopped. And then instead of giving control back to the kernel with a Cisco to reboot it just reexact itself into the new root file system by passing the full reboot. So you can do this in place. So your software boot is in the same root file system or you prepare ahead of time the new file system. And the run next route. And we allow this because usually prepare the new root file system and position all the mounts across and whatnot take some time. So you can do this ahead of time without having to interrupt all the services by doing it in line. So you can prepare your next root of s in run next route and then code the software boot so that you transition very quickly to the next root of s. And again you can prepare your any additional storage you have if you have any encrypted partition for var for example. You can prepare it ahead of time so you don't need to redo the decryption steps which again takes some time require maybe use an interruption maybe accident tpm or whatnot. And again the kernel stays the same so no configuration changes. So in system D 254 we added a new verb system system CTL software boot to do this equivalent the bus API and the next version. We also had some new signal that tell you yet this is shut down happening and it's off type software boot. So we are cutting time away from their boot is that all we can do with this. Not quite we can go further. So given system D set doesn't exit it's reexec itself. You can carry over any state we want to the software boot. So for example the file the script of store is not aware what it is a way to store for the script or inside PID one and then it gives them back to you to your service when it starts. And by the way all these links are on the slides are used to documentation I will put the slides online. But basically your service can say hey I have an active TCP connection take the sd for me and keep it there. And then your service goes down the software would happens you come back and you get back the TCP connection you can pick up from where you left. Because the kernel just stays running the connection is not interrupted it just buffered and there's some delay of course but it doesn't have to be established for example. It's not just sockets you can use this for MFD for example for any buffer any state that is expensive to calculate you can store it in a MFD and get it back immediately. And you can do this for the network stock for example in network D we have these options so that when it goes down it leaves interfaces configured. And when you go back in the software boot in a new file system you don't have to reconfigure your network interfaces which again can be a bit slow. And then finally we transition across Zashran as a state pseudophile system or tempfs so that if services have state in Zashran they find it again when they come back. This is not recursive but and also Zashtemp is reset completely because that's a scratch area. So by doing this we can accelerate the time that the services need to go back to fully functional after a software boot. But is that all we can do again and what the hell does any of these have to do with containers is it a container dev room. So here's an idea now some payloads are completely independent of your router fest for example containers but also portable services. Now if you don't know what a portable service is I suggest to check it out they're awesome they're a way to attach a system service to your OS that runs from a different root file system. But it comes with its own image but it's fully integrated with your system services it's quite cool. But it applies to these but not only this so these these are these services these containers these payloads are independent of the root file system. So can we let them run during this software boot process the answer is well yes why not. And the configuration to that is a bit complex it's linked there I want to show it here we show it in a demo later. But basically you can configure a system service so that system you will not kill it or stop it when the software boot happens. So is the service keeps running while the router fest is updated under it. Net or is it accessible we keep it up the current doesn't go away doesn't the conflict devices same thing for the disks. So for this kind of payloads we go from some interruption to zero interruptions we quite nice. Of course there's a catch there's always a catch these payloads they really need to have nothing to do with the root file system because for example if you keep anything. And if I open for the old root file system and you will keep the resource pin and they will be free that you use more memory or whatever else. So you need to make sure they are disconnected and also other parts of the US are going away for example the bus. So in the documentation there it shows up but you need to change the way you use the bus via the SD bus library for example to automatically reconnect when it comes up. It's usually not done because the bus never goes away normally but if you have one of these payloads so virus of the boot you need to change our use the but it's very simple and it's a. Describing the documentation there. Now one thing I will look at in the near future is also if we can if we have actual bind parts from the host. The first into the services if you can automatically refresh them after software boot I'm halfway through that is all done yet. So let's see this happening with Podman now because I am a coward I did not I'm not doing a live demo I'm showing a recording. Now this is a dead end image dead end testing and it's running podman some version and so podman has this thing called a quadlet where it generates. Some system services for your container and now this is not exactly what podman generates though it's a bit different as most stuff here and we see what that is in a moment. Or you can see down here it runs a very important production use case of sleep infinity that's typical production use case everybody uses. But to show what the actual difference is because this is a demo to put it together I am not a podman developer or user. I thought it was cool to make it work and I have it a bit together so podman gives you some some systems service I change it and show you the deep here so. These settings up here are necessary to make the containers service survive the software boot. This is a bit of a hack and if this is supported by podman natively it would have to be solved in a better way but basically this ties the container to the root file system to the var directory. So I have to comment that out so that they are not tied and it doesn't get shut down and then there's four more things down here that are suspicious and we'll see what they are in a moment. Now which is simple to explain if I start this container this. Sleep service and it takes a second because it downloads the image in the background and I resolve the complaints that we don't care. Now. The way podman works when you run it as part of a system service is works correctly creates some subc groups so there is the payload. C group node and then there is an additional sidecar control service that runs as part of the same C group and is also a group is dedicated to podman. Now the reason for this for settings here is because this common binary comes from the root file system. So we need to make sure if we just do this it will keep the root file system pin that we don't want that. So my my hack to make the demo work is actually we're running on a different the service runs on a different route image. So it's another big image with podman inside. So this binary and the podman binary that runs they come from this image not from the system that way they are independent and they are not tied together. And then we disconnect a couple of things. So. So now we we have that prepared and there's other things so you saw the two C groups there. Now the way system makes marks a C group for survival of software boot is by setting these extended attribute here. Now because podman gets a delegation from this C group which is the right thing to do but we don't touch the children. We do not set these extended attribute automatically for these two payoffs and if podman wanted to support this natively it would have to do that when he sets up the C groups. Now of course again this is how to gather so I'm doing that by hand just setting the that's a group there. The extended attribute so that system we won't kill all the these processes when they are running and now we can finally type software boot and we see all the US space going away. And then shortly thereafter we come back and we get a shell and then we check with us some errors in the C H and we don't care about so just ignore them. I was too lazy to hide them and then we can see that the sleep is still running and the control monitor as well and it's the same PID is the same processes. The containers kept running while we shut down all this stuff. All the system services have been shut down and restarted but the container is just going without interruption. So yeah again this is very quickly out together. I am not a podman developer is pondering there interested into supporting this or maybe LXD developers. I'm happy to help them but this is a have to get a demo I have another one which I think is a bit more interesting. So as your boost if you're not familiar is the an offload card that is installed in every Azure node so your Azure nodes that run your virtual machines have these arms 64 offloading card that runs the operating system that I work on. It's called Azure boost and I'm showing here a demo of this recorded in production on an Azure boost that he pulls for a second now we recorded this my colleague Maya to my oh my thanks go for recording this go record this amount ago so far executives and then I asked hey can I show this in public at a conference. This is never shown before I didn't only in turn on Microsoft stuff super secret and surprisingly they went yes you're going to like what now I have to do it so I had to unfortunately blank out the host names because this is a real node somewhere in the fleet in that it's entering the US and I couldn't show the host name which identifies the node so you will see this blank things I apologize for that but I had to hide those but we are showing here let's start going again. So in Azure we are running this Microsoft operating system it is just a machine it's arm some version of the kernel 5.10 we have what we call agents these are containers running as portable services. Some of these are critical for the actual customer VMs if they go away the network is interrupted you cannot make new connections. The agent is the critical one that goes away network goes away up local is a local one that does some local service so it doesn't matter so we configure the first one that portable service to survive the software boot. And the second one will will we just go away and disappear now we attach the update agent that does the software boot you can see the portable service is just a touch as a new image so we are moving to a new image here in the background there you can see Sierra console going away. Now we switch to a new SSH because of course SSH is not a critical service like it went away issue come up in a second. And we reconnect and we will check and compare the US versions before and after kind of version before and after and check on the status of these containers and see that actually run again so yes the version and the zero three so it was zero one before so we did update the root of S. It always read only the and very to the fast so we updated as a one block the corner is the same we didn't I didn't cheat and do not show a boot there the current is exactly the same same big than everything so let's check on how these containers are doing. And we can see this is the critical one the the net agent and we compare the P. I. D. is before and after they are the same so the same process is one nine seven and two zero nine nine they're the same the same process is the same pale. It keeps running to the software with while we change the the and very to the fast image behind the other one as we started because it's it's just a non critical service so we let that that be a starter so yes this is it for the demo and hope this was interesting this Nick pick at the Azure production machines and running in. Down in the fleet and we have five meals for questions questions. Any questions. I cannot. So checkpoint restore we don't and that's a very different thing right so checkpoint restore gives you an interruption service. This doesn't so you check point and then you come back to the same state of the process but you still have an interruption while you do your update this is different this is. Aim to let us update the root file system with zero interruption for this payloads so it's a bit different and we don't have plans for that at the moment now these are a bit complex payloads so we have a look into CRU at all. Think there was. Any other questions. So I end. No questions everything clear. I don't believe that there you go there we go. I know that guy I'm gonna second. I'm gonna second. So. Excellent question now the demo was recorded in production with a custom image loaded. Thank you. The demo we show was on a production node with a custom image with a new feature we are deploying this sometimes this year so it's not yet deployed a scale we will see I'm sure it will explode in horrible ways. But for now the main thing we found was debas reconnecting to debas was the main thing that broke the services but it's easy to fix that was the main thing for now. Other questions. Going. I can't hear shout. Shout with the microphone shout. Yes so the pen is on the local system I showed it before. You need to prepare them ahead of time. From here. It can work it can work. Thank you.
Juggling with UIDs and GIDs: rootless container deployment with Ansible
This will be a really short talk, just in minutes to take your attention, not much. And it's going to be an easy thing that probably some people already do, but I'm not like a container, I'm not the man when talking about containers. So what I think that this is kind of an interesting thing for all the people that like me, try to play with containers with home setup mostly. So the motivation for this talk is basically that I have like a server and I am experimenting with rootless containers at home. So I am trying with automating with Ansible, which is not an overkill, I even copy my .files with Ansible just because, you know. And then I like to learn by breaking stuff. So why don't you break stuff at home? So this talk is about, again, my personal setup. I will share all the code later on, but just to meet my home server, it's called Morla. Does anybody know who Morla is? 321 Morla is this, beautiful creatures. And it really resembles my... There's a problem with your slides. Oh my God. Maybe you can just reconnect it. Let's go. It's not that I only have Morla. Yeah. It's so... And it's... It remembers everything. It's like a NAS. I store all the things there. It's heavy and dusty like my home and it's a turtle like my home server, which is not a turtle, but whatever. So first setup, I used Portainer and Docker Compose. It's really convenient, but it had some issues that I'm discussing right now. If you use Portainer, Docker Compose, you're my friend, but I just don't use it anymore. And why is that? Well, it's easy to install on some machines like on OpenMediaVault, which I was running before. It's rootless. Well, I don't know if it's rootless right now, but it wasn't at the time. It only supported Docker and I didn't like to give all the root privileges around and it's heavily dependent on GUI. So these are the three things that I will try to resolve with just using Podman and Ansible. And so this is Portainer, if you don't know. Again, very convenient, but also I don't need this GUI. And Linux server images, if you're familiar with them, they're really simple and they all have this kind of workaround for running some services that you're not supposed to run as root inside the container because, I mean, you're not really sure that things can get out of the container. So just to work around this, they implement this feature that you just specify your UAD and your GAD inside the container and that when you run it as root, you're good to go because outside you will have that user. But that's not the case with Podman, right? So yeah, this is what happens with Podman. Basically if you run that configuration, this is for Pwigo, it's for sharing pictures. I want to be able to mount my volumes in the container using the same users, like outside. I don't want to have some weird mapping of namespacing because then it will be inconvenient if I just drop stuff in and out. So again, this would be really easy to solve if you just run inside stuff with root, but again, it's not, again, the case for all the images. So what you could do is, of course, to just run, like if you want to touch a file in some volume that you configured, it will give you permission tonight. So you just do Podman and share, you do what you have to do and then you have your file there. For me, again, it's inconvenient because I store their media files and I just want to drag them in and out quickly and I don't want to just do unshare all the time. So again, this is what we are facing. Basically the red part is what we care. So when we have a non-privileged user outside and a non-privileged user inside, to make it clear when you run any command, the un-privileged user inside will be remapped. I'm not sure that, like, I am not that familiar to talk about it. So this was just to explain myself how things work. So to make it even more clear, user inside and user outside not privileged, when you see the process of the user, you will see that outside you will have somebody. It's like 1-0-0-9-8. I don't know who you are. It's, well, if you don't have to manage stuff, it works, but if you want to take stuff in and out, then you have to deal with that. So what do you do? This is what you do. And this is why I call it, like, juggling. Because basically the way Podman works in root and non-root management, it allows you to remap the user IDs and the GIDs. So both works for groups and users inside and outside the container. And it's kind of complicated-ish syntax. But what basically happens is that you want to take the UID that you are running in the container. It could be, like, whatever UID, like a fake UID, they even call it, and you basically remap it on the host. So now comes the reason why I did this presentation, which is juggling. And as you see, you have to run this command three times, because you're dealing with a user ID inside the container with a fake user ID and a user ID outside the container that you really want to map to one another the correct way. So there we go. Let's run the first command. OK. You don't get lost. This is faster than the I. You can bet now on where's the UID now. You couldn't have guessed. OK. But anyway, back to serious stuff. The real example is this one. I am running all my Linux server images with a fake user ID, which is 911, because help me in case of... 911 is the fake user ID that goes inside and actually want to fake it, like if root is running inside. It's the same result that I want to obtain. So that outside I actually have my normal user ID. If you actually check into the container in the UID map, which is where actually the UIDs are defined, how they are mapped inside and outside, you will see that it's taking my ranges, as I'm specifying. Please don't ask me much details about it. We don't have time, but we'll talk about it later. And then the result is that when you mount stuff, you have it configured for your user. And there's another convenient option, but I don't have it on my server. If you just do one command, it's easier, but I don't use it in my server, so you can either shuffle or just use this one liner, and it will do just fine. OK. And then about Ansible. So running this command all the time is quite inconvenient if you have many, many containers, it's hard to maintain or it's a lot of scripting. And plus, I think that Ansible goes really well with Podman containers because first, the model is great. And second, it really enables you to have much more control on, for example, config files, templates that you need to put in the container when you boot it, and whatsoever. And what I do with Ansible, and I advise you to take a look if you've ever tried, but it's kind of cool. You can have your configuration and just store everything in what's in the main configuration file that will store all the necessary variables, like parts to expose, volumes to mount, all the users that you want to shuffle with, whatever. And then you can basically copy paste what's a generic container configuration or setup if you do things right. Let's take a look, for example, this is, again, the same configuration. OK, I just kind of find what's the main name of the container, what I want to be the display name, where to pull the image from, parts, a database, for example, because I'm running pods with that. And it's basically all just copy paste as you were doing with Docker Compose images, but this way you're actually controlling much more what's happening because you can also control configuration files, setup, mounting stuff here and there and whatsoever, you name it. And of course, well, this I was highlighting random stuff in case you're interested. Yeah, this is the volume configuration. Don't forget to put the capital Z if you are in an IC Linux environment, of course. And let's go to the setup now. So this is my control panel, by the way, it's really convenient. It's like container. You can just specify variables to say, run this, this, this and that container, and it's all finally replaced the name, and it will just work. So it's just simpler, in my opinion. And then I think if I lost, oh, yeah, I wanted to show you the setup as well. So no more scripting, right? If I need to create a config directory, I can just create it with Ansible, right? And it's really nice that also the container creation, this is the same shuffling that we did before. So the fake ID shuffles with IDs 0 and range 1 and et cetera, et cetera. And this is all predefined in all the setup files that Ansible can provide you. So I'll just finish up because I have a few minutes. Oh, this is also very convenient for containers. If you're booting many, many containers with Ansible, never did that. It's good because it will just say, OK, things are good. This thing, I don't need to check it because it's already there. Again, it was a big thing for me. You may be familiar with that, but if you're not, try it out. And in the end, no mistakes. Of course, I'm also not doing a demo because it would break. And very quickly, tags are also great to manage containers. I think they really go very well. You can just say, set up this, this, and that and tag all the files that you need. And then we just go up. Takeaways, go rootless, automate the stuff, and try to overcomplicate things at all times to oversimplify things. Thank you. That is my presentation. If you want to.
What's new in Containerd 2.0!
Alright, let's get started. Is that, I am unmuted, yes. So yeah, this will be fairly quick, just an update on container D. You're either here because you're interested in container D or because it's too hard to change dev rooms and so you're just going to sit here and hear about container D. Hopefully you're somewhat interested. I was having a bit of phosom nostalgia like 2018, talking about just like the first year and a half of container D getting to 1.0. So now we're on the cusp of our 2-0 release, our first time having kind of a major version since we started the project. First just a few stats in case you're unaware. Container D adoption has been growing a lot. Some of that's probably due to the Docker shim deprecation in Kubernetes. This is from DataDogs, ANO report. The CNCF and Sysdig also put out reports. They all come out with different numbers so believe whichever one. This one was positive for container D so I used it. You can probably find another one. Maybe more importantly to the project are actual community growth so people actually contributing, getting involved in the project, becoming maintainers. This is a crazy eye chart from the CNCF. You can see Kubernetes way up there at the top. Again, there's some magic math being done here about how many PRs and issues are flowing through your project. How many people are contributing and it comes out to container D being in the top 15 or so projects. One of the cool things is that we've had a lot of, I think this captures like the last nine months, but new maintainers, reviewers, committers from many different companies, some independents. So that's awesome to see as well. The cloud providers you might be using use container D underneath their Kubernetes service and some other projects as well. The thing I wanted to focus on is one of the reasons I think container D continues to grow as a project is that we've built in extensibility in different directions. I'll talk about three main directions that container D is extensible or how you can build around it. One is on the client end and so one of the newest representatives of that is Nerd CTL written by one of our maintainers, Akahiro Sudha who you've probably heard of because he's written 100 different projects in the container space and anytime you use rootless containers it's probably because Akahiro started that work many years ago. The hero nerd CTL which gives you now kind of a Docker command line for container D. The other way that we're extensible is in snapshotters and those are, if you remember Docker's graph drivers, these are the way that your containers file system are actually stored and so overlay is obviously a very common one that many of the container runtimes use but we've actually made it. So we have built in ones which I'll talk about but we also, you're able to extend that with a remote snapshotter and that's an area where we see a lot of growth where people are writing their own snapshotters for their own unique use cases. Then sort of directly down from container D is this layer we call the shim layer that drives an actual OS level runtime and so obviously many of you have heard of Run C or C Run that's kind of the common Linux adapter if you will that drives the set of syscalls you need to name space your process but the container D shim API again is extensible and there's many different shims available and we'll talk through those. So these are kind of three directions. There's also some other pluggable interfaces that I don't have time to get into today but these are all ways that again as we go into 2.0 we continue to see people expanding container D in these directions. I'll spend the least amount of time on clients. We've had this sort of simple tool in the project since the beginning called CTR. It was never really meant to be a production client for container D but just an easy way to poke at the API, get a list of images, list of processes. Run CTL is much more recent and has its own set of maintainers who are marching along with new releases that are either bringing better alignment with the Docker command set so all the flags, all the features or adding features that they can reach because they're built directly on container D like some of the lazy loading snapshotters, image encryption, container image signing, all those are built in to nerd CTL. Cry CTL is from the Kubernetes community that drives the CRI API of which container D has an implementation obviously CRIO and others have implementations for that API and then of course the Docker project is also built on container D. There's some interesting developer platforms built around these clients. After desktop and CoLima allow you to drive the Docker engine or container D but we have a team at Amazon who built Finch that's just built on nerd CTL build kit and container D again that allows you to do macOS and I forgot to add Windows here because we just launched Windows this past week. But again these are ways that people are extending the capability by building new clients around container D. So the other area I mentioned was snapshotters. There's a bunch of built in ones. Many of you will recognize things like overlay and device mapper, butter FS but this plugability of having proxy plugins to a remote snapshot are so now two things you're not tied to container D's release life cycle. You don't have to get your snapshot or merged into the container D code base. You can write your own, you can run it as a separate process with a GRPC listener and container D will call you for the API of the snapshotter, prepare, diff, unpack and those operations that are required for the snapshotter. So there's three main ones that all three of these have now been donated into the container D GitHub organization. So they were started as external projects and they've now been donated. They're all related to lazy loading file system so if you've played around with being able to run a container but not having to pull the entire image, say it's a 10 gigabyte image with scientific data sets or some complicated ML model. These lazy loading snapshots will only pull the files that are needed to start the container and so Star GZ, overlay BD and NIDUS are all in that family and then there are two, there's Sochi that was built by one of our teams at Amazon that is seekable OCI so again a lazy loading snapshotter and that's open source but then GKE also has a feature called image streaming built around the same ideas of lazy loading but that's at least for my understanding that's not an open source project today. So again these are ways that people are extending container D by having their own snapshot technology and plugging that into container D. Allison mentioned shims so OCI runtimes, there's several options there. So we have run C built in, you can also use C run and we test that in our test suite for container D and there's also some experimental Rust and free BST runtimes but then again you can have your own shim outside of kind of the container D core project such as the one for Windows maintained by Microsoft, HCS shim. Run wasi is one of the more active projects in the container D, you have namespace where again this is a shim where you can drive container D to the same API and clients but actually run wasm workloads instead of running a traditional Linux container and again there's a micro VM based shims, trusted execution environment and Quasar I think is how you pronounce this shim that deals with a new feature of container D 2.0 called sandboxing which we'll talk about in a minute. So again those are just three ways that I think have benefited the sort of container D's growth of being able to plug in and enable features that don't have to be part of the main container D code base and allows people to sort of expand for their use cases that maybe we don't even know about. So this is kind of the picture of where we are currently in the container D life cycle, 1.5 is now end of life, we created 1.6 as a long term support release that again until 2.0 is released we don't have an official end date but it will at least go out another few years. 1.7 is an active release cycle right now and then 2.0 should release in a month or two based on kind of our current set of betas and release candidates that we're in and so that's where we are as far as releases. I just mentioned this isn't new news but 1.6 is our first LTS release as it says here support at least until February 2025 and of course it's always a trick to try and maintain some integrity about how you get things into the LTS and one of the reasons that's tricky is that Kubernetes may add features in the CRI we need to implement that CRI endpoint so it sort of looks like a new feature and so we're having to try and do our best to make sure that we maintain compatibility with Kubernetes without sort of opening up 1.6 to a lot of new features so that it's a stable and mostly just back ports of fixes and obviously anything security related. So yeah so we have this idea that late this year we'll even make that back port criteria a little bit stricter so that people can rely on just a long stable release without a lot of changes to its feature set. 1.7 therefore is the end of our 1.x release cycle and what you'll see here is that we basically merged a lot of new features in 1.7 before we released it that we marked them all experimental so that people could start to try and use them and then in 2.0 all those become supported features and so I already mentioned the sandbox service and the API around that again we had this extensibility at the shim layer but with micro VMs and other ideas about how you treat the sandbox and how you configure it several of our contributors came up with the sandbox service and there's a whole API around that you can read a lot more about it on our either via the PRs or the documentation that's been merged. It was a preview in 1.7 but it'll be automatically turned on in 2.0 so in 1.7 there was a split that we actually had two implementations of the CRI one based on the sandbox and one our legacy code so that'll go away in 2.0 where it will just have the default sandbox implementation. NRI is very interesting if you've ever played around with OCI hooks and the ability to you know modify the specs so say I want to insert a device before my container starts the node resource interface is the sort of our decided implementation for doing that safely and having a way to have NRI plug-ins that you can that the administrative your cluster can enable and give the proper permissions to so NRI was experimental in 1.7 again will be fully supported in 2.0 and then transfer service so if you think about commands like save or export an image pull an image push an image in all our previous releases of container D that was a client side API and so your container client was actually doing those registry interactions in 1.7 and then of course in 2.0 this is now a service within the demon and so for some some use cases that was very important that the demon handles credentials of the demon handles the network connectivity to registries and also gives us a lot more tools for plugability of sort of source and sync so say I'm trying to copy an image from one place to another the transfer service gives you all that in a configurable way we also added username space support which was a new feature coming down so container D core had username space support but the CRI kept the enabled username spaces and Kubernetes added new API to the CRI and so those are now plumb through and implemented and supported in container D and then we had a lightweight RPC mechanism for shims and we've now added full GRPC support which was important again for certain use cases that people wanted so as I said we're in the midst of like our 2.0 release plan right now we are just about to I guess I didn't move that line over far enough because it's February now and we're just about to put out our first release candidate so we're possibly a little bit delayed from our original thinking but again 2.0 will be final sometime this spring and like I said all these sort of new capabilities that were in 1.7 will be final and supported in container D 2.0 it was our first chance to finally deprecate so we've been insistent on keeping a very stable API so that you know people aren't surprised that the latest container D release removed something so you can see that over the years we've deprecated a lot of features or at least mark them deprecated 2.0 will be the chance for us to finally remove those and provide recommendations. One of our contributors added a nice feature so you can actually turn on deprecation warnings and you can actually run container D 1.7 or even 1.6 LTS and get notified of all the features you're using they're deprecated to help you prepare for 2.0. One of the things we were going to remove was support for our oldest configuration version but then someone wrote a converter that automatically converts your configuration so we won't actually have to deprecate that in the sense that you're not going to have to rewrite your config unless you'd like to it'll do automatically for you. There's still a lot of things we'd like to do that we're still working on so I mentioned this new transfer service again the CRI is a plug in implementation within container D that uses container D's APIs to do the work so when the CRI says pull an image the CRI implementation calls into container D to do that so one of the things we're trying to migrate that to use the new transfer service so that's in development to allow plugability for shims themselves and then there's two there's two kind of API layer enhancements that we're thinking about if you think about Docker, Docker kind of gives you this higher level API again HTTP based if you ever have built a tool that uses the Docker API it's at least nice in that you can say run container and give it all the configuration information and it just does it and when people have come to container D they're like hey you don't have the Docker API what can I use that's similar to that and we really don't I have to create a task I have to create a container resource that I have to start the task and so we're thinking about really creating some of these abstractions so that when people move to container D they have a higher level image service and container service so those are things that if you have ideas if you have concepts we're open to them these aren't things that we've built yet but we're planning to as we go into the container D to the T to dot oh time frame if you're interested in contributing or getting involved there's a couple channels in the CNCF slack that we hang out in that we you know talk about new features or people ask us questions we do have a live community meeting on zoom twice a month the second and fourth Thursdays if it's bad for your time zone let us know obviously that's always a tricky thing to handle with time zones and again go to the repo open issues give us your ideas pull requests and that's all I have thank you
Lift and shift: Modernising a legacy LAMP application with systemd-nspawn
So, next up is going to be Martin, who is going to be talking to us about lift and shift modernizing a legacy lamp application with system B and spawn. Hi, everybody. Welcome. So the last time I spoke at this conference a few years ago, it was in the microkernel dev room. It was a very small room. So the bigger the kernel, the bigger the room, I guess. So I'm going to start with a little bit of backstory. One evening about a year ago, I got a phone call from a friend, a principal at a school, saying, Martin, I need help with something. Our sole IT person that's worked here for 20 years has decided that they're just going to go off to the mountains and leave, and they're off in about a month. And I have no idea what state-house systems are around. I know nothing about that. I need someone I can trust who can step in and help. So I originally came in there as a consultant to look at what systems they had and figure out what the next steps were. I'm still there. It's still temporary. And I'm going to tell you a little bit about what I did over the last year there concentrating on the containers. So they weren't kidding when they said it was in a bad state. The critical application that the school ran on was running on one single server, along with a whole bunch of other stuff, pretty much everything else. And you can see here that that server basically dates back to 2009. Someone at some point tried to upgrade it from Debian Edge to Debian Leni. They failed, or they gave up, partly because from Edge to Leni, you had the transition from PHP 4 to PHP 5. I did a quick naive slot count of what's in Vah-dub-dub-dub HTML. There's 200 something thousand lines of PHP. It turns out that this person did not use source control. So there's a hell of a lot of duplication in there. And it's also very much a typical crud app, as you would design it 20 years ago. So it's all just very basic PHP with hidden HTML mix, the worst possible thing you could have. But at the same time, it's very simple as an application, which turns out helped us later. So my naive plan, how do I salvage this, try and extract as much business and technical knowledge from the author before they leave and never come back? And then virtualize all the things, secure all the obvious attack surfaces. I mean, this was still running TLS version 1. It had Apache 1.3 exposed to the internet, worst possible cases. So then split off the business critical system from all the other things that were running on that server. Do that in a way that's as future-proof and maintainable as I can. All while keeping it running and not getting killed by 550 students and 100 odd employees during the school year. The first two steps were pretty obvious. They had some new hardware lying around. I spun up a hypervisor. I had a bunch of VMs. So put the physical server into VMs, started splitting chunks off it. That turned out to be hard. So I eventually decided that I needed a way of reproducing this 15-year-old environment. Reproducing it in a way that I could then develop with, maintain with modern tools, source control and so on. So the nice thing here is I found that the Debian community have developed something called Debian EOL, which are basically Docker images of end-of-life Debian releases, all of them going way, way, way, way back. You can use these images to run both Docker containers or to do whatever else you want with them. The nice thing about them also is that they're actually integrated into the modern infrastructure so that pointing at archive.debian.org, you can, as you'll see, install additional software and so on. I could have probably done this with Docker, but it doesn't really fit the bill because this application, I mean, it's never going to be a 12-factor app with a bunch of microservices. I needed something that's more like previous Dejails or Flourish Zones. And I've previously used SystemDnSpawn. I use it, in fact, today to run a bunch of my own infrastructure, which was originally a bunch of Zen PVVMs and is now happily running for many years as SystemDnSpawn containers. So you want something that can do full system containers that's available, lightweight, and flexible. So how do we get Debian Lenny from 2009 running, using these Debian EOL images with SystemDnSpawn? We need a couple of tools, something called Scopo and OCI Image tool, to get the images off the Docker registry, flatten the OCI image, you basically end up with a root file system. You then, what I do is I use, the reason I'm emphasizing RefLink here, I didn't know about that, it's basically copy on write. So you can use this to create a lightweight copy of an entire directory tree, which only takes up more space if you actually change things in it. So, you try and run this, previally with SystemDnSpawn, and you find bam, it's safe false. Thankfully, we actually get a helpful message from the kernel saying, ooh, you tried to do VSys calls, but no, we don't do that anymore. We can fix that, that's fairly easy, and we can see that, oh look, we have Debian Lenny running in a SystemDnSpawn container. Okay, that's great, and if that was all I was going to tell you today, then that probably wouldn't be very interesting. But if all we want is Ben SH and that to get, that's fine, but I want this full system where I basically want to run full SBIT in it, inside the container to manage all the original LAMP stack services to run the application. I want to integrate the container's networking with the host system's SystemDnetworkD, and get a dev log in it, get, use username spacing, and start and stop the container as part of the normal host system boot process. So I made a script for this, I extracted this out of my build scripts so that you don't have to. There's a link to it also in the resources for this talk. Please take a look. So this script basically gives you a Debian Lenny root file system that has all the things applied to it to let you do the first, the steps that are described here. I spent quite a bit of time working that out, so I hope people will find that useful. You can then do, with that root file system, you get out of that script, you can boot the resulting root of this, like this. The important parts there are private users, private users equals pick, that turns on username spacing, so your container root gets, automatically gets a special user ID in a range mapped to it, which system dns-born will pick when that particular root file system is started. And you get a VF network talking to the host. Kill signal equals SIGINT, we want that so that when the host system, if you run this container as unit file tries to stop it, then the SIGINT gets sent to the sysvian as inside the container, and it will actually interpret that as a system shutdown and shutdown cleanly. So if you run that, you can log it on the console and you'll see that yes, we can shut down the container with control C. So there's a bunch of gotchas, networking, system d network d, you want this, since it integrates very well or bar some problems. Obviously your host needs IP forwarding enabled. As I found out today, or remembered today while making these slides at the hotel earlier today, if you're doing anything at all in your forward chain, since I was trying this top, then you need to make sure that forwarding is actually being accepted from and to container interfaces. Another really interesting one. So I'm still a DHCP client inside the container so that the container integrates with system d network d and gets a network address assigned to it when it spins up. Turns out that old DHCP clients are actually picky about getting proper checksums back in their responses. So if you don't add that particular mangle rule, then what will happen is your networking will appear to work and then mysteriously stop when the DHCP lease expires and the client tries to renew it and gets upset and you just see it renewing and renewing and nothing happens. So, system d journal d has a nice name spacing mechanism. It basically lets you spin up separate instances of system d journal d which have their own name space so you don't really want the container logs or different logs of the different instances mixing with the host logs. It works, but I had to actually read the source code of the system d main loop to figure out why it would just, after you start it, just mysteriously say, oh, no clients, I'm going away now. So the way to fix that, not described anywhere, is you add a drop and set your retention time to something high and then it will just wait around until something connects to devlog. Devlog you can then bind mount into the container. That's fairly straightforward. Starting up, start up and shut down integration. System d n spawn comes with a default unit file and you can then customize that. There are some useful things you can do there like you can add a dependency on your journal d namespace service so that everything nicely starts up and shuts down and there's an example of what you can start with exact start that if you want to use this particular arrangement. So I actually did this, or the bulk of it during the school holidays last summer. Application has been running fine since then. I was quite surprised. I could talk a lot more about PHP and MySQL 5 but that's mostly just be ranting. One thing that I didn't mention is the application is actually running all in CP1250 and not only that but originally the databases were all running still with MyISAM. So I ended up basically exporting the lot into SQL text files. Then I discovered that MySQL and PHP at this time didn't really understand character sets so the database thought that everything was Latin 1 when it in fact wasn't. Well, the way to fix that is again you export it to a text file making sure that the database or nothing tries to convert any of the data. Then you do a set on the text file and say just recreate, replace MyISAM everywhere with the InnoDB, replace Latin 1 with CP1250 and it actually worked. Still there. No data got corrupted. And it's 64 bit now so it won't fall over in 2038. So yeah and I'll end this with a quote for the conversation I had in the autumn with my long time friend Martin Sustrick who was asking, so you spent the last few years before that working on OS research with Unicernals and Docker and the University of Cambridge and so on. So what was more complicated? All this OS research that you were doing or the work you've been doing at the school over the last six months? And I said well definitely the work at the school over the last six months. And I still have 10 minutes. So in fact I guess questions. It was quicker than I thought. Yes sir. This man here? Sorry? The hyphen N option? Oh, ah yes. Okay so the reason you can't do that, in fact this is important and I sort of glossed over it here. That will only work. The journal D integration will only work if the distribution that's running inside the container is new enough. The Debbie and Lenny from 2009 does not have journal D, does not have system D, this predates it. So this is all running good old Cisvi S bin in it. So none of the integration that you'd expect, the fancy stuff that you get today with system D and spiral with machine Ctl if you use the full interface. If you run a system D distribution inside the container then your logging will just transparently get integrated with the host journal. Likewise you'll get things like machine Ctl login which will get you a TTY, a console that you can use to log into the container. We don't have that here because there is no system D, all of this relies on there being system D inside the container as well as on the host. It is exposed to the internet but not directly. So it's the first thing I did way back before I started on all of this. Right, number two here, secure the most obvious attack surfaces. I stuck a modern reverse proxy in front of it.
vscode-container-wasm: An Extension of VSCode on Browser for Running Containers Within Your Browser
So, our next talk is going to be about... Hello, I'm Kohei Tokunaga from NTT Corporation. I'm a reviewer of container D and a maintainer of Build Kit. And today, I'm going to talk about an extension of VS Code on Browser for running containers within the browser. So, this is the summary of this talk. So, on Browser VS Code lacks Linux terminal running completely inside browser. And VS Code container wasn't. Extention enables to run Linux-based containers and its terminal inside browser. And there are two options available for distributing containers to browsers. First one is pre-converting containers to wasn't images and distributing them. And second option is distributing OCI container images to browsers. So, there are several on Browser VS Code implementations in community. There is a limitation for that functionality. This is lack of Linux terminal running completely inside browser. So, users can edit code inside browser but cannot run them inside browser. And Linux-based development tools like CompilerS are also unavailable on browser. And one of root causes for this issue is that browsers don't provide Linux compatible system. So, Linux-based applications needs to be ported to browser. If the application is written in language other than JavaScript, WebAssembly or wasn't will be... will also be used for running them on browser. But actually, porting apps to WebAssembly is not easy. So, wasn't lacks compatibility to Linux system. For example, the binary format is completely different from the existing common binary format like x86-ELEF. And the app might need to be redesigned for Harvard architecture of wasn't. So, this might include like eliminating fork and exact related cause from the application. And some of the issues can be mitigated by CompilerS wasn't target support. But they still don't provide full compatibility to Linux. So, can we run a modified Linux terminal and Dev environment inside browser? So, here VS Code container wasn't. Extension can be used. This is an experimental VS Code extension for running containers inside browser. So, the container and the terminal is available on VS Code on browser without preparing remote SSH servers or something. And this is implemented, levelizing CPU emulators compiled to wasn't. We will discuss about it later. And the workspace of the editor is also mounted at slash workspace path. So, container can refer to the contents on the workspace. For example, it can compile the code stored on the workspace. And HTTP or HTTPS networking is also available. The container runs inside browser. So, the networking functionality is also restricted by browser. For example, the set of accessible sites from the container is limited by calls. So, how container images can be distributed to browsers? There are two options. Option A is pre-converting containers to wasm images. And option B is distributing OCI container images to browsers. So, first option for distributing containers to browsers is pre-converting containers to wasm images. And container to wasm converter provides this ability. The container to wasm is an experimental converter of container images to wasm images. It receives an arbitrarily Linux-based container as the input, and it outputs a wasm image that runs the container on wasm. So, we can run the containers on wasm-enabled environment like browsers. As shown in the right figure, the converted wasm image can be uploaded to any HTTP server accessible from the browser. To use them on VS Code on browser, you can configure the workspace using .vscode slash settings.json file. And the image location URL to that configuration file. And so, you need to add the image configuration URL to that configuration file so that the extension can launch the specified container on browser. And the pros of this approach is that once the container image is converted to wasm, it can run on any wasm-enabled environment, not limited to browsers. For example, the container can run on like washy run times, like wasm time as well. And cons of this approach is pre-conversion is needed for each container. If you want to run many kinds of containers on browser, all of them need to be pre-converted to wasm, so it may take extra cost for development time. And second option for distributing containers to browsers is to directly distributing OCI-compatible container images to browsers. If you use container registry, that registry needs to allow code access from the browser because it's accessed from the browser. But unfortunately, as of now, well-known public registries don't allow codes, but so you need to try it on local house registry with code header configured. Alternatively, you can also use codes-enabled HTTP or HTTPS server. In this case, the container image needs to be formatted as OCI image layout. This is the specification of layout of image content to be stored on the file system. For example, you can get a tar archive of this format using toka-save command newer than v25. And vscode container wasm supports fetching the image formatted with this spec over HTTP. In neither case, the image location needs to be written to the workspaces.vscode.settings.json file so that the extension can launch the specified container on browser. The pros of this approach is that this doesn't require a pre-conversion of the image, and a modified container image can be distributed to browsers. And cons of this approach is that obviously existing public container registries don't allow codes as of now. So if you don't use OCI layout approach, you need to prepare codes-enabled container registry or users need to use like a proxy or something to access to the registries. And this is an example of running container on github.dev. This is github.dev is an on-browser vscode that allows us editing codes of github-reports on-browser. This slide shows an example of running gcc installed devian container inside browser, and workspace is mounted at slash workspace, and HTTP or HTTPS networking is also available. And so this is a demo for this extension, and we use github.dev here. And that. Okay, so here, this is the extension of container wasm, and this is available on Marketplace. And this is the settings.json file in this repo, and this config file points to the URL of the devian container converted to wasm using container to wasm converter, and this is served on github pages, and we use that image on this workspace. And this is the terminal of the devian container running inside of the browser. And this is a secret we are going to use in this demo. And currently, yeah, by executing a command of this extension, this extension quickly loads the image, and the container image stored on github pages to this browser, and it just booted the Linux kernel and container inside browser with cpu emulation. And we currently see the devian shell in the browser. And by executing your name a command, you can see this is the x8664 and Linux environment inside browser. And this workspace of this, this workspace of this repo is mounted at slash workspace slash, so you can see the files of this repo inside browser, mounted on workspace directly. And in this container, we have gcc compiler, and we have a hello world pretty simple clanguage source code, so we can compile that c code inside browser using gcc compiler. Then we can run the compiled binary on browser. So the entire compiling and running steps are done inside browser in this demo. So how this extension works, the container depends on Linux to learn, so this project runs both of container and Linux inside wasm VM on browser. And to enable run existing architectures binaries inside wasm VM, we use cpu emulators compiled to wasm. We use box emulator for x8664 containers and tiny emu for risk 5 containers. So this extension launches all of the emulator Linux kernel and the container inside wasm VM on browser. And we also use microsoft slash vs code dash wasm for wasm and wasm host on browser. So this is a wasm host integrated to vs code, so this allows wasm VM to access to the terminal on vs code and the workspace directly over wasm compatible APIs like fd APIs. And how mounting workspaces to containers works. So as mentioned in the previous slide, we use vs code dash wasm for the wasm host and it provides the access to the workspace directly to the wasm VM over wasm compatible APIs. And emulator running inside wasm VM recognizes workspace directly via wasm APIs, then it shares that directly into the guest Linux via vortio9p. And that workspace is mounted to the containers slash workspace slash directly so the container can access to the workspace on that file system path. And container can perform HTTP or HTTPS networking with restrictions by browser. So this is implemented by running the entire networking stack runs inside of the browser. So additional proxy outside of the browser is not needed. And this networking stack supports forwarding HTTP and HTTPS connection to the outside of the browser using fetch API of the browser. And HTTPS connection is terminated at the networking stack on browser with its own certificate and the connection is re-encrypted by fetch API. So the container can access to the outside of the browser via HTTP, HTTPS proxy running inside of the browser. And there are actually some important restrictions by fetch API including accessible sites are limited by browser so code restriction is applied. And some headers are actually uncontrollable from the container because they are entirely controlled by browser. And vscode container wasm allows fetching container image directly from remote location without pre-conversion to wasm. So this is implemented by fetching and unpacking the container image in browser. The unpacked root file system of the container is mounted to the guest Linux via VARTA ION IP. And not limited to on-browser IDEs, we believe there are some expected use cases or possible use cases of running containers or wasm or browser. So first one is interactive on-browser Linux based demo. And second one is on-browser development and testing like this extension. And also sandbox execution environment of containers and application debugger runable on-browser were recorded and replayed debugging. There are some existing approaches for running unmodified applications on wasm. And I listed some of them here. First one is V86. This is a x86 compatible on-browser CPU emulator by Fabia Hammer. And it supports wide variety of guest OSs like including Windows. But it doesn't support for x86 64 now. And tiny emulator is a risk 5 and x86 emulator by Fabia Spillard. It can run on-browser and container to wasm converter actually uses this for risk 5 emulation. But it doesn't support for x86 64. And this project is still in a very early stage. So we expect further improvement. First one is performance analysis and improvement. We heavily rely on CPU emulation. So I think we need to analyze the overhead and I think we need some improvement for it. And possible integration will be with ELF Conf or ELF Conf. This is an AOT compiler of Linux. And this is a 64 ELF to wasm by Masashi Yoshimura, my colleague from NTT Corporation. So at LLVM, tomorrow my colleague Masashi also have this AOT compiler. So please check it out. And the integration of container ecosystem with browsers is also needed. As I mentioned, container has call to the solution. So currently accessing OS package repos from browser is not possible. And also in terms of container registries, as long as I know a public registries, container registries doesn't allow calls access. So on this field, your help is really needed if you know some technologies or repos or registries that allows calls access, please let us know. And graphic support is also on our milestone. So this is the summary of this talk. On-browser VS code lacks Linux terminal running completely inside browser. And VS code container wasm, experimental extension is enables to run Linux-based containers and its terminal inside browser. And there are two options for distributing containers to browsers. First one is pre-converting containers to wasm images. And second one is distributing OCI container images to browsers. And that's all of my talk. Thank you very much. Do you have any questions? Yes. Yes, please. Can you run Firefox inside the container? Okay, so the question was Firefox inside the container. So Firefox inside the container, inside Firefox. All right. Yeah, I haven't tested yet. But yeah, I believe it's possible. But I don't find any practical use case for this, but I think it's possible. Yes, of course. Yes. Sorry. QM. Thank you for the question. The question was about using QMU alternatively for like a box and tiny Mule. Yeah, I think this is very good question. And actually we have a, we have on container to wasm repo, we have an experimental branch that integrated QMU TCI to this extension. And yeah, in terms of like a TCG, yeah, we haven't integrated yet. So TCI is completely, yeah, so TCG we need to wait for running the generated code. So we, it is not obvious on wasm environment. But yeah, we are seeking for the way to integrate QMU into container to wasm. So this is, yeah, definitely on our milestone. Yeah. Thank you very much. Thank you very much. Thank you.
Debug your stage-1 systemd with GDB and the NixOS test framework
So, my name is Julien and this is Ryan and Linus and we are three NixOS developers. And today we are going to talk to you about the situation that we had during the sprint where we found ourselves in need of debugging our system in Itaardee. So, I'm going to talk about, let me just, it's because I know them. I'm going to talk about why actually we were in this situation. And then Ryan is going to talk about what is the NixOS test framework and test frameworks in general. And then we are going to showcase how we did this specific fun debugging. So basically I'll motivate a little bit the situation we were in. So basically we wanted to work with encrypted secrets in Itaardee. So basically as you may or may not know, Initardee or Initial MFS is the initial file system loaded in RAM as part of the boot process. It supposedly contains all what is necessary in terms of drivers and executable to mount your root partition, which is what its main goal is, like be able to mount your root partition and continue the boot process. But in some cases, especially when your boot partition is encrypted, it also need to acquire like the key to mount it and to encrypt it. And so this can be done by displaying user prompt where you input your password, but it can also be done if necessary by starting a SSH server where you connect and then put your password in and then it mounts your root partition. And for that purpose, you sometimes need to have like secrets stored in this Initardee, for example, SSH key. The problem is that if you have an encrypted system, you kind of have to start from something unencrypted and this Initardee image is not encrypted. So if it has secrets and you just put the secrets in this image, then anybody reading your boot partition can have access to the secrets. So as Nixxos developers wanted to have like an option where you could actually have the secrets be encrypted. Currently, like in Nixxos, you have the secrets are like just put plainly in the boot partition and suffer the drawback that I was just describing before. And so we wanted to find a solution and the solution is we have an option to use systemd as like the Inix script. So we use systemd in stage one instead of a scripted Init script. And what we can do with systemd, we can use something called systemd credentials, which is basically an executable of systemd that has the main, just the role of encrypted and decrypting secrets. And you can do this by using your TPM. And so basically what you can do is use the same TPM in your Initardee and this way you have secrets that were encrypted when your system was booted. That systemd in stage one is now able to decrypt in your boot process. So why all this? Where am I coming? I start, I try to implement this in Nixxos and what we found out is that I don't know if you can read this particularly well, but this is the log of the boot process and you see that there is systemd that is running in Initardee, it says here running in Initardee. And then it says it loaded the credentials that I tried to pass it, to pass to it and then it caught an assertion in some function and says okay, I'm retiring early, goodbye. It's crashing. So the question is how can we, how can we like debug this kind of thing? And one of the things we consider at the beginning is to use the Nixxos framework because it allows us to be in some very constrained situation where you can find maybe the bug easier. And then Ren is going to talk to you about the Nixxos framework is the main turner for us. So the screenshot you just saw earlier was a screenshot of the Nixxos framework. So you can see that it's a VM test and we can repeat that VM test very easily. But so what I'm getting at is in Nixxos as Nixxos developers we have this test framework that we use a lot and I'm giving a screenshot of an over test framework that is open QA used by other distributions. But basically what is interesting with debugging is that when you debug you want to debug a situation, a particular situation where you are hitting the bug. And in our context the fact of using Nixxos test framework, the fact of writing test first is a way for us to automate entering into certain particular situation including the ones that we are interested in, interested to debug. So for us like the Nixxos test framework is only a way to facilitate debugging sessions, a way to be able to write code but enable us to explore various scenarios and try age and bisect very easily any sort of dependencies. In the distribution context we really care about system wide testing. So for me I will just do a very quick intro on that. There are two components I will define. There is the driver, the code you write to assert the invariance that you care about like for taking the example of the system decredentials you want to assert that the credential that you decrypt contains the contents that you are expecting. That's an invariant. You also have the setup. The setup is how do you bring the system to the state that you care about so we need to prepare an image that contains a system decredentials containing the contents that we will be expecting and that's the set of code. And both of them are usually written in some sort of domain specific language that could be a bash script, that could be C, that could be Python. And I made just a very simple state of the art table which is not exhaustive but I find it very interesting to compare which is that for example over project that needs to have like complicated integration testing framework are the kernel and they do have solutions to test file systems and various things. And you can see like they all have their own DSL whether it's bash or any ELF program or executable that you can run on the system and they use some sort of emulator to give you environments to give you full system ablation, to give you network, to give you VLANs so that you can reproduce any sort of environment. And I find interesting so I'm not aware of any over operating system wide integration testing framework except from OpenQA and the NixOS test framework which is just a bunch of bash scripts, Python script cobbled together using the Nix domain specific language and we're using the Nix machinery. And I find interesting that so the biggest difference I find with NixOS test framework and the Overs which enable us to do some interesting stuff is that usually you have one language for the domain specific language so you have Python or shell or something but in the case of the NixOS test framework you can use both. You can use Python and Nix together so you can interpolate Nix code inside of Python code and like you have two levels of DSL that enable you to reason at build time but also at run time. And you have so that's why I do the funny thing of saying Python Nix for driver and Nix Python for setup because you think run time and build time differently at this moment. And so to give you an overview the NixOS test framework can offer you like OpenQA anyway, work test OCR machinery so you can run a VM, you can spawn a chromium instance and you can like use the OCR to read the window title for example in a GNOME desktop environment and verify that it is indeed the window title you were expecting. And all of those tests are running in our CI automatically for every what we call channel bump that is a roll up of a lot of commits in the Nix repository basically. What I think is very interesting in our case and enable us to debug very quickly this problem is that there is a secret source for our test framework which comes from the fact that we use the Nix DSL here. So the Nix DSL gives us a way to describe packages, to describe system the units and various things and it's a functional programming language. So it means that you can write functions that abstract a certain test scenario and then you can write more code to do more advances in the assertion on that environment. So for example I just take a very bad screen and I'm sorry but I will describe it. We have ZFS in NixOS and ZFS is very complicated to maintain. I'm maintainer of ZFS unfortunately. And ZFS is very complicated to maintain because it's out of three kernel package that often has ABI breakages with the kernel for many complicated reasons and legal reasons. And to make the burden realistic on maintainers you need to have strong testing. And so we are able to do matrix testing over multiple version of ZFS and multiple version of the kernel itself and multiple version of even like stable versus unstable and we even have a variant for the system D stage one because NixOS has both stage one. It has a scripted stage one like Julian described and we have experimentally the system D P I D one stage one. And so we are able to test all those scenarios and be able to understand what is going on in a very like in not a lot of lines. And here I will pass it to, we tried a lot of things. We tried to isolate the problem with the NixOS test framework. We are able to patch things easily. But even though we were not able to find the root cause. So we passed on to more powerful tools. Thank you. Yeah. So there we were trying to work out how system D was crashing exactly. It was dumping its core to a file in the temporary file system and promptly exiting causing the kernel to panic and it's not a persistent file system. So we had no way of recovering that core file. So we decided to try and run GDB in the init ramfs or we quickly abandoned that idea because GDB is big and doesn't fit into an init ID that well. Thankfully we have GDB server which I'm guessing anyone familiar with GDB might already know about. So with GDB we can attach, we can either launch a process like above, launch a process as a child of the GDB server. It can listen on the TCP port and then we can attach to it with a separate GDB client process. That doesn't quite work if you want to debug your PID 1 because PID 1 can't be the child of another process. Thankfully it also has a mode where you can attach to a running process. So in this case we're launching sleep infinity in the background and then running GDB server to attach to that and likewise attaching to that GDB server using a GDB client. Now how do we do that if we want to do that in PID 1? We have to put GDB server in our init ramfs and then we have to have it target the PID 1 inside the init ramfs. The tricky part is we want to debug system D but because system D is crashing we can't use system D to launch GDB server. So we go back to having a shell script as our init and that shell script launches the GDB server, has that GDB server attached to itself and then executes system D. First thing we do is launch that GDB server, have it attached to $ in this case it's going to be 1 so the PID of the shell script and background that because otherwise Bash is going to wait for GDB server to exit and GDB server isn't going to exit. Then we sleep 1 because the GDB server needs a moment to start up and actually attach and then we exec system D to actually do our debugging. That ended up getting us actually able to debug it and Julien has a recording of how we did that, of what that looked like. Thank you. So let me try to put this demo on. So basically what we did, try to comment it as it goes. Oh this is not right. Yes it's not doing whatever I want. I think it's... And you can exit the full stream mode and then full stream it. No you didn't exit. Yes yes and trying to do it. Did I... Yes. You have your time. Yeah okay. So on the left side we are running our test framework virtual machine and you see now the virtual machine is not starting because it's waiting that we attach from GDB which we do in on the right side and you'll see as soon as we attach through this socket that is called hello the virtual machine is starting and GDB is loading the symbols yes and then when we do continue then the virtual machine is starting. So this one first virtual machine is as you see on the left is the installer virtual machine. It's going to install in XOS on a disk, populate the boot partition and everything, put the credential in it and then we restart it and we will eat the bug with system D. So what you see here is just a log of XOS installing itself and so this first GDB instance will not do anything purposeful because we are just... Because we change it in its script we have to change it both in the installing VM and in the installer VM so we are only doing the first part that is not really the part we are interested in. But should not take too much time. I can do filling. So what is interesting here is you can see like we have a very complicated well complicated setup to initialize system D initialize the installation and all that stuff. And this is the second VM booting now. All of this is automated. So we are reattaching with GDB and so we are now... The VM is now booting and it's now stuck on waiting for GDB to attach. So when I do this it doesn't work but when I properly attach actually it's reading the symbols and now when I do continue I will eat the bug that we were trying to debug. This we are eating it now and we now can see a backtrace. So yeah that's it. By reading this backtrace we found the bug we were looking for and we were able to open a PR to system D and fix it. And that's it. Do you have any questions? Do we have time for questions actually? Yes. Oh that's good. You said that you couldn't have system D be like the child of another process so you couldn't have GDB like start and run it. Why not? Yes. Do you want to answer this question? Yes so the question was why we can't have system D not be PID 1. It's because our bash script won't reap zombie processes which only PID 1 can do and because yeah there are various bits in system D which require it to be PID 1 especially if you are running it in the init ramfs because it needs to actually switch into the final root file system which you can't do as just any process. I don't understand how and when the transfer the ownership move from GDB server to system D because you attach GDB server to itself then you hit continue. The question was you don't understand when the control goes from GDB server to system D. The init in this case was a shell script which launched GDB server in the background and then the shell script replaced itself with system D and the GDB server was attaching to the shell script. Any other questions? Yeah just a matter of curiosity. Why do you say it's a problem to put all of the GDB binary into the init ramfs? So the question was why it's a problem to put all of GDB in the init ramfs? It's yeah it's fairly big. Big init ramfs can be a problem especially with limited with boot partitions of limited size. For that we might not have the terminal control bits and pieces necessary to make actually using GDB enjoyable whereas with a GDB server we can even attach a graphical front end to GDB or something similar to the target. And the debug symbols and the sources? Yes exactly. So GDB needs to access the debug symbols and the sources at good point. The question was why if we are using a TPM anyway to store the disk encryption keys why would we need to store more secrets in the boot partition to do anything else? I think so there are many use cases here. For example imagine you would run SSH server in the early boot to obtain another part of the key. So you store a part of the key in the TPM2 and another part on a server and the server asks you to prove your identity or something then you need to have your own identity somewhere because the server doesn't know if you're the true server who is asking for the over part of the key and that means you need private SSH house keys to be stored somewhere. So to confirm in general if you haven't configured something like an SSH server and explicitly put a secret in your init you're not going to get one. If that's part of your framework or where you want to split the key up and get it in different places for example this can help you do that. So again to repeat what you just said and I agree with that this sort of approach is useful when you have more secrets than just having the TPM2 disk encryption secret in the TPM2 when you have identity cessation or more parts of the secret somewhere else doing SSSS and what not. Shami's secret sharing to be more precise schemes and this makes sense in those use cases. We still have three minutes. Recompuse. Yeah. Is this already in stream with the TPM user in the init? Do you want to answer? Can you repeat sorry? Is this already in upstream mix? Mix package with the TPM2? Yeah so the question do you want to answer? Yeah okay. Repeat the question. Sorry yeah the question is this way to store secrets? Secret stream. Yes this way of storing secrets in init already upstream. The answer is no. We have a few dependencies necessary. One of them is using booting from system distub because system distub can measure the credentials you're passing. So there are PRs open. If you are an excess developers do review them please. But it will come soon I think in system reboot and also there is work being done in LANZABOOTIS for the same features. So both are going to be available soon I guess. Related is this one of the things that's kind of on the road to LANZABOOTIS? I'm the maintainer of LANZABOOT. So the question was is this part of the work to upstream LANZABOOT which is a secure boot component for NixOS? It's a bit special to NixOS because we have too many generations. The answer is this is in the ecosystem of those such things and yes basically. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Help us improve time manipulation with GDB
Which is wizards and barlocks. Welcome to my talk on manipulating time with GDB. Well, that's what I would have said if it wasn't for the RR talk that came right before me that taught everything that you're supposed to know that I was gonna say. So instead, let's talk about how you can help us make manipulating time with GDB even better, right? So let me give a quick summary of what I'm gonna be talking about. First, some introduction in case you didn't catch the previous talk, in case you didn't, you have no idea you were slipping in the talk or something. I don't know how you would, it was a pretty good talk. I'm then going to go into the technical details of how it works and as I explain how each little bit is working, I'll also explain to you why this little bit might be buggy. And then, I'll give you a couple links, a couple QR codes to where you can see the list of bugs that we have open, and then you can pick your favorite and I'll give you a little request to help us fix them and some contact information if you're not comfortable just throwing an email to the void of the mailing list and you would like to talk to someone who you think is a person. Right, so let's go from the start. What the hell am I talking about? Or first, who am I? Hello, I'm Guinevere. I've been hired by Red Hat to work on GDB. I've been doing it for almost three years and just recently I've been appointed one of the maintainers for GDB for one of the specific areas that does reverse debugging in GDB. And one of the things that I like to do is help people get into contributing to open source. I always wanted to contribute to open source when I was in university, but it always felt like an impossible task. I would need to be like some sort of genius to do it. And then as I started doing it professionally, I realized that there are some people who aren't geniuses like me who are doing it and I wanted to spread that around. And what is this GDB that I keep mentioning? In case you don't know, GDB is a very famous debugger for C, C++, it's been around for like 30 something years. Not sure how many, more than 30. But yeah, it is basically, if you're a time wizard, it's your best friend. It can slow down your program, it can stop it altogether and just, and as you can learn, as you just learned, it can also make it run backwards. I call it time travel debugging because it's much more fun than reverse debugging, let's be honest. It lets you undo instructions and full statements and even maybe sometimes start, go to the very start of the program. And it's very useful for a wide range of things from race conditions to just logical problems to just understanding the code that you don't understand. All of this was mentioned in the previous talk. If you didn't manage to catch it, I'll give a quick run over and then you can use what I teach here to go back in time and see the previous talk. So, since a lot of people were here in the previous talk, not many of you will be asking, many of you might be asking how is that possible, not many of you would be saying that's impossible, which was my joke. So I'm just gonna go and explain like, how is this possible? Because CPU is not meant to execute backwards. It doesn't have a way to just undo things. So let's go with a simple instruction. This is an x86 instruction, just adding one to a region of memory. And this one sounds like it would be very easy to undo, right? You just need to subtract one from the memory. You can do it arithmetically. Sounds like it. It's not quite that easy because whenever you use the arithmatic unit in the x86 CPU, it overrides some stuff. So you cannot just undo things logically. The best way to do it is to, instead just remember, hey, I'm looking at this address of memory. It is this long and it has this value. And then you save that in your program. But as I said, this is the arithmetic unit. So we also need to know the flags that were there before because it's gonna get overwritten. And every single instruction that happens, it will also increment the instruction pointer or the program counter. I use those interchangeably here. So we need to remember that. And if you basically created this in your program, inside your program, and then you added some markers here and there to say like, okay, between these ends is a single instruction, then you get exactly what GDB record pool does. This is the area that I'm most familiar with and it's the area that I maintain. It does exactly what I showed you and nothing more. So there are good things and bad things about using this version. A good thing is that it just comes with GDB. You don't need any extra things. It can fully reconstruct the program state to any previous state. It's not something that every single way to do it can do. But the bad thing is that it is really, really slow. If you think that the twice as slow thing that was mentioned in the RR is bad, try, I don't know, 20 or 40 times or even more. I'd never stopped to test how long it is. It's just unusably slow at this point. But it's really nice. We should make that better. And it's a little harder to support because we need to teach GDB every single instruction that we want to support for every single architecture. There's nothing that says only works in these architectures of that other than people putting their time into teaching the GDB disassembler. And as you can see, there are a lot of, as you can imagine, there are a lot of possible things that can go wrong. One of the things, like I said, we need to teach every single instruction to the disassembler. This is a QR code for a couple bugs that have been filed for missing instruction for this, missing instruction for that, missing instruction for this architecture. And also, if you like, making code neater, if you enjoy making it more readable, the disassembly code for x86 is a complete master. There's a single function with over 3,000 lines and unreadable other functions and members that are just a single letter. Please help. But that is just a single way to do it and a very small example. So let's look at a little bit of a longer one. Let's say we have an instruction here at this program counter. And then your program goes into this instruction and this instruction, which, so you can see that this was a jump and then it continues executing everything. You can see exactly where your program went through, right? And if we were saved exactly this information and just a couple bits more, like how long each instruction is, what kind of instruction it is and everything, we could have a very good idea of what path your program took through the code. And we could maybe not recreate everything, but we can understand like, hey, the bug is happening because there's some logic wrong at this point, which is making us take a wrong if somewhere. This is also the NGDB, is the bit trace recording. This relies on a feature of x86, I think Intel only, but don't quote me on that, which saves the whole path in a region of memory called the BTS. And then whenever the inferior, which is a program that's really debugging, is stopped, GDB looks at that region of area and does all that information of like, it's this big and it was this kind of instruction. It is again good because it's in the SQL tool and compared to the other version, it's pretty fast. It's no, I don't think there's any big slowdowns, maybe like two x and three x, which when we're talking about recording the whole execution is kind of all right. But like as I said, you cannot reconstruct everything and it's hardware dependent and it needs to be in the hardware. It's not like we can do anything to improve that. It has a couple of issues with test suite regressions, you hit some assertion errors and there's some usability problems with like not being very clear when you can or cannot do something, but it's not an area that I looked much more than what I needed to make this talk. So I'm not familiar with the problems. If anyone found it interesting, we can still chat. And well, I've been talking about, looking at each instruction at a time and you have this kind of execution style. What if you had, you made like a whole checkpoint, everything that's happening in your system at the very start of the program and then you keep going and then when you reach a certain point, you create a new checkpoint. So then you can fully recreate whatever is happening at an earlier stage and you can keep going. You cannot step a single instruction back, but you can step a lot and go forward some. This is what our R does. And this is why I got confused when we didn't say there was a live in theory because no, it has to have a live in theory. You create a checkpoint, you go forward, you create a checkpoint, move back and step forward. At least this is how I think our R works. And there's also a tool called UDB, which I've been told does that. It is proprietary. I have no idea how that works and I'm not all that interested in it. But yeah, and then what our R does, as you have all seen, is it creates a way for GDB to control the fewer. Yeah, it does that by creating a GDB server. I'll talk a little bit more about that later. But yeah, so those are the three main ways that I know of doing reverse debugging now. But once we recorded the thing, how do we use it? You using the GDB front end, which is the part that handles your commands and everything, you can do it two ways. Using reverse next, reverse step, and all those commands that were explained in the RR talk. Or you can actually just say to GDB, hey, I'm going to be going backwards using set execution direct reverse, and then just say next step or whatever, and it's going to understand what you want. Because actually behind the scenes, if you say reverse next, it is just doing that, setting it to reverse, and then executing the instruction, and then setting it back forward. So it does exactly the same thing. And when we handle the command, we try to make use of as much information as possible for going forward, as much of the logic as possible. And only when we know, okay, this part has to be different when we're going backwards, then we add a specific case like, okay, if going backwards and blah, blah, blah, then do this. And when we're doing that, and assuming that everything works until proven otherwise, what could possibly be buggy? And RR, like I said, it does a very smart thing. It tries to do as little as possible. It just creates a GDB server, which can control the infuer, the program that you're debugging, and just that. And then it's going to open a GDB server, and accept commands from another client, another GDB somewhere. And everything of command handling and understanding, and saying, okay, we're going to move these many instructions, or whatever that, that's all handled by GDB. All RR does is reset the information on the program. So yeah, what could possibly go wrong with this kind of setup, right? So, so many things. The fact that there had to be two whole talks explaining why this feature is nice and it exists, should tell you that this is not a very well-known feature. It's not something that you see many people using. And yet, there are over 30 bugs filed for it. In a feature that no one's using, that's kind of crazy to me, because if people were using, that would be just so many more. And along with things actually going wrong, there's also confusing things, and just user experience problems. So let's go over a couple of them. A command that's very, very useful if you're very used to GDB is the command until. You can tell it to go until a loop ends, for instance. It just does not work in reverse mode. Or, well, if you say reverse until, it just does not work. If you're setting the execution to reverse, it works just wrong. So yeah. And there are some commands, for instance, record instruction history, and function call history. These sound like it should work for all recording features, right? But yeah, no, they're only available for Btrace, and there's no way to tell as a user. There's nothing in the help text. There's nothing in the name of the command. There's not bug open for it, but there's a Stack Overflow question, that's the why. So yeah, that is part of the UX problem. And another UX problem. If you're used to GDB, ignore the last 30 seconds. But if you're not, at this start here, this is a GDB session, that says we are right before executing right before calling the function setup. So when you're going forward and you say step, you wanna step into the function setup. What this execution log is showing is that if you say reverse step, you do not step into the setup function. You step through the previous line that was not printed, because yes, that makes much sense. It's something that we've talked about in the mailing list before. It's not a trivial problem to solve, but it is a real problem. And there's another problem here in this very execution log. I say continue to move forward. And then GDB says no more execution history. And my very scientific testing of asking one friend has revealed that this makes it sound like you cannot execute forward anymore and you have to start again. You can, it's not gonna be stimulating, it's going to be running. So from the audience reaction, I think more people are confused by this. We have a couple of user experience problems. And if you like a challenge, if you don't want something easy to start, we have really hard issues. We have problem with multiple inferiors, because GDB can open multiple programs to be debugged. And there's no way for the recording to know which program is being recorded, actually. And there's lots of problems with handling signals and things like that, because this was introduced before GDB could do that. So no one ever looked back. GDB recording itself has a problem with multi-threading programs, because I showed you all the information, the memory, the region of memory, and the value. Where do I put the thread information there? Yeah, we don't record multi-thread stuff. So that's one reason to use RR. Until we fix that. So please help us fix that. I want people to use GDB. And as I said at the start, it is just unusably slow. We need some profiling to be done. We need to figure out why is it so slow and figure out how it can be faster, so that it can be more used and people can find more bugs for me to continue working. And then a question that some people might be asking is, where do I come in? Why am I giving you the talk? I said at the start that I like reverse debugging, and I like getting people interested. So if I said anything here, that's not like an interesting problem or an interesting thing that you would like to see how it works and how it can get fixed, just hit me up. And we'll chat and see where it goes. Does anyone have any more questions that I haven't answered yet? Okay, yeah. Yeah, in the previous talk in the era, it was supposed to enable some flag of the kernel, something like ourinary, or something. Would you know why is it necessary in the internals error needs to get the... Yeah, and also I said it was because of security reason, and I don't know if the person who has that, oh yeah, he's still here. Right, so the reason we need the perf flag is because as far as I understand it, again, I didn't look at our, but as far as I understand it, whenever a perf event happens, we get the checkpoint. So if a perf event would happen, we dump everything, and I think, yeah, so if you need that flag to make, to read into perf events and get that kind of internal information of another program, you would need that for RR as well. So I can't answer questions, so do you know, like can you provide some examples of perf events? When does it happen? I'm sorry, I can't because, again, it's not my area. I look at this as similar stuff, sorry. I think you were first. So as I understand, I may be conscious, but you record whenever there is writing on the off memory. No, when perf events happen. No, I'm not talking about, I'm talking about GP. It records every instruction. Yeah, every instruction, but for reverse, reverse, continue off. Reverse execution, yeah. Do, with watch point work, or is it? So the question is, can you use watch points even with GDB recording? And the answer is yes, you can. Most of GDB has no idea that a recording has happened. We sort of like separate what is handling commands, what is dealing with threads, what is dealing with the CPU itself, and somewhere along that stack, there is the part that goes, oh wait, you're trying to execute inverse. I'm not gonna send that to the CPU, I'm going to do it myself. And that facility has no information about like watch points and everything, and conversely, the watch point stuff has no idea that that's happening. It just will check later if that has happened. So yeah, everything that works forward, works backwards, except for changing the state of your program, because we're simulating based on what happened before. I think that was the question, yeah. How does it work with system calls? Does it work there somehow, or is it able to record kernel space or something like that? I'm sorry, I don't know. I've never tried seeing what happens. You won't be able to record the kernel space because whenever you step, whenever you execute step instruction over a Cisco instruction, you are never stepping into the kernel space. If you want to debug the kernel space, you need to basically debug the Linux kernel throughout QM for example. Then you can actually debug both the user space and kernel space, but otherwise no. So it's not gonna be able to handle like the side effects of a Cisco, but it knows that a Cisco has happened and does everything else basically, I think. I think Mark was first. So the multi-threaded case, can that ever work? Yes, I have a couple of ideas how. You can have basically like multiple separate histories, one for each thread, or you could have an extra field for each thing that says this is what thread X or for thread Y, or you could have, you can order things to like a single instruction per thread. There are a couple ways that I have not tested at all, and I don't know if any of them work, but I don't see a reason, like a theoretical reason why it would just be impossible. The thing is, you, one thing is the history, your log, what we record. The other thing is that we would need to serialize execution also because if you have two threads if you serialize it, it's still. You don't know which one changed memory, so how could you know if it was thread, two threads are just poking at memory, changing memory, how would you know which one it was? How would you know which instruction caused the side effect that you're seeing, so we would need to serialize, meaning the way this works is basically single-staff type instruction. So we would need like single-staff thread one, and then single-staff thread two, before that, you know, do a round-robin thing. Yeah, which would make actual, yeah, it would be even slower, but it would make my rate conditions more likely, so maybe it's a good thing, I don't know. Did you kind of just put back in a stop mode? Yeah, this is, like I said, complex issue. If you want a challenge, let's talk in the mail list. You would need to provide guarantee the both forward progress and everything. That's a real mess. Let's talk in the mail list. It's a little complicated for right now, I'm sorry. If you have multiple threads that where you were trying to find a base condition, then if you know what thread is using what memory, and you know that because you recorded that, then you can tell the user, hey, these threads are at that time competing for the same thing. Yes? I think you also need to track all the move-exes and stuff like that, because if you don't, then you don't know if they are really race conditioning or not. Okay, so I'm just gonna repeat in case anyone's watching from afar. The question or comment is saying that if we know all the threads that are trying to access the memory at the same time, we can tell the user that a race condition is happening. And in theory, yes, if we keep track of the move-exes and everything, but the problem, again, the recording part is very far away from everything else from GDB. So unless you manage to do this recording, and then later you also create a command that does that kind of querying into the data, there's no easy way to get that information available to the user. We're not set up to get this kind of low-level stuff right out easily. Yeah? I have a question. It's not a thread-based question. Thank you. But actually, I work a lot with microcontrollers, and for example, with flash memory, four megabytes of flash memory, something like that. And I'm just wondering whether how hard would it be to make also GDB time traveling work on such microcontrollers? I guess the memory space is kind of an issue. Yeah. First off, if you're debugging GDB in the microcontroller itself, which I don't think it would be because GDB is big, then memory becomes an issue. If you're not, I don't know if GDB server is set up to do that. And if it were, it would make the same memory issue. So we would need a facility to get the disassembling information into GDB itself and then send it back to the GDB server. The problem is because also because you have scalability, and then you have distribution and you need to... So yeah, that's a complex use case. This backhand, the... Recordful. Yeah. This is all inside GDB. So if you're remote debugging, you don't need to teach the server anything at all. It's all being recorded on GDB side. Oh really? Yeah. Huh. So you can use Linux GDB server with this and it works. Okay, I'm surprised. You can open OCD maybe in this case, but you need some kind of... So what you need is to teach GDBs reverse debugging engine about that instruction set. This only supports X86 and... I think it does ARM and something, Power or S390. There are a couple of architectures that are partially supported. You need to basically create your own disassembler from scratch, unfortunately. Yeah, there is a disassembly engine inside GDB, but it only creates text. And I try to backport it, but... So right now, you create your own disassembly from scratch, it's easier. Yeah. Oh yeah, sorry, we're out of time. So we can talk more at the hallway track. Or probably tomorrow, because I'm gonna be managing everything. But thank you for coming and I promise... If anyone would like to contact me, these are my contact information. Yeah, thank you.
GDB on Windows: status & plans
you you you you you should I start over? No. Hello, checking, sound check. Sorry everyone at home. Alright, so not asynchronous. So you move this to separate thread and then there's a way for one thread to communicate with the event. Something happened. We did this change in GDB more recently in GDB 13 before that the debugger really blocked. Like you continue the execution, you couldn't do anything else until the inferior stopped. So that was something that was upgraded in GDB 13. So now, skip the slides. So now in GDB 13, this is something that's more important to ID. That the ID you can press the continue button, the inferior is now executing. But the ID at the same time now can execute GDB commands like disassemble or install new breakpoints, search symbols, things like that. Now it can do that while inferior is running. Well, before it couldn't. The ID would have to stop the whole program and then do something. Going back, this other function, the counterpart of the waiting for an event is when you continue the event. You have this argument here, this parameter where as argument you can pass either one of these two macros. And this is basically, you know, like in GDB, when you get a signal and then you can decide to pass the signal to the inferior or not. You can suppress it or pass it. So when you pass it, it calls the signal handler in the inferior. There's something like that on Windows. Not that important, but it's similar enough. But they call it exceptions, not signals. And this function here, you decide whether to suppress the exception or not. Will the inferior continue processing the exception or will it be suppressed? And it's important to know that you do this decision when you call this function. I mentioned this already. All right, keep that in mind. So this is very basically how the debugger internally works. All stop mode is default mode in GDB, how everyone knows how it works. So here we have five threads. This is time, time period one. Everything is running, runnable. And then T3 is about to hit an exception. So it hits an exception and you're calling wait for debug event. It returns saying an event happens and the kernel freezes everything in the process. All threads, the elements, they're frozen. And this one got an exception. At this point, the user is now inspecting the program, debugging the actual bug, reading memory, backtracing, blah, blah. And finally, they decide to resume execution. And then that's when GDB calls this continued bug event and then passes that decision of whether to suppress exception or not. So it's late, it's here. And then all threads go back to being runnable again. That is if you want everything to be running or stop and then everything running again. There are times where you'll want to only resume one thread and leave everything else suspended, frozen. Internally, GDB needs to do this, like to step over breakpoints. But the user may also want to focus on a particular thread leaving every other frozen. And we do that, the user interface is to enable the setting. This doesn't work currently upstream, even though internally everything works because GDB needs to know how to step over a breakpoint. But it's never been exposed to the user. Nobody wired up this to the back end. So I did a little change in my work and it actually works. So it's the same as before, the exception triggers, user inspects the program and then decides to resume T1 instead of T3. And GDB suspends everything else and then calls continue debug event for T3, because that's where the event came from. And now T1 is runnable. But what if you want to do the converse, which is instead of running one and stopping everything else, you want to stop one but leave everything else running. That's what's called the non-stop mode. And this is what I wanted to make possible in Windows. Because this is supported on Linux since 2008. I know because I worked on that. So a long time by now and also supported on remote targets, meaning GDB server for Linux, but also some other embedded systems out there. They support this mode as well. But native Windows debugging does not support this. So non-stop mode means only the event, the breakpoint, reports to stop to the user and everything else continues running. This is interesting, again, mostly for IDE's. You can imagine a big list of threads and then only one of them reports an event. But it's also interesting because maybe one of the threads is important to keep running because maybe it's a watchdog or something that needs to ping a server and if it stops pinging, the program doesn't work. There's something on the other end that needs to see this while you inspect some kind of debugging triggered by some thread. And the reason I thought all over these years that Windows wouldn't work for this is that, well, we have this problem. Wait for the bug event, that magic function reference in the event, suspends everything already. The kernel already does this. There's no way to tell the kernel to not suspend every thread except the one that got the event. And we want to leave them running. So naively, I thought maybe just immediately suspend, block, freeze the thread that you care about, and call continuity bug event, right? But you can't because this is too early. We just got the event. The user hasn't yet decided whether to pass or not. The exception. That only happens after. And I was looking up this past year, and I noticed on the Microsoft website describing these APIs that they introduced a new flag to continue debugging events. And I read this and I was like, really? It's like they wrote this just for me. Hey, they're awesome. Well, it's not the ideal thing that I would like. I would like to have a way that the kernel doesn't freeze everything. Still freezes everything. But what they do is, if you pass this flag, what you're saying is, I got the event. Okay, cool. But I don't want to handle it right now. So, but I call continue. And I'm asking the kernel report the event again as soon as the thread becomes runnable. So what I do is, I call suspend thread on a thread that got the event. So it's no longer runnable. I call this continue the debug event function saying, get me back the same event again once I become, make the thread runnable. That's what he's saying here. Well, in other words. How does this actually work in practice using the same diagram as I showed before? I'll prototype this quickly with a hack and it worked. Amazing. Now I just need to make it clean. And of course that's, oh, sorry. So, some years before everything is runnable. T3 is about to raise an exception. It raises an exception. The kernel freezes everything. There's nothing I can do to control this. And then I freeze the thread that got an event. And then I call this function with this new magic macro. And then GDB remembers that T3 will get a repeated event later. Now users is inspecting the thread, but everything else is running now. Right? So the kernel paused all the threads, but I immediately told the kernel, we resume everything else. So there will be a small freeze. There will be increased jitter caused by the debugger. But most of the time whole threads will be running. And then later the user decides to re-resume T3 and the debugger just calls, you know, resume thread, unfreeze the thread. And remember, now the kernel, because the thread is now runnable, is going to re-report the event. And because we recorded it earlier that we will get a new repeated event, the debugger knows, okay, it's a repeated event. Now I need, I know I need to call continue debug event with the proper flag saying suppressed event or not, the exception or not. Yeah. And a colleague of mine wondered, does this work when multiple threads hit the breakpoint before you decide to resume? Yes, it does work. You know, same thing as before. And here you are looking at this thread and this one raises an exception. Everything works. You can look at this offline if you want to. Yeah, there's a lot more to this. That's when I, okay, the hacky version works now. I need to make it clean. And that, you know, I stumble a lot of things that I don't have time to go over right now. I'm going to touch a little bit on the test suite. How much time do I have, Dodgy? Three minutes. Plus five. Yeah, okay. All right, so I put this in the abstract. The reason is when I talk about the test suite, I need to make this distinction. And when I say GDB on Windows, there are actually two ports for Windows. There's GDB Compiled as a SIGWIN program. And there's GDB Compiled with the mean GW tool chain, which means it's a native Windows program. SIGWIN, for those who don't know, it's like, gives you a POSIX environment. It's a collection of tools, but it's also a runtime, a DLL that every tool is linked with. And this runtime provides you POSIX things like signals, PDYs, and a bunch of stuff. The C runtime that's used is not the one that comes with Windows normally. It's based on NewLib. Try to be as close as a Linux environment is so that you can recompile your application, a Linux application with minimal changes, quote, unquote. It works. So it's not an emulator. You have to recompile your program. Right. So the core of GDB has two ports, like the event loop, for example, is based on select slash pull for most Unix machines, ports. And SIGWIN is one of those. But the native version of GDB for Windows, based on mean GW, has a separate event loop based on this wait for multiple objects function, which is the Microsoft version of select. Right. But the backend, the code that talks with the debug API, those functions I mentioned for it, it's shared between both ports. It's the same code except for SIGWIN, there's extra magic to make some of SIGWIN-specific things work. And this is where I get to the test suite, because part of making this work and upstreamable, and I would get to a point where I was, you know, sure that I wasn't breaking things, because this isn't making this work involved, I'm revamping the backend very substantially. So I want to make sure that I wasn't breaking things. So run the test suite, right? Except running the test suite on Windows is a major pain in the... The test suite is... GDB test suite is built on Dezhek Nu. Dezhek Nu is an infrastructure built on expect, and expect itself is built on TECL, which is a programming language. And Dezhek Nu assumes a Unix-like environment, which you don't have on Windows normally. You know, assumes POSIX shells and utilities, kill, CPE, VOO, and there is no native expect part. There was a company active state that had something like that, but they killed that project some years ago. So you have to use something that's Unix-like to run Dezhek Nu. If you test GDB on a segment environment, you just run MakeCheck, it does work. It's super slow, not stable, but it does work. But if you want to make native Windows GDB, you test that, it's not the same thing, it's a proxy, but it's not the same thing. Remember, I said that the core of GDB is different code paths. So I would want to be able to test this guy as well, I mean GWGDB. So how about we run the test suite, Dezhek Nu under SIGWIN, but make it spawn the Windows GDB? Yeah, that's a potential idea. But the problem is, it's a SIGWIN expect, it's spawning a Windows process, and the input and output is going to be connected to a PTY from the SIGWIN side, but what the Windows GDB sees is just a pipe. And when GDB is connected to a pipe, because that's how SIGWIN PTYs work under the hood, it's a pipe, GDB is connected to a pipe, it's not what's called an is at PTY, so it disables everything that's interactive, so in the test suite, it completely falls down. And something else is that the test, the Dezhek Nu, because it is expecting that the inferior is being run under PTY, so there will be terminal mode controls, time's up. But I have the five? Because... I'll tell you, if you want one minute, you can do it. I'll give you one minute. I'm almost over, just 30 more slides, no, just one more. Right, so there are some ideas to get this working, there's also path mapping issues, because what they expect sees path-wise, slash, C drive, slash, X, it's not the same as GDB. C is because it's a native program, so it sees X, colon. And another problem is that the GDB test suite, when it wants to test multi-threaded things, the tests are all written with P-threads, which is not something native to Windows, even though mean GWW-W64 does have the WinP-threads library, so maybe we could use that. I have some ideas to try to make this work, but I haven't had the time to actually experiment much with this. I tried other things that I thought would be interesting, but they didn't work. The test suite, compiling on, yeah. Right, so about compilation, just if anyone here is motivated by this talk and wants to help. Compiling GDB on Sigwin is super slow, so the way that I got around it is to cross-compile, and yeah, some things here you can do. And then I can cross-compile to Sigwin, but to run the test suite, I need to run it inside Windows, that's, I can't avoid that. But I can point GDB, the test suite inside Windows, pointed to the built GDB that I've built on, sorry, on Linux. Whoo! All right, so maybe I should skip, yeah. So test suite, bad, need to fix a lot of things, that's the thing. GDB, it's the native, yeah, this is the thing that's for the future. Make it possible for GDB to debug programs compiled with Visual Studio. That is something that is missing, it's making people not use GDB on Windows, and I would prefer people not to think about using other tools, you know, staying on the lane. So at some point I would like to work on this, but, you know, no time for that. Just leave it on the screen if people have questions like maybe one question? Nothing. All right. Thank you. So. Okay, actually there is one minute left. Is there one quick question? Yeah. Okay, so here's my question. Oh no. Have you tried using Python to run the test suite? I have. GDB executes and stuff. I, that would be writing a new test suite. Yeah, that's right. I know there's actually some people that do that, some companies, but I wanted to find a way that it can run the existing tests before giving up completely. Okay.
Online Debugging and ABI Data Services
Why are you all here? This is a boring topic. I was not expecting so many folks, but I'm glad to see you. My name is Frank Heigler, I'm a Red Hat engineer. I don't have a bio site because I'm not that interesting, but I've been in free software for a couple of decades, almost three, quite a while. So this talk is about debugging information and another type of information that we hope to popularize storing online for occasional uses. Now, many of you guys know debuggers already, all good. The other subject is a little bit more esoteric, but we can still talk about it. Do you know if Mike is coming in okay? The mic is just for the recording. I know. By the way, who's the next speaker after me? Is that person in the room? Good. I might be able to sell to you sometime. I'll keep this pretty short. Well, I offered you 10 bucks but too late now. So this is boring. How are we all right software? Binary has come out. Some packages up the binaries into distribution. Distribution goes out to people. People who run binaries, everything is happy ever after. That's all that ever happens. Right? Right? So debugging is near and dear to me. I worked on the GB debugger a little bit here and there, back in the prehistoric times, and I've worked on debugger like tools ever since. Despite all my efforts to try and make debuggers irrelevant, we still have bugs in our software and you still need to silly things. So unfortunately, here we are and debugging is not so easy. So I have two parts of my presentation. This first part is about debugging information that's online. The second part will be about something else that's online, that's tangential, but you'll see the connection pretty shortly. So many debuggers. You know how debugger, is everyone familiar with how debuggers work roughly speaking? Not you, Pedro. So one of the main challenges of debuggers is that it has to operate at the machine level, at the register level, at the memory bits and bytes level in order to understand the operation of the program you're trying to debug. And this is despite the compiler doing its darnedest to erase any remnant of how the original source code looked. It's trying to do its very best in nuking every unnecessary, very variable access, maybe tightening up the data structures, shuffling in and out of registers all the time just to make things go damn fast. Compilers are done that way. Well, I mean they're great, but if you need to debug, you need a sophisticated way of telling the debugger where to find all that stuff. So long introduction to that, but we need a good high quality debug info, which basically gives metadata about where the, every piece of the source level constructs are at runtime in the actual machine. So where, where to registers, which memory spots, how each complicated data structure laid out, all those things have to be saved by a compiler, put somewhere, and then made ultimately available to debugger. So does the word dwarf mean, you guys know what that stuff is? Yeah. So, okay. Can you give me a few adjectives about dwarf? From the heart. Okay. Say again. Did you say short? Liar. Liar. Yeah, so dwarf is a very compact, amazing little, most graph database kind of thing. It is absolutely nothing but, it is not short. It can be order of magnitude larger than the actual binary. And because it is that large, distributions then not to ship the thing, not to trip it to normal users because, you know, it's like, like I said here, users just run things, right? You never debug. So they don't get the debugging phone normally. But say you do run into a problem, you don't want to debug, well then you need this information, right? So either you can be the developer who already had this, or for the last 20-ish years, various distros made available, the original debug data that the compiler generated, but it's not installed. It's somewhere off in a separate repository you have to sometimes enable and change the route and download and if you're lucky, you get the corresponding debug data that for the binder that you're trying to work on. So the brand new 2019 thing, which I remember when my team talked about here, two years ago, when it was younger, is this gadget that we, a community built called Debug Info D, which automates the distribution of the Debug Info and other precious such things. And the whole idea is to make it as easy as possible for people, not just developers, but ordinary users to automatically, without special privilege, get all this stuff for as much of the system as possible, you know, without having to go into route, without having to do, you know, activate channel, rel-debug-blah blah blah blah. Okay. So that's our little baby there, the first URL points you toward a website that describes the current situation. As I said, the project is now getting to its third or fourth year, so it is, I cannot call it a prototype in any sort of honest way, but work is still ongoing quite a bit. When we built this, it's a small server, part of the ship with the AlfvTills tool set, which is related to Alf and Dwarf decoding and processing and such, and there are a lot of low level, machine level tools in there. So Debug Info D is shipped with AlfvTills and it's shipped on all the major distros that I know of. All right. It, I forgot to mention this, but I mean, all it is, is it allows a debugger type tool to request Debug Info as well as source code for any binary based on the Hexadecimal unique build ID that's inside binary. So this is a kind of a hash code that's been in binaries for almost 20 years, thanks to Roland McGrath and a bunch of other people who made it happen back way back in the early odds. So it's an HTTP server, it's just an ordinary boring HTTP server as far as the clients are concerned. It's very cashable, it's very lightweight, very, very simple, no like XML, API, blah, blah, blah. It's just HTTP. Because it is trying to be really simple, it, we found it over a course of a couple of months to year, most major debugging type tools grew capability to use this API, use this web of system to fetch this stuff. So obviously, GDB is one of them, it was one of the first, but system type is another tool of this kind, it's close to me. Practically all the debuggers and tracing tools and profiling tools we know of is able to do this now. So the clients are well dispersed in a source code system. The servers are also in really good shape. Over the last few years, the whole bunch of distros came online with running their own debugging for the server. So Fedora was one of the first, and Bacento is up there, Debbie and Ubuntu and other smaller distros, they're all running this server now, whereby their own distro is fully debuggable through this system. So that's cool. The, we're not quite finished with it. There's a piece of extension stuff that we're still working on of one of them is that's particularly cool is cryptographic signature preservation of individual files. As you may know, archives as a whole can be signed by distros and then a client can verify that the archives have been modified, and that's cool. But if you don't want to download the whole RPM because it's too large or for various reasons, you just want to extract the actual source file that you want or an actual little debug door file. You still want to be assured somehow that that file was what the initial distro packaged, right? You don't want to make sure it wasn't adulterated somewhere in the middle. It's kind of security critical. So we're bringing into this web protocol the propagation of the signatures that may have been applied by the distro at the build system level. And that's, it's not easy and not many distros do that level of signature stuff yet, but Fedora and very modern rail do, and we hope others come online too. But what's nice is that each individual file has its own crypto signature, which can pass it down through debugging for the, all the way to the clients. So they can be assured they get the correct 100% grade A certified file. Alrighty. Psh. I couldn't bring myself to try a demo here. I was just too chicken, but, but the whole idea with this debugging for the clients stuff is that it is really automated and integrated and you don't have to do anything special. On the distros that this is enabled on, you just don't have to even do the first line. It'll be done for you in the ATC profile for all your shells. And you just run GDB on any random binary or your own binary and it'll pull in the debugging for for any shared libraries that you're using, any source files you're stepping into, it'll just pull each piece down one by one as necessary. And it's just, it becomes a non-problem. So there's almost nothing to see because it's just so smooth and automated. Parts of it can be slow for hilarious reasons, but I'll explain why if someone asks me that question. Anyway, it is nice. It is out there in many of the distros. I hope you guys enjoy it and I hope it makes your lives a little easier. That is the thing you ever encounter bugs. All right, all right, all right. So switching over to the other sort of health topic. All right, does everyone know what ABI means? Can there be someone person who does not know? So I can justify talking about it? No, thank you. I'll just be brief. As brief as possible. So it's interesting. There's a lot of interest, especially for ISVs who want to build a piece of software and then distribute it, let people run it on multiple distros. But even for normal projects who might want to build a binary of their own releases and then shift that to various other distributions. You got that? To shift it on various distributions so that it can be used on modified. Sometimes they've other problems like wanting to match different generations of shared libraries which might have had little evolutions of their own ABI whether a function signature got changed or type got changed. Something that's not the same at the binary level than it used to be. Which means that shared linking between them is no longer safe. Some projects, some shared library projects are exquisitely careful about this and they do incredible measures to prevent this kind of breakage. When they update their shared library it becomes backward compatible to decades ago by a lot of the works. Like Jalipsi is one of the best in this regard. But some libraries are less good at that. So if you want to write a binary that will work with multiple shared libraries you may need to either kind of ignore the problem hope it doesn't happen or you need to find a tool to check whether this will work with that. It is a bit esoteric. But there are lots of solutions or several solutions which try to work around this whole problem by just giving you, bundling you the one random version of a shared library from some random distro, package it together into a container image or a flat pack or whatever and just plop the whole thing on your system and then they've done the integration checking and then they know it'll work. It's legitimate. It's just it's very space obnoxious and some of them still kind of intermingle the bundle libraries and the host libraries and they kind of do version checking and they hope that the host's libgl will work with their version of libxt or whatever. So even this is a bit fuzzy. So anyway, what we're proposing is that projects that deal with multiple versions of shared libraries that they're concerned about compatibility checking for ABI's consider the gadget I'm going to talk about. Okay. You know, maybe we'll just skip this one. Everyone, but everyone knew. What's there's still a person who I didn't tell what the ABI was? It's just, it turns out to be exactly the same metadata the debugger uses to find variables at runtime. This is exactly the same data. It's just that it happens to be useful to examine even at compile time. So even with just the dead libraries on disk by parsing and processing the exact same debug info, you can tell whether that shared library has the same binary guarantees as a normal program might require. Sorry if I belabor the obvious guys. Okay. So our team at Red Hat, one of the tools they work on is this gadget called libabigale. I'm not sure who works on that, that guy there. Yeah. And it's awesome. It's a suite of tools, binary tools that compare shared libraries versus shared library by extracting their debugging info basically and just parsing it piece by piece, function by function, type by type, make sure they're all compatible with each other. It can also do match a binary to a variety of shared libraries and see whether there's still meet each other's needs. Like a good marriage, maybe. One thing that it's limited though is that to do this work, it needs to have all the files that you want to compare right there on your local disk. So if you want to compare your binary to a RAL6 version of libc or libxgl and you bunch a version, you need to somehow get hold of those files first and you can't really just do it otherwise. So our gadget, the new gadget we're adding to libabigale is a way of not requiring you to download all these shared libraries and all their corresponding debugging info for all these versions of distros that you might not even have or not even want. Just curious about. And a key to that is to realize that abigale also can take not just dwarf files, but also an XML representation of the dwarf. And XML is just conversion. It's a subset and conversion. So that's my four minute warning. We're doing okay. And because it's XML, it's large, but it's textual and it's compressible. And with the one track mind that I have, how can we store a large amount of XML for all these shared library data for a large distro? It's text, it's large, you want to share it. Well, how, oh, no, that's not the next time. I'm gonna leave that in mystery for 20 seconds. Oops, one moment. Two moments. It's pretty soft for it, don't worry, it's good. Ha ha ha. Okay. Yeah, we just skip over here. So writing a little tool is really just a thin wrapper around the existing, the abigale tooling to extract this XML version of the ABI. Jamitin, Git, because we love Git. It's a great way to store text files. And it's a great way to ship them, great way to compress the heck out of them and let them coexist in some nice way. So we can extract XML from a large corpus of files. We can give it a whole boatload of RPMs or Debian's or whatever, it will automatically extract all the shared libraries. It'll download all the debugging files and automatically via debugging 4D, if necessary. And it will generate a Git tree, which has all this XML stuff nicely structured that then can be used by the tool itself to later do a compatibility check. That way you don't have to install the foreign distributions anymore. Anyone can do you the honor or the favor of collecting this ABI XML stuff, sharing it in Git, put it up publicly, and then anyone who wants to compatibility check against that version of the OS that no longer has to worry about this. This is a crowdsourceable enterprise. So I tried this at home, no demo because no demo. But it is really not hard to use. All the prep work is just getting the software. But the thing you think is that the data is in, when one crowdsource version of data is now a couple of gigabytes, it has a big section of Rela8, all of Rela8 in there. As RBI stuff, a few other Ubuntu releases just randomly in there to plan to expand it, to have as many distros in there as people are willing to give us. To submit new information, it looks like these command lines, this is just to demonstrate you can build your own share library at your own institution and generate your own database. This version tells you that it can just mass import whole RPMs and they'll do the right thing, decompress and aggregate all the information. And at the top here is how you check a random binary against the entire set of shared libraries that that binary needs. There are a few bits of cleverness in there, small. It's not very clever, just a little clever. For example, as you know, libraries get updated every now and then we wanna make sure we can store more than one version of the same shared library in the database. There's not just one G-Lib C but 10 per update. So they all have to have a naming convention that lets them coexist. So we do that. But those are internal details. The basic thing is you can submit to the database this way and you can check with that and it tries to be that simple. And that's my conclusion page right there. That all the code is open source obviously and all the servers are extremely low tech on purpose. The first one is a very thin HTTP server and the second one is just literally a Git server that happens to have structured data inside it. So easy and even I can do it. Very, very straightforward, baby technology. And thank goodness, that's it. Can we have entertainment for questions? Yeah, we have minus five minutes. Minus five minutes, my God. Okay, any zero questions? Thank you.
Poke all the microcontrollers!
So hello everybody, welcome to this talk. So the title is poke all the microcontrollers, but the story is GNU poke inside GDB. So we'll talk about poke and GDB more than microcontrollers, so sorry for that part. But let's go to the... So first of all what's GNU poke? So it's the extensible, you can read it here, so the extensible editor for binary, for structured binary data, which is... So what's binary data? It's data encoded in sequence of bits, binary digits, 0, 1, and like this. And we can... So there are structures there. So it means there are relationship between the different bits, okay, like here. We like grouped in four bits, in labels. And like we can assign meaning to a part of like this structure, like this eight bits as a whole to be a number 67, for example, as a signed integer of bits eight. Or we can assign a meaning like a character C as ASCII does. So it's like this is the part of the structures. And then you can have more complicated structures, like this part is the length and this is the table everything, so it's something like that. So and then the editor is that you have the CLI, which you can view the content and you can change it, hence the name poke. And it's immediate, so it's like interactive, it's not about like... It's about when you are exploring the data, you're debugging, you're doing something, you're designing a data structure, like for encoding data, that's the best thing. You can use. So and it's extensible. So it means that there is a DSL to describe this relationship between the bits, okay? So bits, we are talking about like we can address each bits. So and then it's inside of the GNU poke, we have this architecture, which is like you have a leap poke, which is the library which implements, which has these three major components, which is the first one is the PKL or poke programming language incremental compiler, which is the incremental parts means that you can add definition, declaration. It's statically typed, so it's like add the stuff to the namespace. You can redefine things. So the PVM, so it's a compiled language which compiles to a PVM, which is a poke virtual machine supported by GNU Jitter, written by Luca Sayu. And so this is, and then all the magics and the bits are lying here, the IU space. And so and then the other thing is the programming like you can, okay. So I don't know what's going on. No, please. Okay. I can go to here, ext3, I guess. Okay, so it's very not easy to see what's, my God. Okay, so you can write poke and this is a program, which is a command line interface. So this is the poke part of the story. So this thing here, and there is a poke D, which is a demon so you can send the poke code through the Unix socket. So you can make interfaces and stuff like that. There is a new component, poke fumpt, which is go through the source code and then there are some tokens. You can put poke code there, so it's useful for when you're generating test cases and stuff, you can write poke. And the result is the text like you assemble an instruction in poke and then at the end in the test result, it's a number. It's a UN32 number in hex, which is you or other tools can work with, but you know it's like easy to debug because when you're writing the test, you know it's like poke code is freeable. So and then also this is useful when you want to or working with hardware, you have a bunch of registers. You can design like which bits you want to set and then you generate a .config c file included and you're done. You don't need to be like a coding c function, GPIO, init, clock, that thing, something. You can all put write in poke and then generate the numbers, the final numbers and just write the number to the register. So GNU poke in GDB. WTF so I cannot say the word. Because GDB is good at debugging. We are not. And if you want to be, it's not possible because like after like you become a good thing after some years. But GDB is good at debugging, but maybe not as good as Gnu Poke is at poking a binary data. So this will be a happy marriage, we hope. So and the question is okay, what we have already Python integration in GDB. So why we need a new other language and the answer is right, that's correct. It's a general purpose language. You can do whatever you want to do in that language because it's a general purpose language, of course. But there is a but here. It's a general purpose language. So it's not good because what we're talking about here, like you can be what, yeah? Because the poke is a DSL, like with uppercase p is the name of the language. So poke is a DSL specifically designed to describe and poke binary data. That's the reason we think it's a good combination. And so what's the talk all about? So my initial plan was more ambitious and had a lot of things with like life hardware thinking. You know when hardware and things come. So plan is a little bit, but I have hardware here. I can, I'm not disconnected. It's here, it's real. So it's partially right, but not quite as like I wrote in the abstract. I was too much ambitious. So okay, so it's a demo for, it was my fault, not a limitation of, oh, really? Okay, so it's a demo for showing integration of leap poke inside the GC, leap inside GDB using this hardware, which is I showed you. So let's see that hardware. It's this hardware. It's ESP32 C3 module, which is a risk five based microcontroller. It's a 32 bit risk five thingy. So other thing. And so in this demo, you have to, so here if you can see that I connected these two pins together to prevent the thing to go to the state. So like it always boot up correctly. So the LED, I copy pasted this thing from image. I have the link in the end in there. So, and then it's a risk five, so you can see it. And then this is the flags for the compilers if you want to compile for this destination. So, and so the idea is for the, like the thing is you want to do the board being. So this is the idea. This is the whole thingy. So, so you'd like to the first step in board bring up is like check the hardware to see things that should be connected, should be connected, and then things that should not be connected should not be connected. So this is the first seems obvious, but very important. And then you connect it to the public supply with the current and see it doesn't like draw too much. And then the next part is this. So it's classically you can go to the C compiler and then you write things and then gradually add more stuff GPIO LEDs every laser, you know, add from a small thing. And then you add more complicated things to that. But here I'm what I'm proposing is like you have the GDB, you have the JTAC. So it's a command line interface. It's alive. It feels like a shell and you have the superpower of poke. Then you should be good with experimenting with different ICs. Writing to registers and timers and stuff, right? So why this hardware? Because it provides the JTAC debugging over USB. You don't need any external program. That's great. It's also cheap. But we need, we have to compile the GDB, which this integration is not upstream. But then the problem is I use this fork of GDB from Espresse, which is the vendor of this chip, which is on this branch. And you can find the, so, and then when you need the leap poke, which is both of those things work, and you can find it here. So the patch for the integration is old and not updated. And yeah, here you can find it. So I put back ported to the, this branch of the binutils and like ported to newer version of poke too in order to be able to show something. So let's poke together. So we need to connect, use this open OCD to just create a GDB server. So the next step is we run the GDB, which we compile with that thing. Okay. Okay. So nobody have questions. I know. So this is the GDB init. So you have to limit the amount of hardware and things, blah, blah, blah. And so this is the other part of the story. I can, okay. So the people who want to play with this thing, you can, there is this repo here. It's not, so the official SDK is huge. I hate it. So this is a simple thingy, which in the three ESP, in that branch you have three files, you have all the things you need. You can play with that. And then you have this, where is that thing? You have this data sheet, which is awesome. So have fun. Okay. So let's go to the next part. So yeah, this is poke. And you can see that we can describe numbers with weird width. So this is an unsigned one with six bits. This is, yeah, should be fast, but yeah. So it's a programming language. So, right? It's a good one. Yeah. Yeah, yeah, yeah. You should be careful when asking for things. You know, everything is good. All is good, as German said. So here we can also have aliasing for types. So you can have the things. So you in seven, you in something. You can have, okay, it's not important anymore. Okay, so this is the open OCD part, which you can see that I already did that. I hope it still work. And then here we have this GDB thingy with all the hex, the foot. I put things together so it's not clean. I did not show you. So and then I had something. Yeah, so I have to write it here. So it was risk five, risk 532LGDB. You have to have that thingy, GDB in it. So we are here, please work. So it's reading, reading flash. It's doing that. So it's good, great. And now the GDB, so it's complain, no, I don't know what you're going, I know what I'm doing. So it's okay. So because there is no file in anything, so you had no idea what's going on. So you can see that layout next. We have a jump and then we have a like weird stuff somewhere. So we can go next instruction. It's somewhere, okay? So now poke. You have this poke subcommand, something in the GDB. So you can say, like you can ask that poke, what's the read the 132 bit unsigned integer at offset, what's the address of that thing? 0x4123, 1e9c for example. And is it correct? I hope it's correct. So you have to see the same number. Okay. I cannot verify that. So you have to see the number. Okay. Oh, we can, we can, we can. We can like this, verify that. So it's a content of this. There's my mouse. Please work, doesn't work. Why? It's your fault, you know. It's 1e9c. So you should get the same not getting because of the Indianness. You have to poke, set Indian to Indian big, I guess, or little or I don't know. Okay. Also think, okay, now things are not, still not. So it was little. Yeah. Yeah. So please work. Finally things, good. Then I'm happy about that. So let's, so you can have everything. You can define variables here. You can print stuff here from printfv, something, please work, don't crash, works, and doesn't crash. So you see you have the old CLI capabilities in the poke here. And then you can do, okay, we have, you saw this thing, it's a module. So we call it people. So we load this, it's part of the standard, so it's riskv.pk. So we say pk load riskv and then five or, good. Yeah, okay, okay. So you load the module and then next, so it gives you a bunch of definitions. So what I'm interested in is, in some, here. So this is an instruction of this risk five. Please work. Okay. So you have this many variants. So either it's formatted in R format, ISP, whatever, whatever. So we want to decode that integer. We had here as an RV32, what was that, instant? Or yeah, yeah, yeah. Okay, layout, next, more next, TOI disabled, please. Okay, okay. Okay, great. Thank you, Petru. Okay, so now you have all the, this part, like immediate part and then this is like, because it was, if you do remember, it was a, we can disassemble that also. If we do this. Okay, thank you. Disassemble from here to, from here to what the hell. So C80, let's go for, no, it should be. No, no, nine, yeah. A0, yeah. Yeah, this, yeah. So here, we had this thingy here. So then, so it's, now it's a poke variable. We can call methods on that. And this is, please unmask, it's not the time to re-syntex. Okay, so you see that, yeah, we are getting the same thing. So this gives us by the disassembler. This is the magic of poke. So we have other, so you have, I don't know, here, yeah. We have this, the ATAC sheet, so there are registers, you can configure things this way. So questions, you're happy now? Thank you. Thank you. Thank you. For example, can you change the T0 on the flight to some other register or 260 so you could patch it on the. If the function is inside this, in the RAM, yeah, definitely you can do that. I don't have the courage to do that now, but you have to trust me. Yeah, more questions, please, because. Yeah. For microcontrollers, it's like a script language for registers. Yeah. It's this one, this one, this one. So this is SVD, they, they, so you don't need to read all the, all the data sheet to understand. So you can have libraries like this is, sorry, Jose Python library, which you can use this description of all registers and you can like generate the syntax for poke and then you can like load whatever types you want and then poke them. So you can, yeah. So if I use like not, yes please, 32, but something normal. Yeah. Is it already up to them in GDP? No, no, I told it's not. It is false, you know, you can blame him. He don't care, you know. So like on a serious note, the problem is that they didn't opera-seam is that the GC, you're using the boom GC and also GDP uses the boom GC for the guile. So there's a problem there. So we have to overchange the GC, please, Luca, you know, you can ask him to give us a new GC, then we can opera-seam. So that's the real answer, sorry for joking. Yeah. So next one. Yeah. So, yeah. So I told you nobody has any questions, you know. Thank you. Thank you. Wow.
Yet another event sourcing library
Yeah, this is better. So yeah, I'll talk about history, how we made some decisions we made, some things regarding lambda and the project, and basically this was kind of a point where we started to do most of the stuff on our own. Then I will go over the patterns that were kind of influenced for the libraries, so the security and even sourcing, I'll briefly show how the whole thing works in architecture, diagrams, and then I will say why we actually decided to open source it. So the project started in 2019, everyone wanted to do several lists, it was kind of a fancy thing to do at the time, and also we wanted everything to be managed by Amazon and we didn't want to monitor containers or run stuff around, we just want to give our code to the Amazon and run it, and that was kind of perfect to do this. We also had to keep the business logic vendor independent, so this is kind of a regulatory requirement, so we kind of speak that our business logic is the most valuable thing and then we isolated it from the infrastructure, and so the infrastructure part we can always rewrite, but the business logic we want to reuse. You want a simple API, so I had all these query pads, headers, discussions, we always had about API, so I wanted to drop this thing out, and we wanted to keep data so we can transfer it, I know rewrite library and move it to another language and use the same data and so on, so like binary stored messages in Kafka queues were not an option for us. With Lambda basically the big problem is the startup, so we wanted to use closure because we had lots of data stuff to take care of, so the biggest problem was of course the startup, so the ground we had at that time was pretty new and basically most of the stuff didn't compile. We tried AWS SDK, this was a mess inside, they bring half of the main repository back to it when you use it. Also we had like Hockey Recipient, we had to fork it as well because there was some stuff there that they were using that didn't compile as well, even Logback didn't compile for like one year ago with this as well, so then we started to build something on our own to make it simple, so we created our own AWS SDK because everything they do in all this magical SDK is kind of a post request to the AWS, so it was kind of super easy in the end to do. So the first pattern we chose to use was TECORES, just command and query segregation pattern, so the idea is that you have place where you send commands, where you mutate data and you have a place where you query stuff and this kind of influence our implementation, so we had on HTTP site we just had two endpoints, commands and queries, you send in the body everything you want to do in the system, which also make you can take the same body, you can send it to the queue, you can send it and batch of the command in S3 buckets, so this was kind of great because we could just store the commands from the post request, put them in the queue or store them in S3 bucket as a list of commands so it was super practical. The query site is also very simple, so just the query endpoint which made the front end client, we implemented our own front end client for this, it was 300 lines of code together with it mocking, with retries, with deduplication, with everything, so basically just simply having this simplicity on the HTTP site made it possible. Together with Tech QRS, now it comes the event sourcing, so the idea of event sourcing is just we will not store the current state of the system, we will store events that happen, so it is a pattern from 2000, 1970, basically but then they didn't have enough resources to do it, so they decided to event like a relational database model where you just store the current state, so the event source will be, for example, if you take a shopping cart as an example, you would instead of storing the current shopping cart, you would store item edit, item removed, item edit and then when the client asks what is my shopping cart, you would go over the events and figure out what is the current state of the shopping cart but the nice advantage of this is that everything is stored, so basically for us it's very important, the audit logs, basically the event sourcing, they are naturally there, everything is stored, the database itself is inutable, so we are just appending stuff forward, so it's quite easy to handle from the security perspective, information perspective and so on, so for our implementation we have chosen to take postgres, we just store our events as a JSONB field with some metadata around, so it was super simple, we have the transactions because it's just append only, it scales very well, so we have around one terabyte of data and we just add, we don't even think about adding new stuff there, we use optimistic locking, so on the client side we just add sequence to every event and basically unique field on the postgres gives us optimistic locking, so it was super easy to do, so yeah, this is a simple diagram, so from the client perspective how things look like, so we have a command coming into the system and there we touch our service, we just edit the core implementation, edit the core does four things, so takes a snapshot from the view store, then does the processing, whatever needs to be done, stores the response in event store and basically sends to the router all the events effects that were created, so events are, as I said, something that will store the changes and the effects are the things that need to be distributed to the other services, so if you want to call service B or never call it directly, I will store it in the database, the things I want to send to the other service and then they will be distributed to the router. The router then sends also back to the service that needs to update this aggregate and then aggregate this update to the view store and then we go to the next cycle and query is just a simple query, goes to the view store, returns back data to the client. And one more diagram which is also important is how internally the core works, so basically does a couple of things, in the beginning we validate the request, the important thing is what we do, we check if there was already this request process, so we have a command response log where we check if the request was processed, if not then we go, we log this request in the request log, so all the entry commands that come to the system are stored there, so if we need to debug something later on, everything is collected there, so and since everything is a body, it's super easy to store, whether it comes from Q, post request, whatever, then is this processing request, where is the business logic part and then we start the transaction, so we can start the transaction at the very end of the request, which is quite nice from a performance perspective, we store events, store effects, so it commands to the other services and then we just mark this request as completed so that we have a deduplication afterwards. Well, basically that's it, so we started developing this internally, it was only meant as an internal library, there was no open sourcing processing component also, basically this was kind of an idea to start this process as well, there was no alternative limitation because it has a fixed infrastructure there, so we kind of used this as an opportunity to kind of expand library as well, so we mostly started using it as hobby projects, so for the side projects, so edit the DynamoDB support for example for the store as well, for the event store, and this helped to clean up the project, so we did a big round of cleanup of the project with the proper abstractions basically, then we started adding different implementations and then we were contributing the changes back to the internal, so we chose to have the internal project, so we fixed huge amount of bugs outside that help also to get them back internally for the internal implementation and so on, so we set up the open sourcing process, so basically any team in the whole company can open source what they want if you just follow these steps. Yeah, so we had very positive experience with this library, so we are now like almost one year in production, we store everything, this was this space off on daily basis, so we even had a business site messing up thousands of hundred to thousand records, we could recover them quite easily just creating data from database, everything is stored there, audit was super happy because we stored everything, I even ticked off a lot of the audits just because we said we store everything, so they were super happy, and yeah, so the most of stuff like if we had a production bug that basically clogged up the queues, we could clean up the queues and five minutes later we could just select what happened and recover it back in the queue, so we didn't have to worry about finding what was there in that letter queue, what is useful, what is not useful, so and because of the duplication we didn't have to worry about sending against some messages, so we do this almost every week we have one disaster we need to recover and it's super easy for us to do that. Yeah, that's it from my side, questions? Excellent, next we can set up. So tell us a bit more about accepting open source in your company, you can come up. So this was actually six months process, so. So, yes, so the question was the experience with setting up the open sourcing process in the company, so this was actually a very painful experience, so it took six months negotiation with security, actually first to understand what we want to do, then extend it, why you want to do, then talk to management, tell them why this is beneficial, but afterwards yeah, once we figured out that all the rules that we need to follow, then it was it was quite straightforward, so we documented everything and hope that it was six months process to get it, get it there. So, my question is why the architecture decided to use lambdas of the first, why decided to use lambdas? So, one side was because we had a burst, so we like, ah, sorry, the question is why we decided to use lambda functions, so basically in the beginning we had a burst of data, so for example in the morning we would get a bunch of data we need to process and the rest of the system would process like three requests per hour, so and this was kind of a nice thing because it scales quite fast and the other motivation was because it forces kind of doing it fluff clean, there's no caching, you have to really think about what to do, so it kind of wants to push the developers to go in direction of actually making stuff clean and that they don't depend on something being stored somewhere in memory and yeah, the third thing was it was a cool thing to do, so it was kind of a nice presentation, marketing material for the project as well. So, I mentioned you use optimistic locking, why did you decide to use it? Was it because of the lambda bear or was it? So, the question is why we do the optimistic locking, so we use actually postgres in the beginning, but we used optimistic locking because we didn't want to even start the transaction because until we are done, so because we kind of declare all the dependencies we have, we fetch them, we process the data and then we have everything we need to store the mutated database we have it at the end, so at this point then we open the transaction and then we can do something, so that means we fetch the aggregate, for example aggregate version 72, we process everything, we say okay now we'll be version 73 and if there's some version 73 in the bit in happening then we would have a postgres nicely saying there's a concurrency problem, so we didn't want to lock anything database, we just want to make it simple, so this was super easy to implement. I have a comment on that which is our database uses optimistic concurrency control and it actually gives much better scaling up the traditional locking methods and it's much more robust and it's more secure, so we can have a separate discussion about this later. Yes, let's have this, we'll be an interesting discussion.
How to create the universal operating system
Welcome everyone. My name is Erotra. I'm glad my somewhat pretentious title has a lourd you all inside. I hope not to disappoint. Hence the disclaimer that I have way too many disclaimers to elaborate on. What is an operating system? I had to look it up on Wikipedia. I did have some imagination of what it could be, having used one for a few years. But it turns out, and it's slightly paraphrased because their definition is too long for me, it's a software platform providing access to resources and services to run computer programs. Okay, great. I knew that. That's what I use it for. Excellent. The title is about the universal operating system. And universal to me would imply more generalisation. I've always felt that the computer should evolve or computing should evolve. And I hope that we can move towards freely sharing, using, combining, understanding whatever we do with the computer. And from my personal perspective, from my day job, one aspect would be something about safety. And I've added security, but I'm definitely not a security expert in all the automation that we may computers do these days and hopefully in the near future. Because apparently we are dealing with a few crises here and there. And I believe we have ideas to get those addressed using information technology. I hope to learn from Jonathan at the level of representing information, how this could be used in the future. But I'm sticking to this bit. I'm more operationally oriented, so you could say, imperatively. And I know this is the declarative minimalist computing room. So I'll try to bridge that. The ingredients that I hope that the future universal operating system might incorporate is definitely the microkernel. Richard Stolman proposed for the GNU system a few years back that it could have a microkernel. I would still love to see that happen. In the community, work is being done on that. I hope to start to contribute to that. From a software point of view, I believe everything should be modular. Small pieces because I'm just a human. My head is limited and my understanding and time are short. Things should be definitely decentralized. Client and server would be a natural way to go in the interaction between things. But I want to focus on language semantics that might help us to move towards such a universal operating system. Because I think if we add all of these ingredients, we're going to incur enormous complexities. And I'm not really sure that if we go on in software development the way we do, it will actually scale to the level that we need our information technology and our operating system to scale. And I'm going to do that using a very silly example. This is actually the control of my cruise controller in my car. The picture comes from the internet, of course. I'm guessing most of you have heard about a cruise controller. Basically it's an electronic device in a car and it runs a bit of software. But I want to use it as a metaphor to talk to you about small modular things that can work in a larger environment with other small modular things. And if you add and combine them enough complexity goes up and we need to figure out a way to deal with that. So what is a cruise controller? I'm just going to read this out because I haven't memorized this. Basically when it's not enabled, hence it's disabled, the throttle, which is the thing under the hood that is normally controlled with your gas paddle, is fully controlled by the driver if he or she pushes the gas paddle. When it is enabled and you apply the set button, it captures the current velocity and uses that to maintain that velocity over the course of time. There are exceptions. In my car if I go uphill and I've set it at the lower limit, it will drop down and just drop the cruise controller. And there are other reasons for it to refer back to the human control instead of doing it automatically. One of them is if you press the brake pedal or the clutch pedal, it has to stop because it would be very annoying if your car continues moving while you don't want to. And of course as a human you can cancel it. Okay, this is all really boring. But basically if we would put this declaratively, we just want the damn thing to control our velocity. Done. Very abstractly. And I think this is a way in the future, a declarative way to do that future automation. But I've just been listing all of these pointless details which are still very abstract. If you look at the car in greater detail, there's a lot of imperative stuff going on, stateful stuff. And that's what we're trying to figure out a solution for. So we've been working on a language which we call also pretentiously designed, spelled incorrectly. Because if you look on the internet for the normal word design, you'll either find us. So at least this alternate spelling helps search engines. The semantics of our language allow us to, well, our language consists of interfaces and components basically. Interfaces are behavioral specifications. So they record the protocol. I'll show you an example in a minute. And these protocols are actually contracts of interaction between two components. And our components are of course modular. So they are completely isolated from the world by interfaces. And they are composable. You can stick them together and know that while they maintain the protocol, therefore they can cooperate properly. We have a formal definition of our semantics, meaning we actually express it in a formal process algebra. And I'll get back to that. We can simulate our behavior at the interface level and the component level. We can actually implement running code through code generation. And we can actually automatically verify a bunch of aspects of interfaces and components. And I'll try that to do that by example. So let's start with an interface. The pick. Sorry. Something like this. The picture of the buttons on my steering wheel I showed you is captured in terms of syntax here. There's the enable, disable button. The select, the current velocity to be maintained. As a resume button or resume function, a cancel. And on the dashboard, there's this LED that indicates that it's active or not. But the human is expected to interact in a specific way with the rest of the car. And that has been captured in this behavior bit. And our language takes a imperative approach. So we define state. I just scrolled across that. So we have two state iterations and we maintain that in variables, state and set point. And now we describe the interaction of the human with the cruise controller in the car behaviorally. In other words, if the initial state is disabled, we would accept an enable and become enabled. So dive into this further. I'll show you a picture of what this could look like. Let's see. Sorry. So if I show you the state diagram of the text I was just showing you, this is generated from that definition. This is what it would look like. This slightly more human readable. And we have sort of an intuition for this, I think. I'm going to make it slightly more complicated. Let's look at the component, the cruise controller component itself. It is specified similarly. Almost there. Yeah. We use the same language or the same concept. We define the behavior of the component itself, but now it receives its messages through ports and the cruise controller is supposed to interact with the different actors in the system, which is the human behind the human machine interface, the pedals, the throttle, and we have a timer, which I will not go into. I won't go through in all details through this behavior. I just want to show you the following. Sorry. I have to give one more example. The thing that I really want to add, which we have recently done, is an extension. Let me start over. This is what it looks like in, this is the formal semantics of that behavior, which we can actually feed to a model checker and check properties. And let me feed it to the model checker. It checks all of our default properties and the user defined properties, which I will now show you. So what we have just checked is that the component adheres to all of the interface contracts and it actually adheres to the invariant predicates. One of the invariant predicates, which is, you may have heard, there are cruise controllers who accelerate unwontantly. I have tried to encode that in the state of the environment, which the cruise controller is trying to control. And in this case, it would be, if the human has not activated the cruise controller, it should never actively control the throttle. So that's recorded like this. And I can actually make that fail by commenting out a throttle reset. And then that property will help us find a sequence of events that would lead to this illegal behavior, this unwanted behavior. Okay, this was very detailed. I'll try to wrap it up. Oops. So I have to make the link to the universal operating system ahead. I want to foresee that we will build a modular operating system. And because of the modularity and the distribution, the cooperative complexity goes up. And I think we've figured out a way to leverage model checking to help us there in the future. In the near future, trying to, I'm looking forward to adding that to her development, engaged development. In the coming year, we already, we had already planned to extend the scope of verification, including data contracts. But if you want to know more, just come and find us online. Here are the details. Excellent. Thank you. So this system is GPL, it's out of the open. Yes, you can find us on Savannah. It runs on gigs, a gigs install design. Can you tell a little bit more about the automatic verification of the model? Right. That's the magic board. Yes. We actually transform the model into MCRL2, which is a model calculus that allows you to do specify formal properties and capture the formal behavior. So what we effectively do, the execution semantics of the code that we generate is modeled in MCRL2. We verify the entire state space of that code, which is more efficient than trying to test all the code. And we have a composition guarantee. So when it finds nothing wrong, that there is really nothing wrong. And it's not a matter of, we didn't already have a time to find something at once. Exactly. But there are always aspects that you cannot represent, which are also important. You're welcome. More questions? Is it a result or possible outcome for the model? Does it commute to the whole solution space? At the component level now. You should repeat the question. Sorry. Your question was to verify all of the properties, does it expand the entire solution space? Exactly what we do at the component level. So the interfaces allow a certain behavior. And you want to expand that entire behavior, synthesize that, and go through it and figure out if there are any problems hiding there. That's what we do. Final question, is it used in production? It's used in production. Oh, yes. Our biggest customer currently is a thermo-efficient scientific. They make these huge electron microscopes. And I believe they've got about 1.2 million lines of our code running. Another question? Yes. Thank you. Is it also possible to create distributed systems with design? Currently, no. But I hope to integrate with what Christine will be talking about very soon. And that will solve that bit. Great. Thank you.
How much math can you fit in 700K?
So during these two minutes, I'm going to ask a few questions. I think the sound is better like this. Can you hear me? Yeah. So I heard a comment that color was not allowed, so I hope that you won't mind if I use 3D instead. But the screen and the actual device I'm going to talk about is black and white. Who uses a calculator from time to time? Who uses a calculator from the smartphone or whatever? Yeah, it's the majority. Who uses HP style calculators? Not that many. It's mostly. Who uses calculators for binary computations? Okay. Complex numbers in matrices, graphing. Okay. Just checking. So I don't think that the camera can zoom that far, right? So I can't show that. I suspect. Yeah, it'll be hard. But this is the device I'm talking about. You're going to speak. I'll hold up a sign. 5 minutes for question time. Yep. It's the dots. Is it also? For me it is. I don't know what's wrong with my timer. It's Android. Okay. So I'm Christophe D'Alincia. I'm working as a senior principle software engineer at Red Hat, working on confidential computing. I'm giving a talk on this topic this afternoon. But today I'm talking about a pet project of mine called GB48X, which is an open source HP48 style calculator for modern ARM hardware. So I talked about this last year, and I'm going to show how much progress we made since then. I start with a reminder of what GB48X is. We are going to review last year's future plans to see how well we did. I'm going to talk from one engineer to another. That's why I asked the questions at the beginning to see why we need all this math in the calculator. I'm going to extoll the virtues of 1980s era efficiency when there were only keyboards, no touchscreen, no fancy mouse, all that stuff. I'm going to explain how using much bigger numbers led to much less memory usage. And we are going to see a number of bells, whistles, and engineering units along the way. So I hope you enjoyed. Strap on. What is GB48X? The idea is really to revive Schullet's Packard's iconic reverse polish list on modern ARM hardware. So that's what the original box looked like. And a quick primer on the project. We want to simply put, reinvent the best calculators in the world. Nothing yet, less. It's designed to run on existing hardware from a company in Switzerland called Swiss Micro that does these kind of devices. So you see the DM32 on the right and the DM42 on the left. The specs for the project are from the HP manuals, and there are dozens of them. Unfortunately, they contradict one another because values calculators do not do exactly the same thing. So it's implemented in a language called reverse polish list, or RPL, which is a stack-based language, very powerful. It's based on common line and menus that you activate with keys below, function keys below the keyboard, the screen side. It has many data types and mathematical operations. I'm going to talk about this later. And many enhancements in the project compared to what HP did. Now, is this still minimalist? Well, you bet, because that machine has 70K of free RAM and 700K total for the program space, hence the title of the talk. So it's a low-power Cortex M4 at 80 MHz. The battery life is up to three years on this kind of battery, and one of the things that is nice is that the screen is passive, so when you switch off the calculators, it displays a picture, and the picture stays there forever. So that's where I have pictures of my wife and my calculator. The machine has only 96K of RAM, and if you remove the bitmap, which is a high-res bitmap, and the operating system needs, then you get to the 70K I was talking about. So 96K is 1.5,64 for the old-timers among us. It has only 2 megabytes of flash. It has 8 megs in the chip, but 6 are for a flash disk, and so there are 700K remaining for your program. That's less than a Macintosh floppy disk. They were 800K. The project did hit these limits quite hard. I'm going to explain how we worked around that. So last year I explained that I had to restart from scratch from a project called new RPL because we hit these limits. This year around Christmas, I hit the limits again, so I had to restart from rescratch, at least as far as the similar computations are concerned. So I'm going to explain that. So let's review last year's future plans. I think there is a problem with this one. Is this one okay, or is it... Yeah, okay. So I said, you know, back in 2023, I was young and naive, and I said a lot remains to be done. So I was talking about adding complex numbers, vector and matrix arithmetic, about 1500 functions that were left to implement, and key features like plotting and graphing. So what did we do? Well, a lot of this was done. Complex numbers are available, and they are actually much better than the original. For instance, you can have polar and rectangular. You have the usual notations. You have stuff like that. We have vector and matrix arithmetic fully implemented, and we have algebra, but also with exact computations like fractions inside matrices. So you never get a rounding error unlike on the HP calculators. That's the test suite. So the test suite runs on a simulator on Linux or Mac OS, and it currently runs about 2,200 tests. Not everything is tested. That, for instance, is implemented but not tested yet. And we have plotting and graphing, at least the basic features, like drawing stuff, etc., with some nice enhancements compared to what HP did. Like, for instance, we can have plots with various sizes and plot patterns, so I'm going to show that in a moment. And that lets you draw multiple things on the same screen and see what the different pieces are. It just was very fast on the screen here. So how did we go to use only 70K? It's a story of ultimate over-engineering. It's C++ with garbage collection and ubiquitous bad packing all over the place. Let me explain what I mean with that. A C++ object typically looks like this. You have a class, and the way this is represented in memory is you have a virtual table pointer, and then you have the value for the object, so in that case, for the integer, you'd have an integer value or an enzyme value. And then there's some overhead for malloc. It's self-operated or whatever. You have, for instance, a linked list or a free list or something like that. So overall, for your object representing an enzyme value, you typically use 12 bytes. 12 bytes, that's on a 32-bit CPU. That lets you represent all values up to 4 billion, and it's fixed size. You can't remove it in memory. Not good. Let's do better. So the representation we used looks like something like that. We use LB128, which is a system that is used, for instance, in Dwarf all over the place. And there let's us code the ID that is used to identify the type of object as one byte for integers. We have 128 types that we can represent with one byte. And the value, if it's less than 128, is also on one byte. So that means that I use only two bytes of memory that's a 6x factor compared to the other representation for all values below 128. And I can move to infinity because the LB128 is a variable size encoding, so I can essentially have numbers that are as big as I want. It's now a variable size object, and I can move it. So it's a vast improvement. That lets me have a memory organization where I have at the bottom of memory all the global variables, the global objects that I keep. It's essentially a name, a value, a name, a value. And then, so they are all packed together. And then on top of that, I have temporaries that move with a temporary pointer that moves as you allocate objects. And then there is an editor, scratch pad, and the transient stuff on top of that. Because it's all contiguous, the way to reach the next object is to skip by reading the ID and computing the size to get to the next object. So on top of memory, you have root pointers that point back to, like, the stack, the local variables, that kind of stuff, that point back to this memory area at the bottom. And the root pointers can point inside objects. That's a very important property for performance. For instance, if you follow the one link, you'll see that it points just behind, I think, like, with curly braces. It means it's part of a list, and I can put the value that is inside the list directly on the stack. So I can do the computations faster that way. And there is also a series of smart pointer classes, the names and in other score G in the source code, that let me have garbage-collected smart pointers. The allocation is super cheap, because it's essentially I'm moving the pointer at the top of the scratch space, like this. So it's just one addition and one comparison, and the comparison is to see, okay, am I out of memory, do I need to garbage-collect? So a very, very cheap allocation. The garbage collection itself, as you, you know, your memory grows and you allocate more and more stuff, so at some point, memory gets slow. The unreferenced temporaries, you no longer need them, so what you do is you copy the reference object down and you adjust the pointers, and then you move the editing part of the scratch pad down, and you reclaim your free space that way. So the good point of this approach is that there is no memory of a head at all. There is not a single byte that is used for metadata or linked list or anything like that. The sub-objects, so pointers to objects inside a list, for instance, don't cost me extra at all either. If you know something about garbage collectors and you think of a market-suit garbage collector, for instance, it needs some metadata about sub-objects, and so that means you have extra costs for objects inside objects. And it's a single-pass garbage collector, so it's simple code, easy to maintain, but the downside is that it's slow. It's essentially a quadratic behavior, number of stack objects times number of objects instead of linear or close to linear that you could get otherwise. So it's a usual trade-off of space for speed. So why use C++ at all? Well, it's because of template metaprogramming, and let me explain why this matters. So the guy that you see in the photo there is a guy named David van der Vorder, and he's a Belgian guy who initiated me to C++ metaprogramming back in 1998 when we were in the C++ HP compiler team. So the guys you see in the background are the HP compiler team back in 1998, and that guy is super, super smart and initiated me to template metaprogramming before it was even possible, so we were dreaming about doing these things. But now you can, and let me explain why it matters. I'm going to represent code as data using metaprogramming, not because we can, just for the sake of it, but because I have to. So let me talk about bug number 12 in our project. You compute 1.2 plus 3.4, and it hangs on battery power. So how do you reproduce this bug? You don't use the technique shown on the right. Instead, you simply type 1, 2, 2, 3, 4, plus, and the calculator sits there, not doing the computation. And your users call you and say, did you even test the thing? So you scratch your head, how did I miss that? Well, the fact is it hangs only on battery power, and as soon as you plug the USB cable, the computation resumes and you get the result. You can guess that I did my testing with the USB cable on. So what is this bug? This one was a bit hard to find. It turns out that the chip has an execute in place feature that works, it's supposed to work on the external chip, something called the QSPI interface, except it just lacks battery power, or a power juice when it's on battery. And so essentially it sits there waiting for the cycle to complete, and it completes it when you plug the power. Okay, so that means I have to move as much of my mathematics into data that I can read from the QSPI as opposed to code that I cannot put there. That's why I only have 700K, otherwise I'd have two makes. So how do I use C++ metaprogramming to do that? Let's see a description of an interesting math rule, and that's how you expand polynomials. So you know the rule, you see the first rule, for instance, X plus Y times Z, you turn that into X times Z plus Y times Z, and you see that's exactly what you see in the code. So the code contains essentially the mathematical formula as you're applying. That's neat, right? Now, here's a guess. How many bytes of code does that generate? Give me a guess. Nobody wants to want to guess. Okay, that's the assembly code. 12 bytes. So that code generates 12 bytes of code, but it generates tons of read-only data, which is good because I can move that to my QSPI. So the magic is this ugly metaprogramming code that generates constant arrays, and I taught the C++ compiler how to generate RPL objects from C++ expressions. Isn't that cool? And so that's how you get 12 bytes of code, tons of data that I don't care about, I have plenty of that data space free and no executing place needed. So in the end, how much math in 700K? Well, it turns out that for another reason, I'm now back under 500K, so I'm within the limit that we all heard about, the 640K that ought to be enough for everybody, right? So from one engineer to another, what do we have? So we have base numbers for engineers in the computer field, that's really fancy. In any base, I can compute in base 17 or 34 if you want, or three. With any size, you can compute on 13 bits or 512 bits if you want. We have complex numbers that's useful for electrical engineering, and we have phases that are dealt with with exact results if we can, so like exact fractions and stuff like that. We have linear algebra, and here two exact results when we can. Statistics, which is useful for field science. Degree minutes, second support, so that's if you're doing, you know, maritime navigation or stuff like that, that's really handy, you have a really nice shortcut for that. Unit conversions, if you want to land something on Mars without crashing it, because some guy in the US is using really ridiculous units. And symbolic processing, which is useful for math gigs. About 1980s era efficiency. I have this magic menu, it's the key at the top, next to the A symbol, and essentially it selects the right menu depending on the type of the object on the stack. So very few key strokes to get exactly the functions that are most useful for what I'm working on. Equation data entry, I use a single key to enter the symbol that they limit expressions, and when I'm inside, that's the quotes in RPL, but once I'm inside an expression, I no longer need these quotes, so I hit the same key and I get parentheses instead. And same thing with the equal sign that you see at the bottom, it evaluates an expression, so it's the eval function of RPL, but if you're inside an equation, then it says, well, I'm inserting an equal sign because I'm trading an equation, and if I'm inside parentheses, it's inserting a semicolon instead to separate function arguments. Exactly symbol data entry, that's for your gigs. So when you type a hash sign, the cursor changes to a B, that's for base numbers, and it says now the ABCD keys, you don't need to shift them or anything, you just get ABCD. DMS data entry dot dot dot, and yep, and a one key function. Okay. Just my conclusion that I cannot answer the question because I still have 200k to go, so see you next year, guys. Thank you. Thank you. Thank you. Thank you. Thank you. So the next speaker can set up. Is there any time for questions? Yes, there's five questions. We'll have one for the next speaker. Yeah. But I don't see the next speaker. No questions, seriously? Who wants to help with this project? I'll just give my laptop. You know, it's 20. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. We'll just give my laptop. Does that rock? Does the calculator have a beeper? Yes. That's a good question. So let me... I'll use the voice. Oh. I think it will be a full row. So here we go. Okay. Okay. Okay. Okay. Okay. Now it's your world of activities.
RISC-V Bootstrapping in Guix and Live-Bootstrap
or say geeks integration or something like that. But in general, those APIs won't be used by Goblin's programs for the most part. But we will provide the compatibility. Because we already started the next door. All right, very good. Thank you very much. That's the, that thing. Hi, can you hear me? Yeah, right? Okay. How many people here is aware of the problem of the bootstrapping? Place hands. Okay, that's good. That's better than I expected. That's fine. So, first of all, this is a disclaimer. I wrote everything I'm going to talk about in my blog. And also I give a talk the last year. So if you really want nitty gritty details about the bootstrapping process, go there. This is not going to be a very technical talk. Okay? It's going to be just an explanation of what we did in the RISC-5 world in the bootstrapping process on gigs and live bootstrap. So, this is me, right? I'm a telecommunication engineer, and a freelance programmer, and I work a lot in gigs. So maybe you remember me from the last year. I gave this talk. There we explain the bootstrapping problem is you have more interest on that. There's more slides on that and a, quite a long explanation of what we are doing and why. So this is the context. I work with NLNet. The last year, so they paid me literally to make some work in the bootstrapping process. I back ported some support for RISC-5 to another GCC to the 4.6. And also I back ported support for tiny GCC boot, which is a fork we are maintaining in order to be able to bootstrap the compilers. So I'm going to talk a little bit more about this later. So this was explained in the last year, so that's nice. So this year, I decided to continue with this project, but it was completely burnt out, and I needed help because people always helps, right? So I added more people to the project. These two are most, the ones that took more work in this port, and they literally gave me the energy to continue, right? So Andrews is very interested in the project because he works in Live Bootstrap and Stage Zero, which are projects that are very related with this. We are going to see them later. And Janneke is the author of MES and also the maintainer of tiny GCC boot. We are going to talk about that just now. So let's see in pictures, right? There are some colors, but I'm going to point. So if anyone has problems with the colors, it's no worries. So this is what we had before my project, right? We have Stage Zero POSIX, which is source code, right? Then we built with that, we use that to build MES, and with MES, we try to build a bootstrapable tiny GCC, which is a fork of tiny GCC, but that is easier, right? The C code it uses is simpler to be able to build. Then we try to build tiny GCC, then we go for a very, very old GCC from the 90s, right? And then we go for a modern GCC, maybe with many steps in the middle in all of the parts, and then we try to compile the world with GCC. So now the colors. All this is the current bootstrapping process that is in live bootstrap. We have it in Geeks too. So this just works, but only in X86. So I'm working in the RISC-V port of all this. So RISC-V, the status of the RISC-V part was these two parts of the top, they were already having some RISC-V support. It was working pretty fine, okay? The bootstrap of all tiny GCC had zero RISC-V support. Tiny GCC, it was supposed to have some RISC-V support, but it was worse than we thought. GCC didn't have it because these are very old GCCs. They were written before RISC-V was invented, so no support there. And the modern GCC that supports RISC-V is the 7.5 version. Then the world, some things support RISC-V, some others don't, but that's not my problem. I'm only working from here to the top, so don't worry about that. So after my effort, my previous effort, I took the support from this GCC, which is kind of a modern GCC, 7.5 is not at all, took that to GCC-4, this is a note here, this is written in C, this is written in C++, ha ha, I had a lot of fun there. And also, I took the support from here and I moved it to this one, right? So this was also, I think it's like a 10-year difference between these two, so the APIs, the internal APIs changed, many things are very difficult. GCC is horrible to read. Maybe here there is the maintainer, I'm sorry, but it's really hard to read this project, I'm sorry. So at the time, we didn't know that this was orange, this is not fully supported in RISC-V, we thought it was completely green, fantastic, no, it's not, so problems there. And this one, I finished this backboard and I thought I was going to have issues with this, but it happened to be pretty much okay, so nowadays this is way greener than we thought at the beginning, so this is before what we did this year, right? Starting in June, we started working on this with the people I already mentioned, and now we settled to this point and this is already in LiveWoodstrap and we have it in Geeks, in Core Updates branch, this is already upstream to Geeks. So until here, everything works, so thank you very much. Good, so this part we already tested in Adobe Machine, this part we tested in Adobe Machine in real hardware, in RISC-V board we have, and this also works, this GCC 4.6 compiling stuff for RISC-V. So a compiler that was written before this architecture was invented is compiling to that, so that's also very nice, we have it, yeah? So this is more or less what we did. So there are problems though, the arrows here are still red, I don't like that. So why are they, they are still red, why? So TinyCC request some changes in the C library where we have here, so we need to change those to make them work, right? Also, the old GCC requires make, which I managed to compile the other day. And it requires some other stuff, right? It requires patch, we also need G-SIP, which I didn't have the time to compile, and some other things. So also this jump is going to be kind of complex because GCC really has a very complex build system, maybe you tried, it's a really complicated thing, right? So it should just work, but it probably won't. So, questions now, and I have some extra slides for later, but anyone has any question? No? No, okay, extra slides. So we had some limitations in the backboard we did, and this is what we have been playing with since June this year. So when I made the backboards, I was working only using a cross compiler. So if you're working on a cross compiler setup from x86 compiling stuff for RISC-5, you are going to have a lot of problems, why? Well, first of all, you have the bootstrapping problem we're going to show in the next slide. And also, I was using G-Lipsy, which is a very powerful Lipsy, and we don't have that in the bootstrapping process. There's no Lipsy, so we need to play around with all the stuff we have, like Meslipsy, which is written by us, so it's probably not going to be great. We're not that good after all. So also, there's the RISC-5 assembly issue. In TinyCC, the RISC-5 assembly they have, it doesn't use the same syntax as gas does. So our library was expecting a gas syntax, and this doesn't provide that. And also, it doesn't support the extended assembler. So we can't really mix very well C code with assembly code, and we need to play around with all the variables, protect them, and make all those things we have to do by hand, and that's a problem. So this is how TCC is built. The graph I showed you before is just a lie, but it's a good lie. So this is how it works. We first build Meslipsy, we take some part of the code of the TinyCC boot, and with that we build this one, and with that we build this one, and we change the flags of the code so we add more features. With that one, we build another one. We take the code again, we build another one. We do these six times, and then, of course, all these steps will need to work. There is a lot of bash, bash glue code in the middle to make all this happen, and you have to fix that too. And fixing very old bash code we did for this kind of thing is even harder than reading the compiler, but anyway. So then we check that this one and this one are the same. In the binary level, they have to be exactly the same. That means the compiler is not adding new stuff, so we have settled. So we can just continue with those. My colleague, Andrews, already tested that we are already settling the fourth iteration, but we do six because we did six and we don't want to change. They did. In live bootstrap, they only do four, right? So problems with GCC. I only tested, again, as a cross compiler the last year because I only wanted to see that it was able to compile things for RISC-5. And again, I wasn't doing the bootstrapping process of GCC. GCC does internally, when you build GCC by hand, they do a similar thing. They take all the code base of GCC, they create a previous GCC, then they take the code, they compile it again with the GCC they created, and then again, and then they compare. So I wasn't able to do that. And I didn't work in the C++ support either. So the work we did, we started with tiny CC boot, and we started working on top of it. We had to read, we spent many times, many nights, debugging crazy things, and also because Andrew has a real job, not like me. So we need to coordinate to do these kind of things. It was really hard. So also we don't have debug symbols because our compiler is very simple, and implementing that takes a lot of time, and it's difficult. So we do all of this like one hand here. It's very hard to do with one hand and also blindfolded. So, but we managed to do that. I wouldn't have the energy to do this without Andrew, so thank you, Andrews. Also, well, these are some errors, I explained them in my blog, then or later you can come and ask me about them. This is a lot of fun because the body was never executed for any X, it didn't matter. This happens a lot in RISC-5, sorry, in TCC, and in our backend, it exploded. Why? Because this is undefined behavior, and all the compiler was based on this. So they used these to clear bits, and we needed to check all the appearances of this and fix them all. So, funny stuff. Yeah, and many other things we found a lot. So you can read about them there. There's a very long explanation about all of that. Yeah, so we finally managed to build it, we have it, we have a recipe in Laboodstrap and in Geeks. Yeah. So, about mess. We had all the stuff in mess because it was affected by our work in TCC boot, so we started fixing stuff. Why there were errors in mess? Obviously because we were not perfect. Yannick almost is, but still. We had some issues because the bootstrapping process of I386 didn't use all the C constructs that appeared in RISC-5, so it started fixing many things, like the switch cases, they were wrong. The initialization of the structures were initialized to 22, I don't know why. So, these kind of things. And I am almost there. Well, TCC is the same. We finally managed to compile it in a different machine, with C++ support, all of that. Okay, fantastic. So, last words. So, people is important. If you're alone, you don't work well. I had issues, I was like completely depressed, burnout. So, bringing people, giving me energy, the knowledge I lack, and emotional support, good stuff. Also, money is important. You all know this, but if you're getting paid, you work better, you don't feel stressed, you are not just trying to eat the next day, to just get paid, do your work, that's fine, that's good. You can focus. So, thank you to Andrews, to Janneke, also on the net for the money. And you for listening. Thank you. And it's our question. I don't know if we are in time. We have time for questions. Okay. We'll set off to the data. Questions? Regarding both the people and the money, will you be continuing your work? Yeah, so regarding money, the people and all that, will I continue with the project? I'm not sure. I'm not sure. I don't think I'll be doing it. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. I don't think so. Will I continue with the project? We have funding and stuff to do still, until I think the project finishes in one year. So we're starting in June. So until June, we're going to continue. I'm still working on it. I have, most of the budget is not spent because we need to finally combine until GCC. So the project is to... June, we're going to go. Yeah. More? No? Yeah? I was listening to a ZIP project. It's used an interesting approach to this, where they use WebAssembly. So how is it working? Your project was you use the latest GCC to compile GCC to WebAssembly. And then your problem on risk five, is you just need to bootstrap a WebAssembly runtime, which is very small, to run GCC on risk five. Do you think this kind of approach might work in your environment, or is that just very specific to SIG's problem? So the question is around how the people at SIG resolved their issue with the bootstrapping. And they are yet using a WebAssembly environment stuff. And if we can do the same, or if that makes sense in this. So our idea of this is that we want to build everything from source in your machine. Why? Because if you get a Linux distribution, you download a Vivian or whatever from the internet, you are getting a lot of binary blobs. So the idea is just to get the source. So that's not very compatible with the approach you are proposing, because you won't get sources. You will get some kind of a wasn't thinking. And that's not easy to inspect. So what we have here is that you can inspect everything, starting from a very small binary that is written with comments, so you can read the comments on the binary that the bootstraps everything. So the idea is philosophically different. And I'm a little bit upset with the problem with SIG, because I really like the language. And now adding this wasn't thinking in the middle is making us very difficult to add SIG to Giggs, because we will have this kind of binary in there. And we don't really like that, because we want everything to be sourced. But yeah, the idea is good. But philosophically, it doesn't match with what we are doing. Any other time for another one? Piot? Yeah. No more? OK. Right? Thank you. Thank you, guys. You have one, Piot. OK, you have one. Sorry. What about the arm board? The arm. I don't know. I'm not sure. Maybe you should ask another people here, like Danny. But everything we are doing, the RISC-5 board we are doing is 64-bit. It's going to affect all the other 64-bit architecture. So we are doing advances in X, 66, 64, and ARM, and all things. So yeah. Yeah, shoot. Shoot. All yours. Yeah. So for the arm board, we got as far as compiling tiny CCC, and that one compiles an old GCC. And that old GCC has a lot of problems that are well known. Nokia had a lot of fun back then with these bugs. And so we are waiting for you to update GCC, and hopefully that fixes everything. So. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
Self-hosting and autonomy using guix-forge
So, good morning everyone. This is talk about Geeksforge. So, first let me explain what Geeksforge is about. So, Geeksforge is a Geeks channel that has services that will allow you to run a complete GitHub like software forge, but fully on free software and using existing free software components like Seagate and Git, of course, the laminar continuous integration system, something like public inbox and so on. So, usually when we try to build GitHub alternatives, we have monolithic systems like GitLab or GitE, Gox and so on. What Geeksforge tries to do different is use old and existing very stable components like Seagate and assemble it all together into a system that resembles a software forge. And it is assembled together using Geeks. So, you have a nice declarative configuration that you can just deploy practically anywhere. So, in a sense, it's like million a box if you have heard of the project, million a box, they set up complete mail server on a system using by integrating many different components. It's like that, but for software forges and using Geeks. So, first I'll start with a quick demo of the Geeks system containers. This is quite widely used as a package manager, but as a means to deploy Geeks, a full operating system and operating system containers, it's not so widely used. So, I just want to quickly show you a demo of how it works. So, this is a really simple operating system configuration. It just has an engine service that listens on 8080 and serves static directory. So, let me build that. So, the static directory has a simple HTML file that I just wrote up. So, first let's build the container. You build it using Geeks system container. And the hyphen capital N is to enable network access. And the container is completely stateless, something like Docker where you have attached storage somehow. So, you have to mount all storage, all state into the container. And that's why we have the expose here. So, you have this script that has been returned. So, if you open it, it's really just a guy's script that sets up the container and has all the dependencies built into the store itself. So, let me now run it. So, pseudo... Yeah. It says that my Geeks is two worlds, older than 30 days. So, I have started up the container. Let's just go to localhost 8080. And it works. So, this is just the static HTML page. Now, let's try to set up a container that actually uses the Geeks 4 channel. So, this is a more complicated configuration, operating system configuration. Here, I want to show you the Seagate service that Geeks 4 provides. So, it's really simple and it just takes a server name, which is the domain name. And then the repository directory where all the gate repositories are stored. And then you have something called a Forge engine X service, which is similar to the basic engine X service that you have in Geeks upstream. But it automatically handles things like HTTPS, acquiring a TLS certificate, setting up a crown job to periodically renew the certificate, automatic redirection from HTTP to HTTPS and so on. So, it does a lot of things in a very turnkey, fully automated way. You just push the button and you get it essentially. And this is the Acme service configuration. So, Acme is the protocol behind the Let's Encrypt. You have to register an email ID of that. So, that's my email ID. So, in this configuration, I'm currently using the staging URL. It's good for testing because you won't run into any rate limits. So, I'll actually take the risk and delete that. We'll try to build with a real Acme server. So, here again, I'll build a container and run it. I'm mounting a couple of state directories, Acme directory and the GitR Poster directory. So, there it is. It started. So, I'll go to git.demo.system.rego.net. So, initially, the container set up with a self-signed certificate. So, it doesn't work. So, let's actually get real certificates. So, find the shepherd. So, the PID of the container is 19.262. I drop into a shell. Get some source and profile. So, GitX4 sets up a script under user bin. Acme is any... Yeah, I'm inside the container. Yeah. So, around the script. And the script has been automatically configured with all the domain names that need certificates. And now it is actually getting certificates from Let's Encrypt. If you can see the logs, it's telling you what it's doing. Yeah, that it has a certificate and it has restarted the Nginx service as well. Now, if I reload this, it should work with proper certificates. Let's try. Yeah, there you go. So, this is Git. And you can browse some repositories that I put in there. So, Git is really simple, but it doesn't come with all features properly enabled by default. And you have to do a lot of manual tinkering to get it to work. For example, by default, it only serves the dumb HTTP transport protocol for Git. So, but the C Git Nginx 4G is set up with the smart HTTP protocol. That's one. And then you have things like... So, this C Git can render org mode readme files, which the basics it can't do. So, this is actually an org mode readme file in this repo. Then you have things like syntax highlighting that is automatically set up again. So, let's just look at the make file maybe. Yeah, so, yeah, you see the syntax highlighting. So, for that it uses Python pigments. So, my point is that Gitx 4G tries to do all this for you and doesn't expose all this complexity to the administrator. And all you're really saying here in this configuration is domain name and the directory where the repositories are. So, it handles a lot of things with very sensible defaults behind your back. So, that's that. Yeah. How much time do I have? Okay. Okay. So, the philosophy behind Gitx 4G is that it has to be really minimalistic. I don't want to be running a full database server just to publish a few GitHub postries and run a small project. And it should be as stateless as possible. Of course, you need a little bit of state for if you need a mailing list or if you need to backup your Git reports, of course. But it should not have hard to backup state like a database that you have to be that takes a lot of cognitive overhead to keep working successfully. As to, should be as donkey as possible, but it should still be able to inspect it and fit it in your head. It should not be something that is so complex that you cannot hold it in your head. And effectively, what the, what Geeks 4 and the, the Geeks 4 channel is doing is that it's, it's crowdsourcing server management in some sense. So, the regular server which you are always, which have, for which you have to mutate configuration files, you are the only one who's in charge of the server. But when you have Geeks 4 doing a lot of things for you, you're essentially getting a community to help you with managing your server. And so hopefully that will reduce configuration errors and let you run a polished server setup without putting in too much work. So that's it. Thank you. Nobody complains when the speaker is too quick, right? Is this a replacement from GitHub? Yeah, it's meant to be. What about the fast pushing process and we having these things? Can we support them with this Geeks 4? Do you mean the email workflow? Yeah. Yeah, so I don't mean to support public inbox based mailing list. Instead of pull request based model. I think that's easy to set up using existing tools and personally I think it's better than the pull request based model. Questions? Yeah. So I think you mentioned it's in a separate channel. Yeah. And are you planning to upstream it and what would be needed for that? So, can we repeat the question? Yes. Yes. Yes. Sorry. So I'm planning to upstream it into Geeks upstream instead of having a separate channel. So certainly there are some parts that can be upstream. For example, the automatic HTTPS that I demo it can certainly it should be upstreamed. But the all the other services I'm not really sure. So I'm not sure how much of this fits into Geeks upstream itself. We already have a Seagate service in Geeks upstream that doesn't do as much as the Seagate service in Geeks 4. So upstreaming this will essentially break the old service. Maybe it should be called something else now. So that's a difficult conversation to have. Could you have a Meta service? Sorry? Do you have a service with all your special services? I do have a 4 service. It's not fully integrated but it aims to be a full Meta service. Yeah. Can you show Laminar? Oh yes. I can show it in the browser. So this is Laminar which is a continuous integration system. So this is a system that we are already running. It's not running on this laptop. It's running on a different server. And it's a really simple continuous integration system that is very easy to set up. Like most continuous integration systems are so complex that they read the very enterpracy projects that are not meant for a single person to set up. But Laminar is really easy and you should have a look at the documentation itself. It's just a single page of documentation and you can set it up. So we use that in Geeks 4. And it fits in with the philosophy of using very minimal tools. We also have class in Geeks 4. Class is another Git reviewer which is written in Python. So you have even a choice for... If you don't like CGIT you can use class and maybe you can support Git delay and other Git viewers too. Sure. So these are the Git logs. Maybe... Yeah, make file again. So class is just a Git reviewer. It doesn't do anything else. Yeah, it supports the Smart HTTP protocol. Yeah. So you mentioned that the TLS stuff is automated as well. But with the demo there was something that seemed kind of manual? Oh yeah. So the manual step that I showed you is only the first time. And after that that same script is ran as a cron job. I need to get rid of the first manual step but I think I need to patch something in Geeks upstream for it to happen. So yeah. Question. Would it be easy to use this process to set up your own channel and then auto build your packages and then deliver that as a substitute? Yeah. Yeah. Get them to end the flow? Yeah. So we already do that in my Geeks 4 instance. And we also have the... So that is the Geeks bioinformatics channel which Pewter runs. And we already do that for all the packages in Geeks bioinformatics. For example, here you see names of many packages. Some of them build, some of them fail. And I think it's... Using laminar and Geeks 4 is simpler than something as complicated as... As Geeks is quicker as CIS. And I really don't want to be running Postgres to have... To just provide substitutes for my channel. So we have a replacement here for many things, right? Yeah. Including GitHub CI. We don't use GitHub CI anymore. Yeah, we don't use GitHub CI anymore. Alright. Thank you.
Open source leadership at scale, how 1300+ people improved Drupal’s multilingual features
Keeping on, keeping on schedule and with one microphone per person. Gabarhojji, very old friend of mine and open source original. Drupal Core maintainer several times. And now talking about a really powerful contribution project in the Drupal community. Roughly right? Thank you, Jim. Yeah. Great. Hi everybody. Thanks for coming. I think it's going to be interesting for everyone hopefully in some degree. Because this is about open source leadership at scale. As Jim said, I'm Gabarhojji. And... They called it one slide, Gabarhojji. No, it's done. So I'm Gabarhojji. My own made up title is Full Stack Community Organizer. Which means that I can put on an event for you. I can manage social media, design graphics, do a keynote, build developer tools and like basically everything in between. Write marketing copy and do everything in between. So whatever is needed at the time. So I've also been working with Drupal since 2003. Much like Mathias with Type or 3 since 2003. Just picked a different system. But it's around the same time and I'm a Drupal core committer and did a bunch of stuff that are... That helped in getting here where I am now. But I'm more interested right now in where you are coming from. Who's using Drupal for anything in the room? Alright, some of you who have no idea what Drupal is. Just here for the title because it was nice. Okay, I didn't explain Drupal that much. That's great. Who are you consider yourself primarily developers? Okay, great. Nice. So those were the main questions that I wanted to have so I can direct the talk properly for the audience. So I got into open source from open content. In the 1990s I went to high school. And the high school got dial-in modems and we got on the internet. And I was really interested in how we can publish stuff on the internet. How we can put something on the internet. And I decided to be the lazy teacher that reads five pages ahead and then teaches everybody else what they learned. So I started looking at how is this done and started to go into documentation and started to translate the W3C standards into Hungarian. And then the PHP documentation into Hungarian. And then distributing that and then starting to look for news and articles and translate those. Summarize and translate those into Hungarian and publish it for the Hungarian community. And so it basically turned out to be a thing where I needed to set up a website to publish all of these things. And I got together with a person in Vienna, Shandor Shromkuti, who I've never met ever in my life, ever since. Now we work together very well online. And we created this website called Vabla Bor that was hosting these things. And I went on a side quest with the PHP community. I became the lead of the PHP documentation and the lead of the PHP.net website in the beginning of the 2000s. And growing this Hungarian community website as well. But the Hungarian website grew out so much that we needed some kind of system to manage the community, to publish these things, manage the forums that we've had and the meetups that we had. And so we needed to have some system that managed this. And that's where I found Drupal. And Drupal was tiny and nice. And this was the whole Drupal conference in 2005, all the attendees. There's a certain person sitting there. So it was a tiny community that was very tight-knit. And we would get together. The software was managed through a mailing list where you would post a patch to the mailing list. And it was reviewed on the mailing list and then committed to CVS. So it was very tight-knit and everything was reviewed by these few people. And so it was very easy to join. And I needed it for a Hungarian website. So my main problems that I would go in and fix were usually about translatability into Hungarian. Or I wanted to have the path aliases in Hungarian. I wanted to have everything in Hungarian. And I always bumped into bugs and I submitted bugs and they got fixed. So I fixed them, then they got committed. So that was basically the natural way to get into the community. Small was easy to approach and they would receive those fixes very well. Fast forward, ten years later, this was the Drupal conference ten years later, there's people up there as well. They're hard to notice. So this is Drupal condomber. So it's kind of hard to get started in this community. Like who do you walk up to that I have this bug and please work with me on fixing this bug? It doesn't work. Like there's no way to do that. Like, I don't know, it's hard. It's like walking up to people on the street and like trying to convince them of something. When I got in here, all of the buses, I think bus 71 was always full. Like I waited for two or three buses. All of them were full. I decided to call an Uber yesterday. And I was myself in the Uber. So I walked up to the people waiting for the bus. Like, how do you want to come with me? And they were like, no. Who are you? Like that's the kind of feeling that you walk up to someone and like, no, who are you? Like, why are you approaching me? So I was alone in the Uber. But when you have this tight small community, it's much easier to work with them. So when we got to this point, it started to get very hard to manage what people are working on and organize that and motivate that. We kind of went by pretty long without more structural organization. But around this time, the project lead, Drew Spreiter, decided to set up initiatives. And so that initiatives could get back to this tight knit small feeling and they could sit together and know each other and have a sense of community in this much smaller scale. And they could work together very well. And when this started, I was approached to work on the multilingual initiative because even 10 years after I joined, the multilingual was still a problem space that had a lot of problems to be solved. And so I was happy to accept that. And so I started working on the multilingual initiative. And everything was rosy and happy and I started working on things. And then a bit later, something bad happened for me. And the least I considered that was super bad that happened is that another initiative was announced. The views in core initiative was announced. And so multilingual was especially in Europe pretty important. But views in Drupal is basically the, so if you don't know Drupal, that in that detail, then views is basically a query builder based on Drupal's very rich structured data. And it's also an output generator based on the query. And you can choose how the output is generated. And it can generate APIs and REST endpoints and lists and sliders and all kinds of things. So basically it's a query builder and an output generator. And views was four times more popular than any of the multilingual modules. And they got funding from Sony and other companies. So they had money. They were four times more popular. And they started to steal some of the people that were working on the multilingual initiative. So I felt totally betrayed and I was super angry. And I wrote this email to project leadership and core committers that this will jeopardize my initiative. It will make my work super hard to do because they're going to steal the thunder and will be very hard to do this going forward. And now I'm here talking about how successful it was. So we sort of resolved this. But I was super angry and very jealous and also felt betrayed. And I think what was interesting is I didn't get responses to my feelings there. I did get responses to the facts that I stated and they were refuted. But my feelings were not contested. And I think what I realized after a while, after I had time to think about this, is the problem that I had is I was thinking of Drupal as this small pie that we are eating from. And if everybody's eating from the same pie, then it's kind of be over after a while and you don't really have more to eat. So if you steal my people, then I don't have people and I'm not going to have people. And so I think that was the key understanding that I had is that I need to think about how to grow this pie. And even though we had all of those thousands of people at the conference, I think we still didn't have a good grasp on how we involved new contributors very well and how we make them successful, which was even more important. And my other problem that I had is I didn't have money. And this realization that I need to grow the pie didn't make me have more money. Some of the companies that were involved in the Multilingual Initiative had money and they were investing into sponsoring some of their people. But I didn't have money on the scale of 1,300 people. Like, that was not possible to achieve. So I need to figure out something else. And so what I started to look at is how to make people happy. Because they would come here if there's something in it for them. They would join us if there's something in it for them. And I did, I read a bunch of stuff and some of this clicked together afterwards and it provides a great structure for this talk. But some of this, I basically figured out on the go. And so I think the best structure for this talk is these three words. This is from Dan Pink's book called Drive, which is one of the three books that I suggest you read on this topic. So Dan Pink Drive. And he highlights that people like working on things when they have autonomy. So they decide for themselves. They decide how they solve problems, who they solve problems with, how they move forward, etc. People strive if they have mastery so they can get better at things. They improve. They can try new things and improve in them and get challenged. And they strive when there's a purpose of what they are working on. And so if we can figure out, if we can correct the code on those three things, then it works really well. And I think we correct the code in the multilingual initiative and this is how we did it. So I think the purpose is sort of easy, at least for the people that were involved in my initiative. They were primarily in Europe but also somewhat in Canada and somewhat in Northern US. And they had personal needs for multilingual. So obviously they had the purpose of solving their own problems. But there was also some higher purpose, like if you just look at where Drupal is used, the UNESCO uses multilingual Drupal to help with education and children and refugees and stuff like that. The CERN uses multilingual Drupal to advance science. Tesla is using multilingual Drupal to promote their technology. And you can like configure your car through Drupal on the Tesla.com website. Rathetti is using Drupal extensively and they invest money back in open source as well. While we are in Brussels, it's hard to avoid the European Commission. It's using Drupal super extensively. This is in Hungarian, aropa.eu. But they have 300 websites that are in Drupal. Most of them are multilingual obviously in Europe. It's hard to do anything without. And they have more than 100 people on staff, developers on staff that are working on their Drupal websites. So it's super extensive. But I mean these companies can pay their way to solve their problems. If you have 100 developers, even if multilingual is hard, you can solve that. If you're a Tesla, multilingual is hard, you can solve that. So that's not really what gave me purpose. What gave me purpose is my high school's website where I started working on open content is running on Drupal. Totally accidentally as I was not involved. Totally randomly. So this is the high school I went to. So it can make a Hungarian Drupal website that's fully Hungarian and works very nicely. It's not multilingual, but it's Hungarian. So that gives me purpose. If we can make it work in a way that the little websites can do it very easy that we succeeded. So that was my purpose in here. The autonomy part, I think, is much harder to solve if you come from a traditional open source developer background. Because I think many people that start open source projects, they're great developers. They have this idea of what they want to do. They have this architecture in mind that they know how to get there and the steps to get there and they are building it. And they want to have people along for the ride, but they don't want to have people to tell what the architecture should be and what the steps should be to implement, et cetera. So to give autonomy, you need to agree or understand that you need to agree on the high level goals and get rid of your idea of micromanaging anything below that. So you need to be comfortable with the idea that you define these high level goals and it's up to the team to figure out the rest. And maybe it's not the same architecture that you wanted. Maybe it's not going to be exactly on the timeline that you expected or the steps that you expected, but other people will implement it. If you share the same goal, there's going to be shared ownership and they will implement it. So I think this is hard for... So this is one of the things that's been... I've been trying to mentor initiative leads on in the Drupal community ever since because it's very hard to get from a developer background and have an idea of how this should be done. And then give up that idea and work on organizing the whole thing instead of implementing the whole thing. But to achieve some scale, you need to give that up. The next one is, especially in the Drupal community, is to set up space because when you have this big thousands of people at the conference, there is no identity, there's no space, there's no feeling of community for the team that you have unless you set up the space. So for example, what we did here is we have a chat room. This used to be IRC now, it sounds like, that is shared in the team. We use chat meetings and threads so that it's easy to get involved with multiple language backgrounds. It's much harder to follow live audio meetings and video meetings when it's not your native language and it's very fast. Chat meetings much easier to follow. We had this identity that was created by one of the sponsor companies, because the logo of the multilingual initiative, we had stickers of this, we had t-shirts of this, etc. When we went to events, we had tables where we set up a big sign that this is the multilingual initiative so people came in, they would recognize us, they would join us. We were always there in the morning, by the way, that's a good trick, so that we were the default choice at the contribution room. When people came in in the morning, they were like, oh, multilingual initiative, great. So that allowed us to have this sense of small community that we need to achieve in this big community, to have a sense of belonging and to have a sense of connection, and so that people stay and will have those personal connections that otherwise are not possible in this big community. We also had our own website, which you may or may not need, but it was nice to have our goals set out there, and we basically pulled issues from the IssueQ and used tagging and labeling on issues to prioritize them and then display them nicely, so we didn't need to do a bunch of work manually on the website itself. Now then you have people, I think the next important thing is to have buddies, set up buddies for things. At least in the Drupal community, there's always at least three people that you need for an issue to be committed. There's somebody that works on the fix, there's somebody that reviews the fix, and then somebody that commits the fix. So if you need three people to work on an issue, you need to set up those three people to be successful, like, that's not going to accidentally happen. Like, if you walk into this keynote room, I have an issue, there's nobody going to listen to you. So what we did is when new people came in and they were like, I want to help, we always assigned them to something that somebody else was already working on, because then they had a buddy that was already invested in the issue that they came in to help with, so there was already shared understanding between those two people that they want to solve this problem. And once we had that, we had these buddies that if one of them went away, we still had a solution for how do we move this along. There was still one person left that could serve as a successor to the person that used to introduce the problem to the next person. So it was pretty useful to keep things going because stuff happens to people. Like, all the main people that I had in the initiative, something happened to them, and it was always useful to have buddies that shared the same goals. And that was basically the only way to get stuff done in the Drupal community anyway. So I think that was pretty much a key to our success. And the next thing that I realized is we need to praise the smallest of results, because people don't really recognize that they are going towards a goal and they are achieving something towards the goal unless you point that out. And often people forget about, like, after a week or so that they did something great, so it's great to get back to that. And in the meetings, we always had a section where we were praising the results from the previous meeting and figuring out who did those things and call them out as well. And the other thing that's super important, I think, is to praise the people that go away, because when they go away, they probably already burnt out two or three months before they just didn't realize it. And it's good that they went away. It's good for them because they need the break. And it's good for you because they're not going to be here and, like, maybe have negative effects on the team. And it's good for you because if you are praising that they need this break now, it shows the team that they don't need to overwork here and they don't need to kill themselves through this project. We'll figure this out. And the person that you celebrated for taking the break may actually come back after they took the break if you've been kind through this process. So it's like there's no other option, I think, that's like the win-win-win-win-win to praise people that go away because it's the best for everybody. So if you do these things, you have those buddies, you have a small tight-knit community, even in the bigger community. You have this space. You give them autonomy to work on their own ways. You just share the high-level goals. And then you have this shared ownership about things. And maybe it's not going to be implemented in a way, maybe it's not going to be implemented by the same people you started with or on the timeline you wanted, but it's going to have shared ownership. And that was kind of useful for me when I had a problem. So a couple of years into this initiative, it was a long initiative. I had breakfast with my wife and she started to having very strong stomach pain that didn't end. And so we stopped our breakfast there and we went to the emergency room. And they figured out that her blood results are getting worse and worse, but there was no blood to be seen anywhere. So they figured out that it's internal bleeding and she was about to die in a couple hours if not operated immediately. And so they assembled a team of doctors that would operate her that night and they saved her life. But she lost one ovary on the way. But she now lives on and we still remember this day. And at the same time, Drupal Karnasdin was happening. I was supposed to be there and do all of this magic with the multilingual initiative. And I was obviously not going to travel to Drupal Karnasdin when my wife was recovering from a life-saving operation. So because we had this shared ownership and shared understanding of the initiative, all the stuff that we were planning for the multilingual initiative happened in Austin. They sent us flowers and guards and well wishes and they sent us this photo of some of the people on the Contribution Day to wish us well. But this was because we built this initiative to do it together. And so mastery is the final one, I think, which is probably the most interesting thing, I think. Because people want to get better and you want to have people on your open source project. And so the question is what is that you want to have people on your open source project and they want to get better at and what do you need that they may want to get better at. So that's what we are looking for. So one of the things that I've been doing at events is therapy sessions because multilingual used to be very painful in Drupal and people had pain. And so I set up a multilingual therapy barf, is what it was called, on the schedule. And I would sit back and I was like, do you want to talk about it? And they wanted to talk about it. And so what this was great for is, A, I got in the users that had pain about multilingual so I could have a requirements list of what I want to solve in the multilingual initiative. They got to talk about their pain so they felt heard. The people that were contributing on the initiative came in to the barf and they felt like they are the experts because they could give advice to the people that had the pain. So I was basically sitting there, I didn't do anything at this barf. I said, do you want to talk about this? And the experts came in from the initiative naturally and the people with the pain came in and I just sat there and I enjoyed it. So that's the investment. So the experts, basically the people working on the initiative came in and they gave advice to people with the pain. And we got to show at the barf, this is what we're working on, this is how it's going to make your life easier. We feel your pain. Yes, it is something that's hard right now, but this is how we are solving it. And so we could build in that feedback into what we were working on. We could review with the people with the pain the solutions that we had. Does this solve your pain or not? So it was very good to get direct feedback, it was very good to have them listened, it was very good to provide visibility to the people that were contributing and get professional recognition for them, sometimes business because they were giving advice to clients that were showing up in the room and they may get a business relationship after the barf. So it was great. So I think it was important to acknowledge that multilingual was a pain and provide this space in person as well. The next thing we did was radical openness about how we organized this initiative. So we created an open source slideshow for example that anybody could present anywhere. And they translated this slideshow into multiple languages presented in Japan and Poland and France and bring it to companies and a lot of places. So we just gave this slideshow and we didn't ask for anything in return. And this brought the news of the initiative into far and wide on the globe everywhere that this is happening and made people excited. And also gave people the opportunity to deliver sessions that have not done it before, they didn't build a slide deck that was compelling or anything. And this was useful for them as well. We made the Drupal distribution which had a demo of how this multilingual thing will work. It had demo content and demo menus and a bunch of features set up so that people can try out how they can do it. And they can try out how this will work and they can test this out and we can get feedback. We created a two hour workshop with a 23 page handout that detailed the steps of how you get to build this distribution basically. How you get to build out a multilingual menu, how you get to build out a multilingual content structure, etc. Super detailed. That's why it was 23 pages. It was like click here, right this, click here, right this detail. So this was very useful for people to do these workshops and teach people how to use the multilingual Drupal system before it was even done. Like we were already training people on multilingual Drupal before we were done. And with the help of Acquia we created a user testing script that could be crowdsourced so that people can do user testing at their meetups at their local events and record them and publish their results and we could aggregate the results and use that to inform how we are doing and where we need to improve the user interface or the flows or how all the things are connected. And so I've been doing a bunch of research and reading in the meantime and read a bunch of interesting tricks on how to involve more people. So this is one of them, car wash loyalty. So that was one interesting story about car wash companies. They want to have people come back to wash their cars. And so they did an experiment where they had a car wash loyalty card with eight slots that you could stamp in and then get a free wash at the end. And they did another card that had ten slots but two were already stamped in. It's the same eight slots but there was two more that were already stamped in. How much better did this one work for people? What do you think? This worked twice as good. So the ten slots two already stamped in worked twice as good to get people to get to ten stamps than the eight slots, eight empty slots. They had the same exact number of empty slots. But this one told you that you are already on your way to achieve your goal. So it was like you already started. You had two stamps even though you didn't do any car wash. It was just two stamps and the first stamp was the third stamp that you got there. And the people that had this card, they got there faster as well. Not just twice as many people got there but they got there faster. So I have to translate this to open source contribution. So one thing that I did is I wrote blog posts about how Drupal Multilingual is going. And I broke down the initiative basically I think 18 posts or so. Like this is what we do for multilingual installation. This is what we do for interface translation, etc. And at the end of the post I had a section of by the way this doesn't exactly work well yet. And these are the issues that you can be involved with. So people read about what's exciting thing coming up. They got informed. And at the end they got rubbed into helping with solving the problems because they already felt like we are getting this great solution and they was like it's just almost there. I just need to help with this one. That helped a lot. So at the end we got 1,300 people involved and that included people from companies like NBC Universal and Pfizer and Carefor and the University of Waterloo, University of Iowa, Biologist, Genetic Information Management, McGill University, Johnson & Johnson, Ticketmaster, Google Summer of Code, Google Code and you name it. So all of those sources had people that were involved. So this is the list of people. Too fast. Wanted to spot yourself? So there's a lot of people. And so basically all it took is for me to understand that this is not a fixed pie that we need to look at how we grow this pie. We need to figure out what's in it for people to come in here and grow and be involved. And for me to figure out that I need to give people the autonomy in this project to figure out how they're going to solve this problem. So they have a shared ownership of solving this issue, for them to have ways to get better at things, for them to master their craft, for them to improve on their own terms, and for us to have a shared purpose on why are we doing this. So if you want to read a lot more about all of these things, these are the three top books that I would suggest in this area. So David Marquez turned the ship around. This is great for handing off autonomy. He's a nuclear reactor, nuclear submarine captain that was training for one type of submarine for two years and then got reassigned in one week another type of submarine that he had no idea how to operate. And so he can need to figure out how to give autonomy to the crew. It's a great book. Danny Alping's drive is about this whole structure of autonomy, mystery, and purpose. And the switch from the Chip and Don Heath is a lot of great stories and solutions and tips about how do you make people do things that they probably wanted, but you need to convince them. So the car wash story comes from there, but it's all... There's no software stories in there, by the way, nothing. But there's a lot of stories about what do people do about glove ordering in the hospital or kids' cancer treatment or a bunch of other things. There's a lot of great stories in there that you can apply in one way or another to open source as well. So there was my talk. Any questions? You've left a speech list. All right. When does your book come out? I don't have one on my own. Thank you. All right. Oh, there we go. Yes. So the question was that Drupal has this challenge in all kinds of other areas as well. So I think that is the 10x or 100x to apply to all kinds of other topics. I think so, yes. So I think we've seen some of the recent initiatives that people were really driven to implement that had similar approaches, like the single directory components initiative, for example, had a very similar approach and a smaller scale. So I think we could apply a lot of these to other initiatives. We've been trying to mentor initiative leads on these ideas. And we've been successful in some of these ways of like how do we involve people from events in initiatives. That's been a track that we've been really successful on working through. But there's definitely a lot more that could be applied from here. Yes. Yes. Please submit a proposal to the 5.3 developer next. All right, please. I was suggested to submit the proposal for the 5.3 developer days. When and where is it? It's the first and second and third of August and it's in Koffrüh in Germany. It's the first three days of August in Koffrüh in Germany. It's for the camera. 5.3 developer days. You. Yeah, me too, but the listeners as well. Great. Yeah, thank you everyone. Have a nice day. Thank you.
Breaking Barriers: Content Management Systems and Accessibility
So everyone, welcome back to the Open Web Alliance Dev Room. Oh man. Cut that bit. Cut that bit. Welcome back to the Open Web site Alliance. Open Web site Alliance launched today. So I'm allowed to practice it saying it three more times before it has to be free of mistakes. So in this Open Web site Alliance Dev Room, we're now going to proceed with two friends from the Type 3 community and from the Mitwald hosting company talking about accessibility and content management systems. And they are going to be so kind and work their way in over the next 40 seconds to get started talking. Please Martin and Lucas. Alright, thank you so much. Thanks for the intro. Good morning. Can we still say good morning? It's before 12. Yeah, we... Alright. Alright, good morning everyone. We're going to be talking about accessibility and content management systems. We're going to get back to that bit in a second. I'm Martin. I work at Mitwald, responsible for developer relations over there. In case you don't know Mitwald, we're a hosting company from Germany. We specialize on providing the support for agencies, mostly in the open source CMS space. I also do some lecturing on computer science. And just a little disclaimer. I only have started learning about web accessibility recently, most of which I know I actually have learned from Lucas who's standing right next to me. Yeah, I'm Lucas and accessibility is quite a personal topic for me as you might have noticed already. So I have personal experience with a lot of issues you run into if you have some accessibility needs. And I'm a project developer at Mitwald, I think four years I started at Mitwald. And about ten years I'm a freelance web-based developer and developed themes, plugins, custom solutions and stuff like this. Yes, and I have advocated for this topic in our company for a long time now and now it's starting to get some action. And so we are here and talking about accessibility today. So I think or I might speculate that one reason this topic has been gaining traction in the past few months, years is that there's new legislation coming up. Like for example the European Accessibility Act comes into effect I think next year. There are also other laws that pertain to the same direction like in the US there's the Americans with Disabilities Act which has been around for a while, I think since the 90s actually. So I think this might be a reason that some people start seeing this now. But in the next 30 minutes we actually plan to convince you that fear of legislation should not be the actual reason to consider accessibility as something that's a good idea. Because in actuality it's about enabling access for everyone and not getting sued should not be your primary motivation for this. Like in a perfect world we shouldn't actually need to have these laws. So it should just make natural sense to be as accessible and inclusive to everyone. Sorry, I'm getting confused. To understand what we are talking about today we first need to define some things and mostly there are different kinds of barriers you can get to if you have some accessibility needs. And we've listed some here on the slide in a second. First of all there are perceptual difficulties you can get. For example if you have some vision and disabilities, a lot of you wearing glasses, discounts as well. And all of you have heard about color blindness by now, contrast issues and stuff like this. These are the most obvious things a lot of people think about. But there are motoric issues as you can see. I mainly use my right hand so it's difficult for me to do stuff like a lot of keyboard shortcuts are quite difficult. And stuff like this or maybe you have Parkinson disorder and you can't do small movements or keep your hands still. This is another problem. Mental and cognitive issues I think all of you have heard about ADHD. It's difficult to concentrate for people with ADHD if something is moving on your website or even on slides. Yesterday as well as the slide where an animated GIF was running all the time it was really hard to concentrate even for me. And someone with ADHD would have really, really much trouble concentrating on such stuff. And videos or animations on websites, it's the same issue. So think about this when doing such stuff. Remembering stuff especially in our short time memory might be quite difficult for some. In marketing there's the rule to just put at most eight items in your menu. This is one of the points so think about this as well. And then there are two more topics we want to talk about because there are also technical and economical aspects to accessibility. Maybe not all of the people want or can buy new devices, a new smartphone every year or because of environmental reasons they just don't want to. So technical accessibility is a problem as well. And maybe your TV we have some TVs in our company which has very low contrast ratios. We're actually in secret hoping to have one of these fitting university beamers just to bring across the point better but actually turn out to be alright. But you can see it in the top right corner. The Mid-World logo is quite difficult to see because it's white text on white background, not a good example. And maybe you have one of the famous low internet connections in the famous Deutsche Bahn. Then you will run into trouble as well. Yeah and of course low-end and all the devices I just talked about and you have to keep in mind that sometimes financial problems correlate to physical disabilities because people already have to spend much money on accessibility devices like wheelchairs and stuff like this and don't have that much money to put into technical devices. I think disability also correlates with unemployment a little bit. And if it's not a disability as mine you might also have a disability sometime. Maybe later your eyes get worse or you have a temporal disability, a colleague of mine tore his Achilles heel last year I guess and suddenly he noticed how hard it is to get up the stairs at the office. So everyone could be affected sometime or even if you just hold your coffee cup in the kitchen suddenly you only have one hand to use your smartphone and such stuff. And a lot of you might be the kind of keyboard nerd who likes to use the terminal for all his stuff and such. And such people will be happy to use your website only by keyboard as well. So accessibility needs can always be a personal preference. Yeah so with JusticeCast there are a lot of aspects you have to think about. Some of you might think this is really expensive, this must be really hard work. But in fact it's not that difficult. So I think about many people think about like adding accessibility after the fact which I can imagine does get painful because there is a development effort in that. So obviously the most obvious recommendation would be to just consider accessibility from the start if you're starting on a green field. Then it's actually not that difficult if you think about it from the get go and just think about it as a quality measure. Like you would also think about code quality. I'm guessing who's a developer actually. So most people, I would guess most of you wouldn't also think about skipping testing. That would be insane. And really irresponsible wouldn't it? I see some people struggling. Like who would do that? And even if you're arguing about money, if you're building an inaccessible product you're actively excluding users from your product. And I think there's estimations it's potentially about 15 to 27th percent of users that you're excluding. So if you're taking measures to include these users that's going to pay for itself in profit. And even if you're not starting on a green field there are also synergy effects with other quality goals. For example we've talked about sensory issues. For example if you have a limited vision or limited hearing and you provide audio content like podcasts or video streams. And one measure that you could take is to provide a transcript for that, for your audio content. Now if you have a transcript that's text and a search engine can crawl it. Boom! Instant SEO. We've also talked about accessibility issues on lower end devices. Because not everyone wants to buy a new iPhone 15 for 1500 bucks every year. I don't want to. So if you're targeting lower end devices you need to start thinking about limiting resource consumption. Being more efficiently you need to start thinking about performance optimizations. Which in itself is also an important quality goal. Which I'm just thinking about now is that this also has a sustainability impact. Because you're enabling users to keep older devices longer before they need to be replaced before they become obsolete. You also minimize resource consumption. You also minimize battery drainage on mobile devices for example. So that's also an important synergy. There are some quality goals where it gets a little bit more complicated. So for example when we talk about security there might be some trade-offs that you need to make concerning accessibility. I think a general rule of thumb might be the higher your security requirements are, you need to start thinking about accessibility to not become a problem. For example if you're enforcing multi-factor authentication you need to think about a way to make that accessible. If you're in even higher security areas like if you want users to use TLS client certificates or some things like that. That's a very high cognitive load that you're placing on users. Last week I used my German EID card for the first time to log into a service. That's not for everyone. I honestly have no idea if there's an accessible way to actually do that. This is not an impossible problem. There are guidelines for this and I'm just skipping a little bit ahead to the actual solutions. There are guidelines on that and there are guidelines on accessible authorization. Most of these boil down to reducing cognitive load on authentication. The recommendation is that your authentication process should not be dependent on a cognitive function test. Who remembers those? This is... Can anyone of you solve this? I can't. This is just ridiculous. Luckily these have gotten a little bit out of fashion lately because probably at the moment AIs can solve these better than we can. There are ways around that. It starts with simple things like if you have a username password. Remembering a password is also a cognitive function test but you can use a password manager. You can copy paste the password into a password input field. Don't prevent that. You remember those password forms where you can copy paste into... Yeah, that sucks. If you're requiring multi-factor authentication, I think there are new standards coming up that we can use to make it more accessible like web authentication, like pass keys. All of those reduce cognitive load on the authentication process. So that are things we need to start thinking about and need to start thinking about implementing them. And users without disabilities also benefit from pass keys, for example, because it's just a matter of comfort for those. Absolutely. Now let's talk a bit about cognitive management systems because this is why we are here. And in cognitive management systems you have two sides of the same kind, I think. On the one side you want to provide accessible content to the end user, so to speak. So one important part is that our editors have the option, the possibility to create awesome and accessible content at first. And of course the editing experience itself, so the backend should be accessible as well. So that the editors itself with accessibility needs can edit the content. And think of a blind user who is trying to create his or her own block and share their experiences to the world. They need to use a cognitive management system. Yes, for the front and sides, so to speak, there's the web content accessibility guidelines. This is the most basic stuff I hope most of you might have heard this by now. It's things like alt text for images, using anchor tags for links, using semantic HTML, I will run about this in a second. And then there's the editing side, there's the authoring tool accessibility guidelines. This is especially important for CMS systems. We won't go into detail on this because everything you need to know about this, you will read in the documentation. But we want to highlight some of the most important things. One of the most important things is using semantic HTML. And I would hope this wouldn't be necessary to say these days, but a lot of people get this wrong still today. Use a semantic HTML on the screen, you can see the header and the nuths who have replaced this in HTML5. But it starts with the basics. Use list tags in your HTML if you mean this is a list. You see in the wild you still see a lot of paragraphs all starting with a dash. This is garbage for screen reader users. They can't use this. So please use semantic HTML and most importantly make it easy for the users of the cognitive management system to use this semantic HTML. Maybe provide some automatic functions like if a paragraph starts with a dash automatically converted to an unordered list, you can skip to the next. And provide such automatic features. A lot of you might use some messages who do this by now. And our CMS should do this as well. I think a good example for how this really works out is this is a screenshot from the word for spec end. Where you have this semantic outline view where you can actually see the hierarchy of your headings. And the system also points out where you did something where you built it inconsistently. And this is really important to enable editors to build consistent content structures. Because screen readers actually built on those. I think Jaws is one of the most prominent screen readers. It actually offers navigation options based on the hierarchy of headings. So if you mess that up you're also going to mess up the screen reader. So I think what CMS in general should do is to discourage users from just placing certain kind of headings for aesthetic reasons. Because people do that. I see and through the yes, the nodding. Yeah. So there should be ways to discourage that even if it's just pointing out where things went wrong. And we also need options to be able to configure the CMS to prevent users from doing that. Like this is a screenshot from the type of three backend. In which as an integrator you really need to know how to configure the system to prevent users for example from. Above there is the content element heading which is I think in the default distribution it's rendered as an H2. But nothing prevents an editor from inserting their own H1 into the body of the content element itself which will mess up the hierarchy of headings. You can disable that as an integrator and you should but you need to think about it and you need to remember it. For example the Gutenberg editor in Watpers and they also have the option to set as custom H1 heading. And for the websites I develop I mostly disable this option deliberately so the users who edit the pages cannot make this one by accident. I think in general there are some parallels to search engine optimization aren't there? Because if it's a screen reader or a search engine you need to be able to build your site in a way that a machine can make sense of it. And that's why it's so important to adhere to. Basically it all boils down to adhere to the standard. Another thing that we'd like to point out is users are going to have certain expectations around how a system works. I read a case study a while ago on the type of 3 block. I think CPS did it. They tested the type of 3 back end with 2 or 3 blind users. And I think the gist of it was that you already can use the type of 3 back end right now with a screen reader if you have received appropriate training in doing so. So if you're in that situation you're going to have... You're relying on the system working in the way you were trained and you're expecting it to do. So the most important thing is to not break the expectations that users place upon the system. This goes in very, very, very many different ways. One example are conventions around the menu structure. Each and every CMS back end has some kind of menu structure and there are conventions and there are expectations around where things are placed in that menu. This especially gets important when we're talking about plugins or extensions or modules or whatever. Many names, same thing. Because we're not only talking about the CMS course, we're also talking about the ecosystems around the CMS, the third party extensions that extend those systems. So what we thought of as important is that extension authors should be encouraged to adhere to the conventions that are applied to the navigation structure in the CMS back end, for example. Another thing are UI components, for example. I think there are some WordPress plugins that completely roll their own UI library that behaves entirely different than the rest of the CMS. If you're relying on having your expectations around how the system works being met, then this would really confuse you. One other thing is clutter. I think this is a screenshot from a fairly representative WordPress back end. This is the plugin list. Let's have a look at this. That's a feedback prompt. That's another feedback prompt. That's an add. That's another add. That's a maintenance message. We talked about ADHD, didn't we? I also don't know what would happen if you would pipe that through a screen reader. I honestly don't know. Nothing good, probably. This is another screenshot. This is the WordPress... Was it the same site? Yes, it was the same site. The dashboard itself, same site. The same feedback prompt. Another feedback prompt. Another maintenance. That's an add. That's completely useless. I want to point out this new section is from the podcast plugin in this installation. The news in this widget are about digital cameras, YouTube videos. It's nothing about podcasts. You can disable all those as a user, but you're still being confronted with the cognitive load of... You need to make sense of it at least once. You can disable. This is the perfect handover to my next talking point. It's about giving the user the choice and respect choices the user has already made. As you just saw in the WordPress dashboard, users can disable the widgets in the dashboard, but there are a lot of more choices the user has already made when using your system. All of the operating systems nowadays provide some kind of dark mode, for example. It's really easy to use the dark mode setting in the CSS you can click, I think. We messed up the order in this list. Another point is opening links in new tabs. I often see that suddenly a new tab opens. For me, this is quite normal because I know the tab system. I'm used to this, but when giving courses to my clients, I can see that they get often confused. Once it was some worker in a school, I think, she tried to do something in the WordPress backend. A new tab opened. For me, this was obvious. She tried to use the back button of her browser, and she didn't get back where she was before. Her solution was closing the browser window completely and starting from scratch. This was pain in the ass, not good use. I think my mom would do that too. Of course, a lot of people do this because they don't use the PC or the browser that often as we do. Think about these users. Give the user the option to open links in new tabs. All of us could just hold the control or command key, and other users might use a wide click open in new tab. Give the user this choice. He or she knows when he or she wants it. Now I have two clicks, so I can see my next point. The user can select a font size, a minimal font size, and the browser settings. This is very important for people wearing glasses, or maybe later if your version gets worse. We have a colleague whose IDE is set to 48 points. It's really, really large. You might imagine, but for him, it's the best way to code. The IDE respects this choice, and so should you on your website. Just the other evening we were reminiscing about building websites in the old days. Do you remember when websites had their own buttons to adjust the font size on the site? You remember that? Something to do. That's completely attributable because the browser does it for you. You don't work against the browser. I'm the next talking point I already talked about before. It's a reduced motion and video auto play. Just don't. I think every one of us is annoyed by auto playing videos on news pages. It's the worst stuff. A reduced motion is a CSS property. Maybe you can click through so we can see the CSS snippets. The browser these days implement the choices you can easily opt in and reduce smooth scrolling or animations if the user has chosen to do. This is minimal effort, but a lot of users will benefit from this. Increased contrast and color version. I've already talked about this. Just think about that some colors might have the same gray value, so to speak. Once during my college lessons we had another student who asked my professor what this large circle on the slide is about. It was a pie chart, but all of the slices of this pie chart had different colors. It was obvious for us, but for him it was just a gray large circle. Because yes, he can only see gray scale. No colors at all. We didn't know this till this moment, but for him it was really difficult. One thing I always went about when trying to play some games with my friends is I'm casting keyboard bindings. Some games allow you to change keyboard bindings. Some games don't and some of these games I just cannot play because I can only use one hand when gaming. You still beat me every time in each of these games. Yeah, sorry. Maybe give the user the option to set some keyboard bindings even in your CMS. This could be customizable. Think about such stuff. It really, really helps some people. Yeah. This is an argument I also hear quite often. Thank you. Image with text in it is not readable by the search functionality of your wiki. Much into the technical details of accessibility. Our goal was to basically make you think about why accessibility is important and what it's actually supposed to solve. I'm seeing the animation order on the next slide is going to be messed up. Apologies for that. I would encourage you to think about accessibility not just as a feature that your PO can throw into your backlog and you can tell yourself that you're going to build it when you have the time, even although you already know that you're not. Please also don't think about accessibility just as risk avoidance so that we don't get sued. So if you want to think about accessibility as anything, just think about it as a human right that everyone should have and enjoy. It's just about making your product as accessible and as inclusive as you can for everyone. And there's the last animation step I said the order is messed up. All the technical stuff, all the accessibility guidelines, you can test this. There are automated test rates, you can pipe it through light holes and light holes, gives you a green check mark for example. But don't add accessibility just to have that green check mark, you need to understand what you're actually doing it for. Thank you. I didn't drop it. Wonderful. Congratulations. Thank you very much. In my experience, thinking about accessibility, two things really surprised me in my path. And the first was the realization that there are temporal moments where we're all less abled. Like walk out into the bright Brussels sunshine and look at your phone screen or carry anything or have a child or you know. So I found the idea that we're all in that category on a sliding scale all the time, really interesting. And then as a marketer actually, accessibility, if your motives are not pure, and not moral, and not ethical for a second, your total addressable market for any given thing goes up by including more people, right? And if you, and the more semantic data, the more machine readable data you have is the more accessible, is better for SEO, is better for making sales, right? So there's no cogent business argument in my mind to not making accessibility, right? Yeah. So any questions for the, from the room before we break for lunch? I'm sorry, I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. I'm sorry. Any questions for the, from the room before we break for lunch? Yes, please. You should never mention lunch. Unless your question is what is for lunch, because I don't know. So the question in, and the question was if it's a bad thing to, to provide dark mode for the user, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and the user is not allowed to use it, and it is a struggle for dark mode, for example, but the default option should be to option the user as chosen. Yes. All right, so, thank you very much. Thank you.
Wrestling giants: How can free open source CMSes remain competitive with enterprise clients?
Yet another old friend and open source acquaintance of mine, Owen Lansbury, is with us. Owen volunteers for the Drupal Association and has been doing open source, I'm sure he's going to tell you, but for a very long time now. And touching on some of the topics that Matthias and I opened the day with, like how do we really compete on a business level with the big proprietary companies and what's the thinking going on in Owen's head and in Drupalland about that. And a special round of applause for Owen and his presentation shirt. So, oh sure, can we do this? Is that the speakers John Tick? I wasn't sure whether there'd be five people or five hundred for this talk, so thank you for having me. James, given the introduction about what the talk is about, I'm speaking from the Drupal perspective around the challenges that we're facing. And I have been involved in the Drupal project since around 2007, I think was my introduction. And then since then I've run a Drupal agency in Australia called Previous Next and then I've been volunteering on the Drupal Association Board for about the past five years or so. While I am representing the Drupal Association here, the content of my talk is mainly my own opinions, I'm not really reflecting official Drupal Association policy. However, what I'm going to do is tell you a little bit about Drupal that has been covered elsewhere, so I'll skip through what's already been talked about. I wanted to look back on the evolution of free and open source CMSs to understand where we might be heading with them. And then I'm going to dive into talking about how we wrestle the giants that we're competing against in the enterprise sector. And at the end I'll talk a little bit about how we can help each other as open source CMSs and then open it up for discussion and questions if you have any. So quick bit about Drupal, we have a saying in the Drupal world, come fit for the software and stay for the community. And that community was founded back in 2001, probably just up the road here in a college dorm room by Dries Spatets. And we just had our 23rd birthday as a project only a couple of weeks ago. We have a very active community. There's about 8,000 people that actively contribute code to the project, but then there's probably hundreds of thousands of people that use Drupal every day in their jobs as content editors, as developers, et cetera. We're currently at version 10.2.2, is that correct? And that has almost 8,000 extension modules if you're from the WordPress world, what we call modules you call plugins. And if you look at all the versions of Drupal, there's been about 50,000 modules written for Drupal over the years to extend its functionality. Importantly unlike other open source projects, we don't have a commercial module ecosystem or a commercial theme ecosystem. Everything that you do with Drupal is completely free. And that was a very conscious decision by Dries Spatets when he started the project. So I do sit on the Drupal Association Board, whether or not for profit organization that manages primarily the infrastructure around Drupal.org and then how people actually contribute code into the project. So we've had a big project to move to GitLab in the past couple of years, which has been a huge success for us. I think our build times have increased tenfold in terms of the speed that we can do that as a result. And then historically the Drupal Association has run DrupalCon in North America, which is our big flagship developer conference. We've had close involvement with the European event and then various programs that drive the project forward. And then outside of the Drupal Association, the community itself is very self-managed and self-sufficient. So there's hundreds of camps and meetups and other little country associations or big country associations in some cases that the Drupal Association has little to no involvement with in many cases. Now Drupal has always been at the forefront of open source and the open web. And this has been brought into focus recently where we have been officially recognized as a digital public good that in turn supports the United Nations Sustainable Development Goals. And as with many open source projects, the Drupal community is highly motivated by being able to build world-class software that anyone can download and use. And as has been talked about with Type 03, the impact that that's now having in Africa is significant. And one of my kind of foundation stories with Drupal is having an African come to me and saying, hey, we're using Drupal in Burkina Faso. Can you help me? And I said, well, sure, we've got this prepackaged solution that you can download and use tomorrow. And here's all the training materials. Off you go. So that is a really big driver for us. And the Drupal Association's mission statement is to drive the innovation and adoption of Drupal as this high-impact digital public good hand-in-hand with our open source community. And we recently wrote a manifesto that defines our commitment to the open web. And I think that's been incorporated into our open web alliance, website alliance. What are we called again? So yeah, if you're interested in reading that, have a look. Now in order to fulfill our mission of supporting the open web, we do need to be successful as a product in the open market. And I often don't like showing this slide because what it's implying is that Drupal used Pete in 2018, and it's been on this kind of downhill slide ever since. But there is a very different story here. So in 2015, we released Drupal 8, which was a significant architectural shift away from Drupal 7 and previous versions of Drupal. And prior to Drupal 8, Drupal had tried to be all things to all people. You could build your personal blog with it, or you could run the NASA website. But what's happened since Drupal 8 is that we've clearly positioned Drupal as being for ambitious digital experiences. And these slides are from a keynote that Drees gave back in 2017, where he outlined that vision of how we're moving away from being suitable for smaller sites and moving into these larger scale sites. And of course, larger scale sites typically means enterprise style customers. Now this coincided with the rise of SAS platforms like Wix and Squarespace and of course WordPress.com. And as they became more popular, what we've seen is that a lot of smaller sites have moved off Drupal to platforms like that, or other platforms that are better fit for purpose. And I think with that previous slide showing the downturn in the number of installs, that's not really reflecting the true story of what's happening with Drupal. And so what we've been trying to do in the Drupal Association for the past couple of years is rather than fixate on the number of installs, it's what's the health of our ecosystem and our Drupal economy, for lack of a better word. And if you look at the health of that ecosystem, it does tell a very positive story. So we have a listing of Drupal services companies on Drupal.org. With the top 100 of those companies, we estimate their combined annual revenues are about $1.5 billion US dollars. And then once you've factored in all the other Drupal projects that might be built by internal teams, other agencies, etc., our guesstimate, that's very much a guesstimate, is that the total market value annually for Drupal projects is about $3 billion US. So that is a big pie. I think Gabor talked about pies. I'm talking about pies too, because it's a big pie and there's a lot of competition for a slice of that pie. Now unlike other open source projects, Drupal doesn't have a single company that's responsible for the majority of the code contribution or the finances that run through the community. And the challenge that we've had with the Drupal Association is that our annual budget has historically been about $3 million, $3.5 million. So if we're talking about a $3 billion economy around Drupal, the Drupal Association is only extracting about 0.001% of that market value. So that has been a big challenge. And prior to COVID, the majority of that revenue came from running DrupalCon events. And of course, COVID hit, no one could go to in-person events. And we had to have an abrupt rethink about what the role of the Drupal Association is. And we had to refocus on how Drupal could be both sustainable and successful into the future. And so we've recently launched a new strategy that sees the Drupal Association play a closer role in both Drupal product innovation and also product marketing, two words that evoke quite a lot of emotion from people at times. And you might think, how is the organization that's at the center of this huge economy not involved in product management and marketing? But it is a big shift for the Drupal community and a big shift for the Drupal Association itself. So as I've often mentioned to people concerned about this change, if our open source product isn't successful in the open market, the community around that product is going to shift to where the action is. And it's definitely a case of if we can sustain a strong product and ensure a strong community, then we're going to be able to keep fulfilling our goals of providing a digital public good to the world. So all of these things are closely interlinked. So in order to see where we can go in the future, I'm just going to very quickly rewind. I'm sure most of you know these stories. But this story goes back 25 years to when Type 03 was released and then in the five or so years after that, most of the products that we know and love came into being in some form. And I don't need to emphasize that a quarter of a century in technology years is like a millennia. But any of these projects to still be running successfully is an incredible achievement. And I think if you rewind to 25 years ago when some of us were actually working in the industry, we were doing things like building our own custom CMSs and then having to maintain them ourselves and hope they didn't explode. And then big clients were only really able to use products like Interwoven or Oracle CMS, which literally cost millions and millions of dollars to install. So at the time as our open source CMSs came into that market, they were filling a really core need for reusable software that could be maintained across a broad network of people at a very reasonable price. And so as the mid-naughties turned into the 2010s, that really was a golden age for most of our projects. Most of us did very, very well. But then over the past decade, like I said, we have seen that shift towards SAS platforms. We've got headless CMSs. Now we've got static site generators. There's a lot of options out there. And I was looking on Builtwith.com for statistics for this talk and they have 242 CMS products that people are still using right now, which is, oh no, 424. Sorry, I misread that. And I think through this past decade, what we've also seen has been the rise of the digital experience platform, which everyone from Adobe to Sitecore and every other brand has jumped on board. And primarily because enterprise customers do want to have one platform that they can manage all of their content and customer interactions through. Now I personally consider DXP to be quite a clever marketing term. Often it's a mishmash of technologies that may or may not work that well together. But the key thing for those of us who are flying the content management flag is that we're getting shut out of those conversations when big clients are looking for a solution. So within the Drupal Association, we've taken that very seriously and we've started trying to formulate language around Drupal being at the core of an open DXP and however you kind of structure that open DXP is up to you. But we still have a huge amount of work to do in that regard. In the meantime, I think Jam had a slide showing Adobe's push into the government sector in Australia and they very cleverly marketed a Gov DXP platform to government clients in direct competition with, as Jam said, a government managed and run Gov CMS platform that's based on Drupal. And they're winning huge contracts off the back of that, tens if not hundreds of millions of dollars. And this focus on DXP by proprietary companies is just driven by the scale of the opportunities. So Adobe themselves, they think it's worth $110 billion annually and the amount that they invest to stay competitive in that market is around $2.7 billion in product development, sales and marketing every year. And then they make about $4 billion in total revenue. So do us as open source CMS have any chance at competing at that scale? And the good news is we do quite well already. And WordPress is obviously the elephant in the room. They run 40% of all websites, not just big websites. And in this category, this is the top 10,000 websites off BuiltWith.com, they're running a quarter of all of those sites. So open source is definitely winning through WordPress there. But as we go down the list here, you can see we've only got that little 1.4% gap between Drupal and Adobe. And I can guarantee you there is whole teams of people of Adobe looking at how they can kill Drupal to get that leap on the market share there. And I fully understand why us in the open source world, we often recoil at talk of total market value and market share. But when we look at open source from the perspective of its philosophical underpinnings of being free, we do need to recognize that we still build products and products need to be relevant in the market because if they're not, there's only one path and that's down. So in order to maintain a healthy ecosystem for our open source CMSs, enterprise customers are key because if they're changing your product, it means you're still relevant. It means that they can see a pathway for the next decade ahead, that your product's going to evolve and be supported. And it's a huge stamp of approval. Because enterprise customers are going to be pushing money through your communities, they're going to be driving feature requests, and the budgets that they're working with are just orders of magnitude bigger than anything in any other sector. So there are $100 million website rebuilds, but even on much smaller scale, that enterprise investment that flows through to the agency's building, maintaining and hosting these types of projects, and in turn, that provides consistent, repeat income that ultimately flows into our open source communities in the form of stable jobs, good salaries, community funding, and sponsored code contribution. So with that scene set, I'll jump into the meat of the talk, which is how can we actually survive and remain competitive when we're dealing with giants like Adobe? The first thing I want to dive into here is not just a whole lot of bad stock photos, but the understanding, the buying process that these large organizations go through when they're selecting new technologies. As I said in the previous slide, they're often making these decisions for at least a decade into the future, if not longer. And you'll often hear that the responsibility for making those decisions is with the CIOs, and more increasingly the CMOs, the chief marketing officers. All of these people are very risk averse. They don't want to make the wrong technology decision that backfires on their multi-billion dollar company, and they're all quite hard people to influence directly sitting in their ivory towers. So these people will probably be the ones that are reading the Gartner and the Forrester reports. And as you can see on the slides here, there's only one recognizable open source brand there, and that's WordPress VIP. Now, full credits are automatic who run WordPress VIP. There's a very much for profit service. So they've worked out that to get listed here, you need to have a single company with a certain amount of revenue that the researchers can analyze and then do an apples for apples comparison. And having the WordPress brand in these listings just gives huge legitimacy to WordPress for anyone that's using it. Also credit here to Acquia. So Acquia has been mentioned a few times. It was the company that was co-founded by Drees that started the Drupal project back in 2007. They do talk extensively about Acquia Drupal in the context of their broader DXP offerings. The Drupal story is kind of getting through there, but again, we don't have the Drupal brand name there. Every other product on here is proprietary. There was Squiz here that started off as a pseudo open source product, but even they're totally closed source. While I don't think it's that quantifiable about how important these surveys are, just not having the Drupal Type-O3 or Jumla brand names in them does hinder their recognition significantly in this enterprise market. So I think there is an opportunity here. It would be a big process to go through to kind of change the thinking of these analysts to start including open source projects. But if we have an open website alliance that can lobby them, and it is about lobbying, then maybe we have an opportunity to make that happen somehow. Now the other way these C-level decision makers can be targeted is through events. Unlike Adobe, we probably can't afford to pay Ryan Reynolds to come and do a keynote as Adobe do. But in the Drupal world, again, Acquia have done a very good job with their Acquia engage events where they showcase customer success stories on their DXP platform and through that the Drupal story gets told. And the Drupal Association, we have tried to run some C-level decision maker events at DrupalCon, but we've had mixed results because they're kind of mashed together into a bigger developer conference. And why don't C-level people want to be at a developer conference? It goes against a lot of our open source principles, but they want to have exclusivity, they want to feel like they're special, they want to be networking with a very select group of peers and they want to have strategic insights into technology that gives them competitive advantages. So like I said, that goes against so many of our principles in the open source world, but there's a formula there that we can definitely replicate in terms of targeting those types of people. And while our not-for-profit organizations that govern our communities might not have direct relationships with these C-level decision makers, our larger partner agencies definitely do. So the role that we can play as the community organizations is to give as much assistance as possible to those agencies to help them win or retain new enterprise clients, just through playbook-style information. So does your open source project have a playbook that compares your product against a site core in Adobe and has a whole range of answers as to why yours is a superior product? Do you have a pre-package demo that's consistently updated with new features and functionality that can highlight your technology in the same way that a slick demo from Adobe would? This is an area we fall very well short within the Drupal world. Most of our agencies are off replicating effort every single time that they go off and pitch to a new client, but this is a relatively easy to solve issue if it's given the right attention. And something that's related to this is being able to focus on the strength and scale of our global sales team. We don't have anyone that's responsible for sales at the Drupal Association, but every day we've got thousands of people out there pitching new projects to clients, telling slightly different stories, but selling Drupal to these larger clients. So as I noted, we can play a role as the association by providing those salespeople with the tools to help them win those projects. We have started to address this within the Drupal community with a certified partner program, and I recognize that other open source projects have similar types of programs. And what we're doing with that is we're positioning our agency partners using the same language that a proprietary platform would use. Again, we're kind of compromising our core values by playing that game, but to play in the enterprise space, you do need to play a certain type of game. And then the really core group of people that you need to be convincing are the people who'll actually be using the products themselves, so the developers, the content editors, the DevOps engineers, et cetera. So any C-level person will lean heavily on this group to give them evaluations and recommendations about what technology they should be moving towards, and where do they get their information from. They get it from their prior experience of having used different platforms. They might have used Drupal 7. What impression do they have about Drupal 7 compared to what it is today? They talk to their colleagues at other organizations who are using it. Hey, what's it like having a Drupal site? Can you find developers for it? And then, of course, the internet. There's a whole range of challenges that we could spend an entire presentation on each issue, but the core of each of them is purely about perception. So does your 25-year-old open-source product look and function like a temporary piece of software? So WordPress, despite the slides that we saw earlier, is often held up as the gold standard in terms of content editor experience, and we have paid a huge amount of attention to that within the Drupal community to update our editor experience and our administration UI. And we've even got a project at the moment that's looking at integrating the Gutenberg editor into Drupal that WordPress helped fund. Thank you. Another thing would be, is it obvious that your products can fulfill contemporary requirements like a headless front end or integrating with a popular marketing automation platform? Like I said, it is easy to find qualified developers. So this is a much bigger thing that I'll talk about in a moment, but for us as not for profit associations, we should be at the center of those initiatives, whether it's as simple as having a job board available or running a full certification program. Can a developer quickly download and install a demo of your product? When something we struggle heavily with in the Drupal world, I went on the download Drupal page the other day and the first thing it said was, you have to install Composer. I can't do that. I'm not a developer, so there's these big hurdles that we have to get through. But first impressions around that are absolutely key. If it doesn't work the first time, you're probably not going to look at that technology. And then are there demos, case studies, and white papers that target specific industry verticals that you can easily collate, put them in a presentation, and then give that presentation back to your C level? Decision maker and convince them that your open source product is the best one. Bearing in mind that any proprietary platform that's talking to that customer, they're going to be in there with a very slick demo. They're going to have their global digital partners, our digital agency partners saying yes to every feature requirement and yes to every question. So we do need to be playing that game in terms of convincing people, giving them confidence that moving to open source is the right recommendation. And then the final group that C level executives will lean on is their incumbent agencies and consultants. So are they recommending your open source platform to their big clients? Now, in the Drupal community, we've always had a core group of companies that both support the project and champion Drupal to their clients. And like I said, we've recently launched a certified partner program. But the key here is with your agency network, how easy is it for new agencies to both upskill and become part of that certified partner program? And agencies, they're going to be attracted to technology. They know that they can sell to new clients and they know that they can build a business practice around that. So the key to these big global clients is being able to have big global agencies as part of those networks. Again, ArcRear have done very well with that for their own partner network, but it's not something that we've been able to replicate with the Drupal Association at this point. And I think it's generally hard for these big global firms to get their heads around open source. It just doesn't mesh with how they do business. And a pattern that we've seen in the Drupal world is clients will demand that their agencies provide them with Drupal services. They say, yes, sure, we'll do that. They'll do a project for them more often than not. It gets outsourced to someone else, done to varying degrees of success, but they'll quickly slip back into their comfortable pattern of partnering with big proprietary firms. The other part of that is that the way that their structure and run projects, their ability or willingness to contribute back to open source projects in terms of code or have any connection with the communities is generally quite limited. So again, this is a hindrance. It's a solvable problem. It probably takes a lot of focus to get over that hurdle. But again, something that we might be able to grapple with. Now, in terms of where we do clearly win, rapid innovation is something that we do incredibly well in the open source world. And I'm sure, like in the Drupal world, ChatGPT gets released. And then a month later, we've got a working module that you can start integrating that into your Drupal sites with. And I'm sure that was a similar case with most other open source CMSs. But in my question to Gabo's talk, maintaining the speed of that innovation and the scale of innovation is something that becomes harder and harder as your project gets bigger and more complex and as both the software and the community have matured. And so in the Drupal world now, we have this very carefully planned release cycle. And we need to make sure that each release is rock solid. We can't be taking risks by putting new functionality in there that may or may not work, especially with so many big customers. So there's a certain level of conservatism that we now have to adopt because we do have these big customers. And another philosophical hurdle we have is this notion that the Drupal association themselves should be directing actual budget towards innovation projects when there's this notion that contribution is free. But contribution has never been free, no matter how you look at it. The cost in personal time or wages is always borne by someone, whether it's the individual contributing their time instead of doing paid work or the agency that's sponsoring their team to contribute that. And I think the recognition we have in the Drupal world is that other open source projects have no issue with that whatsoever. So the Linux Foundation, they have $160 million that they direct towards strategic projects each year. So we've started getting our heads around that in the Drupal world. We ran this pitchberg contest last year. It was run at the DrupalCon Pittsburgh, if the name needs an explanation there. And this was having a competition where we had $100,000 in funding to drive a few strategic things forward. And one of those was the Gutenberg projects that WordPress actually contributed some funding to. And so as soon as there's some money in the equation, the agencies that are working on those things can easily prioritize them because, hey, they're getting paid for it. Now being able to scale that model up is the hard part. Like I said, with a $3 billion economy around Drupal, how can the Drupal Association capture some of that value? And even if we just captured 1%, then that would be a $30 million innovation budget that we might be able to work with. And I think someone's doing a talk towards the end of the day about how you've tackled that in the WordPress world. Looking forward to hearing about that. Similarly, the idea that Drupal would be marketed as a product by the Drupal Association has been this big wall to get over. And there's a legacy there in terms of being structured as a nonprofit association in the USA where legally funds that come to the Drupal Association are for the advancement of a charitable cause. So there has been a sense that we're not allowed to market Drupal as a commercial product as a result. As I noted at the beginning of the talk, our charitable cause in the Drupal is ensuring that Drupal remains as a digital public good that supports things like the United Nations Sustainable Development Goals. So we do have a core underpinning there. And I think the important thing for us in the Drupal world is that if we don't have that positive product awareness, then we can't actually fulfill that digital public good role in the first place. So whether we call it marketing or advocacy, we do need to drive a positive image of Drupal as a product in comparison to what's on offer from proprietary platforms. We've had a volunteer group within the Drupal community called Promote Drupal that's done a range of initiatives. But what we're starting to do is bring that inside the Drupal Association. We've just commissioned a go-to-market strategy for how to position Drupal as a product in the open market. And we'll have a range of initiatives that will roll out through 2024. One initiative that we tried recently, which was a complete experiment, was to have the very first Drupal product brand at Booth at a big tech conference that was at Web Summit in Lisbon. And again, this was like this kind of radical thing that had never been done before. And it is a very expensive undertaking to do a booth like this at a product, at a conference like that. And so we partnered with a range of bigger Drupal services companies who help co-fund that. And then, of course, they get leads that come through from having a presence there. So in the early days, we have good anecdotal results from having tried it and whether we can kind of replicate that long term. I think the important thing I said to Boris earlier is that we're actually getting out of our bubble and we're putting Drupal out there as a product in the open market. And then likewise, having people tell the Drupal story at non-Drupal events is really key and something that we've been historically quite bad at. So this photo is our former Drupal Association chair, Batti Brattat, doing a keynote at Web Summit off the back of having a booth there. And there's just so many events around the world that we could be doing that type of thing at to get the Drupal story out there. Now, something that doesn't cost a huge amount of money is good press. So for this talk, I did a Google search on best CMS for enterprise. And amazingly, on the first page, we had Drupal come up as best for enterprise. Now was this coordinated by a clever PR person at the Drupal Association? No, there's no one who has that role at the Drupal Association, but it's something that we should probably start paying a bit of attention to. Because for every good review, then there's going to be a negative review about an open source security vulnerability or someone moaning on a internet forum about how bad the user experience of Drupal 7 is, even though Drupal 7 came out 15 years ago. So big firms, big companies, they're really good at managing those narratives. And there's nothing stopping us from doing the same with the right attention. And again, I'll just bring this example up. I think you had a version of this article of the way that Adobe is restoring the playbook of fear, uncertainty, and doubt in the Australian market at the moment to try and convince big customers away from open source. And this article is in Australia's version of the Financial Times, where they just regurgitated a press release from Adobe. Adobe had run a global survey, and surprisingly, the MyGov site they talk about here is now 20% better to use because it was built with Adobe Experience Manager. So us in the open source world, we need to be able to have counter narratives to this. And again, it's a playbook. It's a game to know how to play. Just to kind of finish on here, the biggest strength that we have in our open source communities is the depth and expertise of our developer pools. And there's huge value in being able to market that in a way that enterprise customers understand. I think Jam had a version of this slide in one of his talks a while ago where if you talk about Adobe, hey, they might have a thousand people working on Adobe Experience Manager, but in the dribble world, we've got 20,000. So again, being able to develop that or grow that developer network through robust outreach, training, mentoring, career pathway programs is something that us as nonprofit organizations should be at the center of. It's a big time intensive exercise, but something that's a solvable thing. I will finish very quickly on a couple of slides. So as we've talked about a little bit today, us in the open source CMS world, we're really at the forefront of championing and sustaining the open web. And it's not just us in the open source world who really care about the open web. It's something of huge concern to governments around the world and large organizations around the world. So whatever we can be doing to collectively maintain the focus on and protect our open source technologies is incredibly important. The work that's been done with the cybersecurity act is a good example of that. And similarly, let's look at ways that we can collectively promote positive open source and open web narratives in the enterprise market. And that might be as simple as ensuring that we've got consistent things that we all talk about or it might be as simple as engaging a PR person to manage those narratives on our collective behalf. So I'll leave it there. If there's questions, I can repeat them. Thank you so much. I'll just talk loud and hope that my catch is that one of the things that came out in Matthias' work around that has come to initial fruition with the open website alliance is that open source, we have 100,000, a million developers. We don't know, a huge number. And all of our lives are touched by it every day and you know someone who works with it. But you have people who come and say, oh, I tried open source once, it didn't work for me so I'm never going to do open source. And we are often worried about Wordpress or Jumla or Drupal or very obscure issues for people who aren't in our level of experience. So part of this idea could also be that the mass and the force for good, we don't have that marketing budget that Apple or Adobe or somebody has but trying to figure out how to leverage that scale and make these experiences somehow or these values visible at this collective level seems like a really exciting part of what we're doing here. And Drupal as having found the key into the enterprise market and into the government space very effectively is one of those players I think has a lot of really great examples to follow and I really hope that we can come to each other's conferences and interact more through channels like this. Great. Anyone have any questions for Owen? No.
UKIs, TPMs, immutable initrds and full disk encryption – What Distributions Should Keep in Mind when Hopping onto the System Integrity Train
I'm happy to introduce our first speaker in the morning, who you can already see is all set up here. Well, I'm going to go and hand it over to Leonard to open us up and kick off the distribution s Devrem for the day. Take it over from here. I have this. Does it actually work? It works, right? Hi. Good morning. And thank you for waking up so early for me. Much appreciated. It was hard for me. It was probably hard for you as well. Today I'm going to talk about TPMs and UKIs and immutable inner-ities. I'll give a second talk later today in the boot and in-it track. So the topics are kind of related. But there I want to talk more about the early boot stuff. And here I want to focus on stuff, what it actually all means for distributions. So UKI, TPMs, immutable inner-ities and full description. I think this is where we should be going on the Linux distribution world. But of course, I am not the Linux distribution world. So in this talk, I kind of want to explain what I think might be next steps for distributions that actually want to adopt all this. Yeah. To start out with, this is a fairly technical talk. I'm pretty sure some of you at least have some rough idea about what I'm going to talk about. But just to get you up to some level at least that you have a chance to follow, let's go through a couple of very basic vocabulary and what I'm talking about. The first thing, secure boots. Like many of you probably came into contact with that. It's the thing where during boots, all the various binaries that are part of the boot process are signed cryptographically. And the firmware from early on makes sure that only properly signed stuff is run. The signing keys for that are kept by Microsoft. So it's like a centralized authority kind of thing. At this point, because it's relatively like they'll sign a lot of stuff. There's probably more of a denialist of bad stuff than an allow list of good stuff. And yeah, there's certainly criticism to be had about the centralized nature of this. There's another thing called measured boot. Measured boot is not so like say accepted, well known in the Linux world yet. Measured boot is something where basically rather than in Secure Boot where you disallow components that are bad to even run. In measured boot you allow everything to run. But before you run it, you make a measurement which is basically taking a SHA sum or something like that like a hash of like a cryptographic hash or what you're going to start next. And you write that in a certain register in a TPM. And this is an irreversible way. So afterwards you can cryptographically verify that everything that was started so far is actually what you think what it is. The good thing about this is it's more democratic in a way because it doesn't restrict anyone from running anything. But you can later use these measurements to protect your secrets. And that's what we're going to talk about later. So there's no centralized authority because there isn't a restrictions made on what's booted. But it's up to you to say basically I only want that if this software is run during my boot process that I can release my disk encryption secrets. And then you know, it gives you I think a more like more specific, more focused kind of security than the Secure Boot stuff gives you. TPM of course I already mentioned the word is like this little chip. I mean it used to be a little chip that is pretty much in all the laptops. And in one form or another it's also in all the cell phones. I mean they call it Secure Enclave and stuff like this. But conceptually it's always the same thing that you have like this security isolated security environment where you can keep your keys and that maintain access policies on keys and things like that. It's pretty common, has been around like all the laptops that were sold in the last 15 years probably already had a TPM. On Linux, well I mean it is automatically used because the measurements made into the TPM but actually actively used by the distributions it's generally not. Like it doesn't mean that you can't use it but it's so far typically left for hackers to actually have an interest in TPM to enable it. Regular people do not run this which is completely different like it is on Windows and these other operating systems was always used by default basically like BitLocker. If you don't do anything it just locks it to the TPM. One specific part of the TPM is the PCR registers. I already referenced them earlier, didn't call them PCRs. Those are these registers where you can write these hash values too. They do one relatively simple cryptographic operation which is like they take the old value and the new value and hash it together basically. This basically means only if the exact same stuff gets measured into it during boot the final result of the register will be as you expect. You can reverse this, I mentioned this, once something is measured into it it's measured into it and the only way to get the thing back to zero is to reboot. All the registers started zero and then you typically have 24 of these and half of them are basically used by firmware. The other one is for the operating system. Once you have these PCR values you can bind security to this like locking of disk secrets and thus you can do things like that. You can say that my disk secrets shall only be released if the operating system is in good state. How that actually works let's go into that detail a little bit later. That term is UKI. By the way I'm talking a lot and I have lots of slides and I much prefer though if we do a discussion here rather than me just talking. If any one of you has questions please interrupt me. Let's talk about this right away and let's not move the talks, the questions to the end because yeah I'm pretty sure half of you will probably then have forgotten your questions by then. Anyway so feel invited to interrupt me. So UKI's, this is actually what the other talks are going to be about is it's a unified kernel image. It's not radically new approach but it's certainly different than how most of the distributions use to manage their kernel images. UKI's are basically you take a kernel image, you take an inner d, you take a couple of other things like Chrome command line, boot splash, device tree or something like this. You glue it all together, turn it into a UFI binary like a PE binary and you sign it as a whole and during boot it gets measured as a whole. UKI's are awesome because they make things very very predictable because yeah once you deploy UKI it's one file, you drop it in the ESP like in the ESI system partition which is where the firmware starts from. You can update it as one file which is awesome because it's extremely robust right like you do not have to risk that you have half updated your kernels or something like this because it's always either you have the new file or you have the old file that's fantastic. So it's great from a cryptographic, from a robustness point of view and it's also great for other reasons like for example it's, is it always the same? You can test them better and so you have a greater chance to know that they will, like if you deploy them in lots of different machines they will probably all work equally well or equally bad but hopefully equally well. Anyway so much about the vocabulary just so that we all know at least the basics of what's coming next. How do we deal with the vocabulary? I want to explain a little bit the goals of what I'm actually doing here. So the general goal is tied to security and provide code integrity on Linux right? This is mostly about making sure that traditional Linux, like traditional Linux means distribution based Linux right? Like I do not mean Android or Chrome OS by this, I mean distributions like Fedora, Debian and these kind of things that have a, I would say the sudden democratic approach to things that everybody can participate in. It's not this over the wall open source but actual open source. So I want to make sure that these traditional Linux catch up to the level of security that the other ones actually provide you with. That Windows provides you since long time that Mac OS provides you these days Android, Chrome OS, they all have these code integrity protections. The general goal of this of course like if you want to talk about threat models is usually evil made stuff that you leave your laptop in your hotel room and you want to be sure that when you come back it's still your laptop with your software in it and it's not back doored because right now it's very easy to back door. So focus is generic distributions Fedora, Debian and so on and the goal is to make things just work right? Like I want to move this stuff out of this area where it's a specialist thing that TPM loving hackers enable and I'd rather want this that this is stuff that just works and defaults to just being enabled in distributions rather than being something you actually have to opt in and do work to get to. That is of course like it's a big ask but I think it's necessary because we nowadays like everybody knows the value of IT security and it's really sad that Linux has very little in this area by default because it's laughably easy to back door a laptop right now even if it uses full description because inner IDs and things like that are not protected at all. I already mentioned the word democratic a couple of times. My own focus is much more on measured boot than on Sacky boot right? Like Sacky boot is established like all the big distributions assigned the kernels with Microsoft key and things like that. I actually work for Microsoft as you might know but still I don't want to assign my kernels with Microsoft key. So I think measured boot is actually a much more interesting technology because it allows you to define your local policies yourself. You can sign your kernels yourself, you can define the policies for your secrets yourself and you can just say yeah I don't want to allow my machine to run Chrome OS or Windows or whatever else. I just want it to run my choice of kernels and my choice of inner dean, my choice of Linux operating system. So yeah the goal is definitely to, I think it fits nicely into how Linux distributions are traditionally organized because yeah they are in a way democratic too. So to be more technical what are your specific goals? I want that measured boot is done by default and that means not only, I mean it is done by default because you have, I am like up to the kernel do this anyway by default and have been doing this for the last 10 years or something but I want that this actually continues into the rest of the boot process and actually during the runtime of the operating system later as well. I also want that secure boot covers the whole boot process. Right now we end this really weird situation where it only covers the basic kernel and not the inner AD and I find that kind of laughable. But again measured boot is my main focus. Secure boot is, yeah we should do the two if we can but it is two different protections. Also you get the best results but I find the protection that measured boot provided was much more interesting than the one secure boot provided was. All the measurements that we make during the boot process right like the hashes like all the stuff that gets hashed I want to be predictable. Predictable basically means is that even before you boot it you have got to know if you know the components involved what these PCR values are going to be. This actually matters because you bind the security of your full description keys to these PCR values and if you cannot predict them you cannot do that. Only if you know that if I run the Fedora kernel from this version on and that is going to result in these hashes being measured into the PCR values you can say only unlock my keys if the PCRs have this value. So yeah predictability means a lot. Why do I even mention this? For example in Grubber things like that measurements are not so predictable because they don't measure the actual code so much but the selected pass through the code which basically means there are lots of variables in how it actually ends up in the PCR values depending on like I don't know if you move up your menu it might end up in different measurements. One of the goals is specifically also just encryption easy by default and this particularly also means service. Right now I'm pretty sure most of the people in the room probably use this encryption on their laptop and my assumption is also that most of you probably use it interactively with the keyboard unlock. You boot up the machine you type in your password. That's great but also we can do so much better and it's not something you could ever do in service. On service there's nobody in the server room to even unlock this stuff usually. So what TPMs give you is this ability that you can non interactively do disk encryption because the TPM keeps the secret for you. You do the PCR dance and you basically tell it to unlock the secrets only when the operating system like if it's your version of rel or your version of marina or whatever else that is booted up and nothing else. And this also means like I think actually we should get into that mode where distributions by default enable disk encryption right and even if people didn't ask for this and without necessarily even asking during the install time even for a passphrase or something like this simply because by default they should be locked to the TPM and then if people want to enroll a manual key or a FIDO key or whatever else then that would be on top of the TPM and not what you start out with. I mean this is like yeah this is the goal eventually we're not there yet. We don't even have the infrastructure to make this but I think it's I mean this is basically what Chromebooks and all these things generally do and I think we should catch up and kind of try to make this something that also works in Linux that way. Yeah another goal of boot process is testable I'll already mention it because if you have everything strictly predictable and uniform and on all the installations you kind of have the same set of software maybe in slightly different versions because one already updated his machine to the version of today and the other one didn't but still it should be a small set of different versions and yeah. By the way again questions yeah. So when you were talking about the measured boot regarding being local do you mean here local in terms of the hardware vendor or local based on the distribution or local based on the owner of the machine because at the moment with secure boot you have a buy from lots of parties Microsoft for signing, frameworks, vendors and then distribution that follows the whole process. Ultimately I mean you right like on your laptop you should be in power but the thing is like of course that's a big ask like if I install my mother Linux laptop she's not going to be capable of like it. So ultimately that basically means that I mean we come to this later hopefully given the time but it's like my assumption is that by default you get kernels and the OS provided to you and signed by you and protected by you by the distro vendor but I certainly want to enable you to that you can say basically fuck this I'm going to enroll my own stuff and we want to make this easily easy so that it's robust and you can actually do this right. So that you can basically be more restrictive even you can say not just it's okay that Fedora gets access to my disk encryption but you can even say something like only Fedora in the version that I picked on the architecture that I picked and so on and so on like you can make it much more focused because you know you machine better and the way like that you know that you don't boot from ice guzzies and things like that Fedora doesn't right like so you can make it much more focused and saying yeah I think I'm not involved right so but I'll be the goal is definitely to democratize it right like to put the people in the control if they if they want to but also knowing that this is not what in there like what people could do. So basically you said you but instead of you is your TPM so you don't even have to know about it. Sorry? Your TPM so you don't even have to know about it so it's easy for everybody to use. Yeah right. That's okay. Okay let's talk a little bit about the status quo like how it's right now. So most of the release distribution currently provide minimal second I call it minimal second boot because it only really covers the boot loader and the kernel and it doesn't only cover the inner d which I find really embarrassing for the fact that in 2024 you can just go to the ESP or boot petition and modify the inner d that anywhere you want and we'll just boot from it and nobody takes notice. But inner d is just a file system right? Why is it the same order as the kernel for measuring it? I mean the kernel could authenticate it if it wants to. I mean it could it just doesn't that's the thing that I'm saying. There is no authentication of the inner d right now. Not in the generic distributions at least and that's I mean it's rooted in the fact that inner d's are in the traditional line it's always generated locally on the system but basically they ultimately are different on every single system. They import not only code but also configuration local configuration and that basically means you cannot sign it on vendor systems right? Like if you are a customer of I don't know Suze and they give you kernel inner d then they cannot sign the inner d for you because that inner d only exists on your system and a specific system. So yeah so I think it's really bad situation right? Because it basically means that any evil mate can go into my hotel room just take the hardest guys on their laptop they can just go to the inner d change any file they want in particularly the password prompt for my lux stuff and send it all to send central server if they want and I will not be able to notice this. That is a situation that I think is really stupid all the way. So yeah I've already mentioned this the inner d's are locally built they are not protected by Secure Boot then there are very little measurements actually being done. The kernel now does a couple of them on its own like there's the inner id basically but in general the ones that are made by Grappa I already mentioned this are not predictable and it stops the moment the kernel actually does anything right? Like because the measurement that the kernel does and it still does in unify mode and then user space doesn't do anything anymore traditionally. So I think that's bad right? Because actually what I think makes a ton of sense for root disk encryption is that the key for the root disk encryption is only released by the TPM to the system in the inner d phase but never later right? That is a really nice property that you basically drop as you boot any chances to recover the boot the disk encryption key. I mean the kernel will always have it somewhere in memory because it actually needs to do the encryption but via the PCR mechanism you can make it relatively easily and we have now the infrastructure in place to do this that later on you can talk as much to the TPM as you want you will not be able to recover the disk encryption key from itself anymore because we basically blew a fuse through this but anyway this requires that we make measurements during the boot process and during the run time so that policy like this are actually very expressible. Yeah I remember this like in the status quo on the TPM based stuff again like there is a stat like two stacks even of TPM and Linux but except for hacker circles nobody uses them I would say you can't script it together there's many how-to's on the internet but nobody does 15 people in the world do it. So yeah I already mentioned this as well like the Lux password prompt is implemented in the Internet. Internet is not protected either way that's trivial backdoor it's a terrible thing. I would call this in summary pretty weak security and you could use words like laughable or something and compare to other operating systems. So what's the vision? Primary we want that kernels are shipped as UKIs by distributions so that they are everything is secured protected including the Internet and they are measured as ones and they are fully predictable. This means that the kernels and interities need to be pre-built right not on a local system I mean for the kernels they traditionally weren't except if you run Gantu but yeah the move would be to pre-build the interities centrally. If you do all this then you get stable hashes in the PCRs you can buy the disk encryption to it you get the universal predictability because the software doesn't deviate between systems it's always the same software. You have robust updates I mentioned already because the kernels can be updated in one file and yeah you test the combinations very well. Secondary is like a secondary goal that I have is what I just described is again central authority to some way because it's the distributions that do this. I think it's also important to keep people who actually want to sign their own stuff in pictures as well I mentioned that was your question basically earlier that yeah if you want to run your own like if you want to generate a key pair and sign your stuff yeah we should help you with this. So in this model you probably will still use a pre-built kernel by your distribution you might however combine it with a local Internet ID and then sign it with your key instead of the distribution key. Benefit of course is maximum flexibility but also you need to know your shit. The advantage of this of course is that the PCRs remain predictable but they only remain predictable within your local scope because only you know what you're actually going to build into the Internet IDs and how you're going to combine it. Yeah it's a it's a it's a it's a large installation footprint because you suddenly need to actually build to its installed to do this. I mean it might not be worse than the current situation with Drake and things like that but in some ways it is because you now need signing tools and things like this. But certainly both of these models are certainly in focus of what we should do I think. So the ultimate vision is there that yeah distributions in their install are to figure out is there a local TPM. I mean not all systems have TPMs in particular like ARM based they have other stuff but not TPMs and then the M's sometimes have them sometimes do not so we always have to work with the fact that TPMs might be there might not be there but the goal is certainly that if one is there we should lock to that by default. Locking to that by default doesn't mean non-interactive stuff exclusively it means yeah we can do non-interactive stuff but also mean you can combine it still with a pin. A pin is the exact same thing as a passphrase except that TPM people call it a pin. It doesn't imply a number or anything yeah. So the goal is to always encrypt the data when it's at rest and yeah we validate the boot process when we unlock things though so that we make sure that the right software at the right time and other conditions. Yeah and the goal is to install things by default that way. And then yeah I want that measurements are further done during for all facets of the system like not just for the boot code also for the OS itself for the applications for the configuration itself right like these for example measurements that are inherently local always because configurations always kind of local thing even if my mother would use her machine she probably would configure different backgrounds than somebody else. Backgrounds color is a shitty example because you probably don't need to measure that but still you get the idea. System identity by that I mean things like hostname machine stuff should probably also be measured so that you can use it in policies and can say yeah I want to have the secret that only is released on that machine and none on others. I want to see that these basic building blocks like the PCRs but also the policies generated of it out of it are automatically managed by the West because this is not entirely trivial right like because every time you update the West any component of a boot loader UKI or things like that you have to regenerate like you have to re predict what the PCRs are going to be in the next boot and then do something about that because you still want that when the system boots up next the disk encryption shall be released but not on other conditions so there's some extra work where you when you update something you need to predict the PCRs and do something with it. We'll talk about this hopefully later but let's see how much time we have. The result of this of course comprehensive code integrity the inner dirty gap is closed we are ready for remote attestation that's also kind of goal that remote attestation works I think I mean it's good for some cases if you actually run more than one system I'm pretty sure it's not so interesting for regular people themselves but we should at least be ready for this and that the stuff is that we have the building blocks ready so that people can use the TPM in any ways they want and we give them already building blocks for defining their policy on their own encrypted objects based on the state of the operating system because right now they're kind of lost in this and the result is that it's somewhat democratic because people can just do this on their own laptop and do not necessarily like get a high level of security of code integrity without necessarily getting the key sign by Microsoft. So to make all this any questions at this point so. Does division cover KXAC? That's a very specific and good question so KXAC is a big problem and like in the project that I work at Microsoft it's also a big problem. I have ideas how to deal with that but frankly we have so many problems we have to fix before we can fix that one too that I don't think that's gonna be fixed anytime soon right like but I have a pretty good idea what we probably should do with KXAC because KXAC for those who don't know KXAC is a thing where you basically boot one operating system and then while the operating system is running you decide you want to run another operating system usually new version of the operating system so you execute the new kernel. Now suddenly you didn't reboot so the DPM didn't get reset so all the PCR values will still have all the measurements from the first operating system and then the second operating system starts and the PCR measurements the PCR values just get added to that on top and then all your policies fall flat because they were predicted assuming that you started zero. So this creates a problem but I think we can deal with it like having a handover of secrets that are predicted that moment where you're about to start this up but let's not talk about that since it's highly specific for like we have way too much material before we start talking about KXAC. Any other questions at this point? Next question. So from my understanding if you had your computer that you've predicted all the values on if I was to take that drive and put that drive into another machine that said enterprise and I bought 100 of these laptops is there some kind of unique seed per machine or would it go oh this is functionally the same machine it has the same device tree it has the same hardware I'm going to unlock. The TPM generally contains like an encryption key that's specific to the TPM so no you cannot unlock the encryption key that you prepare for machine A on machine B I mean unless you have the same keys like the seed keys in the TPM but then yeah everything's out of the control and you don't have a TPM you have bullshit on your hands. Okay let's continue. So to make this all reality I'm the system guy so yeah that's what I'm talking about is all system stuff. We added different components these different components have shown up in the various distributions interestingly I find that different distributions adopted different parts of all this big tool set first so I think at this point there are very few distributions that adopted them all but there's at least one distribution that adopted each one of them individually. So I want to system reboot it's like we call it a boot lower it's actually not a boot lower it's a boot menu it's just a UFI program that allows you to select a different like a set of kernels and then chain loads those kernels it doesn't do anything fancy it doesn't have any understanding of how to load a kernel into memory and prepare it it doesn't do cryptography or anything like this it's just a dumb menu that xx other stuff. But it has nice properties because it takes inspiration from how Linux does drop-in directories like with RPM and DPKG there's this established pattern that you can extend other RPMs and DPKGs via drop-in files and directories so we took this idea and said okay new boot menu items are simply files that you drop in directories and as you install a new kernel you just drop in a file in a directory and that makes one new menu item show up. This is inherently different like how Grub works because in Grub you always have these boot scripts that need to be generated based on whatever you find and things like this this is much much simpler because it's just there's one file per kernel you find it and that's a boot menu item there you go. So that's one thing there's also system you stop. System you stop is a UFI bootstuff it's basically a little UFI program that you glue in front of a Linux kernel it runs in UFI mode does a couple of preparatory steps and then jumps into the actual kernel proper. These preparatory steps we'll discuss a little bit later but it's measurements and finding certain sidecars if you want them. So usually like in my perfect model where you use all these components the boot process is basically that the firmware invokes system to boot and then in system to boot you pick one kernel or automatically the newest kernel is picked and that then gives control to the stub inside of the kernel image and then that thing does a couple of things and gives control to the kernel inside of it which already has the init-rd loaded and then you jump to the init-rd so much about the boot path. Uki-fi or I don't know how we pronounce that you haven't really agreed on the pronunciation yet but it's a tool basically that allows you to build UKIs it takes a couple of different components glues them together can sign them from SecureBoot can do PCR predictions and spits out one EFI binary which you then can drop in your ESP. There's a tool called system demasher probably by this time you don't have to interact with it anymore because Uki-fi does it for you all it does it does that PCR prediction step for all the stuff that is contained in a UKI basically so you can run it and basically it tells you if you boot that UKI PCR 11 is going to be this value and then you can use that for policy but usually you don't have to interact with that anymore because Uki-fi is probably the tool you should be using and that calls it in the background so you don't have to bother. There's a thing called kernel install in system retweet. It used to be shell script but nowadays it's actually a proper program. Its job is to I mean Fedora has been using for a while other distributions are catching up I guess but the idea is basically that if the package manager drops its file in slash user and slash user is package manager then kernel install will take these and copy the kernel itself into the ESP to make the system bootable. So that you basically monopolize the OS vendor resources managed by the package manager and slash user and if you copy it into anything else like the ESP which is a shared location like it's not owned by the OS vendor it's owned by the system if you will and OS's just get the privilege to all drop something in there and the kernel is told to do this. The reason why you need something that's better than the CP is usually that you want to do a couple of things when you do this like create, I don't know, run MopPro, do a couple of other things. We have support even to generate the UKI at that step right so that you install on the system like a traditional kernel but locally it gets converted to UKI as you go and sign and things like that so that you basically can keep the old workflow in place how distributions generated in RIDs even and things like that but you end up in the new world with the UKI that is signed by your local key automatically without you even thinking about this. Other components, there is MKOS in RID, I'll ask a question. On the previous slide you mentioned, on the previous slide you mentioned system debuts, STUB measures the UKI. What measures system destub because you have... The firmware. So the stuff that I'm talking about system debuts, system destub, they are ultimately UFEI binaries and the firmware measures everything right so there's a full chain, the firmware does that part and then we like actually you know because system destub is just glued in front of the kernel to make the UKI which is a PE binary, actually the stuff that is in the UKI is already measured anyway by the firmware. The reason why we also measure the stuff ourselves a second time which sounds redundant is simply that PCRs, we have multiple of these and we want some separation of the stuff that... So there's one PCR basically, nine I think, where all the firmware stuff gets measured into AND the stuff right so there's going to be stuff that is specific to the local machine as well as the stuff that we as the OS vendor or whatever you want to call distributions and then we measure the control as measured into the same PCR and that basically makes the whole thing unpredictable. So we measure the second to the time just the stuff from the OS vendor into another PCR so that's what we combine the policy to. So that's why you have the double thing, it's two different PCRs. So MKSI, there's going to be another talk about this as basically a tool how you can build predictable reproducible in a D from generic Linux distributions and then make them ready for use in UKIs. The system you could set up is basically just a wrapper around like lip crypts up and it does a couple of these integrations, TPMs, FIDO and these kind of things and policy management and things like this. System decrypt and roll is the other side that allows you to enroll the TPM and roll the FIDO thing locally. System decrat is something, if we have the time we'll talk about it a little bit more later, it's basically, you know, if you have this vendor build UKI you might still want to be able to parameterize it, right? There's a reason why Inodore D generators the traditional way mix code from the West plus configuration into one CPIO Inodore D image. It's because people want to parameterize things. So parameterization is problematic, right? Like because it means things are not predictable anymore. Also you need to authenticate it again, right? Like that's what we want to come to. So the concept they came up with to fix this thing is called system decredentials. System decredentials are like ultimately they are way how you can pass secrets into system dec services. They have nothing, originally had nothing to do with the boot process. It's supposed to be like, you know, all these, the cloud people they love passing secrets in environment variables. I think that's a terrible idea because that gets inherited down the process tree. So this is supposed to be something better in that regard. So one of the nice things that system credentials actually has is that they can be encrypted, right? Like you can encrypt them and bind them to a TPM and local policy and things like that. This is extremely useful because it basically means that these credentials you can put them on untrusted territory, meaning the UFI ESP which has no authentication itself. It's an unprotected VFAT file system where basically the rule is the stuff that you read from the ESP you need to authenticate before you use it. So you can just drop these credentials in there and be reasonably safe that the contents of them cannot be read. What's the use case for things like that? Like for example, if you have a UKI with an inner ID and you actually want to make it open so that you can log into the inner ID with root password to debug things, you can stick that in an assistant decredential put in the ESP next to the UKI, how that actually works, we'll hopefully still find the time later to look into this. And be sure that this thing, because it's bound to the local TPM, is not accessible, like the root password is not accessible to anything, but that specific system and things like that. So system decred is kind of an approach for local parameterization. It's an option though. It's not like, I would assume that in most of the consumer kind of things you would never use this, but it needs to be there because some people want something like this. There's no restrictions on what you actually encode with this. It could also be, I don't know, ISCSI server data or HTTPS, like X5 or 9 certificates or something like this. Another thing is system is sys-axed, right? Like if you have predictable inner IDs, this of course means that they will come by default with a very clearly defined set of criminal model you was built in. This is restrictive, right? Because people nowadays have NVIDIA drivers, which are like hundreds of megabytes. If you want to make the system work well with all the current consumer hardware, you will have a massive inner ID. That might be something people want to avoid. So on one hand, we kind of push everybody to say, put everything in one file and the world will be a better place, but on the other hand we also know that this is probably not necessarily doable for all environments because these files will get massively used. They will work perfectly if you know your system. For example, if you just focus on Azure cloud stuff, right? Then you know exactly the drives you need. You can build a tiny UKI. It's all good. It's going to be entirely generic for Azure, but you could probably even cover multiple clouds in one UKI. It's still going to be small, all great. But once you get into the wide world where all kind of shit exists, it might be too limiting. So we thought about this. Once we came up with the system, we also had a different use case. Originally it was mostly focused on system, but we can use it great for modularizing inner ID to some point. So the idea basically is system is system extension. It's basically a disk image, a GPT disk image that contains a traditional Linux file system, usually something like squashFS, EROFS plus a signature for the variety partition. Variety for those who don't know, it's a, like DM variety is like a kernel concept for adding integrity protection to immutable file systems. So basically that, like it was like the first user of this was Chromebooks back in the day. I mean it's old now, but it basically says that on every sector access of the file system, you do make sure that it's actually authentic. It's a fantastic technology and we can use it to have these disk images that are, when you enable them, overlaid on top of slash user. So suddenly you get a certain level of modularity where the basic identity has slash user populated with lots of stuff, but you can add a couple of other things into it by adding a couple of systems to the system, which are just overlaid. Overlaying is basically overlayFS. It's really nice because it's atomic, it's cheap to do, and ultimately nothing new about it, it's just regular GPT disk images. So this is me, contract is actually the same idea, but it's about overlaying things on top of Etsy instead of slash user. Also with all the integrity, cryptography, things like this, but it's really nice because in contrast to the credentials that focus on individual bits of secrets, the system in contracts focuses on combination of stuff. Like you can drop 55 configuration files into one of these contracts, and these contracts either are applied or they're not applied. They're never half applied. You cannot use them out of context, hence, because either you get all of these files dropped into Etsy or appear in Etsy, or none of them. Honestly, contracts in my point of view is actually like the perfect configuration management tool, and everybody should just use that and stop using all the weird, Ansible Chef things because they do not have these nice security or atomicity properties, and the security and atomicity properties are just awesome. Another component, system in PCR lock. So I talked a lot about the predictability of the PCRs, but so the way how you actually lock disk secrets to PCR is basically you say, this PCR has to have that value, that PCR has to have that value, and that's all the case. You tell the TPM, you will release your encrypted secret to the West so that full disk encryption can work. But now you need some infrastructure to do the prediction for this. System in PCR lock is that infrastructure that we added to do this prediction. Basically, it manages a set of components that you assume are part of the boot. Then it does some magic, figures out if that's actually true, if that actually matches the reality so far, and then calculates a TPM policy, it's how it's called, from that, which says, OK, we will now create a policy that basically says if you use that policy to lock down secrets, then you have to have this firmware component in this version. You have to have this bootloader version in this version, and this UKI in this version, and a couple of other components that might be part of the boot, and it allows alternatives. Because usually if you update a kernel, you do not just want to say the new kernel is now the only one you can boot, you want to still allow the old kernel, the preceding kernel, to boot. You want this concept of alternative options for every step. So firmware updates the same thing. If you prepare a firmware update, and it fails, you have to boot up with the old firmware in place. If your policy says no way, then you have a problem. So you always need this kind of alternative system. So that's what System.UPCI-Rlock does. This policy is like, all the operating systems have a prediction, like all the other ones, like Windows, Chromebook, they all have prediction engines like this. We have the luxury that we come 15 years later than anybody else with this. So we can actually rely on newer types of TPM functionality, because we can start from zero now, instead of having to be compatible with the original TPM2 stuff that is really old by now. So we actually can do nicer things. We can actually store these policies in the TPM itself. The traditional way how BitLocker and Windows does it, for example, is that they store these policies in the BitLocker superblock on disk. Storing this stuff in the TPM is much nicer, because it basically means that you can have 500 different disks, and when you do your PCR predictions, you do not have to touch them. You do not have to go through every single disk and rewrite the superblock, but it's entirely sufficient to store some slightly different value in the TPM. That's a fundamental benefit, like improvement over what Windows can do, because we have the luxury that we are so late to the party. Any questions at this point? I only got like 10 minutes left, so if you have questions, this is the time to start asking. If you do not have questions, I'll continue with parameterization, modelarization. Which we actually kind of covered already. No one has questions. So, yeah, I mentioned already pre-building UKIs and alreadys are problematic, because they are identical and that makes them large. And you kind of parameterize them anymore. So, there's optional parameterization of the UKIs that breaks up the fact that they are one big thing. Our way of mentioning system-deserved credentials, like system-de-creds, encrypted, and individual bits of information, and then there are system-de-confects, which is the overlay thing on Etsy, which is like combinations of configuration. And the third one, which I have not talked about yet, is, but there was a talk yesterday in the VM mini-conf about this. It's kernel command line add-ons, right? Because one of the fundamental ways how you configure your learning system is by making additions to the kernel command line. Now, in all the stuff that I was talking about, the idea is, yeah, you don't get to do that, right? Because it's the most powerful thing in the world, because you can do in it as an agent, do whatever you want, right? So, we lock that down, right? Like, if you're in secure boot mode and you use that kind of stuff, yeah, you don't get to edit that, because security policy doesn't allow that. That, of course, doesn't necessarily fly with everybody. People hate that, right? Like, people want to be able to do this, but they want to have controls on this. One of the things that we came up with, like this guy over there came up with, is kernel command line add-ons. Add-ons is what we call basically, you build a UKI, but actually leave the kernel out and the NRAD out and everything else out. You just put the kernel command line in there. So, you have the UFIPE binary that looks exactly like a UFIPE binary, but you can't actually boot it because it doesn't actually contain any code. But what it contains is a kernel command line. Why would you do such a thing? It's because you can authenticate them and measure them like any other kind of binary that UFI deals with. Or actually, not you can do this, but the firmware will do it for you, because you can just tell the firmware, oh, I'm going to work with this binary now, please load and authenticate it, and then it will do this for you. Do measuring, dance, all in the background, you don't have to care. Because after all, SD boot and things like that are just a stupid boot menu with no understanding of loading and authenticating anything. And that's how it should be, right? Like, we want our boot pass to be stupid and not replicate, like, with Schimel and these kind of things, all the authentication over and over again. So, add-ons are basically a way how you can, yeah, sign a little kernel command line and then you can extend the one that is built in the UKI and then modulate away so that you can have one UKI and a couple of these add-ons that are, extend this thing and with proper authentication. Modularization, I mentioned this already with System DesistEx, yeah, because of NVIDIA drivers in particular, because they're massive firmwares, we have to do something. So, I mentioned these things, add-ons, system extensions, credentials, and config extensions like ConfaxSys, stuff. We call them sidecars, right? Like, because you have the unified kernel, but then it's not so unified, you have these things as well in there. How to manage those? So, the general idea is to extend this drop-in concept, right? Like, so that you have the UKI and you put next to it a directory where you put all these add-ons. So, how does it actually look? In the ESP, you put the UKI in the directory EFI Linux, and next to it, you have a sub-directly called exactly like the UKI with a suffix Xter.d, and there you put cred files for the credentials or confax.raw, that's the suffix we picked for confax.ddis, or sysex.raw and addon.efi, the EFI, those are the P.E. add-ons. So, it's simply relatively simple, right? You lose some of the extreme sexiness of the approach, which is updates and things like that are not single file anymore, but that's on you, I guess, if you actually make use of this functionality. So, this is all optional. Like, the focus, I think if you know your hardware, if you know the environment you want to run your stuff in, don't bother, right? So, I think that's a good side, like, just focus on the UKI's one kernel and everything simple and robust and idiot-proof and things like that. We've got like five minutes left, let's focus more on questions. Alright, so, taking a scenario like you just said, where you don't know what the hardware is, how's your vision of, okay, how do all these sidecars get selected, indicated, put in there? That's a very good question. That's not something I imagine you want like an RPM distro to be dumping stuff in there, but what should we do? That's a really good question, and there's actually two new list items about this in SystemDTree. I'm not sure how many people have seen that, but so, you know, in UDEV, in SystemDT in UDEV, we have already this concept, how we can automatically determine which kernel drivers to load on which machine, right? Like, it's called mod alias. It's basically like for PCI devices and USB devices, the vendor product that you turned into a string and then having a mapping database that maps that to the actual K mod to load. And nowadays, there's all kinds of mod alias for SM bias and things like that. So, we have this already, right, and there is infrastructure to have a database that we use it as input and you get kernel module information as output. And there's another database, the HW database, where you use these strings as input and you get UDEV properties as output. So, to me, that's what you should just use, right? So, a distribution that figures out how to split things up, like they would have one 6-6-6-x for NVIDIA drivers, one for AMD drivers and things like that. And then you would just maintain this in HWDT, basically, where you match against vendor product, add in and then specify the thing. And then we should have some tool that helps you figure that out and then probably should turn it into RPM command lines if you are RPM based distribution or in something equivalent, right? Like that's distribution material, yeah. But I think just using the mod alias stuff, perfect solution for this. It solves exactly that problem except that now it's not just one K mod, it's just a 6-6 that you pick up. Okay, so just for me to understand something, so system debut would be able to parse the add-ons and show you a similar menu to do. So, how can you start to choose something from the add-ons because I assume that the add-on will have multiple commands boot options, for example. Okay, so the whole command line stuff is still working, probably we'll have more stuff later that hooks that up with the menu. But right now the way it works basically is you drop in one kernel and you put the add-ons next to it and it's not as debut that has any understanding of this. As debut only finds the main UKI and turns it into an entry and then boot it eventually. But it's SD stub, right? Like this early code that is glued in front of the UKI that then sees, okay, I got invoked. Let's see, in which directory did I get invoked? Let's see if that has the sub-directory stuff and then loads everything that's in between that, right? So right now it's, you pick the UKI and that pins basically all the stuff next to it. So what you're asking for basically is that it shows up in the boot menu. We have been discussing this for a while and everybody agrees we should do this, we just haven't done it yet and well, we don't actually know how it precisely will look like. But the idea basically is that sooner or later we want to be able to not embed a single kernel command line into your UKI but a choice of them, right? So one is going to be the default one if nobody picks anything and this would then basically mean that, yeah, if SD boot finds one of these UKIs in the directory, it generates one menu item per kernel command line so that you basically have one UKI where you can select the factory set choice or the debug choice or the regular choice. And it, yeah. So everybody agrees that's the way to go, nobody does it. It's really high on my to-do list. Any last question? How much? Do we still have a minute or something? Are these like system extensions, credential extensions only? Sorry, I don't understand where. Sorry, are these extensions only useful in the case when you haven't enrolled your own machine owner key? Like, or is there still an advantage like so obviously these are going to be signed upstream. But if you've enrolled your own machine owner key, is the better approach or do you approach now just to build your own UKIs locally and take advantage of the secure boot there. In fact, did you have an authenticated NITRD? So I'm not sure I understood the full question but I answered it. It's about the machine owner key like the shim thingy. So all these components that I was just described have individual ways how they authenticate it, right? Like the add-ons because they're PE, UFI binary, okay my time's over. But basically let's, I'd like to finish the question. That's okay. So they are authenticated by a secure boot means, right? And that also means shim, right? Like so that's where the mock comes into control. The other ones are preferably authenticated by the kernel key ring stuff, right? Like we asked the kernel key ring to authenticate them. Now kernel key ring, populating that is a mess, right? Like because you can do it via the mock stuff, that works. But I think it's a mess that this is how it has to go, right? Ideally I would have the way how I can upload from user space a new kernel, like a couple of additional keys, so your local one, and then basically blow a fuse so that later nobody can do this, right? Like because that would be the democratic thing, right? So I would take a Fedora kernel and then in the early boot phase I can install an additional kernel and then nobody else can. And that's like for me, that's the perfect security. Well, we don't live in this area. But I added this concept that we can do authentication user space instead. Depending on security policies, the kernel might say no though, but on the kernel in assistive reasons they all say yes. You can mock, easy talk, the data is done. Yeah. Anyway, so, yeah. Thank you very much. Thank you very much.
mkosi-initrd: Building initrds out of distribution packages
Hello everyone. Let's talk about building in and out of distribution packages. So a little bit about us. I'm Dan. I work for the Linux user space at Tima and Mana. I'm a system dmaco assignment trainer. I'm Zbyshek. I work in Red Hat. I work on Fedora. I'm in Fesco. I work on system d mostly. So let's start by talking about in and out of these and why we need them. The general boot flow when you boot a kernel is you start with a boot loader. The boot loader goes to the kernel. The kernel is then responsible for finding the root file system. And in the early days of Linux this was pretty easy and the kernel could do it itself. But these days finding the root file system is a lot more complicated. So the kernel really said we're not going to solve that problem. We'll just leave it to user space. How do they do that? Well, you give a file system to the kernel, which is called the inner MFS. You do that via CPIO. The kernel will unpack that and then start user space in that temporary file system, which is unpacked in memory. And then once you go into that, the inner MFS is responsible for finding the actual root file system and then doing a switch root operation into it. And then you end up in the final file system. And the inner MFS can do what it wants really, but generally these days it's done in one of two ways. So the first one is that some bespoke bash script gets invoked, which is generated by your inner MFS generator. So these are tools like Drakeit, like inner MFS tools on Debian, or make it in, make in its CPIO on Arch Linux and derivatives. The other way you can do it these days is by using systemd. So systemd has for a very long time already supported running in the inner MFS. And it has all the tools and services you need to find the root file system and switch root into it. And some of the inner MFS generation tools actually don't like, are configurable, so you can either use the bash script or you can choose to use systemd in the inner MFS. So I'll add to this that the amount of stuff that needs to happen for the root file system to be available is growing more complex all the time. So we have encryption, we have RAID, device mapper, possibly the invariity. And in the theme of previous, previous talk, we for example might at some point ask the user for a password, but the user might not be using a keyboard, they might be using a braille device or they might need a screen reader to know that the password prompt is up. And all of this stuff will sooner or later need to be available very early in the boot before the system is reader. Yeah, well I'll add to that that your root file system might not even be there on the system yet, it might have to come from the network. So you would need all the tools to set up a network connection and everything in your inner MFS. So it can get pretty complicated. So what's the status quo? Like I said, we have the inner MFS generation tools like Drakeit, Makein' the CPIO, inner MFS tools. And the way these tools work is they basically go look at your host file system, what's on there and they start picking out specific files and use that to make the inner MFS. Which files to pick is, becomes specific logic for each inner MFS generator. So the thing you need to know is that if you say like include this binary on the inner MFS, that's not going to work because that binary has library dependencies. So you also need to go get all the libraries and of course those libraries can depend on more libraries and so forth and so forth and so forth. So you need logic to make sure all those things get picked up correctly. Now luckily for ELF binaries you can actually in a pretty hacky way do that by just going to look at the ELF binary where all the library dependencies are recorded and trying to figure out stuff that way. But then you get into stuff like DL Open where the library might not actually be listed in the ELF binary. Or you get into stuff like configuration files or other kinds of plugins or anything you can think of. There's no direct dependencies listed in the file system that you can use to figure out all these things that need to be included. So you can get into quite a few issues. So this leads to having regular packaging when a new piece of software is released that is used in the inner MFS where the package build or the WNDEP or the RPMs pack gets updated. And then you get into inner MFS specific packages that you can use to do the packaging. So for example Drakeit. A very good example of this is when we introduced systemd executor in systemd which is now required to launch services. So this was a new binary. So when we released a new version all the specs were updated. And then we also had to update every inner MFS generation tool to make sure to include that binary in the inner MFS. And this leads to quite a few bugs. It also means that when it becomes very unclear where the bug should be reported. It could either be bug in the upstream project or it could be the inner MFS generation tool that's not correctly picking up all the dependencies required to run the tool. So it becomes very hard to assign bugs and requires a lot of triaging to get the bugs to the right project. It's also hard to customize. If you want to include something it's up to you to figure out all the dependencies and specify them in the inner MFS generation tool to be included. And of course it's also quite slow because every time the inner MFS is updated it has to be done locally. And all the dependencies have to be figured out. And anyone that's ever used Drakeit and not used host only mode probably knows what I'm talking about because it takes forever. So what do we want to do instead? We want to reuse all the work that the distributions are already doing with their packaging. So the Arch package builds, the RPM specs, everything. We want to reuse all the work that goes into those and use those to build the inner MFS. So instead of going to look at the host file system we just install RPMs, install dApps, install packages, Arch Linux packages into the inner MFS and we get it out that way. And this has a few advantages. For example, package managers, it turns out that package managers are very good at installing packages. So it just works. They're also good at managing dependencies. So all these systems have, depending on the package manager, very extensive or at least very sane dependency resolution. So all the dependencies get listed and the package manager takes care of figuring out all the extra stuff that is needed and make sure that gets installed as well. You don't need to go parsing ELF binaries anymore to figure out the dependencies of a specific package. You don't need to learn another system. So you don't need to learn the inner MFS generation tool. You don't need to manually start listing the dependencies of the tool you want to include. You just install the RPM, the package, the dApp, whatever you want and the package manager takes care of all the rest. The ownership of bugs becomes clearer because the inner MFS generation tool is just installing packages. It's pretty simple. So there's a, the surface area for bugs is a lot smaller and generally when bugs appear, they're going to be able to be assigned to the upstream project instead of to the inner MFS generation tool. If any improvements are made to the packaging, of course, they automatically end up in the inner MFS as well. And finally, by doing this approach, the inner MFS is not tied anymore to the root file system or the host file system. So you can also start building the inner ID off-host on a distribution builder and distribute it as a package. So you can just download an inner ID instead of generating one locally. Assuming that the inner ID includes all the necessary pieces, this allows you to just have an inner ID that works for 99% of use cases without every user having to spend CPU power to build at inner ID themselves. So there is some, some requirements are needed to build the inner MFS out of packages. Specifically, this means that the packaging has to be done a little carefully to make sure that the inner MFS does not become too big. For example, GCClips, so GCC ships a bunch of libraries which are generally dependent on by software, at least the C library. But it also ships, like GCC supports the Go programming language, it supports Fortran, it supports D, and it includes standard libraries for all of those. If those are all put in the same package, especially the Go standard library, it's absolutely huge. So yeah, and with an inner MFS, that's pretty huge. So ideally, GCClips is a separate, sub-packages for each standard library so that you can only install the necessary one in the inner MFS. For example, Arch Linux doesn't do this. So we have to, you have to start removing stuff manually, but we don't want to do this, right? We want to rely on the packages. So ideally, the distributions take a little care that the core packages are split sufficiently so that you only install the necessary stuff in the inner MFS. Another good one is that the kernel modules generally depend on the kernel itself. So if you install the kernel modules in the inner MFS, the kernel gets pulled in as well, but you don't need a kernel in the inner MFS. So that's another thing where there should be a little care taken to make this possible. And finally, locales. Fedora has, and then Santos and derivatives, have a G-Lip C minimal rank pack package that only includes the official UTF-8 locale instead of all of them. And that again, stuff like that helps to reduce the size. So how do we propose to build this inner MFS out of packages? Well, we suggest to use MAKOSI, which is SystemDIS Image Building Tool. So our idea is that the inner MFS really isn't any different from a regular Linux image. It's just packaged differently. Instead of putting it in a disk image with a GPT partition table, you just package it with CPIO and you get your inner MFS. And inner MFS isn't really any different from a regular Linux system, except it just includes less software and it has two extra sim lings, and that's all you need. So you can build it using the regular image building tools. You don't need anything different. So MAKOSI is a tool that builds these images. It does a whole bunch of things. It installs packages, and it can also build you something else than an inner ID. So it can install bootloaders. It can build an inner MFS for a regular disk image. It can do unified kernel images. And it can run a whole bunch of tools that SystemD provides to configure system images. And it also allows you to test the thing by booting it in QMU or assisting the N-Spawn container. So how do you get started with MAKOSI? Well, this is an example to build Arch, install SystemD and the kernel. We enable autologon and then start it in QMU. This gets you something like the following. MAKOSI supports all the popular distributions, I guess. CentOS, Debian, Ubuntu, OpenSuzi, Arch, Fedora, and some derivatives of those. Raul? Raul. Raul BI. One interesting thing is that you do not need root privileges as your user to run MAKOSI. We use these new UID map and new GID map tools to be able to do everything without needing to enter your password. We also use SystemD Repart from SystemD to be able to build disk images without needing root privileges or loop devices. So you can just run all this as your regular user to build an image. We have configurations, so instead of having to specify everything on the command line, you can also use the regular SystemD unifile, which everyone knows from Unifiles. So what is MAKOSI in-it RD? Well, it is a MAKOSI configuration to build in-it-RAMFS images. It used to be a standalone project, but we recently merged it into MAKOSI itself. So it is already used to build the default in-it-RAMFS for all images that MAKOSI builds. So if you use MAKOSI to build a disk image and you do not specify your own in-it-RAMFS, it will use MAKOSI in-it RD to build an in-it-RAMFS and use that. So every time you boot a MAKOSI disk image, you are generally already using this. And we make sure this is tested on all the supported distributions. So it initially started out as a Fedora only thing, but when we merged it into MAKOSI, we implemented support for all the distributions. So you can build an in-it-RAMFS out of Arch packages, Shibuunter packages, Debian packages, OpenSUSE packages, CentOS packages, or Fedora packages. We also ship a kernel install plugin. So kernel install is a system that is tooling for taking a kernel from your slash user directory, where it is installed by the package manager usually, and moving it to the ESP and doing a bunch of extra required work, like for example building an in-it-RAMFS. So usually like on Fedora at least, the Drakeit ships its own kernel install plugin, but MAKOSI does as well. So you can basically configure kernel install to use MAKOSI in-it-RAMFS instead of Drakeit to build the in-it-RAMFS. And Drakeit will automatically disable itself if another in-it-RAMFS generator is enabled. This view reuses all the package manager caches from the host file system. So you're not downloading unnecessary packages. It just reuses the same RPMs or depths that were already used, or that you already installed on your host file system. And finally, it can be completely customized. So MAKOSI, the configuration supports drop-ins. So you can add a few of those in user-lit MAKOSI in-it-RD or ETC MAKOSI in-it-RD to add more packages to the in-it-RD or to remove some extra stuff, or anything you can think of really that is supported by MAKOSI, you can make sure that it gets applied to the in-it-RAMFS produced by the kernel install plugin. It can also be used as a standalone thing, so you don't need to use the kernel install plugin. This is how you would use it to build your own in-it-RAMFS, which will then appear in the working directory that you invoke it in. One interesting thing here is because the kernel modules packages aren't really set up completely correctly yet and pull in too many dependencies, we do the practical thing and we copy the kernel modules from the host. By using the kernel module exclude settings and the include settings, we can do the same thing that RakeIt does or the other tools do, where we only include the kernel modules that are loaded on the host file system, because if we would include all of them and all of their firmware dependencies in the in-it-RAMFS, it would grow to tremendous proportions. So make sure to only include what's needed. So we cover a lot of this with integration tests. Specifically, we make sure that booting from LUX works, so with an encrypted root file system, we make sure that LVM works, we make sure that the combination of the two works. This can boot up for the AOSFS. We support the system-dgpt auto-generator stuff, just specifying doing everything with FSTAP, whatever you can think of really. We try to make sure it works. There are some more niche technologies like RAID, NFS, and iSCSI that we haven't had the time to write integration tests for, so we can't say for sure that this will work, but we're working on making more stuff work that is already possible. So we're working on making more stuff work with the existing tools. That was everything I had to say. So this is a link to the configuration files from iSCSI in-it-RD. So there you can go and take a look at how the in-it-RDs are structured, which packages are included, what files are removed. Specifically, there's a lot of files that we have to remove depending on the distribution. So any distribution packages, go look at that, see what we have to remove manually, and improve your packaging so that we don't have to do that. Thank you for listening. So before the questions, I want to make one comment clarification. Since we're developing this, we get into this mindset of thinking about the low-level details, but I think that this might be a bit confusing, that on the one hand, we talk about building the in-it-RD in a predictable way, somewhere in central infrastructure, and signing it. And on the other hand, we talk about including local modules. And so a lot of this stuff is for development and for now. And in the long term, we want to have the centralized thing where we're building the in-it-RD, glue it together with the kernel, and sign the pair together, building a unified kernel image, which Leonard Pottering was talking about earlier today. So yeah, just to clear this up. Awesome, thank you. What questions do we have? One over here, one over there. Okay, so you mentioned that currently you use local modules. So it doesn't mean that all the complexity from record for selecting kernel modules still remains here as well, right? Yes, but it turns out the complexity for selecting kernel modules, because the kernel modules list their dependencies properly, is not all that much. But yes, we do support it. But we hope, like Shabish said, that eventually in the future, we don't have to use that part anymore. So we can have a proper set of default modules, and these are all properly sub-packaged in distributions, so that we can install distribution packages to get the kernel modules instead of having to do the extra complexity for selecting them locally. You spoke about integration testing on multiple distributions. Did you try to test in all, let's say, usual kind of lattice distribution, but did you try a bit older, and do you have something that you plan to maintain like testing with a new distribution that are coming? So at the moment, our integration tests are run for the default versions of all the supported distributions. So this is generally the latest. It's Debian testing, it's not Debian stable. But I mean, we could definitely add more. It's just running in GitHub Action, so it's just a matter of defining the necessary configuration and then we can run tests for everything. What was those questions? Zero questions. Alright, thanks you too. This was great.
Desktop Linux, as easy as a smartphone! Just in a Snap!
So, I'm Tiel Kampeta, leader of Open Printing, and by making a snap package of cups, I've learned snapping and done and made a lot of experience in snap. So, I got a snap enthusiast and I'm also working at Canonica and this way came up to giving also workshops in snap and so on and giving talks about snap. And here on the first step, I want to tell once what snap is and how it works and second about an old snap linux distribution Ubuntu Core and Ubuntu Core desktop and to show that snap gives you something a little bit like how smartphones work and so it makes it makes it very easy for the end user to maintain their system. So, what the hell are snaps and why should I use them? So, if you have an open source project, usually they develop some application and this application is published as its source code. For most users, it is much too difficult to download the source code and to compile it. Usually, they even do not have the compilers installed and do not know what they have to install to get the compilers. And so, there are other distributions, they fortunately make distro packages and naturally there are so many applications that the distributions do not cover everything so you cannot be sure that you actually get a distro package and also the distributions they make and update the distro packages until they release the distro version and after that for that distro version they do not make new versions of that package and so this is a little bit frustrating for users and this can easily turn users away and so this is a nightmare. A little bit. Yes. And so, there is a solution. It is when you have probably a smartphone and there you can easily download and install applications via the Google Play Store or the Apple App Store and you simply and it will be independent whether you have a Samsung, a Pixel or something else with all somewhat different Android versions. It is the same applications which you get from the Google Play Store. Yes. And so, and you know also Canonica, they have also created an operating system, Ubuntu Touch, a smartphone operating system. They are not doing this anymore but we have a UB port booth because the community continued and they learned from that. They did not throw everything or the experience away. They developed based on starting from the ideas of their click package format, a package format for computers for embedded and servers in the beginning but later also for desktop and that is Snap and Snap. We have a Snap Store and we can install applications on different distros, on different distro versions. We have a form of distro independent packaging for computers running Linux as IoT server or desktop. And by the way, Snap is 10 years old. It started in 2014. And so, we have a sandbox packaging which means every application is in the security capsule and one application cannot access the space of the other and cannot access the system. And it is operating system distribution independent because these snaps, they bring all the dependencies, the libraries you need and so, so you do not rely on the distribution where you install it and so, it installs on many different distributions like Ubuntu, Debian, Zoozer, Quartet, Windows or whatsoever. And your application, as I say, for the sandbox packaging, we have a security shell. This means every application is in a capsule made of up armors, sac comp and name spaces which prevents it from accessing the space of other applications, of other snaps or of the operating system. And you need, and you need for intercommunication, you need very well defined interfaces. You must define when you create a Snap which interfaces for the outside communication you want to use and only through these interfaces your Snap can communicate. For example, network Cups or Dnssd, Avahaya or so on. And so, it is very well defined what is, how the Snap, how the application in the Snap, the Snap application can communicate with the outside world. So this gives you security and privacy. And if there are interfaces which are dangerous, which can modify the outside system for example or read data which could be private, then these interfaces are considered dangerous and they can, and you can only put some, when you put such snaps which connect dangerous interfaces into the Snap Store. You must have to connect them by hand or you have, or you need a special permission of the Snap Store for those connecting automatically. And with this we can trust third party apps. We are not for a distribution, we are not needing any more that we can only trust our own distro maintainers and need everything packaged by our own distro maintainers. We can trust third party packages and so we can access a lot of different applications like with the Google Play Store and the smartphone. So and Snap has also some special features which perhaps other sandbox packaging methods not necessarily have. And the first thing is don't fear the diamonds. We are snapping them too. Snap allows packaging, diamonds and system applications. And also even what we will see later some kernel boot system desktop environment. You can snap everything. And one thing is also packaging can now move from distros to upstream which means instead of 10 distros all by themselves reinventing the wheel by packaging for themselves the upstream can package and test once and all distros can use it. And so employees of distro vendors can concentrate on the core distro. And we can also and we will also see how it goes into immutable distros all snap distros in our case all snap distros Ubuntu Core and a snap one can a little bit consider an immutable application because you have also the file system of the snap and which is also read only. And this we will see now. Now we look into what snap how snaps work. The snap core the snap file system the applications file system is a compressed GPG signed read only squash FS image which we simply mount we even do not uncompress it. Therefore we save a lot of memory, lot of storage and also memory. And it includes also the metadata of the snap and when we install the snap also a writable area of a for the file in the file system inside the capsule of the snap is defined. So that the application can write to somewhere. And we have five types of different snaps we have apps we have call us snaps we have which is the operating system call we have forget the snaps which is the boot system we have kernel snaps and we have desktop session snaps like Gnomok KDE or so. And we can and when they are updated we can handle binary diffs so that we can much more much more quickly we can have much quicker downloads. And they are available for most distros you can install snap D on many distros and they exist since 14.04 Ubuntu it is default already included snap D since Ubuntu 14.04 10 years. And as I said security we have the GPG signed read only file system for the application so it cannot be modified by any malware. And we have the confinement app armor sec comp and name spaces and the executables are called for the snap D snap confirm so that the security is enforced. And the snaps are route safe due to the encapsulation we can run an application as route but it cannot access it cannot modify anything in the environment because it's in the capsule. And so damon snaps the damon's one is good. And so we do not need any special users and special groups. And they are storage efficient due to the case that when we mount the immutable file system of them that this file system is not actually uncompressed. And we have also additional tricks so called content and once the course and the course that contains the core operating system lip GG lip and all the standard libraries. So this is mounted into the capsule office map to this course map so that the essential parts of the operating system system are available for the snap. And we have content provider snap for example for GNOME with all GNOME libraries for KDE with all KDE libraries so that this can be shared by the snaps too. And as I said safe interfaces and dangerous interfaces and interface between snaps or between the snap and the core system then the snap D snap is the providing snap. We have slots and plugs yes we have slots and plugs for connecting the snap for connecting between two snaps or between one snap and the system itself. And only due to and they are connected plug with slot and this gives a defined connection for communication to the outside of the snap and you get safe interfaces when you use a safe interface and upload your snap to the snap store and someone installs it this interface is connected automatically. And a dangerous interface when the user downloads a snap which uses a dangerous interface you usually have the user usually has to connect it by hand or the snap store has given special permission then this is also connected automatically. And now for the updating the updating is when you update the snap the new snap is downloaded and mounted but the old snap is not deleted. So and the new snap is in a new immutable file system so the new snap is installed and put in use and if you have any problems with it because the old snap always the previous one is not deleted you can easily step back. And so it's not a big problem when a new version introduces a bug and does not work you can easily step back to the previous one but the previous of the previous that is actually deleted. So and a snap started it is it was a part of for Ubuntu core for the Ubuntu core operating system by IoT. At Canonica we wanted to have an IoT operating system Ubuntu core and this we have created already back in 2014 as an immutable operating system with snap as the packaging format. This is the start of snap and the immutable operating system it has not only one single core as most others have it has some modules in the operating system itself. It has the kernel is one snap the gamut which is the boot system and the definition of partitioning and so on. This is also a snap and the core operating system these are the base libraries I mentioned already for the snaps Glib, Libc and whatever is a third snap and these three snaps give you the operating system which you can boot they come on one image but you can once installed you can update and replace them separately for example replacing the kernel by a gaming kernel or so. And these and onto these where once you have these three you install application snaps. These are the ones which I mentioned an application which is packaged as snap. So and this is the Ubuntu core operating system and the Ubuntu core and updates are also modular like with the application snaps when you update a snap of the Ubuntu core operating system for example the core snap then the new core snap is loaded and activated but the old one is not deleted so you can step back and if you update the kernel it can even step automatically back if the kernel does not succeed to boot. So when it hangs when it tries to boot and hang somewhere gives a kernel panic then it automatically steps back and reboots. So if there's a problem if you update into a bad kernel the boot simply takes longer and you are back in the old kernel. And now this Ubuntu core from 2014 in 2023 was extended to Ubuntu core desktop so we take the Ubuntu core and onto it we put a desktop by an additional snap which is the Ubuntu desktop session snap. It is currently at Canonica. Thank you. At Canonica it is a wayland with GNOME but as it's an exchangeable snaps for example Ubuntu 4 can snap KDE for example and all the other flavors can also contribute a desktop sessions snap and this way we get flavors of Ubuntu core desktop. And we have the application which this time can be desktop applications and it is distributed also as an image and usually with the base with gadget kernel and core system but in the image you have also the desktop session and some initial apps so that you have a complete desktop the user can start with. It's all in an image but once the image is installed as usual you can update and replace everything separately. A little bit like Lego pieces or like a framework laptop and software. And now you would think how do I do development on a system where everything is encapsulated and separated. What we do there we use LXD containers and do the development in an LXD container. And so we take an LXD container of operating system we want to develop under and inside this we compile we want all the tools and so on and test and so on so we do not need to snap our application which we are developing all the time to be able to test it. And so this and for this we have a graphical front and named workshop where you can easily choose which operating system not only Ubuntu but also Rated, R, Suze and whatever and so you can develop but it needs still some work for example that one can have a snapped IDE running natively but about working on the containers. And what we have yes what we have still to be done to make the system perfect and complete is once for gaming we need NVIDIA proprietary driver support this is still is still in the works and not not yet ready and for productivity we need to make the printer setup tools work with the new Cups 3.x what I have mentioned in my first talk today in the morning so that in as in Ubuntu Core desktop when Cups is encapsulated it cannot access to classic Cups drivers outside of Cups so it can only access to IPP printers and then we need to change the printer setup tools we also need to for productivity to introduce scanner applications so that we can add scanner drivers to Ubuntu Core desktop and we need to improve the development part so that we can have a snap of an IDE and then this snap can access the files and do the do the operations on the LXD containers in which we are developing and what still is missing is TPM full disk encryption needs still to be done and remote remote management of of of leads with canonical landscape is the secure and modular Ubuntu Core desktop is also an ideal distribution for for for companies who have many computers and want to have and want to have an easy maintenance of the systems so so remote remote maintaining is also an important part and then Active Directory login also for the enterprise desktop and the infrastructure to make it available as a distro that we make the ISOs that we have testing testing plans and testing scripting CI stable release tracks and documentation and so on but this is all planned for the next month at Canonica that these steps will be done so I think I'm also yes yes I yes I have it now yes that was it you can also see snapcraft.io there you find the snap there you find everything about snap there we have also a forum for questions and here are also some links and are there any questions we actually don't have time for questions that was that was it for you but thank you so much for your talk thank you and now we have a demo of Ubuntu Core desktop here behind that door and there you can if you have questions there are there is me there is Philip Kavish community maintain community manager of Canonica and there we can talk more about snap and Ubuntu Core desktop fantastic thank you so much thanks you
Upstream and downstream, best friends forever?
to introduce the speaker. Thank you. Hi, folks. Good afternoon. Welcome to the evening sessions. I'm going to turn it over to our next speaker, Franzy, to introduce his set. Just a couple of housekeeping rules if you could make sure phones, et cetera, are on silent. When you're taking a seat, they can be loud. Make sure that you do them gently and try and keep the talking to a minimum. Thank you. Hello, everyone. I'm Franzy Szek. I'm a product owner of the packet project. I will use this project as an example during the talk. Thanks, everyone, for coming. And I would like to also view things from you so don't sneak out with the doors behind. When I was thinking about this talk, I was thinking that if people come here, they already maybe had issues like me and already were thinking about it. So let's use their ideas as well and don't just talk for half of an hour and let them show and share. So I would like you to connect to this URL or just use menti.com and use this number to just connect to the slides so you can also provide some feedback for me or others. I hope it will not break the Wi-Fi or disappear or something, so we'll see how it goes. And this is an example question. Thank you for putting the answers there. That's not only to test it, but also so we know what are you coming from or what are the, what is your background? So let's give you a couple of more seconds. Wow, so many, so many. Okay. Yeah, and also a positive thing is that if you don't see the slides correctly, you can watch them on the screen and, yeah, sorry those who wanted to just fix some bugs in the meantime or check the next session so you need to use the phone or laptop for this. So sorry about that. Okay, so we'll move on. In the title, there was two times mentioned stream, but what do I mean by that? And it's what I mean by that is a stream of code or the program that comes from up, from the developers, down, down, down to the users. So that's the stream I have in mind. And you can have various pieces on the way and anything what goes up to the developer is an upstream. What goes down to the user is downstream. So for example, Fedora is a downstream when looking from the developer point of view or from GitHub, GitHub, but it can be also an upstream for CentoStream for rel. CentoStream is a downstream of Fedora but upstream of rel. So depends always from what place we are looking from. For this talk, when I mentioned upstream, I mean a Gitforge development, the GitHub, GitLab. By downstream, I mean some Linux distribution, for example Fedora in my case. I tried to use upstream developers and downstream maintainers to make it really clear, but just so you know. So just to check, try to show others where do you belong? Are you more a maintainer downstream guy or are you more an upstream developer, maybe curious how you can get to the distribution. So let's show others how we stand here and what I'm talking to. So mostly maintainers and if you are both upstream developer and downstream maintainer, you are somehow in the middle. Okay, so it's not moving much, so I'll continue. Okay, so let's back to the package project I mentioned in the title slide. Something around five years ago with few people around. We were thinking that we will create a new project and as a goal, we said, hey, let's make upstream and downstream closer together. So let's provide some downstream feedback to the development and also for downstream maintainers, let's provide them some connection to the upstream. For example, when they release new code in upstream, so let's get it automatically to downstream. And it was like, yeah, that will be awesome, everyone will be happy. So we started work on that and few months ago, yeah, we came to the upstream developers and said, yeah, we have this federal integration for your project and it's really easy and you will have like new functionality for your project and can be sure that your code will run. The feedback wasn't so positive and we were really surprised because, yeah, we are trying to help you. So what do you think why developers might care about downstream? Why they might even bother why they shouldn't just live on GitHub, GitHub website and live their awesome life and don't care about any distribution at all. So, yeah, hard question. I hope you are typing. Yeah, availability, software option. Wow, wow, so, for me, articles, okay, yeah, many, many reasons. Yeah, without distribution, they might have no users. Yeah, a lot of obvious things. Just to note, after the session, I'll share the results with you so maybe I'll also set up some blog posts but you will have it attached and, yeah, wow, so many things. It looks like it makes sense to care about downstream. So just a couple more seconds. Yeah, people, shitty tools, yeah, revenue maybe also. Yeah, sometimes there is a middleman that you don't need to tackle the video users. Sometimes you just don't want users maybe. Okay, so let's move on. So, we asked maintenance. Hey, maintenance, we have this nice service for you that will automatically send your upstream releases to your service. Always. We were very positive that we helped people. I don't care if they produce new code. It's definitely a new box and more work for me. So, I'm not sure if I want this service. So, same question. Why do you think maintenance should care about upstream? Why there should not be just the upstream that don't produce anything and I can happily leave just rebuild my package every half a year or so when there is a new version of the Linux distribution and live in peace. Yeah, users want new releases, yeah. New features that's related. Missing updates. Similar stuff, similar stuff. Yeah, bug fixes. Yeah. Writing the code is hard and I really don't want to do that and all the patching, yeah. Can you be a maintenance upstim? Looks like not, but yeah, there are a lot of maintenance with dead upstream. But availability, security fixes, stability. So, yeah, we have at least 17 reasons why to do that. So, I think it makes sense to care about upstream. Yeah, if there is no upstream project, then there is no downstream project. So, that makes sense. Okay. So, we really wanted to help people and it was quite a surprise that we were honest on that and that was our goal to bring upstream and downstream closer together. There was nothing hidden, just a really clear goal to help people. So, on the way, after these feedbacks, we also get some positive ones and also there were people that were both upstream and downstream and after all we get some users and also users that provided some feedback and after like these, somehow 40 years, I can say that we are saving some time people and helping them and there are also great people that uses our project. So, it looks like it makes some sense. But it wasn't easy and we are not done definitely and we've collected various feedback and complaints on the way. So, let's pick few typical sentences that we've heard during those years and let's take a look what we can do to help in those situations. So, the first one, when things go wrong, I don't want to look into the logs. I don't understand the downstream logs. There is some build failure and I don't understand it. So, what would you do in this situation? You are providing a build system integration or maybe like you are running RPM builds for the upstream pull requests or testing on Ubuntu or anything like that. Just the downstream feedback for any upstream change and people don't want to tackle with the downstream logs. What, when things go wrong? What would you do? How to help with that? Yeah, reliable mechanism for filling bugs. Yeah, definitely. If the problem is like a packaging problem, anything else what we can do? Okay, so it would be transparent. So, if we need to give the logs, so let's let them suffer as well. Yes, I should have 20 relevant logs to find the one relevant. Yeah, so help with some home combing those. Cool logging libraries. I'm still missing some crucial. Yeah, you can snap or flat peg, yes. But you can get like a failure from creating the snap. So, it's, we can treat like snap or flat peg like this and another distribution maybe if either of the ends but still. Yeah. Okay, I hope that's my job. Okay, I'm still missing one crucial. Like the obvious one. You know that is not possible. Probably that's why the response is not here. So, better logs. Yeah, it's usually not so easily possible. Sometimes yes, sometimes we can do something about it but with these systems, if you've been on the talk like the Dant had an hour ago about all the federal systems we had in place, yeah, we are trying to integrate with all of them and you, for example, copper had multiple logs and all the systems have different logs. So, and they use more or other tools. So, this is layered and we don't have power on all the logs. But maybe we can, as someone mentioned correctly, we can be good in the aggregation or some visualization or we can use AI. Okay, just kidding but a few colleagues and me are actually working on something like that. They are trying to collect various logs with the failures and trying to get like human input what's going on here and how to fix that. So, if you are interested in that, check this out. I really hope this will happen and will produce some really nice data set that can help us to provide some really nice way how to not tackle with hundreds of lines of logs that I don't need to tackle. So, that's just in the beginning but I'm really looking forward to that. Next thing would help us is provide nice notifications and also connect by that, I mean connect people that can help. And it relates also to the last point that we need to set really clear expectations. Who is responsible for what? Who should take a look? Yes, sometimes it's not clear, sometimes it can be really valid back in the code that is sketched quite soon and that's really nice but sometimes it can be downstream issue and sometimes it's something in the middle and so it's really nice when we are introducing these two to make clear expectations. Who will, which like maybe also time-based and with the notifications we can maybe ping people that can help. Okay, so just a single distribution. Why can't you support all of them? So, what would you do? Yeah, there are people from various distributions so that's maybe common that if you want to introduce some CI or anything that, yeah, but if I enable this for Fedora, I want also the BN, SUSE and everything. Can we help somehow? What would you do? Yeah, we can use build systems that support multiple targets like copper or OBS and these. Yeah, snap, flatback, yeah, we can like switch the distribution but when we discuss in the very beginning, we might want to care still in those distributions that we maybe work on. Yes, some distributions, yeah, a lot of distributions are really, really different and it's hard to somehow compare or maybe some obstruction. Yeah, I'm lost in all the good suggestions. I'll read them probably later. So, yeah, it's probably not possible, definitely not all. There are a lot of distributions but as someone suggested, we can maybe try, maybe we can also have open source that someone can contribute, maybe some architecture that can maybe combine various backends so we can maybe share. If you have an open source, so for example, then we'll come from SUSE and say, hey, let's support impact it also OBS and we can maybe collaborate so that's also possible. The tricky one is we are used to do distro specific terms and we need to be really careful about those because when we mention various scratch builds and patches, metadata, bugs, and all the weird terminology, it might be really, it also might be a reason why they are scared and developers don't want to hear about those. So, describe those terms and be careful and also you can hide those somehow so we don't need to speak about co-properties but we can maybe mention RPM builds and these things. So, yeah, what helps us is also we are supporting various functionality types and what helps is to provide easy and reliable testing like infrastructure or so if they can rely on that and they can run their test code or run their tests on this reliable infrastructure we are using for example testing bar project for that so we don't need to do that ourselves. And easy on boarding, I'll probably mention that multiple times but it's crucial because if they hid like the very first problem during the way and yeah with those distribution things it's not easy and we've spoiled it multiple times but it's important. Yes, sorry, I don't want to have more files in the repository. So, CI system that's generated maybe for any upstream CI so yet another config file. Don't be lazy, yeah. Yeah, there are thousands of files in the repository and you don't like yet another one or next to. Yeah. A few interesting things. Anything interesting? You see, okay, so yeah, we can put more complaints. Yeah, yeah. Very interesting things. Yes, I'll probably move on and read those later. It's interesting but so yeah, we might want to stick with the one line if possible, one file if possible because if it is possible but still better one file than multiple files and also not sure why but people rather put a shell script into the JSON or YAML instead of providing a shell script and specify a name of this shell script in the YAML file so we can help them do that. Yeah, if there is more content we can maybe let them link it. We can also enable some custom locations, custom name of or some sub directory so they can maybe hide it a bit and also for example Zoo project uses global configuration if people really don't want to put anything to their git repository just provide a separate git repository they can create a pull request and enable this. It's tricky from the developer point of view like how to do that, how all the messaging should work but yeah. Yeah, I have my own automation works well. So, I have my script or anything and I'm happy with it. How to live? Yeah. Yeah. Use on Civo. Yeah. Okay. Good for you. Yeah, standard protocols. I'll move on. So, this is a generic when we want someone to start using something else even like with the comparable tools we need some killer feature. We don't, having just the same feature set is not enough. They need to have clear motivation to have something extra when they move. Easy onboarding. I've mentioned that this is crucial. We are for example trying to do some online workshop with the gutter platform and various funny stuff to help them. Also, and that can be the killer feature if things break. The people or the users does not need to take a look and fix that and then that can save a lot of time. So, with the maintenance of this automation and we can save them a lot of time and we need to clearly communicate this but we need to also do that. So, if things break we should take a look and just ignore it. And work on the right things. So, listen to the community, listen to the people and try to like don't just assume that you are doing the right features but just ask and get something there. Yeah, your automation can break some rules for example packaging rules. How do you tackle these things? And yes, sometimes when those packaging guidelines were created the automation wasn't such a thing or maybe when they have written that they expected that humans will interact with the packages and maybe standardization, who's rules, yeah. Yeah, we can also tweak the rules. It's not set in stone so we can maybe discuss. Yeah. I trust the both modern and human. Yeah, that's a positive thing on the automation that yeah, it usually does not do like the human like things and be kind and that's I really like. So yeah, be open for suggestion, communicate, don't ignore the issues. Just talk and see what others think about that because sometimes it can be like really valuable feedback and maybe people already have a suggestion how to fix that or how you should behave. Yeah, sometimes you can also let the user decide for example we've talked about an issue if we should upload some archives to the cache if we should do that before it's merged into federal this gate and we were not sure we someone wants the automation someone. Yeah, someone wants just the security so let them decide. Yeah, and for us it helped that we are trying as much as possible to have the provisions as a regular user so we are not kind of special in any way with one slide exception to get less permissions that we need but otherwise we are trying. Yeah, so. Similar to the previous one. You can continue with the voting but I'll skip to two points I had. Yeah, when people think that they you behave differently we can provide some config options but usually bait. There can maybe be maybe people will realize that they don't need it but maybe they will be a new user that will have similar config option or similar feature request so you can maybe combine it for us. For us user defined actions helped a lot because we've done everyone has a different workflow and it was really hard to do securely and well but this was for us a huge and other. Yeah, and respond to the first issue and questions crucial even if it is like a little you weeks thingy that you are showing how you how you treat your users. So that's all from me. This is the project page for Stornon account and if we have maybe two minutes for question if you have any but maybe you can ask the audience. We don't have. Okay, so sorry about that but I think you've shared your opinion on that. So thanks a lot everyone. Thank you.
Supporting architecture psABIs with GNU Guix
Okay, so hi, I am Afrayim. I've been working on GNU Geeks for about eight years now. So supporting the PSABI's with Geeks. So the PSABI's are the, say, hey, it's been 20-ish years since we first got X8664. It would be nice if we compiled for something a little newer. So the PSABI's are a nice way to go and, let's see if I can see both screens, to go and see, to support both the older machines and the newer machines at the same time. So this is the output that I got from my computer running dash dash help on LD.so. As you can see, my machine supports X8664 v3 and v2 and of course the original, just regular X8664. So it's, you know, how we actually go and figure out what is supported and what isn't. So something that I wasn't able to find looking through everything is, where are these directories? I know the libraries go and slash lib and not really clear where the alternate ones go. So I did what any normal person would do. I went to my local checkout of G-Lib C and I searched for G-Lib C hardware caps. So from the test suite, we got dollar L, G-Lib C hardware caps, our little string and our library. And we also got that supported on three architectures. So we can see that X8664 isn't the only one that is looking for faster libraries. So that gives us on X8664 these four paths where LD.so will actually search for libraries that are supported, more or less. So we have our slash lib and then the two that are after it. For the sake of completeness, we also have for PowerPC64 little endian. We have power nine and power 10 and for S390, we have these other ones. And I've never seen an S390 machine. I just assume that they exist. So that works well on a regular distribution where all of the libraries go into slash lib or slash user slash lib. But in Geeks, everything has its own path that everything gets installed into. And fancy word at the top there is directed acyclic graph. So individually I know what all of those words mean, but in general it's like the arrows have. I'm trying to remember if this one goes up or goes down depending on which part of the stack you're working on. Sometimes we end up with arrows in different directions. So I'm assuming we're going down on this one. So for XPAT, it depends directly on get text minimal, which is different than our regular get text, which would have other inputs in it. And just like for this case, so the acyclic part is what it sounds like. There's no circles. There's no repetitions. Once you build the package, that's it. It goes into its designated folder and nothing else gets installed into there. So before when we had our library outputs, Geeks doesn't have a slash lib folder and it doesn't have these other folders either. So we still need to convince Geolib C to actually look for all of these places so that we can find the libraries. So it's before that one. The other thing is that reading through the bits, it turns out that we're not just looking for, not just checking all of these directories. We're looking specifically in slash lib. And then if your hardware supports it, we will also check the other library locations. So using, you know, to take your favorite library, Readline, you'd have lib slash lib slash readline dot s o. And you could have it also in your Geolib C hardware caps directory. But with Geeks, the first one is going to be in its full path slash gnu slash store slash big hash. And the other ones will be in other paths. So while you would have libraries in the other paths also, they don't actually end up in the same spot. And it goes and says, okay, here's my regular Readline library. I'm going to search for the Geolib C hardware cap along the same path. It's not going to find anything even though you've already built them. So question that I guess keeps on coming up when I was looking at this is, is it worth it? Does this actually make a difference when you're running the programs? Do you, how much difference do you really get from having a optimized libraries for your computer? And answer a little bit is, does it matter? I mean, the options are there. They wouldn't be there if it wasn't going to do something. And other part is people want it. Users want it. It might make a difference. So whether or not it matters, we're still doing it. And to some extent, wonder is it something like the, I always read that as fun roll loops. Where it's, does unrolling the loops actually matter? How much benefit are you getting from it? So, yeah. So one of the, oh, I got cut off a little bit. One of the programs that we were experimenting with was the new NCDU written in Zig. And so up here, I have a just transcribed output from Difascope. I actually went and compared the two binaries. Zig does, Zig inherits the optimizations from the underlying LLVM. And so I compared NCDU built against standard X8664 and one built for X8664 v3, which would run on my desktop. And other than seeing that more than 99% of the code was the same, this part was actually the part that had the largest amount of difference in the generated assembly. And so I don't know if, is it v0 upper is faster than other options here, but I also noticed that it ends up with the same number of instructions. So it's, you know, for a lot of this, we really are getting into very minimal benefits in here. But anyway, like we said, we're doing it anyway. The options are there. I'm not going to take no for an answer. Okay, that didn't get cut off at the bottom. So one of the libraries that we've already looked at and said this one actually benefits is GSL, which is one of the math libraries. So what you're looking at here is scheme code. This is the actual, this inherits from the actual package definition for GSL. An actual one has, you see this one's missing a version string. It doesn't have the source location. It's missing a couple of things where it's just inherited. But basically package definition defines the name inversion. It has the source where to get it. It says what kind of build system to use. If there are any arguments in the case of GSL, mostly we skip a bunch of tests. And then some other metadata that goes with it, home page synopsis, description, license. So for this one, we go and basically say, okay, so we're going to inherit from GSL. I'm going to change the name so that we append the PSABI to it so we can actually keep them separate. The make flags, we're going to actually pass to C flags and CXX flags that we're building for the specific PSABI. We add the, tell it to use the library dir of output is our, is the output. It's the per package directory where this library will get installed to. So instead of installing it to output slash lib, we're going to add glibc hardware caps slash PSABI, which is the directory we saw before. And then after the installation, we're going to delete a couple of extra bits that we don't need. We don't need the binary because we're just using the original GSL binary and the headers and include and anything in the share directory and the package config. We're just deleting all of that. And in the properties, we're going to hide it from the CLI so that we people can't just go and install this one on its own. And we're going to mark it as not tunable because we don't want to say, hey, build this, build this specific library for this specific sub architecture but actually tune it for my machine because that's not going to help anybody. So then when we go and we have the actual library similar to before, we go and we say, okay, go ahead and run through everything like normal using all of the normal package arguments and build arguments and everything. And then at the end, after install, we're just going to see here copy the actual libraries into their location because we can't go and say, here's regular GSL and install the other libraries into its folder. We say we've built a new one and we're going to copy the optimized libraries in. So then, you know, same thing for PowerPC. I just didn't put it there. So in the end, this is the regular one. This is the, you know, just a generic one. I shortened some of the directories so that it would fit. So we have the top, the full path, the output for GSL. This was, no, this is the everything together. We have the full path of the library, of the output for GSL with all of the hardware capability, all the optimized libraries. You see we have the one set of binaries. We have the headers. I've collapsed the libder so that doesn't take up all the space. And then inside the libder, I actually have at the bottom there, we have the libgsl for, you know, just regular. And then v2, v3, and v4 I closed so that it would all fit. And we have just the one package config and through all the testing that I had on various machines, it would go and using this as an input for everything else, it would go and link against the regular libgsl. And then at runtime, it would go and actually use the optimized library depending on which machine it was running on. So Geeks being a functional package manager, we end up with functions for things. So here, some of the bioinformatics programs that I was working on, we have, so here the idea was PGBB, PGGB. One of the common ways that it was distributed is as a docker image. So instead of just compiling everything for the, for baseline or for saying, hey, we've made 500 different images based on what your actual machine you're running on, is we said, okay, we have a list of, in this case, five libraries. Go ahead and actually replace all of their occurrences in the graph with these ones. And then when you go to run it, you get all of the benefits of using the faster libraries. So, you know, just back to the, is it worth it? So this one I hadn't really planned on getting into so much. This one was a blog post from last month. Somebody had gone and rebuilt parts of Arch Linux with, for x8664 v3. And they have, so yeah, so the claim was that it was 10% performance improvement. So the other part that's here that, I guess, yeah, not quite cut off is that the rebuilt one on v3 was also built with o3 flag versus the o2 flag that Arch Linux uses. And then they just went through a couple of programs to see, you know, is it actually faster? What type of speed benefit can you get? So negative times are faster, positive times are slower. So in this case it was, you know, compressing the kernel was faster, decompressing was slower. Flak was faster in all cases. Gawk was, you know, toss up. Gzip was slightly faster, but that might just be o3. LZ4 was slower. Python was slower. R was the same. Forbus was faster. And, you know, XZ, basically the same. Find decompression was faster. That in general, you're still left with a couple of packages here and there actually benefit from having, uh, faster libraries. It's not going to keep people from saying, I want everything to be compiled faster. I'm going to go back to the fun role loops and, you know, you have to eke out that extra little bit of speed. But, you know, it's, you know, yes, it's a, you know, from the distro side, some of it becomes a how much time do I want to spend actually, I guess maybe not specifically supporting the different options because, you know, I just send it through and it gets built and it's done, but how much time do I want to spend building four copies of everything so that I can mush them all together and, you know, expand the size of the final library? Um, what did I have over here that I... No, I thought I had a thing right there. Uh, so I guess the other part that we had with this was that, you know, this one, you know, this assumes that, I mean, first of all, this works well for GSL. Uh, I could go and change it from, you know, GSL HWABI is the name of the function to make it more general and then start passing it other, you know, pass the name of the actual library to inherit from and all of that. But, you know, the other part of this is this assumes that passing just C flags and CXX flags are going to actually go and produce the binary, the optimized files that you're looking for. And, you know, not always the case. Sometimes you end up with ones that need extra CXX flags or you need to go and manually add them in anyway or they're hard-coded and they need to be substituted out. And, you know, going back to the, do I, you actually want to go and support every single package becomes a do you want to go through the entire archive of all of the packages for something that may or may not actually make a difference on all of the libraries? Oops, that was too far. So, yes, yeah, are there any questions? Any comments? Okay. Okay. Okay. Thanks. So how far have you gotten in implementing this? Is it just something you've been experimenting with or something that's actually working? I've mostly been experimenting with it. Some of it is, you know, I don't want to actually, don't actually want to build everything multiple times. But the size increase, that's the part that I thought I had, the size increase on GSL, it went from, I think it was 5.5 megabytes to about 18 megabytes by adding in basically the four different copies of all of the libraries. So it really becomes, you know, say for Vorbis or for Ag or for specific libraries that we know are going to make a difference. It really makes a difference. For other ones, you look at it and say, okay, you know, LibxUL runner for Firefox, that's 100 megabytes. And it's a long build process. Maybe we won't do that one. Okay. You said that there is support for PowerPC, Power9 and Power10. And what is this older variant? So PowerMicroVod or Power7? The... They are all 64-bit PowerPC variants. They are all PowerPC, they are all the 64-bit variants from the actual G-Lib C, where was it? From the actual G-Lib C source code, these are the only directories that are currently... I think this was accurate as of 2.38. These are all of the directories that are searched for additional libraries. So I think Geeks right now targets Power8 as the baseline. So a backboard would be needed for Power7 and for MicroVod. You have to get it into G-Lib C to have the... I mean, I guess if you compiled the distribution for Power7, then you would have support for Power7 there. But in terms of having the special directories, currently there's no support in G-Lib C, although I suppose it could be added. I might investigate later if I have hardware or buy hardware, because I'm interested in the PowerPC notebook project. And they have much more modern hardware compared to anything Apple used, but it's still older than Power8. I've tried some benchmarks yourself with some of the applications, because what you showed with the microscope, those were mostly SIMD instructions, which were optimized. So I think everything which uses those can profit from the knowing that there's like a different kind of vector extension. I ran a couple of benchmarks. Most of them were inconclusive. The one that I actually noticed the biggest change in was LZ4. I actually compiled one for X32 for the 64-bit, X8664 with 32-bit instructions. I think the claim in general is that it's supposed to be up to 40% faster, and I found that the LZ4 benchmarks were 5% slower. So other than actually being quite surprised by that, a lot of it really seemed to fall into the, is it just hot cash? Is there something else running in the background? Is it actually a big enough change to be worth it? I don't know if that's fine, but thank you very much for your time.
Releasing a Linux based OS: an overview of Flatcar release cycle
All right, everyone. Welcome to the next session. Just the usual housekeeping. If you're leaving a little bit early, these chairs are fairly long, so don't do exactly that. There's going to be some good sessions here, and we'll have some time for questions at the end. So let's get started. Right. Thank you. Hi, everyone. I'm super excited to be here with you today to talk about FlatCut, to talk about releasing Linux based OS. And I hope you will learn new things. I hope you discover things. And yeah, if you have any questions, I'll be around for the rest of the day. And I'll be available at the end of this presentation to answer your questions. Before going further, I will quickly introduce myself. So my name is Mathieu. I work as a software engineer inside Microsoft. I'm mainly involved and principally involved in the FlatCut development and every features regarding FlatCut. So for example, I'm involved in the cluster API fields. I'm involved in testing the operating system, building the operating system. And what's matter today is releasing the operating system. If you are here at this talk, I assume it's because you have maybe some knowledge about FlatCut. You're already a user of FlatCut. You just want to discover and you're curious about this operating system. So let's have a quick look of what is FlatCut. So FlatCut is a Linux based operating system. It's designed to run containers. So you only have the bare minimum in your operating system to run containers. The goal is to have the less package you ship in the operating system, the less surface attack you have on your operating system. So that's the point of having this. This operating system benefits from automatic updates, which means once you've deployed your instance of FlatCut, it will get automatic updates from the release server and each release is done every two weeks approximately. So you can be sure to have a new version of FlatCut every two weeks. And finally, this system is immutable, which means SlashUSR is in read-only mode. You can't write anything in SlashUSR. You can't install any package. There is no package manager. There is no APT or whatever. So that's a few difference from a day-to-day operating system. FlatCut is already designed to run containers and nothing more. So the question is, well, just to show you inside the box, so I tried to write something on SlashUSR. It doesn't work because it's read-only. That's normal, even in pseudo. And try to use the package manager, the command.phone for each one of them, because that's the goal. The idea is that you have to trust the maintainers and what they ship inside the US. And if you need something more, you can ask yourself or you can ask to the maintainers or the communities, or you can try to find a way to install these packages. How do we maintain the system? Because you can't update yourself the package. You can't install any package, so you have to trust the maintainers and the community. So on GitHub, so this is the QR code on the GitHub repository, which leads to this list of packages. Basically, we are security-driven, which means each time there is a new CVE, a new issue with one of the packages shipped by FlatCut, we track it into this repository and we update the package. So for example, last week we've got the RunC and Docker CVE that has been made public last week. So it's already tracked, and when we will release the next FlatCut, so I hope this week, you should get this update closed. So the packages are updated by the security-driven base and also by a community-driven base, which means if one of you wants to add a new package to FlatCut, you can just open an issue. Hey, I'd like to have this package or this package into FlatCut. Is that possible? And if it's relevant for the community, if people are okay to have this new package into FlatCut, there is some chance that this package is being included in the next FlatCut release. Most of the time we try to challenge people to say, can you use Docker Image instead of using this package, or can you use just the binary that you will download from the boot of the instance to get your software. So we try to challenge always in the same goal is to have the less package into the operating system, because the less package you have, the less vulnerabilities you have in your operating system. If you want to know what's going on in the next FlatCut release, you can join the Office Hours, so this is done publicly every month. The next one is on February, and during the Office Hours, we just check the FlatCut release boards and we check which new package will be included in the next release, so you can give your opinion, you can give your input of which package should be prioritized or not for the next release. That's always a great time to discuss between the maintainers and the community about the content of the next release. So that's the release board that's available and public on GitHub. And of course we ship new packages and package updates, but also the bug fixes and new changes and new features into the operating system. Now we are ready to release, but before releasing, I would like to demystify a bit the FlatCut release number, because this is something we've seen quite some time that people are getting confused with the FlatCut version number. So this is a FlatCut version number, and the idea is like Semver versioning, but not really. For example, the first digit, the 3760, it's the days since the first CoreOS release, because FlatCut is a friendly fork of CoreOS initially, so 3,000 days, it's almost 10 years, so it was 10 years since the last release. Then the second digit is the promotion level, so are we talking about alpha release, beta release, table release or LTS? And finally we have the patch or the maintenance level, which is the last digit. So if you have a zero, it means it's a new major release because there is no patch yet done for this release number. So based on this, we can play a small game and try to identify which is who is who. So for example, the first 37602.0 is a new major stable, because you have the zero at the end, which means it's a major release. We have the two, which means stable, and we have the first digit that is just showed how many days since the first CoreOS release. But based on this, who is able to find what is the third, so 3,850.0.0.0, what is it? Is it an alpha release? Yeah. Is it a patch release? No. No, so it's a new major alpha. And the last one, so with all the freeze, 3,0,3,3.3.18, what does it mean? LTS. LTS, yeah. And it's really old LTS because there is a bunch of patch releases. So patch releases means basically kernel updates. Each time there is a kernel update for the LTS, for example, we just update the kernel, the CA certifications, and critical security issues like OpenSSL, but that's it. But yeah, for the LTS, most of the time it's just kernel patch release, so that's why you have a big number for the LTS. So I mentioned alpha, beta, stable. How does it look like across the time? So we have a new major that is done every two weeks. Then from time to time, we decide to promote that alpha to a beta. So that's why we have one that happens to this example. And then after a few times, this beta version becomes a stable one. And eventually it will become an LTS one. So that's quite interesting because you, as a user, if you run stable flat-card release, it means it has already been in beta a few months before landing into stable. So that's why also we encourage people to run beta nodes into their workloads, like so they can identify if there are any issues with their workload before it gets into stable. Yeah, so that's the release cycle. Now, what's the release process and how does it work? So most of the time it's done in four days. We never release on a Friday because it's a well-known rule that we don't want to break at flat-card neither. But basically on Monday, we start to build the new flat-card releases. So we kick off the builds for the new alpha, new beta, new stable, and normally the new LTS. So this is done on Monday. And on Tuesday, we check the status of the builds. Is the CI OK? Is the image been successfully? So yeah, we have a checklist of things to see and to check. And we start drafting and preparing the release nodes because that's quite important when you have a new release. It's to communicate to people that there is something new and that would be interesting to know what's inside this new release. So yeah, on Tuesday, we start drafting the release node and Wednesday, we have the go-no-go meeting. So this is a meeting done publicly on Matrix Channel where we discuss about should we actually go forward with the release. Are we in a good shape of a release and can we move forward? So this is the go-no-go meeting. So basically, we just check is everything green in the CI. Are the release nodes correctly prepared and everything is good on the CI? And yeah, we decide to go or not to go with the release. Then we have the actual release, which means we are going to take the new images to publish them on the release servers of flat-card and to generate the new payload because as I said, flat-card is going to get automatic updates. So we need to generate the payloads to get them downloaded by the current running instance. And then we have the announcement. As I said, it's important to communicate to people that there is a new flat-card release. And on Thursday, we have the marketplace release because flat-card is supported on multiple vendors. So we have the AWS, GCP, Azure. So we want to publish flat-card images on this marketplace. If we check the release process for Monday, so one of the flat-card engineers will start the build and he will publish the links. Then on Tuesday, we start to preparing the release nodes. So this is, for example, for the last table and there is some nodes. For example, there is Flakitest with Calico C&I on Digital Ocean. So we try to identify is it our fault because of the test framework? Is it something really critical? So sometimes we have to stop the release because we have an issue with the new kernel that has been identified with the test framework. So this is the kind of nodes we can take during the release process. And after that, we have the Go No Go meeting. As I said, it's done on metrics and everyone, contributors and maintainers are invited to say Go or No Go for the release. So it's a ping into all channels across all numbers. And yeah, so people can give feedback on the release status and we decide to move forward with the release. And when it's done, we actually have the release. It's available. It's on the public website and we communicate on Slack, on metrics, on Mastodon. But there is a new release available and please update and give feedback on the release. And finally, we have the Marketplace update. So this is an example with AWS update on the Marketplace. So what's interesting with this process is that the community is involved at each point and always. So nothing is done in secret or whatever. Every time you can give your input, every time you can see what's the status of the release, are we close to have the release to be done or are we far to have the release to be done? So for example, the checklist of all the best items are on public GitHub issues. So you can easily see where are we during the release process. The release notes are drafted on a HackMD document. So you can browse the release notes and start to write and send some comments on it. And the public discussion are always on metrics. So regarding flat-car release process, it's always on metrics, but also for the public discussion of flat-car development. So every decision regarding flat-car is done publicly on metrics. So there is no, as I said, secret discussion. The only thing that is still secret is the build for now because we still have some credentials for the various cloud providers. So ideally, we would like to have it in writtenly on Jenkins, and people can just see the logs of the build and see how things are going on. But it's not done yet. What we've done is that now if you open a pre-request against flat-car repository, it will start the build on GitHub action. So you can see your logs and you can see if something goes wrong, if the CI is OK. So it just build a QMU image and run the test on the QMU image. But for the release itself, it still relies on Jenkins, but eventually, we'd go public using GitHub actions. And I think that we'll close the talk. So if you have any questions, I'll be around with some flat-car teams, remember, for the end of the day. And thanks for your attention for this Sunday afternoon session. All right. Nice break-up question. Great. What's the elevator pitch for using flat-car above Fedora's offering or micro-OS from Sousa? Well, micro-OS and over-operating systems are quite similar, but flat-car has some multi-vendors, for example, or you can use it on premise, on bare metal, on different cloud providers. Also, you have new features that we try to merge into the flat-car operating system, for example, system DCS-X or other things that we try to leverage. And also, there is this, we try to do things upstream first. We got this talk about upstream versus downstream before that, but that's the idea. We try to go upstream first. So each time there is a new feature, we try to first implement it upstream before trying to solve it downstream. So we try to be more on the community side and try to fix the things on the upstream and not really on the downstream. And then speaking of fundamental differences, for example, with micro-OS, you don't have the same mechanism to provision the instance. For example, with flat-car, we use intensively ignition or afterburn, which is not yet available or experimental on micro-OS. So this is the kind of difference you can see. And if I recall correctly, I'm not sure if micro-OS is using REST 3, but this is the kind of functional features. You can see the difference, but in Vienna, it's the same purpose of operating system is just to give the user an operating system to run containers. That's it. So as it's open source, you have the choice of which solution you want to use. Thanks, Moomo. Feel free to comment this. How much has changed or has it been noticeable since Microsoft took over or the acquisition? Thanks for asking the question. So for the short story, Flatcar was developed initially by Kinfolk, which was a company that has been acquired by Microsoft two or three years ago. And I'd say it didn't change a thing for now for the development. The governance has always been on the community side, community-driven. And I'd say it's even better in a way because now we can totally be dedicated to this operating system and to the support of the operating system. And recently, like a few months ago, six months, something like that, we started to look into a CNCF incubation. So basically, we would like to have Flatcar find a new home at CNCF. So there is an open issue on the CNCF tracker so you can see the status of the incubation proposal. But in terms of governance, nothing's changed, and we're still dedicated for users to get the best Flatcar experience on any cloud providers. Thank you. Over the question? Yeah. Matthew, thanks for the talk and for the distribution, the idea, everything. So I'm not familiar with the project, so I'm attending just to understand what's going on. So everything is a container, right? All tools and everything are running as a container. But I'm curious how the kernel is booted or the NITRZ is done. So, yes, that part is a container or not? I don't think so. So Flatcar is not running inside a container. Flatcar is an operating system, like Ubuntu, like Debian, like whatever. It's designed to run container work loads. So you have the very minimum to run container work loads. You have a container one time, you have the kernel modules that face well. So in the end, it's like any over Linux distribution. You have your kernel, you have the boot process, you have the NITROM FS, and then after you have the user space. Yeah, so if I understand correctly, the stuff that's supposed to be previously managed in containers, in traditional packages, is now containers, right? But if a new version of a kernel is released, then how that's distributed, let's say. So if you have a new version of a software, how it's distributed on the operating system, that's the question. Yeah, you just wait for the new Flatcar release, because it's immutable. So if there is a new open SSL version, for example, you have to wait for the next Flatcar release to ship that new open SSL version. Like so you get the update. So that's like pulled from the Internet. It's not like in a format of a package, right? Sorry, come again? Is that in a format of a package of its pulled from the Internet straight? Yeah, it's not in the form, Flatcar is based on GN2 Linux. So the Flatcar itself, when you build it, you take the source from the value repository using GN2 mechanism, then you build the package. Then once the package is built, it is included into the image, which is the new Flatcar. Then the new Flatcar is released, and this is how you benefit from the software update. Okay. So I think my question is also, let's say not only technical, but more on the political side. So history is that this is a folk of CoroS, where Ken Falk started this, right? Then it was brought by Microsoft, but Microsoft has its own, the CBL Mariner. So how does this fit and what is, let's say, this is a cloud based, let's say, the essential client that you have. That this OS has to be used in cloud, right? Yeah, thanks for the question. So CBL Mariner is dedicated to run on Azure, while Flatcar is dedicated to run everywhere. And it's not monetary to run Flatcar on a cloud provider. So as I said, you can run your own Flatcar image on Raspberry Pi, on ARM64 at home if you want to. Or if you have your own, I don't know, Proxmox, we have some people that use Proxmox to run Flatcar. So yeah, Flatcar is really multi vendors and multi architectures. And so while CBL Mariner is really dedicated to Azure and nothing else at the moment. Hi. So in my previous role, we used Flatcar quite a bit for a while. But then we ran into kind of some trouble with AI and especially around things like we're trying to use like Infiniband. And we were actually kind of running into, I think, getting everything set up with Flatcar. Are you all kind of working more towards like AI workloads and making those easier to run on Flatcar? So I'm not at all AI expert. Maybe Remy, behind you, is a Flatcar member. I'm also a Flatcar maintainer and I've been looking at NVIDIA and GPU support in the past. And we want to get better at that. It would be great if someone from the community would also help because I have limited cycles. But it's something I and we care about, right? Just one last question. No one else has any? Do you support different container runtimes? So at the moment, we only ship the current container D. But basically, in a non-official way, you can use Podman using SystemDCX, which is a system D feature which allows you to mount overlay FS images on the base system. So yeah, with Podman, we ship a system DCX in an unofficial way. So you can just pull this Podman extension and load it on the system and have Podman up and running. There is some tracking issue to have this out of the box, of course. Like so you don't need to provisioning and to pull Podman's system extension. But yeah, ideally, we should be able to say, OK, if you want container D and Docker, or if you just want Podman, use this configuration and not this one. But yeah, you can use Podman. Actually, I did some experiments, there are things with it and it works. Cool, thank you. All right, I think we have time for one final question if someone's up for it. All right, looks like there's no question. Thank you very much. Thank you. Thank you. Thank you.
An introduction to Image Builder: building up-to-date, customised operating system images the easy way
Hello everyone. So I will talk a little bit about Image Builder. I want to talk a little bit about how it works, how the stack of Image Builder fits together, show off some of the things it can do. But before all that, I'll try to explain why it exists. So Image Builder builds bootable operating system images, the base images. It runs on your local machine or as a hosted service. So we also run and operate a service for Image Builder. Now building bootable operating system images is not that hard of a problem. You just put a few bits in the right place, some default correct, hopefully configurations. And most of the hard work is done by the package maintainers, the people who maintain a kernel, the people who maintain system data. They make sure it all fits together. But at a certain scale, consistency and reliability are very important. Because you need to build images for different purposes. I'm talking now from the point of view of a distribution. You need to build images for different purposes, for different architectures, for different target environments like AWS, GCP, Azure, local virtualization, bare metal. And you don't want to have these differ too much from one another. These images. You don't want to have them differ too much from one another between those different variables. You want to reason about them and produce them in roughly the same manner. You want to produce and reproduce them often without manual interference. So as part of pipeline. So you need infrastructure around it as well. And because I also mentioned like target environments, specifically the cloud platforms, well, every cloud platform today often offers something like an image builder as well. I guess AWS EC2 image builder. But yeah, then you're sort of like locked into what the cloud provider provides. Or you just end up using their images maybe full stop. So it's nice to have tooling that is sort of cloud agnostic. Or like specific cloud agnostic, vendor agnostic. And as a final point, while image based workloads are also becoming the norm. You know, everybody uses containers. People build images for a single specific workload. And really one of the things for end users that we're trying to solve is to make VM images almost as easy to deal with as container images. So image builder was created to address some of these problems. Right? So this is the stack. And I want to walk you through the stack just to quickly give an idea of what each component does and why it's there. I will start at OS build. The lowest base layer on which everything is based. So at the very bottom, we have OS build, which is a low level tool that executes a manifest. Now what is a manifest? The manifest describes exactly what goes into the image and then also how to package it. This manifest makes images audible because you have a manifest of exactly what's in it and reproducible. Since you have the exact steps that were used to build the image described inside of it as well. So not just the contents, but also like how do I get from those contents to the actual image? It is mostly distribution agnostic. So it doesn't necessarily have a notion of like what makes up a specific distribution. So what is a Fedora? 38 image, right? Doesn't really know. But of course it needs to support some package manner. So it currently supports RPM and Pacman packages. So theoretically, the distributions that it can build are things like Fedora, Arch, CentOS, et cetera. And as a final note, OS build starts from a pristine tree, so like an empty directory, and then builds it up piece by piece. But to clarify this, let's take a little look at a manifest. So I'm going to need to... That's maybe too much. So as you can see, this is sort of what a manifest looks like. First off, you have sources because the content needs to come from somewhere. So this is maybe a little bit. Okay, so these are just RPM packages indexed by their checksum, so that they're nice and addressable. Now here you can see three pipelines. I will not go through all of them, it's just like very quickly tell you what's inside of this manifest. So the first pipeline actually builds a container because we built the end artifact, so the end image inside of a container. And the reason that happens is because you kind of want the same tooling that will end up in the image building the image itself. So like the RPM version in the image, you kind of want to install those RPMs using the same version because otherwise you might get into a mess. So then the second pipeline is the second pipeline that actually puts all the bits in the correct place in the tree. So first it sets up the kernel, like I think it's proc kernel command line, then an RPM stage, so it installs a bunch of RPMs, then it sets up the locale, hostname, things like this. It relabels the tree, so as soon as it's happy. And then in the final pipeline, it actually goes ahead and packages the image up. So in this case, it just ends up being a raw disk file, right? So the most basic disk file and also like it sets up the file systems and stuff like this. So I guess just a takeaway is pipelines stages very precise way of describing what exactly should go into that end artifact. So now I need to like, yeah, go back a bit. Okay, that looks better. So, but like I said OS build doesn't actually know what makes up a specific distribution. So we have the image definitions. The image definitions contain all the information needed to describe an image of a specific distribution of a specific architecture for a specific target environments. So it describes things like the base package sets, the default configurations, how exactly architectures differ, you know, like install these packages if it's arch, install these packages if it's x86. And like all the differences between those different, you know, that whole matrix of like architectures, target platforms, that's contained within the image definitions. It integrates tightly with Composer. So if I don't know if you remember the stack from earlier, now we're at the gray layer. So we had OS build images and then Composer. And Composer is really the part that brings it all together. It takes user input in a format that I'll get into shortly. It takes package repositories, just as a source for the packages, like, you know, the kernel needs to come from somewhere. And the aforementioned image definitions to generate the manifest, which is then provided to OS build. Right, so Composer takes all of the input from the necessary places, generates a manifest, and then OS build executes that manifest. Because like I, like you probably saw, there's a manifest, they're okay to read, but if you had to write those by hand, that would not end up very well. It's a tiresome job. And as a final point as well, it also orchestrates all the builds. It manages the job queue and workers, which you know, like you can queue image builds and then, because image builds takes a long time often. So, and that's just an important point to be able to run this like in a host, as a host service, as part of infrastructure, etc. So okay, what does, what does Composer need? And as you can see, it's already a lot simpler than the manifest that we had before, right? So say that I want to build a Fedora 39 image. Okay, so I just asked for a Fedora 39 image. I want it to be x86, QCOW. I provide it with some repositories, which are just the default Fedora 39 repositories. Optionally, some customizations, right? So like in this case, I also want to install cockpit. I want like the base system with cockpit installed. And that's it. That's all Composer needs. It will take it, grab the image definitions, figure out what the manifest looks like, and then pass it to us. Okay, so how to make this even easier and like how do we actually like give this to, to end users? I will walk through like two tools that we have. And then I will show off the hosted service that we have running. I just realized I didn't spin up a VM, which I need to spin up. Okay, so Composer CLI. So Composer CLI takes this format called a blueprint. Now a blueprint kind of describes, again, like how to customize the image. As you can see, there's nothing anymore of like which architecture you want, which distribution you want. This is intended as like the on premise tool, and it's all inferred from the host. So like if you're running on a Fedora 38 system, it will build Fedora 38. If it's an ARM system, it will build ARM, etc. All that's left is customizations. And as you can see, it can be quite powerful, right? So like here, what happens is, I source this from a colleague of mine. But as you can see, it puts in place a system, the a little system, the service, then it asks ImageBill to also enable that system, the service, and that system, the service sets up a second disk on boot, right? And also it embeds a user. So yeah, what's left here is just really just the customizations. And this is how you would then like push that blueprint. The workflow is a bit cumbersome, but like you push the blueprint, you start the blueprint, you ask for the image type, which in that case is a Q-Cow. So important point, you use that same blueprint to build a Q-Cow, build an installer, build a cloud image, so that you get, you know, you really just have to specify what you want is common. And like ImageBill would say, care of the rest. And there's also like a little cockpit application. So can I, oh I need to, if it went to, ah there we go. So this is a cockpit application which allows you to define blueprints, build images. It's again targeted for on-premise use, as you can see, I probably should have removed this question. But yeah, so like it allows you to define those blueprints I showed earlier and build images from them. I think if you click, right click here, and you can see like some of the, okay, like the blueprint, like an output type, so there's a bunch of output types that we support. Yeah, voila. So now, okay, so actually the point why I actually mainly did this talk was, which is, so we also run as a Fedora, as a, as a hosted service, and for a specific hat company we've been running a service for a while now, but we are also figuring out like, okay, how do you actually run services, like software as a service for the community, right? Like how do you involve from our community? We want community users, we want community feedback, they often use software in like different or interesting ways. So yeah, that's what we wanted to offer support for that. So if you go to console.stagels.fodora infocloud.org, so it's currently just a staging service, but production is coming soon. It will, it will tell you like, okay, this is how you like, if you make an account there, then you can use the API in that way, etc. Currently it's just an API, but UI is also on the road now. So what's currently supported in that staging service, staging service, it's KVM, BeastFair, AWS, and like, on the right hand side, there's all images that we currently build, but we just need to set up some service, some stuff in the service and enable them and expose them. Also just 8x86 for now, but Arch is relatively easy to add, and in a production service, we'll definitely have Arch as well. So what does a request to this hosted service look like? And at the very start, right before the talk started, I actually sent off this exact same request to the image building service. So this is what a complex request, more complex request looks like. So there's the distribution I'm asking for Fedora38, X86, it's a QCOW2, don't mind the name, it's a bit weird, but QCOW2, please upload it for me to AWS S3 so I can download it afterwards. We host, like, we share images with a pre-signed URL, which are like valid for a couple of hours. So, and then it comes, like, then we get to the interesting bits, like, so can this hosted service, for which you don't need to, like, have any setup on your local machine. Also, integrate with other Fedora services that are currently available. So perhaps some of you know, copper, you know, like, very easy to build your own RPMs, you just upload your spec files and sources and it will go ahead and build your, do everything for you, like, the difficult bits for you, so, and host your RPMs even. So here I'm asking, like, okay, can you also in FCE, YAM repose, whatever, like, make this repository available so that I can install stuff from it. So that's this customization. Then install these packages, right, like, I want copper, I want firewall D, because, I don't know, and then NPM as well. Why? Because I have this awful startup script, which installs revealMD, which is, like, the thing that's running these slides now, installs demo slides, which come from this demo copper up there. Change is to root, runs revealMD, yeah, just it runs, it's a VM that runs slides, yeah. And then the second thing is, like, okay, set up a service for me, which, you know, it's a one-shot type of service, it runs that startup script that's defined above, and, yeah, run it after the network comes online, because I want it, it's a server, it's a service that also starts the server, right, so, and then I want to enable the cockpit service, I want to enable the revealMD service, so that's the little service that I defined above, and then, yeah, for firewall customizations, I want to open this specific port, because that's what revealMD listens on, and I want to enable the cockpit service as well, right, so, like, when the machine boots up, cockpits enabled, and these slides will hopefully be hosted now. Right, so let's go back to the terminal there, so, as you can see, so that's the same request, and I sent it off earlier, and it was building, so let's see what happens now. I hope it didn't fail, because that would, okay, no, so like, the build succeeded, right, as you can see, this monster of assigned URL, which is technically a secret, it's valid for six hours, so, yeah, you can just download it, and, yeah, I mean, that's it, it's really that easy, if there's like a whole bunch more customizations available than the ones listed here, but once you've read the spec a bit, you figure out how to write a JSON file, it's there, and then you can get a whole bunch of images out of it for free, so, let's run that image, right, so like, this is how I'm running that image, you can't see that, so, this is how I'm running that image, okay, so I'll just go ahead and do that off-screen, how am I for time, yeah, good, okay, so I've now booted it up, okay, so I asked it to install cockpit, and, okay, yeah, of course, yeah, just sort of, just sort of, okay, so let's just take a look at the networking, I asked it to like enable the cockpit service, and also, explore this additional port, it has done that, super, then I go to services, so these are all my, my system, these services, I look at revealMD, okay, so that's, that's still starting, and it might take a while because it's actually installing an npm package over my phone, so, hopefully that will do something in the meantime, I think in the meantime we can also maybe start with a question, if already somebody has one, but I want to show like a little thing at the end still, so, but for that I need this to, this to kick off, it's really the most exciting talk ever, isn't it, we all started staring at like, at some logs, so, thanks for the talk, I have two questions, so the first thing is to understand the architecture, the composer thing, it's like a demon or service, right, that's it, yeah, so the composer is the thing that runs as a hosted service and orchestrates everything basically, okay, so I understood that correctly, the second thing is that very expected question, it's not really creative, so how hard would it be to, to, to support like, let's say Ubuntu or, maybe I know, let's say something like that, that's very expected question, thank you, so I've actually experimented with this a little bit already, so like first we would need to add support in OS build, right, and then you would need to add an image, so just to install dev packages and like set up like, what's the thing, the bootstrap, you know, and then you would need to add an image definition, okay, but what is an Ubuntu image, and then you would need to add it to composer to expose it a little bit, okay, but first you'd need to, yeah, write some stages that can handle dev packages, and, and I've been, I've experimented with it before, I've got it to the point where I, we can build a bootable Debian image, right, not UEFI, so, but like, you know, we've got it to that point, it just, it requires some more work in cycles, like theoretically it's, it's, it's, it can do it, right, like it's, it's just a matter of some work, yeah, I'll answer your question, oh yeah, and in the meantime, let's, like, look, it seems to have done something, so, all right, great, yeah, slides are up, so let's go to that, oh yeah, so this is the most efficient way to, to host, I think, you know, image-based workloads, single slides, this is where you can find us, so like, yeah, this is our GitHub project, we have a matrix channel, please can say hi, and then we have a website, and so if you go to service, um, so service and then for door, you can read a bit more about how the architecture fits together, and yeah, if you go to the Fedora service, you can find instructions on how to do it if you want, please do, there's currently only two workers attached to the staging service, so if, if it, if it's not DDoS by like two hours from now, I'm going to be disappointed. Thank you. All right, any more questions? So, um, I'm a bit hazy on the architecture, so maybe the question, it doesn't make much sense, but how much work would be, um, you know, how feasible is to do all this locally, like, for example, for, I want to build a distro, then tweak something, then build another image and run it locally in a tight loop, so that it's like the whole process that you described, starting from the definition with the local overrides all the way to the image that you have downloaded and built locally, can you all do it on a laptop on the plane? Yeah, yeah, so like this, um, so the first, uh, let me go back a little bit, so we have a copy composer, now it doesn't, um, so this is essentially the same thing, right? Like you, you can just, or maybe I'm not answering your question, but this is essentially like, uh, you can do it all on, on your own laptop. Of course, your own laptop can do cross architecture builds, um, currently, uh, but, yeah, is that? So basically I would run the service locally and then talk to a web server on my, it's a unique socket in this case. You can also set up the, the service of, um, locally, but it's, it's not necessarily supported very well, uh, but it's all there. It's, it's all, when you install copy composer or, sorry, when you install, um, OSBIL composer on your fedora machine or whatever, it's all, it's all shit. Thank you. Yeah. Hi. Uh, yeah, probably an annoying question, but so, uh, under the soon, there was like, uh, ISO installers. Can you be, can you be a bit more specific about what soon means in this, uh, in this context? Because that would solve like a use case that I have. So, like really creating a bootable USB drive that a, a technician can plug in and, yeah. So, so, uh, yeah, you're right. So like the ISOs here, sort of installer and live installer are, are absolutely like, in tennis or bare metal. So those are like that you would burn onto an USB or DPD or a CD and, uh, yeah, plug in and, and, uh, they have like the, the fedora, like the anaconda install around them. Um, yeah. And is there like a specific fedora release that's being targeted or? Oh, it's been, this has all been fedora in, um, since fedora 34, 34. Okay. So it's, it's been there for a while. Okay. But we're still like, we're still actively working on it and making it better, like more customizations like that files customization where you can like just basically have like an entry point as like a, you know, just like a one shot system v service. Um, that was, it's more like a recent thing that we were trying to do. And yeah, there's more customizations that I think you can set up your file system a little bit like partitions and stuff like this. Okay. Then I'm missing the suit. Thank you. Yeah. But try it. Yeah. I think we have time for one more. So maybe I misunderstood something, but the whole part of the demo, everything was done locally, right? That you showed us on your local laptop, which part was on the cloud? I thought everything was local host. Everything was local host, except when I switched to here, like in the terminal, what, so right at the start of the talk, I sent a request, um, to wait, hang on. Oh yeah. Here up top here, I sent a request to the fedora. This is this fedora staging service that we're running now. And that is building an image and then that's spit out a URL. Now, of course, I didn't download this one. I built the same thing earlier and downloaded it then to run it now because it's like 700 megs and the internet is not that good. So, um, but like, let me just show, uh, so like, this is like, uh, yeah, the slash composes endpoints shows you all of the composes that I've done. Um, and it's basically all the same thing for this talk. So, um, I think I might have downloaded this one like one before or yeah, I mean, so, so this, this one, like the actual image that the VM that I put it up was not built on this laptop that was built on this on the cloud. Yeah. That I can also, if I have a powerful machine, I could also do it locally, right? Um, so do you mean like, uh, if you want to build the image locally? Yeah. Yeah, you can do that. So that's, that's, um, that's done with, uh, with cockpit composer, I think is the easiest way to get started. Yeah. So that's, that would be a cockpit, cockpit app is I think they're called. Yeah. Okay. Thank you. Yeah. You're welcome. I think we're done with time. Thank you.
Homebrew's Evolution
That's a very nice soothing start to the talk of just people saying shh. As some of you may know, I really like to start talks with raising hands. So put your hand in the air if you use Humbrew. Lots of people, cool. Put your hand in the air if you've contributed to Humbrew. This clump over here will make sense for the next question. Put your hand in the air if you maintain Humbrew. Put your hand in the air if you're concerned about what happens if there's a CV during this talk and no one is able to march a critical PR to fix open SSL. Because all the maintainers are here. Yes, good. Thank you. Yeah, so a little bit of background for you folks. Let's see if this is working. There we go. Oh, sorry. No, this is a Humbrew. We're Mac people here. Okay, there we go. So I forgot, Humbrew doesn't actually support this version anymore. No, back to that one. Oh, there we go. Okay, that's fine. Humbrew supports this one. Sorry, the jokes don't get any better from here. They're only worse. Hi, I'm Mike McQuade. This is my almost becoming yearly tradition at this point. Sort of state of Humbrew talk at FOSDEM. The distribution's ruined kindly. Let's me come and do this here, even though Humbrew isn't really a distribution, but it feels like the least square round peg hold situation at the conference here. You can find me at various places on the internet if you want to talk or ask me things during or after or whatever. I'm currently the CTO of a startup called Workbrew, which is trying to do some interesting stuff around Humbrew. I'll talk incredibly briefly about that at the end with two former GitHub people. I spent 10 years at GitHub, which I left as a principal engineer last year, and I'm Humbrew's project leader, which is something I have to get elected to do every year. No one has ever run against me, so please, someone do that and set me free from my life of enslavement to an open source project that I suffer for. And I've maintained Humbrew for apparently 15 years this year, which is a little bit worrying. So I'm going to talk through some stuff we've done in the last year or so. Some of it may be new to you, some of it will not be. None of it will be used to any of the maintainers. I don't know why they're here, but hopefully they will just laugh at my jokes and stuff like that anyway. The first major thing, I don't know if any of you noticed how many of you run Brew Update or noticed updating Humbrew. Lots of people complain at me about how Humbrew does this automatically without being prompted. You can opt out, but please don't. This should have got, for most people, most of the time, a lot faster than the last year. And the main reason is that we have stopped using Humbrew's GitHub repositories as the main data source for Humbrew. So when Humbrew was first created in 2009, one of the relatively innovative things it did was to use Git and put essentially all the data on a GitHub repo and then instead of building some complex update information system which is going to pull from some server somewhere that someone would have to host, it's like, no, we'll just do essentially just run Git fetch in the background. And Humbrew has kind of had a long-going battle with... Like a little bit of a battle with GitHub and more of a battle with the performance characteristics of this. So Humbrew Core, the main kind of Humbrew repository for all our formula, for all our packages, has kind of grown and grown over the years. Like we've had over, I think, 11,000 contributors, like millions of commits, hundreds of thousands of pull requests at this point. And as a result, it is very, very, very, very, very, very, very, very slow to do almost anything related to Git. And particularly with Git fetch, like a no-op Git fetch was probably at its worst, taking about 30 seconds just to be like, no, actually, you don't have any updates or anything required at all. So when I was lucky enough to be simultaneously working on Humbrew and GitHub, I added a call to the GitHub API that was there specifically to try and make Brut update a bit faster. So you could go to the GitHub API and it could quickly respond like, hey, don't run Git fetch, you don't need to, it's going to be really slow, and you don't have any changes anyway. A few other package managers use that now as well, which makes me happy. But over the years, lots of people at GitHub have kind of grumbled about using a Git repo as a CDN that's kind of nicely global. Globally distributed, and I believe at our peak, we had a couple of GitHub servers that were essentially dedicated purely to people fetching from Humbrew Core. So eventually, after leaving the company, it's kind of weird that it took me to leave the company to actually make my coworker sloppy. We, like with a bunch of work from other maintainers, we kind of moved over to essentially just curling a JSON file off the internet now. So instead, we have like a 15 meg-ish, I think, compressed file for Humbrew Core, for Humbrew Cask. When there's an update, we don't have any sort of clever binary diffing or anything, unfortunately, so we just download the whole thing again. But that seems to be a lot faster for most of the people, most of the time. And we still, optimistically, will be able to make it faster in future. So in case you didn't know, Humbrew has like a JSON API. This is basically the kind of the basis of what we're using. We've had to kind of add some bits and pieces and modify, move things around. And one of our maintainers here added like nice signing to this and stuff like that so that we could meet the kind of security requirements, the performance requirements we wanted for this new API way of downloading. It's actually, our API is really, really fast because it's posted on GitHub pages. So if you've had an idea of like statically building your API, it's incredibly painful in some respects, but also kind of fun in other ways. But yeah, don't dig too deep on how that's implemented because it's pretty disgusting. Another thing, somewhat relatedly, if you have set any of these variables in the past, like commonly people will set these things because Humbrew was updating too often and it was too slow and annoyed them, or shortly after we rolled out the API stuff, a bunch of people opted out because it was a little bit buggy and stuff like that, or it also updates too often considering un-setting them for a little bit. And then if things are still annoying for you, feel free to set them again, but you might have a better time without these than you used to. Similarly, if you still have these reports on your disk, you can now un-tap them and then you will get much more space back and just generally your updating could be potentially a little bit faster and happier and all this type of stuff. The other relatively big thing we did in the last year, not super exciting for everyone, but our analytics were hosted by Google for a very long time. We had a lot of people who didn't like us having analytics at all and I chose to ignore those people because we need them to be able to do our job, unfortunately. But I guess a concern we did hear again and again from people was like, hey, we don't mind you having analytics, but we're a bit concerned with all this data going to Google and if you look at the analytics docs, you can opt out of certain data collection, but that's kind of a line on trusting Google to do what they say, which I kind of do, but I understand not everyone does. So we've kind of now moved to kind of a nice cloud-hosted like EU instance of inflex DB, which means that we're gathering essentially the same data we had before, but we're not kind of tying it to individual users. We don't have the ability to kind of do stuff like capture IP addresses even if we wanted to and that makes everything a little bit nicer. So we've now destroyed all of our existing Google Analytics data and this means that if you want to know what Hummoo was doing or what user accounts were like two years ago, tough luck, but we do have this new analytics system automatically kind of deletes data after 365 days, so this should get us a nicer, slightly more privacy-focused approach in future. And the other thing that has been kind of principal with our analytics is trying to have it. So if people may not trust us with Gather Analytics, I understand that like it's a touchy point in the tech industry with privacy and all this stuff nowadays, but we do try and make all the information we gather public, so we've got these pages like under formula brudo sh slash analytics, various pages of the analytics we gather. We've got a few more things there than we used to be able to have and you can kind of see the download counts, percentage counts, all this type of stuff. And basically maintainers don't have access really to any more information than you do. Like we have a couple, a handful of people can access our InfluxDB console directly, but like the data in there is in such a kind of messy, horrible format that no one is querying that directly. They're all just using the same web pages as you and I might use, which feels like again, from a privacy perspective, we're all kind of on the same page, whether you're a user of Hummoo or people maintaining it. So also, again, another thing to stick your hand in the air for, who considers Hummoo to be slow? Yeah, a few people. Put your hand in the air if you feel like it got faster in the last year. Mostly just maintainers who made it faster, so... It's all right, you still count iValue. So this is a relatively common critique we hear about Hummoo, is it's slow or why does it upgrade all my things all the times and things like that. So we are working on this, this is kind of a background, medium priority thing for us that we kind of considered for quite a while. So in the last year, hopefully, brew update, that's mainly got faster from the API stuff we mentioned before. Hopefully brew upgrades, we've now made it a lot, in certain cases at least, we can now upgrade fewer of your dependencies than we used to. This is a little bit of a hack, but I'm going to talk later on about how we might be able to make this better going forward. And then similarly around brew fetch, some of our maintainers noticed that there was a bunch of work happening there that didn't need to happen. So I guess if you do find Hummoo to be a little bit too slow, then be relatively confident that we do feel your pain and we are trying to make things faster most of the time. A really weird performance optimization we decided to do, considering everything I've said before, is I don't know if anyone who's not a maintainer ever went and clicked around on the repo pages on GitHub, but due to the Git issues I mentioned earlier, a lot of these pages were time out and stuff like that. And another thing that Git and GitHub people who knew a lot about Git have said to us for a while is due to some complicated Git internal stuff that I don't really understand, you have structured the Hummoo repo in pretty much the worst possible way for Git performance. Git apparently really does not like having directories with thousands of files in them, and we had, I think, a directory with 8,000 files and it was something like that, which means you can see it on the GitHub interface because all these operations list in the directory, if you did a Git blame or Git log on this directory, all of those were time out, which meant increasing amounts of the GitHub user interface was just not useful for when you were using Hummoo, and that also contributed to why Git fetch was so slow Git GC was so slow, like opening PRs, like the pushes and the pulls and all this stuff involved, which is like getting really slow and getting slower and slower and slower. We were also seeing more instance with GitHub that GitHub didn't seem to think we're related to this, but I kind of did. So we've now like sharded our repos, so essentially like everything is split into directories based on name, and because we have quite a lot of libraries, so Lib gets its own special directory, it doesn't get bundled in under L, we've done the same thing for Hummoo Cask as well, like again, as I say, GitHub would be wanting us to do this for ages, but we've finally actually done this now, and that now means that on these pages, you can actually finally now see the commit information and timestamps and all this type of stuff, and it makes it a bit more useful for people when it wasn't before. So a more exciting thing for us is, we moved to like using Ruby 3.1, Hummoo, who knew that Hummoo was written in Ruby? It's this widely known thing, yeah, cool. And so Hummoo originally I think was on Mac OS, 10.5 I think the first version, and back then Apple provided like loads of stuff with the OS, including Ruby, 1.8 or whatever I think it was at the time, and Hummoo kind of particularly in the early days tried to use as much stuff from the system as possible and not pull in its own kind of libraries. We still try and do that where we can, but Ruby was an example where Apple said a few years ago that like, okay, we're kind of deprecating the system version of Ruby and Python and I think Perl and stuff like that, and for Apple kind of deprecating this stuff, we've sort of been playing chicken and being like, well, you say it's deprecated, but you keep upgrading it for us, so we're gonna just keep using your version as long as we can, and like eventually kind of went to some Apple people for the last release and were like, hey, the Ruby you supply is 2.6, that's really old, when are we gonna get a new one? And they were like, did you not read when we told you it was deprecated? And we were like, yeah, but, yeah, but please. And they said, no, this time we mean it. So like finally we've kind of, we've always had our own kind of thing we call portable Ruby, which allowed us a way to distribute a kind of a Ruby that you could install anywhere in your system. So it worked regardless of where your homebrew is, and it would work on a variety of Mac OS versions and stuff like that. And that was now moved to Ruby 3.1, so now we have a system where essentially everyone on Mac OS at least, on Linux, there's some configurations where you don't need this, but everyone has portable Ruby now and supplies kind of a nice, relatively new version of Ruby. So this is nice for us, it probably has some, it's had some mild performance increases, and it lets us use like newer language features, makes homebrew easier to kind of maintain, makes it easier for homebrew like Ruby users to kind of not be used to this kind of ancient version of Ruby, and then there's stuff like Surabay and Rubikop and all these other libraries we kind of depend on that were kind of creeping towards deprecating Ruby 2.6, or had already done so. So let's just kind of keep more up to date and stuff like that as well, which is very nice. We've also released a official like homebrew Mac OS package. This is another thing that's been kind of requested for a long time, people have a love-hate relationship. I think homebrew was one of the first projects to do the whole curl this bash script into your terminal, and then we'll install it that way. Who has security concerns about that approach? Almost everyone, good. We're gonna keep doing it, so yeah. All right. But if you don't like that, then you can use this instead. So this is kind of the more standard installation process you would expect, where you get a nice installer and you kind of click through these things and stuff like that. And you should end up at the end with essentially the same stuff, and it prints the same messages for you and all this type of stuff as the bash installer, but you can do this through like MDM tools and things like that. But as I mentioned earlier, I've actually been working on a few little bits which are kind of not strictly homebrew related. So I've been working on workbrew, which is this thing where we're building kind of some close source stuff on top of work, on top of homebrew to try and kind of find this balance where there's been a bunch of things where like the package is an example of one where people have asked for it over the years, some people wanted to get involved and built that, and that's all fine. Whereas on workbrew, there's been a bunch of stuff that people have asked for over the years and I've asked for it, it's homebrew volunteers and they don't want to do it, say okay, well fine, we can do some of this stuff for you for money. So we have our own package here now, which does a few more things than the homebrew one does and stuff like that. Not going to go on about workbrew too much, but if you are interested, go and have a look at our website and there's a little demo of like what we're doing and we're kind of recruiting people who we want to work with on this stuff. So get in touch. But on homebrew stuff, that's looking forward to the next year. So we meet together as kind of a homebrew group each year, so I'm not entirely sure what our roadmap is, we're going to kind of try and decide some things tomorrow, maybe as a group, kind of figure out like what we see as the most important things, but some ideas kind of I've seen flipping around and things that I have and kind of have currently open issues for them or stuff around like handling conflicts better. So there's this kind of ability for packages and homebrew to conflict with each other, that means you can't have either of them installed, sorry, you can't have both of them installed at the same time. That's kind of a pain in the ass, it doesn't really work very nicely, so we're hoping to improve some of that. There's also kind of inherent conflicts between CASCs and formulae. Who feels like they understand the difference between CASCs and formulae? Okay, only the homebrew maintainers, great. So homebrew had this kind of somewhat alternate approach, like the kind of integrated with homebrew, but was kind of its own separate ecosystem a few years ago that kind of merged into homebrew proper a few years ago called homebrew CASC. So homebrew, at least in the official kind of repo, is all about taking open source software. We build it from source, we give you binary packages, and then we ship that to you. Homebrew CASC is a little bit different, that's for distributing proprietary software where the upstream package, well, the upstream supplier of the software provides the binaries for you, and then we download that and install it for you. So for example, Wget might be a formula, because we can download the sources and build that from scratch, or something like Google Chrome, or Zoom, or whatever would be a CASC. So there's some cases in which there are CASC and formula for the same thing, like Docker, for example, is both an open source project that kind of, you get some nice binaries, you can build from source, but also there's like all the gooey stuff and whatever. And if you do, if you install the Docker formula and the Docker CASC at the same time, things get angry and start shouting at you, and it doesn't work very nicely. So that's something that we're probably gonna try and make better this year. Another thing is we're continuing to work on our API stuff, we're trying to make it smaller and faster and consider ways that we can do that to again make that updating experience more pleasant for people to use. The other, also the API, as someone who's kind of been consuming the Humbrew API a lot recently, it's pretty crap. It was originally kind of created in the relatively early days of like, I don't know, 2013 or something like that. And we've just kind of bolted on bits at this point where it's got like six arms and three legs and they're all the wrong shape and it's, yeah, yuck. So hopefully we can have something that's a little bit nicer for people who are kind of trying to integrate with Humbrew to use, release this year as well. And the stuff I mentioned earlier about upgrades. So part of the reason Humbrew is often upgrading everything all the time and people get grumpy because that's really slow, is because we don't have a good way of figuring out what upgrades are needed and when. So historically we had the kind of conservative approach of, well, if there's anything else that's new, that's in your kind of dependency tree, we will always try and upgrade everything every time just to be safe. But then we realized like, well, you upgrade a ton of stuff all the time and then that makes people sad and angry on the internet and all this type of stuff. So then what I mentioned we did last year was we basically said, well, we can kind of infer a little bit from the way the binary packages were built. The binary package was built with OpenSL 1.1.1 and now we have OpenSL 1.1.2. We know that this package doesn't need 1.1.2 so we don't have to upgrade it, yada, yada, yada. But hopefully we actually have like, there's a lot of the kind of bigger, proper package managers and distributions have like actual like ABI which stands for application binary interface, essentially like what libraries you can link again and change the versions without breaking things. They have a lot of tooling around that stuff that we could kind of adopt and similarly like we can have a way, even with our existing tooling to kind of make this stuff a little bit more explicit, which would mean that we don't need to upgrade as much stuff as much of the time. But because we're an OpenSL project, maybe what we do in the last year will be something that we haven't thought of yet, that we think of because someone in this room has a good idea in a pull request or you file a bug report and then that makes us think of something that's smart and then we go and do something in a clever way or you file a really well written feature request that then inspires us to do something cool. So I really encourage you, even if you've never been involved in an OpenSource project before, we're generally, myself excluded, a fairly friendly bunch and we will all try and help you get involved with Homebrew and help you along the way, particularly with something like a pull request, like if you have an idea and you think you can kind of make it happen and you can write some code in some sort of form, even if it's only like 10% of the way to working, feel free to open a pull request and then just say, hey, like this is what I tried, this is what I need help with, and then we can kind of help you along the way. It's often much easier to talk about the code than it is to talk about the ideas about the code beforehand. We're not the type of project where every pull request needs an issue open to beforehand, like we believe in discussing the code whenever you can rather than kind of discussing some abstract conception of what the code might look like when someone decides to write it. So I think we've got a little bit of time for questions now and also if you don't feel comfortable asking any questions in this format, then feel free to ask me anything privately. I'm on Mastodon and Twitter and you can email me and stuff as well. And yeah, thank you very much for having me. APPLAUSE Are there any questions? Oh, all right. Just going to ask, where's the... Oh, the beer costume. OK, so anyone who was here last year, I was wearing a head to toe beer costume because I love my Uber maintainer friends, but they're not always the most organized bunch. And someone posted a picture before Fosdame last year saying, like, here's a beer costume. Wouldn't it be funny we can make Mike wear this lol? And I was like, yeah, basically like challenge accepted. You're not organized enough to make that happen. And unfortunately they were and I had to wear a beer costume. There are pictures on the Internet. Don't look for them. Thankfully they were not organized enough to bring it this year, so that is why I'm not wearing the beer costume. And shame on you, sir, for reminding people that it exists. LAUGHTER Any more questions? Awesome. Thank you, Mike. APPLAUSE You
DNS for I2P: Distributed Network without Central Authority
Okay, let's do it like this. Thank you very much Peter for all the efforts for the I2P deff room and by the way, do you hear me back there? Yes, lovely. Okay, right. I hope that the sound check is good now and we're not muted anymore. I'm the I2P guy. I'm one of the I2P guys and I'm talking about fully distributed networks with their specific problems. Fully distributed means truly fully distributed. So today we're also talking about systems without any trust involved, at least in theory. All right, hands up please. Who's familiar and is using I2P or who's familiar with I2P? Yes. Oh, I love you guys. That's really awesome. We have one third which is familiar with I2P so I'm really rushing through the I2P part. But then I'd like to talk about my depressive, my depressing last 12 months, which gave me really a hard time with implementing a persistent storage layer based on I2P and I will tell you why and I will tell you about all my failures and problems and yeah, I will complain a lot. No, we're talking about bit about Byzantine fault tolerance and the good and the bad of the past year. Right. Diva, I'm working for Diva Exchange but it's only an association based in Switzerland. I'm sometimes a lecturer at the University, at the Lucerne University of Applied Science, and there I'm talking about microservices and fully distributed trustless systems and stuff like that. But I'm singing nobody's song so I'm really totally completely independent and so is Diva Exchange. So we're not some coin guys or token guys, which doesn't mean that this is bad, we're just not like that. So hello I2P network. I2P is well known as a dark net because the media talks about it as a dark net, which means and we'll talk about it later that it has something to do with confidentiality and anonymity. But at the end it's an overlay network. So we have the existing internet and on top of that we place software routers to pack the traffic into packages, repackaging them, encrypting them and sending them over several hops and routers through the network. And like this we receive a confidential and anonymous message transport. I2P is no storage layer. Whatever you hear about the dark net that there is content stored etc. That's not true. I2P is not able to store content by itself. There are storage mechanisms like the interplanetary file system which is linked to file coin and these are storage layers. But these storage layers do not necessarily feature confidential and anonymous transport. Often they even fail on implementing such a layer. Six, seven months ago we made a study at the Devalor exchange and we were interested obviously in how big is the latency of UDP package transport on the I2P network. And as you can see it's slow, really slow. And that's the price for privacy. Anonymity, confidentiality is not for free. There is a price attached and this price tag within the I2P network is time. It's slow. Maybe but that's a theory and we need to look at the university into it. Maybe with a strongly increased number of routers maybe we can increase the bandwidth. But that's just maybe. I don't know. We have to do scientific research on that. But this is the current state. Now a dark net, an overlay network as I2P has cryptographic addresses. They are public keys and often it's a hash of a public key. And these B32 addresses so long cryptographic strings like up here are not human friendly. I will not talk about so-called triangle at 6.30 this evening in this room. You will have a presentation about this topic which is for sure also highly interesting. But we have these hashes and we need to map them to human friendly names. That's our job which we have to do in such a network. And that's why that's the motivation we need at DNS. But the only thing which I2P really has is a local address book. So each router, each note, there is nothing like a central authority. Each private note has its own look up key value store. It's called address book. So there you have a friendly name like diva.i2p linked to a hash or to a B32 address to simplify things. And if I'm loading somebody else's address book, this is a choke because that's a trusted operation. And within I2P we usually say we don't trust no one, it's trustless. So obviously we cannot just load address books from somewhere. Additionally within the I2P network if you're looking at the specifications and if you're looking at how the network is working today, we do have jump services, we do have kind of like registries, but all these services are again a delegation of trust, nothing which we really want. And as you can see ladies and gentlemen, I'm really critical towards the I2P network. I see the central components which we have within this network and I'm criticizing them. But not criticizing in a negative manner, I'm rather trying to make me as a developer and also the other developers to be aware of these central components. Right. Now Goethe, German, des Pudelskerren, the core of it all. Why are we doing this at diva? Why am I doing this? I want to have a service, a storage service and hence a DNS service which is A, fully anonymous, B, immutable, C, really barrier free. And barrier free is an interesting concept if you start to think about it. A coin, whatever it is, Filecoin, Namecoin, Monero, Bitcoin, Ethereum, I don't care, is not by its definition barrier free because, well, you have to acquire it somehow. So there is a barrier and barrier free in the meaning of device change means you have a very low hardware requirement to enter the network just to drop a name, a raspberry or any other low power device and ta-da, your member of the network and you can store stuff. And if the barrier is that low, by definition, the spam will be high. So we have to think about a cost function but the question is how is this cost function going to look like? We're going to discuss this in a minute and trustless. Again, I2P is built, architected, engineered in the last 20 years as a trustless system. Trustless means I really need to only look at my own node and either my node is right or it's wrong. I don't need to care to whom I'm connecting to because every data which is incoming, I have to verify myself. If I'm not doing the local math, I'm trusting somebody else and that's a bad idea in the context of I2P. Trust. I can tell you trust me the earth is flat. Now we all know that the concept of trust means I'm believing in a wrong set of root data or made upset. It's just invented and as if I'm starting to invent root data, I can prove anything because the root data is fake. I don't like that word. The root data is made up. Now if you're building your system of trust, your system will grow and we know in IT, but actually at least the view I have from my specific scientific point of view that the larger systems are growing, the more problems we do have in these systems because we need to introduce regulation that the trust is not abused. More regulation means later even more regulation and so it gets more and more complicated over time. One of the typical solutions, at least what I'm lecturing about is keep your system small. So base your decisions on math, base your system on math, keep them lean and at the end add a cost function to prevent spam or abuse to be a bit more generic. I2P is at least from my point of view, a network which enables small and lean systems. Right, where am I? 1540. In history, building a DNS on a fully distributed network, the approach isn't new. One of the older approaches is our systems which are based on the hash cash function which was properly described in 1990s and then leads to proof of work systems and these proof of work systems, they created currencies, like we all know Bitcoin. Maincoin then came in and other things which are proof of work. What I can guarantee you is proof of work is working. Proof of work as a cost function is mathematically, at least what we know today, perfectly working but it's extremely inefficient because it's a race. It's a race for the fastest solution. This is a bit trivial but at the end of the day it's a race and this race is inefficient. Now, I always resisted to implement yet another proof of work function, not because it's not working, just I didn't want it. What I also not wanted was the Filecoin interplanetary file system solution which is a validator approach because validator approach means nothing else and Filecoin did its moit. They used DRAN to select validators but they're just shifting the problem from their own system to another system and then they say we're solved but that's not true. At the end you just move the attack vector away from your own system just to open up another attack vector and for me, just for me, currency based, proof of work based or validator based is not really an approach and as I am an economist which I studied, I feel very uneasy, very, very uneasy about non-fungible currencies. There aren't many. A few are. Make your own research and you will find out which are really fungible. The others are difficult. Let me put it like this. Then there are Naive concepts which are very nice, highly performing but at the end you need in the area of DNS and in the area of DBA what we're talking about, immutability and integrity. Right, I want to lose a few words about the CAP theorem because consistency, availability and partition tolerance are a triangle within this CAP theorem and it's said that you have to choose two out of these three. Now some blockchain guys said hey we solved it, now we have all three. At least with Butante's fault tolerance I have my doubts and honestly I do not see any concept out in the wild which really solves that problem except proof of work and we don't want that. So this year and that was part of my biggest struggle. We had to leave and that was the talk I had in 2023 exactly here at this place about democratic Butante's fault tolerance which is developed by the University in Lausanne in Switzerland and also in Sydney, Australia and sorry guys with I2P this concept is not working because and we're talking about fallacies right afterwards about the problems with such networks, democratic Butante's fault tolerance was a fail. So we went as Diva chain into eventual consistency because the big problems in distributed computing known since the 90s are things like we have zero network latency wrong, we have unlimited bandwidth wrong, we have a secure network wrong and we all know that as Diva Loppers but sometimes in the lab we go into a perfect world, dream of something, create something and then in the real world it's not working and that's why my biggest tip for every blockchain developer in the universe tested on I2P. If it's still working you probably done a good job and that's exactly one of the core messages. I2P has that many network transport problems which are the price for privacy and which we want that it's a very good test case, a very good transport layer for all the blockchain developers out there including myself. So what we did in the last 12 months with Diva chain and obviously you'll find it on GitHub, we created a transaction based system which is barrier free, immutable, trustless and based on I2P, so fully anonymous. It's working now, it's working since about three weeks. The students, the last three weeks at the University of Applied Science in Lucerne wrote a little prototype with I2P but they had a lot of API troubles and struggles because I made mistakes so it was my mistake and at the end I couldn't present here the final prototype but because of me not because of the students, they did a good job. And what we're thinking today is how to implement the cost function because at the end a barrier free system I already said that will attract a lot of spam, a lot of DNS spam, a lot of content spam, a lot of whatever we can use this system for spam and that's not me as one of the developers that's not my intention. So probably it will be a function of availability and a function of cooperation and when you read this now and when you think this is new, no it's not. Filecoin already implemented this since 2014. The only problem they had was their validator selection so they made the mistake of using a validator function to implement their consensus but this they call this proof of storage, the function of availability and the other one they call it proof of window consistency or something like that but you have to prove two things. First you have to prove in the network, prove means mathematical proof that your content is stored and B that your content is continuously stored. So these concepts here are not new, I would just like think about it a bit more and then implement it. I already talked about my core failures or our core failures in our very little team. Democratic Britsantin false tolerance, a very nice concept, a very nice book, I learned a lot, it didn't work. The eventual consistency approach is working since a few weeks, API is highly unstable, I have a lot of coding work ahead of me, in front of me and I'm looking very much for feedback so if anybody is interested in hacking in, I'm always happy if somebody wants to contribute and the academia feedback was also very positive so I could show a few interesting things in the past months. Please in the last two minutes take out, in the last minute take out this take out, we believe that the eventual consistent DNS or blockchain like system used for this DNS challenge is a reasonable approach, eventual consistent so we drop blockchain consensus and replace it by eventual consistency transaction based. The core challenges as know today, we need to implement the cost function which is reasonable, decisions, decisions in our wording or nothing else, there is a global state where all peers and I2P network agree on a specific state of data and the participation is very welcome. In the presentation on the web which you'll find in this deaf room on the fourth step page you find all the sources and some more stuff so if you have questions please shoot. Yes please. Could you explain what you meant by immutability? His question was could you please explain what you mean by immutability? The answer is once written never change again. Yes please. Right he's asking in our system we're going to have a lot of traffic, we're going to have a lot of records stored right did I did I summarize that correctly and that's a problem right or that's your question in terms of storage right first compared to other approaches Diva chain because DNS is a side project we never like handshake or other projects we never intended to replace the current domain name systems therefore the clear net we always wanted to match I2P names like Diva.I2P because nobody is going to give us a domain to be 32 addresses so no we don't have much traffic there and so the storage problem is nothing I'm currently thinking about but yes there will be sooner or later scalability questions you're absolutely right but in this baby state I don't really care. Yes please. This question was if it's immutable how can I change things? In the blockchain world you never change a record you just let it live in a block or let's call it inner transaction and then you just create a new transaction on top and this new transaction is the new state because in a blockchain you always look from the top and the last state is the thing you believe in because it's properly proved using math. Is your answer given? Okay. Maybe a question from the phone? No other questions? Thank you.
Algo-rollover for .nl
Hello everybody, welcome to the TNS Dev Room, if you just came in. Our next speaker is Stefan, who will be telling us about the NSSEC PSK algorithm roll over for .NL, which normally isn't a very exciting thing, but I trust Stefan, but I've made it a boring situation. That's still fun to talk about. Yeah, thank you Peter. Welcome, my name is Stefan Udink, I work for SIDN, the .NL registry, and I'm talking about the KSK algorithm roll over we did in July last year for .NL. So why did we do this? What did we do for preparations for this change? What was the planning like? How did we execute it? And what did we measure on the internet on our change? So why would we want to change the algorithm? Yeah, the algorithm we used before was algorithm eight, and that's an RSA algorithm, and we wanted to use a SAVER algorithm to keep up with the new standards, because since June 2019, the new recommendation from the RFCs is to use an ECDSA algorithm for the NSSEC situations, and there's currently enough support in Resolvers to do this. As you can see in the drafts, both RSA and ECDSA are supported equally for most Resolvers. And a plus side is also that the NSSEC answers we are giving are smaller than the RSA answers, which gives us less impact when we are hit by reflections attacks. So it's better for the internet. So in way two, the algorithm roll over, we already replaced our HSMs. We used for the signing of the zone with new HSMs from Talis, which could do 20,000 signatures per second, which was a big increase from our previous HSMs. And we started with doing a test run on our test environment without any changes to see how does this work, how much time does it take, because there are a lot of things you can change which would change the time used for some steps in the roll over process. And a normal run took about three weeks without any changes. To be able to do it efficiently, we also made a test lab policy for OpenDNSSEC, which rolled very fast to be able to see what changes were done and to be able to create some scripts to follow everything that is done in the environment. And we also used the local DNS viz installation to see if a Resolver for our setup, it was inbound, could indeed resolve the new situation. And for that, we also created a fake root. So we could play root operator to change everything and we could validate everything to see if everything worked without any issue. That went quite well. And then we went to our acceptance environment in which we used a daily copy of the public NL zone and that has 6.4 million records or at least two main names in it, much more records. And then we had a memory issue. We used our old 128 gigabytes of memory, no swap usage, but still the system holded on something. And after we added swap to the system, it ran again, so it continued. It was not broken. Everything went where it left off. It was strange, but it helped us. So we could prevent this issue in production. Another thing was that normally we generate a full zone every half an hour and in a normal run, it took about 24 minutes to generate the zone, sign it and publish it, including validation. After adding the ECDSA keys, we did a run and then it took 45 minutes. That's not what we wanted because if you want to publish every half an hour, you cannot take 45 minutes to publish something. So we had to find a way to make it less than 30 minutes to do both RSA and ECDSA. And we saw that mainly the validation part costed a lot more time because ECDSA is harder on the validation part than on the generation part. And we made some things in parallel. So we compiled the zone with bind to raw format. We validated with valid NS and all those things we did in parallel and we added parallelization to valid NS. At least it was already available, but we used the switch to do that as well. So we are using all cores on our systems to do the validation and then we got to about 27 minutes of generation. So that's under 30 minutes. Very good job for us. And so we were able to continue with the new zone generation. So how do we plan this thing? We were in June and we know that it took some time. We saw that we might have a ZSCA roll over. We didn't want to do a June ZSCA roll over because Dan Dezon would increase even more because of extra signatures. So we had to plan it and we had also some data that we could use to do the validation. So we had to do a lot of things. And people in the organization that had to approve at IANA that we are going to change the yes in the route. So we expected that the IANA change would take three days and so we came up with this plan. And with all holidays for people, et cetera, we were able to do this plan. And as you can see, we have some asterisks next to some dates. And that's because these are dependent on the IANA change. And if the IANA would take more time than we expected, then those dates would change. And this is something we couldn't predict. But even though we thought three days should be normal and should be okay. And luckily for us, we did a blog post about this change and we were telling the people we are going to do this change. So if something breaks, you know we are doing this and you have these dates to see if everything is going according to plan and we will update this if there's some issue or the dates will change. But it was all good and we planned it very good because all those dates mentioned here were the dates that were used. So we did it according to plan. So on executing a plan, it's good to have written commands just to copy and paste them when you need them. You only have to check, yes, I'm doing this on the correct system. Yes, it's all written correctly, but you don't have to think about it anymore. So during the execution, we did continuous checking with the script we wrote and we did some DNS viz runs on the public DNS viz site to show the people that we are changing and we have some records that we can show. I will show the DNS viz pictures lighter. As I mentioned before, there would be an increase in file size for the zone and before it was 4.5 gig in size during 4.6 and afterwards with smaller signatures, we only had 2.3.7 gigs and that's very nice. Of course, we have a go no go moment because if we would have double signatures, we can still go back without any disruptions, et cetera, and if we would go forward, then we wouldn't be able to go back as easily. So we have to do a bit of a check and that went well. So some pictures of the algorithm 8 situation. During the policy change where you see an addition of the EC, we added the algorithm 13 to the root and removed the algorithm 8 from the root and afterwards we stopped using algorithm 8 and then this is a new situation with only ECDSA. During all this time, we also did some measurements and a colleague of mine, Moritz Müller, did most of the measurements. He wrote a Rolo from Moon Mon quite a few years back and did it again. We messaged some items mentioned on the slide. I want to mention that we only messaged two root servers because all root servers should say the same answers but we didn't want to measure it all 13. What might be interesting is that you see a lot of numbers in this graph and that was a bug in the Rolo from Moon software. You also might notice that there are multiple lines at the top and at the bottom and that's something like this. It's a measurement issue that was caused by using of a small buffer size and still trying to get key IDs from the answer. That's why you see a lot of changes and we saw what we were seeing. This is not correct. What's happening? Because if I do manual checking, everything is fine. What's happening here? Finally, we were able to find the issue and we were able to fix it. Another interesting thing is that during the change, we looked at the response sizes we sent and in this table it's only ns1.dns.nl and other systems have similar, not the same answers because the sizes might differ based on the implementation of the nameserver that's used because of the protocol compacting. Another interesting thing here is that the nx domain and dns keys are increasing during the rover and the ns set is not increasing. It's less and that's because the R6 and the set are in the additional section and during the rollover the section gets increased a lot more so only the R set for ns1.dns.nl is in the answer and the R6 but not for all the nameservers that are in our zone. If we look at traffic, normally we have about one percent of TCP traffic and during the rollover we have about five percent TCP traffic and after the rollover it's back to normal again. Here you see a graph with a logarithmic y-axis and you see that TCP is increasing a lot. It's about eight times more TCP traffic and especially there is no different state in the internet. It might change and so here you see that the KSK is back in the logistic direction and it's not going to change. So we have a level again. So in global we have no measured impact at all as far as we know. I don't know of any trust issues people had or something and you can see on the left picture that the adding of the ECDSA key and afterwards the removal of the RSA key and the right picture is the trust chain that is constant for the resolvers and that's my talk already. Are there any questions? I've got two questions. The first one on slide 17. You mentioned that during the rollover the NS size becomes smaller. Yes. I will ask you a question, complete question. So you said the NS set is getting smaller, yes and the question is? The question is there is no difference between the two. So I think that's a good question. I think that's a good question. I think that's a good question. I think that's a good question. I think that's a good one. So you said the NS set is getting smaller, yes and the question is? The question is there is an RFC out there that says glue is mandatory. If the size of the response is getting smaller because you're not including glue you have to set the TC there. Did you measure for that? So I'll repeat the question. There is an RFC that glue is not made, that blue is mandatory and did we measure some things about this? What I know about this situation is we looked at the dns-vis information that we got and for the measurement for ns1.dns.nl the glue is available but only for that name. So not the glue for the other nameset records. And I don't know if we looked at if the TC flag has been set and if there has been acted on. So and the second question? Yeah, my second question is I noticed you switched to talus. With regards to support from the talus company, did you test that if you got proper support, if you needed it? I will repeat the question. He said we switched to talus and did we test the support we had with talus before doing this transition? No, we did not test the support beforehand and technically we did not switch to talus as in we used to have lunas which was taken over by talus. So we continued with the same lunas, hsm products as we before. We had contact with talus before we switched the hsm's but we did not before the rollover try again to contact them to see how support would handle questions from us. Which might be a very good idea as well. Thank you for that. I'm asking for a friend. Yes. Relate question to that. The tank tank we had in the beginning, did you have any rollback plans in case something went bad? The question is do we have any rollback plans? I mentioned we had a go no go to see if everything is okay, we go forward. If everything is starting to fall we go backwards. After going forward we had some thoughts about how to continue but that might have impact. So the decision what to do when, depended on the situation at that moment. And we didn't write out everything, every possible scenario because that would be too much especially based on our testing and our acceptance and test environment that we had confidence that it would all go correctly. And we would have to look at the situation at the moment to see what the next step would be if something went wrong. Does that answer your question? Yes. If you had choice to redo your procedure, do you think it's worth it to have met HSM at all regarding the other complexity and risk of losing your key in case of backups are not here? Father Van Aving, a hidden signer that is aggabbed by your words for example. If I understand your question correctly is about, did you have anything about backups or? Are you happy with having an HSM reserve having an aggabbed linux that have the casket on the disk and do the signer and just the DNS update that's going on in the world? I don't know. I hope I understand the question, but if I want to answer your question, I'm thinking about we do not have an aggabbed system. We do have regularly backups of all the HSM keys. So in that way we do have an HSM that is aggabbed because the backup unit is an HSM and that we can use to restore keys if necessary. Do you think it's worth it? Worth it. I think it's worth it to have an aggabbed HSM to, it depends on your risk assessment if you want to have an aggabbed system and if you are going to do this thing for instance in public cloud, you might want to have a situation where you have an offline KSK for instance. So that might be in a setup. Did you conduct a penetration test on the HSM beforehand and what are your operations in case the security issue gets known in these HSMs? Did we do a PAN test on the HSMs and the next question was? What would you do if a vulnerability gets known? No, we did not do a PAN test and what we would do if a known vulnerability would be known would require our systems to investigate what happens and how can we react on that? Which information has leaked and how can we recover from that? Those are not known scenarios at least to me at the moment. Why not PAN test? Why not PAN test? I have no idea. Yes? I noticed that an extermination goes up to 14 or two bites. Did you, I'm curious what was your transportation settings for maximum? So your question is what is our transportation settings for the UDP size? We have 1232 as the size of the UDP packets as recommended by an RSE. Other parties that also provide anycast for .nl have slightly different settings for that. So that's why we focused here on NS1.DNS.nl because that's something we operate ourselves. The second question. The second question was you added the algorithm 13DS keys to the rule zone. So you've run a dual DS. What's that to allow removing the algorithm 13DS keys if you had to in a hurry or just as an additional acceptance step before you remove the algorithm 8DS records? Because during the fairly recent transition of carbon net and the EU cells, they basically just didn't swap. The question was why did you not remove the algorithm 8DS when you were adding the algorithm 13DS? Correct? Yes. We did that because we wanted to have a solid path and have a possibility to go back without any issue. So rather than take one big step in that regard, we took two small steps to ensure more stability at least from our point of view and good night's rest for us. Any other questions? Maybe not so much a question but a statement if that's allowed. Yeah. One. I think it's incredibly brave for a national top level domain to take a risk, right? And I think it's very good as a compliment. But because changing an algorithm is different than changing a key, changing an algorithm is fundamentally hard. And for SIDN to do this as one of the early adopters, not the first one but one of the early adopters to do this, I think it's very commendable. And I think you set an example for the rest of the industry for all the other top level domains including ICANN and we're looking at you. The same I would like to. And we're looking at you to see what you're doing well and of course we hope nothing goes wrong but we also need to have that information. And one of your colleagues is working with ICANN to make sure that if we ever do something in the room, that that goes well as well. So he's part of that group as well. So yeah, we're looking at this. We're hoping all the top level domains follow the same example. And yeah, all my credit goes to you guys. Welcome. Thank you. I want to repeat for the online audience. Because if you get a compliment like that, the person, Roy Arons from ICANN said that he said, very brave for our .NL or Azure at the end to do this algorithm change in the forefront of the people who are doing the change and are should be followed by registries to do this change as well. And we have shown that it's possible and without any incident. So any other till these please follow us. Good summary. Thank you. Thank you.
Bootstrapping time on OpenBSD
Welcome to the DNS Dev Room. Our next speaker is Otal, who is an OpenBSD developer. We're going to call him a faithful intersection, DNS second, NTP, and maybe two other terrible things. Yeah. Okay. So I'm going to talk about bootstrapping time specifically on how we implemented that on OpenBSD, but I think the approach could be used in other systems as well. So, small introduction, OpenBSD, a BSD derivative. We focus on security. We do that in several ways. For example, privilege separated demons, which in which we separate the various tasks a demon has to do into separate processes. Each of those processes have minimal capabilities, and they communicate with each other through pipes and exchanging of messages. There's also a lot of other techniques from memory management, which I'm also pretty involved in, and new APIs for that are, let's say, less easy to misuse, things like that. Apart from that, we also try to make a useful system, and so we like to have St. defaults focus on a system that is out of the box, a nice system to work with. By default, we do not have a lot of services active, but if we consider a certain functionality to be included in a default configuration, the configuration you get when you install the system, we are quite strict in that, in the sense that it has to be functionality, which is useful for a very, very large fraction of our users. But also, the actual implementation is maybe even considered more volatile, it's a higher risk, so we focus on extra on the security aspects of that, including the architecture of the software itself and the specific implementation. I'm now going to talk about time, and we'll see a bit later how that also involves DNS, but when, originally, when OpenBusy starts, it gets the time from a battery backed real-time clock, if your hardware has that, because not all the hardware has it, and even if you have hardware that has it, it's not always functioning properly. If you think of it all the hardware, then the case is, my CMOS battery ran out, is pretty well known, and most of the battery backed real-time clocks then give some default value way back in the past. But the booting system tries to read clock, if it's available, if that fails or there's no clock, the time is set based on a time step that is stored in the root file system, which says, well, this was the last time the file system was modified, and basically, if you unknown mount a root file system, which happens on an ordinary reboot, then, or shut down, then that time step gets set as well. So you have, let's say, if you reboot the machine, you probably have a time step, which is a little bit in the past, but, well, reasonably okay. It's a bit behind, probably, especially if you shut down your machine, go on vacation, and you don't have a real-time clock, because then you come back from vacation, and your clock is two weeks behind or so. So that's the problem. We have an NTPD implementation, which I'm going to talk about a bit more in a second, but that originally that implementation did not bump the clock. It would only gradually increase or slow down the clock to adjust it to make sure that the time is corresponding to the NTP-derived time. You could enable that, but it was not a default, because we said, well, we are not going to make a default, because we don't really have enough confidence that it will do the right thing. Why not? Because NTP in itself is not a secure protocol. That's one issue. And also, so we would like to have more than one source of time, not only NTP, even if you talk to multiple pairs. We would like to have an independent way of validating, or that time we see. So we formulated some goals in the beginning a few years back, and we like to say, well, we like to be pretty sure that if you boot up an OpenBISD system that you have the proper time, if you have network connectivity. So that's a nice goal, but we made things a bit harder for ourselves by stating, well, we do not fully trust NTP replies. Like I said, by default NTP is an insecure protocol, and also the design of the protocol is in a way a bit. You can compare it a bit to the original DNS implementations. Security was not a big thing in that time. So we'll talk about it a bit more later. But the goal is still to get the correct time on boot with high level of trust, not necessarily a very high level of trust in the sense that you have a cryptographic proof of that. That's maybe a goal for the coming years or so, but at least we have a high level of trust. Well, if there's no battery backed up clock available or it is not functioning properly, we still like to end up with the proper time. Like example, I gave this cheap boards with Raspberry Pi, for example, or other boards do not have a battery backed clock at all by default. And you can also have cases where very expensive servers forget about time when you switch them off. So the setting is if we can solve the problems in this quite difficult situation where we have lack of hardware support and things like that, and of course the more easy ones where you do have a proper RTC clock or you do have other facilities, then it comes easier. So if we say, yeah, okay, we need to be able to do DNS to resolve NTP peers, it might be that the resolver we are using is DNSSEC enabled. If that resolver is running on a other system, it's quite easy, probably that other system already has the proper time, but if we are running our own system on the same system and we do not have proper time, then DNSSEC is going to complicate matters. So we do want to consider that, at least, what we should do in that case. So a bit of words about the NTP protocol. It's pretty old. Let's say the same era as DNS protocol. There are some design similarities between them. For example, in DNS, a request and an answer basically has exactly the same format. NTP is the same. There's also the focus on UDP, of course, and also the case that the request you sent out and reply that's coming back. In reply that's coming back, a lot of information, maybe even all information that you sent out is also coming back. So you, as a client, you have a reasonable, easy task. You only have to consider the answer because the answer contains all the information you sent out earlier. So you only have to consider what's in the reply packet, do some processing, and you can continue. But of course that is, comes with that you have to trust that reply packet even more than you maybe would want to. Later, there were additions to the NTP protocol. Shared keys were introduced. So if you had a pair, an NTP pair, which you had some form of relationship and you would change some key, you share a key with that other party, then you could secure the NTP packet. So you had more confidence or pretty good confidence that you are receiving replies from a trusted source. Later on there was even more extensions where you say, oh, you invent NTS, which is a network time security, and that includes a key establishment protocol, which is pretty complex. And so far we did not like to implement that yet, but it might come at some point in time because of course that will give you some more cryptographically. And there's a process handling constraints, and constraints is a thing which I will talk about with later. So we have to do not have in our implementation any cryptographic proofs of any validity of the data, but we have a basic spoof protection. In the NTP protocol there's a field which is called transmit time, and according to the protocol the server which answers the question has to just echo that field. And if you, that's 64 bits, so we could, the server is not looking at that field for any other reason than just to echo it. So if we fill in a random, let's say cookie there, we can at least in some way make sure that an attacker which is spoofing us, trying to spoof an attacker which is not able to read the outgoing packets at least, can we protect against that. Of course that comes with storing some state in the client because you have to remember which cookie you sent out, but the protocol without any changes allows for that. When you are actually computing the time, and there's an algorithm in NTP protocol which allows you to, let's say, filter out the round trip times and things like that and get a good idea of the service time, you have to use the original sent out time and of course not the random thing you filled in. So the trust issue is in the NTP, original NTP protocol is a pretty complex statistical analysis of all the replies you have seen from different pairs. We do a bit more simple approach, we send out to several pairs queries, we collect results, we filter out things we consider bad and things which are bad as unreliable servers, servers that do not reply, servers that reply with a bad cookie and we select a median time, median time. And we use constraints which is a completely different source of time information by doing HTTPS requests to certain servers and the nice thing about an HTTP request is that the reply header also contains a time stamp. That is a rough time stamp, one second granularity, so low resolution, but we also do that to filter out bad NTP replies. So we know this NTP reply is outside the rough, our rough low resolution constraints, so we say skip that. There is a small complication there because the certificate check we need to, without any idea of the real time, has to use a time stamp, say is this a certificate which is valid now. But the question, if you do not know what now is, so what we do is we use the time stamp itself and say well it is at least consistent with what they're saying. So the HTTPS request is valid. On the time that server is telling us it is. And we'll come back to that later. Okay, but this is also a DNS dependency and that is because we want to be able to select NTP peers based on name. Of course we have things like pull.entp.org which are very dynamic, change all the time. Also location based, so depending on your query particulars you get a different answer. And you want to have the NSSEC validation. Now the NSSEC signatures contain a validity period with the same problem as with certificates. So we have here the hardest case. If we run the NSSEC enabled validating resolver on the same host as we are trying to boot, we have a bootstrap issue. Luckily there's a way around that. And that is to check disabled flag in the DNS request header. You can say to a DNS resolver I want to resolve this address. But do not do any DNSSEC validation. So that's easy at least from the protocol point of view. You can just set that flag and have at least some idea of that DNS resolving. But in the current API or in the API at that time which also is from the 80s or 90s there was no way to enable that. So now we come to another point it is because OpenBSD is a complete system. We built the C library, we built the APIs, we built applications and the demons that go come for it. We could just add that API and then assume in our application that that API is available. So this is a part of resolve.h, the source code. We introduced a new flag, save and use CD. And that enables us to use the APIs, the DNS resolution APIs which also use a bit of an Eucaly system which also stems from the 80s. That is a global variable or struct called underscore res which allows you to tweak the way DNS requests are done in a libc. These days that will be designed completely different because you would probably have some local object which you pass each time to that code to or have some context or something like that. But this is from the old days where a global variable or global struct would contain the flags to be used. So what we do is if we know that the time is not synced yet, we retry without with the CD bit and the resolution fails. We retry with the CD bit set and hope that it will get better. That way we have an answer. Of course it's not DNS validated. So we are closer maybe but still not there. So what is now the revamped mechanism is we get the time from RTC. That fails with time for the root VST and plasma, completely exactly the same. So the kernel is doing exactly the same as it did before. When open the entity starts, it will get constraints. So that's a new thing. It will try to get a rough idea of the time. And it will also send out entity requests based on DNS requests it has done. And those NTP replies will be validated using the constraints derived from the HTTP requests. And we will move, bump the time if it's going forward and otherwise do a gradual increase, a gradual adjust. We will bump only forward because we do not like to have logs with time going back. So monotony increasing time is pretty important. If we see, and that is probably an indication that something is really wrong, if we have to set the time backwards, we don't do that and scream in the logs and things like that. After that, the regular NTP things just happen, gradual adjustment of using several pairs, etc. So then we have some idea of time and we'll do it one more time. So in the sense that once we are synced, the NTP time and the system time agree, which can take several minutes of course because you have to slow adjust in many cases. We'll do it again. But then we say, well, we know we are synced. So we do have real DNS check validation. We do not have to do fall back to no see. And we use the constraints check the actual time. If at that point things are not okay, then we will of course scream in the logs that, well, we cannot your NTP, your NTP pairs, but then it's a system operator decision to do that. In a local LAN, of course, that might be a very suitable case. And the default config uses several NTP sources like NTP.pull.org, but also time flare offers a NTP server on all their pops. So you get a local, with all the same IP, you get a local time source or local close by at least is the idea. And also a sorted set of constraints of, let's say, well-known HTTPS servers like from Google. And we also use Fortnites servers for that. But they're, let's say, stamp of approval. So this is the default configuration. We also mix the quad 8. We are not using quad 8 for that because we like to have, they are, there's some tie between NTP and not. So, of Google, so we will say one is a completely different system set of systems from Google. So that is why we say, well, if we're using DNS, that will, let's say, diversify the different sources we're getting time from. And a little detail, surface means if the DNS request produces multiple IP addresses, we all query in all of them. And the server is a single source. And sensor is for, if we run on a system which has hardware clocks, for example, GPS based or you have the Meinberg, some set of hardware which, a PCI card you can insert in your system which gets the time from the DCF clock in Germany or other sources. We also use those, of course, as trusted sources. So that's the thing we call time sensors. So that is my talk. I'd like to thank some other OpenBSD developer who cooperated with me on this. And reachable and master done, but also OpenBSD, BUD.org. And I'd like to ask if there are any questions. Yeah. So you mentioned that NTP never sets the time back. But what happens, for example, if you have a hardware RTC clock that's misconfigured, like, for example, set one year in the future for some bizarre reason, and then you're a bit back. Yeah. So the question is, we know our NTP implementation never bumps, really hard set of the clock backwards. If that happens, and if you, but for some reason your RTC clock is misconfigured or set to the wrong time and you, then we require operator intervention. Then it's a human decision to do that. Of course, you can do that with the date command still or with our date where you say, well, get some time from a different system. But that is not a thing which happens automatically. We scream and we say, well, this is not right. But we require operator intervention for that case. I know the question. How much of this is tolerant if you don't have network during boot because it's a laptop that might be going to do a wireless network and that takes 10 seconds? Yeah. NTP deepen. We have a, if we do not have a working network configuration on when the NTP start, it takes about 10 seconds. And if then no actual traffic was seen by the, it says, well, sorry, cannot do it. I'm just going to continue booting because at that point in time, the boot script stops because we'd like to have as many demons starting with the correct time already set. So that's very early in the boot process. Of course, you have, and you have complex configuration with, with freelance and whatever, or then that's not going to work. But the NTP tries its best and then said, well, sorry, I cannot do it. I'm going to do my background tasks like I do always, but I'm not setting or bumping the time. So there. Sorry, you're out of time. Oh. Okay, thank you.
Let's make people love domain names again
Our next speakers whose names I forgot are going to tell us about new developments in domain management. So we were supposed to be only 10 persons, so I'm amazed that the room is full. So Pierre Olivier, are we ready? Let me check. No, no. The server is not responding. Okay, do we see cute kitten for instance? One minute. Oh, yes, our internet is not broken, so I don't know what is happening. So does it respond to ping? Let me check. The audience said, thank you very much. Check the DNS maybe? I don't know how to use an NS lookup. Makes sense. Okay, will I need to show you a happy domain later? Oh, oh, oh, wait, I have a clue. Oh, it works. It's always DNS. Thank you very much. Thanks to the sponsors and first to the volunteers. We are there thanks to them. They're from Warmthanku, Peter. So, I'm going to start with the first question. Warmthanku, Peter. So, look at these DNS issues. You all know them and these are the most known of the big companies. Big noise. Although their team are well skilled and are good professionals. What does it mean that it happens also to small companies all the time to all of you? So, like you, we face the same technical issues. The DNS system is complexity is increasing all the time. Yet, it's invisible, but not for us, but for all the people who type their URL in the navigator. Those and their answer are often badly set up through ignorance or lack of skills. Great. So, there are areas of improvement. Therefore, we build a team of experts to try and improve the situation. Your speaker today, Pierre Olivier Mercier, system engineer and computer professor in a computer science engineering school. And myself, Frédéric Griteur, contributor to Free Software Project Volunteer. Since decade and like you, through my voluntary efforts and regular funding, I'm going to hand read the URL there, 20 for another, etc. There are a lot of good ideas, nonetheless, in the Internet. Good stuff to find. DNS record assistance, for instance. Tool to test your records. Your zone, your delegation, your email parameters. Some services to monitor your domain, for instance, record propagation, or which domain is soon to expire. There are also online interfaces, sometimes with good functionalities. But each provider has its own interface. What does it mean? For us, we are managing several providers. That means learning each interface. And at the end, we end up using the raw mode and making errors. I remember a friend, not me. We missed the final dot at the end of a record. Another example, meaning it happens to us all the time. A few months ago, I wanted to add a CAA record to my domain. CAA record, which is not very well known, yet 10 years old. So if you didn't do it, do it. It's very easy. I found an online form, but it didn't support parameters. So it was useless for me. I was not open source. I couldn't help and change. Happy domain is open source. That means so much time wasted for switching from tools for another looking for solution, deepening in the RFC documentation, which is not so easy to read. Or maybe all of you have read all the RFC. Who has read all the RFC? One, two, three, four. Ah, great. Ah, great, great. I don't. Apologize. So we would like to offer you a sort of magic wand in the form of modern interface. And we named it Happy Domain. It will make settings pleasant, makes the display easy to read, and we would like to centralize all the men and registrar in one place and forgives error as much as possible. So let's Pierre Olivier make a demo. Let's dig inside the software because it's five o'clock and you are supposed to be tired. So tell me, how does it work? And I would like if you could change for me CAA in the domain. We're talking about CAA. Could you do that for me? Okay, okay. So I go to happydomain.org. I log in to my personal account. And here is the domain I managed today. We will make the modification or domain that is not listed here. We can see there are several providers on the left. I can use this provider to filter my domain. For the demo, I will use a local authoritative server, Power DNS that run on my local machine. As it is not already registered in Happy Domain, I click the provider and I select the domain I want to manage. So today this is happydomain.test. Before I forget, I will assign it a group. This is useful for example if we have several clients or environment. So here it is on the happydomain group. And now this is the abstract view of the zone. Instead of displaying directly the records, we group them in a kind of services. For example, we have the origin with the required record such as the SOA and the NS record. And as you can see, there is no technical subdomain here. For example, the DKIM record is not in the list of subdomains here. It belongs to the email service. If we look at it, you can see of course the MX record, but also the SPF1, the DKIM, the DMARC, etc. And the corresponding records that can be displayed here, we see that my domain key record is listed here. This is not my goal. Frédéric asked me to change our registered certificate authority. Yes, I would like to use the bypass. So the certificate authority is in the service certificate authority authorization. We can see there is a simple form that assists the user with several choices. Currently, we use let's encrypt, and Frédéric told me to change it to bypass. So great, it is in the list. If it wasn't, I can select other and write the domain names corresponding to the certificate authority. And we can also provide parameters. For example, some certificate authority can restrict which user can generate a certificate for a certain domain name. So here we can provide a client ID, which is API domain. And only the API domain account will be able to issue a certificate in the bypass certificate authority. We can see there are a lot of other settings, for example to restrict wildcard certificate issuance or the SMIME certificate issuance. And at the end of the form, we have the incident response. We will see a record, provide a way to contact the owner of the domain by mail or via a hook. Here, if someone in another certificate authority tried to issue a certificate, we will be alert for a violation of our security policy. All of that is a summary of the RFCs on the CAA records, and it's pretty easy to fill for a system administrator. So the modification here is made, but only for now on API domain. We can make several other modifications to make a batch and ensure the coherence of the zone we published. So here I just have one modification to make, so I can directly publish my changes. The interface asks us to review the change, so this is a modification for the API domain.test domain for the CAA record, and it changes from letsancript.org to bypass.com with some parameters. This is exactly what we want to do. And like in Git, you can add a log to retrieve it easily later. So I applied the change, and that's it. Frédéric, are you happy with that? Great, but sorry, it's in one month. I'm really sorry, it's a mistake for me. Could you roll back this stuff? With pleasure. So of course I can do the same modification as there was only one modification. This is pretty easy, but in case of many more modifications. As of today, we support 40 providers. We rely on the DNS control project, which is led by Stack Overflow, and several providers are added each month to that project. And also we support a classic authoritative server like Bind, Power DNS, and Not. We do our best for a facilitated readability, even river zones are supported. First, you can review your change at a glance before publishing them. We archive your modifications, therefore users can try and roll back. If required, you can easily export the record, and also you can import in a standard raw file. I think on the cake, the interface can be controlled. For instance, you may need to create a dedicated suit domain for testing, or before putting a project into production. Or you can have a local environment for your development, and then a prep production on a domain hosted by, for instance, Gandhi, and the production hosted by OVH. No problem, happy domain talk to every one of them. There is no need to learn the intricate of everyone's API. You can just call the same script for all your environments. Use a pedoman as you wish, first online, or you can easily install it on your server. We have binaries and docker image for you, so you can use it. Here is what we have today. You can use it right now, and please welcome. We're convinced that we can save time and bring a piece of main for teams that manage domain name. Think 15 seconds right now. This is your time to work. Think about what are your main issues, what are your main tasks, and your greatest interest of improvement. And then we will do our best according to what you say, and your results. We don't promise to build it, of course, but we'll do our best. Some ideas, for instance, now when you use a domain name provider, you are the only one to have access to the setup. We could use it with different levels for bringing several people all together and keep track of their modification. It could also be used to delegate to part of the system domain, for instance for the marketing department or whatever. And testing is really important. So Apidomen could perform some tests on every record, not only the delegation with the SOA record. Of course, we could use and display directly in Apidomen the result of the master, but for email, we can also perform the count of resolution for the SPF record. All these tests take time, and Apidomen could save operational teams a lot of time by aggregating all these tests in the interface. Apidomen could also constantly monitor your domain names and notify you when an issue occurs. And the propagation time is certainly the least understood part of the DNS, so why not display the theoretical propagation time directly and clearly in the interface? Yes, please. We could have also some dashboards. This is an example for DMARC. That means at a glance you can see your features and how they are going. And last but not least, with the effort we put to build the API for Apidomen and the form we build, we can also imagine using artificial intelligence to create a chat and interact directly with the OER zone. This is made possible thanks to function calling in recent models and API. So please use the Element Network of the first-dem or login our system. Use your fingers, rate your dreams and priorities and help us to help you. So thank you very much. Now the questions. Thank you. We have time for one question. Okay, one question. What time is it? No, that's another question. Please. So my company will use DNS control to have DNS managed by Git. Could Apidomains be integrated in that so that both work at the same time, so that somebody could edit the DNS using the Apidomains and see that they're committed to Git where someone else follows from it? I don't know. So the question is, can Apidomen be linked with DNS control? Currently not. We does not use JavaScript language used by DNS control, but perhaps that can be a good idea to go that. And you're welcome. Thank you. Thank you very much.
dnsconfd: system integrated DNS cache
My next speakers are Thomas and Peter who will tell us about DNS Confit, which is new to me from quite curious. So, hi everyone, my name is Tomas Korbars, this is my colleague Petro Manchik, we work at Red Hat and today we've come to talk to you about our new project that is called DNS Confit. So let's start a bit with a motivation behind this project. Last year we received a request from a user that required us to make possible for Unbound to be used as a local DNS cache and to be able to consume configuration from the network manager. In the past we had DNSsectorrigger package for this, but we dropped that in rail 9. So should we reintroduce it? We thought about implementing a debuts API into Unbound just as DNSMiles has and then implementing a network manager plugin just as DNSMiles has. But then we realized that if some similar request came in the future for different service we'll be doing the same over again. So we thought about creating a new project that would serve as a conduit between network manager and local DNS caching services. This project is the DNS Comp. Our requirements for it would be to be able to easily exchange the DNS cache, underlying cache, and to add more services in the future without too much work. We need to be able to support split DNS configuration. We need to be able to support split DNS configuration and then we need to be able to auto configure without manual interaction from the user. Also, we would like it to use already present system configuration and defaults and security features that are already built in and we maintain inside of our distribution. The behavior needs to be configurable enough so you can change handling of corner cases and you are not caught of guard by the behavior that you would not expect. Okay. Let's get a bit back in the past and tell something about why Fedora 33 introduced DNS cache and what it brings to us was a possibility of multiple simultaneous VPN connection at the same time. And that's great. It also made possible to configure global servers but reach some names which are accessible only on local network, which is nice for DNS over TLS but that was not enabled yet and still isn't. And it brought us excellent configuration presentation by Resolve's CDL command compared to what we had before. That was clearly better. And it also introduced well-documented bus interface for both configuration changes, for configuration displaying and also name resolution. They have nice article but that's not our job here. So what do we mean by 3DNS for here? When you connect to VPN without some smart solution like this, you send all name queries just a single VPN and use only your primary connectivity to deliver traffic to VPN server and that consumes everything you use. At that time, you cannot use any other connection interfaces you have on your laptop or mobile phone or something else because you use just one DNS or set the DNS that VPN knows. With split DNS, you can send different name queries to different set of servers provided by different networks. You are connected at the same time and most current devices today are capable of connectioning to different networks at the same time including multiple VPNs. All you need to have is non-coflicting names for them. So for example here, names are different and if some names in those domains provide some useful networks, you can access them at the same time. And we could end it here and thank SystemDGuys if everything worked great but sadly that was not the case entirely. I have listed few issues I think are important and still aren't fixed sufficiently but there were more bugs in the meantime somewhere fixed, some are still not. For example, it prevents any usage of DNS on the host which is where it is enabled by default configuration both Ubuntu and on our Fedora because it just doesn't forward DNS-enabled bit set in queries received. So any library which is quite capable of using DNS-sec cannot use it even if infrastructure, your network provides capability for it. Also, at least for Fedora and Ubuntu desktop I think, you would be quite surprised this top level domains often that does not exist because it sends top names without dot just a local interface over multicast protocol and if it doesn't find something which usually it doesn't, it just returns no that does not exist. So com domain does not exist but github.com domain, surprise it does and even on server edition when I think this is really unwanted. And also strange response is when a response fails because of DNS-sec validation fail, it still might contain a valid answer in the response which is unexpected and no other implementation I know does it this way. So DIC plus short DNS-sec failed org even with DNS-sec enabled in system DreslD gives you very nice address and I've listed just few issue numbers. So lessons we take from this is we want split DNS functionality auto configured and we want possibility to DNS over TLS and also that we want nicer front end than we had but system D people are very good at expertise in system integration and they are quite good engineers and I know it but they lack expertise in DNS protocol area and I am afraid it is visible and at the same time DNS resolvers people are excellent in DNS protocol area but their integration into system is often very limited or at least done and we think only the integration is missing and that is what we are trying to provide. So we want to reuse existing functionality. We want to provide some common interface to set forwarding to different servers so it doesn't change much and we want to provide nicer front end for showing what is configured regardless of use DNS cache in the end. So what we need for split DNS we need some local address which receives queries from applications that usually localhost we need ability to configure different domains to be forwarded to different sent of servers and of course some default for root servers to be forwarded to global default and we also want ability to reconfigure the service without stopping it and flashing entire cache as starting it again and from this is list of servers we have in Fedora and I think all of them are able to provide split DNS functionality most of them are also able to provide DNS over TLS functionality but only DNS mask except there is of D have some D bus capability and that is quite limited and DNS mask has own issues. So our approach is use what already exists provide just front end and components coordination do not reinvent the wheel. We do not want to handle DNS queries ourselves in our service we want proper services to do it and we just provide configuration for them and I have already shown almost every open source resolver has that ability and because we are not handling queries we just want to try single thread application and we written just our prototype in Python to verify this would work. What we also want is to reconfigure ETC resolve confile only when we verify basics that service is running and restore it when our service is stopped. I really hate what a result is when you uninstall it you have to fix it by hand. And we want to have stand alone demon because not everything is primary configuration we think should be done in network manager so there is some unified way to configure it whether it is used system be resolved D or our demon it should not change it should be just implementation detail. And we think the common part is the biggest one and just very small cash specific module is required to implement different caches what we plan to support is what we have in the RL that is primary unbound and also bind and DNS mask. And we want to provide basic compatibility for services using system D or the API directly because something already uses that but we do not want to implement every aspect of what they already implemented because we do not think that is necessary. So right how does the flow of configuration look right now network manager receives its list of DNS servers from either DHCP or the connection profile and then it pushes the configuration through the bus API into DNS confi D. DNS confi D then translates this configuration into some internal representation that we think is general enough for most underlying DNS caches and then we use specified module to transform this into the specific configuration that is used by the specific underlying service. For example for unbound it is a list of four borders. How does the system integration look like now. DNS confi D uses already existing unbound service that we ship and support so it respects its defaults security features and configuration that we ship. We inherit the system D result D debuts API so we work as an in place replacement as of now. You use the default system configuration that is provided and then we watch the underlying changes of the DNS cache so you are not caught off guard by the sudden inability to resolve the domain names. Here's the life cycle of our program that I talked about. DNS confi D itself is implemented as a system service so you can inspect it as you would inspect normal system service and it is started either on boot when it is enabled or it is started when configuration is pushed through because it is implemented by the bus and system D triggers us upon the configuration. After we start we start the underlying DNS cache. We look whether it is ready or not because there is some polling right now needed and we wait for the configuration that is provided by network manager. After that we watch for status changes and perform actions as are needed. Here are some memorable issues that we've encountered. The first one is a war for resolve confile because network manager finds out that a system D result D is running or not by checking existence of some symbolic links in system and we cannot own them because they are owned by the system D result D package and if they are not present on the system then network manager tried always to override our modifications of the resolve conf. We got a run by that by implementing a command that pushes lines into the configuration of network manager and we stop it from touching the resolve conf. We argued about whether it is better to execute the underlying service as a sub process or a system service because sub process approach provides easier way to monitor whether it is running or not but then I was persuaded by Peter that the system service is better because we use things that we already have in place. There is the issue whether unbound is truly up or not because the start job was finished but the command channel was not open yet so we faced some instability during testing but we got around that by pulling a few times and we need to update only zones that were updated in configuration so we hold current state that is set into unbound and we update only zones that are required and we thought that implementing this in D bus would be easier than it proved it really was. We've created a way, we are using of testing this we are using TMT test management tool with containers that allow us to simulate some network behavior in a way that verifies the actions of DNS conf. If you'll ever want to contribute set of these tests will verify that you won't change behavior that is already in place or you will be able just to show us where we are wrong and you want us to change the thing. Okay so what is working already? I admit we wanted much more to present here but it proved not so simple so split DNS configuration as you from network manager already works ETC Resolve Conf is changed just when our demon is running and is restored when it is stopped. Unbound support is the only one we have at this moment and implementation uses only D bus interfaces of system D Resolve D and at this moment also only its D bus name so it can be running just DNS Conf or Resolve D but not both and we reused network manager system D Resolve DNS plugin for now because it pushes configuration over D bus but in the future we want to get rid of it and make our own or use more parameters from just IP address and that is what we would like to use unlike the opportunistic way which system D Resolve D used because these RFCs were not defined at that time and we think this is correct way and support multiple cache is running at the same time is not necessary usually but it would be very helpful for some kinds of testing. We would like to have ability to forward over DNS over HTTPS but there is problem not any DNS cache we have in RL supports that and in further there are only few similar with DNS over Quick and auto configuration of DNS sec would be nice we would like to have some successor and better implementation what was once attempted with DNS sec trigger but maybe better accept it and maybe if its time sometime in the future rewrite into Rust and reduce memory required memory for our interfaces that would be all for us so if there are questions please now is the time and if we can't answer them please use these mails or file issue on the project Definitely stick around to the next speaker we will talk about the Rust domain craze and thanks for the call Questions, stay phone Would be helpful for inbound to have a D bus connection where it says when it's ready No I don't think it needs to be D bus connection I think we need to correct LIP system D notify event which it kinds of supports but I think last time we try to enable in federal it start crashing so it's not built in but some kind of support is there we just need support to inbound to tell us I'm the service I think I'm ready and there's system D API for that we need to use that whenever possible it doesn't have to be D bus Visek? If you only want to communicate over DNS local servers you need to crash I understand that you want to drop the MSS Resolve bridge So how do you want to overcome this? This is part of the question Second part of the comment is that we talk about D bus but D bus is something everyone can relate Actually now it's a series of D bus in parallel which means that we can have a resolution since any book before the D bus server is up Which is why it's always useful so we had a plan to add the private interface The second question was do we plan to add running interface? No I don't think we want that First question was get other info API How can you send the additional information for example about multiple interfaces over DNS projects? How can I send from which interface comes the query or how to request query just for selected interfaces over DNS? We don't want to because in what cases this is needed I think network manager needs that just to verify the connection works I think we might have different service which just will query us please Tell me address resolved on this interface and we will send the query just to correct addresses Because we know which addresses are used for that interface but that would be not served by the local cache Because that is not yet configured for that Can I make more sense to take this separately? To do this after session? Because it seems quite specific Yes, yes it might Any other questions? No thank you again
Domain: A modular Rust DNS toolkit
Our next speaker is Martin, who would be telling us about the domain crate, I think it's called. He's been building for Rust and all the cool tools you will be building with it. Probably, yeah. Hello. Back in 2015, so in the times before, two things happened. One was that Rust released version 1.0 and became stable. And the other was that I started working at a Nellitlabs. You probably know Rust as the thing that everyone wants to write everything in, and you might know Nellitlabs from things like Unbound, NSD, OpenDNSHack, and some more things. So I thought I want to teach myself some Rust, and I need to teach myself some DNS, so why not combine these things? Now, what also happened at the same time Benjamin Fry was working on trust DNS, which is now called Hickory DNS, and I figured, well, let's not do the same thing. That would be silly. So I came up with a different idea, which was instead of building a giant application, or a set of applications, build lots of building blocks that people can then use to build their own specialized DNS applications. Because what happens is we have a lot of good sort of stuff for generic use cases, resolvers, primary servers, that sort of stuff. But if you want to build your own specialized DNS application, that's actually surprisingly hard. I find that out the hard way. For a project that we were doing, I needed a very specific primary server that was having a like a Rust interface, and then instead of people actually telling me which specific resource records to publish, they just wanted me to tell, I want this thing to happen. And then I built this with a Python at the time with Flask and LDNS Python Reps, which are surprisingly weird. And I would rather have had something that I could just build this all in Rust. So I think this is really a good idea. And so I started working on it, learned a bunch of Rust, learned a bunch of DNS. And where are we now with it? This is the starting page for the documentation for the last released version, 093. It's also apparently quite a lot happened before at 09 already. So there's a bunch of things as you can see here. Base is also the handling of DNS data. So there's types in there for all the things you could think of. The main names, resource records, record data, all the R types and all these various types. And also messages, complete messages so you can build your own messages and whatnot. There's R data, which is a massively incomplete set of record data types. We only did the ones that we need because this is very early still. And whenever you change something and you have like what is it, 6,000 record types, you have to change all of that, which would be annoying. So we limited ourselves to the ones that we need. But there's a lot of stuff in there already. There's all the basic ones. There's the DNS-like ones. Even someone contributed the SVCB or HTTP ones. We even have that. There is a stop resolver, very simple one, just as like a proof of concept. Although that's, I think, the thing that most people actually use. I wrote a signer. Well, it's not complete because it doesn't do NSIC-3, but yeah. I wrote some T-SIC support that is actually complete. And that was kind of fun because that happened right at the time when there was an update to T-SIC. So I contributed back some thoughts on the quality of the RFC for T-SIC, which the original one is surprisingly weird. Someone contributed some very basic validation things. I think it's basically just validate and RR-SICs or nothing with looking at DNS keys and stuff like that. And we have a zone file part of there because everyone has one. And this is actually a second iteration. I wrote the first one, so I wrote the second one, which I probably also going to throw away and write the third one. But we'll see. This has also to do with, for NSD, a colleague recently built a zone file part that uses these, what is this, the things with the multiple SIMD, and that is ridiculously fast. So now I'm kind of embarrassed. So it actually kept growing quite nicely. But then, oh no, distractions. We did a thing for routing or RPKI, which we wrote two products for, Routinator and Krill, which are quite popular in that field. But it wasn't actually that bad because both of them are written in Rust, obviously. And these are actual products that are actually used by actual people. So we got a lot of experience with writing in Rust, deploying Rust, and also sort of got a bit more comfortable with doing things in Rust, because if you sort of listen to a lot of people and they're all, blah, blah, blah, fad, go away. But actually, nobody really cared. If you build an application that works, that is very convenient, then of course everyone will like it. And that then meant that we could take a step back. The DNS is changing quite currently, like there's all of these new things, like there is HTTPS, there's a lot of new transport protocols, there's a lot of stuff happening right now. So probably also the applications for DNS, the use cases for DNS are changing. And I think it's a good idea to sort of explore that space and see where we can get this by providing a lot of sort of building blocks that you can just put together quickly and just like play around with things. The sovereign tech fund, which is a German sort of funding organization for fundamental internet infrastructure, they agreed with us. So they funded a year of development for us, this, which then allowed us to spec what we're going to do with domain this year and allowed us to focus, to actually have like three people work on this full time more or less this year. So what's the plan? We came up with three tracks. First is the client track, which basically is all the things that need to be a DNS client. So the thing that sends requests and then receives responses, matches responses, and also sort of preprocesses responses. So the three things that we have here on our list right now is basic transports. We're currently only focusing on traditional DOS, DNS over Port 53, so UDP, TCP and DUT, the other ones are coming. But a year is actually not that long, surprisingly enough. We're doing response caching and we're doing, that's going to be fun, a DNS-seq validation. The second track is the server track. So the thing that receives requests and then figures out what to respond with. Again, basic transports. We're doing all the things that you need for zone handling. The key on plan is to have zone file parsing, which we've seen we already kind of sort of have. Just need to make it nicer, which also probably means we need to add lots more record types. Then stick all of that into a zone tree and use that zone tree to answer queries. Straightforward enough. Of course, DNS-exhining is a thing. As I said, I already have parts of that. We need to actually turn it from a proof of concept into an actual thing that you can use. We're going to do this zone transfer so that you can then basically, with this server track here, technically build your own authoritative, which then would solve the use case I heard earlier. Then we have a third track, which we just called the bonus track, which is sort of the idea is, well, we can build all of these things, but we don't know if they're any good if we don't use them. So let's build something where we actually use our stuff and put it together. One idea that we have is a DNS proxy. Sometimes you call forwarder, like all this terminology is terrible and non-standardized, but basically a thing that sits somewhere and receives requests, but doesn't do the actual recursive resolving just forwarded to someone else. What we've talked with various people about what they need is sort of a way to decide what to do with the request based on a set of rules. Look at the request, look at where it comes from, look at the R types, that sort of thing, and then say I want to process it, I want to send it forward to some other recursive, I want to respond to it here, or I just want to drop it because that's just the wrong thing. So the sort of long-term goal for this thing is to build something that can do that with maybe a scripting language and whatnot. But as an initial proof of concept, it will just basically be into some configuration. Maybe you have to write the rules in Rust by hand, but they're not yet. But no, there's even more. We also want to do a diagnostics tool. I'm not going to say dig. Mostly because we need it for testing anyway. But then we thought, well, just re-implementing dig is kind of sort of boring. Let's look into something more useful. And some of the things that we thought of was sort of like what DNSVis does. So have a thing that you can go and check if your DNSX for your zone is correct, correctly set up. Another idea was compare your resolver, your upstream recursive. What that actually has on data compared to your authoritative. Stuff like that. So if you have ideas, we're definitely open for it. First stage, we just want to sort of see how this thing is going to look. And then finally, we have a bunch of things in LDNS, which were intended as examples and not as actual applications for people to actually use, which then people actually used. And that's kind of annoying because we're actually not maintaining them. They're just there. And then people sort of, you know. So the thought here is, well, maybe we should actually make these official in some way for more shape and make them available. So things like check that your zone is correct. So was blah, blah, blah. Sign a zone and whatever else there is. And I think that all will keep us quite busy for the year. And we're also hoping that by the end of the year, we actually have something that is useful for people. So if you have ideas, if you want to, no, we don't have that yet. Sorry, what was the question? Oh, sorry. The question was support for zone MD. Zone MD, right. Yeah, no, we have none yet. Doing zone MD will expose all the mistakes you made in your design. Excellent. And or draw a path of destruction for your code base. Yeah, but that's what it did for us anyway. Yeah. So we should probably do this actually as part of designing the zone tree thing that we're doing right now. Yeah, that's a good idea. Yeah. Yeah. Can you maybe elaborate a bit, a tiny bit on the actual Rust experience for this? What did you learn, anything that surprised you in the DNS part or in the Rust part? Yeah, so the question was what was the experience with Rust and DNS with this all? Well, I started, this was basically my first real Rust project. So like you're like a kid, everything is as it is and everything is like how it should be because that's what it is. I think DNS wasn't all that surprising in and of itself. The situation with the RFCs and stuff being hidden where you don't expect it is super annoying. It absolutely helps if you have colleagues who have been doing DNS for 25 years, just like one desk over. But all in all, I didn't have any surprises. I couldn't think of anything. Which is great fun. I can recommend doing DNS. Yeah. Yeah. Any particular Rust libraries you're using for parsing, for making together the packet that goes over the line? So the question was whether we're using any other dependencies for parsing and whatnot? No, we did all of our stuff ourselves. Because ultimately it's relatively simple, right? It's just binary, sort of you just go over a sequence of bytes and you pick out the things. And I don't think it's worthwhile to just have some complicated thing for it. And what we did is what might be interesting for other people is we didn't want to stick to a specific representation for these octet sequences. So we did some abstractions over like vex or just slices or bytes or whatever. And made all of these types from the basic type scenario for that. Which makes the usage a bit iffy, because you now have these type arguments. But it's super flexible. In theory, you should be able to use this in an environment without an allocator. So if you just built this on top of arrays by the arrays, then you should be able to do probably only a UDP. But it should be possible to do a little DNS client for an embedded environment. Which was one of the use cases that I was looking into which I thought interesting, probably not very widely used but fun. Any other questions? Do you have a cell in the file, sir? Yes. Does it preserve comments? No. This keeps happening if you don't know. Is that something you would want? Yes. Okay. Any other questions for the audience? Oh, yes, sorry. The question was whether we had a zone file parser and whether it would preserve comments, which it does not. It currently only really just parses things into data structures too. So it's not like a manipulation thing. Yes. Talking about bind files, there's bind files. There's also the cop that you can have a name and you have a context to reference back to the previous record without having to fold things like people. That's awesome. Awesome. So the question was whether I guess whether it's compatible with whatever bind does. Yes. Which it is because that's the standard. Which is also super annoying because like the RFC is very. It's true. It really is. So what we're also looking because also colleagues were building a new zone for buzzer for NSD and that has to be compatible with whatever we have. Right. Or at least with whatever NSD was doing before. And I think they're mostly. Compatible. So what we're also looking into is maybe working on a sort of minimum definition of what is on. I should look like if it's if it's sort of portable between everything and have that as an actual grammar. That would be really cool. But it's also loads of work. So I'm not sure if this is going to happen. That we have a question. Yes. I had a question. So. Have I understood it correctly? You wrote. Building books for building entire server. Quite powerful. But. You never are. Going forward to it. Writing that server. That is. Yes. So the question was whether. We have plans to build an actual server and the answer is not currently. The. The. Our sort of the sort of the. What do you call it the distance that we're looking at is about five years. And we don't have a plan for that. But also we're very flexible. Stuff comes up. But currently we're very happy what we have with unbounded NSD. Yeah. And I'll be in a second. Of course. Thank you very much. We promise stickers. There's loads here. So if anyone needs more decorations.
The first 13 years of blockchain name systems
What do I call you, Naiman? I don't mind. I nickname or real name. Whatever is easier for you to pronounce, because a YAL is a bit difficult sometimes. A YAL? Yeah, I mean, a YAL, that's the name. I use Naiman because I live abroad and no one can pronounce it. Sorry? Yeah. Naiman seems a bit easier sometimes. Peter, see a new audio? Yeah. Unmuted. Unmuted? Right. Right. Your controls are here. You have 30 minutes to question a loud speaker. Just join us and welcome to the UNS developer room. This is our final speaker for the day. Naiman and he will talk to us about the history of blockchain and naming systems. Okay. Thank you. Thanks everyone, you know, who stayed till the end. I imagine it was a great, at least it was for me. I'm going to talk today about the history of, you know, blockchain name systems. I'm Naiman or a YAL. I'm from Israel, but I live in Poland. And, oh, that's fast. And I'm a mathematician, but I work on peer-to-peer websites in the last few years. If you don't know what it is, don't worry, because the main important thing regarding the talk is that those websites use blockchain name systems. So I had a chance to talk with the developers of the main ones, even being engaged a bit in some of them, and that's why I do this talk. This is some projects, but I did, which use, you know, blockchain name systems. Don't focus on that, because, you know, it talks not about me. I know that blockchain has a bad connotation, especially, I guess, in this thing. I'm not here to change your mind. I'm here to tell a story. And the story begins in 2001, where a guy called Zuko Wilcox sent a draft to his friends of some article that he wrote, and it began with the words, please do not propagate this information widely yet. I'm still working on it. Did they respect it? Absolutely not. This was propagated so hard that by now there is a Wikipedia page on it called Zuko's Triangle. Zuko's Triangle basically says that a name system, there are three properties that a name system can have or not have. One of them is secure. Secure means that two people cannot register the same name. The other one is human meaningful, which basically means you can choose which name you registers out of the ones which are available. And hopefully because you're human, the name would have some meaning. And the last one is decentralized, which means that in order to register a name or to verify a name, you can do it yourself without needing someone else like a third party. And Zuko's Triangle says that for any specific name system, you can have at most two of those properties. You cannot have the whole three. Sorry. Here are some examples. You know, a name system that I guess everyone here know. DNS. DNS has human meaningful, for sure. It's also safe. It's not decentralized in the definition of Zuko's Triangle. Public private key is safe. Yes, it's secure. Sorry. Decentralized, yes. You can generate yourself. You can verify someone else on your own. But it is human meaningless. Most public keys are a monster. And my favorite one, the state ID, which is safe. But otherwise, it is neither decentralized or human meaningful, which I think it's a shame. I would love to be able to choose my state ID, but states. Zuko's Triangle kind of was considered to be true for the first decade of this millennium. It was what I had that was not involved well known within the name systems community. But you can only have two. And you shouldn't try to build one that has the whole three. 2009, Bitcoin invented. And shortly afterwards, a year later, in some of the Bitcoin IRC chats, people started to say, hey, can we put name on a blockchain? Now, this continued in chats. There was a Bitcoin Talk forum. At some point, the legendary Aaron Schwartz heard about it. And he wrote an article squaring the triangle, which basically says, if we put names on a blockchain, we can actually go around Zuko's Triangle and can have a name system that have the whole three properties. This can be, you can argue if a blockchain is really decentralized or not in the sense that the requirement was that you can register and verify yourself, not register and verify with a blockchain. But for the sake of this talk, we think about the blockchain as a big dump object. It's a tool. It does what you want. I know it's not. I know that each blockchain has its own pros and cons. I'll be happy to argue about each event afterwards in a beer with that right now. Big dump object, by the way. I'm a youth science fiction fan. It's a term from science fiction. It's a subgenre of books that have a big dump object. It does something. That's a Ringworld by Larry Neven, classic science fiction book. If you haven't read it, I read it as a kid. I hope it's still fun now, but I really loved it. So 2011, Namecoin was launched. Namecoin basically did exactly that, putting names on a blockchain. Here is some interesting trivia details. The names that it put on a blockchain were not actually names. It was just like 250 bytes on a chain, so you can put a sequence of 01. Then if it's a name or not, or how you interpret it as an ASCII or a Unicode or whatever is up to you, no one verifies everything besides the fact that the similar bytes was not put before. No subdomains, because all you put is bytes. So there is no subdomains. It's just names that you register on it. They did have something which is called namespace. It was in the software layer, not on the blockchain. I want to put it out, because they had basically two that they were promoting, the developers. One was D, which was for domain names for websites. But the other one was ID. And that's important, because this already shows that the thinking was that those names are not necessarily for domains of computers that can be used for identify people. The cost was 0.1 Namecoin. NMC was a coin, currency of Namecoin. To adjust it was very difficult. I mean, you can raise it in a soft work, but to reduce it, you need to do a hard fork. Also to know how much it costs really in fiat money, depends on the moment that you buy it. And this didn't go to the developers or to finance anything. It was just burnt, because in blockchain economy thinking, burning money is how you make money. Lecture a few days ago. This is the last blocks of Namecoin. One transaction basically just means the miner. So which means no one did anything there. As you see at the moment, I think that it's a project which is still being maintained, but not really being used. And there's a question, why did it fail? Or at least I think it's failed. Namecoin people here, I apologize. And I think that there are two things that they did. Maybe wrong. First thing, they really copied Bitcoin's playbook one by one. But name is not money. Names are not money. It's a different animal. You can believe that 100 coins have value and it's okay, it's not contradicting. You can go to a store, one store that accepts in dollar, you will pay in dollar, another store in euros, you will pay in euros. Another one wants Bitcoin, okay, you will get some Bitcoin. It's not contradicting, but no one wants to think that the same object have two names. This is not how it goes. Like historically, if I would think that some God has one name, and you will think that the same God has another name, there's a good chance we will go to a war. We will not accept each other's belief. But the other reason, which is maybe more deep, is that namecoin developers had a huge challenge of building it. It was the second blockchain. It was the first NFT blockchain. It was the first side chain that had to invent marriage mining. And also after it was launched, it was definitely not scalable, and also don't think very scalable right now. And they spent lots of their time improving the protocol and handling all those technical details. And they didn't have time to also think, how do I make it useful? What is it good for? And you know, pushing it to users. So 2016, as I said, I entered the blockchain ecosystem. I asked people about namecoin. I even bought one, I think, just a name, but I'm not 100% sure what this intended to. The general feeling was that all the good names are squatted. There is nothing to do with it. And the names on a blockchain is nice for playing, but not really a useful use case. And in the same year, E&S was announced. E&S is a very different animal from namecoin because it is built on top of Ethereum. And if you don't know anything about blockchain, you should know that to write an application on top of Ethereum is much easier than building a blockchain. Which means E&S, which is really well written and a nice engineering feat, is still easier to write back from namecoin. So they actually had time to have long discussions how to get people to use it. And they did two things. One of them, they said, okay, names are going to have an auction. So it won't be the fastest person who takes a name, but the one who agrees to pay the most. It's not necessarily the best solution, but at least, you know, they try something. But the other thing that, again, I see is very crucial is that they had updates. They could update their system relatively easily and they were very open about it because when they launched in May the 4th, May 2017, they called it E&S Temporary Registrar. And some of the messages, they even said, you know, we are not sure how to do it right. So that's why temporary at some point, it will be changed, be prepared for it. At the time, 2017 May, it was before the DAO hack. So it was not really common in blockchain to say that you are going to change things. This was still the time of, you know, immutable programs and code is low and stuff. How did it go? Well, it went in the same way like Namecoin, quite successful commercially in the beginning. I think that someone put a bid on the name exchange.eth of $2.6 million. So that's quite well. Like Namecoin, the money did not go to the pockets of the developers, but instead it was locked. So it was a deposit and the moment that the name expired, you got it back, which if you want to fight squatters or, you know, speculators, it's not necessarily the best idea because they have nothing to lose. A year passed and another blockchain name system was announced, Handshake. And I like to say that Handshake took one step backwards, three steps forward. I think it's kind of represented. And the step backward was that ENS was built on top of a blockchain, which could be very flexible. Namecoin, sorry, Handshake said, well, we are actually going to build our own blockchain. It was a very, already in 2018, to have your own proof of work blockchain without updates was outdated. And I said, because I remember hearing about it, and I thought, OK, that's two years, at least too late. But this thing provided them the ability to do something that the other name systems didn't do. And I don't think that anyone else does, besides them at the moment, because I said that decentralized is registering a name and verifying a name by yourself. But actually, to verify something on a blockchain is very difficult. In the worst case, you need to have the whole blockchain, which is huge. In the better case, you only need to have, like, 30 gigabytes of a proof. And that's not very practical for a name system. And Handshake really made an effort. The whole white paper is to us to have short proofs. So of a few kilobytes, that this is the name owner, and that's what the data that they attached to it. The other thing that they did is gift economy. I think I know that this is from Corey Doctorow books, but at the time, this was very popular among the Bernice. Handshake actually is the first one that said, we want to replace ICANN. We want to be the new root of DNS. And then people were buying it. Namecheap bought a Handshake domain for 750K. There were people who were participating in auctions. And I checked. Now people still participate in auctions, not in these amounts, but it seems to be a thing. There were some other funny stories that SiHab joined Handshake and then left two days later, because they thought they get a domain on the blockchain. But actually, they got a subdomain on someone who has a domain on the blockchain. So there was nothing decentralized about it. It was a misunderstanding. But besides those things, I don't think that there was significant usage of Handshake. Definitely not at the time. We'll get back to it towards the end, just so you know when we speak about what happens nowadays. But at the time, it was mostly like the other blockchain systems buying and selling. So the thing was a bit grim at this point, 2020. But don't worry. New decade, things starts to be going to be more happy soon. We have to go before that one year later, where ENS permanent registrar was launched. And they took two years of studying lesson and actually modified things. And the first thing that they did is auctions out. Because for the first few weeks, people actually participate in auctions for some specific name like exchange.eth by the time that I wanted to buy ENS domain, which was Neiman. No one participate on the auction besides me. And it was just an annoying process for the user. So they said auctions are good for the beginning. Afterwards, you don't need them, which I think makes lots of sense. The other thing that they did is that by this point, they were almost broke. I mean, they started with a million grants, dollar, from a few foundation. I'm not sure if I don't remember if they got anything else on the way. But time passes. You have to pay people's salary. They were almost broke. And then they thought, I mean, their idea was to be a non-profit that gets stuff from donations and grants. But 2019, the blockchain had a winter. No one gave them any money. And then they figured out, well, there is all this locked money. And why do we actually lock it? It's not good for anything. It's not protecting against squatters because they can try to squat on them. If they don't manage, they just get the money back. And they did. The next step, money goes to ENS organization with this NGO, which means it's supposed to be fed into the development. And overnight, they became from an organization which is almost broke, the organization that has millions of dollars. This was important. I was already developing for ENS before. But it was a side project. And when this happened, you start to think, as a developer, well, maybe I should take it more seriously because now they have money that they have to give someone. They are an NGO. They are supposed by their declaration legally to give it to the ecosystem. They didn't give to anyone. But they thought, it sits here. Another thing that they did is that they kind of changed or defined what their names are for. And they said, this is a web-free identity, or more specifically, because a web-free is a marketing term very annoying. It's an identity supposed to be used in a pharium ecosystem. And I think that they managed actually to do it quite well. Verdeirector, Brantley Milligan, he did, in my opinion, magic. He has infinite amount of energy. I wrote some message in ENS forum. And immediately he said, hey, let's set a talk and meet. And he asked, do I want to build for them more? He had ideas. He started to do all those things where he asked people in Twitter to change their name to their ENS names to show that people actually use it as identity. In conferences, people start to use it. In the firm conferences is their identity, their name, Naiman.eth. He was really pushing it well. And I got to see it all from front seat, because at the time I was working on this project. It was a search engine for the Centra's Web for ENS plus IPFS websites. So I got to see how every month more and more people got ENS name. There was more buzz. And people actually use it as an ID. I'm not saying it was a huge thing, but it was a thing. There was a use case for this thing. And before, there was none. But still, when people ask me, hey, are you going to do something professional with it? Are you going to build a serious big project or business on top of it, I was saying that I'm not sure, because the root of ENS at the time was held by a multi-seq of, I think, seven people, which is quite risky. Forget the centralization. Not the centralization. It's just quite risky. If I'm doing a project which I put a lot of effort and investment on top of ENS and then something is hacked with a multi-seq of seven people, which is very easy to imagine, then what do I do next? So I was telling everyone. I also told it to the ENS people. I'm not sure it's so directly or implied it. I'm pretty sure I'm not the only one who mentioned that. And then we reached November 2021, when a very significant thing happened. ENS DAO was announced. DAO is a decentralized organization. If you're not from the blockchain ecosystem, it's OK if you don't know it. The idea of a DAO is that instead of out of the crypto Twitter, I mean, my mom's neighbor who has nothing to do with blockchain told me that he bought an ENS thing. And I was like, oh, I'm working with it. That's nice. I think that lots of people who are now active in ENS joined at this stage, not because there is money, just because they heard about it. It's made an impact. It's a big project that gave control to the community. It's also a bit if you want to work on blockchain, but you don't want to get into all the protocols. And you're not interested in money. Name system is something which is a bit easier to understand and clearer. ENS DAO is very active nowadays. So I was a member of ENS DAO for the first year. I was managing a subgroup of decentralized peer-to-peer websites, which is what I did at the time. I don't do it anymore. But I still follow it a bit. I know lots of people there. It's super active. The forum is active. There are calls every day. There are votings. I mean, for the good or for the bad, very really an active community. And at some point, I don't remember right now exactly when they actually transferred the root key ownership to the ENS DAO, which means now it's owned by ENS. There is one problem, or maybe two. And the first one is that ENS voting goes with ENS token. You can buy the token, which basically means that someone who is rich enough or motivated enough can kind of take over the organization. And if you want it to be critical infrastructure of the internet, it's very risky. If at some point it will be, then someone will take over. I mean, if someone can, then they will. I mean, the DAO can at some point decide that you get voting by reputation. But at the moment, this is the situation. And the other thing, while handshake have short proofs, ENS does not have such thing to verify anything on the Ethereum blockchain. You need quite a long proof. It's not very practical for anyone to do unless it's really your passion, like me. But even then, it's super difficult. I don't know what's the technical way to solve it. If any, right now, everyone compromise on that, and they actually verify things with other services. 2023, which for me is today, because we are beginning of 2024. So the state is at the moment that once ENS DAO went on, they have a huge market cap. They were very, even during the crypto winter of blockchain, they had quite a buzz. People started to make clubs like the 10K Club of People owned the name 1.ETH till 9,999.ETH. There was a website for clubs and stuff like that. It made an impact, and as a result, any blockchain now has their own name system, because it's just easy to make. And they see that there is people who will pay for it. I know people that in each of those block systems buy a few names, because normally they are quite cheap, and they are like, well, we don't know which is a good investment. But articles like the top 10, blockchain domain name systems, I admit that for a while I was trying to follow that, but I didn't find any that has technical innovation, which is what I care about. And I reached the point of saying, well, if something will happen, which is technically innovative, someone will tell me. ENS itself at the moment is focusing on a few things. One of them is subdomains. They want the subdomains to be kind of like domain, so you make someone a subdomain, and then they own it. Like it's not dependent on who owns the name. It's completely independent. If they had something which is called name wrapper, it was developed for many years and launched last year. CCIP is basically cross-chain interoperability, which means how to communicate with, from ENS can communicate with other blockchain. I am not a huge fan of that. I think it's centralized, a decentralized technology, but they seem to, lots of people, they like it. And they really want to join ICANN. Like they really want to get control of the .eth subdomain, a TLD, sorry, but the problem is the .eth is there for Ethiopia. Nick Johnson, the owner of eth, had a long thread about it recently, like a month or two ago. So if you want to read the details and where they held up to this discussion with ICANN of getting it or not, you can see it there. For the other projects, Handshake, I went and checked just before the lecture what's going on there, and I got the feeling that's not much change from the lunch, only that it's less enthusiasm now. Like people still participate in auctions with less money. I didn't find any real use case besides that. If anyone knows and I missed it, let me know. Another story that happened, and I'm going to wrap it up, is a stop-all domain. It's another ENS, another a few blocks name system. They try to patent some names, and now they're a legal battle with ENS. And I thought of maybe speaking of what I think happens in the future, but time is up, so I will not. Thank you, everyone. Thank you. Thank you.
"Vanilla" Debian On An Industrial Embedded Device
Hello, everyone. Can you hear me? Hello, hello. So, I guess we can start. Is it working? Hello, hello. Quiet please. We're about to start the next presentation, guys. Hello, again. So, welcome everybody. Can you hear me? Can you hear me? Okay, great. Hello. So, I am Francesco. I'm here to talk about installing the IBM on an industrial embedded device. What I mean with an industrial embedded device, let's say, not a consumer device. So, it's a device that you might find in industrial automation, building automation, or an agricultural machine, not a baby PI, okay? And I am an embedded Linux engineer. I'm working with Uboot Linux and Open Embedded and using Debian since a very long time. And this is my distribution of choice, but I'm not working with Debian lately. So, what we will cover today. Why this started? I had some hardware available. I mean, that's because it's my job. Something like that, okay? That I normally work with this with Open Embedded. And I was wondering, why can't I just install Debian on it, okay? If we're supported by the upstream Linux kernel, everything is in place. What is preventing me to just install it, okay? And this is where we all started. And this is a little bit of talk, which challenges I had and I was able to get there. We will mainly talk about ARM and ARM64 devices. We will focus on the Uboot-Booter order. And the focus really here is about installing vanilla Debian. So, there are a lot of ways you can install Debian. Debian is running everywhere, not invented by myself. The focus here is really about doing just following the instruction and getting it done. So, just to set a little bit of the stage, a little bit of overview of an embedded system, but also any system boot today. After the system on chip, the CPU gets out of the system, they are going to get configured in some way. This part can be really, really complicated. And at some point, the firmware needs to load the operating system. What it does, it needs to figure out where is the kernel. It needs to figure out, in our case, where is the device tree. And put everything in memory prepared and then jump into the kernel entry point. We will really focus on this step here about preparing the binaries and jumping into the kernel. This is our focus for this talk. Something that is important to mention and will make our life easier is where is the firmware stored in Flash. Traditionally, on your PC, you have an SPI in AirFlash. That is where your UFI is stored and this is completely out of band. This is not where you are going to install your operating system. On embedded device, it depends. Sometimes for curse saving, you have only one device in which you store everything, the firmware and the operating system. Sometimes they are separated. The order that I am going to consider today is using EMMC. EMMC have very nice features that allows to do hardware partitioning. Normally, they have a dedicated partition for the boot firmware. This enables to not have to warn them about overwriting my firmware while doing the installation and operating system. This is not possible, for example, using an SD card like you would do with a Raspberry Pi. Raspberry Pi is booting from an SD card and then you cannot really do that. This makes the stuff more complicated. Good. In our case, our firmware is, as I said, U-boot. U-boot is a platform firmware. It supports a lot of architecture. In the end, it configures the hardware, as I said before, and then it is able to load the Linux kernel. Traditionally, let's say the past millennium, this was very coupled with the operating system that was loaded. Some time, a few years ago, probably 10, 8, I don't know exactly, there was introduced a new feature called DistraBoot that is trying to solve this task of loading the operating system generic. How does it work? U-boot is scriptable with shell script sort of and as environment variable. DistraBoot is implementing with script a generic way to search for a bootable partition and then to search for a way to boot the operating system that it found. In short, you tell the board which are your boot devices and then you just include this header that I mentioned earlier and that's it. It's very easy to integrate. What it searches for normally is a boot script with a fixed file name, searching either the first partition or the first active partition and executive content. It can also parse an x-tlinux.com file that describes how to properly load the OS. We will focus on the boot script that is more flexible. It allows really to do everything because it's really code. The reason we focus on that is that this is what is supported by default on Libyan. X-tlinux is working in Libyan but if not out of the box experience you get from the Libyan installer. The boot script normally what it will do in the end will be loading the kernel from some storage device, loading the device tree, loading your init addy and then we'll just jump into your distribution. Cool. Let's move on the operating system side. The Libyan has a package that is called the FlashKernel and it's really a glue package between the operating system and Uboot. I mean, FlashKernel is a little bit more generic but it's really able to integrate directly with Uboot, generating the boot script we just talked about. It's integrated into the Libyan installer, into the ARM one, and integrates with the kernel packages out of the box using hook. So it's supposedly to just put everything together. Given that in theory it should be good. We have Uboot, we have this package, so what I did was I tried to go through the installation. I took one model, for instance exactly the one that I just showed you before, probably the Libyan is still installed there, and I just followed the instruction, okay, as easy as that. I decided to do a net installation just because it was at the moment the most convenient for me, but again whatever is copied pasted here is just what you can find on the manual. And the result, it was not working. Why? What I figured out is that FlashKernel needs to know about the actual hardware to be able to properly generate a boot script. This is why this is required. This is required because it generates a boot script that are really matching the exact hardware that we are running on. So you really need to tell the exact device tree file that will be loaded in the boot script, and this is going to be part of the boot script. It's also possible to have custom boot scripts and have a sort of additional customization, but if your board is properly supported in Uboot, there is really not much to do apart from telling you use this flavor over the kernel and use this device tree. I mean, for instance, in Debian ARM port, it's supporting two flavors for the kernel, ARM-HF and ARM-HF-LPAE, and you need to tell which kernel flower you want to use. What I did was just opening a merge request on the Debian GitLab instance, SalsaDebian.org, and I mean, I am not a Debian developer, nor a maintainer, I'm just a user, but my merge request was just a reviewer and a sceptre. Now, for instance, it's part of Debian Bookwork. So it just opens for everybody to contact boot. Cool. Then I wanted to try something else. Distribute, at the moment, is considered a deprecated for a couple of reasons. One of the reasons is that being implemented with shared script is a little bit cumbersome. It's very difficult to understand how it works. There are scripts that are set in global variables, and then other scripts will rely on the previous global variables that are set. I mean, if you look at this for the first time, we will just get lost. There was also a need, a long-term plan for Uboot to completely move to a Kconfig based system, configuration system, and for that, there was a need to remove some include file, as we saw before, Distribute is configuring through a configuration include header. That's the main reason standard boot was done. I didn't have any board with this enabled myself, so what I did was just enable this on the board. I mean, it's trivial. This is just defining the boot targets that are just equivalents to what we discussed a few minutes ago regarding Distribute, and then enabling a few configuration options. From the integration to the distribution is pretty much the same. Standard boot is more generic. It supports also UFI. I mean, it's more, but in the end, it integrates the same way with the Flash kernel package, and it's able to execute the boot script. The documentation is linked here, and I mean, it's way more than what I showed here in this. It was a Flash USB Flash drive. And with that, I was just running that bootstrap from my PC directly on the target using USB. And there are a few queers to take into account because the architecture that my laptop is running is x86, while the target was an ARM. And during the second step of the bootstrap, you are going to execute target binaries. And this was really easy to be done also because as of today, using QAMU user static package and using the bin format MISC support in the kernel, it's really possible to run cross-platform-byte-arm binaries on an x86 in a transparent way. And all of that is just integrated into that bootstrap, and it's just a matter of installing this package. So I went through my depth bootstrap installation completed. I did all the required steps, and then it was not working. What was missing this time apart, of course, this board also was not supported. Today, this specific word is. What was missing this time was that the Debian kernel was not supporting the specific architecture. Again, the Debian kernel development is done on GitLab, on this GitLab project. And they are just open to taking merge requests, and it was really probably a free-line change to enable this architecture in a decay config. This is merged at the moment, and with that, finally, it was all working. Good. Last. Till now, I was playing around with old ARM 32-bit architecture, and I wanted to try something more modern and faster. I was getting older and older with the old ARM processor, so I wanted to try with an ARM 64 system. What I realized is that the Debian-y standard for ARM 64 just expected to use UFI. I believe that it would be possible to avoid it. It's not like it's a master, but if you just take the standard path, this is what is expected. And they said, okay, you would support a subset of UFI. This is targeting this specification here that I think of here. It has a few limitations. It's not a full-blown UFI implementation. One of the main limitations at the moment is that it does not support this set of variable up runtime. We will see in a minute why this is relevant. In a very simplified way, I hope that there is no expert on UFI here, what UBoot is doing is searching on a GPT partitioned device for a specific partition called ESP. It has a specific UID. And into that, it loads a binary in a UFI format that is more or less a variation of the Windows portable executable. And it just executed. Anyway, moving on, UFI was not enabled in the board that I have available, but there are really no hardware dependency on this functionality. It's pure software and it was just a matter of enabling it. You can see it's a lot of configuration options, but there is really just a generic code, nothing hardware specific. Good. So I tried again to install Debian and this time, let's say it was working more or less smoothly. It was possible to get to the end without having to do any kind of customization. What was a little bit scary was the fact that the installer was erroring out during the while making the system bootable. What it complained about was exactly about this set runtime variable that I just mentioned before. That was not working and the message is somehow scary. How this is working? UFI is configurable. There are variables and the operating system is able with setting the boot order and boot 00, boot 001 variable. To specify which device and really impractice the file name into the ESP partition. This is really what this is about. Normally, any modern operating system will install with a specific file. I believe that for Debian might be something like Debian.EFI, I'm not sure. However, this is not possible at the moment because of this limitation in Uboot. Debian is able nevertheless to install with a fallback location. That is what is used for removable device. If you think about USB flash device or your CD-ROM, there is a fixed name that is sort of fallback. Debian is able to install there and with that, I got a bootable system. I had of course a few small issues about the fact that some functionality were not enabled in the kernel, but that was straightforward to solve. What I was wondering here at this moment is which device tree was being used. Because device tree is really the hardware description of the board. This is critical for the US to properly use the hardware. While using this standard boot, this is really well defined and the flash kernel package is really telling you which device tree is using. With UFI, this is really not visible. So I started digging a little bit on that. What is used by default is the internal device tree from Uboot. Uboot is also using the device tree for its configuration and it is just able to pass it down to the operating system. Here we are. So what's next? As you saw, as of today, with a very small effort, it is possible to run a pure Debian experience on an embedded target. Something that I assumed here is that you have upstream support for your device. This was the case for the hardware that I have available, but if you take a random board, this might not be the case. Without that, Debian is not going to take your 1000 patches just to enable one board. Okay, this doesn't exist. Second, the integration. What is envisioned is that the device tree is really coming from the firmware in the hardware description. In practice, if you are familiar on how the development of the device tree works, as of now, are stored into the Git repository of the kernel, it's really an incremental approach. You would really like to have to update the firmware when you update the device tree. This is just the reality that we have today. So it would be very nice to be able to take the device tree from the root file system. However, it's complicated. When you boot with UFI, normally you have an intermediate boot loader. You don't jump directly in the kernel. So if you boot through grab, grab is able to load the device tree, but the loading on device tree is disabled once the boot is enabled, for example. And the other issue is that normally the device tree is going through some fix up on the... Why the boot firmware is executing. And if we... The grab is loading, if this fix up are not obvious to do, there is an implementation that is able to ask Uboot to do this fix up afterwards, but all of that is available as a patch, is integrated in Ubuntu and not in Debian and not in Mainline. And then, I mean, we can make this even more complicated if we think about device tree overlays that are binary patches, and they are normally used for non-discoveryable buses that are common. Think about an embedded device, I2C, SPI, or an LBDS display. And that's it. Opening for Q&A. So, any questions? Thanks for the nice talk. Just want to ask you, are you aware of the ISI integration tool for Debian? ISAR. So we are doing heavily Debian on embedded devices, on industrial devices with that. And it also addresses the topic of not upstream firmware. I mean, I'm not aware of this. I mean, I will have a look. I was really focusing here on a pristine and pure Debian experience. There are tons of way in which you can install Debian, tons of way you can have a custom kernel. There are really a lot of possibilities to use. I mean, I could have built probably three slides only on the options that are available. I was really willing to focus here. I want to go Debian or follow the instruction and get it done. This was really my goal here. Thank you for this presentation. I just didn't understand when it is booting from UEFI, it's boot from UEFI, where U-boot is placed, where it is in which partition or in MMC, or where it is, you said you're... U-boot. U-boot, boots from USB, if I understand. No, I was using USB just for the installation purpose. So I was just having the board available as it was a USB flash drive for the installation. After that, U-boot was installed nevertheless together with the UFI implementation on the MMC boot partition. Okay. This side. Thank you. Okay. We don't have any time for any more questions, but thank you very much, Francisco. Brilliant. Thank you.
Using linux-yocto as a Yocto BSP kernel
So, my name is Mitri, I'm working for Linaro. Today I'm going to talk a little bit about the Linux Yopto, a fairly unused Linux kernel or Linux BSP for Yopto. About me, I've been working on both Open Embedded and Linux Kernel and contributing them since 2007. Maybe some of you guys remember Open Zaurus, I've been using Open Zaurus but not contributing to it and started when it became on-stream. So, Linux came about 2000 commits and in our part of the Qualcomm ecosystem, we are working on the Qualcomm devices and I'm maintaining the Metaclick-com, the upstream and the open source BSP for the Qualcomm devices. And this talk is based on our experience with supporting or providing the canals. Should I move somewhere? With providing the canals for the Metaclick-com. So typical Open Embedded board support package, of course it contains Linux kernel. First come a recipe, custom. Initially the BSP will find their own recipe, their own way to do things. They have the source you write points into the Git tree. Yeah, sure. Sometimes this Git tree tracks the whole development history with all the tries, with all the attempts. Or sometimes it is just written for each major release or sometimes it's a mixture of both. So yeah, Roberto, the fix of the fix. It is not an imaginary thing, it is what I saw in one of the BSPs. Do you know how to track if the patch has been ever sent to upstream? What's the status of that patch? No. Which version is it? Well, if you're lucky, it is a long term support kernel, which you can up to the latest LTS release. I tell you, if you're lucky. And security updates, if you're extremely lucky. How to configure? Yeah, there will be usually a different config file either in the layer itself or in the same Git tree. So any idea how to upgrade it? Yeah, or how to enable the net filter or an other obscure option that the BSP vendor did not enable for you. Trouble some. And yeah, everybody does it this way. I think so. Well, I thought so. We tried to change this way for us. Linux Yachto. I knew about it for ages because it was the kernel that was used by OECore, by the core layer for the OpenEmbedded for Yachto. It was used for the QEMO machines. It has been used for some of the default BSPs. But why should I pay attention to it? I have my own kernel. Well, not quite. We found that it follows stable releases. It also follows the latest Linux release. It tracks the release canals. It has a very powerful kernel configuration tool based on the fragments, based on the internal scripting language. And it is endorsed by the Yachto project and the OpenEmbedded layer program. And that's actually what made us to look into it. We thought about making Metacore actually certified for Yachto project compatible layer program. And it is one of the recommendations. Sounds perfect. Well, all the stuff, all the DevConfig, all the points from the previous slides have been pointed. Yeah. The problem was that nobody uses it. We are trying to do it. So some literally small how-to. Yeah. This reminds me of one of the talks of reading the Emacs config. But I will be reading the Metacore configuration files. First of all, the recipe itself is in OECore. We do not have to provide any additional details. We do not provide the Git repo. We do not provide anything. The default is same. We just say, yeah, let's use the defaults. Let's enable it for our machine. Q-com. R&D is our OpenEmbedded machine. We will be using our paths and the bonus stuff. There will be a lot of the files described in the configuration. There will be a lot of the files in the SC format and the CFG format that should describe different options for different machines. We just need to enable a single root file. The kernel.yokto class will get all the files that are beneath that. So we do not have to list anything else in the source URI. If you need, you can add more of the configuration options. You can enable other features just by adding another append in your distro or in your enablement layer. That is it. You do not have to patch anything. You do not have to patch the DevConfig in either of the layers. You do not have to create something that tracks, oh, that was the DevConfig from that layer. The options changed. It also tracks stable, as I said. So we do not have to upgrade the versions. We do not have to upgrade anything when there is a new release from this stable team. Oh, sorry, one button. Patches. Of course, the BSP layer has tons of patches, hundreds of patches. We have to list all of them. They come in a series. This is just a few lines. In our layer, there are currently 78. We are trying to limit them to some same amount. And a bonus feature, bonus point. Because this is a recite from OECore, we have to track upstream status. So for each of the patches, there will be upstream status trailer that talks, yes, this patch has been submitted. Oh, sorry. We did not submit this patch yet. History is no longer written in some obscure git tree. There is all the history of the patches comes from the layer BSP. As the user, you can take a look at any pointers and find, okay, yeah, these patches have been enabled for this and that platform. And they have been changed this and that way. Oh, and when we're basing from 6.5 to 6.6, oh, they did this and that mistake, and I know how to fix it. So the whole history, the whole development is visible to the, well, is visible to the developers, is visible to the users of our layer. Config fragments. As I said, there is a powerful system of the config fragments. They have the SSE files that describe, okay, how it all beams together. And of course, the CFG files, the parts of the actual configuration. And SSE, they provide a street-like structure. They can include other SSE files or they can include config files. So there is a huge set of default features that you can enable. You just pull files from the default set that has been written for you by Richard, by Bruce Asheville. Several downsides. There is no control over the exact kernel version. This is all done by Bruce Asheville, by the maintainer of Linux Yachto. And when he upgrades to the next version, yeah, when is his decision. Before this new year, he decided to let everybody stay calm. And so he delays upgrading to 6.6 LTS for several weeks. And unless you know what's happening, this can cause some confusion. So we had to ask Bruce what's happening. Sometimes Linux Yachto gets delayed. Sometimes there are additional patches. Well, in fact, as it is a BSP for several devices, it has additional patches. And you have to understand how that corresponds to your device, how this conflicts with your patches. And the most important thing. So previously, you can easily have a set of developers working on just Git tree for your kernel. They do their job. They do it all right. They just tell you, OK, this is the Git commit that you should be pulling into your BSP layer. Yeah, OK. Now it is also responsibility of the maintainer of the corresponding layer to actually see what's going on, to work in close collaboration with the kernel developers. In our case, I'm working also as a reviewer for the patches being submitted by our developers and by Qualcomm developers enabling the particular features because sometimes they are not. So you can no longer just depend on being, oh, I'm Yachto developer. I'm open and better developer. You have to be kernel developer too. And the last but not least, so what if we have several hundred of BSP patches? How do you track them? How do you actually manage them so that it does not become the mess? So we solved that by splitting them into the series. And so we actually are working with, as I said, with Qualcomm people just, OK, you cannot say that, yeah, these hundred patches are for to enable this platform. You have to say, yeah, this is a feature, this is a 10 patches enabling this feature. These are 15 patches enabling another feature. And so splitting and tracking different patch sets separately. So rolling see, of course, there is the Linux Yachto itself repository, which has all the branches, all the patches, all the history from prehistoric times till the recent 6.6. Yachto can all cache. That is what I told you when I said that you are pulling the config fragments from the upstream. This is the repository with all the configuration fragments, with all the configuration scripts that your layer will pull, that will combine into the final kernel configuration. Yeah. Our unproud Metacucom, the upstream Qualcomm layer. If you are working on the, for some reason, Qualcomm robotics or on Qualcomm robotics platform, or if you are thinking about using the Yachto for your phone for some reason, and it works, please take a look. This is the area that you might be interested. And yeah, of course, yeah, linear services. We are linear development services, and we now have an account on Mastodon. So please join the followers, and of course we are hiring. So that's it. Ah! Yeah. Questions? Questions. Questions. Hello. How does the feature set from your kernel compares to some internal developer, the standard kernel Qualcomm provides? So I'm not working with Qualcomm, but I see often the big differences between the vendor kernel with thousands of patches compared to what is upstream. Yeah. So as I said, we are working for the Qualcomm upstream, or working on Qualcomm upstream enablement. So we are tracking what is going upstream, and we are developing, and we are sending patches upstream. So yes, right now there is a talk, or they just have been talked about different Qualcomm kernels in another building. You might be interested in statistics. In our case, as I said, currently for this metric, we call for enabling platforms that have not been fully integrated upstream. We have about 80 patches. Ah! Oh! Yeah, it works. So, internal deep tree, before we switch to Linux Yachter, contained from 150 to 200 patches. And so one of the reasons for switchover, and one of the points was that we were able to clean up that stuff. We were able to drop several, okay, I think it was about 20 patches. Just touching bit of config in a different way. So everything now goes to the Yachter. We were able to drop several obscure patches that had been lingering for years. And to move those patches actually into upstream, send them to upstream, rebook them and drop them finally. So I don't know if that answers your question. This doesn't work for the downstream development. Well, you are window with thousands of patches.
Embedded Security 2023
last year. Hello, everyone. Last year, first time, I was talking about errors in embedded development. And I would like to repeat a part of the experience that we have had last year. Please think about an embedded project you are working on or you have been working on recently. Lock it in your memory. No cheating. You lock a project. Now, how many open SSL versions are there in that project? Raise your hand if that's zero. Like 10 people. Raise your hand if there's one. Like 20 people. Raise your hand if you are sure there are two or more. Like less. And raise your hand if you do not know. That's the majority of the room. I think there are a little less people who do not know, but still the majority. Why the question is important? You will see later. And a bonus question for people who knew how many versions of open SSL they had. Who can list the total, who of you has a full list of dependencies of that project? Okay, I round 20 people. Congratulations to you. Now, who is Marta and why she is talking about such things and asking such intimate questions? I'm a security researcher. And then what to expect from the 2024? Now, let's talk from regulations. Regulations that plural are a little bit too much here. One regulation. Because that's a 25 minutes version of the talk. So, their regulation is the CRA. Now, one slight simplification of CRA. To your lawyers, I am simplifying. The CRA is adding security, mandatory security requirements to all products that will be put on the market in the European Union by the requirement of the C-mark. The C-mark, you know it, on all electronics you have the C-mark. It's extending the C-mark to add security, mandatory requirements. Examples of the things that are mandatory. No release with known vulnerabilities. As bonds. Secure configuration by default. Updates by default for all users. And so on and so on. There are two pages of those requirements. In the final version, it doesn't apply to open source project themselves. In most cases, it applies to products that are integrated open source. All products, in fact. It will require paperwork. Mainly risk analysis and the vulnerability management process. And what this paperwork will be, I cannot tell you right now because it's going to be defined further. As for most of the things C-related, you have self-assessment by default. But there are certain classes of products that will require more. Including external security audit. That's an expensive thing if you haven't done one. And that's hot news because we have a final version. It's expected to be voted next month. And from next month, there will be three years till the final implementation. Now, the current version excludes non-monetized open source project. That's a big simplification also. So if you are contributing to an open source project, it doesn't apply to you. But for all integrators, embedded people are integrating open source in their products. So basically, it applies to the whole embedded. There will be risk analysis to do for all components that you include. And that's why the question of what do you have as components in your project is important. And now the big question for the whole embedded open source community. Is everyone going to do this paperwork alone? Or are we going to do the paperwork the open source way and share the documentation prepared for each single dependency? That's a big question for 2024, for all of us. If you want to know more, if I scared you enough, I've written an article published at WN last year, so it covers the first version. And for your trip back from FOSDEM, there's a nice read, the final version of the regulation, just 189 pages. But it's not boring. I didn't fall asleep, it's not boring at all. Now, let's go to trends, apart from the regulation. CV numbers. What is a CV? CV is a way to name vulnerabilities, public ones. It stands for common vulnerabilities and narration. And the number of registered public vulnerabilities is growing up. And in 2023, it went up. Yet again, we have yet another year of a record high number of CVs. I haven't been splitting embedded, non-embeded, but for embedded, that's the same statistics. The number of vulnerabilities is going high in a very important way. Now, a complex problem of funding of security work. In the recent two, three years, and there was a big part of this process happening in 2023, there are external funds paying for security work in open source projects. Two main examples of that, OpenSSF Alpha Omega project that funded, I've chosen examples from the embedded field. OpenSSF Rust, Python, Eclipse Foundation, and the Sovereign Tech Fund that has been part of the work for the Yocta project and other projects too, but in the embedded field. Because of this funding, because of the pressure of the regulations that are happening not only in Europe, in the US there's also different pressure, but in the same direction, we are seeing the update of processes in different projects. An example of that, the Yocta project has now a security team and working security process. In relation to all that, we also have tools that are either being implemented or they are being used more and more frequently. For example, the S-Bomb generation, either in the Cyclone DX or in the S-Bed X format, that is getting more and more common option. In embedded projects, yet another example from our field, S-Bed X is now generated by default in the Pocky reference distribution in the Yocta project. And a similar tool link on the dependency checking and CVEs, you have that in the platforms like the Dependable on GitHub, Standard on Tools also, tools are happening and the pressure to use them is happening too. And another big question for all of us, all that work, it requires someone to do it. To do the security work, to do the processes, to look at the results of tooling, even if they are the CI, you have to have someone looking at the results. How can we do it long term and especially how we can fund it long term? Those external forms may disappear one day. Big question for 2024. Now, for the events, vulnerabilities and incidents, I had to cut things because I want to have time for questions and it's only 25 minutes, so I had to cut. And this is what I have chosen for this year. HTTP2 Rapid Reset, also known as CV 202344487. This one was actually exploited in practice between August and October of last year. And it's a vulnerability in the HTTP2 implementation, or a little bit in the specification itself also. And if a client creates a parallel stream, HTTP2 allows parallel streams for the same connection, if the client creates a parallel stream and just immediately after sends a message to close that parallel stream, this is generating a high load on the server. The creation of stream is pretty expensive. And as a result, you get a denial of service. Most HTTP servers have been affected and there was a big number of releases happening in October 2023. What is interesting in the whole story is that the servers that are more for the embedded market, so with careful resource allocation, with limitations of number of clients, or limitations of streams per client, they had better resources, less vulnerable to this issue. For example, like HTTP, they clearly state that they are not vulnerable to that issue. I'm providing a link to the NVID entry for that problem, with dozens of links for different projects with information, or what they did, or what they expect users to use as configuration options to prevent such things in the future. And then a little bit of fun. It's either funny or it's frightening, depends on how you read it. The whole thing happened in 2022, but it has been published in 2023, so we can say we put it in 2023. This was a long story, but in short, some trains in Poland weren't starting after maintenance. And the maintenance company took a team to the river engineering company, and what they figured out that there were things like, train was locking with a vague error message after staying in one place for a long time, or the train was reporting errors after staying at some GPS positions, which by coincidence turned out to be GPS positions of workshops of the competitors of the manufacturer. Or in some trains there was a log based on a date, well, related to the CERA, but also related to all the things happening on the market. Until now, embedded developers were choosing their dependencies. Well, it does the work, I can take it, if there is a license matters. In the future, it may be that license matters won't be the only condition. There may be also a condition that this project have security policy, is this project providing regular security updates for five years or more, and there may be the need to do the triage in your dependency list, in some surprising places also. On the S-Bomb site, last year we have had S-Bombs being generated in more and more places, generating S-Bombs at school, but it's even more cool to actually use them for something. So I think that's going to happen this year, and then on the pure vulnerability side, we are still seeing products being developed to be in an internal network, not connected to the internet, and then someone puts a GSM modem in there. I am expecting a few funny vulnerabilities like that. Then the hardware series is going to continue, not only chips but also firmware. Have a look at the size of the firmware of your network card, or your graphic card, or your gpu, or other thing, or phone chipset. That amount of software means there are bugs. If there are bugs, they are also likely security bugs. I expect that, maybe not this year, but sometime in the future, the future will have a big issue related to firmware in one of those categories. My personal pick is network cards, a packet to make things funny. Then there may be also issues in places you do not expect them to. Quite many open source projects have never issued a CV before. If they have never issued a CV, users have a tendency to not update them. Not having a CV does not mean that there are not any bugs. In fact, quite the contrary. I expect that we may have an issue of a very serious problem happening in one of those projects nobody has been looking into before. Then everyone will be trying to figure out how many copies of that project they have. To sum up, that is going to be an interesting year. Do you have questions? Thank you for the interesting talk. I have a question about the legislation. Are there different regulations for real security bugs and denial of service bugs? If you have some warmable hole in your software, which is network-connected, or something which is a denial of service, for me it is a different class. In one case, you probably get my point. There are two parts of answer for your question. The CRA is not the only regulation that is currently in progress. You know that there are European elections in Germany. Things are being rushed. There is the CRA, but there is also the PLD. There is the regulation related to the workings, there is the regulation relating to AI, and all of them have certain things. On the typical vulnerability in the US, if it is an exponential like in the case of that HTTP repeat reset, it is a vulnerability. I classify it with a typical vulnerability. If it were to happen in a network device, that also enters into other regulations quite probably. There may be things that apply in different places, depending on the actual use of the same software. Thank you very much for this talk. I think this is probably the most important talk to me, as I am a designer manufacturer, embedded hardware for startups and SMEs. I am desperately concerned about the situation. The timeline you lay out is scary enough, but you will know that we in the UK have IoT connected device law coming into power at the end of April. We have three months to be compliant to this. There is a £10 million penalty, potentially, to us, or a percentage of global revenue. I will say broadly not one of the startups or SMEs we work with, and indeed ourselves, are in a position to deliver on this stuff, which scares the heck out of me. I would love to know who we need to be talking to to work together to try to look at this. I haven't shared the scary part of the series about the penalties, but in all cases, you are not able to pay them, so... That is another example. In different places, there are different regulations being brought in the light. For me, as an open source community, we have the only way to solve it all together and prepare the whole paperwork all together. Otherwise, the big ones will be able to pay the whole paperwork, but the small ones, well, not really. I think we are out of time, unfortunately. Thank you.
enioka Scan: say No! to vendor lock-in for your barcode scanners
So, hello everyone. My name is Antoine Gonzales. I am a French developer at Enoca Outculture and today I'm going to talk to you about bar code scanners and bar codes in general and why they are important in the open source. So, a bit of context. Bar codes, why does it matter? If you probably have noticed that everything in your daily life has bar codes attached to them, whether it's grocery shopping, parcels that you order online, even menus in restaurants these days have bar codes attached to them. The idea behind this is basically bar codes end up being one of the most efficient way to attach digital data, usually an idea, but sometimes more than this, to physical objects. It has many different types, one dimensional and two dimensional ones. The one you're most likely familiar with is EAN 13, which is pretty much on every package ever and QR codes, which is mostly used to share links and stuff like that. So, as I said, they're used everywhere, but depending on workflows you may have more or less requirements with them. Usually, for example, if you deal with large scale packaging or inventory keeping, maybe you need to scan lots of them very quickly. So, some workflows have specific requirements and require dedicated devices, which is what bar code scanners are. And bar code scanners, so the less wonderful side of it, there's a wide variety of them, some small, some big, some that look like phone and that mostly are phones, some that look like rings that fit on top of your finger and you can use to scan products. There's a wide variety of them and a wide variety of manufacturers for them, and that comes with this problem. For example, each manufacturer tends to have their own APIs, SDKs, their own licenses that are very usually not open source friendly. Documentation that can be more or less complete depending on who makes them. And obviously, most of the time, it's not proprietary, otherwise it would not be here. It's proprietary. So what it means is usually when you pick a manufacturer for your backup scanners, you end up sticking to it because changing or just adding more variety to your fleet means having to rewrite your entire application. And for a lot of companies, it's just a lot of time to invest and not profitable in the end. So what is EnoCastCan then? It's an Android library with the goal of exposing a single common API to interact with different scanners. So how it does it? So the goal behind it is that it allows you to pick the manufacturers and the scanners that you need for your needs. That may mean combining different manufacturers because you have multiple needs in your company or just changing when the current contract does not fit your needs anymore without having to rewrite everything. And how does it enable this? So obviously, there's no magic. If every manufacturer and every device requires a specific way to communicate with them, a specific code needs to exist at some point. That's in the library. The idea is every device exposes their own API and we implement a communication way, a way of communicating with that device in the library, either through official documentation when possible or if we don't have access to it through reverse engineering, the protocols used by the scanner. Once we have that communication set up, what we can do is provide an abstraction layer that the end user can use to, for example, send very high-level commands to the scanner, for example, start reading barcode, turn on illumination, something like this. In the library, we'll find the connected scanners and try to translate to the appropriate protocols behind. What's interesting about this approach is it makes it very expandable. This means, for example, if we don't support any given device but we want to add support later, it's pretty simple to do. It's about implementing one interface or another. We describe what the device can do, how do we do it, how do we translate to something that the device can understand. If the device has specific features that may not be common on most scanners, we can divide these commended subgroups that are very easy to implement or not in a way that makes it obvious what the device can do. For the end user, nothing changes. The whole point of the library, the application that chooses the library doesn't need to adapt anything. It's the library itself that's plug-and-play. In terms of compatibility, so far we support quite a wide range of scanners. Some of them use Bluetooth, classical low-energy. Some of them are integrated. This means, for example, a smartphone with a scanning camera on top of it. For some situations, the Android camera is all you need, so the smartphone's camera, in which case we support both the legacy camera API and the new camera 2 API. One of the biggest upgrade we made recently last year was the compatibility with the Zebra data-wage. Zebra is one of the main manufacturers in the back-off scanner industry. Data-wage is their proprietary service that communicates with most of the fleets of integrated scanners. I think this one, for example, allowed us to pretty much support everything these manufacturer produces. Any device that's not in the list, whether it uses the existing supported technologies or something else, for example, USB, if Android lets you access this way of communication, theoretically, nothing stops you from adding compatibility. It may just require a bit more boilerplate to get working the first time. But overall, we have a lot of already set of helpers to make the process easier. What comes next for this library? So like I said, we have a lot of scanners that are supported already, but obviously not all of them can be. There's a lot of devices out there, so more and more devices are going to be added as we get all of them. We also want to provide an external documentation containing guides and examples for the code. So right now we have pretty complete API dots, but not a lot of guides and quick starts for people who want to judge before starting implementing it whether or not this library is in the list, what they expect and what they want for the need. We also want to provide a more complete separation of the core library and the existing SDKs that are implemented to let you just download what you need and not support dozens of devices you might not use. Another thing we want to add is a standalone app, so both an application and a service published to the Android Play Store. So in case the defaults, functionalities of the library is all you need, you already have access to it and you might not want to reimplement everything in your application. And finally, better Bluetooth support. So we already support Bluetooth pretty well, but a lot of devices have specific methods of pairing with your Android phone. Sometimes they require pairing via scanning a barcode that's generated by the device. This is through NFC pairing, which we do not support right now, but we want to in the future and Android support for the activities like camera that we provide. Now what we need help with, because obviously, like I said, we do not have access to every device. It's not possible. There's too many of them, but maybe you do. Maybe you have access to barcode scanners that we haven't tested yet, in which case you probably can help us to expand this library. So for example, by just simply testing whether or not the device you have is supported or not by the library. Sometimes even if it's not explicitly tested, manufacturers do reuse some of their code and some of their protocols, in which case maybe a device that we haven't tested does work with the library. Or if not, we know we need some work on it. You can also add easily more SDKs to make more devices compatible. So for example, if you have a device that we do not support and that's not compatible with the library, you can try to either reverse engineer it or provide the documentation necessary so we can add its functionalities to it. And finally, if you see any feature that you think is missing that could be done better, or optimized maybe, you can try your hand at upgrading the code base, basically. We try to be reactive with issues and questions that we receive. So if you want to take a look at it, there's a barcode to the GitHub repository. So if anyone has any questions, I can answer some of them. And otherwise, if you want to stop in the hallways or open discussions on GitHub, you're welcome to do. Thank you very much, Paul Singh. We have one question. Hi. Can I ask, are you planning on supporting any other platforms because a worrying amount of POS software still runs on Windows XP with serial? So Android is like right now it's only an Android library, mainly because in the mechanisms we use to connect to the different scanners are specific to Android. But the core compatibility layer, basically, with each device is not specific to Android. So even if connecting to a scanner uses Android, you could probably take the code base used to translate messages and port it to Windows or Mac or whatever else. Anyone else? Okay. Well, thank you very much, Antoine. That was great. Thank you very much. And... APPLAUSE
The Small Device C Compiler (SDCC)
So, welcome to the small device C compiler. Talks about such short, so I'll try to fit in just a basic stuff. I'll start with a quick introduction on what the small device C compiler is. Then I talk about the architectures we target and then just a little bit of what the future hopefully brings for the small device C compiler. Okay, so STCC is, as the name says, C compiler. It tries to support C standards, in particular either C19, 99, 11 and 23. It's nearly always used as a free-standing implementation. The only exception I know of is that FASIX, an operating system for some 8-bit systems use it as a part of a hosted implementation. Now, those familiar with the C standard know that in a free-standing implementation, you are more restricted in particular in features from the standard library set. You can use course, well, on your device there's no file system, there's no point in using any standard library functions trying to open read or write files. There are some supporting tools, apart from the compiler itself, in particular SM plus, a linker and a simulator. The simulators are usually kind of cycle accurate. We mostly use them for our regression testing internally, but they are also usable by end users who want to run their programs on a simulator rather than on real hardware. It works on many house systems. Most popular would be Linux and Windows, but it works fine on free BSD and so on. We target various 8-bit architectures, probably more than any other compiler does, and we have some unusual optimizations that do make sense on these targets where you really have very little memory and where both optimizing for code size and for memory use are very important and often more important than optimizing for speed. Our user base consists mostly of developers targeting embedded systems. I guess they make about two-thirds of SDCC users, and the rest are retro gaming and retro computing enthusiasts because we also support various older 8-bit architectures. They're similar enough to modern 8-bit microcontrollers that it makes sense to have them all in the same compiler and many high-level optimizations can be shared. But I believe that the user base in the end benefits of having both these groups represented cause sometimes one group or the other is more eager to try some new feature, which of course helps us finding all the bugs in corner cases and iron out everything, while then more conservative users that want to wait for longer than getting in a more polished state. Our latest release was at the end of January, which is very recently, typically we do one release per year. So the project is hosted at SourceForge. We have our issue trackers there. We have mailing lists for communication. The users have version repository. The user weekly for some documentation outside the manual. And we have a compile farm for nightly regression testing, which means every night on many different host systems, both in terms of operating system and underlying architecture. The latest SDCC from Drunk is built and then runs all the regression tests, meaning compiling a lot of tests, running them on the simulators to see if the results are what they should be. There's something between 10,000 or 20,000 tests that are executed that way and also incorporates a large part of the GCC test suite. A quick comparison to more known compilers. We don't see ourselves as a competitor to GCC or LLVM, so the versus up there is just for a comparison. Now we specialize in targets that are hard to support in GCC and LLVM. For GCC or LLVM, you typically want some risk like architecture, many registers, uniform instructions set. Then you can use a tritine style register allocator and that's efficient and everything is nice. The typical 8-bit architecture is not like that. If you want to get into the compiler, there's a compiler developer, our learning curve tends to be less deep than GCC. Our internal interfaces tend to be more stable than LLVM, which for some people is also a nice feature. Talking about the recent release, our main improvements were definitely in the last two years in standard compliance, in particular ISOC23 support. This was partially funded as a project by the prototype fund from the German Ministry of Education and Research and improvements and optimizations, in particular generalized constant propagation to allow us to narrow variables. If people use an int as a loop counter, that's typically a waste of memory in an 8-bit target if that loop doesn't really need the 16-bit that an int has on those targets. The work in optimizations was partially founded by an LNET via the NGI-0 initiative. We also got two new parts, namely one for the WDC6502 and one for the SCR800. One is the MOS6500 derivative and the other is the SET80 derivative. Let's get to the parts. The STM8 part is our best one because we generate really good code for the STM8. It's currently the most advanced part. It has all the bells, whistles and great features. We do very well compared to the non-free compilers. Unfortunately, recently this architecture has become not recommended for new devices. The manufacturer tries to move their customers to arm. But just to illustrate how we do versus three other compilers, which are all non-free, in terms of benchmark scores, we generate the fastest code essentially, except for WEDSTONE, which is a floating-point benchmark. We didn't put as much emphasis on it. And we also generate reasonably small code also for all of these benchmarks here. This is with the current release in January versus the current versions of these non-free compilers. Now our oldest part is for the 8051 and its derivatives. That's an ancient microcontroller architecture that Intel introduced long, long ago and abandoned long, long ago. And there are still many dozens of manufacturers that make compatible devices. It's a very, very popular common microcontroller architecture. It's not as nice as STM8. It was the first supported architecture in STCC. But in the recent years, it has fallen a bit behind new features that got added for other architectures, didn't always get added to 8051. And also many devices made by different manufacturers are also often slightly different, in particular new features like additional data pointer registers, which are used in different ways. We have support for the HTC rate and ST rate. It's current microcontroller architecture by NXP. The problem is there's not really much of a free open source community around this architecture. There's individual bits here and there that someone wrote some free software for it. But in general, it seems a typical sentiment by developers of ST08 programs as well. We get the, at no monetary cost, we get the development environment for the manufacturer. Why should we try something else? And sometimes they complain a bit if the manufacturer drops the part for an older device. As per DOC, a Taiwanese company that makes billions of microcontrollers each year that are not that expensive, they were not really meant to be programmed in C. But we still managed to support them, at least three of the four subarchitectures that exist we already support. The largest one, the PDK says, not yet supported. One thing interesting about these is that they have hardware multishrating support, which we currently don't support. What we can do is write a C program, run it on one core and then the other cores run a sampler software. There's microchip pick. Those used to be very popular because they were cheap. The ports are currently un-maintained, but we still get sometimes contributions from users with patches. It's not like they're completely abandoned. Maybe sometime a maintainer will step out of these user contributions. Okay, now we get to the architectures relevant to the retro computing people. These are a large number of Z80 derived architectures. The SM83 might be known to most people here as a CPU from the Game Boy, even though it's also found in some other Japanese appliances and TV remotes. And then we have the MOS 6502 and its derivatives, which don't even fit on the line anymore. They're found in old embedded systems, especially those R2K to R3K, those other rabbits. They were very early IoT devices because they are kind of enhanced Z80 with ethernet or Wi-Fi on support on the chip. But these architectures are relevant to the retro computing community, which often doesn't use SDCs directly, but instead via downstream projects. They package SDCC together with libraries for certain devices that use these things like video game consoles or historic computer systems. Now, what will the future look like for SDCC? We're definitely facing a problem at the moment because the SDM8, the architecture for which we're doing really great, and those rabbit devices that I mentioned on the retro computing side, are both not recommended for new devices anymore. Meaning that the architectures for which we really, the architectures where we really do great as a compiler are about to be phased out. We will keep supporting them, probably unlike many of those commercial compilers. I mean two of the three commercial compilers for the SDM8 haven't even seen any update for the last two years. But to stay relevant for current embedded systems, we need to try something else. And basically this is the idea. The main thing is putting the focus on the MCS-51, the 80-51 again. It's an ancient architecture. It's not exactly the nicest architecture. But due to the large number of hardware vendors, it's not likely to die any time soon. And looking at the reasons why users choose non-free compilers versus SDCC for the 80-51, the main reason is definitely that the main non-free compiler for this architecture can optimize better for code size. So this slide about the future is basically a very rough outline for plans for the next two years. And generating better code in the MCS-51 port is definitely something that we want to do. We will look a little bit into the SDM8, but due to the lack of community behind it, there's probably not that much that can be done. We still try to keep the SDM8 up to the other ports feature-wise, even if maybe not optimization-wise and code generation-wise. For the PDORC things, it would be nice to be able to support the multishrating better and also support the one remaining subarchitecture. And then there's this F8 thing, which is basically a very early project to maybe come up with our own architecture. I've worked on the compiler for a long, long time and very often there was a feeling this could have been done a little bit better in this architecture, or that could have been done a bit better, it would have made it a much better target for C compilers. The SDM8, for example, is a really good architecture. It has things like stack pointer relative addressing modes. That's one, something that you really want for local variables in C because then you want them on the stacks, so you have full re-entrance, C standard compliance, everything. But it has very few registers. The SDM8 has more registers, but the stack access is a little bit less efficient, because you have to set up a frame pointer, it goes through index registers and so on. The dog things have great multithreading, but they don't have the necessary instruction to support good C standard atomics to communicate between the cores. And out of all those lessons basically learned from other architectures, the F8 is kind of a project to come up with an architecture that, to say that somebody should become, if it succeeds, something for the 8-bit world, something like risk 5 is for the rest of the world, and to see that the time is up. Questions? Thanks for the talk. Can you maybe give some hints about the internals of the compiler? The internals of the compiler, okay. We have a classic Lexiac, sorry, you didn't front-end. Yeah, I just want to say if you are using an intermediate representation and maybe also the simulator, does the simulator, since it has to support many architectural uses on intermediate representation, I would be curious about that. Okay, so the front-end is a classic Lexiac parser. We have an abstract syntax tree that gets converted into the i-code, which is basically a free address code. This then gets annotated with some extra information, such as the register allocation, and then in the individual back-ends, this i-code gets transformed into a sampler code. The sampler code then goes through a P-Pole optimizer, and that gets unwritten out to the linker. The simulators, well, that's not my area of expertise. Daniel Drotos is definitely doing most of the work on that part. They're written in C++. They're using the classes and stuff to abstract things away, but I don't think there's any intermediate representation in the simulator because they need to be fast. We want to run tens of thousands of tests for every architecture that we support every night, so performance is definitely a goal for the simulators. You mentioned code size as one of the areas where STCC lacks behind the proprietary compilers from the vendors. What kind of factor are we talking about, and are you doing regular statistics about the code size of STCC, like in terms of different versions and so on? Yes, we are tracking this throughout work. We have graphs, and we are not lacking in codes that in general compare to other compilers. I mean, we're doing okay for the STM-8. Resonance can generate smaller code, but resonance is in every other way the worst compiler for the STM-8 around these days. I mean, they don't even support C90, and the code is very slow. It's specifically for the 8051 backend that we, Kyle, generate more compact code. I need to just to preface my question, saying that I only experienced STCC through the downstream projects, and I began actually using it in great part thanks to your talk a couple of years ago. But I have noticed that the compilation step takes a lot longer than other compilers would. I suppose it's optimizing and evaluating. Why so? And what would help it? More faster disk, more RAM, faster processor? What would help the completion time stop a bit? This depends on the backend. Most backends use what we call the new register allocator, which definitely was the key to being able to compete this well with other compilers in generating faster code and also being competitive in code size. 8051 does not yet, but for the C80, this register allocator is used. It has a parameter, maxEloxPlanout, that you can set to tell the register allocator how many different possibilities to consider at each node of an internal representation. The default value is 3000. If you set it lower, you get less optimization, lower RAM usage, faster compilation, but there's people that set the thing to a million and let their program that in the end fits into 8 kilobytes compile for half an hour, but they really want it optimized as well as it's possible. So yes, the most of the compilation time is spent in the register allocator and the people optimizer, and for the parts that have the new register allocator, definitely the register allocator, typically more than the people optimizer. And one interesting thing is this can become provably optimal. If you also add F-Verbus ASM, you get comments in the sampler that tell you once if the register allocator found a provably optimal assignment. Per function. Okay, I think that's what we have time for for the questions. So just wanted to say I wish you thank you very much for the fascinating talk on the palace.
Vehicle Abstraction in Automotive Grade Linux with Eclipse Kuksa
All right. Why can't everyone, while the last people join the room, let me ask a few questions to get an idea of the audience that we have here. So, quick show of hands. Who of you knows AGL Automotive Grid Linux? That's quite a lot. Awesome. Another question. Who of you knows Cooxer? Okay, let us change that because that way, fewer hands than for the AGL. But I think that's a good thing. Last and final question. Who's here still from the beer talk? Like room beer? Okay, I'm glad. We actually came out of these talks. So, as you can already see in the introduction slide, we will talk about vehicular abstraction. So, we talk about Automotive Grid Linux and we talk about Cooxer. So, before that, maybe a bit of context. Who am I? So, I'm not the super automotive developer doing a Canon AutoZer for the last 20 years of my career, also due to age. But I started with really coming from the cloud. I used to keep an E2 working on different projects in Github. And I thought, how can we actually make application development for vehicles more fun and efficient? And one really large essential piece here is one challenge because there are no restandardized signals. You can develop an app for one car and it won't run on another vehicle, even maybe from the same vendor. So, what we often see in the industry is this kind of high end-to-end complexity. So, every application is developed for every specific model, every specific car, and we have a huge pay point for that because you cannot port your applications there. You cannot scale, so if a developer is developing an app for one brand, it won't work on the other brand and also maintenance is just a nightmare because just build it for one car and then you completely forget it. So, as always in computer science, one solution to that is abstraction. That's why we also took a lot of effort in the topic of vehicle abstraction here. So, how can we make a world like this happen? So, where we have tons of applications that develop against the same API, against the same data model and that work just on different cars. While it's the same car, it's the same time. I'm talking a bit too much about cars. We run it on different models, different brands and so on. So, basically how do we get to the world where we ride at once and run it everywhere and to also attract third-party developers because this is how you grow the ecosystem and also make it more attractive to develop the unrealized synergies. So, for this abstraction, I would say we basically need two things. One is a data model to operate on and the other thing is the APIs to interact with the data model. Coming to the first thing, I hope, here we go. When it comes to the data model or you might also call a taxonomy, we decided for the CoVisa Vehicle Signal Specification. So, it's done at an organization called CoVisa, formerly known as GenoVee. Maybe that rings the bell for some. And it was basically does. It creates a tree structure for all kinds of data that might be available in the vehicle. So, for instance, you get the tire pressure. You follow the branch of vehicle, chassis, axle, road, one, wheel, tire and then you get to the pressure signal. The same way you have sensor values in here, you can also have actuator values. So, for instance, when we have a seat position, we could just change this value of the seat position and eventually that seat in the car would move to that position. That's the idea of this whole data model. If you want to play a bit with that, there's also a really cool website called Digital Auto that makes nice visualization of that and also shows some example applications how you interact with VSS. Okay, now we go to the first piece. How about the second? And this is where Cooxa or more specific Cooxa Viara comes into play. So, while in this case, vehicle abstraction layer, so we talk about abstraction, so the idea is to have Cooxa running in the vehicle computer. So, some kind of computer which might run Unix or something similar to that. And we also assume this is a place where we decub the hard from the software in the vehicle. So, the underlying assumption is something you can see on the left. So, we have a lot of deeply embedded layers, can, autos, are, lin, sum, IP, whatever you like or maybe don't like, which is maybe really proprietary in some cases and also the signals and the bits are really specific to the car. So, then people would write something that we call provider or also feeder to translate between these really specific systems and embedded systems towards VSS using the Cooxa API. This is where the API is coming from because we use here Cooxa. If you like more on the abstraction side, we also can say like in the deeply embedded layers, we mostly have data like really 1001 or the bits and we kind of need to interpret those. So, we translate it to VSS, get some information out of that and then by combining this information in different applications, we actually create knowledge. And here Cooxa is a nice building piece for that. So, what is Cooxa in general? So, since we are in the open source conference, obviously it is open source, fully licensed on the APHG 2.0 license and as I just mentioned in the previous slide, it is some kind of digital twin based on VSS. So, it shows the current and the target value of your vehicle. I don't want to go into the definition of digital twins but I guess you kind of get what I am getting at here. And so, you only have the current value which is quite nice but you also have the target value. So, coming back to our seed example, when you would change the value, the current value as an application from a seed, this doesn't meet the seed is actually where I wanted to have. So, I actually will set the target value and then it is up to the deeply embedded layers, so the actual vehicle to move the position of the seed over time. So, that is why you can change both value and hopefully at some point the current value will be the target value because that is the whole idea. So, much about the concepts. Let's get to the code. Or maybe I won't show code here but what it is actually written in. So, we wrote this in Rust. If you steadily compile it, it is less than 4 megabytes, large or small depending on which word you are coming from I guess. Like, these are the cloud words and it is small from the automotive words, maybe large to you. And it is quite language agnostic because the interaction with this is with it because we have a GIPC interface with some basic functions like get, set and subscribe and also a number of client libraries using this. And with that, that is actually the basic of Cooxing and I have to be honest with you, if you have been in this death room last year, you would say where is the news because this has been shown there as well. So, let's get to the news. So, what has happened in the previous year? First and foremost, it was using AGL so Scott will talk a lot about that in the next minutes. But we also have some other news. For instance, we now have a Cooxer Android SDK, we have a mock service and we also did some work with later from our side. So, the Cooxer Android SDK, I mean it is kind of straightforward because in the end of the SDK, that is now available in Maven Central and you can interact with the data broker from an Android application. So, be it Android automotive or maybe your own app on your smartphone. So, assuming you have some kind of Cooxer abstraction in your vehicle, you can use a companion app for instance, which we are about to release to the F2O store. Now, there will be a moment for the releases there. We did support request beginning of the week, but we still wait for F2O to actually show this app in their repository. So, stay with me till Monday, then it might be there hopefully. Another thing is a mock service because the guys in the previous presentation had their robot here. We cannot always have a car on our lab to test the application, but we kind of depend on the behavior of the vehicle. So, we need a way to mox this. So, the community came up with a behavior definition. For instance, whenever the signal of a seed is changed to a certain value, like 1000, then the current value should also change to that value. And this is what you can basically mock or emulate with the mock service to show you just an example. Here we have just an example I mentioned. So, whenever the driver's side position changes, then we create an animation to move to that position or to move the current value to that position, which makes it quite easy and flexible to test whatever you desire with your car. And last but not least, this is just a sneak preview into the lab. So, Cooxer is part of the larger community in the Eclipse Foundation. There's an Eclipse software defined working group, or short Eclipse STV. And there's another distribution called Eclipse Leder, which tries to combine some of the major pieces of the ecosystem there. And this is called Leder. And what we managed to do is actually run the Leder-Yogtu layer on top of an HGL, so that you actually get these pieces, like especially Cooxer, but also some other projects like Cantal, to run on the HGL stack. And I think this is a really good opportunity to learn a bit more about HGL here. Oh, okay. I'll take over then. All right. Thank you, Sven Eric. So, I have done a lot of stuff around HGL, so people might recognize me. I'm Scott Murray. I've done Linux for a long time, and I've been at Linux for a reasonably long time as well. I've been working on HGL on contract for pretty much eight years at this point, and doing all kinds of different things for the project around keeping the Yogtu stuff up to date, and also doing a lot of the demo and integration type of things. So, there was maybe almost half of the people indicating that it would be what HGL was, but I'll do a very quick run-through. So it's a collaborative open source project, basically trying to build a base platform that you can build an automotive product on. So it's about 10 years old. We have a vast array of members now, a lot of the major OEMs, and tier one and two new suppliers. It's pretty much a code first sort of thing, where we are more focused on let's build the distro and get it there for people to try and involve. A lot of work went into that. You might have seen HGL demos for several years doing that type of stuff, but our members were basically saying in 2020 that they weren't interested in maintaining that because they weren't going to use it in product. They all have their own application frameworks, or they buy an application framework, and they like to see HGL focus on lower level, show us how to use open source more than writing new stuff. So we started out, our tech demos, or integration demos are more like taking best of breed open source projects and showing people an automotive. Here's how you use these things. And so this really worked out well, because we weren't connecting. We needed something to show here's how you will do vehicle signaling, and VSS and Cooks of Al basically were starting to come out around the same time that we needed a new thing. So I had started playing with Cooks of Al in 2021. Our first release basically was our spring release in 2022. And it replaced our old signal composer and our can service with basically the original Cooks of Al server. And so since then, basically since spring 2022, we have recipes in our layers for HGL to build the Cooks of Al server, now the data broker. And as well, we actually have some signal customization stuff to sort of access an example of here's how you add some custom signals. And we use their can feeder to basically sort of wire up and show here's how you put all these pieces together. We have our own sort of like mocked up HGL virtual car can definitions. And so that sort of acts as an example for people to use. So that was spring 2022, like I said, and that won't go into all the nitty-gritty there. But originally, we were using the original Web socket API, which is a standard thing with sort of companion to VSS. We actually had can working in our demos. And so through 2022 and into 2023, we were sort of keeping up with the Cooks of Al releases. I started, you know, some nominal updates around switching how we were doing our signal additions and stuff. And then this past summer, our pike release, basically, I started the process of switching over to the data broker, which is the rust based implementation. And so I actually got interesting because we're based on Yachto Kirkstone, which is the LTS release, which at this point is two years old. And it has older rust. So we couldn't actually build the data broker. And so that was the thing where basically, a jail, we contributed upstream, I have a layer that you can get for the Yachto Kirkstone like mix in basically gives you a newer rust to be able to build the data broker, which other people I now are no are using for building other rust projects. So that, you know, we're now starting to look at the data broker, this cop coming release. Basically, we were now using the absolutely latest version of Cooks. And that now I fully have us all switched over everything's data broker using gRPC, all our demos are converted. And that basically acts as a thing. We're trying to see this with the automotive community, because, you know, we see a lot of vendor, you know, or we encode that people open source is all like custom IPC and stuff like that. And it's like, Well, no, there are open source projects that are heavily used that do, you know, gRPC and, you know, interact with cloud providers and stuff, you don't have to reinvent the wheel. So Cooks of Al has been a very good thing for us to sort of try and get that to people. So how exactly are we using an AGL? So there's, you know, the assess applications. As Eric mentioned, the concept of, you know, there's actuators. So there's, you know, apps that basically just listen to sensors. So like dashboards type of, you know, things like that. And then for acting on signals, so basically implement an actuator behavior, we have some example services that do that kind of thing. It's like HVAC sort of stuff. There's also setting an actuator value. So that would be like on a user facing infotainment app would be like HVAC controls or, you know, audio or volume that type of stuff. So in our tree right now, we have two demo services that basically do that actuator side of things. So we have HVAC service that basically listens to all the like signals in the VSS hierarchy around HVAC controls. And then in our demo setup, which unfortunately we won't have the full setup here, actually pushes out to drive some fans and things like that. In the audio side, basically I'm listening into the audio like volume signal that's in VSS and we, you know, have some custom things that I'm working to push upstream. But basically actually drive that down into wire plumber and actually like adjust the, you know, the audio setup. The user facing side are demo applications, the QT demo, which I think we might be showing tomorrow. Basically we're using the SS signals for like everything pretty much. So all the applications in that demo, which are in our source tree, you can grab them, basically are all wired up to do VSS signaling. And the code is sort of in a nice little library now and basically allow you to reuse it. On our newer Flutter demo, which I'm not truthful, actually maybe I think we'll have one setup that'll have that tomorrow. Basically it's, you know, it has a unified sort of home screen. It's doing GRPC from Dart. And right now I don't have that sort of library sort of packaged up yet, but that might happen this year. Or we might move it to native code. Tidder, who are big into Flutter, they tell us that's what they do that for some of their stuff. So, you know, we're pretty much, this is what our newer Flutter demo looks like. And so in this demo, like the tire pressure, all the likes, you know, vehicle speeds and stuff like that, and all the, like the AC controls and the temperature, all of that is going through VSS signaling to driving, you know, demons or whatever you want to do. Or KNData coming in actually gets converted back into a signal update. So, so there's some extra, you know, presentations from Sven and Eric myself. And we're going to be in the AW building tomorrow. We're open bed at work today. We're to have that table tomorrow. We'll have our demos. And this is, you want to do your pitch? Sure. So if this sounds interesting, or even if it doesn't sound interesting, there's a huge chance to engage with the community around coaxa and the larger communities in the automotive sector. So we have something called Bosch Connected Experience. It's hosted by Bosch, but it's basically very large hackathon in Berlin in the end of February. So a bit short notice, but I would be really glad to see some of you there. We have the chance to work with a lot of things like maybe actual seats, maybe actual cars, hopefully. Or and also we plan to have some meetable assimilation of a car which is then connected to a data broker. So I think it will also be cool what you can do with combining these physical and also this cyber physical world, if you will. So I really encourage you to do that. If you want to come there, you normally have to apply. But if you just approach me, I think we'll find a quick way to get you in because being you in this room, I think qualifies as a good hacker for that. So was that maybe you there on another community meeting? So thanks a lot for stating this out and we open for questions. Yeah, I think we have a couple of minutes. Yeah, we'll have to share. Thank you. Great talk. Just wanted to understand a little bit about your testing cycle. So if you you're developing something with this and then you test it in a virtual environment and then you want to test it on a real car, like what do you do in practice when you're developing stuff? Do you have an answer to that? So I wouldn't have a straight answer because here we talk more about implementing that abstraction layer and mostly testing it against things like this mock service or with something like a feeder where we have recorded data. But things that you're touching on a small like a really general topic on how do I actually get my automotive software up and running and into the vehicle. So that's a bit beyond the scope of what just the Cookshead project is doing. So not too much I can comment on here, but I think it's a good topic for the communities, either AGL or Eclipse STB because we have some rounds of meetings where we exactly talk about that. Yeah, I would just say that it's still actually pretty early days for DSS. I mean, I know there's a bunch of OEMs and interior ones that are actively working to product eyes. So I don't think we have visibility yet into how they're actually going about and testing. So I think hopefully in the next year or two we'll see more and we'll maybe get some ideas there. Any more questions? Maybe in two or three words, can you share a little bit about the data broker? Is it something that looked like Debus? Is it something like look like MQTT broker? Something else? What it looks like exactly? Is it something that we can reuse elsewhere or is it specific to Cookshead? I would say the data broker is really specific to VSS data. So it's not like you can put any data in there. So the way it works is you start the data broker and you also give it this VSS data model that you have. So the VSS data model is expressed in a JSON or in a YAML file. Then you put this JSON or YAML file into the data broker and then you can basically do get set and subscribe. That's why I put up this slide again on this kind of data which is expressed in the data model and then the data broker implicitly knows about that. When you talk about MQTT, there's also I have to admit other APIs to interact with VSS. For instance, VIST done in W3C and they also looked a bit into how to do that over MQTT. But again here, the data broker is especially tailored to interact with VSS signals. That's why I cannot generalize it too much. Basically, when I go home, I have a project that our vehicle to cloud, Expert Group in Agile, wants to see basically pushing from VSS up into the cloud. So I'm going to be building a proxy that will basically take a list of signals to listen to from the VSS data broker or cooks a data broker and then basically MQTT them up somewhere. So then talk to us in a better world. I'll have a story for you then. Maybe one final thing to add to that. As there's one slide, I actually removed it from the slide deck but there has been a huge discussion in the VSS community whether VSS is actually fit in the vehicle or whether you should use VSS more on the cloud back end so that you put all the data from the car and whatever form it up to the cloud and then consume it in VSS there. And the data broker is kind of like an answer to yeah, it's also possible to do it in the car in addition to the cloud. So that's kind of the background story as well. Okay, so I think that's all we have time for for the moment. So thank you very much Sven Eric and Scott and round of applause.
An open-source, open-hardware offline finding system
Hello. So this is our talk about the spot nuts. It's a Techist Tinkering project. So first who we are. I am Pingu. I am 14 years old. I'm a member of the Techist community. I began hacking like four years ago or something like that. I'm interested in Python, home automation and stuff and obviously Penguins. And I also work at the Alexis project. And my name is Snick or Dominic if you like longer names, three to two. I am more or less the founder of the Techist community about which Penguins will say a few words right after my introduction. And here I'm working in the intersection between education and free software. It means I'm showing young people what free software is, what the values around free software are and also helping develop and promote free software for education institutions. And in my day job I mostly spend my time as a trainer for Linux administration, PostgreSQL, Rust and Python related topics. Yes, we mentioned Techist. It's a community based in Germany. Our goal is to create a comprehensive technical word for and with children and like to empower young people to question things and hack and build stuff like this project or the Alexis project. So here you can see where we were. This is an Alexis meeting. This was in I think at the Frostconn, the second largest conference in Germany. This here at the left side is our summer camp, name taking sun, where the trials come and then learn something like I think they are soldering things together and then programming it. So now what is an offline finding system? It's basically you attach something to something like a small tech, you attach it to your backpack, then you lose it and then you open some app on your smartphone or on your laptop and then you can find it or search it or don't find it. And the more technical offline finding thing. So the tech sent a signal via Bluetooth because it's offline. So there isn't a connection between the tech and the internet. Then an app like a helper app on your phone gets this Bluetooth signal and then says, hey, I found this tech there. And then I as the owner can go on my smartphone, search for the tech and then my phone search in the database for the tech. So how we got into offline finding. I'm very steady. So my scooter, like my scooter to drive in the city got stolen and then I had a Samsung smart tech like an offline finding tech attached to it. And then we drove to the approximate location and then with the feature that we can send a signal to the tech and the tech response I'm here, we could see where the tech was basically so what where signal was. And then we did three literation. So we went from multiple sites to it. And then there was a signal at one point and then we got a scooter back. And also there's our sketchy chef there. And he always loses stuff and wants to get it back or find it. So offline finding basically has three components. But the tracking tokens, the small devices that you attach to the things that you want to find, they aren't connected to the internet because then it wouldn't be offline and sort of use like some like and then there are the smartphones or some small helper devices. They get the signal from the tech and then send it to the internet. And then there's obviously a server where the messages like I'm here and there is a pack are sent to and then I can get them back from there. So there are obviously some challenges. Some are privacy related like a stranger must not abuse the beacon for tracking over the long term. And they should not identify the owners because then I could know where the stuff of some people is. And the back end, the server couldn't identify the owners either because then I as the owner of the server could identify the owners. And yeah. But some are also technical like the encryption without knowing the receiver because then I can identify the owner then Bluetooth because of the range and yeah because of Bluetooth. And then because of Bluetooth also the energy efficiency. Yeah, because at one point we tried out in ESP. How long would it last? And I think we did it with Shah 256 hashing and like lasted for a couple of hours. Because it's small and I think a couple of hours aren't enough for checking device. Yeah, design overview. All right. Thank you. Yeah. So after we somehow got snubbed by this by this topic around offline finding how this works, of course we wanted to try how far we can get building such a system. Of course, somewhat motivated by our grumpy, sorry, I mean, a sketchy, sketchy chef who asked, hey, is there some system like this based on open hardware, open source? I'm not so very excited about Apple controlling where I lose and find rediscover my stuff. So first, what we first it was we looked at how the Samsung smart tech system worked, which is the sort of tech that Pingu had attached to the scooter. And we found out that it sends these strange beacons of some sort using Bluetooth low energy. I will come back to that in a minute. And in the course of time, while we looked at how this works, we've it's more or less became obvious that actually this sort of system is an enter and encrypted mailbox system, because there is an owner device and this has a public key and yeah, what you can do with a public key, you can receive some sort of messages. And there are helper devices that can see these beacons and more or less just send any sort of message to the helper device. So if I lose something as the owner and let's say Pingu wants to help me find it, then they walk around in the city and their smartphone receives the beacon signal and now they somehow need to get the information back to me, telling me where they saw my beacon. And that's where these texts come in and they are as probably as dumb as you can imagine, they just send out a public key and yeah, so all the information you need to somehow get the location sent back to me. It's a macro incident that these messages carry location information. We could just as well put anything in there if any of you are into this sort of systems. Apple had a few vulnerabilities discovered in their implementation. One of the most interesting ones in the recent weeks was that people actually used the beacons themselves to transport key logger information out of otherwise air-gapped environments. I think using your favorite search engine or the search engine you distrust, least will bring some really interesting information up about this. So what we really want to build is a mailbox system and some sort of key management system because that's the really interesting part as far as I'm concerned, how we solve these privacy issues and some of the technical issues with cryptography. So this is the big picture. If this works I can zoom around in this a bit and now it shows that I should have used the headset. Can I do it with one hand? Yes, I can. So here's the big picture and what you can see here is all the red circles are showing secret keys that I use in the system, the green circles are showing public keys that I use in the system. Let's get a short overview of how this works. So we have the owner device and we give the owner device a sort of main key. This identifies the owner device and the easiest thing we could do now is we could make this Bluetooth beacon and simply copy the public key of the owner onto that beacon and attach it to some bag or scooter or some flash squirrel or whatever you don't want to lose. So at this point we more or less are done with the mailbox part and with the encryption part but we got into all the privacy troubles because what you now can do is you can follow the tech around. It always broadcasts the same public key information. You can just walk around the city and always rediscover where one person is moving and make a nice motion profile of this person. Also you could discover several tokens that are linked to the same owner device and get the information that all these tokens belong to the same owner. These are two of the most inherent privacy issues that you obviously don't want to make when designing such a system. So the next thing we do is we derive using hash based key derivation some keys or one key pair for each token so that we can unlink the tokens from each other. And the rest of the system in case I think many of you will have heard about this term a ratchet algorithm and the rest of the system more or less is very close to what for example the signal messenger does with the scriptography. We transfer this this key pair this device key pair to the tech and now we do one key derivation every let's say 15 minutes at least that's what Apple does. And the interesting part here because I never worked with cryptography on this level before is that now we can derive new key pairs on the tech and it will send out another elliptic curve public key every 15 minutes. So we fix the privacy issue of following someone around. Now you can follow someone for 15 minutes and after 15 minutes you see another beacon and you cannot distinguish whether this is the same tech which rotated its key pair or some other tech of another person. Yeah that's more or less the main secret of the system and then if I find the tech I can send a message to the public key it is currently broadcasting and there are some other things mixed in here but I don't want to go into too much detail about this part right now. And the second secret is that when I try to retrieve my location information that all the messages that other send to me I just ask the server for all the information sent to all the public keys I know my tech will have generated within the time frame. And this request can also be encrypted because we also use another set of keys so that the server can also not find out that all these keys are linked to my device. They should have zero knowledge about the ownership relation between the techs and the owners. Okay our experiments are implemented in Rust. We have split it into the spot nuts crates. Hazel OS is what is supposed to be running on the techs and the helper device in Rust based mobile app and in case you happen to need or happen to find the time to review an implementation of signals, X-Ed SDS implementation in Rust. We also factored out this crate so you can tell us what obvious mistakes we made in the cryptography there if you like. And the JG crates are a general implementation of this message of this mailbox system which can be used for the offline finding system but actually for anything that is supposed to carry public key information to someone and allow them to anonymously send back some sort of information. So what we have? We have this implementation of this general JG key exchange and Maywork system with a library usable as an alpha version and a small server implementation that actually does not care whether it is used for offline finding or whatever other purpose. And we have an experimental version of Hazel OS for ESP32 with the limitation that Pingu already mentioned that we get the ESP32 development board to run for something like five hours. So how long did you take to get your scooter back? Did you manage to do it in five hours? I don't think so. Okay you have to be quicker next time when you switch from some. Best thing so we can either fix the technical issue or you can start a running career so whichever is easier. Okay so next things we want to do is we want to find a decent microcontroller. I happened to give a Rust training last week and one attendee told me this ESP32 this is nothing to do with microcontrollers. This is a toy. Get a more hardcore microcontroller and I think this is what we will try. And for Hazel OS to this we need to build an experimental companion app. Maybe design a nice PCB so it don't have to attach a breadboard with a development board to your scooter or stuffed squirrel or whatever. And maybe we can find others interested in open offline findings standard because Google and Apple and Microsoft and you name it are working on something like this but of course it's not so very openly developed. Spotnuts is a tinkering. Thank you for the talk. The question is how do you allow the helper device to send the message to the owner device and at the exactly same time don't allow some stranger to track the owner. Somehow at the feeling that at least one of my slides went missing when refactoring the slide deck. There's back an infrastructure. One thing I mentioned is JGD which is just a small mailbox server. It just has two API endpoints. One receives messages. It does not care what these messages contain. They are just JSON encoded encrypted messages to the public key we saw and the owner devices they just ask hey do you happen to have received any message for this public key I think I might have had. So the thing here is you can actually even in the Apple ecosystem you can ask the server for all messages you like. You can just send public keys there and they will give you the information about all messages that were sent encrypted to this public key. The nice thing is so you can download the whole database from Apple servers as well. The nice thing is you can do anything with it because obviously you also need the second half of the key pair. If you don't have it you get a nice bunch of random data. Over here. Hello. It's here. Over here. Would it make sense to make this key rotation time period not fixed at 15 minutes because if I was following a tag I could time the key rotation based on the period and then know that it was rotated at the exact 15 minutes. Yes. Bit of silly question but have you considered Linux mobile support for the helper device? Can you repeat the question please? Have you considered supporting Linux mobile phones? Supporting mobile phones to carry the... Is it a part? That's running Linux instead of Android or iOS. It's supposed to be a web application which will need web Bluetooth support in more browsers than Google Chrome but actually there's this Rust library and it should be easy to use it in any sort of app that you like on any platform. That's great. Thank you. Thank you again. Thank you.
From an artificial nose weekend hack to a future-proof IoT device
That was helpful. Thank you. Thanks for joining. This is going to be a talk about a fun project that I started, I think it's almost four years now, so I feel like I'm sort of milking the idea, but it's pretty cool. It's back in 2019, I guess. I ended up building an artificial nose using some cool tech, and I'm going to talk a bit about the tech behind it and how I ended up moving the project from a really, really dirty weekend hack into something that's hopefully more future-proof and using cool things like Zephyr. So, a few words about myself. I'm a Benjamin. I'm based in France for the past year, almost to the day, actually. I've been working as a developer advocate for the Zephyr project at the Linux Foundation, and I do many things, including as a good French person, I guess, baking bread. And I don't know about you guys, but I've been trying to perfect my bread recipe for probably over 30 years. Like, I'm still not really happy about the way it turns out. Like, it's a bit random, right? And so, back, I think, in the really first few weeks of COVID, with like being stuck at home, lots of times on my hands, I was like, maybe technology can help me improve my bread recipe. What if I could figure out a device with maybe some AI in the mix that I could like train to figure out when my sourdough starter would be perfectly fermented? In my head, at least, the idea would be that I would buy AI, figure out when the sourdough kind of looks all right, bake the bread, figure out if the bread is good or not, give it a, like, oh, it's a nine out of 10. Like, it's really crispy, really nice. And then do the training that way, right? And so, the idea would be to smell the sourdough starter to capture some information, which in my head, at least, I'm not a chemist, I'm not a food chemist, but measuring things like the amount of volatilagony compounds and CO, CO2, whatever, there has to be a correlation and like the perfectly ripe sourdough starter, there has to be a way to identify it, right? And so, back in 2019, there was also this sort of cool kid on the block, new cool kid on the block, which was, and which is, tiny ML and things like TensorFlow Lite, finally available on micro controllers, things like that, right? And the thing is, I know really little about neural networks myself, like, for some reason, the math, like, whenever I would open a book about neural networks, and like, oh, yeah, it's easy, you're going to recognize handwritten digits, like, this is a bitmap, you go through some layers, blah, blah, blah, oh, you recognize the digits, that was going way over my head. The thing is, playing with physical things, more tangible things, I actually was a role in just a few hours, really, and with the help of some tools, some of you might have heard about something like, called edge impulse, it's not strictly speaking open source, although it's based on TensorFlow Lite for micros, but it helped me train a model, basically, taking some Arduino, like, an Arduino compatible device, this is a WIO terminal, a Cortex-M4, taking a gas sensor, feeding the data and like, capturing the data quite often, taking this data into some kind of training algorithms, and I would be able to figure out the difference, not necessarily between good bread and bad bread, because remember, COVID, like, flour wasn't even available in the supermarkets, but booze that I had in my house, so I actually figured that it was able to make the difference between not only, like, rum and whiskey, but it was actually accurate enough that two, like, one really pitted whiskey and one slightly less, so it would make up the difference, right? And I started to talk about the project, because I found it really cool, like, not do the silly bread thingy, but something slightly more useful, which is figuring out in the human breath, when there are, when you can spot the markers for fungal pneumonia, Kaleb, the kid almost died, basically, when he was really young and the doctors couldn't diagnose the disease, turns out that since then, there's now literature available out there that says that, yeah, there are some markers, and he sort of built a proof of concept for that, so that felt really good, but what didn't feel really good is that the code that was from day one available on GitHub of that project that I have to put together is horrible. It's like 2000 lines of boilerplate, copy paste, typical Arduino code, right? Like, I mean, I've been gathering bits here and there, of course it works, but it's really, really bad. Small, just like really quickly, because I think it's worth mentioning, how does a machine smell anyways, because we're all, I think, familiar, or we all think of things like temperature sensors and humidity and illuminance, like that certainly comes to mind, because we actually also use them every day, but there's also sensors that can smell, they measure the concentration of particular chemicals in the air. The way it works is basically just a chemical reaction between a tiny slice of metal oxide semiconductor, and based on how many of the offset compounds can be found in the air, you can measure a variation in the change in resistance, right? The more VOCs, voltallogonic compounds would be in the air, the higher the resistance, for example, which means that I could measure, like, start acquiring data, putting my sensor on top of bottles of alcohol and tea and coffee and whatnot, and capture basically what I would call the fingerprint or the olfactory fingerprint of a particular smell, and then with a bunch of AI and ML, basically figuring out what in this raw data identifies a smell, and so my intuition would be not knowing, again, a thing about signal extraction and all that kind of thing, would be, oh, well, but if this is whiskey, then if I were to write down what makes whiskey so special, it would be probably something like, oh, yeah, when you smell whiskey, nitrogen dioxide goes up, carbon monoxide, not so much, VOC goes up as well, maybe in a slightly more steady way, and so basically what happens then, the way the model works, is just that, except that it's a machine doing it, looking at the raw data, doing some basic statistics to extract the mean, the mean, the max, the standard deviation, like, all those things that could potentially characterize the smell, and then this pre-processing, this DSP, if you will, then goes through a typical neural network, so this is fun, you get to the point where you have this funny looking thing, like you can even go the extra mile and, like, sort of, 3D print, the enclosure, and there's, yeah, you have a lot of fun. I ended up building and packing, again, like, in those 2,000 lines of code, plus all the libraries, of course, that I'm pulling, I would have a GUI, I would have Wi-Fi integration, actually, that's something that I added eventually, and, like, whenever I smell something, I can push it using MQTT to a server, there's, of course, tons of hardware interactions, and all that needs to work at the same time, except that if you do it the Arduino way, and the lazy way, I guess, then you end up just doing this, which is, again, not necessarily, like, if you're lazy and just, like, eager to get your POC and your thing working, you end up putting a lot of code in, essentially, a superloop, and so, as often as possible, I need to do all this, which is acquiring sensor data, which, by the way, you don't need to do that often for getting good accuracy, like, the way the device works is that I just sample the gas sensor readings 10 times a second, it's not all that much, so every 100 milliseconds, I would read sensor data, and then I need a bit of time to actually run data through the AI model, which, again, doesn't really take a lot. The model, at the end of the day, is really simple, so you really only need a couple milliseconds there, fair enough, and then there's the world GUI aspect, which, again, if you're lazy, I'm not even, like, whenever a button is pressed, it's not even interrupt driven, so you need to figure out, like, if a button is being pressed right in the loop, not ideal, but you do that, and then, if you want, you then post results to an IoT server, and then you don't even know how long it's going to take, right? Like, if this is synchronous, it might be a problem. Enter an autos, right? That's basically, for the first few years of the project, it was sitting there on GitHub, this really crappy thing where people would open issues to be like, really, I mean, yes, I would put the ready to flash, like, firmware for people to use, but anyone who wanted to basically tweak the code, they were just scared, and so the thing is, I ended up, yeah, using Zephyr to try and rewrite, and also to myself, frankly, to learn some of the best practices there, I ended up trying to leverage some of the features of Zephyr, which is beyond being an autos, which hopefully would help me move away from the super loop, also get a better solution for targeting multiple architectures. Like, originally, I would be targeting the weird terminal, which is some D51 Cortex-M4, but I actually don't mind ESP32, and having the same code, same portable code, and portable build infrastructure, test infrastructure, I don't mind getting that, plus all the libraries that also come pre-packaged, and yeah, that's basically what I did. So, from this point, I guess, the presentation is more about telling you, like, how I replaced some of the concepts or some of the things that I had in my Arduino code, and point you to some interesting areas in Zephyr of, like, features and subsystems that are available that you maybe didn't know existed, and, but frankly, I didn't know existed either. Sensor acquisition, that might be the sort of the easy part, but I really like the fact that now my V2 version, if you will, of the NOS, I have essentially, and literally, a dedicated thread that acquires the data exactly at the sampling rate that I require for my model to perform accurately, right? That's like, that could be an issue. If I do the super loop thing, and for some reason, the UI takes longer to refresh or communicating with the the cloud takes longer, then it will basically shift the sampling rate for the gas sensor data, which basically means that I will start feeding crap into my AI model at all. So, you may want to sometimes put the sensor to sleep and make sure that it doesn't draw energy unnecessarily, so it's actually also integrated in the Zephyr APIs. Then comes the TensorFlow Lite aspect. So, I'm basically pulling TensorFlow Lite as a library in my application and leveraging something that's called ZBUS that makes it, especially for someone like me who's not necessarily a hardcore embedded developer, I basically have this high-level framework where, okay, I have my sensor acquisition thread that does its stuff, basically puts the sensor readings in a ring buffer, and whenever there is data that's available for the rest of the world and the rest of my app to do something out of, then it's effectively like there's an eventing system where, effectively, my inference thread really gets data, like subscribes to sensor readings so that it does the stuff and figure out what is it smelling like, and also uses ZBUS to put the result on the same, like using the same topic mechanism, if you will, so that, guess what, the GUI, for example, can in turn subscribe to this piece of information to do something useful out of. No need for fifo's and cues and semaphores, like it's actually really nice, and the overhead is minimal. So, there's that, and then for the GUI, that's one thing that's really nice with Zephyr is that you have LVGL, it just works, like there's obviously in Zephyr tons of drivers already available for a wide variety of display controllers, but then on top of that you even have, like, the high-level framework that is LVGL for creating a GUI with, like, chart, like, this gauge, this gauge, and I never know how to pronounce it, like, this gauge, and the charts, like, those are effectively widgets that subscribe to the data that comes and is being sent on ZBUS and just displays it, and the code is really, really straightforward, it integrates also with things like the Zephyr input system, like, if you have buttons, keypads, touch screens, that basically send events, you can have the LVGL app automatically react to that, right, so that's nice, and as you may notice, this is not a photo of LVGL running on the actual device, it is a screenshot of LVGL running in a desktop environment, because you can actually run the full artificial nose code in a fully emulated environment, if you will, on a POSIX OS, including the GUI aspect, so that's pretty nice, and like I said, it really feels like you're writing, like, really high-level applications, I have, I'm defining, and, like, I have a listener that wants to be notified whenever there is an inference result that's being made available by, probably, by the TensorFlow light for micro task and thread, and when that's happening, then it's pretty straightforward, you get the data, you really get it actually as an actual, like, typed message, like, so it's something like you can actually really make a good sense out of, in my case, the inference result would contain both a label telling me it's smelling coffee, whiskey, whatever, and a confidence level, based on how confident the model is that it is effectively whiskey or coffee, and so I can actually display that on my UI, and the code is really, like, literally moved from, yeah, 2,000 lines of code, I didn't count, but it's a couple hundred max, so there's that, and then this is sort of nice to have, if you were to do more than just a kind of prototype toy project, you could think about having the device, probably with something less stupid as the enclosure, but in the ceiling of the restrooms here in the building, so that whenever it smells pretty bad, you know that it's time to send someone to clean the place, but you don't want to send someone to clean the place, like, twice a day if, like, nothing happened, like, if it's, you're on the weekend, or it's like a day where there's strikes or whatever, or there's COVID and everyone is at home, so the device would need to be communicating somehow in a way, like, remotely, and for adding that to my project, it was also pretty straightforward, because there was a, like, full blown networking stack in Zephyr for, like, TCP, IP, and, like, co-op and MQTT, and, like, all the variants, all the flavors, and all the kind of connectivity options you may want to use, they're all there, and so effectively, and I can maybe quickly switch to a really quick demo, which is, I have, so, well, this is the version with the enclosure, this is the version, which is actually the WIO terminal, this one is M5 stack core 2, so this is effectively an ESP32, this is the sensor, it's already configured and already connected to Wi-Fi, so if I were to, I think I need to stop sharing maybe, if I were to connect to my MQTT, yeah, connected to an MQTT broker, and in real time, so this is really, like, reaching the internet and then my laptop connecting to the very same broker that this guy is connected to, and, yeah, apparently it's smelling ambient air, I guess it's more, like, nerdy or geeky air, and if I put, so this is, yeah, well, that was fast, actually, this is lemon, and for the anecdote, I, I mean, not that you care, but I actually forgot to bring the lemon from home, so I bought this one just this morning, so it's different lemon, I guess that the one I use for training the model, but it apparently works just the same, so that's, there's that, and what else, yeah, and many, many other things that are pretty cool in Zephyr, the fact that it leverages K-configured and device tree, just like Linux does, makes for pretty neat code when it comes to, oh, I want my GUI to be slightly different if my screen is large, I want to put, to cramp more into the UI, well, that's an information that you can get really easily from device tree, right, if my screen is wider than 300 pixels, blah, testing framework, CI integration, every time I commit something and push something and make a modification to the artificial nose, it gets built immediately, A1, basically, by the way, I wasn't working on Microsoft back then, and they are absolutely no problem with me putting everything on GitHub, so kudos to them for that, so now the new URL, if you wanted to check out the Zephyr version would be the same, with Zephyr in the name, you can find all the parts online, I don't get any royalties or whatever for that, but seed has actually sort of been like nice, ready to use bundle where you can order all the parts, and that's it, questions! Hello, thank you very much, so there is some abstraction where you can use different sensors, but surely the sensors don't give the same values for... Great question, I had a slide, I've removed the slide, removed the notes, I forgot, one thing that I would love to see happen to kind of answer your question is some kind of open data set, open ontology to actually describe smells in a consistent way, because you're right, like you would have sensors that are giving you readings in terms of like unitless concentration, like it's going between zero and 100% of VOC concentration, some would be talking PPM, some would be whatever, some would have like weird calibration things, there's, yeah, it's, you're right, so you would probably need to retrain the model, it's not like you can, at least with this code, it's not like you can easily be like, okay I'm going to switch from Bosch to Aliexpress, and it's going to work just the same, like you need to, yeah, I hope this answers the question. One more, yeah. We would like to know how it did it work with the sourdough and your baguettes? That's super, everyone asks the question, I never, like I never done the whole thing, like because back COVID, there was no flour, it would have been painful to bake dozens and dozens of baguettes and eat them anyways, and this is more fun to play with just random things like spices or booze, and the sourdough thing probably works, frankly, probably could be done more in a more simple way too, like maybe you just need a alcohol sensor and just measure the peak, and maybe that's it, I don't know. Thanks everyone. Okay, thank you.
Google Home, But Better: Building our own Smart Home Display with Flutter
So, welcome for the second time. Thanks for staying this long with me last talk today, named Google Home, but better. Starting really good. Just a second. So, even though it's a short talk, just really quick, a little agenda for today, what you can expect. A bulk section, really brief about me, why am I talking about this, why should you listen to me talking about Flutter. The hardware used in this project, of course, that's one of the interesting parts, but no really big surprises there. It's just what you would all expect. Then we get to the software part one, the embedded Flutter part, and part two, the implementation. And I think this is for most of you, this will be the most interesting parts of this talk. So, first about me. Hi there, I'm Moritz. Yeah, a few years ago, when I was 15, 16, I started out with embedded development. Back then it was all hobbies. I started out with an 8051 Derivate. I think it was an Infinion XC878. I started developing in C. Back then I wanted to mainly build everything around music, high-fee, loudspeakers, equalizers, digital sound processors, and so on. Following through college, I worked as a why we created Snap Embedded. That's what we're doing there. Also, co-organizing the Flutter Munich meetup. So if you ever want to come over or speak in Munich about Flutter, just feel free to hit me up. So, I left Embedded. Now I'm back at Embedded. Why? And this is maybe really short clip showcasing why I'm back at Embedded user interfaces, because this is still stuff we get today in new projects. And it's still sometimes you get a new coffee machine state of the art with a touchscreen and you use the touchscreen and you're like, oh no, God, why did you build this? So, yeah, I don't want to build any more of those things. I run, I want to build the UI like of the one today's talk is about. I think this, I hope this looks a little better than the things you saw before. That's the user interface of the Google Home replica we built or I built that I normally wanted to present here, but sadly there was, it would have been hard to set it up in five minutes, get it here on the table, so we'll rather stick with the presentation. Also, it would have been unfair for all the people online. But nevertheless, I have picture of everything. We're going through that now. So, the hardware. Yeah, as I said, not much more as you would imagine, Raspberry Pi 4. It's still model 4B, 4GB of RAM, that's enough. 2GB of RAM with a desktop environment and Flutter, yeah, wouldn't recommend that on a Raspberry Pi. Of course, the Raspberry Pi 5 would work. It would just be more expensive and it would run just as good. A little thing we have in here, what deals with the but better part with the Google Home or Smart Home devices in general, we can't add whatever we want hardware and as we will not be adding a voice command service on this device, I thought about what would be cooler. Voice commands would are already out there and what do we need to see? What is the most interesting thing and that's for a lot of people, I guess, interacting with custom hardware. Therefore, we integrated an air sensor there, the Pimoroni SCD41 measures CO2 temperature and humidity, connects to the Raspberry Pi with I2C and it comes that is also very handy with a ready Python library that's known to be working with the Raspberry Pi. The touchscreen is just some WaveShare 11 inch IPS panel, capacitive touch, USB, HDMI, really nothing too special. Those touch screens just got really good in the last years using them. Yeah, at least with the Raspberry Pi OS or so, just works out of the box, it's fine, nothing to worry anymore about. Then for the last part, yeah, with Smart Home, what most people think about is turning light bulbs, plugs on and off and for Smart Home projects or whenever you want to do projects on your own, devices that come in really, really handy are those Shelly bulbs and Shelly plugs because they just come in with a built-in web server and you just have a REST API, connect them through your Wi-Fi, they come with an app, super easy and yeah, you have a REST API where you just can interact and it on, off, change the colors, it couldn't get much easier. So, all together without a whole bunch of cables, that then looks like this. So, now that we have the hardware part together, now comes the interesting, the next part, the embedded flutter part and as the talk earlier already pointed out, there's not just flutter to run on embedded devices. If you Google it, if you want to start out with it, you will find a few repositories all dealing with flutter and embedded devices. We just saw one, in fact in the last talk, it was using flutter Pi, so what's with that? Why are there different options? Is this not flutter or well, it is flutter, but to understand this, we may have to, yeah, next slide, we may have to look at the Linux embedded that flutter uses. The main difference, those custom embedders have, the custom embedder connects or the embedders connect the flutter engine with the targeted platform and the main difference we have with those custom embedders, which I have, let's see if this works, here on the right side, fancy, I wasn't prepared for that. So, the main thing you can see is here, something's missing. Flutter for Linux just heavily depends on GTK and GTK, in fact GTK 2, which is getting a pain right now for flutter itself. So, what most of those, or what all of those libraries have in common, we don't really need those GTK parts that flutter uses anyway in embedded hardware. We don't have tabs, we normally don't have windows, we don't need all of that stuff, so they just get rid of it, and which sadly isn't that easy in the Linux, in the, let's call it vanilla, flutter, embedded, but they get rid of it, so you can use flutter on custom hardware without GTK and GTK, and that means you can use flutter, for example, with Wayland, with a custom embedder, as the talk before already pointed out, which is not possible right now with the, let's call it flutter on embedded projects, especially if you want to go in a really industrial style, but we're getting there. Also, a big part that's missing right now is tutorials on how tools are still, there's not so many out there, just Google it, it's, yeah, there's not much out there, but I'm sure we will get through this within this year, or at least maybe the next year, and then flutter will also definitely become available to startups, to smaller, medium-sized companies, there will be tools, software as a services around that, and flutter will get more mature, I think we don't know it, but I guess that flutter will get more mature in the embedded world in the next one to two years. But, so if we want to do a project right now, where we just want to try out how flutter on embedded devices works, at least for this project, when we use a Raspberry Pi, we have Raspberry Pi OS, we can just use flutter as it is, we can build for Linux there, it will work just fine. The newest Raspberry Pi OS changed to, I think it changed to using Wayland, I haven't tried it yet, but apparently it works alright. Flutter needs to do something about GTK2 anyway, so maybe it will be possible with the just normal flutter to build something suitable for Wayland and direct rendering as well in the future. For right now, if you're doing a hobby project, if you just want to try something out with a Raspberry Pi, just go with flutter as it is, it's fine. If you want to go with, if you want to use direct rendering, if you want to go with Wayland, if you want to get something into production grade, then you have to look at flutter Pi, Toyota's, IVI Home Screen, or the one from Sony, whereas the Toyota thing really is amazing and is moving forward at a really fast pace. So enough to this generic talk about flutter, what about the implementation for this project right now? I want to go through it in a few steps and yeah, the first part or the first part that we need to integrate for this project to work is connect to Raspberry Pi to the touchscreen. What do we do for that? We use the Raspberry Pi Installer, Installer Raspberry Pi OS, it just works out of the box. Thanks to a lot of guys that are also here. That's really, really easy. Then we need to get flutter running. For that, we wrote a tool, I just said it with Snap Embedded, we're doing open source projects around that. We basically built a tool in the end, there's a repo with the link called Snap CLI which allows you to, from your host machine, set up a Raspberry Pi that's connected in the same network as you are. It'll connect over SSH, it will install flutter, all the stuff you need, and it will set it up as a custom debug device so that you can just run the code and debug out of VS code on Linux, Mac, Windows, and the code will compile and everything will run in real time with hot reload working with the Dart tools on your Raspberry Pi. If you want to just develop on Raspberry Pi, that's already really easy and straightforward. Even the Dart DevTools work, all of that is already there. Just, yeah, no cross compilation, we don't want to get in that direction yet. The next part is, yeah, it's rather uninteresting. Here you can see a little bit of Dart, that code won't run, I cut out everything that looked ugly. So that's just basically a get request. You connect the bulb and the plugs with your flutter or Dart application, run this function to get the bulb status, set the bulb status, or to set the bulb color. The more interesting part, I guess, and what I wanted to point out, which will also explain how you would integrate a voice assistant to with a flutter application on the Raspberry Pi, is how do we connect this sensor that's connected to the I2C bus with our flutter application. We would have a different approach, or we do have different approaches that we could use here. We could do a Dart implementation of everything directly to the I2C bus. We could go through the data sheets of the sensor, implement everything by ourselves, all the commands do it all by ourselves. We could run up an MQTT prokure on the Raspberry Pi. We could then connect the sensor to this on the prokure, subscribe the flutter application on the MQTT prokure, because MQTT is one of those plugins that work with most of the custom embedders, so that really works out of the box. So that would be possible to take. We could, of course, here, I use Python, but we could use a Python backend, just make another REST API on this device and talk to it locally, I think, in a lot of embedded projects. It's done that way. Or we use Dboss. We have the Dboss running on Raspberry Pi OS. We have the Dboss running on most Linux systems, and we can just clip on the session bus for this purpose. The plugins are also already there. And for this example, this is what we did, because connecting Flutter application with whatever else process is running on the machine, you can just use Dboss. We can just use the Python example library that was already shipped with the sensor, of course. I mean, we don't want to do work twice. So we can connect whatever we want right now with packages plugins that are already available. Resources, thank you very much. Two minutes.
How do you write an emulator anyway ?
Thank you. Well, welcome for this session and congratulations on waking up so early after yesterday evening. It's always hard on Sunday morning to watch them. And thank you for those who are watching online. So who am I? My name is Anis, as Mahmoud said. You can follow me on social media, find my blog here. I'm writing this gamegear emulator called Gears. This is not the subject of this talk, but maybe I'll tell you a bit more about the gamegear hardware so you can see how that helps writing an emulator. I'm not an emulation expert. I know there are a few here which are very well versed, but I'm hoping that helps gives another perspective. I also gave a presentation on the Z80 which was pre-recorded in the emulator dev room two years ago. You can watch the talk. And yesterday on WebAssembly, putting this emulator to the web browser in the rest dev room, we can also watch the recording when it's online. So this is a small demo, what you can see here. This is the emulator that's running in a native window. Yeah, nothing very specific. So first of all, I'll tell you why I'm giving you this presentation. But before that, has anyone here ever written an emulator before? Okay. Oh, that's quite interesting. Who here knows how to program, how to write code? Oh, nice. That's good because that's not the goal of this talk. It's not to teaching you how to code, right? You know how to program. I'm hoping with those skills you'll be able to give you a few pointers to how to start, like where to find documentation, things like that. The goal of this talk is not to be exhaustive, otherwise it will be a full university course over a semester or something. And I want to also tell you why you should write an emulator. That's something that should come from you. And yes, so the focus of this talk would be on simpler platform because it's always easier to start with something a bit simpler. Yeah. So what is an emulator first? So a few definitions. It's something I struggled a bit because they come in many shapes, but in general it's a software, I would say, a software program that is used to run software from another computer or another platform, whatever. To give a few examples, here I showed a few screenshots of existing emulators. You have Gameboy, an gameboy emulator named Semboy. You have another BGB. Some support weird devices like printer. I showed an emulator running on the Android platform for the BBC Micro. There's also an Android emulator itself. So you want to emulate the computer that runs an Android OS. And also put something in here which might be debatable, which is an analog pocket, which is an emulator using FPGA. So you write software defined hardware and use real-time thridges to run software from other platforms. An emulator can have a huge spectrum of, let's say, accuracy and emulation. What does it emulate? Accuracy is how faithful you will be to the original. When you're emulating something, will it be running like one software? If that's your goal, that's all right. You just emulate enough of the platform to run one game with burning all the available software. Or maybe you want to do even more and be able to run any software that's on the target platform identically as if it was running on real hardware. We call it clock accurate, but there are us even in this spectrum. Before we continue, I wanted to show you a crazy example I found a few weeks ago, a few months ago of an emulator. It's a Linux emulator. It's a RISC-5 emulator of running Linux written in Scratch. It's a Scratch programming language. So we can't really see anything here on the screen. So I'll describe. You have a Linux terminal on Game Console. It has already booted. I wrote some commands. And here I'm scrolling, and you can see the Scratch code of the RISC-5 core. So yeah, emulators comes in all shapes and colors. So you want to write an emulator. Let's go with the first level. What will it be? Starting. How do you start? So if you want to start, the first thing you have to do is to pick a target. And by target is the platform you're going, what I mean is the platform you want to emulate. You have to pick this target. You have to pick a host platform to start somewhere. Even if your goal is to write something that's portable and running on everything, you have to again start somewhere. So you pick a host platform and make sure you have a bit of time. If you want something that's complete, emulators are something that's hard to decide when it's complete. You can always have more features, more things. So you don't have to have a lot of time in a short period, but maybe on a longer period it works as well. For example, for my emulator, I started two years ago. I've been working on and off, so it's not something that's taking a lot of time every day. That's what I mean. Where to start? Okay. Start simple. So with the CPU, you pick one CPU instruction. You write some code that will be able to disassemble it, which means you will take the binary form of this instruction, this one instruction. It will be a few bytes, one byte, I don't know, depends on your platform. Can your code recognize this one instruction? It might seem trivial to you. Is this a few bytes? It's how it starts, basically. So you start with this, and then you start adding stuff on top of it. You have your disassembler. It's very useful to debug. Then you add something, which is execution. So you have the CPU. How do you model its state? Okay. What's inside the CPU? Go look for more information. What's the CPU? Build this state, change it, which is basically what an executing an instruction does, and verify it has the state change as you expected. So if you want to add something to a variable, you do an add operation in your language. And then as you, let's say it's a good starter, and as you go, you keep learning new CPU concepts and how a CPU works. So yeah, this is helpful for starting. So a CPU is a processor. It's a half of, usually it's considered a half of consoles. Nowadays it might be a GPU, but as I said, this is focused on 8-bit platforms. As I told you, it has states. A CPU is a processor. So this state is basically what we call registers. It has all the kind of states, but the stuff with registers. I told you about an instruction. An instruction is a minimum operation that a CPU can do. It has assembly visualization, a text. You probably have heard of assembly programming language. That's how you visualize for human instructions. And it has a binary version and encoding. And this binary is, yeah, it's bytes you have to recognize. It has other concepts that are interesting, interrupts. These CPUs usually they can do execute instructions sequentially, and they can also be interrupted. So when they have an event from the outside world, you can change the way it's executing code. Also interesting is how do you access memory? I told you about states. Usually as a programmer, when you write in code, you think about variables and things like that. And this hides that on a hardware. State can be on registers or in memory. And the way a CPU accesses memory is also quite interesting. But the goal is not to teach you those concepts. It's to give you pointers on how to learn. So we've learned about how we start. Let's talk about how do we structure an emulator? So you've been writing a bit of CPU code. How do you structure the whole emulator? Because the CPU does not make a complete thing. I'm giving you here an example of the emulator structure. It's schematic by Rodrigo Copetti, which has been doing a very nice introductory documentation on hardware platforms, various hardware platforms. And here I took the master system one. You can see here, as I told you, that the CPU has the central part. So it's the square in the middle where it's written as Xilog ZAT. Then you have other devices that are interesting. I told you about the memory. Here on this platform, you have two kinds of memory. There's ROM and there's RAM. You have IO control, which is how you plug a joystick or the time it was more controllers. So this is plugged on this platform on an IO controller, and this is connected to the CPU. You have the game cartridges. So those are specific type of memory with things like paging in order to access more memory than the CPU can access. It has also a way to generate sound. Small device, a chip called PSG. This device is from Texas Instruments. It generates a very simple sound. It has a video display processor, which would be the ancestors of today's GPUs. And other things like something to... Okay, it's a video display processor. It's a bit specific here, but it has access to its own video RAM, which is a concept that you have to think through if you want to emulate this platform. And the video encoder is used for TV output. So it depends again on the platform. This is nothing very special. Many platforms of the time had very similar architectures. This is interesting because as you want to structure your code, your emulator code, you will want to... We probably will want to follow this structure. You want to take those devices and maybe organize your code... Take them as a code boundary and organize your code in modules. I don't know whatever your language has, functions, objects, classes, namespaces, whatever is on your programming language. It's an interesting code boundary to know, okay, this device could be emulated like this. There's another device like that. Another trick I'd like to share is when you're writing an emulator, you don't have to think about optimization too much, but you're allowed to optimize a bit. Only, for example, you're writing a CPU. It's a very simple thing. You probably want it to be fast. It's something that will have to be very fast. You might want to, for example, not do allocations on the emulation path. If you know what memory allocation is, it means that you want... It's something that can be quite costly. It's very useful, but when you're emulating, it's not something you want to do every other instruction or every instruction. You might want to use drum tables. This one, on the case, is debatable. Depending on your language, it might be automatic. Quite a common advice we see when advising people to write an emulator is that you should write a vertical slice. What does that mean? That means you have all those things. You know, I told you about the CPU, there's the video display processors, the audio. If you go on to writing an emulator, you probably want to see results quickly. That means that you will write support for a few CPU instructions and then a bit of display code so that very quickly you'll be able to have feedback and see on the screen that's showing something like the Nintendo logo on Game Boy or Sega or whatever. You can do that. That's not what I did. Do what works best for you. For example, I gave you a talk two years ago on the ZAT. It was a pre-recorded talk and at the time I had nothing else but a CPU. It depends on what you want to do. Do not hesitate if you have any questions. No, maybe we'll take them at the end because they're recorded. Sorry about that. Another trick is that I told you a bit here about the disassembler before. You should disassemble and write the text versions, assembly versions of instructions. It will be very useful to have a debugger. You might want to build debugging tooling to debug what's happening inside your emulator early because you will have bugs. You will have emulation bugs. Build this tooling early. Or you can use already existing tooling. Here you have the emulation. It's a great one. It's not open source unfortunately but you should definitely check it out. It's a multi-platform emulator. I think we have the developer here in the room. Definitely use the malicious. I can't tell you how many platforms it emulates because I don't remember but it includes the Game Gear and Master System. It has great debugging toolkit. You can see assembly. You can see video devices. You can see many things. Always make sure you have debugging in whatever form works for you if it's tracing or logging. It's nice but be able to inspect the state that's happening on the target emulated machine. I told you about all of this but that's quite interesting. How does one find information on where to start? That's a very common question I've had. Where do you find documentation on emulation, on how the hardware works? Well, basically you look online. There are many different communities. If you want to emulate a Game Boy you probably want to go to GB Dev. It has information on how to write Game Boy software but also how the hardware works. In fact, I like reading documentation aimed at developers of the platform instead of emulators developers because it tells you how you're supposed to develop for this platform. It also means you'll be able to understand how to emulate it this way. There's also the Geq's complete technical reference. It's considered a definitive guide on the Game Boy. If you want to emulate a Sega platform you probably want to go to SMS Power. It's a community around the Sega Master System and other devices like the Game Gears, Sega Mark 1, Mark 3, SG-1000 and the most recently announced the Sega AI which was a computer from the 80s. I'm sorry I don't have a screenshot here but that was very interesting. It was an AI computer that Sega released in 1986. I invite you to look for it online. So here, SMS Power has documentation on how to develop software for the Sega Master System, the Game Gear, it has documentation on how the Z80 works. It has many links to other documentation on video, audio, etc. One of the guides I used when writing my newsletter, there are three main ones. The hardware reference manual for the Sega Game Gear console. Some people in the community took this developer manual that Sega wrote for game developers and they scanned it and OCRed it and made a great PDF version. I don't know who did that but it was invaluable as a preservation effort and also used that for developing my emulator. It has some things that, well, a small caveat here is when you're describing stuff for developers, you might not go into the details of how the hardware works and sometimes you'll have edge cases that won't be explained to other developers but you need to emulate properly if you want the emulation to be correct. So yeah, in general it was very useful. The CPU of the Master System as a Game Gear is fully documented by Xilog. It has very complete manuals, the company still exists. I suppose, for example, to the Game Boy where all the documentation is unofficial. The Z80 CPU is well documented and even then there are other tricks and things that are not documented in the official manual. It's kind of part of the talk I gave two years ago. You probably want to go read this undocumented Z80 documented and then afterwards watch my talk for things that are not in this document. So finding documentation on a very simple trick, you want to emulate something. When you do research, use technical terms. It might seem trivial like this but even I fell into this trap many times. You're looking for something, how to get the more accurate thing. An example is look online or what exact chip sets your target platform is using. For example, for the audio, it's a chip from Texas Instruments and instead of searching for how to do audio for X console, X computer, use more precise keywords. It gives better results and I'm showing you here, on the left here, I Google for Game Gear sound and you have audio videos, YouTube videos and things like that but nothing very specific and on the right you almost find just SMS power and that's it. So basically the link I gave you. So let's get a bit more into practically what that means. What would be, yeah, practically how does that work, how do devices work. What I'm showing you here is an extract of the ZAT manual and how to do device IO. It's very complex and you don't need to understand it. It's basically electronics but it hides the fact that back then, using devices was quite simple. It would be almost as simple as writing to a memory address and that's how you interact with the device. Now on the ZAT CPU, there were dedicated instructions to do that but it was then quite simple as opposed to modern platform where you have GPUs, memory mapping, DMA, whatever is in a modern platform, it used to be much simpler and that's something you can use when writing an emulator. So in practice you want to write an emulator for a host platform. Make sure you understand your host platform first. You want to write an emulator for windows. Make sure you understand how to display pixel buffer on windows. So you want, you know how to open a window, you know how to, I don't know, allocate a memory area where you can write pixels, what is the pixel format, can you display something, a small image, make an image, can you change it multiple times per second. So yeah, make sure you understand your host platform and yeah, it's the same for audio. Let's, you want to start emulating sound. Make sure you know how to do, play audio on your platform. You have a buffer, can you generate, I don't know, a sine wave or square wave to make a beep. Nothing about this is emulator specific but it's really something you have to do when you want to do interactive development or game development more specifically. So you want to, let's start with the graphics emulation, okay. This is something you will need hardware understanding. You will need to understand how the VDP works, for example, on the game share, how the PPU works on the Game Boy. And so you will need to read the documents I pointed earlier. I'm giving you an example here for the VDP, a few concepts that are interesting where you want to, when you want to display pixels, you need to understand how developers were interacting with the device and how they, for example, they accessed the video RAM, how did they use the registers of the VDP. So, conceptually, I told you it's very simple. You use specific instructions that are basically sending bytes one by one from the CPU to the VDP. So that's how you send commands to it. So you write your registers, you write to VRAM, so that's IO. Internally, it has a display area. Here, this is an extract of the game share documentation where the LCD display area, the small part of the screen, will be part of a bigger buffer and then it's like a viewport and it can scroll on it. It has an infinite scrolling. The top and the bottom are connected. The left and the right are connected, so it means that it's like a torus, mathematically, the donut shape. Other VDP interesting concepts, you have the sprites. I told you about the background. On the background, you display sprites. Sprites are often used for game characters. And they're very interesting because basically the VDP was like a sprite accelerator because at the time, if you wanted to display things very fast, it was not simple and the VDP helped with that. The sprite also helped to do collision detection and things like that. But you will need to understand how color encoding works, how sprite pixels are encoded because it's not really a simple square buffer or whatever. So everything has a specific encoding. It's well documented. Here, what I'm showing you is a tile map. So it's a dump of the video RAM of the Sonic 1 game share game. And this tile map has sprites on the bottom and background on the top. It's not exactly the same as the display screen, but it shows how things are represented in memory and then they can be mapped to the LCD display. I won't go into details about that, but you probably want to have a synchronization strategy between your CPU and your devices. If you want to synchronize the VDP, for example, it's something that's easier to do line by line. So you emulate a given number of instructions and then you emulate one line of the VDP. So it allows doing this emulator single threaded because it's easier to think that way. And it's an available thing strategy and one that can give accurate enough emulation. Sound emulation. Sound emulation is quite interesting. Again, it is hardware understanding, so you will need to read the documentation. I'll give you an example with the PSG. So you write registers. It has less registers. It's much simpler. It's a device that's conceptually quite simple. It has four channels. Three are tone generators. Basically, the generate beeps with a given frequency. And one is a noise generator and it generates noise, basically. So you have multiple things that are interesting. The tones are shown here on the top right. There are square waves, at least in theory, because when you interact with hardware, life is analog, and it's not perfectly square, so it might look a bit more like the wave on the bottom, just below it. And what I'm showing here is a noise generator. It's a very simple hardware device called Line Feedback Shift Register, or LFSR. And it's used to generate noise by basically shifting a set of bits, shifting them right or left. Well, it's the same, but here it's right. You start with one bit and you shift them, and then you output the bit that's on the right. But if you were to do that without feedback, it would just shift the one, and then you just put zero, and yeah, it's done. Except this here has XOR function. So we'll take two bits here, XOR them, and put them back as inputs. And with this input, you're able to generate the random-looking noise. It's not perfectly random, it's not cryptographically random, but it's good enough, and yeah, that's how it used to work. For sound emulation, again, you have to start simple. You want to generate a square wave, as I told you. It's a very good hello world for your platform, for sound. But then you will need to add more things. On the PSG, it varies on other platforms, on the master system and the megadra, as the game is here. You need to think about the tone channels having counters, and not the frequencies. You need to think in terms of period and not frequency. It's almost the same except when you're emulating, you will have edge cases that won't work well if you think in terms of frequency. A quick advice about thinking. You can have multiple ways to think audio emulations with the CPU. My advice would be to use CPU cycles. So when you're emulating instructions, you will need to count the cycles. Depending on the platform, one instruction can be a various number of cycles, from, I don't know, from four cycles to 20 on the ZAT. So you will need to count them accurately enough so that when you're playing audio, it won't be distorted. And in general, it's useful to count cycles properly even for display. I wanted to give you an example about playing samples, but very quickly. They also use a square wave, but it's quite similar. They use amplitude variations. So they play a wave that's always up. So if you were to play it, it would be silent. But they make the volume vary and they make it very, very, very fast, like 7,000 times per second. And it generates an audio signal and that's how you have samples. Samples are, when you hear, for example, Sega sound, something like you would play an audio file today. They didn't support, well, this platform did not support playing a random audio file, so developers had to get creative. Testing, how does one test an emulator? There are various strategies for that. For example, for the ZAT CPU, there are unit tests you can use from other emulators. For example, the Fuse test suite has very good unit tests that are not dependent on the Fuse emulator. You can also use integrated tests. For example, ZXOL and ZAT tests, these are programs that were written for the ZX Spectrum, which was a computer from the 80s. It has, they were generating instructions, lots of instructions, executing them, and then dumping the CPU state and making a very small checksum. And they run that on the real hardware and registered what was the checksum for each instruction set. So these are bit long, ZAT tests that can take a few seconds up to minutes. On real hardware, it was much longer, of course, and they are very useful. Go test, go test with, even if it's another platform, you can reuse the CPU tests very simply by doing a few bytes modification and it works on your platform. How to test audio? Well, this one I was not sure because I'm not so sure how to well test audio emulation. Listen to the music. Does it look like the original one? Yeah, you need a good ear for that. You can use fast Fourier transforms as well, which are mathematical operations used to analyze an audio signal. For example, you generate the square wave. Does it have the correct frequencies? If you go through the emulation path. And then can you hear the samples? I told you about the playing samples. These are, I'd say, the hardest part of the audio emulation because it plugs into many accuracy things. So yeah, can you hear them? Other examples here for the games here. This is DGTest SMS test suite. These are software developed by emulator developers for the platforms that will test various features. Here for the games here is the Sega Master System. For the Game Boy, you probably want to look at DMG AC test too, for example. The Game Boy is a platform that's very well emulated. It's a good choice to start. It has many tests. Blogs, test suite, MoonEye test suite, many accuracy tests. Yeah, you should look into it. Another testing strategy is frame generation. So you're emulating stuff. You're generating a display, pixel buffers. You can very easily dump your buffers into an image and compare this image with a good emulator. And you also can compare it with real hardware. For example, if you use flashcards and you don't have all the original games, it can be useful. In general, I would say test a lot of different software and look how it works. For example, here you can see on the left side, this is my test directory for a few games where I'm basically using that as a regression test suite. Does it always work? So some images have a story like I had bugs that I had to fix. And when it finally worked, I recorded that to make sure it kept working. On the right, what you have is the same boy auto frame generation. It's captured a very small part of a web page where they test all the Game Boy games, a Game Boy color, and other, and they make screenshots. It's very interesting. Other communities who are interesting in testing are the speedrun communities. I'll let you look into that. They also do frame testing, but they record the frames on real hardware. So summary of everything we said here. It was a bit fast. I'm sorry. Pick platforms, host platform, a target platform, something you want to emulate. First of all, always do something simple first and then make it grow. Read a lot of documents. That's probably something that's part of the emulator development. So a lot of documentation. Testing, because depending on how accurate you want to do, you probably want to test your software properly. And don't forget if you ever go and write an emulator to write blog posts about it. So people know about it and come here to first them in the emulator dev room and make a talk. Please. Thank you. Any questions? Testing, testing. Shall I do the question round? We have a bunch of questions. So I'm just going to run around and... Thanks for the talk. It was really good. Two small questions. Approximately how long did you spend on your first emulator? It was like a few weeks, a few months before you got something running. Do you have any recommendations from your experience? I know you did some stuff in Rust. Which languages did you use other languages? How was your experience with Rust? Is it good for emulators? Does it make things harder? Rust is very good. I don't use it for anything. So that's... To be honest, it's a hobby project. We have a question. It's a hobby project that you didn't measure how much time it took to develop everything. Before I had real feedback. You asked me before I had real feedback how long it took. Part of my strategy was different with what I usually recommended. I developed the CPU first. I wrote a talk about it. The feedback here was does the test fit pass? I used different tests and do they pass? That's how you get early results without having a complete emulator. About the programming language. I intentionally did not go into details here in this talk because I want people to be able to write in whatever language they feel comfortable. Rust is great. Go is great. Use whatever language you want. Especially if you're emulating an 8-bit platform. You don't really need to care too much about performance. You should be able to get very good results with whatever language you use. Unless maybe it's... I don't even have a good example. So yeah. Next question. Thanks for the talk. Regarding the audio emulation, would it be possible to just record the waveform and compare them? It can but... I didn't want to go too much into details for that. An example. On the Game Gear and Master System, the sound chip is generating audio at about 115 kHz. On modern platforms, you will run at 44 kHz, 48 kHz. You can have more on most laptops, but it's not what's usually used. So what you will need is a down-sampling strategy. So you will need to take the samples and down-sample them at your host platform sample rate. And this will generate artifacts. Yeah, okay. Good. Thanks. Yeah. Next question regarding the audio stuff. Do you know if there's any ongoing work to emulate the original sounds of the Game Boy or Game Gear because they have built-in speakers which will compress the sound and... Yeah, it will have a specific sound which you can't hear these days. Do you know if there's anything like that? Like you could record the compression and all? I think there is. I think I found a website, I don't remember, on audio emulation specifically. They were developing automatic filters to match the platforms as close as possible using machine learning and with the target of you putting back into DSPs or FPGAs. But I can't remember the name, I'm sorry. But that's something I'm very interested in, especially because audio emulation is not that simple. You need filtering. If you want to go closer to the Game Gear, for example, which has speakers, you will probably need filtering because it has a frequency response which will not be the same as your modern speakers frequency response. So yeah. I'm sorry. I'm sorry. I'm sorry. More questions? Yeah, I almost forgot about you, I'm sorry. So you mentioned tooling and debugging. What? Sorry. Tooling and debugging of the emulator. And you said there were two options like you can write your own tooling and debugging so that you can inspect the state of your emulator. But you also said you can use external existing debuggers. How do they help? How do they help for your emulator specifically? They will help you understand if you have a bug with a given game. They will help you with multiple things. They will help you understand how the game works. So you have a better view of how the game works, how the software is working. And they'll help you understand what you're doing wrong. So you're emulating the game, you have your own logging, your own debug tooling. They'll help you understand what you're doing wrong in your emulation. So it's more for a comparison type of thing. Alright, we have time for a short question. And also can you put your contacts to your first slide I think so people can find you after the talk? Oh yeah. I think that's a good idea. It was a short question, right? Or turn it into a short question. Are there any worthwhile platforms left to emulate that are also approachable? I would say yeah. I gave the example of the Game Boy, it was very specific because it has so much documentation. So yeah, it's a good platform to start. Is there anything left? It's okay if it has a lot of emulators. If you want a platform that no one has written an emulator for it will be harder because you have to discover all this information by yourself. So if you want an easy thing to start, it's not the same as exploring new stuff and reverse engineering and things like that. Well, thank you very much. Thank you, Anis.
Breathing Life into Legacy: An Open-Source Emulator of Legacy Apple Devices
So, we're going to start. So Martin here is going to tell us some stuff about Apple. And I have to confess, I'm very anti-Apple, so I wanted to actually refuse this talk, so that everybody will, again, refuse this talk. So, Martin, take it. Thank you very much. So good morning, everyone. Thank you. So good morning, everyone. Thank you for providing me with the opportunity to speak here. My name is Martin DeVos, and today I will present to you my hobby project, which involves an open-source emulator of legacy Apple devices. And in this talk, I will explain how I managed to emulate these kinds of devices, what it takes, what the challenges are, and what the next steps are. So let me first briefly introduce myself. So I'm a postdoctoral researcher at EPFL in Switzerland. And my main research topic is not actually on emulation or reverse engineering, but it's on distributed machine learning systems, like many people are working on nowadays, like LLMs and stuff. But I'm also a very big enthusiast of reverse engineering. And I actually started doing this during my master thesis already. And during my PhD, I worked on reverse engineering some mobile applications of banking apps in the Netherlands and other countries as well. And that has resulted in, well, my first paper of my PhD. And, yeah, and two years ago, I decided to pick up this project. I was inspired by reading a blog post of someone that managed to emulate an iPhone in software. And that's how I was motivated to work on this project. And this was actually Jonathan Effek. And I think he was one of the first that managed to boot iOS, the operating system of iPhones and other Apple devices, with QMU, which is a very popular open-source emulator. And he managed to boot it to an interactive Bash shell. So he managed to boot this emulator to user land, which is quite an achievement. And I thought, well, I want to learn how that works. It involves some reverse engineering, which is a thing I really like. I like seeing how software works, trying to decipher some of the secrets in the software. And it would also contribute eventually to long-term hardware preservation, because when people would run it, it has some feeling of nostalgia. And, well, I mean, I had my first Apple device, was an Apple Touch, and I decided to, well, work on that. So after reading the blog post, I was a bit puzzled. And I was like, OK, where do I start? How can I set up my own project to work on this kind of stuff? And, you know, Apple has released many different devices over time. And the first question I had to answer is, which device am I going to emulate? And if you think about contemporary devices, they are incredibly hard to emulate, at least emulating all the aspects of these devices is a very, very challenging and difficult task. They contain neural engines. They have phase ID touch IDs, which also has interactions with secure enclaves, but also software-based security measures like trust caches, which is a mechanism that's by Apple that only allows particular applications to have privileges. So I was thinking, if I go back in time and I take one of the first devices by Apple, at least in the Apple Touch family, that should be somewhat, well, easy to emulate. It is a device that was released in 2007, and it doesn't contain, well, the complicated hardware peripherals that I just mentioned. And, yeah, hopefully that will be simple enough to emulate, well, which were some famous last words, because even these devices are very, very complicated, as I will outline a bit later in this talk as well. So I'm definitely not the first one to work on this kind of emulation. So there are some related projects. One of the, I think the earliest attempt actually on emulating the SOC of an iPhone was by a CMW.me, who actually is the founder of Corellium, which you might know as a company that provides virtualization services, both of iPhone and Android applications. Yeah, we had the blog post that I just mentioned, which enforced the emulation of an iPhone 6S Plus. And that work was picked up by someone else and eventually involved in an iPhone 11 emulator. And there's also OpenEyeBoots, which is an open source bootloader for early generation Apple devices. And all of these projects have been extremely helpful in understanding and connecting all the different pieces together, because without them I wouldn't have been able to get this far. So then I had to pick a framework for emulation. And QMU is one of the most popular open source frameworks for this kind of emulation. It provides support for hardware emulation. You can define your peripherals, your hardware components. You can implement their expected behavior. And it already comes pre-shipped with support for many different protocols, like the USB protocol, network interfaces, SPI, SQC, SDAO, etc. So that was all very nice, but unfortunately it has a very, very steep learning curve. So it's quite difficult to wrap your head around particular parts of the project. So most of the time I had to rely on existing emulations provided by QMU to see how that works. And when doing emulation, you also need a way, or you would like to have a way of debugging your software, because you want to see which code path execution is being followed, what the register values are, and what's generally happening in the system. So the nice thing about QMU is that it automatically provides a GDB stop, a GDB server, that I can directly connect to, and then I can step through the code, I can jump to functions, and I can inspect all the register values. And for the reverse engineering part, I've been using Gidra, if I pronounce that correctly. It is a very popular open source tool for reverse engineering and decomposition, and this assembly of binaries, and this has been also tremendously helpful. So here on the right you can see, for example, some code related to the start procedure of the SPI controller, which controls the SPI interface. And if you look at it, it's actually pretty readable. You can do a lot with this stuff, but also the way Apple has engineered our software is very predictable. So they're using the IOCAD framework, which is very similar in structure. I mean, most of the peripherals look like this. You initialize some memory, you set some variables, and that's mostly it. So now let's talk a bit more about the emulation itself. So my philosophy when it comes to emulation is that I wanted to stay very close to the actual hardware, to what's actually happening on the hardware, no matter how difficult that might be. What I noticed is that many existing emulators, they cut corners, which is not unsurprising, right? Because for example, if you run into some kind of signature check, it might take a lot of time to get everything working and to get the right functionality and to make sure that pass steps. So one way is, for example, to just patch out that particular procedure or function call. Why do I want to do this? Because I had a feeling that any hack, any workaround I would do in the very early stages of working on this emulator would bite me back later. So I'd rather do it right very early in the bootpress process, where things might not be as involved as when dealing with a more high level of a user land or application. So I tried to, well, get it right the first try. Well, but as expected, it still ended up with a bunch of hacks, patches, works around, and patched out binaries, because for some things I really, really couldn't wrap my head around, at least not within a reasonable amount of time. So another philosophy that I had, I started with following the bootshank. So I started with the lowest level components in here, which is the Securon Bootrom. This is the very first piece of code that runs on an Apple device. It is actually fused into the chip of any device. If you find vulnerability in there, it's very nice, because you cannot patch that out. That's actually something that happened a few years ago. The Securon loots another called low level bootloader, LLB. That in turn loads the main bootloader, iBoot. Then that's iBoot, the component loads the X and U kernel. When the kernel has launched, it will start the launch D process, which is the very first process that runs on the system. That's launched Springboard, which is responsible for drawing the iconic user interface with the app icons and the hope screen. Springboard in turn starts all the different apps, like the Alarms, Safari, and other applications that you are familiar with. So I started working on the Bootrom first. As a very first step, I had to get the Bootrom, which is fortunately provided online. So that's very nice. It was dumped. The main responsibility of the Bootloader is not only to load the other bootloader, the low level bootloader, but also to initialize some key peripherals, like a clock, the timer, and the USB stack. Because even if everything else on the device fails, the Bootrom allows you to restore the device using some USB protocol. So if something goes wrong, you can use DFU, DFU mode to restore, to refresh your device. Now, I had some instructions running there, but I very quickly found out that when emulating this binary, this Bootrom, that it jumps to some unknown memory locations. And that was a bit problematic, because I didn't really know where it jumped to. And I looked a bit on the internet and I asked around. And it looks like this first generation iPhone is using some proprietary logic by Samsung. So very early generations of Apple devices were made in collaboration with Samsung. So the Bootrom was also made by Samsung. And I didn't really have any idea of what happens, because the Bootrom is very obfuscated and very small, and there are almost no strings and no contacts to work with. And I also didn't have any physical iPhone, Apple Touch device at that time. So I couldn't really figure out or dump that part of memory. And the same actually goes for the low-level Bootloader. I was running into the same problem there. It jumped to some unknown memory locations, so I decided to skip these two parts and go straight to iBoot. Yes, and this is how I load iBoot in code. So iBoot is the main Bootloader. It is responsible for loading the kernel from basically the hard disk. I was very fortunate that the source code of iBoot got leaked in 2018. So that actually was a newer version of iBoot, but at least it gave me some idea of how this all works. So I really tried hard to map all different components in the leaked source code with what I see in Gidra in the binaries. And I managed to boot iBoot and get all the peripherals up and running that iBoot expects. One thing about that is that there is this device tree, which you might also be familiar with if you work with Linux, some low-level Linux. It is basically a big dictionary of all the peripherals and their properties. It is included in the IPSW file, which is like the firmware file that you can download from Apple, and that is being installed. It is populated by iBoot. So iBoot, for example, gets the MAC address of the Wi-Fi driver and then it injects this number in the device tree. So here on the right, you can see a part of the device tree containing some information about the crypto AES engine. So it contains some identifiers and some other things. That was also dumped. So I also used that as reference to get an idea about which peripherals there are to emulate. And I can tell you that these devices are extremely complicated. So this is a diagram that I made of all the components that I managed to get up and running. Not all of them are fully functional, but most of them at least have some functionality. And this is for the Apple Touch 2G, which is slightly more complicated than the first-generation Apple Touch. So these peripherals, most of the peripherals, you can talk to them through something called memory map I.O. So in the memory map, there is a small part that is allocated to a particular peripheral. So here on the right, you can see the addresses of all these peripherals, which I also mostly got from the device tree. And you can write to these memory locations to talk to your hardware devices. And then the main challenge becomes, of course, like to talk with these hardware devices. And you have to do that in such a way that you get the expected responses and that the kernel and the other parts of the boot stage are happy with what these peripherals are saying. So this is an example how you can initialize the hardware components in QMU. You define some methods, some initialization methods, and then you include them in some main file. I won't spend too much time on this now. This is how you implement the functionality of each hardware component. You create a read method and a write method. The read method is called when something, a hardware address associated with the peripheral is read and the write function is called when you write to a register. And you can see, for example, in the read method that you have a switch, so you look at which address am I reading something from and then you return the write response. And sometimes that can be very arbitrary. I mean, I haven't deciphered all the meanings of all registers and what they expect, but you can at least do a best effort attempt in returning the values that makes the kernel happy. And this can become complicated very quickly. So here you can see a part of the SPI controller, which was a particularly difficult component because Apple has some, well, weird things sometimes. They make some modifications to their hardware, which not always follow well-established hardware protocols to say. And finally, you attach the peripheral to the overall machine in QMU. And you, so, and you optionally, you can connect the IRQ like the interrupt request. So interrupts are also functional there. Again, I won't spend too much time on this now. So after iBoot was running, I had to load the kernel and the kernel uses iOkit and it starts all the device, all the device drivers that are declared in the device tree. So whereas the low-level bootloader in iBoot would only load the most important peripherals, this would start all the peripherals. And here on the right, you can see some of the peripherals that I reverse engineered with the Ghidra. You can see LCD display, the power management unit, some other functionality that I didn't even know that were part of the Apple Touch itself. And this mostly follows a very similar protocol. When you start a peripheral, you usually execute some reset procedure or you do like an inter, you wait for interrupt or something to indicate that the device is ready. And after all these devices are loaded, then you start launch D. And this is the part where I spend most time on because I had to like get past all these peripherals. I had to understand how they work. And the further you get into the bootchain, the more complicated things become because then you are really building on, on the correct functionality of say the clock and the timer and interrupt requests, et cetera. So roughly 20 peripherals later, I got most of the things functional, like the clock, timer, the interrupt controllers, they're all fully functional. I'm pretty sure there are a few bugs left, but nothing too major. And only partial support for some of the more involved peripherals, just enough to make it past initialization. And then we're talking about peripherals like TV out, which happens that if you connect your Apple Touch to a TV, the GPU, also the accelerometer, the light sensors, they're not really important at this point. I was very fortunate that I could avoid GPU rendering, hardware GPU rendering with a flag. So the GPU rendering in this emulator happens fully in software, which is slower, but still it's reasonable enough to use the Apple Touch itself. So there's a lot of work to do, but at least at this point I managed to boot to userlens. To give you one more interesting challenge, was the persistence layer. So the Apple Touch contains two types of memory, some more memory that contains small binaries. I think it's at most a few megabytes. And you also have the NAND memory, which is like eight gigabytes, and you can store all your applications and the operating system in there. There are some key difference between the layout of these, of NAND and NAND. So I had to spend a lot of time on when I emulated the Apple Touch 2G to make sure that also works. The main problem here is that once the kernel gets some kind of block, let's say block five, it uses logical block addressing. And that doesn't match with what's how the NAND layout underneath works. So I had to really figure out how something is mapped from a logical block level to the physical block level. And that took a lot of time. I ended up with some scripts in a separate repository that takes a dmg file and that converts it to like a raw file system, a file system as it is like really in the hardware. This is the diagram for that to give you some more context. This is for the NAND. So we have the file system which is implemented in the kernel and that's if it wants to get something from the operating system, it uses a logical block address that goes through two different layers, the flash translation layer and the virtual flash layer, again with their own numbering and addressing and mappings. And that results eventually in some physical page number and a CE which is basically like a bank, a number between one and eight. I think in the interest of time I'm going to skip this but I just want to say that multi-touch, even though it looks very simple, how hard can it be to convert a touch on a screen from to an X and Y coordinate was very, very complicated to get right and for this I actually, for this I actually needed a real device. So most of the things I could do without having an actual device but for this I needed a real device because I had to play with touches and see how the encoding of the touch works. So here on the right you can see, well me playing around so you do press a button and then I recorded what the multi-touch driver gives back to me. So all in all I managed, when doing all of this, I managed to boot to the, I will touch one G to the home screen. Well you can see it's a pretty basic home screen, not many applications. I think I got this running about one and a half year ago and a few months ago I managed to get the Apple Touch 2G working as well running iOS 2.1.1 and the Apple Touch 1G is running iPhone OS 1.0. And that mostly concludes my presentation. I open sourced all the code, I created this GitHub project out of it which is a fork of the QEMU project. I'm not sure if I want to upstream it because it has a lot of ugly code and a lot of well, work arounds. But contributions are very welcome. It currently has support for the Apple Touch 1G and 2G and I'm currently focusing on getting the Apple Touch 2G stable so I can get an app store and third party applications up and running. So that's all, thank you. And if you want to know more, I have some blog posts with more technical details on my personal website. APPLAUSE Right, hello. Yeah, so we have some sign for questions. I hope the people ask questions are here in the front because I don't want to run to the back. But I'm going to start with a question because you mentioned Corellium, which is awesome by the way, they are very expensive but they are awesome, but Apple suites them into oblivion and they are lost, which I'm very proud of. It has nothing to do with it. But so the question is, has Apple made any friendly inquiries? No, no, no. I think this project is still too insignificant for Apple to care about. I also know about the Rockbox, for example, which does Ipod generation emulation. I'm not sure. I don't think they've been sued. But I'm not that worried about it right now. OK, excellent. Questions? Sorry, come to the side. Hi, thank you very much for your speaking. Only one simple question. Because why you choose the iPod Touch and the iPhone platform, it's only a simpler problem, or because there are patents or other problems in that way. Thank you very much. Yes, thank you. So the question is, why did I choose for the iPhone and not for the iPod Touch and not for the iPhone? Well, I mean, when I started this project, I was not familiar with the architecture of both. But I was thinking, well, the iPod Touch contains at least one less peripheral, namely the baseband, the modem baseband. And I was not sure how critical that would be for the entire booting procedure. So that was, I think, my main motivation. But most of this stuff can also be applied to the iPhone. I think with some changes, you can get the iPhone 2G working. Because the iPhone 2G is architecturally similar to the iPod Touch 1G. Yeah. Hi, great talk. What are your future plans for this project? Do you want to support newer devices or expand like the computer in a more modern iOS version? Yeah, thank you for your question. So what are my future plans? I am currently working on getting the USB up and running. There is an independent researcher that also managed to get a syscalls between the guest and the host running. So that's pretty cool. So we can do some syscalls. So I'm currently working on USB. Whether I want to work on newer generations, I'm not so sure. I think it will be possible to emulate them. But I think having one stable and, well, actively used emulator is better than having 10 fragmented, half supported emulators. Because there are like many Apple devices out there. So yeah. OK. OK. Hi, thank you for this great talk. I was wondering, you were talking about getting the app store up and running. Have you considered getting in touch with Jay Freeman, the author of Cydia? Cydia, no, I haven't considered getting in touch with him. I know some people are asking me about, can we jailbreak and then install Cydia? I think we probably can. But there's almost no tooling around this emulator at the moment. So getting these jailbreaks up and running is kind of difficult right now. But I think it's a good suggestion. I think at one point I should. Yes. Thank you. Yes. Anybody at the front, hopefully? Thank you. Hi. And thank you for your talk. I don't remember in 2007 with this type of device, required activation behind us or not? I think they indeed require activation. Oh, actually, that's a good point. I used activation tokens from an actual device. Because I also had to match the serial number, et cetera. So I matched the serial number. I used activation tokens from an actual device. And then it worked. But I could have well patched out all the look down the demon. It's the look down the is a demon responsible for this checking if everything is activated, et cetera. I could have well patched that out. OK. Thank you. Great talk. Have you got the opportunity to play with JTAG debugging to cross check if your emulator works well, like a real device? What are you referring to? Like how can you do this check? I would say you try to execute some peripheral access, both on the real device and in your emulator. And you cross check the read results. The spell is a good point. I think you could do it with open iBooks. So I managed to install open iBooks on the actual device. There you can play around with the peripherals. So I think you can have some kind of trace where you just fire requests to the hardware. And you get some responses. And you can cross check that indeed with what I get. No, I haven't done that yet. But I think that's an excellent idea to make sure that your emulator is mostly compatible or the same as your real device. So I had a small question, actually, because at the beginning you mentioned you're a postdoc. So how much time do you spend on this? It's very difficult to say. Because sometimes I have a week and I spend every evening on it. Sometimes I don't spend any time on it for three weeks. I mean, it also depends on my main schedule for my work. I mean, depends on paper deadlines as postdocs, obviously. Yeah, I think when you get closer to getting something up and running, you tend to be more motivated. And then I spend more time on it than when you're completely stuck. And yeah. OK, does anybody have a question? I can keep going on. So another small question is, because one of the previous talks, they mentioned motivation. How do you get motivation to start something like this? And where do you start? So can you tell us something about that? Yeah, I think for this, well, you mean, first of all, you need some curiosity. You want to know how things work. And you really want to, yeah, you have to be able to dig deep into some components. And you know, there are many components. So you will inevitably run into something that you don't know anything about. So I learned a lot about all the different components that are in there. But another very important thing, I think, is persistence. Because at many times, for example, when working on the multi-tarch or the nonce, I was like, yeah, I really don't know how this works. And then you solve a small part. And then it turns out there's yet another layer of indirection going on. And you have to figure that out again. And then it turns out that something you did earlier was you made the wrong assumption, which breaks all other components further in the pipeline.
Opening Energy: Reimagining this Ecosystem through Open Source devroom
So welcome to the Energy Dev Room. We're all happy that you're here. For those who are speaking today, thank you for generating the content for the talks today. It was really exciting to see the proposals that came in. Just a couple, I guess like housekeeping rules of housekeeping. If there's an empty seat in the middle to make space for people who might be coming in, we're going to ask you to squeeze to the middle. It would be great if you could start already right now because I'm seeing some empty seats in the middle. We see empty seats, yeah. Squeeze together so there's people coming in so they can have somewhere to sit. Thank you. Anything else we need? Should we introduce anyone? Yeah. Who's going to introduce the volunteers? Yeah. Yeah, so the organizers for the room. I'm Rachel Tipton. I work for Open Climate Fix. We have Boris Dali from RTE, RTU. Anna. Yeah, do you want to introduce yourself there? Hi, I'm Anna. Yeah, one of the organizers. We have Dan from the Linux Foundation Energy. Kai from Everest Pionics. Nico, who's been managing all the things the past days. Thank you. Yeah, thank you all. Yeah, and also if there's speakers who have questions about the setup, you can in between the talks if you're a little worried about how you're going to get your computer setup. Feel free to approach us in the blue shirts or any of the other managers will be trying to help you. Okay. And by the way, there's been a small organizational change. The first and the second talk have been switched around, so don't be confused. Yeah. And have fun. Thank you for coming. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
EVerest: One stack to charge them all?
Yeah, I'm just going to talk a really quick presentation about Everest. First a few words about myself. My name is Kai. I have a background in computer science and robotics and I've been working at Pyonix on this Everest project since early 2021. So what's Everest? It's a complete software stack for electric vehicle chargers which is running on embedded Linux. It's released under the Apache 2.0 license. And the aim is to support many different hardware platforms and you can also build your own. Yeah, it comes with a lot of different modules already. So, you know, board support drivers for AC chargers, for DC chargers. It's already prepared for or comes with high level communication support. So we have Slack implemented, the Deanspec 7121 and ISO 508-2 and dash 20. There's OCBP 201 and 1.6 support with drivers for power meters for DC power supplies and so on. Yeah, the project is primarily written in C++17. There's also language support for JavaScript and Python and relatively recently we also introduced support for writing your own modules in Rust. Yeah, this is a, hopefully you can read the slide but it doesn't really matter that much. I'm just going to talk about it a little bit like the timeline, how this project came to be. So the first ideas on, you know, how to improve the EV charging ecosystem began in like the end of 2020. The company Pyonix, which started this project, was then founded in early 2021. And about a year later, Everest was announced as the latest Linux Foundation energy project with the source code being published in January 2022. And we had ChargeBite joined the technical steering committee and they also started integrating it into their charge controllers. In the beginning of 2023, we had different manufacturers of charging controllers and suppliers of chips and stuff like that launched several dev kits that are Everest enabled. And in October, we held our first little conference with like 100 people, the aptly named Everest Summit. It's always a bit of a mountaineering pun going on with some of the names. Yeah, and pretty much the same time we had the US Joint Office of Energy and Transportation as well as Charger Manufacturer Quello join our technical steering committee. And yeah, that leaves us pretty much here at FOSDEM 2024 and lots of exciting things basically planned in 2024 as well. And yeah, this is a slide basically showing a lot of the like ecosystem around it already. Like we have involvement from like academia from like enthusiasts just wanting to work on this, but also like charging station manufacturers, component suppliers and like standardization bodies as well. Yeah, then looking at 2023, that was basically the year where the project kind of like took off I would say. I held a short talk at FOSDEM last year in February and you can see like the stream of contributions basically increasing over the whole year, which was pretty cool. Lots of like pull requests to review lots of things to merge and a lot of like community engagement, which brings its own challenges with it. So it was a bit of a fast growing community. Like yeah, in 2023, we basically only had like a mailing list and at some point it was basically unmanageable because of all the traffic. So we kind of thought about how we want to like tackle this, how to make this, you know, sustainable for the future. So we thought about like moving to like a more like chat based solution to the Sulep chat and you can kind of see like the amount of messages kind of going down on the mailing list at the same time as you know, our active users on the chat system got up. So I think this is on a good track and we'll just have to see how this works out over the next couple of months. And yeah, with this introduction of the chat system, we kind of also created a new organizational structure to basically better engage with the community and manage this, you know, growth. So we introduced different working groups. One of them is, you know, focus on car communication. So I think ISO 1511A, Tadeimo and things like that. Another working group that I'm very active in is the cloud communication, which is mainly focusing on OCPP at the moment. Then there's one kind of talking about everything that's related to like the core of the Everest project itself, like builds tools and like the foundation of it, which has a bit of overlap with the CI and testing working group as well. And for everything there is not really a place for this general and Q&A working group. And yeah, what I find really interesting is that it's kind of like a multimodal approach. So we try to have like chat streams where people can ask questions and engage in like a text-based way, but also have like regular meetings, like video calls where people can also ask questions. And this seems to work pretty well. Yeah, let's talk quickly about some milestones in 2023. We had like set out a goal for monthly source code releases. And I think we more or less hit that goal. We had like 10 monthly source code releases in a year. We also just released the January 2024 release. Based on those source code releases, we also provide a Yachto layer for Kirkstone. And we're also kind of thinking about maybe a new release strategy going forward. So maybe doing releases every two or three months and focusing more on like stability of these releases. But this is still a bit up for debate at the moment. Some of the technical milestones of 2023, we worked pretty hard on OCPP 201. So the core and advanced security profiles of that, they are pretty much almost done. And some parties are already going into certification based on that code. And in general, there was very active development on OCPP in the last year, as well as OCPP 1.6, we also continuously improved that. And yeah, on the car side things, car communication side, we now have a pretty well-tested Dean's back as well as ISO 1511.8-2 implementation, including plug and charge. And we had the first successful charging sessions with ISO 1511.8-20DC. And yeah, to make all of this work pretty well, we tried to attend lots of different testing events. So we intended to OCA OCPP plug fists in Arnhem, as well as three different like Charin test intervals, which are focused on testing interoperability with ISO 1511.8. Some of you might remember this, like last year I talked about the open hardware that we also launched, like the end of 2022, early 2023, this year, Jaggen-Yeti board released under the CERN Open Hardware License Version 2. But I'm not going to go into any detail here. So if you're interested in that, there's like two talks I gave last year about this hardware and you can basically find everything that you need on the GitHub page as well. Just another cool thing, we built with this hardware, it's like a DIY DC charger. So we basically plugged this with a wiring diagram, very similar to this one that you can also find on GitHub together and use basically our AC controller hardware to drive a functioning DC charger. Another cool thing that we've been working on last year was, is this what we call the micro megawatt charger. This is a handheld DC charger powered by Everest. And what's pretty cool about this is it's a functioning handheld DC charger. So it started out with just an early prototype in early 2023, still in a box with cables and everything and basically ended up in something that fits inside of the box just for delivery. And what's cool about this is given that it's a functioning DC charger that is battery powered, you actually have voltage on the DC pins so you can plug it into a car and you go basically through the whole charging sequence with the car. Not just protocol testing but you can actually go to the power delivery and then most cars basically say, okay, I can't do much with one watt, I just stop. So why do we do this? It's pretty cool to just walk around on like these testables but also like on a normal parking lot, you know, with consent of the owners to just plug this into the car and generate log files and packet dumps and things like that. And we also try to publish these on GitHub after we recreated them. Then we worked a little bit on EV simulations so we got a small children's electric quad outfitted it with a CCS port, runs a hacked up Everest simulation on it, EV simulation on it. And I think it's one of the only children's electric quads that can charge on a commercial DC fast charger. And we have some more plans with that in 2024 so we want to have this EV simulation like natively in C++, include like an EV manager in there and basically extend it with ISO 15118-20 support. And there's a little bit of work going on on Tademo as well in the moment. And yeah, this brings me to the roadmap for 2024 and like no particular order. Like I just mentioned, the native EV simulation, we want to complete our OCBP201 implementation and start integrating OCBP 2.1 once the spec has been released. And there's going to be a lot of work going on on ISO 15118-20 so we have a C++ based XC parser and a parser generator in the works. We want to also include plug and charge there work on the AC unidirectional as well as bidirectional power transfer. And there's also a first lecture demo prototype for the charger side in the works. And yeah, if this sounded interesting for you, yeah, here so you can get involved. Basically you can find documentation and how to get engaged with the project, you know, like the mailing list, the group chats and things like that on everest.github.io. If you just want to look at the code, it's on github.github.com. And you can also find the open hardware under those two links. And yeah, I'm looking forward to your engagement, maybe contributions and thank you very much. We have about three minutes if anyone has any questions. Yes, I have information about to see the first is recuperation of energy when you are going down and the street is going down is the first question also the deceleration motor deceleration with trucks also in the system. And the other question is about this hardware or software for bicycle with electric assistance. Okay, I think the first two is mostly like on the EV side of things like proper EV side of things. And I mean, we are mostly focused on EV charges. But for bicycles, like, I think there's some work going on in some standardization bodies at the moment to basically specify like charging for small bikes for small like electric assisted bikes as well as, yeah, how do you call these things like the little motorcycles that are electric for the scooters and stuff like that. Doesn't look like it. So how much has the open hardware helped with the project in terms of contributions or say vendor adoption? I think it's really hard to quantify because people can just look at the designs and basically build stuff with it. Like our company, we had some orders for like finished kits of these things and I think we sold quite a few of those. But yeah, I think it helped. But it's more like, you know, we see it as like a dev kit that people can just play around with. And it's really not that complicated. I mean, especially like it's an AC charger. So you need some relays, you need a way to drive these relays. It's like a power meter on there, but usually if you want to build something for yourself, you don't need that. And then the high level communication board, this needs like a modem, like a power line communication modem to talk with the car. But only if you want to do Ice with an 11-8. If you don't want to do this, you can just leave this out as well and build something really, really simple. But for starting to hack around with Everest and all of these more advanced things, I think it helped. And there's definitely some interest there. Thank you very much.
Using FlexMeasures to build a climate tech startup, in 15 minutes
Welcome. Thanks for having me. My talk was actually about one o'clock this afternoon, but I'll jump in now. This is the right – am I too loud? It's fine. Okay. Well, I am Nicholas from Germany living in Amsterdam. I'm co-founder of CITA, Energy Flexibility, and we co-founded the FlexMeasures project. I will briefly talk about the FlexMeasures project. Last time at Boston, we also had to talk about some specifics. I like to introduce a project with some specific applications. So last year, we talked about our Vehicle to Grid implementation, where we use flex measures and home assistant. And today also, I'll go more on the developer perspective as a developer, you would actually work with flex measures. I only have 15 minutes, so I will fly over it a bit. Don't worry. I mean, let's not read every line of code. It's just to give you an impression. How would it be like? With flex measures as an introduction, we have been focusing on behind-the-meter optimization. So that's these other things you find behind-the-meter. So there's enough complexity to run an optimization and find the best running times for the things that are flexible here, which are usually EV charging, batteries. And today, we talk about hot water storage. These things are not exactly behind-the-meter, but they matter as well. In Netherlands, we have congestion on the grid that influences the optimization of what you're doing. It's a constraint and dynamic energy prices. So then, it becomes quite interesting as a problem. Right. So very briefly, flex measures is a platform that takes in a lot of data, like meter data or prices, all these things. And it gives you the best timing for your flexible assets as a very simplified picture of what it is. We have used it in a couple of areas, like I mentioned, bi-directional charging, in industry, in water sanitation, and now we're working on smart heating as well. Here's a little look on our dynamic visualization of what flex measures knows at any given time. So this is from the VEP UI of flex measures. You can replay what happened, what data flex measures knew, and what forecast it knew. But I want to spend 10 minutes, have this very brief tour. What if you were an energy startup? Let's say you work with smart heating, and you want to have the smart scheduling for your e-boiler, as an example. So these are things you would like to do. I will go through each of those. And I'll touch upon a couple of ways to interact with flex measures. You're writing your own flex measures, plug-in. There's a Python client, there's a command line interface, of course, there's an API. And I'll just, while I go through this list, everything will be touched for illustration, what are the things you can do. The brief picture would be that there's a house where there's the e-boiler, so your energy asset, with temperature readings. There's a flex measure server over here in the cloud. And all of these things are going to happen. So there's a little bit of an architecture diagram, but what we'll try to touch here. So the flex measures client will send temperature, it will ask the server to compute a schedule for the boiler. There's a data platform where we can get the prices. We'll have a cron tab because we will have to do some stuff just regularly. And let's keep that in mind. So this is the very first step. You don't have to read everything, but I'm just showing that we provide a cookie cutter template so you can quickly get up to speed, have your own code structure. So you choose a name and a description and you say, yeah, please give me the API blueprint. Blueprint is a word from the Flask system because flex measures is a Flask application. And you get some kind of boilerplate like this. And that's a boiler. This is the one endpoint we're doing here. What if we want to create a new customer for this project? This is a lot of code. This is basically the endpoint we wrote as an example. I'm not going to read everything. Basically, this is how you plug it in. It's going to be plugged in flex measures and available as an endpoint. We're creating a user and an account. And maybe this is the most interesting. So this is basically your business objects. I will go a little deeper here. This is the same code roughly. So we're creating the boiler as an asset. We're creating a couple of sensors. Here's two examples a bit bigger where we really define, we tell flex measures how to handle this. What kind of units are we handling and the event resolution and so that flex measures know what to do with them when data arrives. Schedules have to be made. And then if that happened, if somebody called this endpoint and your account was made and you would end up in the flex measures UI, you can see them here. Next step, let's say we measure the temperature locally. You have your own sensor and you want the temperature data to end up in flex measures as well. Then here's a small example how to use the flex measures client. Basically, it provides you with some nice code to work with more easily, but it actually uses the flex measures API in the background. For fun, we actually had the temperature reading in Fahrenheit, which we say when we send it to flex measures, the data is actually to be stored in Celsius and will automatically get it right. So this is where a lot of work goes, as you can imagine. But otherwise, this is just sending this reading. There's not much more. You'll do this regularly from your local script that runs on your Raspberry Pi, whatever you're doing there locally. One more step. So there's some external information we need. Temperature is a local reading from your local asset. Prices are a good example of information from some other third parties that just has to also be collected in flex measures. One other example is weather forecasts. In this example, I'm showing that we actually wrote a plugin for that. So we're cloning this plugin we wrote. NSEU is the organization of European transmission system operators, and they provide a data platform so you can get various things like prices, but also just a head allocations for all the transmission zones. And so we say we want the Dutch transmission zone. Please give me the prices for that. I'll talk and we configure everything. And actually then this is the command. So through flex measures CLI, this plugin has registered a group of commands, for instance, to import a head prices. Also, all of this is public how we wrote the plugin. So if you call this regularly, let's say one time per day, you'll have the next day head prices always in your system. Small visualization of one day of prices in the flex measures CLI. Excuse me. Okay, now I'm not sure how much time do I have. Eight minutes. All right, that's not too bad. But the main part now is you want to actually tell flex measures to give you an optimized schedule for your boiler. And here I'll show, I could do that via the flex measures client as well, but I'll just show how to use the API directly. This is not so interesting, of course, you have to have an authentication token. But I have to spend a bit more time here. A lot of time we spend when we made flex measures is how you configure the problem. How do you tell flex measures the constraints of the problem in the back flex measures will actually take your information about your setup and your problem. Basically, you could call that business rules, and really translate that dynamically into a linear program. So flex measures contains, I think three different algorithms, basically, we have one that's focusing on storage based problems. And that's what we also use for heat, heat batteries, we call them. We have one for, if you just want to allocate processes. But it's a very important part for developing a new application that you can tell the flex measures server, this is how I want you to treat this problem. Here's the constraint you don't know about, or here's a local thing you don't know about. And that's where we're working on two things, the flex model and flex context. So flex context would be, well, these are the prices that are relevant. We also have a project where we don't use prices, but we use the CO2 signal, the CO2 content of the grid that is anticipated. But the flex model is a bit more detailed. So this is not all the things you can do. But basically, wishing, well, the state of charge of this heat battery is this many kilowatt hours. So that's local knowledge you have. Here's some constraints. I can't go under this. We don't want to go under this. And also, here's a target for you. In the morning, I need to have this much energy content in my battery. I think this could also be a percentage. We're pretty flexible there. Some other constraints. You can see how these translate actually into constraints of a problem. And then you call our API to say, well, for this, the fill rate that I want a schedule for that, please start. And that will actually trigger a scheduling job. And then flex motors will usually pass this on to a worker. So we, in our implementations, we have a web worker and computation workers that will handle those. And then you can call this, get endpoint to check if your computation is ready. It will usually not be ready after three seconds, but soon after. And then, yeah, you get your values here. So then you can implement these settings locally. You can, let's say you ask for a schedule for 12 hours, then your local gateway has the plan for 12 hours. If there's anything that changes on the ground, you just ask for a new one. You'll update as we go. So that's general behavior. I'm almost done with, with a, you know, two of the force here. One thing we want to maybe do is in flex measures have a nice dashboard that has the most crucial data on top of each other for some inspection. And then, well, you can actually put that on the boiler asset. And then you, in flex measures, you have these nicely stacked, right? You want to see what you've been using for optimization on top. Although this comes from a different asset. This is something for everybody. All the assets can use this. And we use, as you remember, we had like four sensors or so that are relevant, but we just decided these two other ones we want to see. So we can easily see that in a period of low prices, flex measures has tried to, you know, fill the, fill the boiler at those times. Some signal here. I'll skip over this a bit because, yeah, I originally had a 25 minutes idea about this. Just as very quickly, we also noticed it's very important to also do some reporting. In flex measures, give some logic about that, that you combine some sensor data so you get the outputs of what happened, for instance, like costs. You know, that's very important. Sorry. And that can become a C like a minus well that you regularly say, okay, now the day has happened, we optimize as we could. Let's calculate how much energy costs we had here. So combine just the prices and the fill rate, which happened. But we also saw already that's that's many more interesting computations that people want. So this is a very simple multiplication. But we've made a pretty complicated architecture so you can actually have a lot of bring a couple sensors together for a new result that even can be used further in your next optimization or so. It's a very flexible system we've built here. And this is the project website. From there, you'll find the the GitHub, you find the read the docs, you'll find more information like I was interviewed for Python podcast where maybe I go into more detail. The mailing list contact, everything's there. You can also just write me directly, of course, if you're interested in doing something yourself and joining our TSC, the technical steering committee, everybody's welcome. And that's it. Yeah, there's lots of things to do, of course, I've touched upon a couple things, applications like vehicle to grid or smart heating and industry. But the roadway is still, of course, filled. There's so so much things in the energy behind the meter and a bit above to optimize. Thanks. We have time for question, then. If someone wants to ask one question, you said that you create a linear program. And what solver do you use to solve this program? What kind of solver? Yeah, we have we work with two solvers now. You could, of course, also use Cplex, but we've used two open source ones. All right, now they don't come to my head. Sorry. Hi. Yeah, we switched to that one. And we had a different one before that are both possible. So you can just those are shipped with a Docker image even so you can just configure that which one you want to use. But you can also we use pyomo as a representation for the problem. So everything that works with pyomo, which is you can use that as well. Thank you so much.
OwnTech Project: An open-source generic reprogrammable technology suite for reimagining the energy ecosystem
So, okay, we got the hard task to be the first one to speak, and so we failed. My name is Jean-Linéie. I'm the CEO and co-founder of OnTech, and today with Louis, we will discuss what we've done so far and what we are trying to achieve. So, we wanted to have a bit of a general introduction of how we see the energy and how it could become more and more open source over the years. The idea is that we see it as a pyramid with the bases being the power hardware and then having levels of sensors, real-time algorithms, industrial informatics, higher level in terms of communication, how we dispatch information from these devices on the field, what protocol we'll use, how we dispatch the energy among different power hardware, and then there is the highest level, which is like simulation, optimization, and modeling, forecasting, and so on. Today it's really exciting because if we look at what is it all about in this session, we have like plenty of amazing projects that are filling these pyramids, and it's really interesting because eventually we can reach that point where we have like the whole chains from the power hardware to the modeling, to the forecasting, to the optimization, through all the complexity as well of communication and protocols and so on. An interesting thing to note is that like the time constraints in the power hardware is not necessarily the same as the one for modeling and simulation for grid, for instance. So, the complexity associated with these things makes the informatics different, it's different fields between like the embedded world to the HPC and the modelization and optimization world. So, there is like an inherent complexity in the energy domain that is really interesting as a technical asset and thing to explore. And this is why I'm really excited today is that like in this session we are combining simulation, communication, hardware, and so it seems that we have already all the bricks and maybe tomorrow we'll build the pyramids. So we the energy people have the power to change the world and so I'm really excited about that. We'll let the floor of speech to Luis now. Thank you, Jean. And this pyramid is built with different bricks and these different bricks are hardware softer like Jean just said and hardware usually is hard until it isn't anymore, until somebody comes along and bundles the hardware somewhere and makes it ergonomic, makes it easy to use. That's what Arduino has done, that's what Raspberry Pi has done, Microbit has done it as well and they have inspired us to do that for power hardware. And that's what we have achieved. We have, there's a box there with one of our circuits and I'll pass it around a little bit later. We propose a community based compact versatile open source and low cost technology for learning and prototyping power electronics. That's the goal, that's what we want to achieve. The idea here is to create a technological sandbox just like Raspberry Pi, just like Arduino have something that is standardized, that is simple to use, that can be used by academia for teaching, can be used by industry for fast prototyping or for using in other applications for makers and fab lads to make fun stuff and burn it. And this is the place where we hope to foster new ideas and come up with new talents, people who are willing to build electric bicycles, people who want to build a microgrid, who want to understand how it works and put together the bricks and build the hardware upon which they can test their forecasting algorithms or test in their models. Now, how does, starting to get a little bit under the hood, how does power hardware work? If we look at it from the perspective of a functional analysis, the power is really the red arrow in the corner. And to get that arrow to work as we want, we have all these different arrows in the middle. And if we take a top down approach where we come, we did a forecasting, we did a simulation which allowed us to do a forecasting, which allowed us to do, to calculate an energy management strategy which we then send via dispatch through a protocol all the way to the target. And when it gets to the target, it gets here through the communication backdoor or frontdoor. And that goes into the industry informatics and the control systems which are operating in real time, locked into this micro or nanosecond level loop. It also receives measurements from its own embedded sensors, but these are not normal sensors that we come in interrogate via Laura once a week. Or these are sensors which are sending information at a one megahertz bandwidth which you are sampling at 50 microseconds or sampling at a very, very precise moment as well. These combined the control with the algorithms that are in here, they create the low level electric signals which then go there and trigger the power electronics for them to work the way you want them to. And then the loop is closed and the thing works. There's a little fiddler secret in the middle, never forget it, the energy has to come from somewhere so sometimes if that little fiddler secret fails, the whole thing stops. So everything kind of stands on the choice of the little component that you made when you put that little fiddler secret there somewhere. And what we did is that we got all the stuff, we put it into a board and you have all the different blocks which are somewhere bundled there together. But you don't have to understand to that level of complexity unless you want to. You see the communication coming in and the power going out, that's it. And that's the idea. We have two products. We have one which is a power product, the twist board which uses the second product, a passage to talk about. The twist board is a module which we can then either rack up together so we pick several twists, we put them together and that allows us to handle more power since they synchronize and communicate with each other. It's a linear progression. The more twists we put together, the more power we can handle. And we created a communication bus at the low level which can talk in CAN, can talk in NRS 485 so we can talk at the millisecond, we can talk at the microsecond and we can talk at the nanosecond with analog. So we have different bandwidths which we can dispatch with through different communication methods and protocols. And we have the spin board which I'll let Jean present you. So eventually in order to control power hardware so fast you need like some special embedded microcontroller. And this microcontroller has some real time constraints to it. So it's not a regular Arduino or Raspberry Pi that will do the job. If you want to have good performances you need really precise timers, really special communication peripherals. And so eventually we came up with designing our own board which is like the spin board. The spin board is both a piece of hardware that looks a bit like an Arduino Nano or a Raspberry Pi Pico. And this thing has tremendous resolution for its PWM signals, so the driving signals that will eventually drive the power stage, but also a really flexible acquisition of signals. So it will connect with the analog signals on the board. Eventually microcontrollers are great only if they work together with great ergonomics and coding a microcontroller can become either a nightmare, either a piece of cake, depending on what is the software and the ideas that you use to do so. So we wanted to comply with the maker movement mindset where you basically take a piece of microcontroller, you plug it with USB to your computer and you start coding in seconds and minutes. You don't have to install all the tool chain and so on and everything is done by the ID itself, so without the complexity to set and so on. In order to do so we use platform IO together with visual studio code so it's a really seamless experience for the developer. But also we have IO level of development that is possible for MATLAB for simulation people that want to deploy some control loops and control those directly in the target. They can do so through an IO level of graphical coding, let's say. And those are there is something from the Linux foundation, the FliarHartus that is providing a framework on top of which we've built APIs. So these APIs are calls that are basically making things seamless for the user so that you don't have to go through the hassle of the 2,000 pages of the microcontroller in order to program the power hardware. You have like high level functions that relates to the power world, so okay, what is the duty cycle, what signals I want on that MOSFET or directly related to the application. I want to increase the voltage, I want to decrease the voltage so I can go in my level of complexity in the language I talk daily and I don't have to go through documentation and things like that. So we have different APIs. One is the microcontroller API if you want to develop your own hardware, your own power hardware and control it through the spin board, you can do so. Or you can directly call another API that is built for the power hardware that we provide as well with the spin module. So this way you can call functions and not signals. And then there is a communication API, how to synchronize things with the surrounding world and task APIs to say okay, I want to dedicate that amount of time to do this calculation and that amount of time to do communication or higher level housekeeping stuff. And then there is a user code that is basically your main as in a not doing no experience, let's say. So this is the pinout. Of course everything is open source, so the hardware itself is a CERN HL license based. The idea here is to push people to share back the modification so that we can move on with a better and better hardware of the time. Of course all the documentation is Creative Commons, all the interfaces and the graphical stuff is GPL. And we have like a dataware and something that you can plug and see like the data live like if you were having a kind of a low bandwidth oscilloscope just by plugging your USB cable and gathering your data from the device directly. In order to make that thing happen, like we've created a foundation that is under the Aegis of the Seniors Foundation, so it's a National Council of Research in France that has put a ton of effort into making this thing a reality. So we got a lot of support from a public lab in France and this is where it comes from. And so the foundation is holding the IP. So if you want to contribute to this project, everything will be under a dedicated foundation that has strict rules to enforce the open source vibes of the project forever, forever one. And then there is a startup that is basically providing the hardware because if you want to develop things, you need someone to be able to provide the hardware to go fast basically. And yeah, so on the foundation side, we create tutorials, content, MOOCs and we make that thing available online. So we create an online space for that. We also coordinate a small embryo of community at the moment, but we hope that it will be more vivid and for some international collaboration around these fields of power for energy. And also we are starting to organize training sessions and events to answer local needs and the idea is to spread and to make things decentralized in a way that everyone can tackle its needs of energy with this kind of Arduino for energy thing. So to give an example of the first use case, at the moment we are working on a use case with a fully open source e-bike and in this e-bike you have inverters, you have battery chargers, BMS system in order to monitor all the cells of the battery, converter as well for the PV panel on the roof. And so we are collaborating with our great open source hardware projects such as Libre and Vosola and we are aiming at replacing all these closed source pieces of converters inside of this e-bike and make it fully open source A to Z from the smallest piece of electronics to the frame to the bike itself. So yes, that's it for me and hopefully Luis will be able to make a demo in five minutes. Yeah, maybe we can combine with a question. And how much? Sorry, can we buy the boards and how much? We started producing so we have our own pick and place machine so everything is made in France at the moment, assembled in France. So we have started assembly, we have shipped our first eight boards to a university in France for students. They haven't destroyed the boards yet so it's a good sign. And so we have pre-orders at the moment and so to give an insight of the price at the moment the power module is 300 euros and the microcontroller is 45, 49 euros. Can it be used in a full tour around the architecture? So yeah, to answer that really fast maybe I will come up back to that slide. So one of the strengths of the modular approach is that we've put a lot of effort into making different modules, being able to share power loads and share communication. And it's a good thing for fault tolerance because if you go modular, if a module fail you can think of clever ways of replacing the faulty modules with another module. Yeah, just one, an application is a complete autonomous, energy autonomous for electricity home with wind power, a little solar panels for the voltaic, also bicycle with electric assistance and so on. So also for with low tension DC for computers and something else and high tensions AC and also taking account of the day of the day, the battery charging with lead, battery charging with lithium or lithium, I don't know, something like that. So definitely off grid applications are key and also for energy independence and so on. At the moment the module that we've developed is DC based, so it's DC to DC. It has a really wide range of operation between 90 volts down to 10 volts also. So it complies with all battery technologies from 12 volt batteries, 24 volt batteries, 48 and 90, like 86 volt batteries. So it covers a range of battery application let's say. In the future our goal is this kind of grid applications and home and energy independence is to go for a microinverter basically and this will be made by combining different modules. So this one is a DC module and then we'll add an AC connection on top in order to cover this off grid applications and energy independence. There is one in the back. Could you create also some BMS or open source? So we haven't developed a BMS but that is already covered by the hardware from Libre Solar I think. Hello, it's a bit of an implementation question. So you are using Canvas, so you are using Canvas for now. Maybe it's because the automotive world is using it. I was wondering if you were thinking about moving to something like T1S. I'm not sure you're familiar with that. It's kind of Ethernet but with the CAN topology, so multi-drop. So it's really nice and kind of microcontroller friendly and IP based by thanks to Ethernet. So I was wondering if you were kind of thinking about it. So yeah, we thought about it because Ethernet is as great features but it tends to be costly. The idea is to go like to lower a bit down the cost of the overall communication architecture and so on. Yet we are making things like to the biggest extent modular in a way that if you want to plug a different way of communication you can do so. You can access all the pins of the microcontroller that we have. Maybe in the future we will make it think we will support different microcontrollers as well that will have more features and more peripherals but at the moment it's not planned. We have like two different things. CAN is for housekeeping and sending average data. And RS485 is for superfast communication. So we go at 20 megabits with RS485. So it's a bit uncommon but it permits to have like one cycle of control communication with different modules. So they can share one reference and a set point but also measurements among multiple modules still at 10 kHz control frequency for instance. No? Sorry, no demo. Can I convey it for you guys? We are here the whole day but the thing just crashed of course. Of course it did. Demo effect. Demo effect but I would like to just share something with you though. I can hear you online. But I would like to share something with you. Can we get into a... Yes. Yes. So we do have a... We do have a GitHub and what I wanted to show you is that on our own tech foundation GitHub there are sample codes, the examples of the data that we have. And in the example repository we have multiple different examples of how do we use the twist board at different applications, DC to DC, microgrid, AC. What I wanted to show you, the demo that I failed miserably to achieve, was the microgrid. So what is that supposed to look like if we get the peer to peer AC microgrid? We have the documentation. How to connect the boards together. And the communication that goes here. And these two boards then will work together to share power. In this case it's a peer to peer exchange. So one board is drawing power while the other is supplying. And this is actually data from the board itself. That means that we can ask the power converter to sample very quickly data. And keep in its memory and then we can retrieve it later. So we can do this kind of test where we get... Every point is about five microseconds apart so we can get a lot of resolution and see what's going on. It's offline because we do it after the work but it still works like that. And for the DC-DC side, same thing, we had the DC... Comes up. Okay. We have the different structures and different examples. They are there. So we invite you to go there, take a look at our Github. Take a look at the spin board. It's there. It's in Kaikaa. The twist board as well. And if you want to talk with us during the day, I have everything that I would need normally for a demo. And we can just sit down and do it together.
Enhancing OCPP with E2E-Security and Binary Data Streams for a more Secure Energy Ecosystem
Okay, welcome to my talk. We already heard a lot about the fascinating domain of immobility from the guys from Everest. While Everest is on the charging station side, this project is more what happens behind a single charging station. So all this back-end stuff up to the energy providers, distribution network operators and so on. And why is this important? Because in the energy domain we have a lot of safety and security regulations coming from the government. We have somehow to comply with it because in immobility it's not only important that you have IT security, you also need to provide energy safety because everything is connected via the grid. And so when too many people behave badly, we will have the next blackout. So nothing changed since last year, so I skip this because I only have 50 minutes. In the past, the immobility was quite simple. We had charging stations on one side and a back-end on the other side. And they more or less had been communicating. In the last couple of years we are HTTP web socket communication. So this is the client, this is the server, everything is fine. But now, the situation changed a little bit. We have no longer a single charging station somewhere at the street. We have normally multiple charging stations at one location. So it's quite useful to have some middle box which combines the communication. So that you save money when you want to communicate with the back-end. This is nothing new. There's a lot of vendors implementing it. There are even specialized vendors for this. It's already in the OCPP standard, but it's not really in greater detail. It's just mentioned that you could do it. We want to dig deeper into this problem and see what we need to realize in this. Next thing, when you have this middle box, it's very natural that you add additional stuff to this. So you not only want to combine the communication channel, you also have specialized energy meters which are now located at the grid connection point. So monitoring the grid connection point, the idea behind this is that you can do a local load management because you have only a limited capacity on your grid connection but want to share it between the charging stations and somebody must be in charge of how to share this energy. There are other projects who do the calculation for this, but this is the communication part. And here, for the first time, if you're a German, people, you know, this fascinating world of smart meter gateways, which is more or less specialized hardware from the federal industry for security in Germany, which regulate this area because energy, as I mentioned, is a safety critical infrastructure. So they try somehow to improve the situation that most vendors don't care that much about security and safety. The first problem would have, because I said we come from a very simplified view on this problem, where the connection from the charging station to a back end. So because of limitations of the OCPP protocol, we at the moment duplicate every connection between this charging or communication aggregation box and the back end. This is not only perhaps a design flaw, which nobody cares about, it's also starting to become more and more a security problem, because the only security we have is HTTPS, so transport layer security and transport layer security and in this box, here you have another transport layer security, so you have a split communication channel. So your IT security is no longer given, because this could be a man in the middle. It's getting even worse, because now we have specialized companies who sit in the middle between your charging stations or your aggregation box and your back ends or even multiple back ends and want to do analytics for you, because normally the charging station management operators or vendors just manage charging stations. They are not that much into analytics. So very often they even those sit in the middle and then you realize, okay, now the problem is getting more and more complicated, because people who only are interested in Excel sheets sit in the middle of your critical infrastructure and maybe they not only analyze what you're sending, possibly here could be sitting Mr. Putin and send commands back, because you have no chance to stop him. So the first thing what we want to have and this is also nothing really new, we want to share this web socket connections. For this we need to adapt a little bit the OCPP protocol. There's already an internal draft how you could do this, but when you look closer at this draft, so internal means internal in the open charging alliance, which is the organization managing the OCPP protocol. You see there perhaps a couple of drawbacks. The first thing what we obviously need, we need to add some additional looting information, so that we know we are sending from this box to that box and my idea is or my proposal is we can do a lot of interesting things if we copy this good old concept of the record route taken, which is also an IP version 4 optional option and so we can implement this much more user friendly. Next thing is in the OCPP internal draft we have more or less source routing, so the sender includes the path through the network into the request. This is well known, it's a valid way to do it, but it has also a lot of limitations because when network is changing very often you have a scalability problems. So it's much more logical to use a normal routing table in every box. You can use this typically outdoor learning what you know from other net switches, which also on learn which communication partners on which port and implement it more easily. Now, but it's getting more and more worse because we are in a modern world. A charging and stage management system today is no longer a monolithic thing on a notebook somewhere in the Netherlands. It's a highly complex system of microservices and these microservices are even from different operators. So we have very very often complex systems where the asset management, so which charging station is located where and coming from which vendor is within an SAP database, then you have another database for all this real time energy measurements and so on and so on and so on. So you realize okay now we have a bit of a problem because we have a critical infrastructure, but in the back end we have a multitude of loosely coupled systems without much of security. So the traditional also PP security model is also no longer sufficient here. For this very simple, it would be nice to have digital signatures. Again, there's an internal traffic and the open charging alliance, but this had signatures on the transport part of also PP. So it's limited to also PP, but it would be much more interesting to have it on the other PP messages itself, because then we can send end to end messages and end to end means in this case, from the EV to the energy distribution grid operator or to the EMP or to the smartphone of the driver and so on and so on. We will later see a lot of use cases how to make use of it. What do you, when you want to have signatures, the next problem is you read. As usual, you reduce the complex problems onto a key management problem. So you need something like signature policies to define who, which signature is valid, which signature should I use, which signature should I verify. When you have this signature is implemented, you can extend it to user roles because at the moment everything in also PP is more or less one user. You have no differentiation of this communication partner is only allowed to set energy commands. The other one can also change communication parameters or whatever. This can be implemented using the signatures. And last but not least, at the moment, also PP is only using the text frames of HTTP web sockets. But there are a lot of useful use cases for binary streams, especially when you look at firmware updates or look for downloads because this is at the moment external HTTP requests. And this makes your network security more complicated. So when you would integrate it into the also PP protocol, you could close down your network and only allow also PP communication and improve the security. So nice, all this little details, but what are the real use cases for this? So in Germany, we have since the first January of this year, there's a nice new law that your energy provider can send you messages. We are a highly regulated infrastructure to reduce the amount of energy you're using because we want to renew a bit of energy and so on and so on and so on. But it's external additional hardware. Why not use the existing infrastructure for this? The reason? Because it's not secure and safe enough at the moment. Would it be secure and safe enough if we could perhaps talk to these guys and say, okay, look, we have now improved our infrastructure why don't we remove this additional hardware? The same is in the same law, we have a possibility that an energy provider can get your measurements. This is again a regulated use case. We would do this with our normal also PP infrastructure with charging tariffs, charging tariffs coming from immobility providers or someone else. They should also be assigned secure data which is immutable and then used in OCPP. Good part is that in the upcoming or in the next version of OCPP, there will be some support for tariffs, not yet end-to-end signed tariffs, but at least half the way. There is this interesting use case where you want to pay for your charging but in an anonymous way, so you don't have an account somewhere, but you pay with your smartphone. In the regulation, they are talking about QR codes. Wouldn't it be interesting that you use this QR code to get something like a direct communication channel to this charging station over all this complicated infrastructure, but it's secure so that you have something like a remote control because nobody stands even not even for 20 minutes in front of the charging station just to look for tapening. They want to have it on their phones, but for this you need a secure channel. The same idea, but for another user group, is the charging station operators on the energy people also often don't know what's really going on on this charging station because it's very limited for the consent over the wire at the moment. They use a lot of AI to invent what might be happening on the charging station, but in reality it would be much nicer if we would have something like this digital twin idea just send everything what is important somewhere where it can be analyzed, but again we have no secure infrastructure in the middle because every shitty marketing company could manipulate our data. Better German calibration law, that is my favorite topic, but we had this already last year. We have national contact points who want to collect all this data and statistics about your charging station infrastructure, how good or not good it is. No security, no privacy at the moment. The same problem as usual. The really biggest problem is this is on the street. Yes, more or less last slide. This is on the street. There's also on the street, so no physical access security here. So even when we have encryption signatures, we cannot be sure if not somebody sending us a lot of crap. Okay, it's a bit harder to manipulate a lot of charging stations on the street, but if you're put in, probably you would try it anyway. So how or what can we do to analyze it here if this is a valid request or a valid information or not? And I try to my best to get it into the OPP standard, but also at the Open Charging Alliance, we have a normal problem. There are many leaching companies and not so much real contributing companies. So if you find this use case is interesting, if you think this is interesting for you, for your company, for whatever, feel free to contribute to this project, feel free to become a member of the Open Charging Alliance and help us to get it out on the street. Thank you so much for your presentation.
CitrineOS
Hello? All right. I'm going to start my talk. My name is Christian. I'm a software developer at a company called S44. We make software for ChargePoint operators and mobility service providers. So basically the cloud space of the EV stuff. All right. And today I'm going to talk about OCBP implementation and the clicker works. It worked a second ago. All right. So if you take a look around at chargers and charging networks, what you'll find often is a broken charger, a charger with a black screen, and especially payment terminals saying oops. I found a study from 2022 that said in the US that 75% or less than 75% of the chargers weren't working or when the users came up, they couldn't get a charge started. So now governments have gotten involved, right? There is uptime guarantees in the UK and the US Neve funding also relies on uptime guarantees. If I remember correctly, I think the off here also has uptime guarantee, but I'm not 100%. And so the most recent thing I found for the US and the company I work at is mainly US based. So that's why there's a little focus there is that in 2023 broken chargers was like the major concern for users to use public infrastructure for charging. And then maybe most importantly, Reddit users are super unhappy. I think some subreddits even banned talking about broken chargers because they were really annoying. And I'm going to click. All right. So one thing that we found or our thoughts on why this happens is a lot of proprietary implementations. So you can see Dolly's interpretation of OCBP proprietary stuff. So if you're not Tesla, which owns the entire vertical, right, they know what's happening at the charging station in the car and in the cloud. Then what do you do? Well, what happens right now is there is a bunch of different vendors. Wherever you sneeze in the EV charging cloud stuff, there's a different vendor. And most of them don't really share what's happening under the hood, which results in, well, a bunch of uncalled for behavior unknown what's about to happen, especially later in the field when it's a user interacting with something and you don't have known input. Then of course, we have OCBP 1.6, which leaves up a lot of stuff to the imagination on when which message should be sent. And then maybe the CSMS thinks, well, I'm expecting ID token now, but get some other message. But one thing that I think is one of the most big problems with OCBP 1.6 is around monitoring. So right now, each hardware vendor builds in their own obscure monitoring messages. And if you want to integrate with like five different hardware vendors, well, then you have to work out how to understand all five different messages. And that basically mean the same thing. Well, that leads to broken parts in the fields and no one knowing about them, which then leads to Reddit users being angry because the charging station has been broken for like a week. No one really noticed. Thanks. All right. So what can we do to improve the state? Well, OCBP 2.01, I think is already a huge step in the right direction. You can see Dolly thinks as well. OCBP 2.01 winning strongly. One thing that I really like about OCBP 2.01 is it has a lot of use cases and it's super structured and you can build your test cases on them. And then of course, there's much more monitoring around the device model that helps in identifying, oh, there's something about to go wrong with the charger instead of just it's broken. But that still doesn't help with transparency. So if everyone just reinvents the wheel once again, just like with 1.6, well, you're still going to run into different interpretations. So we think there should be something that's open source, that's transparent, that you know what's happening under the hood. And we hope that with something like that, there is better cross compatibility between different vendors and the CSMSs can easily integrate with a bunch of different hardware vendors. And next one. All right. So we looked around. We didn't find something that we were super happy with. So we came up with the project Citrine open source. It's written in TypeScript. I know in this room that might not be the most popular choice, but on the internet it is. So that's why we went with it. It runs on Node. We have a API based modular architecture. So similar to what Achim was saying, there's some microservices and you can set it up that, for instance, transactions is super scalable, but maybe provisioning is maybe not as needed. It's released under the Apache 2 license. And most recently it's been adopted by the Linux Foundation Energy. And it's in their hands now. Yeah. So in general, we think OCPP shouldn't be like something that everyone works on once again and again, but like a stable cornerstone that you can adopt, that you can drop into what you want, where you need it. Because the messages are there, the protocol is really specified and redoing the same thing. Well, I can spend my time better. So taking a quick look at what we envision for the system architecture and how it works right now, going from the left to the right. So charging stations connect via WebSockets to the central system that helps us with scalability. You can have a bunch of different instances of the central system that manage the individual chargers. Then we publish on a message broker. What was important to us is to have our underlying technology kind of agnostic. So you can set up Kafka, you can set up PubSub, whatever you want. Just like with memory cache. So you can use your address in memory cache. At least that's what we've implemented for now. And then you can adapt whatever interface you want. And for relational databases right now, we have it hooked up to PostgreSQL, but you can set up whatever relational database you want. Then comes down here, the maybe more interesting part. So we have our modules. And like I mentioned, transactions is a big one. Most of the bandwidth goes there. So we set up the modules based on how much we think they're used. One second, one back. One thing I forgot to mention is we use Fastify as the web framework to interact with our setup. All right. So looking one step further under the hood, we have a JSON schema generation JavaScript that we take the set up, the part three of the OZP spec and use that to validate all incoming and outgoing messages. And we generate our TypeScript interfaces out of that. Then to run, for the implementation of the modules, we work a lot with decorators and metadata on which decorator is used for which message. And that's how we route the messages within the modules. And then one thing that I think is quite nice is that we have some open API documentation that's generated. And you can easily try out some OZP messages from the REST API. So you can either use the API generation, click try, or use postman and just straight up send OZP messages that then get forwarded to the charger. And our system does the interaction with the charger for it. All right. So then looking up and looking at a UI, so right now we've hooked it up to Directis, which is an open source project that gives you some nice UI on top of a relational database that helps with keeping it simple. But you can go crazy on it. You can build your own flows in Directis and do whatever complex things you want. For now, we have it set up so that we have a little testing set up with our app that we whipped up to try charging. Yeah. All right. So where are we at right now? So a few days ago, we released the 1.0 version that goes through the OCTP protocol's testing cases of core and advanced security. We're quite happy that that's working. It's been working for a while, but we only got to release recently. Then right now we're under development is the advanced device management and advanced UI. We also have a few other people that we're talking to about integrating some payment and just general, we've generated quite some buzz with people that would like to add some modules or add just on functionality. And so moving forward from there, we're looking to ISO 1511.8 support. And hopefully in July, that's what we anticipate is that we have the full OCP 201 implemented. And then for the future, of course, similar to what Ahim was saying, well, you can build on your BI tools or whatnot. And we hope that this is a nice interface for innovation on top of and not that you have to hook yourself as a machine in the middle or something similar. And I'm really happy that so many people were interested in this topic. So maybe you also want to contribute. We're fairly fresh. You can find us on GitHub. The top right is QR code to our Citroen OS core GitHub page. The first technical steering committee will happen on March 14th. So get involved, join, bring ideas. And we have a Discord server. So drop by and ask questions. Sometimes we're fast. Sometimes we're slow, depending on our workload in responding. All right. Does anyone have questions? One simple question. We all know every vendor does its own shit. On the other hand, you generate everything from the JSON schema. So how do you implement extensibility? When a message or an unknown message comes in, do you drop it or can you handle it in a smarter way knowing, okay, it's coming from this vendor and therefore I should interpret it somehow? So right now I believe we drop it. Our major taste has been the Everest. And they send normal messages. Am I in the wrong spot? All right. And for the detail on how it will be handled in the future, I'll get back to you on Discord for that. I got to check with a few people on what's happening, what's going to happen there. So you said you can make an API call and you send the, for example, start charging message to the charger. So do you use like then you get the API call, you use Kafka or something and then from Kafka it goes to the charging station? Okay, that's very cool. I'm also doing that. Yeah, exactly. I've seen implementations where they are just white. I've seen implementations where they are just white like a flag into a database that is like very, very important time. And I think that's very ugly. And I think like message brokers are very elegant solution. Yep, we agree. Okay. With message brokers and 15118 you have very strict timing. How do you ensure that your message brokers not too slow? I got a, I got a pun that one. I'm too nervous for that right now. I'm sorry.
Power Grid Model: Open source high performance power systems analysis
Hello everyone. My name is Natesh. I work as a scientific software engineer at Al Yander. I'm also a developer at the Power Grid model project on which I'm going to give a talk now. So it is a high performance distribution grid power system analysis library. Yeah. And the next slide. Oh. Oh. Yes. So in this presentation I'm going to mention why do we need this project? How did we come across building this? So what does the library do? And how does it perform compared to other solutions that are already available in the space? And how do we use this within Al Yander which is Dutch DSO for its own products and applications? There's also some talk about open source since we are open source and we would like new contributors as well. In a traditional way up until a few years ago at least the power system analysis used to happen within the DSOs, DSOs in this way. The electrical engineers would usually have some data files where they run the calculation in a GUI focused software where we have built-in presets for running the calculation and we get only certain results and then we make decisions on whether to add a new transformer, add a new cable and such components within the grid or not. If the grid can handle more solar panels, if the grid can handle more EVs or not was done using this way. But now with the new smart meters and EVs and renewable energy we have to do a lot more and for that we have to have all of the data of the smart meters which is in a really huge volume in a database where also lies our topology and electrical parameters and then we cannot just use a custom, we cannot just use a preset of the calculation method. So we have to have some customization available over there and then we have to do the calculations in the cloud because these calculations are in the set of millions now because we are trying to simulate the entire year for example of time series and the volume increases a lot. So why did we decide to make this and what makes a good power system analysis library? So around I think 2018 also Alianthar faced a problem where we were not able to do this using any of the open source software or the commercial software. We faced these pain points actually and then we decided to make the library which are focused on around them. So we needed a well-defined software API. That's because we want this calculation library to be part of a really bigger application which does a lot of things apart from just calculations and we also wanted this library to be cross-platform and scalable so that we can use it within the cloud. And of course since the volume is in millions, high performance and parallelization was needed otherwise you might have to wait for a month or so to get results which is not adequate and if it's in cloud it costs you money as well. That was in 2018 by the way and after that at that point our power grid model was in our source within Alianthar. We had some applications in 2021 then we made it open source at around 22 and we do have a lot of applications now which I'll cover soon enough. What the library does? So it does some calculations especially the power flow calculations, state estimation and short circuit calculations for both single phase and three phase grids. We have many algorithms with which we can do this and these sum up the calculation functionalities in a really short way. We have a huge focus on the software side of the library because of the pain points that we did mention before. So we have a native shared memory multi-threading and that enables us to do the parallelization for batches and in as many cores as possible when we do deploy it in the cloud and yes the implementation is in C++ and the API for the users is in Python if they wish to use it and it's well documented, it's quite stable and then we have the binaries available in PyPy and Anaconda for the Kondaforge and we have support for Windows, Linux and Mac OS all three of them. And since making this library is not just enough we have to show that these calculations are actually correct as well and for that we have done the validation of the library against some theoretical hand calculations at the start. Then Vision and Chaya which are commercial software and also PowerFactory, we validated the library against them and PandaPower which is another open source library. So we validated against these software and then we use them as a reference for each new revision of Power Grid model. So it's part of our CI pipeline if any of the new features do not comply with it, it won't, yeah that should not be worse than. How does it perform compared to other libraries because yeah there are a lot of libraries within this domain. We have some more presentations now as well about them and each one has its own specific plus point and the plus point of Power Grid model is its performance. For the performance benchmark the link is in the presentation if you wish to do the benchmark yourself. We try to compare it with PandaPower and OpenDSS to get an idea of how it performs and we found that the performance in case of PandaPower is almost 20 times of their calculation which was a huge boost and will really help in doing these calculations much faster. So these were the symmetrical calculations and the asymmetrical calculations is where Power Grid model shines as well because we as a distribution, I mean when it started as a distribution analysis library within Alieander this was really needed at that point. So the Newton-Raphson for PandaPower is around 100 times and with OpenDSS we have to compare it with iterative current that was four times faster than that library. We have data conversions as well because we don't have the best data model to store it and hence we have we have conversions to SIM and other softwares that are used for power system analysis. SIM because we can then integrate with other applications throughout this ecosystem. And we currently use it within 10 plus applications within Alieander so it's a mature project at a production grid and yeah there are many applications grid planning, automatic network design, automatic network design, monitoring asset allocation and congestion management. Since I do have some time within automatic network design for example we try to forecast what the effect of the grid based on the EV growth will happen in the coming 30-40 years, EV growth and the solar panels and based on that we simulate this and then we identify the bottleneck, add the cable, run the simulation again and in this automatic way we design the whole network. That's what this application does. There are actually multiple congestion management applications as well. So one is the active one with which we do real-time congestion management. We take in the measurements from the previous 48 hours and predict if there's going to be a congestion in the coming 48 hours based on any plan maintenance if there is any and the other type of congestion management that we also do not present here. It's on the assessing the measurements of the entire year of this past year and then what would be the congestion in the coming year and based on that we might offer new contracts to our customers because the grid in Netherlands is highly congested right now. We have a lot of people waiting for new connections but we can't add them and hence power grid model really helps in making all of these calculations. For the open source you can just use the library and provide feedback. That's a great contribution in itself. Report any bugs as well. That's really helpful too and you can also do the validation for the library with any test cases of the 80 cases that I mentioned. You can provide more and validate the library. You can improve if you have an idea for a new way to make the API. You can suggest that too or you can also add new algorithms and make the code more efficient in the C++ code. That's also possible. We have a list of good first issues within the repository too if you wish to have a look. We have a few partners. There are DSOs, TSO research institutions, universities and other open source projects as well. The DSOs do use them. Aliantha does have those products as well as an access and study and are also trying to add to their operations. That's all from me. Do we have any questions? Hello. Thank you so much. This looks really, really, really cool. I have one question. If I am running a project, hello Chris Adams, Green Web Foundation. If I have a new project on to build a big solar farm or put a 100 megawatt data center somewhere, can I use this to model how I might integrate with your grid to say this is why you should let me build here or possibly this is what it's going to be the implication if we keep growing at this space. Yes, definitely. We do some calculations on our side. We would be able to, I mean, like Aliantha does it on its side if it can integrate the customer. On the side of the producer, the producer does it so it can identify if it's profitable to make this investment or not. What would be the ROI in the coming years based on what the grid looks like? That's definitely what the producers still do and they do use the model over there. Hi, Peter Dutfield from Open Climate Fix. Thank you for the talk. How did you say some other TSOs have used this? Have you had any feedback from them and how they found it? Well, I said that they are active partners so they did not actually use it. They are TANET and RTE as well. I'm trying to look if they could use this model. But some of the core features of TSOs, we do need to add them as well. That's one of the requirements from the TSO side. Once that happens, TSOs would use it as well. But the focus is primarily on the distribution system analysis side. In Germany, we have this TSO tells you please reduce your consumption. Can I use your project for this calculation? Is it fine enough or is your project just a scope of the complete DSO or a larger part of the grid? Can I use it for a single grid collection point or just for larger parts? Let me think if I got the question correctly. If you do have a single connection point and you wish to use the library, then the motivation would not be so that what would be the transition somewhere. But if it would be a profitable thing for you, right? Did I get it right? No, the DSO uses your library to calculate that tomorrow there's not enough energy. So he wants to tell some customers please we to use your consumption tomorrow. Is the library able to calculate this for single connection and grid connection points? So that I can really can say you and you and you have to reduce tomorrow? Or does it just calculate a very yes, is it just for a large part or also for a very narrow part of the grid? Now I understood. Nice point. The library does not do that. It just calculates the yeah, the power flow results, the voltages, the powers. One of the applications that I did mention about the active congestion management, we tell the customers to reduce their generations. We have certain contracts within Alieander to do that, but it's not part of the power grid model. Yes.
GridSuite and PowSyBl: an Open Source approach to develop advanced tools for grid analysis and simulation of power systems.
Okay. So, hello everyone. So, I'm Jean-Vatiste and Geoffroy is here. So, we are at the software development department from RT. And RT, was it RT? So, we will give some elements of context. So, RT is the French TSO, so transmission system operator. We handle from 20 kV to 400 kV, so it's the high voltage. And we must provide electricity 24-7 for all the costumers and all the inhabitants in France and of course in Europe because we have to cooperate. And the particularity is that we are asset owner of the grid, which means that we are responsible to invest and make sure that the equipment will be okay to work to complete our mission as a TSO. And we are also responsible to adapt the structure to make sure that we will ease the transition energy. So, we need some interconnections and we also try to adapt the grid to connect for example offshore wind generators. So, we have many, many challenges in the fast changing world. So, we have of course new energy mix with big goal around neutral carbon neutrality. Sorry, for 2050. So, it's a big challenge. And we have also some codes, some regulation that make drastic change and we must adapt to that. So, it's a more package where we have a lot of work to do in Europe. And for that, I will read the sentence because it's very important. Today... Okay, so... Today... So... Oh, it's okay. So, today's need is what is very important to understand is that today's need is not to build a tool that answers present needs, but to build a tool that is capable to integrate quickly and efficiently tomorrow's needs. And if you have the idea of the tender to create... the way to create new tools, sometimes you make specification, then you made a tender process, then you ask a vendor to develop, and this cycle is maybe like four years. The problem is that we don't know what we will ask to do in five years because everything is changing very fast. So, what is the strategy for it to answer those issue is to use open source. Andrew Froy will present you two tools that are based on open source. So, Possible, which is one of the first projects that was started, the Linux Foundation Energy Initiative, and then what we can do, what we can build on top of Possible. So, I'll let the floor to Joffre to present in detail the tools. So, hello everyone. So, the first project, Possible. So, Possible, it means Power System Blocks. Blocks, so this is a software component that we have as a foundation of many other applications, especially at RT, so we have something like 15 projects that are based on, as a list, a few components that has been developed in Possible. So, what is the content of Possible? What is it? So, this is many things, but it's the first way to model the Power Grid. So, we have a data model that allows us to build, to have a green model and to use it to make, for example, some evolution, some change for this grid model and to study what will be the impact. We also have some components for visualization of the grid that will be integrated in some higher level application. Also, what is very important is to be able to feed this data model with some data. So, for that, we have some converter coming and to standard data format. So, the most used one and the most famous one is the SIM, the SIM data model. So, we have premium support for SIM converter, SIM data model support. And also, what is very important is to have some interoperability with commercial tools. For example, the one, there's two very widely used commercial tools which has a PCC from Siemens and also PowerFactory. So, we are able to import some data into our data model from these tools. And also, we also have some converter from academic data format, for example, MATPOWER, which is widely used for research and science. And with this data model, we are able to run some analysis functions, for example, powerful calculations, security analysis. So, security analysis, for example, is a nice function that allows to test what will be the impact of some contingency. For example, we have a line loss, an outage on the grid, and we want to see what is really the impact of this outage on the flow, on the voltage to see if we have some trouble. We also have some scientific analysis, short-circuit calculation, which is also very important, and also dynamic simulation. So, time domain simulation. This is why we are integrated with another project of the Linux Foundation Energy, which is Dainaw. So, this is mostly written in Java. And this has been designed to be as light as possible. There is no dependency with complex framework or anything that takes decision to how you are going to use it in a higher level application. So, GridShoot. GridShoot is an example of a tool that is built on top of the component of possible, and that allows people to make some grid study. And very different studies. It starts from a real-time study, for example, security analysis, to a long-term development study. For example, we can with this tool, study what will be the impact of a connection of a new renewable generation power plant on the grid, and to assess that everything is fine if we connect this generation in a specific place on the grid. So, this is a tool that has been moving to production very recently. So, at the end of last year, since a few weeks, and we still have some very early users. And what we plan to have is 400 users in the coming two years. And so, this tool will replace an existing tool, which is at 30, since 15 years. So, we have a team of more than 20 developers, and it is a growing team. From the technical part, the technical stack that we are using is for sure 100% open source. So, this is a micro-service-based application, very scalable application, based on Java, Spring Boot. We have everything based on REST API and also asynchronous messaging with RabbitMQ. On the storage part, most of the micro-services are based on a PostgreSQL and Elasticsearch. As it's quite difficult to manage such a distributed application with a lot of micro-services, we have everything is deployed using Kubernetes cluster. And on the front-end part, so this is the web application using our RAC.js. And also, we use a little bit of WebGL for high-performance representation of the grid. So, an important issue that we had with this is that we have some Java component, which is very convenient to integrate it into, I would say, a classical enterprise application where often the backend is based on the Java ecosystem with Spring, Quarkus, or some kind of framework. So, this is fine, but what we had is also needed to use these components for high-performance also for research and data science community. And most of the people from the data science communities are on the Python ecosystem. So, the question was for us how to use the same piece of code in these two ecosystems and how to share the code in Python and Java. So, what we have done is to use another fantastic open source tool, which is the Gural VM. And Gural VM, this is done by Oracle. And this is several things, but we are using a component with native image that allows to compile Java code into native code. And thanks to this, we are able to build a C library for everything that we have impossible. And with this library, we can build a classical Python extension module based on the C library. So, some useful links. So, for sure, there is a GitHub repository for both project possible and Gridshoot. Maybe I can focus on the Slack channel, the two Slack channels. So, see where is the place where we answer questions and we discuss with the community. And also, there is an online demo of the application Gridshoot. So, if you want to test it, you can do it. So, we have an instance of Gridshoot that is deployed in the cloud. And you can connect to this one just using, for example, your GitHub account. Also, there is a YouTube video if you want to show a live demo. So, this is a screen of the application. What you can see here, just to explain what is it. So, on the left side, we have the data manager. So, starting from a case, from an initial, a green model, we have a way to create some tree of variant, of modification that allows us to test different changes in the network. And for all these variants, we can run some calculations and analysis and then compare what is the best and what is the best one for us. Then you can see on the right side that we have some way to represent, to display the grid. So, this is a full representation of the French high voltage grid. We have some substation diagram like here. We have some what you call the network diagram, which is part of the grid that is shown in the bus, in the nodal view. And then we can run some calculations. We also have some table to see the data in a tabular form. We have some specific user interface to show the results, etc. So, this is one for the presentation if you have any questions. Or if you want a demo of this tool, we can do it after the presentation. If you are interested to have a more detailed view of this tool. I was just wondering what format your network data is in. And whether you could, for example, take in the open-street map network data and try analysis on it. It's not complete, but can we do that? So, this is not OpenStreetMap, this is MapBox. But we can change the type provider to use whatever you want. So, here we have used a very light type representation of the grid just to have a better view of the grid. But we can use the OpenStreetMap. How do you make the link between the grid and the end-consumption items or human on the grid? Do you go to the machine-to-machine communication system so that the people stop consuming? Or do you make advanced polls in order to know the consumptions within one hour, within one month? So, this tool is a bit of a snapshot of the grid. So, this is done by some other tool which are before this one. So, we have the SkyDal, for example, which are doing the acquisition of the measurement. And that has a database of the grid model. And from this, we have some snapshots and then it can go into this tool. Okay? I don't know if I answered the question. How do you handle the stress when, for example, the grid is about to fold? Do you cover any cases with humans at the end of the grid? I don't know. Anyway. Okay. So, we will answer the question later. Thank you.
LFEnergy SEAPATH - Easier Operations in Electrical Substations through Digital Twin Empowerment
So, hello everyone. I am Paul Le Guin de Carnaison. I am an embedded software engineer at Sauvard Fair Linux. And today I'm going to speak a little bit about the C-PASS project and how we brought in. My company Sauvard Fair Linux. We are based both in Montreal and in France. And we are experts in embedded software and free open source engineering. And we've been working on the C-PASS project since the last couple of years. So what is the C-PASS project? C-PASS means software enable automation, platform and artifacts for the line. What is it? We are in a context of energy transition, as you all know. And there is a lot of constraints with this new energy. And the main constraint is that we have a multiplication of distributed controls. We have more and more power stations. And so we have an increase of the data management's need into this power station. And so the idea is how can we bring some free and open source into this power station. And this is where C-PASS is here. So to remind a quick reminder on the aim of C-PASS, the goal of C-PASS is to develop a reference design with an industrial grade on open source and real-time platform. C-PASS allows us to also virtualize platform and inside this virtualize platform we can run automation platform for our power station. And so we can share multi-application provider and this combines performance and safety. For 10 minutes presentation, I can present in deeper the C-PASS project, but my colleague Erwan already did it last year at 4.10.25. So if you're interested, you can see his presentation. So the main idea of this presentation is how did we bring some functional tests to the C-PASS project. And for this, I want to take a simple case 2D. So here is the power lines you can see in the campaign. And you have a tree after the storm that falls on your power lines and there is two lines that touch each other. And this is a big issue in your electricity system. And so you have systems that must cut the current very quickly to avoid any other on people or on the infrastructure. And so how can you have all this safety equipment with C-PASS? So I have a very simple representation of all of this is working. We have first a protection algorithm that makes a decision if there is or not a situation where there is another or not. And this algorithm is running inside a virtual machine and this is where we have the C-PASS project because this is running inside a C-PASS cluster, inside a hypervisor, et cetera. And we have on the opposite side an hardware which is doing the monitoring of our architecture. And the communication between the C-PASS cluster and this hardware is done with a protocol. And you know it, it's the ISC 61850 and this is a protocol which is based on TCP and it generates packets that we call sample value and this is the communication between the C-PASS cluster and our hardware. And so why did we need functional tests? C-PASS as you see is designed to work on a very critical infrastructure which is a power distribution and if we are some issue on the power distribution there is a need of the protection of the people and the protection of the infrastructure because we have electricity as that. And so in case of failure the safety protection must react as soon as possible. And so this is why we need a very, very, very low latency transit of this sample value that transit into the C-PASS cluster. And the last thing is that your power distribution in your country is running every time you have electricity in your home every time and so we are context of 24 hours and seven a day context. And so we have to ensure that this latency as low as possible every time. And so we are in a deterministic system where determinism is the primary goal. And so we have a big infrastructure, we have expensive items and so maybe you are wondering how can we in our labs at our desktop or can we simulate all of this chain simply. So this is the work we've been working on. And so I represent here a very simple scheme about how can we reproduce this protection chain in our lab. The first piece is what we call the Publisher Machine and the goal of the Publisher Machine is to generate the IEC 61850 sample values. And then we have the C-PASS cluster and the C-PASS cluster is composed of two parts. The first is the IPI Vizer which are running virtual machine and we have the virtual machines which runs all of the software which are an SV client receiver that will proceed the sample values that have been sent by the Publisher and a protection algorithm which takes the decision based on this sample value if we are an issue or not. I took here a presentation with two IPI Vizer and three VM but it could be a totally different architecture. So what tool did we use to do that? First on the Publisher Machine we use the pick up format. This is a very ideal of a format because we can reproduce some TCP traffic generation and for example we can reproduce what could be happen on electricity infrastructure for example a 50 hertz electrical signal and then we replace them with some tools. Here I use a TCP replay to send this packet with the spacing we want. We can use some PTP packets to synchronize all of this but keep in mind that it's not a recroid, it's not an obligation to use. PTP is only used on C-PASS when you wish to use some C-PASS features such as VM synchronization and VM migration but this is not an obligation to be used. And then on the C-PASS cluster we have first the IPI Vizer side and we have to have very low latency. First we have some CPU core isolation so we have done some work to dedicate some core only for the Linux system that is running on the IPI Vizer and isolate some core only for the VM which are running and do also some IEQ and processor isolation inside the Linux kernel to be sure we have some priority for some application etc. We also did some biosoptimization which is very hardware dependent but there is a lot to do. There is a thing like the multi-EPF reading feature that are very bad for determinism and you have to disable this kind of feature. And then on the virtual machines we have also it's kind of the same work that we did on the IPI Vizer side which is all of the isolation and CPU and IEQ etc. We use what is called the PTI path view and this is a very interesting feature because it allows us to directly inside the VM to take the packets which are received on the network interface of the IPI Vizer and this brought some good performance. And finally we can also use some SAIOV that can be used if you have multiple virtual machines but keep in mind that even if you have better results this is an optional feature. Thank you for your attention and please let me know if you have a question and I will answer to it. Thank you. Have you got any examples of real world adoption of this? Let's say Karin A. Ronan. So you are asking if we have concrete implementation of the C-PASS project. Currently no. We don't have any concrete implementation because C-PASS is currently near in early adoption but we have a good opportunity to be in the future. But currently it is not in production if it's your question but it's the goal of the C-PASS project. If you don't have a ground master clock what is the source of time? Matthew I will let maybe answer this question about PTP. In production we have to have a ground master clock but for testing we use just Linux PTP as a PTP clock. Thank you.
OpenSTEF: Opensource Short Term Energy Forecasting
Hi everyone. Thank you for having some patience with me. Computers are not my strong suit, although I am in IT. So my name is Sunita Rijder. I am the Community Manager of Open Staff. And I work at Alieander. So let's get into a little bit of background. So Alieander is a distributed grid operator. So we are responsible for the distribution of energy in both electricity and gas in about a third of the Netherlands. So I think we all know these kinds of gas. So this is energy consumption on some place in the Netherlands. However, we have no idea what's going to happen in the future. Well, this is where Open Staff comes in. Open Staff stands for Open Short-term Energy Forecast. So instead of our question mark, we actually know what's going to happen. So after this very short introduction, let's tell, let's talk about what I'm going to talk about today. So first of all, I'll start with the challenges on the grid and why we actually need Open Staff. Then I'll talk about Open Staff, of course. And finally, I really want to discuss our recent developments and collaborations. So the challenges on the grid. So when everything was still good and easy on the electricity grid, it looks like this. So on the left, you see one big producer, just one direction energy flow, and then we have our consumers. Fairly easy. However, due to the energy transition, I think you're all aware, it looks like this. So very chaotic. So on the production side, we have a distributed production due to our solar and wind, both on the mid and low voltage, but also at our consumers. And on the consumption side, we have the issue that our consumption has exponentially increased. We had a lot about EV charging over here. Well, those electrical vehicles need electricity through the grid. And this is where our capacity issues start. So this is a map of the Netherlands. And I think you can all guess that red is bad. So on the red parts, we actually have no capacity available. So let's say that you want to start a company of one of these areas, we cannot connect you. So you get no power from us because we just simply have none to give. But of course, we're all very smart and over the people. So we have some solutions. So one of these solutions is actually to shave the peak. If we expect grid limitations to be surpassed. So on this left image, you see a forecast on the load on, for example, a transformer. We see a very clear peak. And this is where our grid limitations are surpassed. So our solution is just to say shave the peak. So for example, if this is production, we just ask one of our solar farms to just shut off for a little while. Of course, they can money for this, but that's something else. And then this is the result. So our grid limitations are not surpassed and nothing breaks. So great. But then able to do this, we do need to know just left image. So we actually need accurate forecasts. And this is where we have open step. So again, open step stands for open short term energy forecasting. And let me a little bit explain a little bit more about it. So first of all, what the hell is it? Well, it's a complete software stack to forecast the load on the electricity grid. But it's energy forecast. So it could also do it for heat. And it's automated machine learning pipelines. So it's a step by step process, which is automated to make a forecast. So in these dark blue boxes is all everything that open stuff can do. And I'll talk a little bit more about it. So what does the software look like? So first of all, you need a database. This is one that you have to make yourself, of course. But we do have open step DBC, open step database connector. And this is able to get all of your data from your database. And then we get into open step already talked about pipelines. So of course, these are in the software overview. And these are part of the tasks orchestration. Then we have data preprocessing, which includes data validation. So for example, if you see a little flat line, as we're able to cancel these out of your input data, and there was something very interesting, feature engineering. So in this feature engineering, we're, for example, able to calculate the wind speed at the height of a windmill from the wind speed on the ground. And we're also able to calculate the lag load for one time stamp. And then of course, we're machine learning pipelines. So we have some machine learning in there. So we're using open source models such as XG boost to make our machine learning models. So we're able to train, optimize hyper parameters, of course, make a forecast. And we're also able to make a split forecast to our Dazzles model. And finally, we are able to evaluate our forecasts, store our model, and do some post processing. So let's look into the methodology on a really high level. So on the left, we have our target load. This is where we actually want to forecast. Then we have some external predictors. So we have our weather forecast, market prices, and typical profiles of companies and households. From these external predictors, we can actually calculate our derived features. So this is the feature engineering I just talked about. So we're able to calculate lag loads for each time stamp, but also to have some derived weather features, such as, for example, the wind speed at the height of a windmill. And for the more calendar info, it really matters if you're are forecasting on a Sunday or Christmas compared to a Monday. And then we can train a single model for all our lead times. So here you can see what the data, for example, looks like. So if a daytime with increments of 15 minutes, our targets, and external predictors, you can also see here that we have the Dutch energy prices in there. So if you have multiple training horizons, we just simply do pick late our data and use this for our training horizons. And then if there are questions about it, please ask me in the break, but I don't have time to go into this in 15 minutes. We can with our trained model now actually make this forecast. And of course, we want it to look nice. So we have this beautiful Grafana dashboard, which actually summarizes all of the information that you need for your forecast. So let's look into it. First and foremost, our forecast. So the red line on the left is actually the low that has been historically measured. And then we see here the yellow lines is our forecast. Well, now you see that there are a lot of yellow lines. What do those mean? Well, those are actually the quantiles. So you have actually a certainty in your forecast. And this can be actually useful if we're a certain location. You're quite sure what your forecast is going to be. You can go into another quantile. Then if you have a location where you have a lot of factors that you actually don't know anything about. And also very nice our feature importance plot. So here in the feature importance plot, we can see our lag loads and some other features. And this is actually nice. So you can see for every location, which features are important for your forecast. So for example, here we see radiation. I don't think it's readable for you, but it says radiation. So you know that there are quite some solar parks or solar panels behind, for example, your substation. Wind speeds nowhere to be seen. So probably no windmills in that area. So this was really short about open staff. Let me see how much time I have left. Six minutes. Perfect. Okay. So community and upcoming events. One of the main things that has really changed in open staff this last year is our community. So before it was just Alliander who actually created it together with RTE, working on open staff. And now it looks like this. So let me go over every company really quick. So Alliander, that's where I'm from, talked about that enough. RT actually working on open staff for quite a while and they're actually ready to implement it very soon. RT International just joined us this year. They have a very nice proof of concept and they're going to work on it further. Fidel has actually been using open staff quite a long. I've heard some terms, leeches this today. Well, that was a feed on up to actually a month ago. So we contacted them and they were like, oh yeah, we found some bugs. We fixed it. We can implement this. So they actually joined our community as of this year. Sigelman still working on a proof of concept and seeing if they want to replace their own forecasting model with open staff. And Shell is working on open staff DBC and seeing if they can use their method of data important. Now, I hope everyone feels like they want to try open staff. Well, you're in luck because we are organizing a workshop. So on the Friday, the first of March from two to four, we were organizing a workshop. And I would like everyone who's interested to join. So you'll get a better introduction to open staff and also a little bit more of the technical details. It will be virtual. And you will get really a hands on experience. So you get some example notebooks from us where you have to make your own exercises and you can actually make your own forecast with open staff and see how easy it is. If you want to sign up, just scan the QR card over here. And it will be very nice. I also have it on the next slide for people who are too slow. So want to know more about open staff, maybe even before you sign up for the workshop, we of course have our GitHub website documentation, etc. You're only one command away from using open staff. And if there's anything you want to ask or give some comments or anything, you can just send me an email or send me a message on LinkedIn. So thank you for your time and I welcome any questions. Who's running the microphone? I'll try to do my best. Please feel free to guess to find the best path. Hello. First of all, thank you so much. This was very interesting. And I have no experience, I have never heard of open staff before reading on the FOSTA website. I have one question about the data collection. Do you provide like some examples or standards on how and where to fetch data because the data source is very, I tried, I looked. So very good question, I think this is something that a community indeed struggles with. So for the Netherlands, we actually do have those sources because we are using them ourselves for other countries who are working on it to see if we can find some open data for everyone. But if you're interested, you can always send me an email and I'll see what we have. Yeah, great. Hi, it's Miné. I'm from Red Hat. So obviously I will ask the question about scaling this, right? How will you standardize and scale this because it's a project. It sounds super interesting. But how are we going to scale this to 49,000 substations or millions of smart meters at home? Very good question. This is actually something we're working on right now. So we are actually employing our open step stack on Dexter probably anytime soon and seeing if you can actually scale from that. Currently we have it scaled up at I think 100 substations. And if you're curious how we have a reference implementation on our GitHub and you can see all the information there on how we deploy this. Thanks. Yeah, yeah, sure. I have a question about the data sources. Is there any thought given to adding geographical information systems data into the system for forecasting models? Because especially stuff like wind and solar radiation also not just depend on the time of day and the wind speeds, but the location itself. Great question. Yeah, actually for our system, it just connects to the closest K and MI. So that's the Royal Dutch Weather Organization. So it's able to find the closest station to where you actually want to forecast. So it definitely takes a location to account. We have a prediction job class where you can put in all of the information for your forecast and in there you also put the latitude and longitude of your location. So it does take that into account. Question over there. Thanks for the question about the geographic data because I was thinking about an approach of just using cheap raspberry weather stations in Austria and distributing them across some locations to fetch the data because I have the Google Weather API and the Open Weather API or whatever as comparison values. And for the geographic thing, thanks for the question. How would you connect that? Like is this a plan of open stuff? Did I miss this? Yeah, thanks for the kind of difficult question because I don't know the answer. So I'll ask my colleagues who actually made this part of the open stuff and I'll get back to you if we connect afterwards. So then you'll know. But it's very interesting to do with the Raspberry Price things. Thanks.
Unleash the Power of Flexibility with Shapeshifter: A Universal Flex Trading Protocol
Alright, let's start then. Me and Tom here are, well, we've come here to tell you about Shapeshifter, a project we've both been working on. For me it's been, I think the first time I got in contact with it was five years ago, I think for Tom it's one or two. Well, this is us. I work as a freelancer. I actually used to work at Alieander, where we just saw the presentation from and I'm freelancing right now. Tom, just as me, is a member of the TSC, the Technical Steering Committee of this project. We're here to tell you something about it. That's some technical issues. This is not great. Anyway, the problem we're trying to, I'm going to talk to you about the problem we're trying to solve. Yeah, what Shapeshifter is, how does it work, and also how to use it. I do hope the screen will get back to us. Right, so it's funny actually, part of the slide you saw just now. It's actually fun to see the trend here because just two years ago we had some areas where there were issues with congestion on our grid and right now it's almost impossible to get anything going. Yeah, really, to get your data sender up or get your solar panels up. It's just almost impossible. Where the previous presentation was about forecasting, we're actually on the other end of things with the trends that she talked about, well, production and consumption is getting more simultaneous, it's getting less centralized, a lot of electrification. This means we're slowly getting into issues. So how does it look like if you draw it in a graph? This used to be our case. Well, quite simple. We expected use up until the green line, capacity was black and actually the grid companies oversold their capacity a little bit because, well, it's not usually used for the whole thing. Really annoying. Sorry for this. Pull out the clicker, maybe that's the problem. Yeah, it strangely shows up on the second screen there, but apparently the beamer's not very happy with it. Alright, almost back. So it still wasn't that bad yet. I'm gonna stand on this side so I can at least click on the laptop. At some point, it got so bad, our grid capacity is reached and well, if this happens, two things might happen. Either your transformer is sort of overloaded, but it will keep working. If you do it too often, what then happens is it will blow up or it will really shorten its lifespan by a lot. So we try to prevent it from happening. As was, yeah, we can also fix this problem by making sure it doesn't go over this peak anymore. How we do this, I'm just gonna try and explain it without the screen. There's some major users on the net, let's say above 1 megawatt for their connection, that actually have some influence on when their energy is used. So if they, yeah, let's say a giant freezer or they've got a battery somewhere stored, to them it doesn't really matter in case of the freezer, for example, the temperature needs to be between minus 25 and minus 20. So this means that if you don't use power at the peak, you can make sure that it's fixed. We're gonna try a different approach for the screen. Give them two minutes maybe. So really any major company in the beginning, it used to be voluntary, but at some point, yeah, problems are getting bigger and bigger. So we're getting into a state where it's actually allowed by Dutch law to force certain companies to comply with these rules to make sure that we're not having blackouts and that the grid is saved. We've got a second screen, we're back. All right, thank you. It was kind of stressful. So actually the previous presentation, the open step, this was really about creating insight. Where's your problem? When is your problem? How big is your problem? That only, so the insight is just creating the graphs. You can identify problems by having your experts at your company determine, okay, at which points, how much load can our transformers handle? How much can our cables handle? So once you have that, you can, okay, so these are risky areas because we see in our forecasts or in our measurements that we're nearing our limits in certain cases. Third one is really an interesting one. You can choose a solution because in the grids there's, well, not hundreds, but at least tens of possibilities to alleviate the pressure on your grid. And shape shifter, which we're talking about here is one of them, which is in the market based direction of solutions. Tom will tell a bit more about the contents of shape shifter itself. Of course, you need to activate a solution once you've chosen it. And last part is actually kind of interesting too. If companies are saying they've decreased their use, how do you actually determine that that's true? Because usually power use is not stable anyway. So how do you know they've kept their promises? Well, that's where shape shifter comes in. You know, mind if I talk about this one. It was founded in 2014, well, by a couple of grid operators, but also IT companies and consultants, because it was a very common problem in Europe altogether, but it was started in the Netherlands because we've got a very, very stable power grid and we like to keep it that way. So that's where the need came from. It's actually put into practice by the Dutch DSOs. I worked on the first pilot project where we actually made sure that we didn't have to put in a temporary cable, which would have cost 2 million euros for just three years, because that was when the new transformer would have been in place. So it was a very nice project to work on. The USEV standard, that's the Universal Smart Energy Framework, has been updated a couple of times. In the meantime, it was relabeled to universal flex trading protocol, but that conflicted with UFTP. FTP, that was kind of annoying. So it was renamed to shape shifter and hopefully we'll keep it that way for a little bit. And I think now it's Tom's turn to talk about what it does, or should I still take this one? I'll go into this one again. So in the grid, there's a lot of roles defined and the Universal Smart Energy Framework actually defines all these roles. The most important ones here for a shape shifter are the DSOs, aggregators and prosumers. Aggregators are companies that usually provide some form of IT where they can trade on behalf of companies, well, prosumers or consumers or companies that actually have the physical flexibility, but the aggregator will be the party that's participating in the flex trading protocol. It's possible that those two are the same. It's not necessary that roles are separated. It might be that a prosumer actually takes on the aggregator role because they're big enough to just be a party in this flex trading protocol. The DSOs and TSOs are the parties that really send out the request, the flex request that Tom will probably talk about into the market. All the aggregators will have time to respond to the request. Okay, apparently tomorrow at 6 p.m. until 7 p.m. there's congestion on the production side. So please, if you can use extra power during that time in a certain region, we will actually pay you for it. That's an example situation that you can handle with shape shifter. So some typical flex applications. Solar parks talked about them, wind parks, freezers. I think that's a very interesting one. I'm not even sure if it has been done before to actually use a giant freezer as an energy storage for grid stabilization, so I thought it was a very cool use case. Farms with solar, close to the solar parts, of course. Steel mills as they are major energy users and not necessarily time specific in when they do so. And really, in most of our early projects, we talked with greenhouses a lot because they have both large electricity connections and gas connections. So they can switch really easily or can even provide power back to the grid if they need to. So they are very nice from a flexibility perspective. Then I'll hand it over to my colleague Tom. Yeah, so in the time remaining, the shape shifter project, it consists of a specification which is published on GitHub. So you can scan the QR if you want. The specification is the one hand of the project. The other part is the XML schemas which are defined along with it. We are organized using a technical steering committee. We are part of the committee and also a couple of members from the UK DSOs and also from the Dutch DSOs. There are also two shape shifter implementations already. There is a Java library which is in use by Alieander and also at Gopax, which is the congestion platform we are using in the Netherlands. And there is a Python implementation which is used by another DSO. Everything is published under the Apache 2 license. We are currently focusing on improving our processes, our quality control, etc., to meet the open SSF best practices. And we are part of the LF Energy initiative. I will skip this for now. So we have a couple of implementations already that are using shape shifter. There was a demonstration project in the UK called Fusion. Good result from that. As I said, Gopax in the Netherlands. There are also some congestion service providers which are also implementing shape shifter on their end. So those are the CSP aggregator type of parties. And of course the grid operators themselves are using it to facilitate trading of flexibility using shape shifter. Skip this for now. Just one simple example of what the protocol looks like. It is an XML based protocol where the DSO can indicate what flexibility is required on the prosumer side on which the prosumer can reply with one or more offers that will be able to offer some flexibility to the grid operator. And the grid operator can then in turn reply to this message and say, okay, I want to use your flexibility during these hours to solve a specific congestion problem. One minute left. Current challenges we have with the project. We are trying to get as much of the prosumers involved as we can. So one of the challenges that we have to really keep a low entry barrier for these parties to connect using shape shifter. So one of the things is security is of course a very important topic. How can we keep it secure but also keep the barrier really low for entering into this type of integrations. And the other thing is we need more contributors. So that's pretty much a story for every open source project. But yeah, there are different ways you contribute. So if you find this interesting, please take a look at our GitHub and see if you can improve one of these items for example. Yeah, that's pretty much it. Any questions? I see one in the back. Oh, there's no time. Sorry. Yeah, just come to the front and then take your question.
OpenSCD: Everything Everywhere All at Once
So, yes, you can hear me. Hello, everybody. And welcome to my talk, OpenSCD, everything, everywhere, all at once. So my name is Tamar Schuss. I'm a software engineer and the lead of the domain development at Sprint 1. We are a software company in Stuttgart in the southwest Germany. And I would like to talk about... Hello, hello. Yes. Okay. First of all, what is OpenSCD? Just to give you context, a brief introduction to the history, how it came to be, and where is it today. Then what I would like to do, the goal is the talk, is that I would like to talk about the challenges we have as a community and which approach we took and which approach we are thinking about. So I'm just going to talk about the technical approaches today. So let's start. What is OpenSCD? It's an open substation communication designer. We're going to get into it later, what that really means. And it's also an IEC 61-850 tool. I don't know if everybody knows it, but I'm going to explain in a few sentences what it is. It's a progressive web app. So it's browser-based, and it's also a platform. So we think about it as a platform rather than an app. Okay. So just again for a quick context, probably everybody knows for the... Or for later, for the recording. This is a substation, an electrical substation, and it converts high voltage to low voltage and vice versa. And IEDs, so intelligent electronic devices, monitor and control the substation, so everything works as it should be. And the... Before I mentioned IEC 61-850 is a communication standard that describes or specifies how these devices should communicate or how you should design a communication between these devices. So OpenSD does something with this, and how it came to be is that at Omicron, one of our good friend Jacob Fogelsang created a Java app first because he wanted to help with the colleagues and his team to create multi-vendor projects because every vendor had its own tool and they interpreted everything a bit differently about the standard. So they try to... Our Jacob tried to create something where you can agree on a software level, so not just a specification but also agree on the implementation. Later on, Christian and Dinka joined the team and they restarted the project as a progressive web app because they saw just how hard it would have been to deploy and to just distribute a Java app to everybody. So they see the web platform as a nice way to distribute the software. Then the project started to grow. Alieander RTE joined Transpower from New Zealand and TransNet BW also from South West Germany and we joined with them and we had to create a few plugins for them. And now I'd like to think that we are at the scaling phase. Just last year, a colleague from Alieander, Pascal Wilbrank and I took over the maintenance of OpenSCD and just last week we have been accepted to LF Energy. So we are very happy about it and we are looking for the onboarding process and get to know all the other projects too. So these scaling of course are the scaling problems I think everybody has. So we have more interest in usage and more usage and more interest in the project and we face a few challenges. First to get back to the title, so everything. What we see is if a tool doesn't really provide all the tooling to design substations, then people are going to just use under ones. And we are right back there where we were at the beginning, that the tool is differently maybe interpreted as specification and then these designs, so these files are not going to be as much exchangeable as we would like to have. So what we see is that in order to be successful, we need to provide all the tooling, all the features that the users need. The problem with that is also that we have to provide it everywhere, otherwise a standard couldn't really work. It's already too bad that this high EC standard is not accessible for everybody for free, but even if the software that uses it isn't accessible to everybody, then it's never going to work. So at least we are trying to change that what we can. So we would like to really make it available for everybody. So all at once means, as you may know, in a multi-stakeholder project where everybody has its own deadlines, roadmaps and timelines, everybody tries to basically prioritize their own needs over the others because it makes sense. So this is also what we are facing with all the TSOs that everybody just has a different need. So our approach may be, so not every problem is solvable of course with technical solutions. We try to do, we try everything out, but today I would like to talk about just the technical ones because otherwise we would be here wrong. One is Web standard, it's really important to use them. We depend on them for the flexibility and performance and of course the long-term maintainability. Then the plugins, we have a plugin system. I'll get into these topics deeper in a bit. The plugin system just helps you customize for every use case you would like. And also the distribution, it's just one step further that you can have your own version from the whole system. So Web standards, how does it help us? So as I mentioned, it's a progressive web app. OpenSD is browser-based and what we need also is an offline usage capability because not every engineer has internet connection at sites or they would like to browser or design the digital substations on the go. So this is a really big point for us. And also that again as mentioned, installing an app, it's not really possible. It begins the prices and TSOs because the IT just doesn't like to install apps. So providing it in the browser, it's a nice way if you have internet connection to have it. Because it's a progressive web app, you have to visit it once and then you have it and you can use it at any time. The next one is custom elements as a web standard. So we use it for the plugin system and for a few other things. Why is it important for us? Because again it's a standard and if you can compile to custom elements or web custom elements, then we are fine, then you can create your own plugins. That leads to technology independence because we don't really mind what you are using. For example, OpenSCD is mainly lit JS based but we have for example, Swed plugins at Sprint I and so we created for Transnet BW, Swed based plugins and we just compiled it to custom elements and everything works fine. So this is also really nice to broaden our perspective and broaden our, let's say the developer team because no one company or the companies doesn't have to stick to one technology. Every company can pick their own or what they are best of it or what they have knowledge of and they can just use it. I'm going to show in a bit also how easy it is. So let's dive into the plugin system. This is OpenSCD and almost everything is a plugin. The menu points for example are all plugins and as is the example, the Open project opens by default, it opens a file locally from your PC. But for example, in our let's say sister project in Compass, it's also an LF Energy project, they re-implemented it and they have re-implemented the Open project plugin so that it opens files from the server. You can do this with everything else. So of course saving makes sense too. Then the next one is the editor plugin. So this is basically the main content that you see in the middle and also in the tab bar on the top where you can switch between the plugins. And the editor plugins are the plugins that can really manage the... Yes? Oh yeah. Yeah. Thanks. So editor plugins can really manage and modify the design. And what you don't see is the good thing, it's a validator plugins. We have by default the standard XML validators for the standard but you can of course, everywhere if you want, you can create validators that check for some semantic meaning. That means if you have for example a naming convention at the company, you can create a validator for it and then it's going to tell you if a naming of a device is not correct. Right. So how can you create plugins? It's really simple I think. It's just an unregistered custom element. So that means, if you can see it hopefully, it's just the standard way of creating custom balance. That's everything we need. We don't really need anything much because this we can load and use. And basically in this function, you can see almost everything we need. At the top highlighted, you can see this is a... Okay, maybe it's too small but in the top, we create a custom plugin tag. So a custom HTML tag name for every plugin. This is just to make sure that no plugin are collide. And we do this by hashing just the source. So you can have as many instances of your plugin as you want if it's necessary, just only the source or the source URL has to be different. In the next step, we just load the custom element, the JavaScript file and define the custom element with already generated tag name. And then render the tag, render the element, put in the HTML and the DOM and give it a few props, a few attributes so it has something to do. And the result is going to look something like this. Where you have open SCD and inside it, we have this plugin with random or a hashed generated HTML tag. So this is again an other example. But this is one of the plugins we created on the left. It's just a small weld component that wraps around another component. And on the left, we have this relatively small wrapper custom element that the main thing it does is here just basically deploys or starts this weld component. And why is it pretty good for this use case because it doesn't really have a runtime. So even if you have sub-weld, so to say, in every plugin, you are not going to have anything too big because it just really compiles down to basic JavaScript. In case it would be also possible something similar with React because React also bootstrap similar like this. And the only thing is that then with every plugin you would load React. Actually as the whole library with it, which sounds like a problem, but to be honest, once you load it, the plugin then it is cached and you're not going to load it every time. So even it would be good with React. So the last thing I think, the distributions, what's one of our solutions we are trying out currently is for example, it's what already working is that you can already deploy Open SID. So you can just take as it is today and deploy it on your own infrastructure. It is just a web app. So it's pretty easy. And it's yours. The other one is Eddence. We are currently working on to provide building blocks so you don't have to use everything. You can use just some of it. And it's easier to recreate and modify. For example, the plugin system, there is a history system where you can undo and redo your actions and also saving the project and editing. So these everything you could replace yourself and make it like for example that the editing doesn't happen in the browser but it gets sent to the servers, to the backend and then everything happens there. So this is what we are working on to increase again the flexibility. So again for the Compass project it's necessary to create new Eddence or a new, right now they are, they use a fork of Open SID but it's not the best solution so we would like to provide rather building blocks where you can put together your own platform. And what we saw is creating your own plugins. You can do it today at any time and the nice thing is that you can load the plugins from your local PC so you can have your PC and nobody can access to it but of course it's not the nicest thing to do so you can deploy it anywhere and you can install it in every distribution. So we already have a few distributions and we already have few plugins that we use everywhere even and it's not developed from the same teams. So it's always a nice way to use the work of others. Yes, so I was a bit quicker than I thought. Maybe we have a few questions but if you want to get in contact we have of course the Open SID organization at GitHub. We are on the Elephant Energy selection. We have a website and you can try out Open SID at opensid.github.io. Thank you. Is there a question in the room? The United States we have plenty of time, we have 10 minutes so if you want to ask a question of course to Tamas but to other speakers are still in the room feel free to ask for an energy question here. But of course priority to Open SID. It's post break, that's why. Let's break. Everyone's a bit tired. We can understand perfectly. We can ask questions to the audience. Oh, that's nice. Let's jump. Okay, IAC 61A50 right? You said that. Does it, apart from explaining, does everybody have any experience with that? Is that something you know of or not? Raise your hand. Okay. So who works in energy industry? Okay, about half I guess. Are you doing something with energy at home? Home automation maybe? Okay, yes of course you said. Okay, yeah. What if we teach energy that comes where? It's not industry. Yeah? My teaching. Teaching, teaching. Oh, education. Education. Higher education, primary schools, I don't know. Yeah, so much things. We had the own tech of course also. Has anybody thought of a question now? Ah, there's a question. Great. You're a hero. I think it's a follow on from your comments. I'm coming from a telecoms industry. I think I see a similar problem. The community is not big enough. The energy community is not big enough to sustain these types of projects. I think telecoms industry sees the same. So is there a way that the projects can be widened to get that their scope is even bigger so there's a much bigger chance of getting a more sustainable community? Do you have an opinion on that? Yeah. For sure. So as I mentioned in the beginning, so like it's what we try to do is what I talked about today is a technical approach. One was that basically having a desktop app is not going to cut it. So you need some new solution like the web platform where you can really distribute your software to everywhere. And also LF Energy and Alian there already does a great job with supporting the open source communities and using their project. It's already really big. How else you can, I think it's really hard to get amateurs so to say or hobbies I would say to the projects because the features we develop is not for a week and not for a week. The results are really long term. So until we really reach it, it could be years even in the energy industry and in the telecommunication I think it's similar. So the technology, how the technology moves in these industries is quite slow so to say or slower and maybe interesting. So how you can maintain such a communities I think you have to get through the chasm. If enough people get to use the project, for example, I think we are either there just before the chasm because Alian there uses it, Erty uses it and TrustNet BW uses it. If you can get a few other TSOs on board, then probably we're going to get over the chasm and the rest of the TSO is going to see this is a nice project and they want to maybe get involved too. So that's one thing how you can maybe grow the project and how to maintain the project is of course through foundations and through the companies. I don't see indeed this industry because it's so specialized and because of the closed source or the closed nature of this standard doesn't make it easier. I'm Dan Brown from Linux Foundation Energy and you're exactly right. There are so many parallels between networking and energy. I would say networking in telecom is like ten years ahead actually of where energy is right now, believe it or not. Where things, you know, ten years ago nothing was software defined and that sort of thing in the telecom space and now it largely is. So we need to go through exactly that same transition. I'm not saying telecom is perfect by any means and there definitely are not enough people in energy. So it's a matter of getting all of these traditional old school suppliers on board as well, the vendors who have been selling proprietary black box systems to the energy industry to utilities for years. They need to basically stop doing that and come to it with an open source approach and so they need to bring in the resources but we also need universities, we need researchers, we need government, we need the utilities themselves. So it's really a matter of community building and scaling and it's, you know, not an easy task by any means but that's why we're here in hopes that some of you who may not currently be, you may be developers in other vertical markets or in horizontal industries, horizontal technology areas who may find this interesting and be inspired and be inspired to, you know, come and join and start contributing to these sorts of projects. There's not, you know, an easy solution unfortunately but we're just doing everything that we can to keep building capacity. How the IAC61-850 market share, in terms of number of items, what part of the market of the substation does it represent? Meaning on one red electronic site that's deployed, how many are compatible with this protocol? So I'm not the best person to answer it, I'm not an electrical engineer, right? I'm not sure. So far what I get is that they are capable of it, so the IEDs, the Intelligent Electronic Devices are capable of it. I'm pretty sure, at least in you, so I haven't heard that they wouldn't be, so yes. Any other questions? Maybe to complete what you asked about. I think that the two last days some of us were in the Policy Summit for a European Commission. It was organized by the European Commission and we thought that it's very important to make a big announcement on energy and open source opportunity because we all rely, our future relies on energy, of course, our business, everything is relying on the energy. So if we can have fundings and if we organize through foundation to coordinate the effort and not to make efforts there and there and there, I think we will find a great path to have more and more contributors. Yes, you have a question. Can you please give the mic? I just want to compliment, sorry if I stopped abruptly, I'm sorry. I think that there's a very, in my experience, in my research in software-defined paratronics is that software-defined energy is much harder to achieve than software-defined data and signal because of the fact that there's a lot of current, there's a lot of power, there's a lot of issues with that and different use cases require different types of converters and all that. So there is, for me, one of the hurdles that we have as a community is that we need more open hardware as well. I mean, let's try to do some, you know, no code with no computer. It's not possible. We need to, if we want to do software, we need a computer. And if you want to do power, you need a power converter. It kind of goes with the, we abstract the hardware because eventually we want to, but there is a lack of hardware and I think that's a big frame, that's a very big break on the process because hardware is not only hard but it's difficult to abstract as well. We're going to get there. Thank you.
Power to the People - Technology for Access to Energy
Okay, I think we can start. Hi, I'm Vivien Barnier and this is Martin Jager. We will talk about Power to the People technology for energy access, or access to energy. We have seen a lot of great presentation today, very technical, very detailed. This presentation will start slightly differently. It goes also like this general discussion we had shortly about community and how to grow the community for energy projects. I want particularly to talk to you guys about access to energy and how we can use open source and which potential open source has for this completely under-explored area for open source, because a lot of these great projects are particularly targeting industrialized geographies, let's say, and their realities. But there's also another reality which looks like that. As you can see, in the north and hemisphere, like the global north, a lot of lights in the night. While you see, there's a lot of people living here and here and also here. But there's much less light. That's because there is no electricity. So if it's the energy death room, we should also think about how to leverage open source, both software hardware and community to help electrify these areas. And if we look particularly on how electrification has gone in the recent years, you can see there's been some progress in reducing people without electricity. So we started in over a billion in 2012. But then in the recent years, this improvement of access to electricity stopped and we have even a back trend. Any idea why? Not exactly. Not only Bangladesh as well and also on the African continent. There's a bit of progress. And now it's not getting cheaper anymore? Exactly. General population growth. The speed of electrification is the same. Just the population growth has outpaced the speed we are electrifying. So we have now every year more people with electricity than less. So we need to speed up the process of electrification in general. Now electrification of areas which haven't been electrified in the past brings particular challenges. Which is you have extremely low income customers. You have extremely remote areas and don't think about what is remote in Belgium or in Europe or somewhere. It's completely different. You have unknown future demand patterns and forget about AI machine learning. It's just not going to work. You don't have data. It's people that never use electricity. So you have no clue and you can't use these like digitalized methods to predict what will be the demand. Data connectivity issues. So if you want to communicate with the assets, extreme weather conditions and regulatory uncertainty. Because a lot of countries which are not that stable in the political situation. So you also don't know if they are coming the main good. If they are not coming the main good, what technology will they need and so on. So you need extremely low resilient technology to achieve universal electricity because you have to respond to all these particular challenges. And now you have NGOs, you have private companies, you have international companies. Large utilities are trying to go into this market. Non-profits, corporatives, communities, even agribusiness. Because they are already present in these areas. Often are going into energy ventures. And all these companies or stakeholders or NGOs are developing technology. Software, hardware and also business models. And very often they come up with exactly the same thing. Almost the same thing. So they are constantly reinventing the wheel. Because it's a very decent sector with a lot of players. So perfect playground for open source I would say. Because we have to overcome this constant reinvention of the wheel and it's decent. So there is a lot of possibilities to still shape the sector. Which in industrialized countries often is a bit more tricky because there are more established industries, more established players, bigger players and so on. So what we do at an access is we promote and support open source development and adoption for energy access. Very particularly for technologies that are meant to provide electricity to people who have not been electrified in the past. Particularly to generate an equitable ecosystem. Where more local companies, particularly domestic companies can participate. They can compete against large utilities like the ANGIS of this world. And so on like trying to grab this market. And have this adaptable and resilient infrastructure that we need to electrify. And to say what that means. A couple of projects that we have funded and supported in the development in the past. They range from software, hardware and business models. I will not go into details for the sake of time. And we want to speak particularly about one in particular today. Which is an open hardware project. Which is the open source battery management system. The Libre Solar, BMS C1. Which has been developed by Libre Solar. And which Martin will now provide you more insights on. Yeah, thank you very much. Yeah, so I'm from Libre Solar. Which is mostly an open source hardware and firmware project. But we're also a very small company. And doing some consultancy work around the open source hardware development. And yeah, we've developed these battery management systems. With a particular focus on energy access in a project together with the NXS foundation. And I will explain a little bit what a battery management system is at the very beginning. And talk about some of the technical aspects. As well as the community. And how we are interacting with other people interested to join our movement. So to say. Yeah, a battery management system is part of almost any modern electrical equipment. That's battery powered because all those batteries, those systems use lithium ion batteries. And yeah, in the energy sector, energy access sector in the past. There was still a huge use of lead acid batteries. Which have some issues with environmental damages. And which also don't last that long. And nowadays lithium ion batteries are getting cheaper. And so also. Okay, so I covered the antenna apparently. Yeah, so. Yeah, nowadays also in the energy access sector, which is very cost sensitive, as you can imagine, we also. Yeah, using lithium ion batteries more and more. And so we need a battery management system that takes care of the safety of those battery systems. So that's basically what it does. We have a pack of lots of different cells, which in series connected in series. We measure each single cell voltage, make sure that they are well balanced. We measure the current. And if something goes wrong, we have the safety measures like a fuse and a switch, which open the battery. And so that you don't get an overheating battery that could potentially even explode. And of course it's a safety critical component. So you have to take care to develop it right. And yeah, open source is a really good method for collaborating and not reinventing this wheel, which could be a costly process. So yeah, this is the hardware board that we developed. It can be divided into two parts largely. This is the power part, which does the switching and the current measurement. And you see those pretty large connectors. We can handle up to 100 amps, which is not that common for most people used to Arduino and so on. Of course, for the own take guys who had the presentation before, it's also going in that direction. Yeah, and that's a little bit challenging to put the microcontroller and the power parts onto the same PCB. But we decided to go with that route because we really have to make it as cheap as possible, but still handle this amount of power. And with these 100 amps and up to 48 volts, you can provide one huge in that sector AC inverter with enough electricity, like 3, 4, 5 kilowatts almost. And that's sufficient to build a small AC mini grid in such an area. There are also so-called solar home systems, which are really tiny, like 150 watts, 50 to 150 watts are really tiny systems. This BMS is not targeted at those systems, but more at the slightly larger systems. But you can also use the technology for light electric vehicles and other things, especially because the firmware is open and can be adapted easily. For in terms of communications, we have the CAN bus, which is the automotive and industrial state of the art protocol used for batteries as well. But we also have Wi-Fi and Bluetooth low energy, so that would be more for the non-control part. So we can talk to inverters, for example, with the CAN bus, but you can also have a smart phone app and connect to your battery. Where's the antenna? Okay. All right. So for an open source product, we think it's also essential to use only open source tools. And so that's why we decided to go with the Kikead, obviously. And yeah, so since a few years, I think Kikead is really a completely professional, great PCB design software. And it has very nice plugins, which you can use to automate some processes. And we're always trying to somehow get towards a similar experience than you have with software development for the hardware development. Because if you get a pull request and you get a binary, then you try to understand what's happening, what did the person change. That's difficult. So we use some community developed tools to create diffs in PDF format. So you can see what the pull request changes. And that makes it easier to collaborate also on hardware development, even though it's still not as easy as with software. Yeah, also we generate an interactive HTML bill of materials where you can see all the part placement. So if you want to assemble it manually or you want to fix something, then it makes it easier to understand and find the parts. Yeah, all those are community developed plugins. And for the firmware side, the software that's running on the BMS, we are using the Zephyr real time operating system. And I can really recommend anyone who's into embedded development and maybe didn't use R-Tosys before, really try it out. It's maintained by the Linux Foundation, so really fully open source. And it has extremely great features. So you can use it for almost any architecture. And you can also switch between different microcontrollers because it's very well abstracted. And that's something we experienced during our development. So we started with an SDM32 microcontroller. And then a chip crisis came and we couldn't get any SDM32 microcontrollers anymore. So we thought, hmm, what are we going to do? And we replaced the microcontroller with an ESP32 C3 microcontroller. And it was really just a matter of changing a few board configuration files. And then all the things like UART communications and so on worked out of the box, almost out of the box, I have to say, because there were some bugs in some drivers, which we, of course, upstreamed and fixed. But this is now fixed because that microcontroller was still pretty young. But yeah, that's really a huge advantage. And so if someone comes and needs this battery management system, but needs a different communication protocol or slightly different hardware configuration, a different microcontroller for whatever reason, it's almost a matter of a day to get it ported to a different board. Yeah, as mentioned already, we have lots of communication stacks in DeFa already working out of the box. For this energy access market, the GSM communication is very important because most of those batteries or off-grid systems need remote communication through GSM. But we can also use Bluetooth Low Energy and CAN and Modbus for the more local communication. Yeah, there are also some Zephyr folks here at FOSSTEM. We don't have a dedicated DEF room, but yeah, so if you want to learn more about Zephyr, I'm also an active contributor and maintainer. Just let me know afterwards and then we can talk about it a bit more. For the communications, we are using a protocol, so to say, that's called Thinkset, which you can think of as an API, a REST API, but for microcontrollers. So you can use it over the serial interface, you can use it over WebSocket remotely, you can use it over CAN bus, over Bluetooth Low Energy. We are using exactly the same upper layer of the communication protocol for all these layers, which makes it really easy to integrate it also into other projects. It's not meant to be just for battery management systems and it's self-explanatory. So you can see an example here. You can send a request. This would request the battery parts from the battery with a question mark for a GET request and then you get the data as JSON, including the units. Whether they are read-only or write-only values and so on. It's quite versatile and we are making good experiences with that in case you're interested. Here are the links. The presentation is also uploaded online, by the way. Now, coming to one challenging part of open-source hardware development, which is the manufacturing or production. So, often you can order electronics hardware on JLCPCB, but for these power electronics boards it's a bit more difficult. Sometimes you don't get the board specifications you need, because you need thicker copper layers than with boards where you just send out some data or communicate on the PCB. And also we have some quite heavy connectors on the boards, which need special soldering processes and so on. So far we haven't ordered anything with the Chinese manufacturers yet, but mainly ordered boards locally in Germany. And we're in contact with some companies in Nigeria who are also trying to produce it, or they will be able to produce it in the future, so that we want to basically also break this chain of developing something in Europe, producing it in China and shipping it to Africa. So, yeah, the idea is that it can be produced locally. And thanks to these features in KiCat, you can also easily, kind of easily solder it by hand. So this is an image of the first prototype in our FabLab in Hamburg, where we have a manual pick and place machine, but you can also use tweezers. And you need a pizza oven and a small reflow oven, and then you can do it yourself. That being said, we have ordered some PCBs. If you want to participate in the project, they will be ready soonish in one month, maybe. And you can also get in contact with us. Our plan is, of course, in the future to be able to provide boards for easily and for low price, so that everyone can participate, but it's also regulatory issues, or that's on our list to improve on that situation as well. Right, so final slide, almost. So how's the situation with adopters? We have had about 10 companies who started using the product during our project, and more than five companies who are actually starting to use it in their products and in the field. Also some companies from Europe picked it up and provided really valuable feedback and pull requests. We also tried to really start from the very beginning, from requirement specification, towards the final design of the PCB, to have everything on GitHub. So the specification was on GitHub, everyone could interact with us. And yeah, we've also got a community forum, one is on the AdNXS foundation website and in the LibreSolar forum. But I usually prefer just using GitHub issues and pull requests for communication, so you have it in the right place and people will find it. Yeah, so if you're interested in developing such systems or using them, just join us on our journey to bring power to the people. And here are some resources, some websites where you can find all the hardware designs, firmware designs and the community. And now, yeah, we're open to hear your questions. So first, thank you very much for your presentation and for your great work in the open source community. I had several few questions. First, I saw that you had a passive balancing board on your BMS. I was wondering if you are also offering or thinking about active balancing. Yeah, good question. We really tried to use active balancing at the very beginning and had this on our requirements list, but finally dropped it because of cost reasons. So there are some linear technologies chips and Texas Instruments chips, which cost $5 each for six cells. And they are really expensive. You have lots of passive components as well that you need. So that was the main reason that we are just going with passive balancing at the moment. There are some Chinese chips we couldn't get anywhere from a reliable distributor and we couldn't get any data sheets and so on. So that was also not an option. If you have an idea how to implement a cheap active balancing system, let me know. There was also already some contributors that wanted to do active balancing. Okay. Did you already have some cases in which you had to handle the balancing between lithium cell storage and other kind of storages such as compressed air or concrete and fibers or the storage system than lithium cell? No. So far it was only lithium cells of the different chemistries like iron, lithium iron phosphate, lithium nickel manganese cobalt oxide. We were in discussions with some people developing redox flow batteries where the voltage is smaller so you could potentially monitor multiple cells at once and get an average kind of. But yeah, so far these technologies you mentioned we haven't tried. But if the voltage range matches then it should also be possible to do the monitoring at least for those cells. I have a couple of questions. I don't know if I can ask multiple questions or just one. Okay. I can also stay outside the room after the session. Okay. You talk about network and notably having GSM but you don't talk about security feature for software and remote control. So I don't know, is there any, how is it handled or anything? Yeah, so we are using TLS certificates for the communications. And currently, yeah, that one of the problems with security is that with TLS you have the handshake mechanism every time you restart sending something out via the modem and in these areas you usually have SIM cards with roaming and they are rather expensive for high data rates. So you have like six megabytes per month only. And if you do a TLS handshake every time then that gets tricky so you have to reduce your data rate. We also thought about other things like co-op where you don't have the handshake but it's tricky but we're still implementing security so we're not making any compromises on that. Thanks. Originally it was the same question as the first one but I can switch to the next one. Did you consider to have a bit more modularity like on the other models you had? Those were somehow split up because on a bicycle for example 36 volts I wouldn't need 100 amps so that would be a bit much. And I think for everyone afterwards, we'll come out together and we could chat about the Vigotech project, Ointech mentioned. Yeah, so you have a certain degree of modularity which you can implement through leaving out some components. So you could just have less MOSFETs or take cheaper and smaller connectors and then reduce the amount of time you have to do that. That's possible in that direction but not in the other direction. So we designed it for 100 amps which is almost the limit of what you can get with the power and control PCB on the same control part on the same PCB because the chips have small pin pitches and if you need a huge copper then you get into trouble with that. So, yeah. And as you mentioned the bicycle, so this is really designed for energy access as we said and for high power appliances in energy access. So like milling machines or stuff like, or a tri-cycle where actually then you possibly get to the 100 amps that you will need and not for like an e-bike, like one person e-bike. Last question. We have a certain number of people who are using the same power supply and the same power supply to e-bike, like one person e-bike. Last question. We have 13 seconds. It's precise. I'll be quick. So, as you know there isn't a standard protocol for talking to BMSes from all, you know, because you're going to be talking to standard inverters in a typical system assuming there's some solar input and the problem is that most of those aren't open yet. So, they expect that there's a whole series of battery protocols. And so, currently if you buy a generic inverter, it speaks 15 different battery impersonation modes. I'm just wondering what you've done some of that or if anybody's thinking of a sensible standard for this madness. Five seconds, Enswer. Okay, yeah. We thought about that, but that's all only software. So, the ideas that we have are RS485 and CAN bus, so you can implement any special kind of battery protocol. We have CAN open readymade out of Zephyr, so that's already something, yeah, a higher level complicated stack that's already pre-built in. But, yeah, anything, any very special thing would have to be implemented. Thank you so much. Thank you very much.
Sharing the operational cost of Europe's electricity grid: optimization and transparency through open source
Hello everyone, I'm Peter Mitri, I'm a software developer at RTE, the French TSO. So today I'm going to speak to you about two open source tools to software that help us optimize and share the operational cost of the European grid. The first part of the presentation I will focus on optimization. So I will talk about what we call the regional operational security coordination and remedial action optimization. In this part I will introduce the open source software which is called OpenROW. In the second part I will talk about cost sharing through flow decomposition and in this part I will talk about the open source software which is called flow decomposition. I try to keep as much time as possible in the end for questions. So yeah, I hope you have some questions. Great. So let's talk about first of all why we need to optimize the grid. So I understood that many of you work in the energy sector but some don't. So we talked a lot about congestion management in the previous presentation. So here I'm going to try to set the scene and explain what a congestion is. So as you may know electrical equipments in the grid have physical limits. Outside of these limits the equipment is not safe to operate. So for example a power line which transports electricity from point A to point B has a thermal limit. If we exceed this limit, if we transport too much power on this line, the line may heat up, it may deform, it may even catch fire and of course it's pretty dangerous. So to help set the scene, imagine here that you have a small grid or a small part of the grid which is represented in three nodes. So the nodes would be like sites where consumers and producers are connected to the network. And between these nodes you have power lines which are in black here. And let's imagine that you have most of power production on the left side and most of power consumption on the right side. So most of the power will flow from the left to the right. Let's say that we have a consumption increase on the node here to the right. Then of course the flow will increase from the left to the right and depending on the network's topology it may very well be asymmetrical. So we may have more increase of the flow on the bottom part here. And we may find that the flow, the new flow that is on the line here exceeds its limit. So this is what we call a congestion. Of course there's not just the question of consumption and production, there is also some accidents that can happen in the grid and that can lead to congestions. So here you have an example. If we lose the line that transports electricity from here to here, then most of the power will flow through this line and this can lead to congestion on the upper line. As a TSO, RTE has the responsibility to be robust to all eventual incidents on the network. So we have to do something about these congestions. So what can we do? Fortunately we have what we call remedial actions. So these are actions on the network that can serve one of two purposes. The first purpose would be to redirect the flows on the lines. So for those of you who work in the electricity sector, you may know them as topological actions, HVDC actions or phase shift transformers. I'll talk about them in an example in the slide that follows this one. There's also another type of remedial actions which acts on the injections. We call that either redispatching or counter trading. These are actions that will change the power production plan of the producers. In general, the first part of remedial actions which redirect the flows are called non-costly because the only cost to operate them is the aging of the equipment. The TSO has power over these remedial actions. And the second type of remedial actions is costly because when we ask consumers or producers to change their injections, we pay them for their service. So to help set the scene, this is an example of non-costly remedial actions. So here in the example above, we have the base case where no remedial actions applied. So let's say that you have a congestion in the line here. One first type of remedial actions is the topological action. So let's say that you can split this node here into two nodes. This will make the power flow equal on both lines, this one and this one. And then it will relieve this line here and then we would have relieved the overload or the congestion on the network. Another type of remedial action is the phase shift transformer. So let's say that we equip the line here with a phase shift transformer. This kind of equipment is able to shift the phase of the current on the line and so act on the active power flow and so it can relieve the congestion on the line. The second type, in the second family of remedial actions, which are costly remedial actions, this is maybe actually easier to understand. What we can easily do is to call a producer which is on this node, a power plant, and ask them to decrease their production and ask a power plant that is here to increase their production. So naturally this makes the power production closer to the consumption site and it reduces the overall flows on the network and by consequence it relieves the congestion on the line. The key difference here is that power plants 1 and 2 get paid for their balancing service. The fact is that Europe's electricity grid is highly matched, interconnected and synchronous. So for example if you have an incident in France it is instantly measured in Romania. Thus the security of the network is no longer a national one, it's a European one, it's a global one. So TSOs have to conduct coordinated computations to ensure that the European network is secure. This is why the Acer, the Agency for Cooperation of Energy Regulators, imposes on TSOs to conduct what we call the regional operational security coordination. So in this process TSOs must choose the best remedial actions on the European scale to implement in the network in order to ensure that it is secure. Of course it's a large escape problem so we can hardly do it by hand. That's why we need an automatic tool which is called the RAU or the remedial action optimizer. The RAU will have to choose the most optimal remedial actions in a given perimeter and it also has to do so by minimizing costs that are imposed by cost remedial actions. So using an open source RAU has many benefits. First of all transparency because we are in a European perimeter. So what better way to be transparent about what the RAU does and which cost remedial actions it selects than to put its code in open source. Given of course that it's well documented. It also serves the purpose of coordination because this way when we put a tool in open source different TSOs from different countries and different vendors from different countries can cooperate more easily. It also serves robustness, interoperability, it also serves reusability and time to market because when a tool is used in many business contexts it becomes more versatile, it becomes more robust and it becomes quicker to deploy. At RTE we have developed an open source remedial action optimizer called the possible open RAU. So for those of you who may be know it it was called FARAO in the past. The journey started in 2019 but two weeks ago we made the move to possible open RAU and we did this because we wanted to join the Linux Foundation energy adventure because LFE provides a clear governance for which all contributors accept to abide and it also provides a clear methodology to work more efficiently and in better intelligence. Open RAU is actually used internally at RTE but also in many European processes. So I talked about regional operations security coordination or ROSC. Open RAU is being implemented for the SWE region here which covers France, Spain and Portugal. It is already in operation for another process which is called capacity calculation on the Italy North region and on the Co region which is actually the largest region in Europe to conduct the coordinated computations. It covers around a dozen countries. A few words about what our RAU can do. So it's an optimizer so of course it has to have an objective function. It can either minimize the worst congestion or remove all congestion in the network. About congestion we can model flow congestion and we can optimize flow congestion. So this is the example I talked about in the previous slides. We can also model voltage magnitude constraints and voltage angle constraints but for now the RAU cannot optimize them. It can only monitor them. For immediate actions we can optimize phase shift transformers in a given range. So the RAU if you give it a range of possible tap positions for the phase shift transformer it will choose the most optimal one that reduces congestions over the whole network. You can optimize an HVDC set point so it can change the set point of the HVDC to reduce constraints. It can also choose to activate or not activate some topological actions. For example closing a switch or opening a switch. It can optimize a subset of redispatching remedial actions so actually a redispatching remedial actions are pretty complex and actually an open RAU would just have a subset with strong limitations. Also it can optimize a subset of shunt compensator actions and it can for now only model counter trading remedial actions but we do not support optimizing them in the RAU. So of course like I said open RAU is used in the multiplicity of business context so it is very versatile. It has a lot of ways you can use it by changing the input data or by changing its parameters so if you need more information you can look on our website for all the ways it can be used. Under the hood the open RAU software is licensed under Mozilla public license 2.0. It's hosted on GitHub and the code is written in Java 17 so we use JUnit for unit testing of course we use Mavin for dependency management. We monitor the quality of the code on Sonoma cloud and we're pretty happy with our figures. We publish the code on OSS Sonotype and we rely closely on the possible library to be able to model the network and to simulate it in particular to use sensitivity computations and load flow computations. We also this specificity of the RAU we also use Google OR tools. I don't know if you know it but it's an open source modelization library for linear problems developed in open source by Google and through it we can support a multiplicity of linear solvers. For now for example we have skip which is an open source solver also CBC but also we can support express GROB Cplex which are commercial ones. As a side note we tested that open RAU is compatible with Docker Jenkins Kubernetes and Cucumber testing. So in conclusion I'd be more than happy for you to participate in our RAU adventure either by using it and giving feedback or by contributing to the project. So the best way to join the adventure would be to join the possible Slack team and then to join the RAU channel. And there is also a quick tutorial on Java if you want to play around with the RAU on our website. And if you want to know what the future of the RAU looks like the roadmap is updated once per month and it is discussed during the possible TSC which you are free to join. I'm moving on to the next subject which is decomposition and cost sharing. So I'm going to set the scene with a small example here. Imagine that you have three zones. Let's say there are three countries A, B and C. Imagine that you have big bow production in the north of A and big power consumption in the south of A. Then naturally you'd expect the power to flow from north to south so from producer to consumer but in reality it's not so simple. Any part of this commercial exchange, the power that is sold to the consumer, only part of it will transit through the internal lines of zone A and the other part will go through zone B then to zone C and then to zone A to the consumer. So of course the consumer got the power they needed but some of the power went through zones B and C. We call these loop flows or polluting flows. So the commercial exchange is simply the sum of internal flows plus loop flows. And we say that they are polluting because they transit through zones in which they are not consumed. So as you can imagine more loop flows in the polluted zone means more loads on the zone's internal grid. It means eventually more remedial actions to implement possibly costly and this leads to more costs for dispatching and counter trading. So in the core region alone we have up to 3.7 billion euros per year of dispatching and counter trading. And of course loop flows are a reality. They are a consequence of the topology of the network. We can do nothing about them. We cannot eliminate them. However we can compute them and we can better share costs when we know where they come from. So the Acer again the European regulator defined a clear methodology of computing loop flows in the core region and this methodology is followed by a methodology to better share costs between TSOs. Of course using an open source tool has all benefits here and most of all transparency because when you talk about sharing costs we talk about TSOs having to share the bills and being transparent is very important. At RT we developed a tool which is called possible flow decomposition. It follows the Acer methodology so you have the documentation for it here. And it has both a Java and a Python API. Under the hood it's almost the same as the row so MPL 2.0. It's developed in Java. It uses Mavin. It's hosted on GitHub. It uses a lot of computations thanks to possible for load flow computations. And most importantly it's already supported in our PyPossible API. That's it from me. Do you have any questions? So maybe I wasn't paying enough attention. Can you, so the purpose of your system is to allow you if something happens like whatever that thing on the Pyrenees a couple of years ago for the whole system to react appropriately. But you were showing that you're doing subsets to the computations. I didn't understand in an emergency. Presumably everyone needs to do something right at the same time. The whole network. However far the effect propagates. So what was happening there? What happens in an emergency versus whatever you were showing on the screen with doing computations of various regions? This is not really an engine that is supposed to help decision making in real time. It's supposed to be used as an optimizer for the grid. For example, in the regional operations security coordination, TSOs have like a photo of the grid in the day ahead. So 24 hours before real time. We merge the whole grid models of different TSOs. We conduct load flows and then we see if there are any congestions. If there are any congestions, then we run a remedial action optimization. The optimizer with us, okay, I found these non-costly remedial actions and these costly remedial actions that will make the network secure. 24 hours ahead. 24 hours ahead. 24 hours during the day, but it's not supposed to tell the operator which remedial action to choose. This is another, this is really apart from balancing. And if you, if we go back to the example where I showed balancing, something that resembles balancing, what we should do here is every time we change production somewhere, we have to, so if we decrease the production here, we have to increase the production here because the TSOs do, anyway, when we handle congestions, we cannot change the balance of the network. So the balance between demand and offer is handled in another process. Hello, I have a question about how much resolution you need to see into each of the grids in order to actually make some of this. Could you talk a little bit about the visibility that's required at the TSO level or beneath it, for example? Depends on the process. So in the regional security coordination, we look at high level voltage, so 200 kilovolts and 400 kilovolts. And basically all big production hubs are on this voltage level, but this is a really generic remedial action optimizer, so we can generalize it to whichever resolution we need. Any other questions? Is there some ideas to change the software for real-time congestion management, like for DSOs or for other systems? Yes, some experimentation is underway for balancing in order to be able to find creative remedial actions in real-time. So for now it's not an operation, but it's being experimented. So my question is about impact. Have you noticed that over European TSOs are using your software as well? Is that the goal in the end to share among different TSOs as the Europeans can? For now we are the only TSO using the RAL internally. However, here it is Coriso, which is the computation coordinator that is using OpenRAL for these three regions. And also the idea of joining the possible project is to be able to develop a Python API pretty quickly and to be able to have more users in different TSOs. What kind of algorithm is used in OpenRAL? We have an optimization algorithm, so a linear optimization algorithm. I have a few slides in the appendix for this. We can talk about it later if you want. But basically it's a search tree in which we optimize the topological actions and inside after every topological optimization we run a linear program to optimize linear remedial actions. These are remedial actions that have a linear effect on flows, for example, PSTs and HVDCs. How do you test it? How do you ensure that there isn't a bug that affects all OpenRAL instances running simultaneously? With this, if it answers your question. We have a lot of input files and expected output files. And with this stack, with Docker, Jenkins and Cucumber, Cucumber is a framework for functional testing. So you write scenarios in a Gherkin language. You say, for example, given this input file for the RAL, then I expect that there is no congestion at the end and that this remedial action is activated. You write it in a very natural language. And of course there is code to run these things. And then we put that in a Docker and in Jenkins and we run this every night upon almost 500 scenarios. And every night we are sure that our main branch on GitHub is still solid.
Quartz Solar OS: Building an open source AI solar forecast for everyone
Welcome. I'm Rachel Tipton and this is my colleague, Zach Watts. Can everybody hear me in the back? I'm not used to talking in a microphone. Yes, it's good, okay. So today we're going to be presenting Open Quartz, building an open source AI solar forecast for everyone. I'm a full stack developer. I work for Open Climate Fix. I'm going to introduce myself and then Zach will introduce himself. I'm a career change developer. So before working in climate tech, I was teaching English in France. I got a little bit tired of teaching 18 year old French students the present perfect. So I decided to, the French people, huh? Because it isn't. Yeah, because it isn't. It's not perfect. And I'm not a perfectionist. So I decided that I was going to channel my love for languages into learning code languages. I completed boot camp about a year and a half ago and that's why I landed. This is why I'm here. I landed in climate tech and I'm quite happy. Zach. Thank you, Rachel. Yeah, I'm Zach Watts. If anyone's noticed, my last name is Watts. So I think I was destined to work in power of an energy of some sort. I recently finished my masters in physics two years ago where I was trying to make cells dance using acoustic sound waves. And then I kind of fell in love a bit with AI and then joined open climate fix about a year ago where I do some of our machine learning implementation and data science. All right. So what to expect from our talk? I'll introduce open climate fix. We'll talk a little bit about why solar forecasting is important to balancing a power grid and some of the use cases that we use it for. We have a live solar forecasting service called court solar and derived from that is the open source court solar model that we'll be talking about and Zach's going to present that today because he's worked on that model. And then hopefully we'll have time for questions at the end. And this is a sneak preview of the code that we'll have you guys run at the end of the presentation. And we're hoping that the demo works, but we'll see. Open climate fix was founded in 2019. We're a London based company. I'm based in the north of France, so getting to be in Brussels is kind of more my home territory. This photo is from the sustainable work ventures office in London where we work. We're a nonprofit product lab developing open source solutions to decarbonize the power grid and generating solar forecast is part of that work. All right. So we see ourselves as like a, I'd say like a middle man or like the traverse between ML researchers and the energy industry. So we want to make our data available to researchers and we want to make the research ML researchers are doing available to the energy industry. And how do we do that? So all of our code is available on GitHub. We also have models and data sets that are available on hugging face. Does everybody know what hugging face is? I'm assuming this crowd does know. Yes. Okay. We know what this is. So a lot of the data sets are from NWP data or numerical weather predictions. And up to date, we have 500 people who have signed up to download those data sets. So we like to say that we're making an impact in that way. We also make available the EU met set data that we collect from where connected to a live service of like we get data from the satellite itself while we're generating our forecasts. And then we're actually putting that data into the ZAR file format and making that available to ML researchers. And that data has been downloaded 16,000 times so far from the Google public data sets site. So that's a way in which we're having an impact. The data has also been used to forecast rain, like to do rain predictions in Sweden, storm evolution in Taiwan. So it's been used for a lot of different purposes. And most recently, there was like a graduate paper that was published on, I think it was like day ahead PD forecasting. All right. So moving on to why solar forecasting important. The weather is unpredictable. The sun doesn't always shine. The wind doesn't always blow. If any of you have listened to a podcast on decarbonization, you've probably heard that phrase before. So moving into the future, our basically our power generation is going to be dependent on weather dependent energy sources like solar and wind. So in this chart, you can see by 2050, about 75% of the world's primary energy source is going to be based off of renewable resources. And then the resources at the bottom are gas and coal. These are what are called dispatchable resources. So you can you can burn X amount of coal and get X amount of electricity, you'll burn X amount of gas and get X amount of electricity. This is a basic concept that I'm presenting. But it's important to think about, because you don't have that predictability with solar or with wind. And that's where our predictions come in. So does anybody know what this is, this image that is on the screen? I'm sure there's somebody who knows more about it than I do. Peter, would you? No. Huh? Somebody else? Anybody? Yeah, it's a gas powered turbine. Thank you. So it's again. This is a gas powered turbine. I'm using it to introduce the idea of spinning reserves. So a power grid is, as we've seen, there's a lot of calculations, as well as a lot of, it's complex to balance a power grid. And so what we're doing with our work is, we're helping power grid operators balance the power grid by providing them with a PV solar forecast that indicates how much solar energy is going to be on the grid. If they don't have that energy, what ends up happening is they have something called spinning reserves that they keep running. And that spinning reserve is running at 50% capacity. And so it's running at 50% efficiency. And so you're actually burning fossil fuels just to ensure that there is electricity that could be generated to be on the grid. If you don't know how much solar energy is going to be on the grid, it makes it more likely that you're going to have a greater amount of spinning reserves that are functioning or running at a given time. So I'm just introducing this to explain how our solar forecasts are actually decreasing carbon emissions currently with our work with national grid. So our main solar forecast is a national forecast that's run for national grid ESO, which is the electricity system operator in the UK. This is a picture of the control room. If you've never seen the picture of a control room, this is what the national grid control room looks like. And our national forecast is in operation in the national grid control room. So this is what a solar forecast looks like. You have the dotted line here. So the dotted line, that's your forecast. And then the solid line behind where it says 1130 is basically the history of the forecast itself. And I'm just using this to show you the information that national grid is given. And then they're able to make balancing decisions based off of this information. So if they see that there's 3.5 gigawatts of energy that is guaranteed to be on the grid, then they can reduce the spinning reserves that are running at that time. And therefore, decreased balancing costs for themselves. And then they also are diminishing carbon emissions at the same time. The other model that we have in production is a sites model. And this is what the open courts model is based off of. And so this is a model that's not necessarily generating a solar forecast for the power grid itself or for an entire country, but it could be like for a solar farm or like a smart home operator. And Zach is going to tell us how it all works. Great. Thank you very much, Rachel. So as said, we've taken a lot of the information that we've learned from building these kind of larger, more complex models and distilled this down into a site model. But essentially what we're doing when we're trying to do a forecasting problem in general, we want to start by providing as much information as we can about the problem we're trying to solve. So we start that by providing a diverse set of solar historic generation data. That just means we can capture all sorts of different types of conditions that might occur across a different location. We then provide multiple numerical weather predictions. These are forecasts made by large supercomputers of different countries, forecasting things such as cloud cover, temperature, rain, irradiance. And not all of these numerical weather predictions are equal. Some of them have slightly different biases. So we try to incorporate as many as possible to try and capture that information. We also utilize satellite imagery. As Rachel said earlier, we've made that data set public on Google data sets. That's really useful for helping with kind of near term cloud formation, not only that the satellite imagery, because it's a satellite up in space, it can take a picture every five, 10, 15 minutes. So you have a higher resolution of data going into the model, whereas the numerical weather predictions, they're run on quite resource intensive, quite slow to run supercomputers with much lower resolutions. We also then provide some topographic data about the terrain in which we're forecasting. And we feed all of this data into machine learning model. And if you've dealt with any data on this kind of order of magnitude of 60 terabytes of satellite imagery, you would know some of the pains in creating batches and the slow processing times involved there. And out of this, we're able to create a national, a regional, and an individual sites level forecast, which I'll be talking about today. So as we said earlier, we've been doing some work with the National Grid ESO, which we started a couple years ago. They were our first pilot project with our forecasts. And we managed to generate a forecast, which was three times better than their existing in-house forecast. So that gives you a key to kind of the bar that was set when we kind of started this, trying to getting an error, which is three times better. And this chart we can see to the right here, this is from one of our latest models, which we call PVNet2. And you're looking at mean absolute error as a percentage per forecast horizon. Now I've used this to demonstrate the value of using satellite imagery combined with these numerical weather predictions. The light blue line that you can see here is if we train the model just using the satellite imagery, you can see it's quite good early on, but the error relative increases quite a bit. Whereas just using the NWPs, which is this dark green line here, very kind of horizontal consistent error. And then by combining the two data sources, we get this, what I find a quite satisfying convergence where the models learn to take the information it needs from both data sources. So moving on to our site level forecast, just curious here, if you have solar panels, could you just raise your hands now? All right, now keep your hand raised if you also have a battery pack in your house. Now, are any of you using solar forecasts in any way at all at the moment? You are, nice. So this is where we see the kind of site level forecast that we've generated to be open source being really useful. There's a bit of a shift going on in the past couple years as consumers and kind of home households are realizing that there's these technologies available that can help them optimize their energy consumption. And it's not just the consumers as well, it's the smart home operators who are looking to participate in these energy flexibility markets. Now, as we've heard, there's been lots of really great presentations today about how to manage a grid. The grid, the electricity grids really need a lot of more infrastructure that needs to be built on to the grids to meet electricity demands going forward. And one way they're trying to tackle this is by increasing flexibility through things like smart home management. So one way this could possibly be used is when a smart home operator has access to many, many households, they can incentivize households to turn up electricity or turn down during different times. And this provides a flexibility to the grid. Now, from a consumer perspective, you might have an electric vehicle and you might want to charge your EV at times when you know you've got the lowest cost to you, which is when you'd have solar generation. So you can look at a forecast and say, I want to drive my EV tomorrow. I can look at my solar forecast and be like, well, it's really sunny today and really cloudy tomorrow. So I'm going to charge my car up fully today and then I can drive it tomorrow and it'll be lowest cost to me. So we see this being used by smart home operators. We're already speaking to a few startups in this space who are trying to integrate this into this smart home optimization sort of systems. Experts in battery optimization, research and academics, and just general hobbyists who might want to incorporate solar forecasts into their current situations. So to create this model, we've used a data set of over a thousand household UK sites, which can see on the right here. And we've trained quite a simple model, just a gradient boosted tree, which essentially tries to separate out the data into different buckets. This is quite a crude example, but say the clouds are less than 25%, you might predict 100% PV. If not, then try and create another branch that will then split the data up further. And what we're able to do by using kind of a wide range of different sites spread out all across the UK is forecast anywhere in the UK. So we can now plug in our specific latitude and longitude information about the site we want to forecast for and forecast for anywhere and hypothetically globally as well, depending on what data we have available. So this brings us to open courts, which is the open source solar forecast we're presenting here today. This uses open NWPs. Now there's two primary open ones. There's a few, but the GFS, which is the American global forecasting system, and the ICON, which is created by the German weather service DWD, and is widely regarded as the most accurate free to use weather service. So we take things such as cloud cover, temperature, visibility, and we pull this data from open Meteo, and we're using our pre-trained model that we previously showed. And by doing this, we're able to create a forecast up to 48 hours ahead at a 15-minute resolution and do all this in four lines of code. And we're able to get a pretty good error doing this. In comparison to some of our other models, which use slightly more up-to-date information, the error is not too much worse. Now you might notice that there isn't satellite imagery involved here, and that's because this model itself is something that you can run on your computer using our pre-trained model and by pulling the data yourself in just a couple of lines of code. Now when you involve satellite imagery, you need licenses and stuff to have that data live. The repository, the data storage that we keep has a two-day lag, I think, on live real-time data. So we were going to do a demo, but we've had to do a last-minute swap of computers. So instead, I'm just going to talk through this with everyone. But if you do want to do the demo, you can follow along. So if you head over to our GitHub repository, which is GitHub-OpenClimateFix, I've pinned the repo, open-source-court-solar-forecast, so you won't have to type in that mouthful of the name of a repo. And if you head to the example folder, there is an example notebook you can follow through, which will lead you to creating a solar forecast. But essentially, all you need to do is pip install-court-solar-forecast that we have here. And then once you have that installed, these are the four lines of code we tempted you with at the beginning, but essentially, you want to first import the function, which we'll be using to run the forecast. Next, we import this PVSight class that we use. We then want to create the class. So in this case, we're going to specify the latitude and the longitude of the specific house or site that we want to forecast for, and then the capacity of our solar panels. Next, we just run that, we use that run forecast function, passing in our site as an object, and then specifying a time in which we want the forecast to start from. So using this time here, it would create a forecast starting at midnight on that night, going out 48 hours from that point onwards. And what does the results look like? Well, we get a nice, so this is where I click demo done, and it would nice graph and smooth, but we get the best results out of this anyway. So we get our solar forecast, which looks, as we might expect, kind of peaking around midday. There's some bumps in the road here. This could be due to some clouds that are coming over or storm. And we've got our forecast from midnight out to 48 hours ahead. So hypothetically speaking, with the demo running, I could have shown you what it looked like exactly at this location here today, looking out for the next two days, and we could have seen today. But running it on my computer, it didn't look too great. And that's kind of reflective on if you look outside the window today, it's a bit cloudy, and not the nicest. So I'm going to pass back to Rachel now to talk about the robot. All right. So moving forward, the idea for the open, the Quartz open source forecast is that other people can use it. You could potentially input different types of data, so different NWP data could be input or PV data. And also just anybody who wants to do a bit of ML experimentation, this would be a place to start with that. As a company, we're looking to build our community as an open source company. It's something that we're kind of trying to put in place. So if people use the model, hook it up to an API or a database, and actually start generating a regular forecast for themselves, we'd love to know about it. So I don't know if we have any time left for questions, but yeah. Too many questions. The prediction, does it assume like you can specify the capacity, but can you specify things like south facing versus east west facing, that kind of stuff? And how does this contrast with forecast.solar, which provides for home users a similar API? Sure. Thank you very much for the question. So in providing features like tilt and orientation, that's something that we have built into the model and needs a little bit of a tweak to get it working. So originally, with this model, it was based off a model that we have in production, which we run for a thousand household sites in the UK. And we found that the tilt and orientation data that is generally provided is not always that good and that accurate, because oftentimes with a solar installation, the builder might have noted it down, but it's not that accurate. And when we ran experiments, hard coding the tilt and orientation, versus letting a user kind of specify exactly, we got slightly better results if we assumed it was a perfectly south facing and at 30 degrees. But that is something that is a little tweak and is I think one of our kind of issues to work on. And your next question about using kind of another provider, what was it again, the name of the forecast.solar. So I think what differentiates what we're doing compared to other people, this is something that you can run like locally on your computer and do it yourself. And we're also forecasting generation. I think a lot of these other APIs, they're forecasting things like solar irradiance, and then it's down to the user to basically interpret that irradiance value into a generation value. Maybe forecast or solar is different, but I think that's what we do, maybe slightly different if that makes sense. How do you handle so this issue of the solar, long term solar weather and recent critical events quite like volcanoes or dust balls, which can affect the yield for the solar partners? Yeah, so things like volcanic eruptions, they definitely do affect the solar. And a lot of time, I think that information generally is helped out. So the numerical weather predictions that we use, sometimes they tried to capture in that information. I did see some research papers on how they actually don't capture in things like volcanic eruptions. And the researcher, I suppose, who was saying, we need to improve these models to capture things like that. One other data set that we're looking to incorporate is aerosol data sets. So that does include information like that. And is something which I think we're doing with some of our other models. And at some point, I guess, we'd like to do with this model as well, which should help to capture extra information like that. Hi, thanks for the talk. So I wanted to ask, what is the geographic extent of this? You're using models which might cover more than, say, Eukaryurop. Or if it's confined to Eukaryurop, do you have plans to expand it to a wider region in the future? Thanks. Hi, thanks for the question. So this model in particular, because it's sort of dependent on the weather data that you have available. So we're using IKON's global weather forecast. That essentially means that this model can be used anywhere in the world, because that forecast is a global forecast. The only issue you might encounter is because the training set that we've used is just for the UK, there might be some sort of bias towards the UK household sites that we've not really looked into yet. So I think one of the things that we do want to do is to create maybe a more robust global model is to have a PV data set, which does cover the whole world. But I think so this is very recently, we've pushed this out. And since we've done that, there was someone reaching out to us from Indonesia who was testing it out there. I think they got it working. So it does have global coverage. Some of our other models, which we provide as a product and service, they are quite specific to the UK, but we're expanding out to India at the moment and some other European regions. And that's mainly down to the satellite imagery data that we have access to, because we're using the European geostationary one. So it's easier for us to build on that, how it is at the moment. Thank you, everyone.
Can open source development drive energy transition? PyPSA-Earth experience
So, we have stopped somewhere between regional and global perspective. Let's go global. The energy transition implies that thousands power systems around the world should be transformed with a pace which has never been seen before. And while we know what should look like the picture on the global level, that is still a question how should it be translated into regional levels. And what is special about this global scale energy planning problem is that we should plan decades ahead under deep uncertainties. And basically, we have quite an experience of energy policy failures. There have been quite a few cases when energy policy measures looked quite reasonable in advance but have resulted in failures, didn't lead to results which have been expected and these programs should be stopped. And that is why actually we need large scale energy modeling. We can replace this painful experience of rail war cultures by playing with the modeling, with energy models. And obvious advantages of open source, of open modeling and open data for energy planning has lead to a rapid increase in interest towards open energy modeling. And currently we have dozens of open energy models. We have a lot of open data sets relevant for energy modeling. But the picture is very incomplete and very patchy and there are regions in the world where we do not have even a net zero plan not to say open net zero plan. And that is exactly the gap which we are addressing as independent research initiative. PIPESIMEDS Earth aims to provide every part of the world with the open and reproducible and accessible energy systems model. What we are doing can be divided by three blocks. First of all we are doing open coding indeed. We are working with open data and we support open energy modeling community. So just a reminder about energy systems model. There are I would say power engineering models that is tools which we mainly have discussed today and there are also academic integrated assessment models. Academic integrated assessment models relate to the whole world and model global scale large scale interconnections between economics, environment and energy. And energy systems model that is kind of tool which translates these results of global assessment into plan of actions on the regional scale and obviously energy systems model should contain should reproduce in realistic way behavior of power systems. So that is what our workflow what our architecture look like. We have data block, we have modeling block and we have optimization block. Processing is orchestrated by snake mate and well probably the most trying part of the whole picture is work with data. There are different groups of data which effort operation of power system and there is also quite trivial but very impactful moment which relates directly to open data licensing. Basically we have starters data kit which we provide with the model to facilitate starting with modeling and I think the most frequent how to start request is about loading this starter kids data and many troubles by created by the fact that some open light some licenses of open data set do not allow redistribution or hosting of the data. So for some data we can collect data set and transform it in the form which is needed for energy system model to run while for others we do not have right to redistribute and have provide data to sources and connect them with the scripts to clean data and to prepare them to format which can be used by energy modeling and that is exactly chain of the whole link which breaks most frequently. So just open data in action. Environmental and climate that is part of the data workflow where we are truly grateful to open science community and to geophysical community. Basically that is the most unproblematic part of the whole workflow and we have package which translates geophysics to energy related parameters and basically that's it. Mainly it just works but as for electricity demand here the biggest problem is data availability. Indeed well what we need are hourly demand profile for every country of the world at least at national at aggregated national level. Indeed they data exist but they are not openly available and so we have a model we have machine learning model which has reproduced synthetic lot profiles but we would be very happy to improve flexibility and geographic coverage of this approach and access to the data to original lot profiles that is bottleneck currently for this group. Another important part, another part which is crucial if you are interested to model a power system of some arbitrary country is data on power infrastructure especially on grid and here we have used open street map database and developed a dedicated package which extracts power futures and allows to prepare model of grid topology. A part of that we have packages from pipes ecosystem which provides data on power plants on installed generation and a data set which collects and curates data on technology costs including forecasts for technology of technologies development. So and that is what modeling workflow look like. We take preprocessed data for power infrastructure and simplified topology preserving electrical properties of original power grid then cluster it to make the problem tractable and the next point is the most challengeable from the perspective of open source because open solvers are still overplayed by commercial solutions and here there is some room for improvement and we are collaborating with developers of open solver to improve the situation and now once the workflow has been established we had to be to ensure that it is actually possible to apply our model for every country in the world in the most literal sense. It has it took about almost a year work to introduce all the necessary fixes which account for different special futures and now it is done that is linked to it's another report which contains schemes for power systems of every country of the world of 193 United States country and we also have the code the source code which we have used to produce these schemes as images and if you are interested in model any country of the world please feel free to do that. Now let's look what actually can we obtain if we apply this approach that is net zero study for Nigeria which we have used in course of development the model which we have used as kind of proof of concept and the lessons which we have learned the most interesting output of this study has been that well net zero power system for Nigeria can be actually a little cheaper as compared to state of as compared with status quo indeed we haven't included we haven't accounted properly for uncertainties which exist for energy demand for Nigeria and this work should be certainly continued and applied to every country of the African continent but that is what does it look like that is which well that may be helpful to shift a paradigm and that is actually what is it all about and that is a study which has been done by in collaboration of pipe summits earth and open energy transition and a German think tank agor energy vendor they have considered Kazakhstan power system and the question is if it is feasible to implement solar in and wind faster as compared to with the current Kazakhstan current national development plans and the results are quite encouraging and being currently discussed on policy level and that is output of a master study for Saudi Arabia and that is a country where 99% of energy mix relays on fossil fuels a study which an author has done using pipes earth has shown that wind and solar actually can have quite well quite a place in power system of Saudi Arabia and it isn't so expensive as it could be expected that is the case when data accessibility data availability is a big issue so this this results are quite preliminary because more advanced optimization methods are needed to account for this uncertainty and also account for all pathway all transformation pathway but what is important what is an effect what is an impact of this study is translating conversations translating discussion about possible futures for fossil fuel relate countries from purely hypothec hypothec level to a level of numbers and that is a case for Bolivia case when South America when networks of South America are considered and that is region where data of open street map data have not so good quality so it has been needed to introduce quite some tricks to restore topology and the resulted model has been successfully validated for energy for dispatch on the national level so it works even if you don't have data of such excellent quality in open street map and that is a case for Malaysia we have considered decrobanization of industry and in Malaysia the local the local feature is well renewable sources renewable potential is not so excellent so we have shown that it is basically possible to decrobonize one bright branch of energy sector but if we would speak about the whole national economics it looks like it makes absolute sense to include into modeling into discussion not only traditional on wind off wind and photovoltaic also something more exotic like floating solar or probably to consider cross-country interconnections so and last but not least community is essential part of the whole story we have different channel of communications and we are very interested that is essential for now for us to build global community as we have seen there are some countries of the world where there are still a lot of modeling evidence available where efforts of researchers and developers are still focused but energy transition is a global thing and if we wanted to work we need to provide tool we need to involve people around the whole world and we can unfortunately confirm that there is definitely a gap geographic geographic gap in free and open source software community Tobias has talked about during free about that during previous first day and now I think we have some understanding about reasons which which are behind this gap and that is basically quite quite simple people around people in different regions just have different patterns of communication and that should be accounted for if you want to build inclusive community and another part of the story is that many things which we take for granted like education or even stable internet connection cannot be taken for granted in too many parts of the world so but the good news is that actually those problem which cannot be solved alone can be perfectly can be solved if we join efforts and well we are doing it we are solving them we still have a lot to do there are research tasks there are validation tasks because we can build power system model for every country of the world but it would be nice to understand how close are we to reality what are errors what are modeling errors for each of the components for power grids model for installed capacity how far we are from reality in demand profiles and that validation task it is huge if you're interested to join please feel absolutely free we would be happy to accommodate you and another big task is to increase usability in particular condo environment and version conflicts inside all our pytonic soul that is still a big questions and we have we would be very happy to improve it somehow and another part relates to capacity building relates to improvement of documentation and to spreading the world spreading knowledge so again we are very happy to accommodate any suggestions and we are inviting contributions if you are interested please do not hesitate to ping us using any of our channel our communication channel so just a reminder that energy transition is a global thing and can be tackled effectively only together thank you very much and I am very happy to take your questions what's the role of Earth observation for this models do you use satellite data to track transit lines or look you look for wind turbines or solar cells or is in this data set you use is just you use official data sets for your modeling thank you we do not use satellite observations directly but we are using for power grid we are using open-street map data only while it would be great to supplement them by satellite images we had some stream which have been focused on addition and on adding actually satellite data to open-street map but the this team currently is not very active so that is perfectly that would be perfect that would work definitely but we just don't have capacity to do that right now also we would be happy to revive it and as for installed capacities we are using fusion we are using merging of a number open data sets on power plants I am not sure if satellite observations have been used of in any of this data sets but at least we don't make yet satellite processing ourselves and we do not use them directly also I agree with you that it would be very interesting idea and it would be also perfect academic topic there are some countries which really don't want to collaborate I don't know quite like North Korea or other other countries where we don't get any data well to answer that directly we have data for Northern Korea but we would be very careful about using them because when you are applying if you are modeling especially well specific countries I would be very much concerned about safety people who are affiliated with those countries and that also goes for China for example because for China there are some local regulations which basically forbid going into too much details of power system for people who are not approved by national government so I would be very careful about delicate areas of the world but technically yes it is possible so my feeling is that correct approach would be to try build collaboration in more or less safe way providing tools to people who are safe using these tools for example if there is there is some group in China who is approved as approved by national authorities as experts in power systems as people whom whom they trust then we may provide tool and support them in using it in the right way also I agree that it is complex question and it may go a little bit complicated first let me remark agor energy vendors a very good name in Germany so Congress and getting them using that and my question now are you also doing storage like water reservoirs or millions of batteries distributed well I agree that storage is one of the key question when we are speaking about energy transition and we include a number of different storage technologies and currently that is one of the key points of energy transition we are able to capture them and if you're interested please feel free to investigate the details we would be very happy very happy to obtain your feedback and suggestions and contributions if you see that something can be improved actually we have a huge poor request which should provide interface to a big list of different storage technologies and it would be perfect if you could revive it I was just interested I've got a friend who I'm doing who's a research who's doing geothermal in Nigeria and you have to yeah well he's doing it on me do you have geothermal resources in there as well okay all right good thank you yes we have geothermal and we have quite recent request from from Kenya where people are interested to include geothermal in a more sophisticated way thank you
Carbon measurement and energy attribution for processes and hardware devices in the Linux kernel
All right, everyone. I hope the mic is working. It's great to be here. This is my first fast time, by the way. And I'm very happy to talk to you all about current measurement and energy attribution for processes and hardware devices in Linux. My name's Aditya, but you can call me Adi. That's the first three letters. I'm a grad student. And yeah, that's my contact. I'm always very happy to talk to people during, before, and after my talk. Please reach out. Please. I would love to hear from you. So a bit of background. I'm a graduate student at ETHEHA Zirik in Switzerland. And I do research at the intersection of computer architecture and operating systems. I love this stuff very much. Great. What do we want to talk about? Let's get a bit of a brief background to bring everyone on the same page. Now, when we talk about energy sources and computing systems, you can have a bunch of options. You can have direct inputs from DC, from USB. You can have battery power systems. And if you're really exotic, you can even have energy harvesting devices. Okay? Now, we want to use the maximum. No, I'm sorry. We want to use the minimum amount of energy to perform our task. Why do we want to use the minimum amount of energy? Because energy consumption correlates with battery capacity. And battery capacity is a major, a significant design constraint for your consumers. Okay? All of us have cell phones. We have the recent buzz around Apple Vision Pro, AR devices. These devices are significantly restricted by the amount of battery capacity. So we want to minimize the energy that we use to get the job done. Okay? Now, what is the problem here? What do we want to solve? Let's flesh it out. Energy consumption is defined as power times latency. Power is determined by your hardware. Okay? Latency is determined by your software. Okay? Now, how do we measure this? How do we get this data? Programmers often measure latency using well-established tools. I'm guessing many of you would be familiar with Linux Perf, or you would have timed your own software using wall clock time using CPU clock cycles, right? Now, these are well-established metrics and well-established tools to quantify your latency. What if I ask you, do you know of any tools to calculate your application's energy? What comes to your mind when I pose this question to you? How would you calculate your application's energy consumption? You would say, okay, Ali, I know. This is very simple, right? Energy is power times latency, right? We just talked about this. I'll get the power from the CPU. My CPU has this magical interface called Rappel, which stands for running average power limit. I'm going to get the value and they, oh, voila, my CPU says 15 watts right now. Great. Then I'll time my application and my application turns out to be, let's say, five milliseconds. Okay? And we put these values into this formula and great, we have 75 mJ of energy consumption. Stop done. Let's go home. Unfortunately, this is too simplistic. Let's try to dive into what we missed here. This does not reflect the ground reality, okay? And now I'm going to deconstruct what happened here and what we missed. The first step, we saw the power was 15 watts. And unfortunately, this model assumes a linear power draw over time. That is not the case. If you actually look at the system, this is what it looks like. You have these values and you have these peaks. And if you measure your power at the wrong time, you will end up with a significantly different number than what you should have. On the x-axis, you have time value. On the y-axis, you have the power for CPU. And power consumption is not linear. So this assumption that we have a linear power draw is incorrect. Second, we got the power value from Rappel. Remember running average power limit? It turns out that Rappel is only available on Intel or sometimes on AMD. ARM, for example, has a very, very different interface to report power. So I would love to share a story. I was doing energy profiling on a server-class system back in university and I said, oh, I've built this great infrastructure on my Intel platform, right? Let me just use it to run on ARM and see what happens. And the moment I ran it on ARM, Linux Perf said, I'm sorry, I don't recognize the CPU. I can't give you any numbers. And it just crashed, okay? So all of these interfaces are really different and you need a significant amount of engineering to make sense of it across different platforms. Second limitation is that we do not have uniform interfaces or the formats to measure power reliably. All right. Let's try to go deeper. Let's try to get more into the closer to the ground truth. Our model got the power value from the CPU. What about the other devices? I'm right now broadcasting from this device and oh my God. I'm sorry for this. I hope not. Give me a sec. Beautiful. Beautiful. Okay, okay. So back to the presentation. We were talking about the impact of devices like the screen, the memory, the network cards, right? We don't know how to quantify them. So we did a lot of experiments and it turns out that these devices very often dominate your power consumption and our findings are also correlated from some similar observations at Google. So what Google did was they were trying to optimize their data centers and did a huge amount of profiling on their server class CPUs. Server class CPUs are the heaviest CPUs that you can get in the market. And it turns out that they observed that DRAM is dominating their power because DRAM is burning power all the time. CPU turns on and off, but the DRAM you cannot turn off. Remember, it is volatile. So you need to break out of this mindset that CPU is the end all be all. Okay. So let me try to summarize everything. We are inaccurately calculating only a fraction of the system's actual energy consumption. Okay. And I would love to put this in a take of a code for you. This is not from me, but I like this very much. We cannot improve what we cannot measure. So we first need to understand how to measure energy correctly. And that's what my project is all about. That's what I love to do. What is the goal of what I'm trying to do here? My goal is to develop a framework to accurately and reliably measure the energy consumption of processes in the kernel. All right. All of us can get this data. What is the use for this? Because data without use is it does not get used. Okay. Once we have this data, we want to report it to the end users in an easy to understand format. Right. End users should be able to make sense of the number. Right. What does this number mean for me? We wanted to report it to the programmers which improve their action ability, which enable them to move their code up and down to change their code to move the numbers. Right. And we want to report it to system designers to enable them to iterate much faster over low energy designs over low carbon designs. Okay. So let's try to dive deeper. What do we mean by a framework? What are we trying to do? Let's flesh it out. A framework comprises models and tools. Let's break down these two words. A power model is how we think about a device. When I say that I want to measure power, a power model is the mental model that I have that I will use to get the value. Okay. And it turns out that these power models are often very poorly understood for a number of devices. For example, DRAM power models are often not available to the public. They're not available to academia. They're, let's say, a proprietary trade secret. Don't quote me on that. And okay, once we have these power models, we can build tools which accurately calculate power based on these models. A tool that I would like to mention would be the NVIDIA SMI utility. It allows you to calculate the power of a GPU using this tool. It's a good tool. And okay, so let's pull it all in. What I would like you to take away is that we need accurate models, first and foremost accurate models, and second, reliable tools to calculate the energy consumption correctly. So we defined our problem and we defined our goal post where we want to go. And now let's see how are we going to get from point A to point B. Great. So before I dive into what is the mechanism, I would like to bring to knowledge what has been done before. All of us have been here for the entire day, right? We love energy and we love efficiency. If this is such an important problem, why didn't people solve it before? People did. People did try to solve it before and I'm going to describe to you right now what they did before and why that is insufficient, why we need to do better. Okay, on the screen you can see a screenshot from a tool from Intel that is known as PowerTop. And you can see the first column here which reports power estimate. And on the right side you have the description of the particular device, interrupt, process for which this power estimate is calculated. Now, what are the challenges? Well, first of all, I believe in energy. It turns out that power is a discrete time event. What do we mean by discrete time event? Let's try to break this down. If you have a graph, a power is a single point on that graph. Energy is the area under that graph, okay? We want to calculate energy because energy is what correlates to your battery drain. Your battery supplies you energy. Power is just one particular instance in that time. Second, PowerTop has a vendor-specific implementation. I hope that is clear. Third, what is the actionability? So I just showed you this data. I just showed you the screenshot. It says, oh, my display backlight is taking 350 milliwatts. Great. This particular process is consuming 292 milliwatts. Okay, fine. The question that comes to mind is what is the use for me? What is the actionability for the programmer for this data? How does the programmer change the code to move this number? And I don't know. How do I fix that? How do I fix something that I don't know how to fix? And that is a gap that I would like to bridge, right? So let me dive into the guts of the system. This is a system design. On the screen you can see an elementary flow chart which summarizes the system at a very high level. And this is a regression-based system. A regression-based system has two inputs. You have the parameters and you have the inputs to the parameters. First, we calculate the parameters and then we calculate the inputs to the parameters. Great, we have time. I will go into details now. Please bear with me. Okay, let's first look into the parameters. How do we determine the regression model's parameters? There's an algorithm for this. First of all, we turn off everything. We turn off everything that we can turn off in the system. We measure the baseline draw. This is what we refer to as the minimizing the system load. Then we pick each device one by one. We isolate the impact of the device on the baseline load. And we measure the drain over multiple times. So we turn on a single time. Let's say that I turn off everything and then I turn on just the screen. Okay? And I measure the difference between these two values. The difference is the impact of the screen on my baseline. And then I also do one thing. I sweep the screen. So I change the brightness of the screen from minimum to the maximum because obviously the minimum brightness is going to have different power than the maximum brightness, right? I hope this makes sense to me. Are you guys still with me? Okay, so this was just an example. But what we're trying to do is we're trying to quantify the impact of each device on the baseline. Now, I would love to give a metaphor to help explain this better. Imagine that you have a water tank. And in this water tank there's one single input and there are 10,000 tiny outputs. And the problem that you're trying to solve is what is the rate for each of the output pipes? You cannot measure it directly. So you have these 10,000 outputs which go on their own anytime. They can go off and you don't have levers to control them. What you're trying to figure out is what is the drain rate for each of the output pipes? That is essentially the problem that you're trying to solve. So what you do is you turn off all the outputs, okay? You turn off all the outputs and you turn on one single output and then you see the difference in the tank level before and after turning it off, okay? And that is essentially what we call as an isolation or well in academic terms it's also sometimes known as an ablation study. But we try to isolate the device and measure the impact. Next, we repeat this process for all the pipes in the system and we try to get a reasonable estimate of what is the impact of each pipe. Great. So that was the first step, the device-specific measurements. The second step would be the kernel process accounting step. This would be the inputs to the regression model. So we have the parameters that we got from this step and now we need the inputs. Now how do we determine these inputs? Sorry, did I hear a question? Okay, great. Right, how do we determine the inputs? What we do is we isolate the impact of each process. So we identify how much time the process used for the CPU. We identify how much was the network activity, the screen wakeups, file handles, memory usage and we put all of these numbers together into the model. And this is what gives us a predicted energy consumption value for that process. Okay, so what are the challenges? This seems very simple. This seems, okay, you've done this work but what did you not tell us? Here comes the part that I did not tell you. First part, estimated value. This is not the reality. It is really hard to find out the reality. And there's a very famous line in machine learning community. It's known as all models are wrong but some are useful. So my goal here is to build a useful model that I hope is less wrong. I would love to make it perfect but unfortunately we cannot make it perfect. But yes, I would love to make a useful model first. Second, there's a bit of a cash 22 situation here if you observe that. What is the cash 22? I am running a measurement process. There's a process that is doing measurement on my system. Okay, that is also going to create a load. So there's going to be a skew in the values that I get because of my measurement. Okay, and the more accurate I want it to be, the more skew it is going to create. So we want to understand what is the right amount of accuracy that we can use to also be useful while also minimizing the bias. So this is very challenging, right? Because this is different for every system. And that's a problem that I'm almost struggling to solve. I would love to get your inputs if you have. Great, next challenge. There are millions of devices out there and these millions of devices have billions of ICs inside them. Very often we don't even have the data sheets for these ICs to correlate the values that we see. The estimates that we get can range across two to three orders of magnitude. One device can say, oh, I use one microjoule and the second one can say, oh, I use 10 milliwatts and those numbers don't make sense. Those numbers really blow you away. So how do we maintain our sanity in the face of the variance that we see here? And one more challenge would be that, assume that you can say, oh, let the users supply this data, let me get the data and then build a centralized farm of this data and then try to make sense of it. Should the users share this data? Would the users share their device users' data to you and allow you to put it on a centralized server? Who will own that data? Because there's enormous value in it. So this is, I would love to get your inputs on. One more challenge here would be the validation. So we got a value that we estimated. How do we make sure this value is as close as possible to the ground truth? In an ideal world, I would have infinite money and I would go to every computer in this world and take a probe and put it next to their CPU and say, oh, this says 17.5 watts and my tool says 17.5 watts. Great job. Let's go. I cannot do that because I don't have that much time. Okay, so we want to minimize the difference from the ground truth and what we see in the tool. There's a significant challenge in making sure that what we see is what is the reality. Right? Remember, there's accuracy and there's precision and there's correctness. And all of these trifecta come together and make this a very difficult tool to get right. But still, I believe it's going to be great. I'm very happy to work on it. Great. So once we have the energy consumption, how do we link it to the carbon emissions? We just saw that we can calculate energy consumption using power time flatancy. The carbon footprint can be calculated by multiplying this number by the composition of the energy. Where did the energy come from that you used to power the device that you were running? And this composition depends on multiple factors. It can include the geography. It can include the time of availability. It can include the cost of generation of that energy. Right? So fortunately, there are good tools and libraries out there which can simplify this problem for you. So energy composition is, let's say, something that I believe people will solve faster than I can solve this one. That is why I would love to focus on this one. Great. All done. Let's get back to the good stuff. How is this going to look like? How is this going to make your life better? If you're an end user, I would love to ship to you an application like this, an application which tells you how much energy your inkscape usage consumed, how much energy your screen was dissipating. So as an end user, you can remember to turn off inkscape when you're not using it. Or you can figure out, oh, I need to deliver a presentation to so many people in five minutes. I'd better save my battery. Otherwise, I'll be in deep trouble. So it's for those use cases when you want to maximize your battery life as an end user. As a programmer, if you want to expose an API that enables programmers to take action, if you want to indicate the devices and the code regions which consume the maximum amount of power in the code and enable the programmers to change it, to modify it, to fix it. So actionability is the primary concern for programmers. In an ideal world, I would love to have direct suggestions in the IDE that tell the programmer, oh, this code is not, this code is going to burn this much carbon. You'd better change it. And for the system designers, what we want to do is we want to enable them to iterate our designs faster. We want them to enable this, we want them, we want to enable system designers to discover designs which are really low on energy, which are really high on performance, which are really high on carbon efficiency. So there's typically a design space that designers explore. And we want to enable them to explore the design space faster. That would be the end goal from this tool. Great. So what is the takeaway from this talk? If there's two things that I would love for you to take away, that if you forget everything else, okay, just remember these two things and I'd be very happy. First, we cannot improve what we cannot measure. We must measure correctly, okay, to improve things. And second, we need to break out of the CPU mindset, okay, non-CPU system components can dominate your power. Please remember that. Please remember these two things. And the next time I see you, please come say hi and I'll, I'll buy you lunch. Okay. Great. Thank you very much for listening to me. It's great to be here. It's great to talk to you. Please be in touch. Please reach out and oh, boy, we're out of time, but I'm very, very happy to get your questions. Come talk to me. There's still like two minutes for questions. So if there are any questions, please. Go for it. There's one in the back. Yes. So. So hello and thank you for this presentation. I hope you're not going to hate me for this question because I'm a primary infrastructure guy. And one thing I was always concerned about is redundancy, like a scale twice. So if one dies, is this part of your thinking and scope? Or does the question make sense? I'm really sorry. I don't fully understand what you mean by redundancy. I mean, sure, I understand, but redundancy is trying to solve the problem of fault tolerance. Okay. It's not trying to solve the problem of efficiency. I'm trying to solve the problem of efficiency. So redundancy is an orthogonal concern to mine. Does that, does that mean it makes sense? Yeah. Thank you. But thank you for the question. I really appreciate questions. Yes. Did you try to monitor the hover head of monitoring the energy consumption? Yes, that's a great question. No, we did not. On one side, I'm afraid it's going to be huge. On one side, I don't know. It's like an infinite recursion, you know, like how can I measure the impact of my tool itself? Like the tool is what measures the impact. But how do I measure the impact of the tool? I don't know. I hope, I hope that, I would love to believe that. That's what I, yes, that's what I want to believe. Yes, please. Thank you very much. It was great to be here. Thank you.
Advanced Linux Power Management Evaluation using Perf
So, hello. Let's start. I think 50 minutes and all, so I will hurry up here. And the previous slides and presentations, we saw the overall picture somehow, the crit thing. And the last slide, we dig into one system somehow. And in this slide, we also want to dig a little bit more in the details, how to analyze the power consumption. What we saw also in the previous presentation, there was an... Sure, sure, sorry. What we saw in the last presentation was that we saw the power consumption of one system, a little bit similar to power supply that we saw this task consumes and what's whatever. But the question what you often have is after this data, how you can optimize your load for your server, for your embedded product somehow. What are the causes why the application runs too often, the system runs too often, cannot go into deep states, peace states and such things, right? At the end, it's the hardware that consumes the power and you can save power if you put things into deep sleep states or consumes the frequency somehow. And this is really important to save energy. And what we did in the past was writing scripts to optimize your workload and get things what are the causes that an application runs too often, right? It runs often, cannot go into deep sleep states. This is important to do the power optimization. And what I provide in the next couple of slides here is an application that helps you to optimize your workload and makes this visible somehow. So what we are talking about is a perf script and extension to perf. So it's, if it is mainline, it's not yet mainline. I will send this script to Analo in the mailing list and hopefully it gets merged quickly somehow. But when it's merged then it's really usable, easy usable. It's just an up-get install and everything works out. And also for Yachter and Buildroot, it's really easy to use these things after. It's also important that it can be used in embedded systems and everywhere. How does it work? It's just a record call where you record your workload with the workload separator, like every time. And here I record for 60 for one minute, workload on all the CPUs. And then you record everything one minute fine. And then you start with the report, the power analyzer, and it's have different modes. Because I have just 10 minutes, I show one mode here, but there are different modes for different optimization, analysis, and things. So what are the modes? There are several modes that can be activated and used. And you just activate or use the mode you want to focus and dig into the details, right? This is how things work. And what's also important, every mode has different trace points in the kernel. So usually you record only the trace points you require for the particular analyzer. Because if you record everything, every trace point, you get a lot of huge data and things. So normally you limit the data. How does it work? So there's the per script. As always, you can write a recorded data, as we saw for one minute here. And it just records all the trace points that are required. But on the other hand, you can also record the data that are required for your analysis. This is documented, what the trace points are required. Then you have the data. And then you start the script, the reports, and here outputs all the analyzers for this. You start it here with a timer. So what are the timer events somehow. And then because there's a lot of data coming out of this, you usually can use this data already and see, here's something that's not working well, too much timer interaction, for example. But what is it also can do is some post-processing and to create data graphs somehow or filter things afterwards because it's a lot of data. And here, just a showcase, this is one image that's created. You see the time and you see a workload. It's a logarithmic scale here. How much time, timers are working. Timers are one course that triggers from a deep C state to an active state to a zero state somehow. Timers are not that good. Often you see, if you begin analyzing things on your desktop CZ, you see here, I think this is the kitty, my terminal I use here. It has just wake-ups all the time. Why are there wake-ups here? And then you see often some buggy applications, clip-out things. They are constantly triggering your system and this prevents to going into a deep C state. This is the causes that prevents this. So it's really important. And here you see a workload I started and you see all the timers that are correlated with starting a workload. Here you see a lot of kernel timers and then you can start optimizing things. This is just a focus for the timer events, but there are a lot of other events as well. This is other sub-sequence analyzes also just for the timer events. You see here for a tick-less system, normally if there is no load, kernel can really go in a deep sleep state. And then it shuts down the timer tick altogether. But does it really stop the timer tick? You will see it here in these images and you can analyze things and optimize things. What are the kernel timers that trigger your systems? If you look at the graphs a little bit, the resolution is not that good, but you see that there are timer ticks all the time, and the network interrupts, timers are working here and you can optimize this. If you see this and you know what's happened, what we see here in this graph is the timers that are working for each particular task. So you can optimize for your task as well. How many timers are there? I often see in the production environment that the timer has done all the time somehow and not correlated. What you can also do there are system calls for the granularity that the timer can optimize things. For example, the kernel which the introduction of the HR timers, the resolution timers, you can align timers so that timers are not really spaced there and they're exactly triggered at a particular moment in time, which is a simple system knob. You can also say, oh no, it's not so important that the timer is triggered at this time so that the kernel aligns timers at a particular time and allows a deeper sleep state again, something. This knowledge can be combined with the knowledge of this, what you see here, for example. Where are the timers? Right, CPU 0 is somehow special. There are the timers. Can you move, for example, tasks to CPU 1 so that this other CPU cores can go in a deeper sleep state, for example, right? All this is important to do an optimization there. There are some general options. Some others are not always required. This can be turned on with this particular flag. There's CPU, often you want analysis on a particular CPU so you can limit the data. And there's a file out option so if you want to do a post-processing, as we saw in the images, somehow the data is not put it on the standard out, so it's put it on the file and you can use this there. And the data is also written in a day and sanitized that you can trust through use partners here to read the CVS data. And for the post-processing, it's really easy. But there are multiple modules there provided. This is just a sneak peek on the timer module, but there are a lot of other modules as well. You can use them later on. But to the time limit, I just highlighted this timer module. But one last sneak peek here, for example, is the governor. The governor is the component within the kernel to do the processing and commanding of disease deep-stakes. This is the governor. You can select a different governor. It's normally the menu governor. There are other governors as well. And here what you see, how often is which C state is commanded here? And what is also analyzed is, was this good or not? Because the kernel doing a guess working, right? So here the things are the next time in 10 milliseconds, there's a workload because the timer will trigger. So it puts a processor in a particular C state. But was this the right decision or sleep is too narrow, too shallow? And so this is also important somehow. And here you can debug the governor. A student of mine also discovered a bug for the AMD stuff. It's for one particular C1 state. It's switched all the time to the wrong state. But I think this will be released in the next couple of weeks somehow. So it's really also important for you. If you see, does the governor does the right job here? This is visible with another analysis, but there are multiple other post processing steps. And yeah, that's all. I hope this will be integrated in the mainline next couple of weeks. But if you want, you can use this kernel tree and this particular branch to use this. It's just a perf script, really easy also to use out of the tree. And this post processing scripts cannot be shipped with the kernel. That's not how the kernel somehow works. This Python scripts and there will be always available here based on this. And at the end, good documented, hopefully somehow. So yeah, that's all questions. Yeah, perfect. Questions. I'm always getting a question. Process of coverage, just x86. What's the coverage of you got? I mean, now look, I've got an M1 Apple thing. Would I be able to run it off there if I run Linux on that hardware? Yeah, this script will work on ARM x86 for Intel and AMD. There are differences in the P state tracking because P state tracking is the introduction of Skylake and HWP with hardware tracing. So it's will be not visible, but it will be visible on ARM CPUs. For example, some as a sample work, some will not work, but it's just Linux and all the major. And some are more software, the analysts of scheduling events. Somehow it will always run, but more hardware like analysis will not work somehow. But yeah. Just a follow up for previous question. Will it work for like, Graviton, all this kind of cloud proprietary processors? Yeah. It would generally run there. If it at least Linux ARM and the processor and it will just be the same. So no difference there. Yeah. So I think it's going to be a good idea to run it on the same hardware. Yeah. And there. If it at least Linux ARM and the processor and it will just be the same. So no difference there. Another question. If not later on we can install a script at your PC and test it. Hi Aaron. That's just a follow up on the previous question. There's actually an extra library, LibOpen CSD, which gives you a whole lot of extra stuff on most ARM cores, but not necessarily apples and Amazon's ARM cores, but any that actually come from standard designs. So a Turing design here. One goal was that it runs everywhere somehow, right? It must be general. And I don't, we don't skip going into the EPPF world. So there are advantages to do things in the kernel to aggregation in the kernel. So but this has sometimes problems with on specific ARM and PSOX and embedded products. So the design was really that runs everywhere. It's easy to use and generally available somehow. Somehow EPPF working with EPPF things to in the kernel and process unwanted data there out has some advantages, right? But you need a tool train then on an embedded product. So it's not that great somehow. And this everything I told you was somehow the idea on the design somehow or extra library. Keep it a bit of minimal stuff which works everywhere somehow. If you want to do more and often you want to do more if you analyze your particular task, how is the scheduling behavior you need more and you need more custom scripting as well somehow. But this is not here. I think it's a lot of data already there. Easy available somehow. But if you want to do more, you need more scripting and things like that and libraries you want to use. Sure. It's a compromise. Maybe a question for me. Can you give us a few insights about the community? How many developers, how many people contribute? Currently I'm the main developer. But at the end it's just in the Python script so it's not really the rocket science. And there are students also working on this, help things and looking at the details. But yeah, it's not that magic somehow. It's just keeping things putting together and make them easy usable. The trace points and Steven Roslatt and all the things, the infrastructure that the kernel provides are the main drivers. That this is possible, right? So just in script. Thank you so much.
How can Open-Source help the Wind Power industry?
Hi everyone. Everybody can hear me well from the back. Okay. Good. So, to introduce, I work for ZF Windpower is not any company doing software. What we are doing are turbines. Okay. We produce pieces that go to produce wind power actually. But I'm part of the digital team. Why do we have digitalization? We'll talk about later on. Okay. Like that. But to start, let's tell a bit about my story with wind. I didn't start just working with turbines. My love for wind dates back much earlier when I used to live in the beautiful Marseille. Marseille, nice town for who has been. Less than six hours from Brussels and it has a great resource, wind. At the time, I used to do sailing. You come to the Vierpart, city center of Marseille, have your pasties and you will see beautiful boats. You go out, the sea is nice. What has this to do with the energy transition? Well, historically, Marseille, the main industry was fossil fuels, oil. Okay. We have a place called Fossilmer. Great place to sail. It has great wind and one of the best winds in Europe. Just in front now of the oil factories, what do we have? The latest technology of turbines, floating turbines, means devices that you can put on the sea. Very deep, very recent technology. Those are, if not the most modern, among the most modern of France. The power is 8.4 megabatts and now they are three. They can already power a small city like Martig. They're just close to it for who is familiar with the area. So from the love for wind, from sailing, now I can see that this gets to the energy transition. Right? Is it only Marseille? No. What happens in Marseille does not stay in Marseille. Valid also for energy. These are graphs that I created myself from data from Kaggle. So open source data. What do you see there? Well, you can see that around the world we have some big production of renewables in general. And I guess there will be somebody here who has been or is from South America. South America is a place where you have strong input from hydropower. But at least to stay here in Europe, wind here is a big deal. And it's increasing. If we look at countries like Denmark, we are already almost at half of the energy production. National energy production of wind. It's a combination of good wind, because up north is really good, and of politics of wheeling it. I told about Marseille, there is good wind, but France is not even close to the measure. That means wind is definitely one energy resource that we will use more and more. Very important. And produces at a big scale. We have already now in Marseille 25 megawatts. It's huge, only with three turbines. All good? All great? Well, we have some problems in general with big installations. Things can go very wrong. Very wrong. And this is not just a matter of changing a small component. Let's say that a turbine, like even in land, not as big as the one on Marseille, gets faulty. It has to be stopped. What happens? Notification processing. It has to go, and we need to tell someone. Two days, then there will be an inspection time. I get the team to inspect the fault. Two weeks. And if I have locally the component, the replacement, will be six weeks to replace it, but maybe much longer. Component may be on the other side of the world, even. We don't know. And then the repairs, a couple more weeks. So for a turbine, which is 3.5 megawatts, not the latest technology like I show in Marseille, the whole intervention, if you are lucky, let's say 10 weeks, is lots of money, 125k, at least. A lot. How do we tackle this problem? By forecasting and by optimization of spare parts, so getting the spare parts already in-house, and to get it start as quick as possible, faster return to operation. And this is done by treating data. What do we do? Ideally, we monitor and predict. So there is an alert. It has to go to the cloud. I have to classify the failure already in the cloud. I had to know what is about. I have to prescribe a solution. I have to find the spare parts. And I have to forecast when I need to apply my solution. What, where, when. That means alert, data are collected and analyzed, and compared to historical data. The graph that you see there is like a exponential curve. I don't get into the Bible modeling. Here is not the moment of mathematics, but is a model that predicts well when cumulative failures will occur, subjected to a certain type of failure. So that's the type. I don't, as I said, don't get to technical. This is not the moment, but we can discuss it later on. How do we do that? Well, wind turbine data come, production data come. So this is from us, from people who produce. We get it to the cloud and we do prescriptive maintenance. What means prescriptive? Well, let's see quickly. Reactive. I fix the failure. Okay. I have a puncture on my bike. I change. Preventive. I do it regularly, like changing the oil in the car. That's what is typically done also for bigger installation. Predictive. That's what I talked about just now. Prescript is AI. Okay. Familiar with that. So AI tells you this is going to happen. Please do this. Data analysis with open source software allows more and more sophisticated maintenance. We have just been talking so far about AI, power by Python and so on. We have seen very good demonstrations earlier on. And what is our digitalization tech stack? So to get more specific, I already introduced Python. Then pandas to treat data frames, to treat data, at least to a small scale. Lifelines is a package that implements ViBool and is open source. Myself, I had already my own version where I added some modules. Docker, Git, of course, we work as a team. And it goes to Azure DevOps. Notice, I told about the cloud. I told about DevOps. DevOps is not just a technology, it's a way of working. That's very important. The technology allows to work us in an agile manner. And that's how we get results. So what we want as a result, reduced on time. We want to shorten. Instead of six weeks, let's get less or 10 weeks. Reduce the cost. So we need to have a proper stock level. And for that, we need predictions. Reduce template maintenance. We don't want the scenario where we have to call the technician. And the technician has to come from the other side of the world. And to avoid, of course, the consequential damage by addressing recurring failures. Okay, if I see that a certain bearing is always failing, let's address that. So, why? How do I know that I'm not just talking hot air? Okay. This all sounds great. Okay, you use AI, you use open source. Does it work? Well, what did we do? Then as producers, as manufacturers, pardon. We went to one of the customers and we proposed a pilot project. We apply our techniques. Let's see. The results. Okay. How did it go? Well, 50% less of alert processing. All these alerts. So, unplanned field inspections, 60% less. We strongly reduced the lead time to repair because we could forecast and we had the right parts in stock. And in general, the annual energy production got up to 0.5% of all the park with also with all the turbines. This means lots of money. Of course, for corporate reasons, I cannot tell too much, but you can figure out. To conclude, okay? Without, then we can go more technical if you like, guys. But take away messages are fragmented value chain affects very badly the wind energy efficiency. Don't have value chain, which is all dispersed. Okay. In general. But data insights and very good communication from the data. Has great benefits. We reduced the alert processing effort. We have prescriptive maintenance, which allows us to decrease the time to repair and we increase the overall efficiency. So, the annual energy production is one of the main KPIs for the public. So, we have a lot of data. So, the annual energy production is one of the main KPIs for wind power, but in general for any energy source. All these could be achieved with open source software. Okay. I sort of saw the full stack. Finally, it was the devos practice, not only the software that allows us the success of the pilot project. And I guess here we have a lot of people who are familiar with devos. Now, I guess I still have a couple of minutes for questions. Anyone? See. Hi there. I used to work for Siemens wind power and they had a predictive maintenance team. I'm just wondering, have you found any other companies as you've put open source kind of using your tools? Well, I'm not dealing directly with the customers. In general, with the customers, we just propose our solutions and we exchange the data. For example, if Siemens Gamesa has failures, we communicate the failures to the dev and we can suggest a stock amount for a certain component, for a certain turbines. But it's not that we are like the software company going to sell that. You see, it's more like the normal customer relationship when you sell parts. But how can we make, let's say, predictions? How we can interact? How we can serve better the customers even as a company? Good analysis of data and for that we use open source.
Energy optimisation: smart home meets smart district
Good afternoon. My name is Rik Barillot. I've been a core member of Open Remote for a bit more than… louder? Okay, sure. A core member of Open Remote for a bit more than 12 years now. I'm not the person who was supposed to give this talk, so I'll do my best to work it through. Don't hesitate to come back afterwards, and I can point you to some of my colleagues that worked on those projects. A bit away? Yeah. Okay. Okay, I'll do my best. And speak louder? Okay. Okay, so Open Remote, it's a 100% open source IoT platform, so it would do whatever you expect from an IoT platform. Back to the devices, have some logic, and user interfaces. We'll come back to that a bit later. So open source, fully free, available on GitHub, and a community throughout the world that's pretty active. But also some projects that we work on with some companies. That's mainly what the core team does when I said professional. It's working on those projects. Also they have projects that are in home security or smart cities, typical IoT projects in more exotic things like smart clothing, architecture, and of course a lot of projects in the energy domain, energy management, but also some link to other aspects of energy. And we'll go into a bit more detail in the Nottingham city project a bit later. So looking at Open Remote, what is it? It's mainly a middleware developed in Java. It has a database that is both for the configuration of the system and for the state of the system. So the current values of your sensors, but also all the historical data. It has quite a few connections using standard protocols, so you can connect to gateways or to data feed. We'll see that later. Awesome property hardware. It has a set of user interfaces. You have standard more management user interfaces where you can configure the system or see the values or trigger some actuators. You get Insight, which is a dashboarding kind of application. But we also have a set of web components, freely available that you can use to build your own custom application for a given project. And so you have an application that you can access through a browser, or you can embed it into a mobile app, what we call the consoles. And you can also connect to other systems like Grafana, Power BI, if you want to have extra features. Then you have, of course, a mechanism for the logic. We support different type of rules engines, simple through the UIs like IFTTF. So if then that or more advanced features like Groovy scripting. So if you want to go really deep. There is a set of default services, so building blocks that you can use, for instance, to push notification to the mobile phones or to place devices on a map or to implement optimization services, what we'll talk about in a minute. And this is, of course, built with security in mind. So there is a strong identification, authentication and authorization layer in the system. So coming to energy optimization. We'll talk about two things. As we say, what we call smart home, but it can very well be a smart office or even an office complex. Basically, it's the concept of an island behind a meter. And you have kind of a sole proprietor of the island. And then when you move to the smart district, it's a composition of many islands behind one transformer. The problems are a bit different, but the system is the same. So if you look at the system, yeah, whatever, I'll do this. You have your renewable energy, so solar and wind. You have the grid, both import and export. You have a battery with charge discharge, and you have your load, your consumers, but can also sometimes feed in energy back into the system. Some electric vehicles can do that. So the goal for the smart home is to optimize either based on the cost, so you want to pay the least amount, or on the environmental footprint, so you want to be green as much as possible. The data that we have to do that is for the renewable energy, we are going to estimate the consumption based on the peak characteristics of the installation, so how much your solar power can produce, solar panel can produce, and on weather data, so we can take the estimate of that. For the grid, we have dynamic tariffs, so people can, for instance, have contracts where they pay a different tariff by the hour or by the quarter even, and so we have the data to know those costs, but there is also a carbon cost associated with the type of energy that is produced. The battery, it's a charge discharge, but there is also a cost, so a levelized cost of storage, so for instance, if your battery costs 1,000 euro, and it can do 1,000 charge discharge, every charge is charged cycle is 1 euro, so you need to take that into account when optimizing, and so for the loads, we have the path consumption, and we do a weighted exponential average to predict the future consumption on that. So now what we are trying to optimize, as I said, is minimizing the cost of the carbon exhaust based on all this data. And so the system will control what we call the flexible load, so depending on this data, it can decide when to charge or discharge the battery, it can decide when to charge or discharge potentially the electric vehicles, or it can decide to control heavy loads, like heat pumps where you have a bit of freedom and when you can power them up or the temperature set point, things like that. And this can be automatic of course, but it could also be simply manual by pushing information to end user through the UI. When you move to the smart district or the collection of island behind the transformer, you have a slightly different problem, which is the transformer that is between your district and the grid, which has a peak capacity, and so what you want to make sure is that you stay under the capacity of the transformer, both for import and for export. So when there is a real high production of renewable, you don't want to surcharge the grid. So the data that we have is basically the same for the battery, for the renewable and for the loads. In addition, we have real time peak power, not peak net power of the transformer, so we know how much the transformer is currently taking in and out. And we also can then adjust the optimization algorithm with a fake kind of tariff. So if we know that we need to change the consumption on the transformer, we can like fake how much the electricity would cost so that the optimization algorithm would steer one way or another. And so we keep doing the optimization at each individual island, but we want to push for the global optimization so that the grid stays or the transformer, the grid stays under control. And so one additional problem comes now with the fact that you have many households, for instance, in a district, which can have their own technology. So it's quite complex to control them, to automate them at all. So one way, and we're exploring that, is interfacing with more home automation systems, like Open Hub or Home Assistant, for instance. Another way is to manually impact. And so what we can do is send personal challenges to every household where the people can earn points, which basically earns them money if they play nice within the whole ecosystem. And there is a lot we can do is we also have shared flexible loads. So for instance, in a district, you can have the shared charging station for the electric vehicles. And then we can control and, for instance, diminish the available power so that we can also keep the grid under control. So that is the general idea. That is what we are aiming for. There are several pilot projects that are starting to implement that. So this is the global idea. One of them is the Nottingham City Council. The idea, it's a smart home, but really it's more smart, well, we could say office complex. The idea is to control the charging of all the vehicles, electric vehicles that are used by the City Council at Nottingham. And so what it means is you can control a global static battery plus the charging of all the vehicles to save money. You can also control or you want to have your vehicle charged at least to some level because you want to use it in the end. And you also want to prevent surpassing the limit, the power limit that you have for the whole district. Oh, sorry, Council. And so what you see on the right is the dashboard interface that we have in Open Remote that can show you the different location of the vehicles. So we can track that anonymously, but we can track the different vehicles and the global power that is currently used by charging of this vehicle. If we now move to the smart district, this is a project that is currently starting in Amsterdam, where we have a community of about 500 households that are part of this project. One thing is each household can control their consumption by we interface with the meter and they can see a real time information about the power they're consuming through the mobile app so they can adapt their own consumption. We have the challenges that I talked about so they see how the whole district is doing and their proposed challenges so that they can play nice within the neighborhood and by doing so earn money. And we can also, as we said, limit the if there is really an emergency, we can control the heavy loads that are shared for the district to make sure that we don't go above the limits of the transformer. So it looks a bit like that and these are design of slides so there are some inconsistency in the wording, but globally every participant will see his own consumption with a bit of a history on how the district is doing. And the green dots around the indication are a global indication of how the district is doing. So it's really gamification there. Now you see that at some point the neighborhood might be reaching the limit so we are reaching the limit at the transformer level and so we will propose to the person in each household a challenge saying well for the next hour you need to keep your consumption below this level. If the person accepts then for the duration of the challenge they will see their own consumption, see the limit, how they are doing against it and how many points they will collect. And so they also receive tips, say well potentially if you want to keep your consumption under the limit maybe charge your car a bit later or set the temperature a bit lower, something like that. When the challenge is done they see how many points they have collected and then they of course can see a summary of all the challenges they have completed, how many points they have earned, etc. This is the view from the manager so we can see different meters that are all connected to the system. At this stage as it's pilot project they have 50 meters connected, the project just started, the target is to have 150 by the end of the month February and with 150 this should be enough to already influence the whole behavior of the district. So with 150 connected meters we should be able to have an impact really on how the district and the impact on the transformer. And so this is here the dashboard where you see a summary, the small diagram I showed with the consumption and the load on the transformer, how we are doing compared to the peak performance of the transformer, a historical graph and things like that. So thank you, these were the two projects that are currently running on energy management at this stage, there have been others. You can find the open remote platform in the GitHub repo, there is also the forum where the community is active and other information. Thank you very much.
A journey accross the environmental materiality of digital services
Hi. So in this talk, we'd like to take you on a journey across the environmental materiality of digital services. So the speakers in front of you, here's David. My name is Benoit. We are contributors to an NGO called Boavista that we'll present briefly later. We also are colleagues in a small company called Hublot working on ICT and environmental impacts. Regarding Boavista, so the NGO we work for and this is the work of this NGO we present to you today. This is an NGO based in France that gathers more than 250 members now, private companies, public organizations, universities, researchers, freelancers and so on. And the goal of the organization is to provide public and open methods, data, tools and knowledge about environmental impacts of ICT and its assessment. And of course we try to provide a useful open source, open data and open science stuff. Thank you Benoit. So today's objective will be to see how can we get from digital service to its environmental materiality. Environmental materiality is another way of seeing its environmental impact and it includes not only its carbon emission but also all of the other pollution and its usage of renewable and non-renewable resources. To do this we need to follow a process which is called environmental accounting. And at Boavista we have chosen to do it with an open source approach. What is very difficult when you're doing accounting, environmental accounting in the context of ICT is that you must take into account all the value chain of your digital service including the end user equipment, network, data centers, so all of the infrastructure that your service is using. But you also need to take into account another dimension which is the lifecycle phases. So you don't want to only include the use phase impact of the use phase but also the impact of manufacturing the equipment that your service is running on, transporting those equipment to their place of usage using them and also the end of life of the equipment. Today we won't be able to dig in all of the dimension so you'll see on the slide what we're going to focus but Boavista is working on all of the dimension here. It's still me. So why have we decided to do open source? So we're out for them, I think everyone here is convinced that we should do all of the data and development with an open source process. But when we talk about environmental accounting, it's more specifically important to follow an open approach. First, because we believe it's a democratic necessity. Environmental figures are often used to justify political orientation. For instance, the Green New Deal is full of environmental figures and we believe that citizens should be able to criticize, audit and criticize the figures that are being used to make political orientation. Also, environmental figures and environmental accounting are used to label product and services. I think you might have seen some data centers who said that they are greener than greener. But to say this, you need to rely on figures, environmental figures and often those claims are not based on open approaches and figures, which is for us a problem since consumers cannot audit and criticize the figure. There is also a very more straightforward argument because today environmental accounting in the context of ICT is very immature. So the data that we use, the data that we report are of very bad quality. To illustrate this, we've done some work. We normalize the carbon impact of manufacturing one inch of a lead panel, so lead screen, and this is the impact for manufacturing one inch, the carbon footprint. And you see from the five data sources that you have here, we have a magnitude of 10 between the lowest impact and the highest impact. We could think that HP has a way better environmental friendly process than Dell, but this is not the case. At least we cannot, this is not the justification for this difference. This difference is, there is this difference because all of those providers are not using the same data sources, the same hypothesis, and the same method. And because all of those are not open, we are not able to explain you why there is those difference. So open source should be a way, if all of those figures were based on open source approaches, we could try to normalize those impacts, compare the provider once they get another, and explain why different providers have different impacts. So let's first focus on the energy footprint. So I guess the energy footprint is the part of the ICT footprint we mostly think about when we work in ICT. That's easier to get a grasp on it. But as David said, it's still, when we look at energy in ICT, it's still only one part of the impact. So it's really about the usage phase. It doesn't cover the rest, which is, which can be way, way greater impact than just the usage phase. That's also true for data centers. In what I will present to you today, most of the information are accurate for data centers. Some of them may be applied to end user equipment, but we didn't include specific information on network equipment. So we are going to include specific information on network equipment for technical reasons and also because it's hard to get data on that part. So first, a little bit of context regarding data centers. I don't know if you've seen the latest figures from the EA. EA is International Energy Agency, and it, let's say, it's a rather conservative organism so far regarding ICT and their own impact figures. But their latest figures is quite enlightening because we can see that in 2022 we were around 400 terawatt hour of energy consumed by data centers, which is the double from what they previously said for 2020, which is a bit strange. And also that their projection for 2026 or in two years says that it will double again. So around 800 terawatt hours. Part of it is because of AI, but not only. You guessed it. So this is the context. What we can say here at least is that we are really in hyper growth trend and not the opposites. That's not what we have seen in some medias like data centers and energy consumption is flat. That's not the case. Then what's the issue here actually? What do you want to look at? It's not just about the energy consumption, of course. I think I won't teach anything to anyone in this room when I say that energy consumption means that we at some point consume oil, gas, and coal or other energy sources. This will emit greenhouse gas emissions, of course. But we will also consume water in the process. We will consume water if we take into account the cooling of the data center. And we will consume minerals and metals and other resources. Not all the resources that we can account for are listed on the draw. But there are 16 environmental criteria that we take into account in Boa Vista tools. So what do you have in your position to work on your own energy consumption of your services? So we have talked during the day in this room about paraffin power top. There are other options as well. Of course there are physical measurement devices. So smart PDUs, ID rack or ID low administration cards if you have them on your server. What matters in general? This is one way. The other way is software evaluation. So those are the options that I've listed on the top. All of them are open source solutions. If you are, let's say in a bare metal server context, you might choose power API, paraff, power top, SkaFound. If you are in, let's say more in a development phase of software, you could use power Jula. If you are in a Kubernetes context, Kepler or SkaFound may help you. And if you are in a machine learning context, code carbon could be of good help. And these are some examples. What's behind the scene is actually interfaces that have been mentioned previously in the day. So NVIDIA, SMI for getting the energy consumption of GPUs, RapeL for Intel or AMD, X86 CPUs. And the third approach is modeling. So we could classify this as measurement. This is more about modelization. And some of those tools also use modeling, then don't necessarily use only measurement with those interfaces. And the Bois API is also part of it because it does modelize energy consumption and answers about what's the carbon composition of the electricity. What, if I take the words from the previous presentation. But when we say that we have to precise something is that both hardware and software measurement tools have their limits. If you take the wider purple and pink squares, they represent what perimeter, a physical device will be able to measure. So the whole machine actually, but you won't be able to zoom on what's the footprint of a software or a given component. On the other side, if you look at the yellow and green screens, not so green, but the smaller squares here, we see this is the perimeter that RapeL is able to measure. So a CPU, if there is an integrated GPU memory can be measured for GPUs, SMIs. In some cases you may have a broader perimeter in RapeL, but this is for recent machines only. So we have an issue here because we are in a trade-off between completeness of the evaluation and precision and the ability to zoom in on the footprint of one software, for example. And so how could we fix that situation? In Bovista we are launching a project called EnerGista, which is basically a collaborative science. This is a collaborative database that we open and we propose voluntary organizations, individuals to share with an open source agent energy data and data about the hardware of the machine that has been measured. Which will help us to do statistics and then at some point produce better models that will help us improve software evaluation for power evaluation. Thank you Benoit. So from the beginning of the presentation we've told you that the use phase and the energy consumption is not the only thing that should take into account when you want to account for the materiality of your service. And this is where the life cycle approach comes in. So a life cycle approach will try to take into account all of the phase of the life cycle of your service, but also all the impacts, well, most of the impacts of your criteria. So not only carbon footprint, but depletion of minerals and usage of water, for instance. We're going to focus here on how can you identify the environmental impact of manufacturing a server. So it will be mostly in this area. But at Bovista we try to have a comprehensive approach by identifying the impact of all the phase from all the value chain. So this is a very, very partial and simplified model on how can you get the environmental materiality of a server for a specific service. So the first step that we do when we do environmental accounting is we try to identify what is the technical infrastructure that hosts the service. And this is often the most difficult part, because for instance if you take a function as a service that runs on AWS, it's very hard to know what is the specific consumption of resources and what is the technical material that your function is running on. But we need this data to know and to understand what specific component is used and what is the impact of those components that we should allocate to the service. So this is sometimes like archaeology when we need to dig and try to make some hypotheses to know how do we get from a service to its technical layer. But once we have the technical layer, we need to go to the raw material, because this is where the impact comes from. So we try to map all the processes that needs to be completed to assemble and build a manufacturer server. In a simplified way, we could say that a server is an assembly of plastic for the casing and packet and components. So CPU, RAM, Graphic Card Card and so on. And a component has many processes, but the most impactful process is the packing process. When you pack the dye, it's part of the components that is engraved where you have the semiconductors. And for this, you need to have metal. And for having the dye, you need to engrave a silicaweather. And so as you can see, the process of engraving consumes a lot of water. And also you need metals to, of course, produce a silicaweather. Across all of these processes, there is the use of energy, which also will use raw material, which will cause the pollution and resource depletions. So of course, you don't want, each time you want to assess a service, we are not going to draw this map and go until the usage of coal, oil and so on. So what we do is we factorize the processes and we make them easier to access through the different tools we are building at Boavista. One main tool that we have is Boavista API, which is an API that can make a translation between the ICT world with IT people and the environmental impact. So you give to the API a technical configuration. It can describe a digital service, an equipment, a component. And the API will give you back environmental impact, not only on global warming, so not only on the carbon footprint. But for instance, on other impacts that has primary energy that you should know if you know a little bit about energy and abiotic depletion potential, which is a criteria that assess the removal of non-renewable resources. So this includes minerals and fossil resources. Around the API, we built, so our architecture is in microservices, so the API is a central microservice. But we have other tools, such as Cloud Scanner, which will scan an AWS account and try to assess with the API the impact of the AWS account. And we have also a pedagogical front end, which is called DataVista, which is based on the API, but it's just a nice layer on top of the API for people who doesn't want to manipulate API. So for instance, here is a way to assess the impact of a server. And you see you can configure a server. For instance, let's say that I have one CPU. Demo effect, where it's just, okay, I put an L. I can also change the location where I use a server. So this will chase the carbon footprint of the electricity where the server is running. So I invite you to play with this tool and see a little bit what is the main cause of the impact, both from the manufacturer and the use face. And also the manufacturer impact, you can have it by component. So it's also interesting to see which component is most impactful. There is also other features, which are also in the API. You can assess the impact of your cloud usage, for instance, or of end user devices, but we haven't introduced those during the talk. Yeah. The API is you can scan the QR code and this will get you to the repository of the BoaVista API. We wanted to open up this talk. So we've begun by talking about energy. Then we took a broader approach with the life cycle approach, life cycle assessment approach. And we wanted to open up with an even more systemic approach, which I call systemic footprint, but it could be also called a consequential approach. Yeah. From the beginning of the presentation, we'll talk, we've talked about the direct impact of digital service. So it means the impact of the value chain of the service. But maybe sometimes the most impactful part of digital service, it's not just direct footprint, but it is the indirect externalities, environmental externalities that is brought by the fact of deploying your service. Your service, you're doing your service for some usages and you need to be careful on why your service is used and how your service is used, because this might be, this service might be used to make environmental harm. So when you want to understand what is the consequences of launching your service, you need to take another approach, which is a causal approach, and trying to map the different causes and consequences that are, that follow the introduction of your software. For instance, if you take a cloud provider, cloud are known to be often more mutualized and more optimized in time of energy, energy usage and carbon footprint. But since we have consumed, since cloud is very easily accessed, we are consuming way more compute resources than we did before. So this is what we call the rebound effect. And this is something that we can get from a basic life cycle analysis. We need to have a more systemic approach to, to understand all of those social transformation that is brought by ICT. And I think we're done in. Thank you for your attention. We have some minutes left for questions. Yes, it was very interesting. But so the problem is that everybody must know this kind of thing for in collaboration to climate, environment and so on. And because there is no studies of Greenpeace about this kind of thing, about energy provider. Yes, in Belgium, but so this kind of thing is very difficult of because so I know that the three said what is French? I don't know in English. So I'm a Amazon Web Service. Yes, so and this, this kind of thing is very, very, very important to their data centers, how it's take off energy, its, its harm effect on the river or something like that. So, so all this kind of thing for the construction of, of a computer and so on. I would like to have that is an, an, an Greenpeace barometers of this kind of thing everywhere. So because it's very important for our future. So also when dissipate energy in a river, a separation of energy and so on. So your remark is about awareness, I think. So there, there, I think there is no report from Greenpeace, but there is a report from WWF at least. And I think the, the main purpose of Boa Vista and the tools that we're building is not efficiency, but it's more making people aware of those problems and taking action because I think, and I think we can, we can talk for both of us. We think that having more IT people engage is one way to, to fight against those, this, this, the impact of IT. Hello, thank you very much for this. When you were presenting the server impact thing, I have a technical question. There was a discussion about jewels and primary energy as opposed to something that we might use like kilowatt hours, which is quite common. Could you maybe talk a little bit about why you chose that rather than a figure that we see used in lots of other places? Because that is something that I found a little bit difficult to understand when I first looked at it. So primary energy versus secondary energy. If you could explain some of that and explain the decision to choose one versus what hours, for example, instead of jewels. Yeah. You want to answer? Why do we express primary energy in jewels? Yeah, what I can say, but I don't know if it's an accurate answer, but in practical terms, jewels are very used for very precise measurement purposes. Most of the time when we talk about big figures, we are more about what hours, kilowatt hours, megawatt hours and so on. What's its power? So it's not expressed on a timeframe. That has been said in a previous talk. I don't know if it clarifies or... Yeah. Oh, okay. Actually, I understand the confusion. Primary energy is an impact criteria. Secondary energy is a flow. So it's not considered a final impact. We use... If you see here, let's say, we can model the secondary energy, the power usage here, in what? And we use it to compute the usage impacts for the difference impact criteria. Primary energy is how do you deplete earth from primary energy? Does it? Maybe you're time for one. Maybe you can do both. So the question is, because of some countries now don't want any more of the rubbish servers from our countries, did the data centers change the policy in terms of management, for example, for the storage system? At Google, they used to break the hardware into small pieces, not even recycling them at all. And where there are changes recently for the spare parts management? Because of the fact that countries don't want to make the recycling any more offshore countries. Actually, that's a very complicated talk.
Power profiling my entire house with the Firefox Profiler
Thanks for coming so late. I'm Florian Kez. I work for Mozilla as a performance engineer. You might have been here last year when I was talking about the work I do. So as a performance engineer, my work is to understand how much Firefox uses power and what we can do to reduce it. So I was explaining last year how we developed power profiling tooling. That was the cover slide. For example, I was explaining that we have power profiling tools that let us understand how much power is used by things so small as just blinking the cursor in the address bar. So this is what I was presenting last year. And if you want to hear more on this topic, I will be doing a similar presentation updated and extended tomorrow in the main track. So today, I will be sharing a different story. It will be more a story, actually, because it's late and I want this presentation to be easy to follow, maybe a bit entertaining if I can. So first a story about why I worked on power profiling the entire house and then technical details and then lots of examples because those are the most interesting. So the story. So there was first time in February and in April, we had a new member in our family that I was very happy to welcome and that completely changed our life, of course. Two days before she was born, I installed this on the wall. One of the reasons why I installed this is solar panels, it's not obvious. I wanted most of her energy used to be renewable. And I tried before to have solar panels on the roof of our house and it turned out to be extremely difficult, which means we failed to get around those. Reasons were mostly there were chimneys on the south side of the house that were making massive shadows on the roof. Lots of other issues with the roof. Basically, all the companies who came, they never gave us a code. So we couldn't get panels on the roof. So install this and I was wondering, can this power of a bottle warmer that we will use for the milk we give to the baby? I work from home. I work on energy efficiency all the time. Will this power my home office? So I had questions and it answers. So how could I answer those questions? I installed the power meter that you see here inside the electric switchboard of the house. So it's communicating with RFI, I'm measuring three different things. The link with the grid. So seeing if we are importing or exporting electricity. It's measuring specifically the solar panels I had put on the wall. And it's also measuring my home office. So that I could answer the questions. Of course, I very quickly came up with more questions. So I was also wondering about the washing machine, the freezer, and a few other things in the house. So this is what the thing quickly looked like. So a bunch of things in here. I made the thing in the first place so I could make a mess of it if I wanted. So now we are measuring also the link to upstairs because there's a second panel upstairs. The freezer, the boiler, the washing machine, those kind of things. And also, I needed to answer the questions. So we put a smart plug on the bottle warmer to be able to figure out what was going on there. So now let's go into technical details. What am I doing over all this? How can I get relevant information? So first I need to correct and start the data. I have a constraint. I have nothing in the cloud. Because it's very personal, sensitive data. All the parameters are connected through Wi-Fi. But with parental control, they have no internet access. They all send data through MQTT. They send one piece of data every second. And there's an Ubuntu virtual machine somewhere in the house that hosts an MQTT server. And with trivial scripts, logs everything to disk. So that part is pretty simple. Then second part, I need to visualize the data. Because if I just have massive log files, I do nothing with it. And this is where the Firefox provider part comes in. A tool I was very familiar with because the power profiling part I made the previous year. I have on the Ubuntu virtual machine a trivial script that converts the data from the files on disk to a JSON file the profiler can understand. And the profiles contain mainly two things, power counters and markers. So this is what it looks like if you're not familiar with a profiler UI. You might not be. I will explain very briefly. So there's a time axis here. The top part here is what we call the timeline. Everything is against time. The values thing I said I'm metering, you can see them here. You see the shape of the chart for each of those. And markers, they are here. And they can give us more specific details about specific things that the script thought was interesting. And you can see here that for example, so BIM is the brand of the wall. You can see that typically produces more in the middle of the day. You can see that when the cloud is less interesting, many other things will go into more details later about what we see there. So one thing I wanted to mention here was the date, which is the most important, sorry, the date is the most important thing here. We were three weeks in after we got the baby. And this is what I spent most of my days doing. And actually most of the nights too. And how this works. Usually when people get a baby, they say they have no time left. I actually had the exact opposite. I ended up suddenly having plenty of spare time at night because she was waking up so often that we couldn't sleep. So we were taking turns. And half of the night I was up. And she would wake up, want to have some milk and then sleep a few minutes later. So I had plenty of hacking sessions that were somewhere between 10 minutes and three hours. Unpredictable. But I had multiple weeks of having those sessions at night, which was why the code is maybe a bit messy because I had to do it in small chunks. But it worked pretty well. Otherwise I would have had no time to do any of this hobby project on the side. Also the generous parental leave at Mozilla helped a lot because that meant I had lots of those weeks where I could stay up at night and do those kind of things. And then more seriously, generating a JSON file that the profiler can understand was really simple. Maybe because I work with a profiler a lot, but still I think most people could get it done and get something that works relatively quickly. And also I don't have to host any web UI or anything because I can just generate URLs like this with the URL to where I generate the JSON file. And that's all I have to handle. I don't have to take care about anything in the UI. Then there's the stuff that didn't work as well. The profiler was made to profile Firefox. Typically we were having profiling sessions over a few seconds. I accidentally had profiles that were an entire day. So stuff didn't work so well in terms of units, for example. So I did put some good requests to add minutes and then hours. And then a few weeks later, days also. Changing the units, if you remember the screenshot I gave of profiling the cursor blinking in the address bar, we were talking about milli-watt-hour, micro-watt-hour. I wanted to see kilowatt-hours because numbers with many zeros were not so fun. Performance also, showing a profile that contains data for an entire day. It was not that bad, but it took maybe five seconds to display. I fixed it. And another thing that was a lot more important when profiling the house and that is completely irrelevant when profiling Firefox is knowing when something happened. In Firefox, typically we want to know how long something took. Here I mostly wanted to know at which time of the day something happened when we were starting to consume more power. So I also had to tweak that a little bit. It's also nicer when using the Firefox use case, but it was a lot more important for profiling the house. Colors, but it was just nicer. Everything was gray in terms of power in Firefox because there were a few attracts. Now let's go into examples. Doing laundry. Washing machine dryer. So washing, it consumes a lot of power twice. And this is most likely when heating the water. And then there's what? Okay. Whatever. I also wondered why it's doing it twice here. I think I saw it doing it new once a few times. So it depends on the program. Actually, I would like to profile the values programs. And if we zoom into this part that looks interesting, but we don't see because of a big thing here, we see there are lots of patterns here that are probably good enough to figure out what the machine was actually doing. And then the dryer, and it turns out it uses less power than the washing, even though it takes longer. And this is probably because we took the most efficient dryer we could find with a heat pump. And I also profiled my mother's dryer and it uses seven times more power than mine. Typical day at the office, home office. And this is why I don't want this data to be in the cloud. And I don't want my manager to have access to this data. We can say exactly at which second I return to my desk throughout the entire day. And you can see that there are typical days like this with small breaks in the middle. You can see the shape here is different. And then there are days like this one. And the main difference here is when you see that it's high first and then decreases, it means my battery was not full. So that means I probably worked from somewhere else than my office. So here, here and here, I clearly worked somewhere else from my office. And then the last one is on Sunday. So on Sunday, the only thing that remains power down is the modem, which is also useful for Wi-Fi and the rest of the house. But maybe before working, I should have started with breakfast. So this is micro-oven from the 90s, generated from my grandmother. And two things we typically do in the morning is unfreezing bread and heating milk. And I was surprised by the patterns there. The surprise is I was thinking that when in the infreasing mode, we would use significantly less power. And that's actually correct. But the problem is it's heating at the maximum power for a few seconds, then nothing for a little while. And every 30 seconds, it's heating for seven seconds, which means that if I'm hoping to use solar panels, and it's in the morning, and they are not at their peak production, I'm basically buying all the power from the grid, even though the average power is only 300 watts. And that's the kind of stuff we see when power profiling with a high rate sampling, but I would not see if I was looking at that every hour. And heating milk is what you would expect, almost a rectangle. So now, time for a quiz to ensure you are still awake. In your opinion, what uses most power here? Is it the massive chest freezer we've got that's full of milk? Is it the internet modem? Who thinks the freezer? Raise your hand. Who thinks the modem? So let's provide it to figure out. So, of course, very different shape. The modem is using the segment of power almost the entire day with very tiny variations. And the freezer, there's a spike at the beginning for a few seconds. And then it's stable for a few minutes, and then stops entirely, and then starts again. Modem, 27 watts all day long. It also runs the virtual machine that does all of us power profiling. So the answer is you are all right. They used exactly the segment of power during the entire day. So back to the initial question about warming the milk for the baby. So there's this milk pump, and then there's the bottle warmer. How much do each of us consume? You can just see the number. I don't think I'm going to read them out loud. Something that we quickly realized when looking at those profiles that was interesting is we see the timing, same as figuring out when I'm working or when I'm not working. And I'm not sure if you had a baby recently and had this experience, but you have lots of constraints about how long you can keep things. So milk that has just been pumped and kept at room temperature you can use for four hours. If it has gone in the fridge and you are heating it, you can use it for two hours. So to be able to know if the bottle of milk in front of you is usable, when suddenly the baby wakes up and you don't know when it just slipped last time because you were not in charge of that time, usually it's a mess. And we can make use of this data, and we did. And that's actually what we used the power metering data the most for is figuring out if the bottle of milk in front of us is usable. And we figured this out. The reason why we figured this out is only because we could see on the chart that actually it's very easy to detect the pattern. So it's time for a summer break. We visited my parents and they recently had those nice solar panels installed on the roof of their kitchen and it came with a gateway that's sending the data to the manufacturer or the gateway who's collecting a lot of data. I'm not too happy about that but it was not my decision. So it's sending one data point every 15 minutes which is good enough to figure out how much electricity was imported or exported on that day. Use this to figure out what you're actually doing with your electricity. And I noticed during one night of taking care of a baby that actually we can get one data point every second if we query a local HTTP API. So I did. Put a Raspberry Pi in there. Of course we can get profiles. So now let's see what they look like. That's what I saw at my parents' house and one thing quickly caught my attention. So it's a free-phase system because of a large heat pump I will go into it later. This thing looks strange. There's high power use here and it's throughout the day. And the only thing that could be using as much power is this thing. And it's supposed to be using power of peak hours because the price of electricity is not the same in France at night or during the day. And after investigating a little bit we realized that there was this switch here that was in the wrong position that was forcing the thing to be on all the time. So it was pouring on whenever someone was using water. And we changed the switch and now it's eating only around midnight and then a little bit around 7 a.m. and then it stops the rest of the day. And that probably saved quite a bit of money. I said there's a large heat pump so now we are no longer in the summer. I forgot to say something. The heat pump here has a large accumulator also. And when we look at the power use pattern we see the heat pump that's pumping and using a lot of power on all the free phases six times a day. And then there's the circulator here that's going throughout the day. So we actually can understand how things work. And we can see also how the power from the solar panels was used. Back at home some magic happened. I said we couldn't have solar panels on our roof but we had a baby which means that we returned home and after returning home there was a midwife who came to visit to check everything was right. And on the car that she used to visit us there were ads for company putting solar panels on roofs that was owned by her husband who's very proud of figuring out solutions to all the desperate cases where there's nothing possible and who came gave us a code that was very reasonable on a couple months later. The baby solved all problems that we were not able to solve for two years. So now we have real solar panels on the roof but that's enough about this part of the story. Fast forward December and it's time for another baby picture. She's grown up quite a bit. She's really into trees. Whenever she's crying and we don't know why we show her a tree and she's super happy. So we had to get her a nice Christmas tree for our first Christmas. And it's time for another quiz. In what you see in this picture what's using the most power. So obviously there's the Christmas tree here. The Christmas tree turns itself on at sunset and turns off at midnight. Then you might not have seen but we have the solar panels here and they produce power during the day. They use power during the night for some reason. So what's using the most power in your opinion. Who thinks the Christmas tree? Who thinks the solar panels? Okay let's provide it. So the Christmas tree uses 10 watts for a few hours here and the solar panels about five during the end of the day and the beginning of the next day. And if we look at the numbers Christmas tree 64 solar panels at night 67. That was a surprise to me but yeah you couldn't be surprised twice by my quiz I guess. But they did produce a lot more power so it's still worth having them. And I think we still have a minute or two so I have a few more things I can share. I have more power matters that are funnier and the interesting thing about this one is it can give me data at a 50 hertz something rate which is the frequency of the oscillating AC power. And I forgot this profile at home on a computer that's not connected to the internet but the profile was fun because we can see what happens whenever the rotation direction changes. We can see that there's a break in power used for a few milliseconds and then it uses more power when the motor restarts. So all those details we can see and expose with fast sampling and power profiling and it's pretty nice to see. And then USB power meters those are interesting if you want to look at the energy used by any random USB thing or anything that charges through USB. And there are quite a few in this picture all of those are reverse engineered to make compatible with profiler and that's part of the topic for another talk that I will be giving tomorrow but this is kind of how I worked with those. So reverse engineering a bit and then putting a load here USB light that I knew what it would look like. The code is in here if you want to play with it. So I will explain why this is useful for profiling Firefox and Android and even Firefox and laptops tomorrow in the main track. Now let's see the things that were not working so well or that I still need to look into. All the profiles I shared were looking good. I selected them. Some don't look that good. So this is a profile of a boiler. I said we profile the boiler so this is just it's a gas boiler so it's not most of the energy used but still during winter it uses a lot of electricity to just circulate water so that the hot water it's producing is going through the house. And then the Wi-Fi is not so good. It's especially terrible in our house because there's a lot of concrete with metal in it almost everywhere. Despite putting multiple repeaters it's still not so great. And someday I still have missing data like this and profiles that are almost garbage. And it could lead to incorrect conclusions because the shape here is just clearly wrong. So if we can, wired network is probably better. It's not really possible to put those wires exactly everywhere like on smart sockets or things like that. I think the best solution if I have time would be to change the firmware in those devices for an open source one and ensure that they store the data until they receive an act from the server that the data has been received and include timestamps in the data. So probably a project from next time I have many nights without sleep. I would really like to clean up this code so that all of us could play with it easily. It's not very complicated but if we don't duplicate it, that's much better. So the code for power profiling with USB meters I cleaned up enough because it was part of my work and I put it in a easily accessible repository. The code to do profiles that are nice from on-phase gateways I would like to do soon. And the rest, it's a bit of a mess because it's a mix of my code and configuration data with the same files because like you know 10 minute hacking sessions. And I would also like to blog some of our profiles of appliances and devices that I tested because I think there's quite a few surprises we could have when looking at devices. Some don't really behave like we would expect. And as a conclusion I would say sampling at a high rate is useful to understand how things work just because we are often curious. I definitely am. It's also useful to find and fix bugs like the water header thing at my parents that was wasting a lot of power on costing money. And if we want to optimize consumption from the power that's generated based photovoltaic panels, it's better to have an idea of how much we will consume. Like especially unfreezing bread like I was sharing is probably not a good candidate for using energy from solar panels. And that's all I wanted to share for today. Thanks for your attention. Could you match the power used by your workstation with the solar panels in the end? Oh I forgot to say but I could totally use the power from the solar panels for my home office because it was clearly enough and I'm mostly working during days. And I could actually decide that when we have a lot of power from the solar panels maybe it's time to compile for your folks that will use a lot more power. But actually the one thing that uses the most power as we have seen in my profiles from the home office is whenever I decided to use the computer without being plugged and then plug it back in because then it charges and that's where the power used is the biggest. The other thing that contributes a lot of power use of my office is screens. I have two external screens and surprisingly the 27 inches screen and the 20 inches screen they have almost the same power use. So if I use only one I could turn off the second one and they will also save significant power. The profiling your stuff is often called NILM non-intrusive load monitoring so if you go and look up there there are databases you can contribute to. The end phase be careful if you're running on version three and you're using production.jose and it all goes away and it's all behind a power paywall and horrible don't upgrade. And things like water so microwaves yes are just on off so those are hard to do so you should run them when it's sunny. And washing machines right so normally washing machine is on heating the water at the beginning and then that's it you know there's mechanical effort which you could see on yours. Dishwashers are usually at least two because you get the main wash and then a hot rinse. So washing machine with two is weird. So I'm not sure there was a question in this or if it was just comments but about the versioning of the young phase gateway. The young phase gateway we've got at home is not collecting data about our power use. The on-phase gateway we've got at home is not collecting data about our power use so I put my own power meter behind it and the reported data about how much power is used by the on-phase system at night is dramatically different and my parents profile on in mine because in my parents profile is the data reported by the on-phase gateway and it's counting only the power used by the micro inverters that are on the panel and it's around one watt and mine is also counting the power used by the gateway itself and now we are on five. So time's up thank you so much. And you can see the presentation tomorrow if you want more details about Firefox for approfiling. Thank you so much.
Closing Energy: Reimagining this Ecosystem through Open Source devroom
Just a few words. If you want to just maybe think it won't work. Yeah, we'd like to take just a couple moments to close off the deaf room. Thank you all for being here. This was our second year we had the energy deaf room. We started, maybe we can put this off. Well, we started with half a day last year and we extended it virtually. Yeah, you were attending the virtual room last year. Yeah, and we kind of had a feeling that there's going to be more attendance if we have a full day. And I think this demonstrated it pretty well. Like, from the beginning in the morning until like basically now it was only like a few sessions maybe not completely full. So thank you very much for sticking around and I hope to see you next year. And maybe if you want to volunteer, you know, please send us an email or hit us up somewhere else.
Gleam: Past, present, future!
Yes, good now it is Hello everybody, this is fantastically exciting Look how many people there are, I thought there was going to be like five of us having a lovely time But no, there's far too many of us, great I'm so excited, I'm going to take a photo Just so I can prove this happened Does everybody smile? Wonderful, thank you so much I'm ecstatic, so hello, I'm Louis, I'm the creator of the Gleam programming language If you want to talk to me, do so here I'm here to talk about Gleam, which is, you've just seen a new programming language for the Erlang virtual machine And it feels like we've had milestone, because the language has really matured especially over the last year And so I want to have a little bit of an indulgent look into the past, sort of where did it come from and how did it get started A little bit of a celebration of where we are now and look at some really cool projects And then I want to look into the future and say, what's coming next for Gleam, what can we bring to the beam? So, this slide is ever so slightly irrelevant after that last talk But I just want to stop by saying, what is Gleam, to get everybody on the same page It is a new functional programming language, it doesn't look like Ruby, it doesn't look like Prolog It kind of looks like Rust, C, JavaScript, that sort of thing perhaps And it runs on the Erlang VM, so it is a sibling of Elixir and Erlang But it is a bit different in that it is statically typed, unlike the other two dynamically typed languages, and most of the other ones And that means that it brings a new style of programming to the beam, and hopefully can draw in more beamy people It aims to be very small and consistent, and the point of that is that we want to make it as easy as possible to read code We want it as easy as possible to learn the language and to get productive with it And productivity is not just about having a good language, these days you often have really good tooling Gone are the days when you can just give someone a compiler and then you say, ok well everything else is up to you, you figure it out So we also have a really nice build tool that comes with a formatter and package management and a language server and all those sort of things you probably expect And also it can pass a JavaScript, which is probably less exciting to this room than most But maybe you don't have to write JavaScript if you do in your front-end, so maybe that's a cool thing So first up, the past, what did I mean? Yes good, ok, the past, the past, how did we get here? So this is a history of Gleam according to GitHub, and in the very beginning there was a little tiny little blip of activity and then nothing For absolutely ages, so what was that? What was the very first Gleam? It was this, this hideous thing, this is the very first Gleam syntax People keep saying that the first Gleam syntax was like a Haskell rip-off, it was not, it was this You see it's sort of C-style, it's got braces, but it has the Erlang thing of multiple function clauses, so your top level flow control is done that way And it looks like nothing and nobody's familiar with it and nobody really likes it And it has this perhaps cool idea of having the tests for functions actually be part of the function So maybe you could show that in documentation, and I thought this was great, this is the thing that the language is going to be all about But it looks kind of rubbish to me now, because you can't do any test setup, the only thing you can really do is give an input and an output Well maybe that's good if you're reversing a list, but other than that what can you really do with it? What else can it do? Nothing, it didn't have a type system, it didn't really have a design, I wasn't working in any direction You could return strings and maybe call a function, but that's kind of about it And it was just a really bad layer on top of Erlang, which asked the question, why? Why did this exist? Well it's kind of like today, I wanted to do a conference talk So there I am, looking at younger, Elixir London 2017 And I did a talk on how to write a compiler, how to write a compiler that targeted the Beam virtual machine And it went really well, people liked the talk, I got to hang out with loads of my peers and then I took that project and threw it away And I didn't think about it ever again, sort of Because during this empty period where no work was being done really, I was doing my job, doing open source stuff And I kept thinking, I kept thinking back to that project and wondering, is there actually a point in making another Beam language? And this was spurred on further because I was writing all these really wonderful languages And every time I was writing one of them, I was thinking, oh I really wish I was using one of the other ones Every time I'm writing Elms, it's really difficult to do this IO thing in Elm I wish I was, oh there's no concurrency, I wish I was using Elixir, or I was writing JavaScript I really wish I had Rusts tooling And I sort of figured, maybe it's possible to take all the things I like from all of these languages and merge them into one Because I've sort of accepted the language that I wanted to be writing didn't exist, I felt like I tried them all at this point And so can I make that thing that brings it all together? And so after about a year and a half, the start-up I was working for was bought and trashed And suddenly I had a lot of free time on my hands And so I thought, this is the perfect time to resurrect this project So I remade the whole thing and this is the syntax people keep telling me is the very first Gloom syntax, but it's not It looks a little bit more like OCaml with bits of Elixir mixed in I think And this is in February 2018, okay, so maybe like a year and a half after that previous one And so I kept working on it an awful lot and then fast forward a year and a bit later to April 2019 We've sadly scrapped all of the nice ML syntax and we've got a much more sort of JavaScript syntax And this is version 0.1 which I'm really excited about because it did something You could use it to write some small program whatsoever, which is really cool And it started to look a lot more like Moddingling Fast forward another half year We basically got the syntax as it is today, we did a little bit more, but that's kind of it You notice the differences here are, we've got one of those little pipes And if you look between the IO and the print line, we've got rid of that colon So that's the last of the little Erlang things, sorry Erlang fans What else happened? We used our first class modules as a feature that people love People absolutely love first class modules, that's something that you find in OCaml And really we do it a lot in Elixir and Erlang as well Because if you think about when we pass around an atom that is a reference to the module Well that's a first class module, we're passing it around, we don't have module functors But we do use them an awful lot in our APIs Good, I am actually on the right side And we also have row type records, which is a really cool way of A really cool type system feature that enables you to do these really interesting sorts of polymorphism With objects and variants that sort of looks like interfaces in an OO land But doesn't have that same sort of subtype thing, so these are two fantastically cool features And we also had a more complicated way of declaring types and data structures That was much more akin to what you find in Haskell So we got rid of all these really cool things and replaced them with a string concatenation operator The ability to use callbacks in a slightly nicer way and the ability to give names to arguments So we've swapped really sexy awesome functional programming stuff for things that are actually quite useful But not very exciting And this has kind of been the whole journey of Glim, this has been It's very easy when making something to get excited and distracted by all these things And it could be, we could do this, we could do that But what is actually the most useful thing? And it turns out just removing things and honing in on that call, that most useful, that most productive thing Is the most, hopefully, is the best thing to do And I think we've got a really nice place because of that One thing that we have added that is quite big actually is that JavaScript compilation So that wasn't in there originally, that sort of exploded after Which does make the ecosystem more complicated, but the language not so much We also got a build tool, as I mentioned earlier The idea is to have a really good batteries included one Originally we were using Rebar 3, which is the Erlang build tool And it's really good and it worked quite well for us But you could tell that we were using a tool that wasn't made for us The user experience wasn't as good as I wanted it to be And I didn't just want to match Erlang's developer experience Or even Elixir's developer experience I wanted to even best it in some fashions And I've been writing a lot of Rust and Go And they've got some really amazing tool And I thought, wow, let's take all this goodness that you find in these other ecosystems And let's pull them into the Beam ecosystem, make it grow even better We've integrated with the Hex package management We're all beamers together, it doesn't really matter what language you're writing We want to be able to all share the same code And all depend upon each other's projects and share and give back So we've integrated with Hex So rather than just having a few hundred packages written in Gleam We've also got the 20,000 packages that are written in Elixir and Erlang as well And then we've got a code formatter and a language server And lots of goodies like that So I said there's 20,000, a bit more than that packages on Hex On the package manager And about 200, a bit more than 200 of them are Gleam That makes it extremely difficult to find anything written in Gleam If you want to make a Gleam project So after a while we made the Gleam package index And what that is, that's a little window Just a little view that looks into Hex And allows you to see just the ones that are Gleam So if you want to find a library for HTML in this case You can type in HTML I didn't, I didn't, you're making, we'll talk about that later Anyway, and it will give you elicit packages that have the word HTML In the description or the name That somebody's library does not have HTML in the name or the description And then if you find something suitable you can use that in your project And if you don't then you can then make a decision about Whether you want to perhaps make something new Or if you want to pull something in from the wider ecosystem Internet points, everybody loves internet points So I know that Stiles on GitHub mean absolutely nothing But it's been really uplifting and really wonderful And I feel like a really good sign that loads of people have Have taken that two seconds to say, yeah this seems right This is kind of cool I've been doing this for an awful long time And I think I probably would have stopped by now If it wasn't for loads of lovely people Sharing their support in some small way Whether it be a Stiles on GitHub or a kind message on Discord Or absolutely loads of you turning up into this room today And so it's been absolutely lovely to see that line go up and up and up And I find it wild, I've plotted it here against two quite similar languages Microsoft's F-Sharp and O'Cammill And at some point in the last year or two We've ever taken both of them in terms of number of stars Which is absolutely incredible I'm really excited about ML types And also people really love the Beam I think So this is a really good sign for the future of the Beam What else have we got? Has anyone heard of Exism? Anyone a fan? Fantastic, for those who haven't, it's your lucky day This is a really wonderful website and project Where you can go to learn new programming languages And they've got tens and tens and tens of different languages on there And for a few years we've had a Gleam track And they give you an exercise, some instructions Maybe some hints, and then they give you a series of tests And you can solve it there in your browser Or you can use the command line and download it and use your favourite editor And then when you're happy with your solution You can submit it off and they do a bunch of automatic grading So they've gone some tests and they might do a bit of static analysis Like oh you've done this, maybe you didn't want to do that And then if you're feeling super brave, which is where the real value comes from You can submit it to get some mentoring from an experienced programmer There's loads of lovely people who are just sitting there Helping strangers improve their Erlang or Java or Gleam or whatever There's a really wonderful project And last year, with some help from the wonderful Erlang Ecosystem Foundation Who sponsored this work We went from not just having a set of challenges That you can use to practice your Gleam, but an entire course So you can start by not really knowing any Gleam And by going through this whole thing You can be taught individually all the different concepts And so they give you a concept Then they give you a little challenge that's focused on just that concept And then they will unlock all of the exercises that they think you should be able to do now Using those skills So it's a really fantastic resource and it's absolutely amazing that it's free So do check it out And it's been really well People have really taken to it this course So you can see in the middle Can you see where we launched the new syllabus? Suddenly the uptake went absolutely skyward, which is fantastic And this is not the number of people on the course There are about a thousand, just under a thousand This is how many solutions people are submitting So this is actually the activity It's absolutely wonderful 30,000 submissions Which is a lot of learning, a lot of wasted time Who knows? So Exism is really cool And I really like that idea of being taught The individual concepts in a way that enables you to get somewhere And become productive And off the back of that And also inspired by the wonderful tutorial that Go has So we decided to take that idea of teaching the Breaking the language into concepts And teaching them in an incremental fashion Where each concept builds upon the last one And distilled it, minus all the exercises Into a sort of whistle-stopped tour of Gleam So if you go to the Gleam website today And at the very top there's that hero image That's got the tagline saying Gleam is great tour I don't think it says exactly that, but you get the idea And there's a big button that says Get started or try it or something like that And if you do that it will point you straight It will point you straight onto that first lesson And you can go from This looks kind of interesting Maybe I'll try it to Oh wow, I'm writing and learning Gleam All in your browser without having to Work at how to install Erlang And realizing that App has an Alte Deck package So you can't actually install it properly And oh how do I install Rebar And how do I do these things No, you just go straight in and you can start learning So hopefully people from other ecosystems Or people who are writing election Erlang Can turn up and go Oh I want to give this Gleam thing a try And then very quickly Get whisked into being a Gleam They can be hooked They can start working on the beam And this comes because A, the compiler is written in Rust So if you have Rust you can compile to WebAssembly WebAssembly is a very cool project And we can also compile to JavaScript So if you have those two things together You can run the compiler inside the browser And you can also execute the code inside the browser So we don't have to run any servers So even I can afford this And we don't have to worry about any security stuff Everything is just on the person's computer And it also means it's super fast You can get your feedback immediately So Gleam present, I'm going a bit slow I'm going to speed up a bit Where are we now? I want to look at some projects in the community That are really cool My original version of this The talk ended up being about an hour and a half long So I've had to go loads out So if you're not mentioned, very sorry First thing I want to say is that The Gleam Discord is wonderful I'm super lucky to have loads of lovely people Hang out there I can see some of them here today And there's just people helping each other And sharing cool projects And talking about the news Or talking about coffee or keyboards Or anything really It's a really lovely place to Either get help or to talk to people So do join The community is absolutely wonderful And delightful And I'm super lucky to have Working with them be my job these days So thank you so much everyone But now on to the things they've made The first thing I want to talk and boast about Is MIST And MIST is a pure Gleam HB1.1 server That sports HBHBS It has web sockets I believe Server-centered events are coming in the next version And they're working on HB2 So the cool thing about this Is that it doesn't wrap an Erlang web server It is pure Gleam And it doesn't even use Erlang's OTP It uses Gleam's OTP It's an entirely new implementation And what's really cool Is that it's not just proving that you can use Gleam To make sophisticated things You know, implementing a fast HBHBS server Is quite challenging But you can also get really good performance Out of the ever end So here we've got a bunch of different web servers Graphed The ones at the top are MIST and Bandit Bandit being Elixir's new one Bandit has had a new version Since this benchmark was done So I think it's actually slightly faster now But they're about the same You'll notice we're even beating Go And everyone talks about how Go is super fast But no, we in the Erlang world can do better And we're obviously beating JavaScript But the thing I think is really cool Is that we are really beating Cowboy We are really building the one That we as the community have said This is the best fastest web server It shows that we have further that we can do And it shows that Gleam can be just as performance As Erlang So this really proves the language I think So I mentioned OTP Gleam has gone a different way for OTP Then shared out to Fred and his squid there Gleam has gone a different way with OTP With most of the other languages So Elixir and PureRail and other languages If they want to use OTP They put a very thin layer on top of Erlang OTP Well, Gleam doesn't do that Instead, Gleam takes the core concurrency primitives That you get from the Erlang runtime system And has made type safe versions of all of those And it's the same things like link, sport, monitor Send, receive And then it looks at the protocols that are implemented OTP says you've got to implement certain messages Like system messages And there's certain ways of sending Of doing synchronous requests and all that sort of stuff And we've implemented those same things From the ground up in a type safe way And what's really cool is that we've discovered it's possible For a long time people have said You can't have typed OTP Well, if you get the same If you get that same core primitives that you get inside Erlang You can build the same thing from the ground up So that's been really cool And the fact that it's been used to make miss Shows that it can work And it could be practical and useful in performance So it's all very good having a web server But you kind of need a... Probably need a web framework Unless you want to spend all your time Writing a parser for multi-part form bodies So we have Wisp Wisp is a really lovely little framework I can call it lovely because I made it So if you want to do a web thing That's a good place to start Databases are pretty handy as well We've got bindings for these sort And probably some others that I haven't found The first two, Postgres and SQLite They wrap Erlang projects All the SQLite one can even work on JavaScript If you're using Deno But the bottom two, they're really cool Because they're, again, written in pure Glean Using Glean OTP Now, this is a really cool one This isn't quite so beamy But so Glean can compile to JavaScript Okay, so how do I do a front-end in Glean? I don't want to be writing all this JavaScript For my Beam application If I can avoid it So Lustre is this really lovely library That's sort of quite similar to Elm Or perhaps some React State management systems That gives you a way to make a declarative DOM And then all you need to do is talk about What messages you're going to emit And then how you update the state Every time one of those messages come in And as an Erlanger, I look at this And I see a GenServer I think that the Elm architecture Is basically exactly the same as an Erlang GenServer Instead of calling it call, we're calling it Handlework, we're calling it updates And then we've got this HTML thing on the side Which I don't, who knows But what about live view? People like live view, right? That's the hotness at the moment So live view, in case you don't know Which I find that you almost certainly Do know in this room That's when you have that same sort of idea You get a declarative DOM that is on your front end But all your state updating Where you hold everything is on your back end And then they talk to each other over WebSockets And this results in a really lovely develop experience And you can do all sort of things that you can't Practically do if all the state is on the front end Well, us too can do that as well That last component I showed you There's nothing that says that has to run on the front end It could also run on the back end Just rendering it to HTML Or you could put it on both So you could just by saying Hey, start an actor with this And then here's WebSockets You can have live view with Luster And what's really cool is that you can now pick Which parts of your application is going to use Which architecture? You know, there's a criticism of live view That it means that certain actions That should be really snappy are quite slow And if you lose network connectivity Your whole application stops working Well, then maybe put those bits About making it be resilient to network failures Put those on the clients You can pick exactly what you want So we've got loads of servers and clients And API clients and middleware That are all part of this wider HDP ecosystem And one of the things that's really cool about this Is that there is a Gleam core library Called Gleam HDP that defines a few types for Requests and responses and headers and all these things And so all of these libraries Even though they've been made independently By different people, they can all work together They all share the same primitives And you can say, well I want that API client With that HDP client on the front end And that HDP client on the back end And I'm going to handle it with that server in my tests Fantastic, and it all just nits together Enough about Web There's lots of other cool places we can run code One of them, we probably will do an awful lot Is on the command line And there's this really lovely project called T-Shop Where you can It's a similar sort of Elm updated type thing But rather than being events coming from a DOM It's events coming from a terminal So you can make these really lovely interactive Tuis in Gleam Sadly at the moment you can't run this code on the Beam Because there's a few There's a few quirks of how The Beam handles standard input But hopefully we can make a proposal to The OTP team and they can expose A couple of functions that you can't get to And then we can have exactly the same thing In Elixir and Erlang and all sorts of other languages as well And because I've showed lots of libraries Let's look at an application I think this is really cool This is, I'm going to butcher the name Electrophonie, maybe Which is a music streaming app Similar to Spotify or such And it is written in Gleam Part of JavaScript using Luster And then because we've got this really excellent FFI So we can call into So we can call into other languages And we can use all these web APIs And do things like use the media keys Be on the lock screen of a phone Be in that little bit of the top of your computer Where the music thing is I don't know what it's called And, yeah, the ecosystem is really growing I think there's a name for that kind of curve I'm not sure what it is But we are now 1.2% of hex Which is a tiny number, but bear in mind We're not at version one yet And Elixir's been at version one for 10 years Something like that I think that's really impressive And I really hope that that is going to keep going So, where are we going? What comes next? So, Gleam isn't done A lot of things are very mature, but there are still things to work on And the thing I really want to focus on for the next year is the language server So, what is a language server? Just to make sure everybody's on the same page So, traditionally, if you are making a text editor, an IDE And you want to support a language or a plugin So that they can support a language You need to then work at how to learn all those things about the code layer Oh, how do I know if there's an error? How do I know what I can auto-complete with? How do I know what snippets would expand? How do I know what refactoring is I can do? You'd have to individually implement all those things But some clever clogs, I think at Microsoft, came up with this idea of We're going to have a language server, we're going to define a protocol That all the editors can speak and all these backends can speak And all you need to do is implement the protocol And then suddenly we can have one brain of an editor And that can talk to Elix and Vim and Emacs and VS Code and Zed And all these other cool ones And so we've got one of those Built into the binary that you get when you download Glim Excuse me And it works, but it doesn't work as well as I wanted to It's definitely the least mature part of the whole GLEE ecosystem And a big part of that is my fault I've been developing it entirely on Visual Studio Code And it means the protocol is a little ambiguous in places In a way that I find quite irritating but apparently is fine So all of the editors do slightly different things when you give it certain data So we need to spend more time working on the other editors And making sure that it's rock solid and works exactly the same And all the other ones And I switched a knee over them now so it's not going to be a problem anymore So first step, we're going to get it all working super reliably for everybody And then we're going to flesh it out to have everything We want to have the same experience that you're going to get with Rust Analyzer Or maybe even try and get close to what a JetBrains IDE might give you We want it to be a really excellent experience of all these different things Find references, renaming things, all sorts of refactorings And also code generators, I think are really cool There's loads of bits of trivial code that we bash out every single day about thinking about Well, if it's that easy, just press a button and have the tooling spit out for you And then you can choose to edit it in whatever way you want So breaking changes Over the last year we've had an awful lot of breaking changes Because there was a design and then suddenly a bunch of ULOT turned up and now we had users And then we realised that, oh, actually that original thing that I made up five years ago While I was sitting in my room wasn't the best idea There are problems, so we've made a load of breaking changes in order to refine them What breaking changes are coming next? Hopefully nothing, I think we're there I think we basically have the language to work exactly as it should Which is wonderful And that kind of begs the question Does that mean we can work towards a version one? Yeah, we're working towards a version one So what does that mean? When we get there, what's going to be the points of version one? And I think there are two pillars to this The first one is productivity for people who are using Gleam So that's going to be no breaking changes You can't build on top of foundation that's constantly changing on you We won't have no language bloat I'd be really proud of how we've really honed in on what makes Gleam good And by having a very small, concise, consistent surface area It makes it easy to work with And I want to keep that property I think it's very tempting for languages to hit version one and then go Oh, maybe we need this feature Or maybe we need typeplosses, maybe we need these things No, we're going to keep it super focused And it's going to say exactly that same language that you really love Or don't, you know, whatever it is, it's not going to change And we're going to keep working on improving the developer experience So more tooling, keep improving that If there's something that's annoying to do that everyone has to do Let's make a library for that You know, just keep solving those problems And document everything You know, we want to have cookbooks and guides and tutorials and examples And just make it really easy for you to go How do I do this in Gleam? Oh, look, it says here, here's how I do it Now I can get on And the next thing is sustainability I am not Microsoft I do not have 50 developers working on this I have me And some lovely people who are very kind enough to agree to join the core team Which means they're just called the core team and they do free work for me It's fantastic Thank you very much So we want to make sure that every bit of work that we're doing is as impactful as possible You know, it needs to be... Everything needs to be meaningful And if we can't justify it as being impactful for a large amount of people We just shouldn't do it We've got to make sure everything is efficient as possible Not just in the code, but in our practices as well We're going to document everything internal We're doing really well with this But I think we can do even better I would like people to go, oh, there's something... There's a quirk with the build tool I think this is a bug Okay, I'm going to look inside and see what it is And then just see loads of comments, loads of docs And then they can hopefully work out, oh, that's doing this, that's doing that I can make a contribution to this And the last two things are about funding the project So I work on this full-time And I work on this full-time thanks to GitHub sponsors primarily I really want to... So here, charted in the pink That's how much income we have for the project I'm super happy that it stayed super stable And up there in blue, that is the median For a lead developer in London, which is the city I live I really want to get that up to the blue line For obvious reasons But I'd like to do further than that I'd really like if we could afford to have like one... Two pizza team, is that too old? I want to have that core development team To be able to afford to work on this thing That I think is useful and important and productive And be able to work full-time And be rewarded appropriately It shouldn't be charity, I think, for these people They're doing this really useful work for the ecosystem And then if that stable foundation is there That means other people feel more confident Building their businesses and their projects and so on top of that So if you want to help out, do join the... Do start sponsoring or get your employer to So about half of that previous income comes from one place And that's from Fly, that's our big corporate sponsor They're the really wonderful deployment platform And the other half comes from people donating like five, ten, twenty dollars And they're both wonderful But it means there's quite a lot of... There's quite a lot of weights on one organisation I'd really like to spread that out So if we could have a bunch of smaller corporate sponsors I think that would be much better for the long-term health of the project And if you've got ideas for other things we can do So I know Elixir has a sort of quasi-support thing That you can sign up for If you've got some other ideas, get in touch with me I'd love to hear what your thoughts are So when is Glean version one? How much more have you got to do? Well the answer is now We're there, like we're completely ready And depending on how much you lot distract me For the next few days I hope to get a release candidate out today, tomorrow At some point in the immediate future So this is a really exciting time Good, so questions? Any questions? Thank you very much for creating Glean Could you elaborate more on what happens when we keep target minus JS? When we're targeting to JavaScript Repeat the question Yes, so the question is Can I explain what happens when we target compile to JavaScript? Okay, so we compile to... What can I say about it? So we compile to JavaScript source codes We don't add a runtime We keep very close to JavaScript So like your scripts end up being very small Suitable for use at a browser But because we don't have a runtime It means we don't have an implementation Of say like the Erlang concurrency inside JavaScript So you'll be using a different concurrency pattern If you're using Glean JavaScript If you're using Glean Erlang And that means there's certain incompatibilities Between the Erlang and JavaScript target You can't write a library that easily abstracts over both If it does file IO for it Well, that's a bad example If it does like HTTP requests, for example But it means, you know, that's the trade-off But then it means you can work very well with the Erlang... Sorry, with the JavaScript world We can run Glean in browser through... So again, sorry? Can we run Glean in browser through WebAssembly? Can you run Glean in browser through WebAssembly? No, but that's something we want to explore in future Not because we particularly want to do WebAssembly Sorry, not because we want to do it in the browser Because we could already do that with JavaScript But there's loads of other places you can use WebAssembly And I wanted to talk about this, but I didn't have enough time I think it would be really exciting If we had a good way of executing Glean inside the compiler Because there's loads of optimizations we can do We could start looking at certain kinds of like code generation Meta programming stuff that you can do in Alexa, for example You can't do in Glean But we can't do that because we don't have a copy of the beam This massive thing inside the Glean compiler So if we had like a little VM, maybe we could do that And WebAssembly is a really good little VM for this whole thing Thank you Any other questions? Yeah, I do have a question that you might use to point to I think it's a great question I think it was in the last year during the episode of code And it's a really great project But as you know, I think one of the main parts that draw me to the language Was the vibrant, pink color Is there a story behind it? Is it the color? Why is Glean pink? Great question Great question This was, what is this handle, K-Tec I think is And he just threw this idea it should be pink And I was like, oh really, why? That's really odd And I liked it because it's different You know, you see this pink and you don't go You know, if you see a blue you're like, is that TypeScript? Is it Python? You know, it's visually very different And the other thing is, I think it's quite friendly And hopefully it's welcoming to different people I hope that if someone sees a bright pink thing they go Oh that's cool, you know, maybe there's not going to be And it also says like, you know, be nice to each other No Nazis on the website, you know I'm hoping people will see that and get an idea of what we're about We're about being supportive and friendly And looking after each other So it's, look different and hopefully say something About the kind of vibe we want inside the community Well you work, thank you I mean, it's probably the best thing about Glean I think Currently you target both Glean and then Javister What do you plan to do to introduce other targets like WebAssembly? So I don't like to look at targets as well as, you know I think there's a problem and when people make in languages It's very easy for them to do things that are cool For a language maker to do So for example, it would be cool if I could target WebAssembly It would be cool if I had type classes I don't want to do it for those reasons I want to drive changes by them being impactful to the community And as I said with WebAssembly That can be a nice VM that you can embed in the compiler To enable compile time code execution You could use that to do like Glean script So you could just have just the binary on your server And you can use that to execute tiny little scripts When you don't want to have like a whole virtual machine installed On that computer for example All sorts of little things like that And so I think there is a good Argument in favour of having WebAssembly And so it's something and I would quite enjoy it So I'd like to explore it in future But it's not as high priority as like getting the language server Working really well, getting the documentation fantastic Making sure we've got like a really lovely Like Elixir Phoenix like experience To do web development in Glean So I would like it Maybe one day, don't hold your breath When you do message passing in Glean Does the messages support function cloners as well And if so, how does your type system handle it? As in you're asking When you're doing type OTP Can you send a function to another process? Yeah, function cloners Yes, okay, so So, it's quite tricky You can't, how much context do I give this Because I've thought about this for years And it's quite hard Yes, we can, you can pass any data to another function The key difference between message passing In typed Glean OTP and Erlang OTP Is that you need to have more than just a PID To send a message or something If you've used languages that have channels So for example, Go or Rust You don't just have like the handle for the thread And pass a message to it You've got to have a channel And you send a message via the channel So it's the same idea So we have this idea of a subject We don't call it a channel Because it would be confusing Because it still goes to a process inbox You can't give a channel to a different process And they start pulling from it And every channel is the thing that's typed Not the PID It looks like you should be able to do the PID But then you suddenly realize If you build from the ground up You can't implement synchronous You can't implement call, synchronous message passing If you have typed PIDs Because the type of the return Doesn't match the type of the PID So you need to have something more flexible So we have this thing And if you look under the hood in Erlang OTP They have the same abstraction We've got 14 seconds left And it's used to implement GenServit.co So they have this from thing So it's the same as the from field in GenServit.co That's the thing that you send messages around with I have three seconds left Thank you very much everybody Thank you very much
Property based testing in Elixir
That's not helpful at all. Okay. Now turn from yellow, greenish to green, greenish. Okay. Okay, cool. So let's write a unit test for very simple use case in which we want to add two to number together. And it would look something like this. So we usually when I write tests, I try to come up with at least three cases. So positive one which tests the happy path, one that actually tests the opposites and then try to find or think of edge cases in which my software could actually fail. So this is such an example in which we assert that two plus two is four, that two plus two is not equal to five, and we also try to find some edge cases like if one combined types or do something other funky stuff that my software still works. So if you look at that example, you can understand why I think writing tests can be pretty boring. So that's my conclusion, testing can be boring. Then if we look at it in other aspect of writing unit tests, what if our software project grows? If we have end features, then we have some linear amount of tests accompanied to that. But what if we then start to combine features? So function A and function B, we have to test combinations like pairs of those functions as well. Then the amount of tests will grow quadratically. But then if we're going further and we combine even more features, at a certain point that growth makes it really hard to scale to write to go further. So testing I think can be hard, at least if you want to do it properly. Like if you really want to make sure that you have for confidence your code you want to have as much cases covered in those tests. If you approach it that way, then testing can be hard. So how can we fix this? Well, some people they come up with property-based testing. A summary of it is instead of us humans writing examples, let's define properties of our code and let the computer come up with cases. So that's the folks at a company called Qwik. They came up with this idea around 2000s, and they build a project and the company around those ideas. They've also added some more features to it as well. But the general idea of property-based testing is that we define properties instead of examples for our tests. So let's have a look and a comparison how we could do that. So let's say that we write a test for string reversal. So we take some string and we have a function that reverses the order of those characters and how we would unit test for such a case look like. It would be something like this. So a raise of hands, if you write test like this, who feels confident that this test are actually covering all cases of our function? No hands raised, nobody feels confident. One maybe. Yeah, everybody is like you feeling anxious, right? You're not fully convinced about this test. You could probably write it in a different way. But if you would translate these things in properties, so let's take a pause and think if you would try to express that behavior, that functionality in properties, how would you do it? Like the contest numbers, special characters and so on. You would come up with examples of special characters, numbers, these kind of things. So examples, right? Basically, examples of edge cases like weird input. But that's not how you would, for example, define your software as a property. Those are again examples, clear use cases, but they're not properties of our code, right? One property is that the length of the string in input is the same length of the string that you get out. Yeah, that's a good one. So if we reverse the string, in both cases, the length of the string should stay the same. That's the property, right? So another one, still readable. Another one would be if we reverse the string twice, then we should come back with the original one. And this is how you would write it down in a property-based test. So we define a property-reversed string twice returns the original. And we actually tell, on this second line, we tell from all the possible string inputs. So we ask the library to come up with any strings. If we reverse that string twice, it should come back with the original, right? And if we run this, then the library will generate about 100 cases for us. And in doing that, try to prove that it's property holds for our code. So other examples, if we reverse the list, then the first item becomes the last one and the last item becomes the first one. If we have a palindrome and we reverse it, it will stay the same. So palindromes are strings which, when we reverse, they return the same string as well. And like you said, the amount of items, so this applies to any kind of list or string that we're reversing. If we reverse the amount of items, it stays consistent. It's not like some things disappear magically. So if we, and the funny thing is, if we try to write a property again, so I don't know if anybody noticed, but in the previous example, I specified that I want to generate a list or generate examples of strings, but which only contain ASCII characters. But if we do the funky stuff, the funky characters part, so we say, well, generate any string from the UTF-8 set, what will actually our library tell when we run that? And then it finds an edge case. So there are unicode characters that apply to previous characters as well. So when we reverse them, you don't get back the original anymore. And these are kind of edge cases which we, as humans, probably couldn't come up with. Well, you do know that it exists, but if I ask you like now, within these five, ten minutes to actually write this example, you actually wouldn't be able to do that. And it actually, normally runs about 100 cases, but even after eight cases, it found this example. So that's great. It found an edge case. And the other thing that is not shown in the example, but if you write a property and it finds a case for which the test fail or the property fails, property-based testing tools are also able to shrink down the case. So it does a binary search. So if I have a list of numbers and our test fail, then it tries to, then it tries the half of items from that list. Sorry. And if it still fails, then it goes on and on until it finds the minimal set of input under which our property doesn't hold anymore. So let's talk about some use cases. Where has this kind of tooling been used? So Volvo, at a certain point, wanted to have third-party parts to be replaced by other companies. So they came up with specifications in which these components should interact with one another. So they wrote a specification about 3,000 pages long. They had about six vendors come in to test this, their specification, combined they had a million lines of code written. And when they used property-based testing to actually test these six vendors' implementations of the specification, they even found about 200 issues. 100 of them were actually in the specification itself, and 100 of them were in the combination of those parts. Because a car consists of several parts. So it could take component A from vendor A and some other component from another vendor, and they tested components in isolation but never together. So the combination of these components actually yields some errors as well. Clana is a financial system, and at a certain point they had a problem, which occurred only once several weeks, and they had kind of a hint because it came up with when the generating files and it were over 1 gigabyte big. And they spent six weeks full time investigating this issue, and they couldn't find the source. Like they could stumble upon it, they could in some cases, trigger it, but it was actually impossible to find out how and where it came from. And it took them three days or less than three days in total to come up with a model to write the properties. Less than a day to run the properties until they actually stumbled upon race condition in which that's error occurred actually. So those are two kind of big examples. What are the other occasions in which we could actually use property based testing? One obvious one is if we have symmetrical functions, so if you serialize and deserialize something, those are opposite functions, you could easily property based test them. If you need to have some other method, if you have functionality that needs to have some kind of mathematical proof, it's good for comparing systems. So I had to, in one case, I had to rewrite the system in another language, and then it's nice to have the old system and the new system and test them against one another. And I haven't really mentioned it during this talk, but the tool that QVIC has built also has special conditions to test concurrency items. So if you have a system like the Cliana financial system, you're going to test what happens if five people do some transactions simultaneously. So conclusion. Property based testing can generate all kinds of test cases for us. Very often also edge cases that we as humans don't think about because it tries to spectrum of all inputs that you specified and find very weird items. Because of the shrinking, it also helps narrow down to diagnose what the actual culprit of the error is. It helps reduce complexity. And like I said, because you have to think about properties instead of examples, it actually makes you think differently about your tests. It makes you think more philosophically. And I think that in itself is already an advantage in learning property based testing. So I think we're out of time. So a small thank you for SliceGo. So if you think this is a nice presentation, I pulled it from their website. And I also have to contribute back and mention them. I also want to thank you all for attending this early and for listening. And for the organizers, of course, keep forgetting us. So if you think, well, this was nice introduction. It sparked my interest. How can I continue learning this? There's a good book on a website called propertesting.com. If you're not using Alexa or any of the other languages, there are also libraries in other languages like Python has hypothesis, for example. Look it up on either property based testing or generative testing. Some communities call it differently. And if you think, well, how should I think in writing properties? Then John Hughes, one of the founders of QVIC, also has a good talk in which he talks about how do you come up with these properties? How do you think in this way? So I don't know if we do have time for questions.
gen_statem Unveiled: A Theoretical Exploration of State Machines
especially state machines and how they are handled in Ireland and also from a theoretical point of view. So, it's up to you. Thank you. All right. Yes, he said, like, I'm relatively young but I know a school guy, so I code in V-man and use Ireland. So, this went too fast already. I work in Erlang Solutions. We do like Erlang stuff, so concurrency, scalability, the useful things that most of you would be hopefully familiar and we also contribute a lot to open source. This talk is going to be about state machines, as you heard. First, a question of protocols. What are protocols? I wanted to make a survey and ask you and so on, but we have limited time, so I'm going to answer the question already. System of rules. A few examples. Okay, I need to point here for this to work. Protocol defines the system of rules for syntax, semantics of the project, the program that you want to write. Some examples, the usual ones are TCP for network communication, is connection oriented, stream oriented, messages are ordered and they are acknowledged. Another common example, TLS for privacy, integrity and authenticity, encryption, very important. I hope that everybody has HTTPS enabling the browsers by default. Some other examples are file formats or markup languages. Parsers for them can also be implemented as state machines. The two classic examples, XML and JSON. XML is particularly interesting to me because I work in XMPP messaging server written in Erlang, of course. If you saw our talk in CodeBeam, for those that are following CodeBeam, Pablo and me, we talked about the state machine re-implementation in Mongo's IM. This is a bit of a continuation to that. Some more complex protocols can be implemented as state machines like HTTP and as I mentioned, XMPP, which is my specialty, which is extensible, that's the X and my favorite part of the whole thing, it's an instant messaging protocol that also has presences, the green bubble, whether your friend is connected or not and it also does contact list maintenance on the core protocol, 500 extensions and build your own. This is the state machine diagram for the protocol. Much like a flow chart on steroids, I really like that analogy. With the state machines, we are like the usual thing, how you think about state machines, you draw the state with some arrows, the arrows have tags about how you transition to the next state. Finest state machines give you a way to visualize a system that can be very complex. Why state machines? State machines can be seen as a model. We want to model the behavior of protocol that can be very complex like TLS or HTTP, most of you will be familiar, XMPP, my specialty. Let's talk a little bit quickly about state machines in particular. A few formalities. I studied mathematics in university, I'm excited by these weird symbols, but some people can find them off-putting, so I will try to make it pleasant. A few terminologies, we define an alphabet, terminologies, you use Greek symbols, mathematicians, which are the input symbols, zeros and ones, or ASCII characters, UTF-8, or complex symbols treated as a single element, half, and you can do equivalences. One of the weakest ones is the regular grammars, it's how you do regexes. A regex, this thing that right ones are never read, but very powerful, is theoretically equivalent to a state machine. Again, this is jumping too fast. Something a little bit more powerful is the partial automata, I'm not going to focus on this one too much, use a key difference, then nothing else parsed, now it has one more thing, it's the same thing before, plus a stack, and the stack behaves as you would expect. The function that used to take the state and the input symbol also takes the stack and the output of the function is whether you pop something from the stack or you push something on the stack. It's safe to consume a string that you give to this PDA as it arrives to one of the final states with an empty stack. There are equivalent definitions, not all definitions require the empty stack, but I choose that one. They are equivalent to context-free grammars, parsers, but not compilers. Why a compiler? So in tree, the thing about being context-free is that it doesn't remember symbols that were defined before. So for a compiler, for example, the usual regex compiler for C that needs to remember the definition when you say int e and then you use e later below, parser doesn't remember that, you need symbol tables, parser only builds the syntax tree. And the fancy one, the computer, theoretically, Turing machines, which is again the same thing, but nothing else is supplanted by a tape that is infinite. It is equivalent whether it's finite in one side and infinite in the other, all of those are equivalents, whether it has two tapes is also equivalent, will arrive to that. The function takes the tape and the action go one to the left and write something, go one to the right and write something. Very similar, a Turing machine is said to consume a string when the next step is undefined. When it holds, you have all heard of the holding problem. There is no way to know whether a Turing machine will hold. That is important. They are equivalent to interested grammars, compilers in the Chomsky hierarchy that are like four levels. The three things that I describe are zero, one and four, there is something in the level three that is not directly useful for the moment. So I skip that. So how do they compare? This goes very fast sometimes. So that's the power that they have. A Turing machine can do all the others. PDA can do the one over there. So that's the power that they can do. They contain the power of each other. Two FSMs running together has still the same theoretical power, the same thing that a PDA with a finite buffer or a PDA with a finite state machine is still as powerful as one PDA. Turing machines, whether it's multi-tape, tape one banded on one side, they are all equivalent again. A Turing machine doesn't get more powerful by giving it 100 tapes. It gets maybe more performant theoretically, but the problems that it can solve are all the same. And a PDA with two stacks is really a Turing machine when you know you can just go in both directions. So when you give the PDA two stacks, you build a Turing machine. So conceptually, finite state machines can keep track of one thing, the state. The push-down automata can keep track of two things, the state and the top of the stack. And a Turing machine can keep track of infinite things. When I was going through the mathematics and I came to this conclusion, I found this funny for a completely unrelated reason. In the European languages, I mean to human languages, used to have the concept of dual as something different to singular and plural. The function that it computes depends on one thing to things or an infinite number of them. The function that was defined before. So in the European languages, as I said, they had this special concept of the dual. And I found it very funny how informal human languages used to have such a thing as a dual, as a different grammatical category than one and infinite. When you build the declinations, they had a different thing. Why do I know this strange thing about languages? Because I live in Poland. So Slavic languages have some remnants of that dual concept. So there is this famous joke of in Polish you have like 100 ways to declinate number two. And you have more ways to declinate number two than you have number three because of that all dual. So two is special. I live in Poland, but I'm not Polish. It's challenging. So do FSMs produce output? Let's go move slowly to what is useful here. We can define finite state transducers, which same thing than before and then nothing else is supplanted by another output alphabet. The function takes the state and the input and decides the next state and a symbol for the output. It's a to consume a string the same and they are also equivalent to regular grammars. When it comes to the problems they can solve, again, they're all equivalent. You get fancier tools, but there are properties that are going to be all the same. You will see in a second there are many, but let's focus on two ways of defining transducers, the milley machines and Moore machines, whether the output, I have a laser, yes, whether the output symbol depends on the input and the previous state or only on the previous state. There is a way to define Moore machines from a milley machine, but not the other way around, so milley has a bit more powerful. Now something a bit more useful, how do they compare? They are still the same than the FSM machines, but this can be composed. We are getting into a bit of engineering. We are almost there. Not that much. This is a thing, laser. Yes, oh god. Come on, sometimes. So given three sets of states, three alphabets, one machine goes from one state and one alphabet to the next state and the other alphabet. The second machine uses the same the output of the previous as its input, so you can define the composition as a state machine that takes the first alphabet and the first set of inputs and gives you the third alphabet and set of inputs. Composition, cool. Why? Because you can implement all these things as state machines and the output of one is the input of the next. So my stack on XMPP, you can implement TCP as a state machine. Have you heard of the Erlang socket, the new socket? It's implemented in TCP on top of gain state them. If you go to the source code. So I have the output of one state them, throwing into the output of the next state them. TLS is also implemented as a gain state them, throwing output to my thing, to the XML parser that throws its output to the XMPP protocol. So we are composing things. One last theoretical thing. The unions of FSMs that is uniting all the states and strings, it's also an FSM intersection, so the states and its input symbols in common gives you a very small FSM. It's also an FSM reversing, still an FSM, empty, so no states and no input is also an FSM that when you do union and concatenation with another FSM does nothing and homomorphism, so a function that transforms alphabets and states into other alphabets and states preserves the structure of an FSM. So FSMs are a semi-ring. This is an algebraic structure. Why is it useful to have search algebras? To prove things that you cannot prove with Turing machines because they do not form an algebra. So now let's do something engineering, state them. So as I said before, it's a Melly machine. It gets the input and the alphabets, it produces the states and alphabets, it produces the next, you follow, I hope. We can consider that the input are the messages in the mailbox and the output symbols are side effects, like for example sending messages to another mailbox. Gain state them. I'm a big fan. I love it, but I know that people sometimes don't use it because maybe it's confusing or I don't know, complicated. So I'm going to try to explain one thing that is very useful here. An extended mailbox. This is a discussion that the OTP team, when they put the pull request for gain state them, there is a big discussion with over a thousand messages that was probably forgotten, but when they discovered gain state them and I liked it, I went to the source and I read that super long thing. And there are useful things said there. A way to visualize a gain state them. Imagine that it has one queue, that is something more than the process mailbox, with like three pointers. The head pointing at the oldest event and the tail pointing at the youngest and current is where I am now. You can move where current is with some of the actions that gain state them gives you, for example postponing an event. Postponing an event means that current moves to the next, but the event is not forgotten. There is a different action that will put current again in the head. Not postponing and you consume it is removed from the queue. When the state changes, current goes again to head. Next event inserts things where current is and not at the tail. And timeouts inserts things at the tail. So the engine, the gain state them implementation allows you to extend the inputs that your formal state machine is going to get. How does it work? Imagine that we are here, we have event one and we decide to postpone it. What happens? It's still on the mailbox. We just are now going to deal with event two. Now we decide to do some stuff and then go to the next state. So that has been processed and current because we changed the state goes back to the previous. Now we are again going to handle event one and this time we decide to not change the state, but we generate new inputs as if this process has received a message. But this event A, which is ad hoc, we just created it, is inserted where current is. So it's the next event that we are going to handle. We can decide to postpone it. Now we are going to handle event three. With event three we do some stuff, but we don't generate events. Imagine that there is middle code here doing. So event three has been dealt with. Now you go to event four and you decide to postpone event four, but also insert and event B. So event four goes behind, you insert and event B, you get the idea. So the engine gives you a way to extend the process queue. What am I doing with time? Oh, one more important power. I'm not going to have time for everything. One more useful power of the state machines. Managing accidental complexity. There is a talk that I want to recommend. It's quite an old one, maybe something like 10 or 15 years ago by Ulf Rieger, where he was complaining about some limitations of GANFSM, but even GAN server that we all use. Very useful talk and I have one tiny answer to that with the new GAN state that didn't exist back then. Typical state on, off, but you can imagine that you're switching a light, but your switch talks to a through a cable protocol to the light machine. So when the user says on, this is a GAN server, you say and the state is off, you send a request to on, you wait for the answer on, it's on, vice versa, relatively intuitive code. Now imagine that that request through the cable protocol was not synchronous and imagine that the switches cannot block. It needs to do other stuff. So you send an asynchronous request to the light, hey, turn on yourself and continue doing other things, but then the user sends more off and on. What do you decide to do here? It's not part of the protocol. The events are now asynchronous and out of order. There is no global ordering. So there are some questions like you need to choose what to do. Sometimes this, this is the, so we can use a state machine. They use all the way. The name of the function is the name of the state and you can postpone things if you are already running a request, you postpone it and when if the user press on like a hundred times, by the time they like says on, then you have changed the state and you're going to handle all those. It's already on, so just do nothing. But the code is terribly symmetric. It feels repetitive. So problems, there is no ordering when things are asynchronous. Tying yourself to the ordering of events leads to accidental complexity. This is the point of Ulfiger when the order changes, the whole implementation changes. It grows relative to the number of states. This is super simple. It's a like that goes on and off. But imagine complicated protocols and for example a middle layer between a very talkative protocol and a like one and code reuse. So I really like the handle event way of handling things. It's a single function callback that gets a simple the state and the data. By the way, it's very confusing because we are used to the state of the process for the server thing. But in the state, the state is the state machine state. So the other thing where you save like, I don't know, the socket, for example, is called data. So just confusing terminology. This, you can just pattern match whether you're in the same state and the previous function that was terribly repetitive is now in a single function head. This is, I believe, a way to answer to the problem that Ulf raised and now I'm exactly on time. One more slide. A way to answer to that problem and in a way that you can reuse code, that you can decide the order of events because you can postpone things and you can also insert things. Quickly here, why I use on the XMPP, we had like this implementation. There is only one thing that I really like here. The composing. As I said before, you have the TCP state machines that go to TLS that goes to XML that goes to messaging. So if we want to implement this on a single process, this can be, for example, this is a simplification on my data. I have a parser and the crypto library that when I get that TCP payload, this is how we do it in Mongoose. I am not TCP, TCP we just use TCP to complicate it. So it's a separate process. But crypto and XML parsers, we implemented on the spot. There is a C code that the parsers, part of the XML, for example, it gives you a new parser with a buffer and the XML structure that then you can use to generate the events that my protocol cares about, the XML payloads. That's one use case that we have. That's me. You can find me by that picture in all the places. Those are some of the projects I work in and I was going to say questions, but we are one minute late. Thank you.
Guess Less with Erlang Doctor
Okay, I... Yeah, it switched off to like... By itself. I didn't touch it. So yeah, when you debug, for example, your code, when you're trying to find out why you have a strange error or something like that, you can use our long tracing. And it's very powerful, as we said before. And for example, you can use tools like DBG or Recon that are using error tracing underneath. And the first step is to choose which functions you want to trace, actually, because you don't trace everything. Although you can trace what you want, you cannot trace everything at once. So you choose like, I want this function to be traced for this bunch of functions. And then, when you call these functions, you get your traces being printed out. So you get the information of this function is called, these are the arguments, return values, things like that. You can get it to console, you can get it to a file, and you can also send it to a network, and that's what I have been doing for many, many years. I was just setting up, for example, I said many years, yeah, for 15 years, I think, with Erlang. So I was just setting up a special node that was like collecting traces for all the other nodes. So you can also send them to the network. And, well, afterwards, you either read the traces that you collected, or you can also search them, grab them, parse them, do some other operations on them if you want. But these are just text logs, let's say, mostly. And the problem is that very often you have to repeat the whole process. That's because you've traced one function, but you found out that maybe the problem is in another function, maybe in a completely different module, and so on and so on. So you do it, so you repeat, repeat, and that might be kind of a problem. So this doesn't scale well. And what I mean by that is if you try to trace a lot of functions, well, I found out that at least for me, when I get like 100 to 1000 traces, it becomes difficult to read, like for a human to read that amount of information. Okay, but you can search, for example. And this also has a limit. So, of course, this is just a rough estimate, let's say, but for me, usually, when I have like 10 to 100,000 traces, then it becomes difficult, because even my system can slow down, IO can slow down, and actually it's quite surprising, but sending traces to a file or to a network, it's actually quite expensive. And it can slow down the system quite a lot. And it's a heavy operation, so sometimes I had traces accumulating for three minutes after I finished traces or something like that, and the messages were still in the queue, still being processed. Yeah, so this doesn't scale that well. Okay, so let's sum up. Choosing the function to trace is kind of a guesswork. Not always, of course, sometimes we know precisely, but most often I don't know. I know kind of what I'm looking for, but not exactly, and that's the problem, but I need to somehow know the function exactly here, to choose it to be traced. So, possibly many iterations of the process. This is, for me, this is like ad hoc logging. This is very much like logging, but I don't need to have a log statement in my code. I just choose dynamically, right now, I want to do these functions to be logged. And what if the trace behaviors have tests that fail every 20 runs, for example, do I need to repeat this 20 times? So what? That's the problem, right? And answer to some of those issues is Erlang Doctor, at least for me, and for the people who've used that. So what's the difference? So you set up the tracing for entire application, not always. Sometimes it's not possible. Sometimes you have to trace individual modules, but usually you can start with just one entire application. You capture traces, store them in the ETS table, and you clear the traces afterwards. And you can repeat this process instead of repeating everything, because you've collected enough stuff to query and to find out about different functions, for example. To query, oh, this was this function called, maybe another one, and so on and so on. You can do this. And of course, rarely you have to repeat this, but for me it's only when I, for example, trace the wrong application, because the problem is not in my code, it's in a library that I've used. Then I need to trace another Erlang application, right? But it doesn't occur that often. This scales much better. What are the limits? So on my laptop, for example, querying becomes slow at about 10 million traces collected, which is quite a lot, but it's like tracing a function, a system under heavy load, for example. And of course it depends on the size of individual traces, because you can have big arguments being passed or something like that. Yeah. System memory becomes the limit at 4 million at about 50 million traces, but sometimes it's 10 million, sometimes it's 100 million, it depends. But basically when you have a few million traces, it's probably too much. So there is a limit, of course. So to sum up, very few iterations of the whole process, usually one. This is for me like ad hoc instrumentation instead of ad hoc logging, because you're gathering structured information in the ETS table. I will show you details in a moment. And use cases, for me there are many use cases. For example, debugging, system exploration. I often use it to just learn about the system. I just run the system, do it with the usual stuff when I'm tracing the whole application, and I'm just querying what the system did, actually, with the traces. And you can also do some lightweight profiling without the need to set up the profiler for a particular function. Yeah. So let's go to the Erlang doctor itself. How to get it from GitHub for Erlang or for Elixir? For Elixir it's called Xdoctor, which looks like a former doctor, but it's just a bit funny. Yeah, so here is a package of Xdox for both of them. And yeah. So how to run it? Three options. The first one that I'm using sometimes when doing firefighting, this is when you don't have it in your shell, but you want it right now, like in a system that's misbehaving or something. In both tools there are snippets that just download it from the Internet and compile it and run it, which works in this particular case. It's probably the best option if you just want it right now. And yeah, all you need is to have the access to the Internet, which is usually the case. The second option, which I'm using always in development, is just that you set it up in your .erlang or .iex.exe file. So that it's always available whenever you start any Erlang or Elixir shell, be it in your project or wherever. And third option, packaging. You can always include it in your application, in your software, if you think it's that useful. Okay, so let's move on. Let's start. Examples are in Erlang, but they are also available for Elixir in the docs. You can find them. The first thing to do is to start. It needs a GenServer, so it just starts at GenServer. And a few other examples how you can start them. You can choose a different ETS table. You can just have multiple ones if you want, switch between them. You can limit the size of a table. Very useful, like in a production environment, if you need to do some tracing, you just set it to like 1,000 or something. Just the table will never grow bigger, so you will never consume all memory. And yeah, there is also a start link. Okay, so let's set up tracing. I'm just tracing an example module. It's a test suite, but it contains functions that we can trace. It's good. So yeah, I'm just starting that tracer. And I can also trace a specific function, like provide a module function argument, whole application, multiple applications. And add a bit more. You can trace messages. You can trace specific processes and so on. There are a few more options. Capturing the traces. Okay, so let's call a function from the trace module. I'm calling just a sleepy factorial of 3. It's a function that calculates a factorial and just sleeps for 1 millisecond between each step, right? It will have some time difference. That's it. Yeah, very simple. And yeah, I'm just... Okay, now we can stop tracing. It's a good habit because you don't accumulate traces when you don't want them anymore. And now what can you do? Because we've accumulated traces, what can you do with them? So let's read the record definition. By the way, I'm using records because they are very performant. And even maps were giving me five times worse performance for some operations. So yeah, I'm using records. Yeah, so let's get all the traces. So I got all the traces and I don't want to talk about everything. Let's talk about the arguments. So these are the arguments and these are the return values, okay? For calls, for returns. And I will just introduce the other fields as we go. Arguments are in brackets. So now trace selection. You can do a select. It's a fancy way of doing this ETS select with ETS from 2MS. And let's get all function calls. And for each argument, let's just get this argument. So I'm getting a list of arguments. And of course, this is a recursive way of calculating factorial. So it's 3, 2, 1, 0. And there is also select slash 2. And this one takes like any term and looks for that term. So here it found, for example, it has an argument. Here it found it as a return value. But there is more. It can be hidden inside any lists, maps, tuples. So it will look recursively inside your data structures to find out anything you're looking for. So for example, you can look for an error message, even if it's called unknown error, which occurred to me once. And I just found this unknown error. I just put unknown error here and I just found the function that that causes, right, instantly. Okay, there is also filter. It's similar to select. But here you can pass any function. It's a bit slower, but it has more features simply. So you can, for example, assign a result to a search, to a variable, and then you can search in that list again. Oops, sorry. Then you can search in that list again. So you can narrow down your search. You got like two traces. Now you search in that two traces, but only for calls and you get only one. So another way to query it. And the tracebacks are very important for me because I want to, like, know the source where this originated, this particular, for example, function call. So here I'm just looking for any return value of one. And the sleepy factor of one actually matches it and it returned one. So this is returned, the traceback of this one. The call itself is first. It's right here. Sleepy factor of one and the rest is just the traceback. And the sleepy factor of zero returned also one, but it skipped because of some skipping logic. Yeah, it's details, but it helps you, like, limit the output that you get. Actually, you can disable that and you can get, like, all the traces with output all. Then you get, then you have no skipping of traceback that include another tracebacks. And you can limit the number of much traces. You can reverse call order. You can search in a different table, in a list, for example. And you can get only the first traceback if you want, very useful, just shortcut, let's say. And you can also get it for a particular record or just an index of a trace because there has just this auto-incremented indexes. Yeah, and similar to a traceback, you have ranges and ranges look inside. So traceback is like what's the source and ranges give you, like, all the traces starting with a function call until it's returned. Everything in between, from one process. And, yeah, so for example, here we are really looking for any traces that are just for function calls that have one as an argument and you get a range of traces from the call until the return. A range options, you can get, like, limit call depth is quite interesting and very useful because by having one you just get a call and return, which is very useful. And searching in a list of traces is also possible, getting only the first range if there are many, also possible. And getting the range for a specific trace. So quite a lot of options. I've just, you know, added, I've been adding and adding over a few years of development of these two. So they're all quite useful. Utilities, two simple utilities they wanted to talk about. One is to just look up a trace. Nothing fancy here, ETS lookup, does it, right? But then you can execute the trace, which is quite useful for me. So if this was a function call, I can just execute it right now, again. This is just, for example, let's say I fix the bug and then instead of writing some long code, I can just execute a trace and see if the result is the same or different. Or I can trace again, right? I can start the traces and trace again. Okay, now a bit of profiling. So I find this lightweight profiling very useful because it doesn't put as much stress on the system as F-Prof, for example, the Erlang profiler. And it's like instantly available. I don't have to prepare for it in any way. So call start, it's statistics aggregated by this function. So I'm aggregating everything under the total atom. So I'm getting like four calls and this accumulated time and this own time. These are equal because I'm just accumulating everything. But if I aggregated by function argument, you can see that there was this one call with each of the arguments. And this call took the longest time, but actually it's accumulated time because its own time was the shortest section, right? You can also do filtering here. So you can say when N is smaller than 3 and we just skipped one of them. So you can do that and you can sort them. Yes, you can sort them and print them as a table. We had some just nice utilities to have. And the last feature I wanted to talk about is function call 3 statistics. I called it like that because let's say we have a function that calculates the Fibonacci sequence in a suboptimal way, you probably all know that it's suboptimal. It branches a lot and let's clean up the traces. Trace again, call fib of 4 which returns 3, which is the correct value and stop tracing. So we now have different traces in the table and let's do it. Let's just call this function with default arguments. So it says that there is a call 3, I mean by that function calls, returns, everything inside repeated twice because there is this number 2 and it took 10 microseconds that there is no sleep in this example. So it took 10 microseconds in total. And this is how the function call 3 looks like. So you can see that, yeah, indeed, it repeated twice. So this can help you find out redundant code. Yeah. Okay, so this function also has some options but I don't have time to talk about them. You can just customize them a bit. And table manipulation, you can get the current table, dump it to a file and load it on a different Erlang node. And then you can continue analysis on a different Erlang node. And that's all I wanted to talk about. And that's me on a mountain bike. Thank you.
Evolve your (web)app while it is running
Thank you. So I want to evolve my application. I'm going to use Gleam and Erlang. The organization asked for slides, but this is a luster application. So I can't give you the slides. My name is Kiro van Gelder. I'm a freelancer from my own company. I've been using software for up to 30 years since I don't know how many platforms, languages, environments. And I happen to like to beam Erlang, Gleam, etc. So that's why I picked it. That's not only why I picked it. So recently I was at Langdev, model-driven stuff, all kinds of things that people showed us. One talk that spiked my interest was about, okay, well, we have a game. We have a description of a game. And while we're running the game, there's people interacting with the game. Well, there's things we don't quite like, so we're going to change the rules. And then we have to keep, the game has to keep running. So, yeah, well, okay, that looked awesome. So I thought I can build something like that, but I'm going to do it on Beam. Beam has these superpowers, all the reloading and things are, it seems really suited for it, so why not? And then the other thing I thought, well, if I want to do that with a game, and I have to build some kind of infrastructure and things, let's start with a simple game. So I picked out Holland's Ganserbord. You have a little goose spawn, and you have to reach the very middle spot. And if you look at it from the model and the rules, there's actually quite a few of interesting exceptions for all the kinds of places that you can run around on. And land on a goose, you have to move twice as far. You can skip it during, if you're in prison, well, you can only get released when someone else releases you. It's kind of special. So what will I talk about? Tiny introduction on model-driven development. I'm going to tell you why I do the modeling game. I'm going to explain a little bit about dynamic reloading as the Beam provides it for those who might not know it. I'm going to tell you why the game itself, the instance will be running in core Erlang, and that will be demos all along. Model-driven development. You want to have some model. It needs to have a very precise description, and at some point, me as an editor, sometimes in a computer editor, will make that description. And from that description, we generate an instance that is running what we described. Why would you want to do that? Instances sometimes are, well, they're compiled targets, and they know the things that they have to do, but they don't know things about themselves. A model is a description that does know it. A very nice example that was given on that Langdef was about Dutch income tax. This is described in laws. Computers do not interpret laws, but they went through the effort to take the law, adjust it a little bit so they did it together with lawyers to make sure you got something that a computer could interpret. So laws might have ambiguities or vagueness things in there. Computers don't. Well, then they had these more precise versions of the law, and they, from that model, they generated Java code. Dutch income tax is running on Java code at the moment. Additional things you can do once you have it in the model is you can start reasoning about it. So by now, I understand that even if they want to introduce a new law, they can, you can just plug it in the system and see is there any contradiction here. If so, please adjust this law proposal. Often you'll have a domain specific language like the adjusted law. For my conserboard, you could have stuff like this. Often people will want to edit these DSLs by hand, but I'm very much interested in small deltas. If I have a running game and I make a big modification in here, I don't have a game. So I need to make sure I do a small step, a small delta. So I have to, with what I do, I have to restrict the possible edits that can be done, which I call deltas. How does that look? Building a system called Eagle. It's running on a server. That is, I have a browser. That's my client. From the client, I do the editing of the model. From there, I generate my instance, my game, and then from the same or a different browser, I view the game and I play the game. Why would I want to do a modeling GLEAM? The model should be as precise as possible as I just explained. GLEAM gives us types. Type safety is more than non-type safety. So my name is GLEAM. Other benefit, a superpower of GLEAM, you can compile it both to Erlang and to JavaScript. So I can, if I have my model and I can somehow transmit it, I can just use the exact same code on both sides of my client server and I know that it's the same thing. It saves me work. What does it look like? I have my model here. So in my LusterUp, I now have an iframe. This iframe is the model client that I showed in the previous picture. It's an iframe, so it's a web page, but as you might guess, that web page again is a LusterUp. And I want to grow the Hanzer board from as minimal a thing as possible in small deltas. So let's see where we can start. My model is very simple. It's a bit too simple here, so I need a bit more of a model. I can make a list of an int. I could even make a list of a list of an int and it's represented here. And I not only want to describe that I have some type, it also needs to have some default value. In addition to that, we have a cell. My Hanzer board isn't quite finished yet. Sorry. I have a plain cell starting to finish and I might even use the goose. And right now what I also have then is a game type. It has a board which has a list of cells and it has pawns which are a list of numbers indexing where the pawns are. It's a choice. There are other choices. And the default of my Hanzer board then has one cell at the moment which is plain, which is a bit ridiculous. So let's add a few more cells and say we're going to start at some of these things. And of course the default that I gave to the int turned out to be a bit awkward because I wanted it to be zero. So I mess a bit with my model. In very small steps, I modify my description to get hopefully to a better place. Dynamic reload. Erlang and beam provides us the tools to do these timing upload. This is a loop. Usually there's a process running on the beam that is executing this loop. It had been started by another process. And generally it receives messages, handles them, sends back some results. That's what the from exclamation mark is. It sends back some result. Could be errors. Could be part of the new state. And then it loops. And another possible message could be well just stop and then you don't loop. All there is to it. What Erlang and beam also provide is a way to reload, sorry, to load new code into your, into the beam for a specific module. That's the code module in the Erlang kernel package. You call it binary here in the middle. Target is the target module. The thing in square brackets is the file name that it would be coming from. We don't care. And the object code is compiled core Erlang code. At that point when you do that, you have your old version of the code in the beam and you have a new version of the code in the beam. But the existing processes, my game, my Hansbord is still running the old stuff. So now I'm going to send a message upgrade to my Hansbord process. And roughly like that this is the same loop as before but now I have the relevant upgrade part. Instead of just looping, which would loop in my old code, I have to say format this module for my target module, my Hansbord module, explicitly specify the module with the loop and then it's guaranteed to use the new version of the code. And then I can happily play along my game with upgraded code. Now why do we do that in Erlang? The difference between the local loop and the exported loop is something that Gleam does not know. So I can't do it. Why did I pick core Erlang? Because we can do this all in memory. No need to use file systems and other things. There is an Erlang C-E-R-L library that can generate these things. I wrapped it for Gleam. That's called Gencore Erlang and you can find it on the hex. So let's start a game. I already made the type properly. So I'm ready to create a game. If I connect to the server it won't do anything here, but if I create an instance it's now and it will connect to the game. So as you can see, it picked up the start plain plain from the definitions that I had at the left. It also picked up a move button. The rules, there are some implicit rules that I did not edit things about. I did not tell anything about. I need to be able to do something in my game. So there's a hard code that moves. I will just move the pawn one forward. And there's also a check for the win condition. So where the pawn is, it has to check on the board whether that's a finished location. And that rule is going on continuously. I have not made nice deltas and things for that in the UI and when at some point I will. But these things are running in the background in the instance. Now at least I'll show you the move. If I move, my pawn moves one forward. So yay, getting closer to the finish in my GUNS board. All right. So a little bit more about the bits and pieces that are happening. There's the Asian communication from the browser to the server for the model. Whenever there is a delta that's made, we just recreate the entire instance module in Core Erlang, reloaded and upgraded. And the instance just keeps talking to it. It doesn't even notice it from the client and also talks in terms of chasing with it. State and rules. The initial state is something that should be adopted, adjusted and rules are... So yeah, that's right. The rule is something that is in the model. I talk about a conceptual thing there. My instance doesn't know the rule. It just has code. And my client doesn't know what the rule is either. It just knows whether it can do something or cannot do something. So what is important that I wanted to make as two kind of changes. Some changes that when I do that, the instance will in the end see it. So I'm going to change the board. And that when I change the board it's going to compile the change into the Core Erlang, reload it, do some small migration and then also pass that information to our client. So, here we have it again. If I turn this into a Goose for instance, it becomes a Goose. So that was one recompilation of Erlang in the back button. Now I make another plane. I can add another one if I want to. And at some point I'm going to have to reach the finish here. So let's make that. So yeah, that was three, four recompilations of things in memory and moving on. Involving. But another one thing that I might want to change, because I can also create multiple instances, is my starting state. And that would mean that the only thing that happens if I change that from my client, it changes my model, but nothing else, unless I don't start a new instance, nothing happens with it. And that looks like this. So I have a game one and a game two. Game two hasn't been started yet. And if I change this one from zero to one, now I start at position one. Then we notice that in game one nothing happened. But if I start a new instance, then this one will now start at position one instead of position zero. And just to show that even though it shows two, it didn't change to one, but didn't show it. If I move, it will go to three. Well, and where did I put the finish today? On start, zero, one, three, on number four. I'll just move to four. There. I finished my game. I won. So things I want to do in the future. I'm very much interested in what kind of deltas are usable, sensible. You saw me adding cells to a board. You saw me change the type of the board. Okay. What if I remove a cell from the board? Yeah. What if the pawn is on there? It quickly becomes like, you could think of a couple of solutions for when you remove the cell from the board. You move the pawn to the previous or the next cell, or you remove it, or you put it on cell zero. But why would, why, how do you pick one? Depends on your application. So I really want to look more into that. Another thing is that I don't think the Hanselboard UI that I had looks very nice. It would be much better if it looked like the second slide that I showed. But if I do that, then the client really knows about Hanselboard. But what if I want to make a slightly different game? So, okay. And what if my Hanselboard knows about most of the things I do, but I add that labyrinth thing? Now I want to render a nice labyrinth. Okay. Can I make something that knows its Hanselboard, but can also adjust to changes that I make in the model that expands on what was already there that they didn't know about in the start? And obviously, it needs to be multiplayer because playing in my own is, I want you all to play with me. All right. The code that you saw in the iframes is in the top link. While reasoning about this, I also wrote a little, a start of a Gleam library that generates Gleam code, which is the second link, the one that generates Core Erlang. It's the third link, my own web page, the fourth. And if you want to know what the Dutch income text is looking like, it's all in there. Thank you. We have time for a couple of questions. Anyone? Any? Okay. Thank you. That was really quite amazing technology there. I was wondering, do you have thoughts on like when you might decide to apply these sorts of techniques to a problem? When it would be a good question. Yeah. Okay. Question is when this kind of solution would apply to a problem. It's, I might not be the best person to answer this because I'm somewhat new to model driven. It helps when you, when that description is going to give you something, whether that is checking that something is coherent or correct. When, yeah, the model should give you something. When not to do it, well, if you just want to play Hansenboard, just make a Hansenboard server and a Hansenboard client because it's much faster. Much quicker to build. So it's, it's, it is an investment. It's quite an investment. It's not just, not 10% extra. It's a factor much, possibly 5, 10 extra to make sure that you can really do that kind of stuff. Other question? It was the same question. Because it's fun. Okay. Yeah, that was really, really cool. Really interesting. When you changed like the initial state, you showed like the running client didn't update, right? Like it didn't, it didn't update the client because all these updates are triggered through messages. Is it possible that you could replay message history from the beginning with a running client? So like if I updated the initial state of a running process, could I then replay all of the messages it's received to change propagates? There's, there's, there's two answers to that, I guess. Yeah. Is it possible to replay all the events that happened both to the model and the instance or just to the instance game? Just to the instance. Just to the instance. At the moment, no. Would you? Would be, would be interesting. One way in which at least the part of the answer would be yes is if you look at the model when I, when I say please change this in this way with this delta, the server will respond by just giving you back, yep, I applied this delta, now you do too. Okay, cool. Any other questions? Okay, thank you then. Thank you. Thank you.
Snap!
All right, good morning everybody. Welcome to the session on SNAP. My name is Jens, and with me is my colleague and friend Yadka. We're here to present SNAP, build your own blocks. But it's not just Yadka and me who are working on SNAP. We're just the ones presenting it today. There's a larger community and more people involved. There's Benat Ramagosa down there. There's Simon Walters from the SNAP community here. There's John Maloney, Turgut Gunezu. The whole blocks gang is here. And all of those have been more or less also involved in the development of SNAP. I've brought also Olanzah, who's very involved in SNAP, our mascot. I keep saying this for people who might not have heard it before. Every once or so often we get somebody who says, take down SNAP. It's a copy of Scratch. And so I just want to point out, yes, you know this is also a mascot on Scratch. It's called Gobo in Scratch. And we're friends with the Scratch team. And we're allowed to use Gobo. And our friend and collaborator, co-author Brian Harvey, mutated Gobo with this funny haircut that, of course, everybody in this room knows about. It's a lambda. So SNAP is Scratch on lambda. This is what it's about. It's about building your own blocks. And so what Yatka and I are planning for today is kind of not so much a talk as a demo. It's a walk. It's a lab visit. And we'd like to invite you to a visit of our lab. So when we say SNAP, the title is Build Your Own Blocks, it used to be called BYOB for Build Your Own Blocks, until some parents in the US who didn't have a sense of humor decided that it also means something else, like bring your own bottle. Usually there's alcohol in it. And the implication that there is alcohol might entice children to try it. So we had to give it a different name. So this is why it's called SNAP. We're still calling it Build Your Own Blocks. But to us, and what we'd like to show you today, it is as much about learning to code or learning about technology than it is about building one's own mind, about learning something about society, about the surroundings, about the environment. And we'd like to kind of give you a little overview of not just our technology, but also our pedagogy. So just very quickly, this is SNAP. How many of you know SNAP or have used SNAP? OK, about one third. So just very quickly, I'm just going to show you so in SNAP. One thing that we have, when you open it, this is kind of pure vanilla SNAP. You can, here's a block, you can move something. You can also kind of stick blocks together like these puzzle pieces. There are different kind of categories with control structures. You have kind of these control structures. This is a repetition. So when you have kind of four times this, and you click on this, everything is live. It moves. You can see what it does. You can, you have a drawing pen. You can get the pen down, pen up. So now you can draw a square. And always the question is how you get rid of it. So there's a clear block. You can click on these things. You don't really have to think that much ahead to do something. And it's called build your own blocks, because you can build your own blocks. So you can make a new block. I want to make a block in the motion category. It's a square block, and it has as input a size. It's really hard to build a block with one block. It's really hard to build a block with one hand. And so I'm getting this editor. I can say the size should be an input. It should be kind of a number. So now I can say, oh, this part is this thing that makes a square. I can drag this in here, and the size is an input. And I can say the size is how much it should move. So now I've made my own block. Wait, did I make my own block? Or did it? Oh, look at it. It didn't work. OK, see, this is the fun of doing it with one hand. OK, you guys, this is a collaborative coding environment. Let's try this again. Oh, I need to even make size and input again. So as we were just hanging out with a bunch of teachers is that the repetition is the mother of all learning. So now I have the square block. Let's try to make a square of 100 again. It works. And then we can do things like, now we can build around this. We can say, OK, let's do, let's clear this, and let's maybe say, OK, i times 10. And we also want to turn a little bit, like maybe 36. And we can kind of build things. This is kind of the technology part that isn't really new. It's great. We love it. But the thing is that we want to build blocks, but we also want to build ourselves. We want to interact with our surroundings. And so we want to look at data. We want to look at the world. And through programming, as I think Mitchell says it, learn to program and program to learn. But not just learn about technology, but learn about something else. So here's a little example of that a while ago. John and I were also working on and we're thinking about how we could express things that are really important. And here's a little puzzle. You can make an educator. You can kind of make a project and give it to your students. And in this project, I've downloaded a bunch of data from the internet, from GitHub. This is the population data of 195 countries over a span from the year 1800 to today. It's the population development. It's a lot of data. I've also downloaded the life expectancy data. And for some periods, there isn't one. But it's the life expectancy in each of these countries. And also, the gross domestic product adjusted for inflation and broken down per person living in that country. And that might tell us something about life conditions and how life conditions in countries have evolved over time. And some of you who can read it probably already know this. There's the GapMinder project by Hans Rosling. Who knows this? GapMinder. And so we thought, it's great to work with this. It's great to look at it. And we thought it's also really great to use coding, not just to use it, but to interact with the stator yourself. And so what we like to give people is an imperfect project that has something like here. It's something that already has a slider built in. And you can configure that slider, kind of give it some numbers. And then it emits a broadcast, like the slider changed. And now we could say, OK, in this project, like when this receives the slider changed, me, we might want to hide it. And we might want to go to some coordinate, like go a little bit to the left and down. And we might want to set a color to something light and write the value of the slider. So we're going to query the value of the slider. But it should be big. Let's make it big, probably at some point, before we also want to clear it. So now, as I move the slider, I'm getting a readout of the numbers. But I want it to be the numbers from 1800 until now. So I probably should have some kind of formula in here where I say it's not the value of the slider, but the value of the slider. Plus 1799. That's about as much math as I'm allowed to do. OK, so now I have a sort of a timeline, an interactive timeline. And now I want to kind of map all of this data and sort of scatter it in there. So I'm going to make a new, I'm going to draw a little bit of data. So I'm going to draw actually a new object. I'm going to use a vector pen and draw something like a, oops. Sorry. Draw just a little dot, which is about my drawing things. I'm going to call this a country. And a country is going to, I'm going to add another variable. That's just a name. So it's sort of a, OK. And actually, I don't need to show that. And so when the green flag is pressed, I can just set the name to nothing. And actually, what I want to do is now I want to do something for all the countries. So I can take out, so if I look at all the countries, they're all in the first column. So I could say, OK, this is the first item of the columns of this table. And as you can see, it's always live. So these are all the countries. So I can loop through them. We could say for all of these, I want to make a new clone. So I'm telling a new clone. I just want to assign the name to whatever is in that list. And I want to do this really fast. I don't want to wait for it to take long. And when I'm done, I want to do something. I want to broadcast that something has changed. So I'm going to broadcast the slider change. And now what I want to do is that when each of these clones receives the slider change, it should align itself to the data. And here's something that we're saying, build your own block. So here's a block that we made that you can already use. So we give this block that arranges a record. So the record here is indicated by the name. And we're selecting in these records, we're selecting the year indicated, which is dependent on our slider. It's the value of the slider. And so we want to map the wealth, kind of, you know, the money. We want to map that to the x-axis on a logarithmic scale. The life expectancy, we want to map to the y-axis. And the size of the bubble, we want to map to the population. And let's actually try this. Do you think this could work? OK. Wait, I wanted to do one more thing. I wanted to make them a little transparent. So we could set the ghost effect to something like 60. So we see the ones that are underneath. Let's try this again. Yeah. OK. So here's a map of the world of 200 countries in the year 1800. And now we can see some interesting things. As we move the slider, we can see how the countries develop. We can see how life expectancy rises, how kind of things are distributed. Now, it would be fun to see which country is which. So we could also add some other interactive elements, like we could say, when I am, mouse entered, we could say, you know, say my name. Say my name. Say my name. OK. Say my name. When I'm, again, when the mouse goes out, like when the mouse is departed, just stop saying whatever it's saying. Say nothing. Does it work? Yeah. We don't have to restart it. So this is China, the big thing. This is the US. And now we can do interesting things. Like now we know this big blob is China. We can see what happened to China. Whoops. Here there's a famine in the early 60s. Or we can see, we can look for other spectacular things like right at the beginning, like here in 1880. And there's a problem in Tunisia, the last case of the Black Plague. And with all this data, it's already interesting now to use this and to use Google and to find out what happened at certain periods of time. For example, we can go, let's go to the early 20th century here. 1904, the first genocide committed by Germans in Namibia of the 20th century. We can see how, oops, here you probably know this. It's not World War I. It's a drop in life expectancy almost everywhere except in Denmark. Nobody knew that before COVID. Nobody knew that before COVID. Now everybody knows it. It's right. That's the Spanish flu, which wasn't Spanish, but it's the flu epidemic. And so this is what we're talking about. It is as much about looking at data, finding out how to work with tables, how to model things, but also about really discovering things that are fun to discover with a computer. This is much more fun to discover with a computer than with a textbook. And it's also way more interactive. And even the learning can be more self-directed because now I want to know what happens at this bubble at this time. And so it's about building blocks, but also building knowledge very much in the constructivist, constructionist way. And so one thing that's really cooking that we're working on right now for the next version is, so with these abstractions, we're building up. We're sticking together blocks to build up abstractions that let us do more awesome things with less blocks. Now the question always comes, especially conferences like this, but at the core, at the bottom, there's got to be some real language. And the real language has to be text-based because everything is text-based. And even artificial intelligence uses large language models, so obviously there must be something to textual language that is very powerful. And it is. But here's something that we're working on. And we've had several projects. John and I were working on a project where we tried to make a, and we actually went pretty far to build a block-space language that was completely written in itself. Snap is not completely written in itself, but we're trying this. Look at this. This is something that's in the development version. Here's a new thing we're working on. It says blocks all the way. And if I click blocks all the way, now I can look at all these blocks that are called primitives that are actually written in JavaScript, but I can edit them. I can edit them and I see blocks, how they could be written in Snap itself. And here's a block that is a primitive block, which means we're actually, we're calling up a native JavaScript function. But we could turn this off, and then it would run this, and it would totally do the right thing. And if you actually look at this code, you might be astonished because this is probably not what you were expecting. It has sort of a NumPy APL-ish way to deal with coordinates in a way that uses vectors. And it's really fun to check out how things work. Like, for example, here's the glide block. So there's glide written in Snap itself. There is, of course, if an edge bounds, and this is way more complicated. But we can even go to other things, and like, obviously, the control category is very interesting. Now, we can look how forever loop is done. If you look at forever loop, it's implemented recursively. It's a higher order function that calls the function it is given, and it calls it recursively. But, or here's how repeat until would be, and repeat until is interesting, because it also uses if. Now, what about if? So here's how if is done, or things like not. And if you look at if, if again uses if else. So it's almost sort of like a not really infinite, but kind of it goes not all the way down to the metal, but it goes down some more subterranean stories. And what we're hoping to achieve with that is to let kids and learners explore the system, and find out it's not about one language versus the other, but it's about how to express your ideas and how to do these. So let's actually try to really find something, which is we're at a hackers conference. So what if we said, you know, move? Let's break move and say, okay, I don't want to go to something, I don't want to use the primitive. Actually, I want to see how it moves. I want to glide something like not so long. And so what I'm going to do is I'm going to run the glide block. I'm going to leave the coordinates empty, which turns it turns them into implicit parameters. I'm going to put in this vector in here. So now I've really fine move. And now if I run this, I can see that I changed the way how move works. And I've now made it such that I'm using glide instead of going to. And this is going to be fun because it sort of gives you agency to even change the way snap works because all of this is now editable. All of this is really a system that is malleable. So this is what it's about. Build your own blocks. I'm now going to hand over to Yatka to tell you some more about our kind of the pedagogy and the kinds of things we're working on. Yeah, thanks. So we do this thing together with people at UC Berkeley. And it's also a lot about education. So it's used at universities around the globe and also in a lot of schools by now. And if you have a school that you want to use, snap, feel free to reach out. We're always looking for more collaborators. And as you already saw, you can program snap in different ways. And that's also something that's really important to us that we can. So there's not one way to the solution, but there's several ways. And we also accommodate the boring ways, like using all the four loops that kids are required to learn in school. But we also want to elevate the mind to new ideas about programming, like the stuff that Jens has just been showing you. And I just wanted to give a short, can I close that? Oh, I'll go over here. Yeah, great. So, no, not right now. Thanks. So I wanted to do a short, oh, have we tried whether we can record something on the thing? No, but we'll try. OK, we'll try. I record a sound and see whether it works. Hello. OK, let's try to play it. OK, it doesn't. Good. Ah, OK, it comes out of here. Yeah, OK. Hello. So we can record sounds. This is something that's also really important to us that we can extend projects beyond the working with numbers. So I personally am not a developer. Like, I didn't study programming. I am a developer now, but I studied biology. And until I was 25, I thought programming is, can I say that shit and boring? Then I tried it out and I used Snap for it and it is actually fun. I mean, you guys all know that, but it's awesome. So I'm not that much of a math person. So I really love that you can extend Snap using media, using data. And now I want to show something using this recording. So you can access the samples of a sound file that I recorded by using this block from the sound category. So you can see we have different categories here with differently colored blocks that helps to structure the programs and also to read the code. And if I click on that, it's also a list with 51,840 samples. At the beginning, I didn't say anything, so it's all zeros. But then when we move down, we see that it's negative values, but very small negative values. So the samples are the amount that a membrane swings either to the left or the right. In our ear, for example, or in a speaker or something like that. And now I could try to modify that sound. So let me grab one more recording that we can always play it. And I could do it the traditional way using the for loop that I've just mentioned. Because as Yen said, in German schools, kids are required to learn loops. And this is really important. So for I from, we start lists with one. So let me start at one for I from one to the length of the samples of my recording. I want to do something. And the something I want to do is let's make it louder, maybe. Let's try this. So I want to create a new sound. And I call this new sound. And I set the new sound to an empty list. Okay, now maybe I do want to. So I set the sound to an empty list. And now I want to add stuff to my sound. And what I want to add is the value that I had before. So item I of the samples of the recording. And I want to multiply them with the factor. So increasing the number that's in the in the sample makes the sound louder times. So let's try times five maybe and hope I don't fry anything. And then I want to add that to my new sound. When I do it like that, it's pretty slow. So as you see, this runs rather slowly. So we have this, what we call the warp block that just speeds up things. So let me wrap this around here. And now I can have my new sound. And I hope that it's louder than the one before. So this was the one before. Hello. Can you even hear that? Hello. Okay, now let's try the louder one. Hello. Definitely louder. So we can use that way to change media. But we also want to support, as I mentioned, other ways of thinking. So what we have down here, and this is where the lambda that Alonso has, comes into play is the map block. So this is a higher order function, a function that takes another function as an input. And we represent this function with these gray rings. This is like a lot of the rings, one ring to rule them all metaphor. So this is what gives the power to one of the powers to snap. And we can use data here in the second input slot. So what I could use is again, let me just duplicate that, the samples of my sound. And now I can add another function down here. Let's do the apps function maybe. So this gives back the absolute value. And now I can play that sound as well. Hello. So you can hear that I sound more like when I don't have air through my nose. So you can make sound effects like that. And then the last way is the one that Jens already mentioned, and I didn't come up with the third effect. So I'm just going to use apps again. You can just drop lists like vectors into functions directly. So I could just drop this one in here to create the exact same effect. You could use floor. I could use floor. What? Let's try that. Beautiful. Okay, so this is just, I think this is just negative ones, zeros and ones. So with only three different values, you can still kind of understand the sound. Because what I said, because the pattern is represented by the way the samples are arranged. And not by the actual values in the samples. Okay, and you can also do that with more complex data. So we can also access the webcam from Snap. And I can try the same thing with the camera. Maybe I can try the same thing with the camera. Maybe we'll just unplug this real quick and I take a picture. Okay, we broke the webcam. So let's leave it at that. So let's go over and I tell you you can do it with graphic effects. And if you want, we have a workshop later come by and I'll show you how it works with graphic effects or photos. Okay, and then stuff that we are currently working on is AI. Since this is the big thing and the schools wanted, universities wanted. So we had to come up with resources that they could use to teach artificial intelligence. And I wanted to show you one that we've developed last year and it's available in German and English. You can download it from the internet. And it's a detailed walkthrough guide on how you can use that in a classroom setting. And how to program the whole thing, some ideas on what you could do with it. And we call it grant gestures. So it's a simple gesture recognizer program based on the $1 gesture recognizer. I don't know whether you know that, but it's a prototyping gesture recognizer for Unistro gestures, things that you can draw in one line. And I already prepared something like a TV cook. So I have this project here already and this is a simple drawing program. When I click on the stage, so we call this window here the stage, then it'll broadcast the word sketch. And when I receive the word sketch here, I'm reacting to that and I can actually draw something. Yeah, since some of you are sitting really far away, let's just increase the pen size a bit. Yeah, so I can draw stuff here. And what I already also prepared is I'm storing values in a variable. What you can see here is I have the sketch variable that I can also show. It's 164 points and it's the position that my sprite, the object that I'm programming went to. And now I also have this examples variable here where I already stored a few examples and this is always a path and a word that's attached to this path, so basically a label for that path. And now I want to create a few more things and then we're going to animate them. Can you hold the microphone again, please? Thanks. So to create an animation, let's start with the animation. We gave you a block here that we call animate and here you can also see again one of the awesome things in Snap. You can make your own control structures. So this is a C-shaped block like control structures, some control structures look in Snap. And this is a custom block that runs actions that you can put here into the C-shaped input slot and we made this block. And what it does is it takes what I've drawn and puts it on as a costume. So the costume is an image that the sprite is wearing and then it does something. So when I draw a heart, I want it to have a heartbeat. So increase the size a bit and then decrease the size in one step. So increasing the size, I do with the change size by 10 block and I do that 10 times and then I reset the stage. And since hearts are bumping like twice, I want to do this two times. And if I put that into the animate block, you see that I can do that with the drawing that I just made. So I can draw something and it takes this actual drawing and does something with it. And now I want to trigger this reaction whenever I receive the message heart. Okay, this is what's supposed to happen, but now I need to identify this heart. So we want to find out how these paths work and to see that I can render what I've drawn. So this is also a block we prepared and I can just put my sketch in here and I can render that and I see that the points through my path are not very well distributed. So I draw really slowly at the beginning and the end and then I was really fast here. So to really make them comparable, it's important that we have, we normalize them. So we have this resample block here and now I could resample my sketch to 64 points. And this evenly distributes the path between my, like the points on the path. Yeah, so now I can use that to train my algorithm here a bit more on my program a bit more. So here this is what I've drawn was a heart. So let me draw a new one and I can now add that to my examples. Let's add another one. Okay, and now I need to recognize this heart in all my examples. We also prepared a block for that, but you could build it yourself if you wanted to. This is the recognized block. And this recognized block looks for the smallest difference between two shapes. Like it's measuring the distance between the first point in the first path and the first point in the second path. And then adds up all the differences between the points and reports the one that has like the label of the one that has the smallest difference. And we can just use that as an input to the broadcast block. So let me just show this real quick. I'm recognizing my resampled sketch which is the heart that I've just drawn in all my examples. And since this is the heart that I've just stored, it should report back the heart which it does. I can try another thing. So this is also the heart. I also wrote down FOSSTEM. So let me try to write FOSSTEM. Writing is really hard on a touchpad. So this reports FOSSTEM. Great, seems to work. And now I can just broadcast this thing. And we want kids, people who use that to tell stories with it. So this is supposed to be an interactive storytelling project. And the story that I just came up with was, how did it work again? Let me check. Ah, yeah, okay. The weather in Brussels is not really nice. It's raining all day. And I'm sitting in dark buildings all day. But still, I love to be with you at FOSSTEM. Okay. So this is a resource that we've been working with again to also inspire people who might not be the traditional audiences for programming. But it's also pretty cool if you are a programmer and love math, you can still do stuff with it. And now I would hand over to Jens to tell you what we're venturing for next. Thank you, Jette Gah. I also love to be at FOSSTEM. This is so cool. So you might see, you know, this isn't really about an algorithm about using AI, about using a large language model. What we're trying to do is to sort of at least lift the lid a little bit to let you see a little bit underneath the hood. So for us, it's not about upskilling youth to be employable, but it's about bringing across a sense of awe and wonder about what you can do and maybe letting you reflect about things. So now, since it's been mentioned for two years, generative AI is this big thing with chat, GPT being everywhere. And it really boils down to, as we've seen before, language, even textual language being the basis for everything. And we thought, well, yeah, that's nice, but it's, we love language, but we also love structure. And so one thing that we've tried to come up is with a activity that is more on the basis of these language projects, which really is a next token prediction system. And that might lead us up to experiencing and learning something about the really generality of AI. So it's all inspired by a wonderful little project. I have to kind of give them credit by Michael Hilsche from Peha, Schweiz, who, this wonderful project, Zykia GPT, you all have to look at it. And so this is what we're trying to do is now build something like chat GPT ourselves and on little data. So we don't actually have to use chat GPT. So here's something I scourged the internet for 30 fairy tales of the brother's grim. And here's the English version of these 30 fairy tales. It's not a huge corpus, but 30 fairy tales. And it's just a text. And in order to work with this and turn this into an AI, I need to split these 30 fairy tales into a list of words. So now I've got a list of words like 58,000 words. And so sevens, waybians were once together. OK, so just a list of words doesn't give us a lot. So in order to use this in a language AI, we have to do some statistical analysis. And the way we do statistical analysis is by we're grouping these words by their sequences. They're called like pairs or triples or bigrams, trigrams, tetragrams. So, you know, it's build your own blocks. So I'm going to, can you please hold again, I'm going to make a category. Wow. I'm going to make a category that's called, be called generative AI. And I want to build this one function that I'm using. And it's the, it's going to be a function. It's going to be the n grams of a corpus. And n is going to be an input. It's going to be a number like two or three or five. And the corpus is what's going in the language thing. It's a list of words. So what I want to do is I want to get the numbers from zero to the length of the corpus to go through all of this. But I don't need the full length. I can decrease it by the n that I'm looking for minus one. Now I want to take this as an input to map. So I'm taking these numbers and for every of these numbers. So when I have a list, list of the numbers one to ten, item one is the number one. But if I put in a list of, of, of items to check, like three or six, I'm getting a list of, of individual items in there. So I can slice my input. So what I want to do is I want to get the item, whichever the number is, the numbers, whichever the numbers is, two n minus one of my corpus. This is what I want to do. So this is the function, the n-gram function. Let's actually try this. Here's the n-grams, like let's, let's go to the bigrams of my 30 fairy tales. Click on this. See, oh, seven Swabians. Swabians were, were once. You get it? It's kind of broken up. I can also do this for four. So now I'm getting seven. Swabians were once. Swabians were once together. You get the idea, right? So, so, so this breaks up the corpus into these little sequences of words. Now I want to do a statistical language model. So to do a model, I'm going to make a variable called model. And what the model should do is the model should be, let's actually hide this, should be several variants of this, not just bigrams or trigrams, but let's go all the way to five. So I'm going to say the model is going to be another map block. Map the n-grams. I'm going to leave this blank over the numbers from one to five. So let's actually just run this once. So now let's look at this model. Now this model is a kind of weird looking table. If I format this a little bit different, you can see it's a five item list. The first item is like though. Unigrams, bigrams, trigrams. Five is great because tetragrams. That sounds like diabolic pentagrams, right? Pentagrams. So it's a sort of a cascade of a model. And I can also try this like the item one of the model, you know, is a list with one, the item three of the models list with three. Okay. So this is really the heart of a statistical language model. Because now, for example, let's take a list. Let's take a list of words. Let's say the king's daughter. So we want to see what's a good way to find out how to complete a sentence that starts with the king's daughter. So we could look for, you know, in all these, in these four grams, whether anything starts with the first words like, oh, the king's daughter and then find out which words come. Let's actually do this. So this is keep in my model of four, something that is equal to the king's daughter. And we want to find out the items with the numbers from one to four. Okay. Let's, let's try this. One, two, three. You see. So we see, oh, there's a bunch of sentences. The king's daughter came is began, love said again came. So you see, there are several times sometimes the same thing in there. So we could say, okay, we could just take, you know, the last item of a random element of these things that we get to complete that sentence. So we can say the king's daughter came, the king's daughter laugh. And this is basically finding next token, something that has been used in that context before. So I've made a little block for that, which I'm going to import that does that. It's the next token block. It's literally just that. And now we could build something like chat to PTRself. So we could say, okay, when the green flag is clicked, ask. Enter the beginning of a tail and wait. And then when the user enters said, we're setting, oh, we need to make a new variable. That is going to be a tail. And we're going to set the tail to what the user entered, which is the answer. And we're going to split that by the word. And then what we want to do is we want to take the next token in that tail based on the model. And we want to add that to the tail. And then we want to say the thing, right? So we want to say the text of the tail, that just so we don't see a list, but we see a nice text. We want to say that. And we could say, okay, how about we do this when I receive next. And here we broadcast next. And then whenever the user presses space key, we also broadcast next. Okay. So, well, let's try this. Does it work? Enter the beginning of a new tail. Once upon, ah, once upon a time. So now whenever I press space, it is continuing this and it creates fairy tales. Sometimes it'll stick to one fairy tale pretty long, but since it only has a context of about five words, it keeps forgetting which fairy tale it's in and then just finds something else that is plausible linguistically, but maybe not from the story. And so you all know, right? This is not how chatGPT really works, but it's a statistical language model and the similarities are actually striking. This is what they call a Markov text generator, Markov chain text generator. And it has like a GPT model, a transformer model is just this except a little bit more complicated. It has a longer context and it has some neural networks that makes itself aware, so it kind of doesn't just take the last n words, but kind of knows, has a little more memory and distinguishes, which things are more important, but it's literally just making up stuff. It doesn't have any idea about the language it is written in. So when, for example, I take 30 fairy tales in German and instead of these 30 fairy tales in English, I'm doing the model in German and now, yeah, I'm just saying, you know, I don't care. So now it's going to speak German, right? Because it's been trained on these German fairy tales. And so it's not about language, it's about statistics. And so again, we could think that this is all about language, but at the core, it's about finding things that correlate with other things. So we thought, well, wouldn't it be nice if this were also a good pedagogical model for an AGI, for something that is more general, not really super intense, but more general. So there's lots of sequential data. One thing that I think every hacker loves, I don't know, is music. So here's, I transcribed 20 songs. And these songs aren't just words, they're notes. And so the notes are, you know, there's a pitch, a mini-pitch, and there's a duration, how many beats. So what if instead of these words, we just took these notes and shopped them up into a bunch of n-grams and, well, okay, so I prepared this little script that does that. It's the music improvisation script. So this takes the 20 songs, chops it up, uses the exact same blocks, like the n-grams, and the next token block. The exact same thing. And remember, the data is differently dimensioned. It's not a single list. It's a multi-dimensional list. I'll try whether it does something. So you sort of see, it's like me whistling, you know, ah, there's this, oh no, there's this other thing. Let's kind of go and do some funny associations. But it's already beyond language. It's already generalizing the principle of finding the next meaningful token. So we thought, okay, wow, this is nice. So what about pictures? Could it work with pictures? I mean, pictures aren't sequences. They're a plane with multi-dimensional. But you know what's a sequence. Sketching something is a sequence. So we thought, why don't we take the sketching thing? But instead of mapping coordinates, we remember for each line segment the direction it went to. So let me again kind of write something like an honor of, so, hello. So now I have the sketch in this model. And let's try to find out if we do the exact same thing that we did with the music by now doing a sketch program, training it on just this little data. Oh, it tries to imitate my handwriting and does something that sort of looks like I'd written it. And there's, it's going to be different kind of the next time. And it's sort of fun. It is finding some meaningful next tokens. And at this time we were really, really having fun. And we thought, well, what if we don't just write something? What if we, like Bernat's idea was, what if we, for example, draw something that already kind of makes sense? Like, let's draw something that looks like a tower, something that looks like a castle. Whatever, a roof and a moat. So here's a little thing that I drew. Let's try to find out how we can do a skyline. Isn't that cool? And that's almost, you know, a glimpse of, it doesn't matter about language. The secret is to seek when you're in a building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. You can't get out of the building. No, you can't get out of the building. You can't get out of the building. like you want to edit it. How do you visualize what change? This is a list of numbers. What do you want to do? Do you think the delta between the change of your graph and the numbers? How do we do the software engineering spiel? We don't teach children about version control. We teach children powerful ideas. There is a version that is called SMURGE. We are mixing up the graphics towards refining the code. Do you have examples of the SNAP? We have the gesture recognizer including all the materials you need for school. We are working on the SNAP GPT thing that will be published in a few weeks. If you want, you can search for projects like the grand gestures. Here is the project. Go to snap.berkeley.edu and use the search bar. I am not aware of any other block space programming language offering the notion of procedure as data. We have data. We are more like researchers. We do not teach kids lambda. We use lambda to build blocks that kids can use. In higher classes, there is a curriculum that is used as an advanced placement course. We will use higher order functions totally. We have a high school version of that. They have a middle school curriculum for seventh graders. Thank you everybody. See you next year.
Hedy
So, hello everybody. My name is Jesus Pelai. I am a developer in the HEDI programming language. I also a teacher in the University of Carabobo, which is located in Venezuela, where you come from. Here is also Pink there. She teaches with HEDI, so she will also be answering your questions. And today we will be talking a little bit about the HEDI programming language, what it is, why it is useful over its alternatives, at least for user cases, and also how you might get involved with the project. So let's begin. Okay. Okay. So, what is HEDI? And I will summarize it really fast for those of you who are also jet lag like me. And these are the three core concepts of HEDI. So HEDI is a gradual, multilingual, textual programming language built for teaching. So this is what our entire deal is about. One outside asks you what is HEDI? You can tell them that it is gradual, multilingual, and you can use it in the classroom. And it's a bit of a mouthful, but I will be explaining each one of these concepts along the talk. But first, let's talk a little bit about our misconceptions when it comes to teaching programming. Because us as programmers, we don't really are that good when we want to teach programming, because we have this series of misunderstanding how people understand things. Another thing is that compilers are friends. And that is a lie made by the compiler PR group. The compiler is not your friend. It is a tool, and especially not friend for a kid. Because the compilers are made for professionals. And the error messages are very much tailored for adults who are working with the language. So when a kid sees an error message that is meant for an engineer, it is really intimidating. And also this second issue is also a sense from the fact that programming languages are made for adults and engineers. And people think often that that is not really a problem, that syntax is not really an issue. And one of the core things of HEDI is that syntax really poses a barrier for entry-level barrier for novices. 50% of, there's a study that says that 50% of programs submitted by novices have syntax errors. And 75% of programs submitted by the weaker students have errors. So this is not a kid that is having fun reading the error messages and fixing them. It's a kid that is trying stuff around, trying to make the program work. So it's really a really tall barrier for someone to get into programming when you have the syntax right away. And the other misconception is that you mostly learn alone. We often, when we think about learning programming, we think about someone who is sitting in front of a computer alone in the room. And that was the case 20, 30, even 10 years ago. But today there are many schools that teach programming. So now we also have to think about programming as a discipline that you can learn in the classroom. And these were some of the misconceptions that the creator of HEDI, Felina Hermans, had. When she was asked to teach a class of high school programming. Felina, she is a researcher from the Netherlands. She has a PhD in computer science. But she was teaching high schoolers as a Saturday activity. And now she thought, okay, fine, I know programming. I can teach programming. And for that she used Scratch. And Scratch is a great tool. It's a great language. Scratch is a visual-based language where for programming you only drag stuff around. And by dragging stuff around you have a program. For example, this one right here makes the cat move around and play meow-meow sounds. And kids, they love making the cat do meow sounds. However, as time went on, the kids grew up and they said to Felina, we like Scratch, but we don't like a toy anymore. You know, this is like a toy. Drag stuff around is like making a puzzle. We want real programming languages. They want that programmers use the ones that can get me a job. Because these were students from a technical school. So they were very much interested in learning programming as a way to get jobs in the future. And many people learn programming this way. Not everyone learns programming now because of a passion of computing, but because it is a really viable way to get out of poverty, to get a job. So now she said, yeah, that's fine. I can do that. Let's teach you Python. And Python is also a great language. It is certainly easier, the syntax is certainly easier than the likes of Java, C or C++. But still, it's very much a language made for professionals. Error messages are for professionals. And the syntax is a barrier right away. And you can see here, if I'm trying to teach a classroom how to write this, I would say, yeah, kids, today we are learning to print some text. And for some students, that is not really that exciting to just print text on a screen. But you got along. We are going to print text. And to do that, you write print, and then you write a parenthesis. To write a parenthesis, you have to do shift eight in the keyboard. And then a quotation mark. Yeah, the quotation mark is right beside the answer. So you have to do shift that, keep it pressed, write the text, and again. So as you can see, the kids, at this level, they might not have the proficiency in typing or proficiency knowing how to type stuff or clicking and selecting text. So that is also a barrier for them. But let's assume that the kids do it. And then you have text on a screen. Awesome. But what about some kid makes an error? For example, this kid, they wrote an uppercase P. And for them, this will look pretty fine. Perhaps this kid does not know that their rates quickly line means that they have an error. So they try to run it. And they are faced with this. A really uglier message, you know? It says, trace back. Most recent call last, a file and a path. And then in the end, it says, name error, name print is not defined, did you mean print? And a frack it, this is really intimidating stuff. The kids probably didn't even read the complete error message. They would say, teacher, teacher, what is happening here? And if they read it, and if they know English, it's another barrier. Because Python is an English-based language. So if the kid does not know English, there is no clue that they will understand this. But you know, they read it, they fix it, it's fine. But now you also have this mistake. Now this kid switched around their parentheses and the quotation mark. And now it says, syntax error. parentheses was never closed. And they might say, yeah, I closed it. And what the heck, it's a syntax error. Now, I have this program. And it looks perfect, except for the squiggly line that it tells us, adults, that there is an error, but the kid doesn't see it. And now they try to run it. And it says, indentation error, unexpected indent. Teacher, what is an indent? Now you don't want to be explaining all of these to a kid. This is just the first class, remember. They are just learning to print text on a screen. They are faced with complicated error messages and syntax element that has to be placed exactly as the programming line was expected. Otherwise, you don't have anything. And what an ugly error message. So it was also a problem when she tried to teach concepts. Because for example, if I show these to my students, I want them to understand the underlying concept of repetition. I want them to know that these right here will say, you will print 0, 1, 2, and 3. But they are not seeing that. They are seeing that they have to put a column and brackets and spaces and put it in such a way that it's perfect for the computer to understand. And the kids, they think, why is not the computer as smart to understand while I'm throwing at it? You know, there is artificial intelligence. They can write stuff. Why can't it understand a simple program that I'm making? So we see here that syntax is creating cognitive overload. So we humans have a limited space in our minds to work. We have a small amount of short-term memory. And when you are learning a programming language, your short-term memory is filled with the symbols. So you've got to have a short-term memory, so a spot for where the variable is, and another spot for where the bracket is, and the name of the programming, and the name of the function. So it really is complicated for the kids to understand syntax. And on top of that, understand the programming concept that you are trying to teach. But now, this is not only a programmer, a problem that us programmers face. Because there is also other disciplines that are very hard to understand. We have language and mathematics, and those are famously hard disciplines. But we don't teach them right away. You don't expect a kid to understand the Riemann hypothesis. We do it step by step. For example, if I'm teaching language to a kid, and the kid writes these letters, I don't say to the kid, kid, this is a really ugly letter. This is an ugly A. This doesn't make any sense. No, you tell them, this is really good. You wrote the A, you wrote the I, the N, and you are beginning to understand that. And then you tell them that the words, the vowels, and the consonants, they make words. And then you put more complicated stuff on top of that. And then you have that, you can have uppercase letters and punctuation marks. And then you end up with an understandable sentence. And now the rules change rapidly. You didn't start with the whole thing all at once. You were changing the rules little by little, little by little, so the kid could understand all the steps. Now, changing the rules creates also a little bit of cognitive overload. But that very much makes up for the fact that they are understanding each one of the concepts individually to then learn the other one. And they were learning on top of those foundations. And this is also the case for mathematics. If I'm trying to teach a kid to subtract numbers, I can tell them that you can't subtract a greater number from a lesser number because they don't know that negative numbers exist. So I can tell them 5 minus 3 equals 2 and 3 minus 5 equals 2, but then I change it. And I tell the kid, actually, 3 minus 5 equals minus 2. And now that you just changed the algorithm of subtraction a little bit, they didn't have to take all in advance that there are negative numbers and positive numbers. And the same happens for division. You're first telling that 8 divided by 3 is 2, reminder 2. And then you introduce the concept of fractions. And now it changed a little bit. And now 8 divided by 3 is 2 and 2-third. And the same happens for, and also the same 8 divided by 3 is 2.6666. So as you see, the same operation for the kid had three forms, and it became more complex as time went on. Now this is the idea. We can do that for programming too. And that is how Heady was born. We have a gradual programming language. And we can see here an example of Heady. So as you see, in the level 1, we have a really simple language. You can only print and ask input from the user. But you see here that there are no syntactic elements attached to it. So in its simple form, print just, print some text, print hello. I can ask input from the user, but I don't have to send a variable or do an equal sign off. Just ask what's your name. And the echo command will repeat the information just given to us by the user. So you just use that. And this level consists of six commands. Distribute for text 2 for the total module of Python, so you can also make drawings. So it's not as boring as just printing text. And you can also play some music. We just merged that. And so you can also now do some simple music notes in this level. And now you move on, because that is a really simple language. And I put some stuff to it. And now in level 4, I already introduced quotation marks and also variables. So you see here that the program changed a little bit. So now I have that if I want input, I store it in a variable. So now his name is ask what is your name. And I can concatenate that in the print command. And then in level 18, you end up using a subtractively valid subset of Python. So in the end, you will just be premming in Python. And you can move on from Heady to Python. And I start using that ecosystem. So you can use Piberix or anything else that you like. But you don't have to explain all of these syntactic elements right away, because you do it slowly. And also the levels, they offer you an opportunity for the kids to all be in the same place. Because you don't have many different things going on with kids using 4 and others using other stuff. Now, also it's important to note that the syntax of level 1 is not valid in level 4, for example. So when you move on from level to level, you are basically switching to another programming language. So in essence, we have 18 different programming languages which are built on top of each other. And each one has its own parser and its own syntax. And these are our design goals. So we have these six goals here that we use when we are designing a level. So the first one is that the concepts are offered at least three times. So in the example of print, the first way is you do a print text, then you print variables, and then you add quotation marks and everything. The second one is the concept is introduced as simple as possible. For example, in the repeat command, we first introduce it with repeat and a single sentence in a single line. So you don't have to deal with blocks or multiple commands at the same time. It's just repeat a single instruction. And then only one aspect changes at the time. So if I introduce the repeat command in level 7, I do it in a single line. So the next change is just do it and now we have blocks. And in the next one, in the level 9, we introduce nested blocks. So you see that we try to change as little as possible to make the levels feel like a gentle progression and not so steep. And also the syntactic elements are deferred to the latest moment possible. An example was the quotation mark. They are introduced in level 4 because you can't do more complex programs if you don't introduce them. And this also poses a problem for the students. Some kids, they really struggle with the introduction of the quotation marks. But we cannot do it any later because it would mean that we cannot make more complex programs. And the concepts are interleaved, which means that if we change something in one level, we don't change it in the next level and the next two levels. So concepts can be really taken for the students, by the students. And then it's always possible to create meaningful programs because in the end, this has to be fun otherwise the kid won't like to use it. OK, so that is for Grad-Walt. But that is not all what we are all about. We also are multilingual. And we ran a user study with Dutch kids. It was composed for 21 kids, I think, in an online class during the pandemic. And they told us that they liked Heady. They liked that the error messages were easier than Python. They liked that it offered a state-by-state guide. So they told us we would like to program in Dutch. We just don't want that the interface is Dutch. We want the keywords to be in Dutch. And it was a bit of a surprise because Dutch kids, they know English, but even then they want to program in their own language. And that is more so the case for Arab kids and Spanish-speaking kids, because, for example, for an Arab kid, I don't only have to teach them what is a program. I have to teach them what is a P, what is an R, what is an I, because they don't use the same script as us. On top of understanding syntax, they also have to understand the Latin script. So it's very hard for them. And on top of that, they have to change the tweet from the Arab keyboard to the English keyword and then back. So it is a mess. It's kind of a mess for them. So we did it. And now Heady is available in 49 languages. I made these slides like two weeks ago. Now we have 49 languages. So as you can see here, we have Spanish, we have Arab there, we have Japanese, and we also have Dutch. Dutch or German? That's German. German. So, and this is kind of hard. And I will talk a little bit about, it is a compromise for us to be able to support the 49 languages and growing. Because each one, you know, some languages have their own quirks. So Arab has something called Tatwill, which is like a valid space. So when you add a space, you can add it any place in the world. And we needed to highlight, because it's the same world. It's just have like that space in there. So when the language has these quirks, we tell the translator, translate the program as close as you can, as many languages as you can. And we will try to solve the problems that arise in the grammar. And there are some cases where we can't solve it because of, you know, limitations with passing technology. For example, in Chinese, the words for plural and singular are the same symbol. So if I say for animal in animals, they place the same symbol in both of those places. So there is no way to differentiate the variables. That is a problem that we really cannot solve. But there are some others like Tatwill in Arab. And also some key words have spaces in them. We can solve that. And there are some conquers of French has also valid spaces in the end of the world. So yeah, now let's do a demo. So this is, hey, this is level seven. Let's go back to level one. So this is the interface. Here we can see that we have several elements. We have the editor where kids can write the programs, the output window, and this top part is where they can read the assignment. So these are sort of like exercises that they can make. So this one we have to build them. So these are provided by us and are translated by the community. What if I want to change to another language? Now I say that I want to switch to Spanish. And you see here that it's with the interface, but the keywords are seen in English. See if this works. And you see now that the keywords now are translated. Which only translates, the keyword doesn't translate the program because for that we would need to use like an automatic technology for that. And that is kind of hard to integrate and also poses its own problems. But you can switch the keywords. And what about bilingual kids? What about I know English and also know Spanish? So I can mix in Spanish and English keywords. And you can only miss your own language and English. You cannot mix any combination up to languages. But this is really useful for kids, for example, Latino kids in the USA where they know Spanish and they know a little bit of language and a little bit of the other. Okay, cool. Now that's it for multilingual. But we are not only that. We also build for teaching. So one of the main aspects of Heady is that it's not only made for learning programming but it's also a system for teaching programming. So we are the primary language, but we also are the system that is built around it, learning and teaching system. And also this is very good for a teacher because the levels offer a state by state guy and they are already like a lesson plan. So they don't need to make a lesson plan from the beginning. So that makes it a little easier to teach. So let's compare that to teaching scratch. So this is scratch and you are when you open it, we are met with like a black canvas. So the kids can do whatever they want. But that also poses a problem for the teacher. Because what can my students build? Some of them will not have no idea. Others will have many ideas that they will want to build. So now you also have to either get lessons from the internet or build your own or you don't really know whether those will work. So you make your lesson. And now there will be a kid who will tell you, hey teacher, what do I do now? I don't know what next to do. And this is fine. They are all in their place. There are other five kids calling you but you go to them and you help them. But there is this kid who ignored your lesson. I build this. Can you help me debug it? And now I have to understand this program and help this kid debug it and there are other five kids around me calling me teacher, teacher, I don't know what to do. So this is hard for a teacher to manage because there is too much freedom and too much variation between the levels of proficiency in the students. So what we do, what we can do is that in Heady you can build your class. You can make a class. You have a student account and the students can submit their own programs in a class. So you don't have to download or upload any sort of files. So that is done automatically by your system. You can have quizzes so you can block the next level if the kid does not have the necessary grade. And you can also block it by date. So you don't have kids at a kid in level two and a kid in level 18 with the aligned levels of proficiency. So you have all of them right in the middle. So the slower ones will pick up the speed of the faster ones and the faster ones will not have, you know, will not build such complex programs because they will be limited by the level. And all of this is customizable. You can make your own lessons also by building your own adventures. So it is possible to just ignore what we've built or build your own adventures on top of that. Some teachers, they like to have an adventure that is like a project. So the project starts in level one and it changes from level to level and in the end you have to make a really big program. And that was they built it bit by bit in all of the levels. Now let's start a little bit about how does this work? What is our architecture? An architecture is a really simple client to server architecture. In the server we transpiled the programs using LARC. And LARC is a library that generates parser. You write your parsing, your syntax grammar using eb and f and then it generates the parser for that. And to that, and then we build the Python program in the server. We send it back to the client and it's called which is a library that executes Python in the browser and that is done by translating the Python program to a JavaScript that they can understand and then execute it. But what are some architectural challenges? So A is an editor that didn't work well with right to left languages. So we were having problems with Hebrew and Arab and I also think that's how it is. And so it was really a challenge. So we have, changing this wasn't easy. But we did it and it was hard. So as I told you, making a language that is available in many languages, it is very challenging because you have to debug in languages that you don't understand. You have to look like an Arab program and you don't understand Arab. So it's very much a decision by the team to make this program, this language more accessible to everyone because it's not really about this social justice problem but it's also an economics problem. Like it will be so much more easier for the kids to program in their own languages. For Arab people, they don't use our numbers. They have their own numerical system with other symbols. And they can use those in heavy. So it's really like programming at home. They don't have to learn everything else from other languages. And for this, to support these many languages, we use parser generators because building our own parser for 18 levels, 47 languages, would be really hard. So we generate them using LARC and that one is laser, the one we used in the front end. So we have to maintain 18 different grammars times 47 languages. So that is a lot. And we have some tricks for that. For example, we don't build the entire grammar all at once. So we only change bits that change from level to level. So if level 4, I'm just introducing quotation marks, I only change the print and ask commands. So it becomes a little bit easier. And we have to make some creative way to do that. So we made a grammar merger and also the same for LARC, for laser. So you see, it's complicated because both of these use different parsing algorithms. So you have to take that count for that. Okay, now let's talk a little bit about me. I am, as you see, I am Venezuelan. So what is a Venezuelan doing here giving this talk? And I became involved with Haiti in 2021. And by that point, I was an undergrad student trying to get involved in an open source project to understand a little bit more about professional projects and Python, et cetera. And I heard about Haiti because I followed Felina on Twitter. So I texted her. Hey, I want to get involved with Haiti. It could be possible. And she had a meeting with me on Google Meet. And that was really impressive for me, for a researcher from the Netherlands to be meeting, you know, an undergrad student from around the world. That really meant a lot for me. And then I became more and more involved with the project. And now she offered me the opportunity to work full-time at Haiti and then give this talk. So to me, the open source has not only been, you know, about building a great product, it has been life-changing for me. And this can be, you know, the case for other people because, for example, my students, they, I teach now in the university where I studied, and they told me, teacher, I want to get a job. I want to support my family. And now I'm telling them that come here, help us with Haiti with open source, and then will you have something in your resume? And this is very important for us in Haiti that not only our product is accessible to everyone, but also our system are accessible to everyone. So if you are a new time programmer, if you are learning Python and you are not as proficient in professional systems, we try to help you and make it as simple as possible for you. So you can boot Haiti really quickly in your computer and there won't be a problem. You don't have to deal with complicated stuff. If you only know Python, you can help us a lot in the back end, so you don't even have to know JavaScript or HTML. And this is also something that we embrace a lot when we're doing changes to our system. When we try to improve our systems, we think would this be accessible for students to build with Haiti? Because students have helped us a lot. They have built some of the features that you can see in Haiti, and then we polish them. And so we welcome everyone. We welcome novice programmers and we welcome professional programmers. So if you can get involved with us, you will very much welcome. So you can join us on our Discord and GitHub and you can try at Haiti.org. Thank you very much. Now we are open for questions. Yes? Just a quick question. You mentioned that one of the problems is syntax and errors. And I just wanted to know what language is used for showing errors in the language. And how you will not define but understand in which language should you show errors? Oh, right. So the user sets their own language. For example, I have Spanish here. And if I make a mistake, you can see here that the error message is shown in Spanish. And the same if I go to Arab. The error message is in Arab. So it does know which language to use for the user, for the error messages. And this is also the problem with the student itself. So because all of the expensive ideas that we use like day to day are so over complicated. It's profiling, the borders, etc. etc. Okay. There should be just like a small tool for kids just with a window. And as you show in the wrap, I made it directly on the computer. So you're talking about tools for the student. Yeah, like ideas that could help you. Oh, okay. Right. So they can, for HEDI or IDE, is there in the page? But they have included some tools. For example, we do have HEDI bogey. So you can debug HEDI programs and execute them step by step. And you can also know which variables are defined in the program. So we do have some light IDE features that can help the students. We do not want to make it very complicated. For example, you could also show like auto-completion. But that would mean that it would crowd the page a little bit and it would be more complicated and more stuff for the kid to take in. So we try to make it also as simple as possible for them. Yes? Have you thought about interfacing the IDE and HEDI with external algorithm like microcontroller or GPIOs or LAPs? Which is more, kids can take their hands and stuff. Yes, that is actually something that is in the works. We have a student right now working on interfacing HEDI. Which I think is microblocks. Microbits. Microbits, yeah, microbits. Yeah, it's starting to move from LED 38 into micro. I don't know how you do it. I am not personally involved with that student. So I don't know exactly who he is doing it. But we know that that is in the works. So it will be coming shortly. Also maybe I can step in here as a teacher. Yeah. Is the other microphone still there? Yes. So it's a question we often get when we show this. Is it boring for the kids only text? And that is exactly what I thought before I was teaching with it. But I wanted to teach them anyway because I wanted to do pie bricks with them. Lego robots with Python. So I needed to learn in Python first. So I thought maybe HEDI is the way to go then. And then I found out that in the classroom I teach nine and ten year olds. I actually enjoy doing these text things over and over and over and over again. And also at the end of my lessons I always have the portion that they can show their work to me. And often six or seven kids want to show exactly the same things. And they're still proud of it even if the ones ahead of them are exactly the same. So for kids it isn't as boring as I thought it would be. And also the turtle they love obviously. I'm not saying it is boring but maybe HEDI could be an alternative for a system like you're doing. Because programming is impossible. If you could do similar things but using a simple language that would be very useful for the kids. That's my idea. Because I have a kid and I'm facing this problem with her. You can go two ways there I guess because you can use the micro blocks that Peter can tell you everything about. And also the micro bits for instance. They have their own Python in their own websites. If you go to python.microbit.org I think then you can do Python as well. But there you do have the full syntax and stuff. So if they know all the concepts and the syntax before diving into that it's probably more successful. And there is somebody working on that. Yeah I think the idea with the micro bits within HEDI is that you can do a print and then it will come on the pixel screen on the micro bit for instance. Yeah probably the way to go is to... Probably is that what we do often when we do these connections to Python libraries is we have the HEDI code and it will get translated to Python. And it will probably be Python brick. And it will be executed. Any other questions? Yes. Very nice presentation. Thank you. How much does the children spend from level one to level 18 usually? How long did it take to get from level one? It does depend on age and prior knowledge obviously. At the nine year olds I teach I have them an hour a week and I do two or three weeks for one level. But the quicker kids could do way quicker than that but I want everybody to come along so I take it slowly. And the second question is is there an offline version? No. There is no offline version of HEDI. To use HEDI you need to be connected to the internet. Okay. Because I'm also a teacher and I have problems in bigger rooms with many children. The Wi-Fi is terrible. Yeah. Yeah just check the GitHub. You have an offline version. It's already a works. Yeah there is an offline version but this was like a person who took it as a personal project. So it doesn't work very well and it's not updated to the latest version of HEDI. Because you are using parcel generators so every time you change the set of keywords you have from Python every time you update you have to regenerate the parcel. Is this a real programming language or just a transpilation? You have a set of keywords and then it's Python. So you execute it there and you get the result and show it in the screen. Yeah. So what we do is we cache the parsers. So we have the grammars and we generate the parsers using lark. And this is like a Python object and we cache that Python object in the server. So when the user for example someone from the common used languages tried to use that they didn't have to wait and it would not overload the server. But if you are changing to a lesser used languages, yeah we will generate the grammar for that level. The total grammar and then generate the parcel and that will be in the memory. Is it automatic or do you have to rebuild the project? No this is all automatic. Lark takes the grammar file and automatically from the server. So we have the Spanish grammar, the Arabic grammar, the English grammar all of them in the server. But you have to change the grammar file of lark? Yes we have of course. When we... Did you build? Yes when we build. So you will have the new version of the parcel. Of the parcel, yeah. And do you cover... I think you cover just like the... Let's say the... Let's print some basic keywords. But if it is about advanced algorithm it's not part of the thing. It is a full premise. At level 7 you already have repeat and conditionals. So you can do some somewhat complex algorithms. Of course you don't have interface to archives or anything. So you are limited in what you can do for the interface to the outside. But it's a fully functional programming language. But there is still a subset of Python. There is a subset of Python in the end, yeah. Yeah, of course. It's just a subset. And for example you said from level 1 to level 18 let's take Arabic. At the beginning you will start learning in... The kid will start learning in Arabic, you know what I mean? Yeah. In level 18 it's still Arabic. No more than Arabic. But later you have to switch to Python to write a fully correct, a scientifically correct program. How do you do that? So that transition we haven't yet gotten to the point of designing content for that transition. So you end up in level 18 programming in Python in your own language. So then it is up to the teacher or the student to then learn Python using the English keywords. But you already have some part of the syntax and understand how some of it works. So what they have to do is it's a little bit less hard than in the first lesson in the beginning. And also obviously we're hoping that other programming languages will pick up on the whole multilingual thing and do maybe Arabic versions of Python for you, for the real Python thing. It's a long way to go though. So still I think for people not in the English language part of the world it's hard to get into programming in almost all languages based on English. So we're hoping to help. The other languages is not just about the language itself. No. It's about the computational thing we have. Yeah, absolutely. Everything is built in English, all the meanings and all that. Yeah. It is designed for professionals. It's easy to have an RTL algorithm. I'm not that sufficient always. There are a lot of problems to create something in a language other than English or whatever from the left to right. But when you go to that part you would always face problems even for the smallest. Yeah, and that's why Hedy is trying to do it for right to left as well. So when you're learning to code you can learn the concepts of doing loops and variables and all those concepts that are most of us could probably dream about. But kids it's really hard to learn and new people to coding as well. You can learn the concepts without learning the Latin alphabet or having to switch your keyboard because you want your output to be an Arabic, probably if you're an Arabic. And you can't type Latin and Arabic on the same keyboard. So we try to make it so that you can learn programming apart from learning English and the rest of it. Yeah. You have a question? Can we use the turtle library? Yes, we have embedded turtle functionality. It's not the complete library I think. No. I don't think you can do fills at the moment but you can do the turtle and colors and stuff. Yeah. Yes? You can do inputs in the keyboard. You can do the students can do input, output in the screen. Move the students from the screen object. No. No. No, there is no long term storage. So you can only do the output is to the output screen. So open it's printing, asking for input is directly from the keyboard, from the user. So there is no connection to files. No pixels as well. You can have the output read aloud by the building text to speech thing. And you can also download the turtle drawings. That's the only thing you can download. Yes? Why did you choose to use Lex and Nap? Not another modern part. Oh, it's not. It's called laser. No, it's from, it's not Jack. Larek is a Python library. And the other one is from code, the part of the code mirror which is in editor. So laser is purposely built to work with that code editor. So we're not using Jack. No. So it's a little bit more modern than those. Yes? Yeah, I need to follow up with the offline question. Can you self-host it? You want to run it locally? Yeah, you can self-host it. Okay, and another thing, yeah, it's really nice this approach to incrementally teach kids program. Is it also an approach to incrementally at the same time, teach the kids to read technical documents? Because at the end they're going to face this challenge of reading technical documents. So if you could teach them to do it on the ground, then there are... So our exercises and adventures are very much tailored to be, you know, fun and tailored for kids around, you know, 12, 13. So we don't teach them to read complex stuff. It's more like we want them to understand the programming aspect of it before moving on to other endeavors. Yeah, of course, but incrementally teach how to approach the things of information. No, it's the same approach across the levels. That's a good point, though. Yeah. Because the question that they ask about the warnings, I suppose that if you're a level one, you get a simple warning, but when you're a level 17, you get that more technical warning message, for example. So the error messages change a bit because we are adding more complexity to the language. So the error messages do get a little bit more complex, but they want that, for example, if you make an error in level one, that is the same as in level seven, it will be the same error message. I think it would be a great opportunity for students working on the project or anybody working on the project to help on, say, graduating from Haiti to real world Python and introduce a constant just like that and also more concepts such as using a code editor, instead of doing it on a web page, stuff like that, could be a level 18 or level 19 or something, extension. The Haiti project is only two or three years old. Yeah, we are very fast moving still with new features and stuff, like the music notes that are in there for about a week and only available in English because the translators haven't come round to translating it yet. So what's not now maybe next year, you never know. Yeah, you are very much welcome to come join us in our meetings because we have public meetings. We put them in the Discord so you can come join us and give us this idea so we can begin working on them and consider them. Yes? Can you talk about at the beginning as a preliminary exercise about typewriting? Kids do not even know how to do an app case. Yeah. App case later. Just to help them understand how to do that. Shall I answer that? There's not anything in there particularly in Haiti, but as a teacher, do that bit before I start them on Haiti. So I teach them how to make a password, why it's so important to have a password and keep a secret, stuff like that, and how to type your name with capital letters and accent letters if we have them a lot in the Netherlands. So I do that before I start them off in Haiti, but not for too long because once they are in, they are really eager to learn that kind of stuff because they really, really need it. And kids want to learn the stuff they need to do what they want. So if they need to type the quotation marks, they'll just ask me or the person next to them, and within 10 minutes everybody knows how to do it. And they don't forget because they try it, they have to use it all the time. So it works both ways I guess. You have your finger for a while. I can see how it's analogous to learn to read, to write, to learn. But what happens after level 18? Have you done any studies to see how impact what it is? Here you subjected to your learning from, as I understand it, from stage 1 to 18 in any of those four languages, but then to progress to, I don't know, hopefully your professional development, suddenly you have to learn everything else that you haven't learned, and you have to learn different languages and so on. So how useful it is beyond teaching, programming, and Haiti? So when you finish level 18, you already know, you know, Haiti in your language, or Python, you know, because it's level 18, but you already know the programming concepts, which is the important thing that we are trying to teach, because we are not trying to teach the syntax for that, because the syntax will change with every programming language. So we want the kids to know what it is, what is a loop, what is a conditional. So if they move on from Haiti to, you know, Python, or some other language that is English based, it will help them a great deal, because they already know the concepts behind those, behind that syntax. So the transition is easier. It's easier, but is it easier now? Julia, do you know if Filina did any research on that? There has been some research on this, yes. I do know that it's still, like, going from a level in Haiti to the next level is a smaller step, than going from level 18 to real world Python. So, yeah, they are still thinking about how can we make this a smaller step, so that it really feels like going from Haiti to Python is just a smaller step as between the levels. But it does help a lot, of course, because if you have no programming experience and you come through Haiti, then it's really smaller to go to Python than if you go to Python without anything in between. So, what I'm asking, I'm not questioning whether it's easier, or I'm asking if there is any evidence that it's easier enough for people to make the transition. Do you know? Do they make the transition or do they... Yeah, I do know that at the school actually, Filina, aside from being a researcher, is also a high school teacher, so one day she's teaching in the Netherlands, mostly the 12 year olds around that age, and she has students in her classroom of 12 year olds who go to Python after being done with Haiti. So, she does use it herself, and I've been observing in that class myself. Yeah, and it's working fine. Yeah, so I know, I mean, it's very anecdotal, so I don't know about big study with lots of kids, but I have seen a classroom where there were 12 year olds who switched from Haiti to Python and were enjoying themselves. So, yeah, it is possible at least. Also, if you'd like to hear Filina's answer, I encourage you to come into the Discord and ask the question there as well. She's on there every day, all day, so it seems to me anyway. So, she'll definitely answer the question if she's done any research, or maybe even put a student on it to research it, because it's a really, really good question. I know there is a paper about multilingual programming that she published a while back, but I don't think that it's specifically about Haiti, much more about the concept. So, I don't know if that will answer your question, that paper. Yes? Related to the previous question, is there an option in Haiti to show the program you wrote in the actual Python to make that process? No. There is no built-in functionality for that in the interface, but if you know a little bit your way in the browser, you can actually see what was transpired. Let me put this back in English. And maybe to add to this, I don't know if it's clear, but if you have code in maybe level 18, you can just copy-paste it into Python. It's actually a code. There's still a program about the English from Morocco. Well, if it's in English? Yeah. Assume that your program in Haiti in English, then you can copy-paste your level 18 code into Python, and then it's exactly the same. So, level 18, Haiti in English is actually Python same. It's a subset of Python. Remember that you can also mix in English and your own language. So, you can also do that transition within Haiti. So, you can mix English keywords and keywords in your own language right there in the editor. So, it makes the transition a little bit easier. Yes? How do you handle the zero word? How do I handle what? Zero word, variable that could not have some names. Like, you cannot have a variable called print. Ah, okay. When you translate it from one language to another one, the variable change is the zero word. Yeah, that depends on the specific variables. So, there are some keywords that we can parse in context. If your name or variable, for example, I don't know, color I think once you can make it. So, it depends. If it's a variable that you cannot assign, it will tell you that you are misusing the command or something like that. Otherwise, it will work normally.
ZIMjs 2D PWA apps into 3D+VR with ThreeJS
Hello. Hello, Europe. Welcome to FOSDEM 2024. I am Carlo Roestel and I want to explain ZIM. ZIM JavaScript is a new way of coding creativity for kids who wants to type and not drag blocks. We can do different things. I will explain you the version 014, the version 015 and we launched this January two weeks ago, number 16. So let's enjoy the new tablet apps creating online. ZIM is created by Dan Zenn. You see a little man at the left side. He's from Canada and his name is Dan Zenn, but his alias is Dr. Abstract. And I am Carlo from Belgium and I created my own signature ZIM Salabim, the magic of code. Let's start. So this year ZIM already exists 10 years. So if you go to ZIM's history, you can see different versions. You can click on. So we now launched speech. Everybody is talking about AI, but we can now show you Happy New Year. It's okay, February, but still we can say Happy New Year to everybody. When we launch an app, here you have the possibility to see the promo for the app, the full version at the right side, the editor to see the code and to view the app in the online editor. So I'm going to give you a promo in the full version. So click on the button of the computer. You see an animation of the hand. It's a GIF. You can import into an HTML canvas and the computer should talk. If you click on it. So here it's a computer voice speaking, but you can choose wherever sound of the computer is active. Speak. This is Dutch. The language Dutch. I speak Dutch. So we are 10 years old. So we celebrate with this FOSDEM edition. I'm happy I can give you some information. I was here a year before also. But we now have the new ZIM forum. A little applause because we always work into the Slack version. So if you have problems or questions for ZIM, please contact ZIM forum. There is a big team of developers helping you if you have a problem to make apps. It's a pleasure to meet us there. So how did I start with ZIM? It was Corona three years ago and I'm a teacher and I was looking for my wife who is a teacher for little people, so toddlers in school. She was making videos. And the videos, they are not interactive. So I want to make apps because I'm a graphicist. I did not have lessons into JavaScript. So I found a man. I found Frank. I found Ken Academy. But suddenly Dan Zan came up and he is the inventor of ZIM. His little crazy man, but he loves a lot of inspiring people all over the world. You can see a little introduction. This movie. I was like a little bit of that myself. Now a two time Canadian New Media Awards winner. The second time was for educator of the year. I'm a professor of interactive media at Sheridan College in Canada. The first time was the programmer of the year for my website called Dan Zan because I'm also inventor of Dan Zan. Dan Zan is a large site. He's been going since the 90s. Many games and gadgets. Tools. There's a tool called Appartica, which was the number one appart making tool in Google for 20 years. Until Flash Die. And most of the Dan Zan site was made with Flash or even directed before that. And so it's now treated as a museum. I'm well known in the physical world for dancing and fashion parties. I had a happy celebration. Here are some of the things people have to say about me. These can all be found at the front of the creativity framework. Okay, so many people are telling about him and the reviews of the website. And I'll show you some reviews. So kids want to be happy and they want to make apps. So that was my first idea. And Dan Zan helped me to create simple active apps to go with the robot up and down, for example. It's a D-pad. And you can, for example, make dialogues and the kids can animate robots and aliens, for example. So if you click on the aliens, you can move them around in space or you can move the robot left and right. So this is what kids like a lot. They want to be satisfied with 10 minutes lesson, maybe about coding. And they can make this. We have assets for children made into ZIM kits slate. That's a new way of programming for kids. So also we have a hero sensor inside the RG. Here's him. Mobile device and we tilt the device. Also the robot can tilt left or right. And in 2023 we have results and no time is set. Frank Loss. So Frank Loss is somebody in the known physical world for game making with JavaScript. He made space inventors. And I was asking Dan, can we make also retro games? It's fun for kids also. But with the knowledge Dan gave me, I could use the information of Frank Loss and he is talking about ZIM that it's really fun. And fun is a factor for me to start exploring ZIM. Even videos in five minutes you can explore. So for example Angry Birds, if you want that, making Angry Birds a very known game. You have articles on dev.2 Angry Birds in 15 minutes. You have little videos where he's explaining how to code. So this is the intro. You see, he's really doing cool things for children. We look to other libraries also for examples and then maybe show you my example. I made on my website ZIM Salabim. You can find it there. And then you can drop and drag. So we have physics in ZIM. You can do whatever you want in ZIM. Just your imagination and you can make it. All for free and with a little help of others on the forum. It's nice to know. So for example the reviews. Thank you to ZIM. Develops. Time was quite short and a very smooth ride along a quick learning curve. So all around the world we are now. When you are a beginner intermediate or you're an expert, no problem. You can join. This is a survey so we can collect the people who are meeting us and the cat is Omi. We have given the name as Scratch has a cat. We also have our cat, the grown up, you know, the big one. Because we have to type a little introduction by then. Okay, so this is the framework ZIM slate. We a little change it. One year back it looks like this, but the main idea stays on the left side. You have the view of the app. Other outside you can code. For example, if color is white, you can use that to color the background in white. Okay, you see Harry Potter is teaching you because it's magic code, you know. Okay. We have. What is bubbling at ZIM bubbling is like what are we doing with ZIM. I'm Carl and I just had the idea to make kids happy. So I put a lot of effort each night because I also I have kids of one of five years and one of one year and eight months. So I need to measure time settings with my wife and it's very difficult because I believe in the product. It's free. It's open source and why shouldn't I invest in this beautiful website because I want to spread it to the world, you know. But I'm not alone. For example, I see games. It's from UK. They make a lot of kids apps. I'll show you. This is a website of I see games. You have math or English you can learn. You have even Dutch examples of apps. You're learning tools. This is, for example, a beautiful one. You can expand. So you have to learn the numbers. So you have to push 123. So the kids have to follow just with the finger. It's not doing anything, but the animation is made in ZIM. So I clicked the one. It's okay. There comes a star. Here we use physics so you can drag the wooden block and you have to make a bridge. So the children has to count because the one is before two and then we have the three after the two. And then we already have to running. So this is a sprite running, you know, this makes us fun for kids. So that's the idea after ZIM, but also top marks, for example, has a lot of examples for kids. If you want to be inspired as the teacher, this is like all made in ZIM, you know. So and also close up makes a lot of games. I'll show you a little further. So here we have the ZIM kids golden bar at the right side. If you go on the red side, you can click kids and then you can make the apps. But we do also VR for kids. So and we do AI. It can be imported in AI also. So here you are at the right side. You have a virtual environment. So I'll show you an example in some minutes. ZIM has also a Facebook where you can find some information. So the latest apps or post there. This is a clap lap from Israel who's making apps in ZIM. So this is the mediums of users have already played and use the products we developed this year. In the past, again, they want to take us about two months of work. Now takes us only two weeks of work. Working with ZIM is very flexible and makes it possible to work with dozens of components developed in the past. And also to develop and expand our own components and code elements that optimize the work. One of the biggest advantages of ZIM is the connection and support done and healthy. The availability and possibility of the web kind of ZIM developers and Slack or Git allows us to rely on ZIM as our main platform for development. Any question, buy or request is immediately a sensitive tool and we are very quick and we and our customers know that there is something to trust. Unlike other platforms, ZIM is constantly being updated. And every few weeks new features and innovations are released that optimize the work. Working with ZIM is easy and fast. Images, videos, samples, vector files and much more work naturally with ZIM. ZIM allows us to be present in the design and focus finishing our developments and games. The number one reason to work with ZIM is coding with ZIM is just fun. So thank you ZIM and all the team developers in ZIM. ZIM is here to stay. Okay, so I hope it's little advertisement but they also have the possibility made to tap lab to create your own game with an environment. So three games are free and the other of us you have to pay for. And I'll go to the next slide. So also the ICT games. It's not about why another ZIM is in the S when I'm in the game. And why I'm in the game. So I started 16 in the year 2000 and I wanted to make educational games to support my children from the past. On the learning objective of doing that thing and we get home. I wanted it to be free so that all children had the equal opportunity. And I originally made games in Clash and then about six years ago I was with an alternative and then ZIM. ZIM.JS which is just brilliant. So hopefully we'll see some things. So you can turn things and through this space. It's cute. And there's a mask to make it to chop off the edges, make it look like it's a mask. We've got randomly sorted shapes and words and then you can drag. And if you look at the code with this, then it breaks it up. See the grid there, climate and outfits. They describe how it works. It makes sense of what's happening. Also, I really like the documentation. It's not clear. The documentation. So we saw rectangles. We just found rectangles. I can type in something. I have a new rectangle again. So remember the themes of light in that game over there. This is all this section. This is all rectangles. These are all the powers and variables you need to make a rectangle. So you'll see. We have. We have explained the ICT man. The library has a lot of components made by ZIM and eco-appointments. You can use, for example, to make games for kids to connect lines. So many kids want to learn the numbers and connect them. And then you have an image popping up. So that was a question of me and then implemented it. So I was happy I could make games like that. So for my wife, this is the website. We have Dragon Games and matching games together. So you can visit it if you want. The slideshow is online. But now we will talk a little more about the environment where kids can make games. So it's a scene. So you have to choose for a background and some pictures. And you have to combine some pictures like the eyes on a gem. And you can then make your own creativity. So first of all, the names for a picture. We shortened it to pick. So you have a new pick. New out is audio. A video is feed. A hif is an animation. A hif is a hif. And a sfiji is a scalable vector graphic in a purple box. You have seen it. So Pragma is a little woman in the middle is the daughter of Dan Zan. And she's also helping kids programming because also girls can code of course. So it's good because also love it a lot. If you want to know how, for example, you need to make a new pick. We also have at the right side a help button. And then you see a video. So if you click the video, it then is explaining how you can work with the assets. A little explanation. So that's a little tricky. But let's try that again. Test. So no sound is playing because the user is supposed to interact with the act first. Here's what they do. Start. And now the back of the sound plays. Yay. So that's a little tricky. But. Okay. You can get to kids at kids.zinjs.org. So come on in and I'm going to show you how you can make a scene. Scroll down here. Make scenes. If you want to learn about coding, then you should be taking a look at all of these parts and trying out the handy tutorials here to work with you. But you can also make a scene in Slate. And Slate is this link right here. Or you can click. And that's a place where we can type any code that we want. Of course it needs to work. And then we can test our code over here. You may see it's starting with this demo. And then we can test our code over here. And then we can test our code over here. You may see it's starting with this demo right here. A fragment. And there's the code for that. But we want to make our own scene. And we'll do that over here. And test right now. We got nothing in it. Alright. We can add some assets. And then available up here. And if you want help on how to do that. Well, this is a help video. So hopefully this will give you help. But you can press help there. And what that does is it jumps you down to this help section right here. Where it tells you how to use the asset buttons. And it says see the video. There's a bunch of examples as well. We can add a background color. And that just adds a color. That's not really using assets. But here's adding a background image. Such as a beach. And there's adding images. Like a butterfly. We'll do that too. There's a code to be able to do that. We'll talk about that. Here we can put things into containers. And move them all together as well. So we'll take a look at that. Here's adding sounds. So that when we press on above, it plays a sound. And then here's adding a background sound. Which is a little bit tricky. Because we're really supposed to click on something in the app. Something in the app. Something in the app. Before we can play a sound. It's a sort of a polite thing that the browsers make it do. Because otherwise people would go to a page and sound would be playing right away. So that makes it a little tricky to add the code for a toggle button. As you can see. Or the start button. It's not too bad. We can actually test just fine. Nice and easily with the background sound. But that will only work because we're pressing the buttons here. If you did. In a full mode. Don't worry too much about that. Let's start before we go into the assets. Let's just show you how to change the color of this. The way you do that is you go frame.color is equal to something. Such as yellow. We have a bunch of colors. Most of the playing colors there are green. For instance, will work. And we get test. So there it is. This outer color here didn't change. It's still black. We can change that color by going frame.outer.color. Is equal to blue. Oops. That's cool. And we can test that. We may have done some other colors down. In the example. There you go. There's also a way. If you take a look at the demo. You see how that's called a gradient. And if you take a look at the code there. We made a new page. And we told it the stage would. And the same type. And we told it two different colors. We added that to the stage. So if you did that. Copy this. Copy. And paste it in here. And do a test. There it is. It's going from pink to blue. And our outer color happens to be blue. Okay. So that's some ways that you could. Change the other color at that point. There we go. You can see how I think of it. That's some ways that you could start off. Without the background image. And just start building things here. For instance. A new rectangle. Or a new circle. And we'll make it 100. Radius and red. And dot center. And dot drag. So this is you making an app. It just doesn't have any pictures. But it allows you to think. Think what if we want. Start making a game. It just has these shapes. Okay. Now let's go to assets. And see how that works. We'll clear that. And test. So we're back to nothing here. We're going to go and add a background image. So I'll click on backies. And then we'll go to the background image. And here's a bunch of different backings. We've collected for you. And thank you to back easy. We've got these right. And thank you to our co-ops. You help process all these. Okay. Let's choose the beach. Beach one. So you put on the beach picture. Or the check box. You want that check gone. And then hit save. What that does is it says. Okay. So we're back. And we can access that by going asset. Singulate. Asset. We're using an asset. And then you put the name that is right here. Beach zero one. Inside of little quotes. That turns it into a string. If you don't use a string, it will break. And then we will talk center that. So there's the asset. Asset by the way is a word for images and sounds. And we have images here. And also sounds. One day we may add sprints. So there is our background image. Now you can go to phone mode here. You'll see the phone mode makes it a bit longer. And you can see that the image doesn't really fit perfectly in the stage. And if we're not intended to. It's not high enough. So that's because the asset is just a little bit too big. And you can see that the image doesn't fit perfectly in the stage. And if we're not intended to. It's not high enough. So that's because the asset is just a little bit too big. Or for this. Which means we'll need to adjust. We can do that with scale.scot. So we put in a.scot. And if we make it twice as big. And save it. Or get test. There it is. And that fits. It's a little bit too big. As a matter of fact, if you want, we could drag it and sort of see what we mean. See? There's some. Maybe it's just fine. Maybe we like it like that. But we sort of have to guess at the scale a little bit. Or we could make it this scale. And then move it over. For instance, we could center. And then m o b e for move. And then we want to move it this way. This way. Then it has to move negative. It's negative x. So we could move it minus 300. Hey, there we go. So if we moved it to zero. There's what it looks like. You can see. But we're going to move it 300 pixels. Over in the negative direction. Okay? So that would look like this. Negative 300. And hey, that looks pretty good. I like that. But there's an easier way perhaps to just pick the picture. Or make the picture. Fill. And let's show you how we can do those two things. Moving it. What we'll do is we'll use scale. To. So if we use scale to. That will scale to the stage. And I say this. And now it's scaled to fit in the stage. Like that. If we made it at a bone. It fits to the top this time. And if we made a portrait. There it is. Fitting in there. That's okay. Well not really. Because it leaves this stuff on the top and the bottom. That's fit. So we want the sweepy bracket. Tight. Pulling. Fill. So we'll see what we've done there. We've used the sweepy bracket. To go to the tight parameter. Of. Fill. Hey look. Now it fills the stage. Fill in the stage at a bone. And you know the portrait. It fills it. So it fills it. No matter what you choose. You can still move it afterwards. But that will probably do us for now. It also may. Okay. A little introduction. To the slate. We have made several changes now. To import 3D into the editor. And a slate so kids can make 3D apps. Everything is explained by new teaching videos. And we have made for teachers the editor. Yes the editor. You can save for free all your zaps you're making. Otherwise you have to download it. Save it on a place you don't know. So now you can make zaps. Zim apps. Is a combination of zaps. The name. And then you can make lists of zaps and share them with your students. You can also find some demos. And all that is for free. So no advertisements. So this is the other one. The newer one. How you can explore. Zim with demos. And here you can have a look. To the. Editor. So. You can also find some demos. And all that is for free. So no advertisements. So this is the other one.operate seeds and make sure you錢. Come on. Got a start. Good news. All 2018. Slowing. So we have. the search here, like in DITS. And as soon as we do all the collapse, all of that means the results are, but this collapse is also the opening of my collapse, my result part. So there's how many we have in so far, 30 seconds of them. And the fact is that the current one that I have to put in here. So if we take a look at the dianobesan, oh, there's dianobesan, the bactezan, the grids, the guides, the do-wees, and passing parameters or object pages, the variables, and so on. And you go into the slide. So those are the dits. And in other words, great, great, we can just hide it that way. But we do have as well, and this one's called list. So once again, I'm logged in, which allows me to keep track of the files. And then here, what we're going to do is copy the bits. And I'm going to copy it to my own list. And I will call it bits. And note that they're all the names of the bits. And you work nice and happy with them. And it's saved. Now under save, here, see there's the files. List, OK, so that's right. That's not a list. So save is anything that I happen to get plus sign on. And then save plus my own list. So here under save, we have my own list. So you would have your own list. But we want to look into this. And then I can have these are my list. And they are my art, my prepassion, my basics, and here's my new list. Wow, take a look at that. So I have. OK, so you understand the idea, making lists of all your zeps. You can share them with the world. And we can go then for children much more into depth to learn code, of course. So we're building, building, building. And we hope everybody understands how you can manage and share your zaps, your ZIM apps, your little code to let children be creative. For example, this is a blob. A blob is a shape that is closed with a kbizier curve. And you need to change the parameters, the lines, to make all the form of shape. And this is needed to make, for example, the kids app connector. So you have to make a path or a blob that is closed. And when you have the blob, you can make a game of it. OK? So I also was thinking if kids can make games. If I have to make all the games for the kids, it's a lot of work. So I made a tool. Kids can import their own image. And then they can change the blob themselves. So you have like this basic blob. Hide the numbers. I can say I want to edit a blob. And I can just drag around the cat, the blob. Otherwise, you have to program it and drag and drop with big buttons. It's very easy. And then you have saved the new blob. And you can have the points to code, to use it into your game. So that is very important for children that they understand what is a form, of course, of a shape, to explore a code. Also, beats are possible. Beats, for example, stars. You can make the Europe flag. We are in Fosdame. It's Europe. So 12 stars need to be on the circle. And then you can add beats. Also, emojis. We made a tool to get kids creative with emojis. So what is the code for an emoji? They all send emojis on the phone. But what is the code after an emoji? So we are going to the depth for the kids and then, oh, wow, we can make emojis. So for example, the code for an emoji is new emoji. We have the code and then 200 pixels width and then in the middle of our game. So if you go to the site, the emoji, for example, this is a tool. You have like, ah, I want to do a tennis ping pong game. I want this, I can do copy and I can import it into Zimkits. Or I want another one. You can also find one by your own operating system. So for example, I'm happy. I'm in love with Fosdame. So I want to use the code of this emoji. So this is why kids are loving Zimkits a lot because there can be real coders. They can see the code behind. Also, we have panels. You can close, open and close. For example, emojis in the list for teachers, kindergarten. For example, then they can talk about what is this, what is that. So in a little time, you have pictures because they are emojis. And if they want to not anymore on the screen, you can remove them. So you can open close. You can move around. So you can make whatever you want and let children talk with, what is this? Oh, my daddy likes this a lot. I don't know what it is. But I think here everybody knows. I also have a little thirsty. I'm a little thirsty. So I will drink something afterwards because I'm happy and in love with Zimkits. But also the indicator has love emojis. So indicators in games, you want to know how many lives you still have after hitting something. And then now you can use emojis. So another tool we launched Zimkits is the emitter. So Zimkits has lessons. But the emitter wasn't yet possible to show to people. So this is the confetti. You can choose whatever you want. You can make fire if you want other colors. You emit something. And the code you can create by the button. And this code, you just copy it and paste it into the editor. So you paste it. Zimkits is slate. And we just do this way. And then we test it. And we have the code. You see? What I also explained to kids is how color works. Color is a big problem for children how RGB is working. So I made a tool for kids so they can change the color. And they understand what is alpha, what is looking through, what is hue, what is black, white, how does it work. So they can learn a lot of the techniques about color. Hexadism, all the thinking. This is all possible in Zimkits. Sliders. It's unbelievable. So these are tools I really need as a teacher to explain what is coding. Also, if you have particles for stars or a snow simulating fire, we have a website. And Zimkits has different levels for animating objects, of course. I'll show you a little example. So we go to the parts, events. Here you have a triangle. And you click on the triangle, and it goes to the right side. No problem at all. I go to the code, and kids can learn the code. This is a triangle. It's rotated. It has a cursor, the hand. And it has a position at the left center. So at the left side, in the middle, it's positioned some pixels on the right side. So that's the start position. But if you go to the level two, they learn about animation with the spacebar of the keyboard. So that is also possible. But here, you can click, and the circle is moving. But what is also possible, the other side, you can go with the arrows up and down. And when it's hitting, you get confetti. So this is why we needed the confetti, the emitter, to show children what is possible. Because when they are fighting in a game online, they also have to see something when they die. So this is also a possibility. Confetti, when we shoot something in the air, it goes to another place. And this is a little faster. But we have also the sound with it. We have a counter of the time. It goes down. And then, oh, we have 10 seconds left. We need some more points. Come on, come on. And then when it's finished, we need to see what do you think, what do you see? Game over. So this is a little animation for children. How can we make a game? If you want more complex, then we have sprites. We can really animate like in a real game. So this is the emitter, why it's used. You can also catch beats. So an emitter I used for is my own name. I learned kids of 12, 30 years how to make their own name. And a little game for the name is please raise all the cookies. And when you're happy, all cookies are coming. These are pictures into the emitter. So you can do whatever you want in the emitter. And if you want to play again, you can launch the button to play it again. So creativity is all yours, of course. Walking onto a path is also something why we use the path, of course. Here you have the path. And on the horse, there's center class. It's a typical Flemish ritual. So he comes on the roofs of the houses to bring presents to the children. So he's walking in Antwerp above the roof tops. And you can stop and do help. What do you have to do? So there's also a path. But it's coming back when it's at the end. OK. So also I made a game for Saint Martin. I live in Bevereux, next to Antwerp. And there is like a candy man. And the candy man is there at night. And he has to throw some candy. So you have to hit the little people to get them candy. So if they are happy, yay. And this is creativity. You can put your own pictures into the background on characters. You even can now play multi-pong. If you are with three screens, so two phones and one laptop, you can play pong along with your phone. So I would say test it sometimes. But I cannot show it now because I have no time anymore. But it's also working. OK. We have also animating. So if I see something on a classroom, I want to make it myself. So animating is with sprites. This is Dr. Abstract. He has colons and rows. And sprite sheets, we call it. And then we use it also for games. So for example, this one, we have three layers. And we can shoot. So that's what kids also like, of course. And they want playing games. But I'm not a shooter man. So I make it with vortex. Vortex is a little brave or another way of thinking. So you can remove somebody standing over there or something. But there is nothing to do. It's suddenly walking and jumping. But you can shoot. So another ID. What do I have here? We have also a video. Mr. Sander is somebody from Holland. He makes a lot of videos. I was asking him, can I do something with your videos? So he's asking, where do we hear the letter B or something? So on the left side, the different images has to come. And then you have to click left or right. What is the answer? So this is only the first step that I've shown you. But we have also Mr. Sander in the 3D space. So I want to show 3D. Why? Because in September, Zim launched the possibility to go forward into a 3D space. So everything is 2D, but you walk around. And you can go with your mouse left to right. And this is very amazing for kids. So they want a game. And they can even play videos in 2D space. All made by Dan Zain. So this is also about the science day in Belgium. This is Mr. Sander again. So this is playing. We are moving. So you can pause. And you can go forward. And you can do several interactions with the several Zim apps. So everything is a ZAP. And you can place it wherever you want. This is the book. We have an interactive book also in Zim. So we can go left and right by touching it. This is unbelievable. We can do it in 3D. Interactive books. So if I put the little people there, I go back. It's still on ZAP. So we can make books. We can path. We can go like here. We have a circle. We make it bigger. Other right side. We can animate everything in 3D space now. So this is a big step forward for people who are 3D makers. This is another example. We can even use the VR glasses. So down you have a button, not support, because on a website, a computer, Chromebook. But if you have the glasses on, you have to push the button below to go. And then you can go forward, backwards. You can then animate things, make it bigger, make the solution. So this is very amazing for me. So this is for the high school in Belgium. I made an example. Thomas Moore in Mechelen. So this is a big step forward for us. The game then made. It's also on the right side. But it's not with animated panels. It's only with cylinders. Also, we have a game module. So you have to import some modules. And then you can make a talking people. We have sensors, ZimCam, for making games interactive, of playing music. You can wave to your cam, and it's interactive. So this is the magic of code. I put the wrong button. So if I go to the number 5, demo 5, it's very cool. We have like a cursor. But now your cursor is your hand. You move your hand onto the red button circle. And it's activated, and you can place it somewhere else. So this is what I want to say about the magic of Zim. Zim can do everything. I didn't push the button just with my hands. I can move along. So you can make an Ufu landing on the world and giving everybody some happiness, love. But the next thing I want to show is speech. Speech is launched two weeks ago. And I did some pioneering work for it. So this is a little man, a boy. And of course, I want to talk to him. So I have a red button. Hello, everybody. How are you doing? And he's typing what I'm saying. So this is a big applause for Denzen in implementing it. So in Canada, thank you very much. All over the world, speech technology is very huge. And nobody else I've seen implementing it into apps. And we can do it. So also, for example, a robot, Basin Adrian, is from the 80s. I was a lover of Basin Adrian from Holland. And I made a robot because I was a huge fan. So I show you. So I made the finger at the left side so you know what you have to do. Hello, I am Robin, the robot. And then you have to, for the next sentences, hello, I am Robin, the robot. Again, hello, I am Robin, the robot. Yes, robot, of course. And then you can play it. And Robin is Robin in the language. And Robin is Robin in the language. So the language you can choose the language you want. Hello, I am Robin, the robot, of Fosdam. Hello, I am Robin, the robot, of Fosdam. So, even my name. Thank you for listening to Carl Roosale. Thank you for listening to Carl Roosale. Thank you very much. So I don't know what to say anymore. This is what I was liking a lot. But the problem, at least, is the children can talk every word. So what do they want to say? Words they don't want to hear as a teacher. So what did I offend? A list where children can only talk the words of the list. So he has the center class. He comes. He wants to bring presents. What are the presents that he brings about a Vliegtheug? So it's a plane, a bear. And if you talk to the circle, Vliegtheug, you get the word and you see the image. But you cannot talk any word. He only listens to the words I'm talking. That's extra magical. So I'm very excited about him. And we even have the controllers possible. So you can play with your controllers, also games. So if you go to heart 2, you can go to mouse for now. But if you have a Bluetooth connection or a controller, you can collect all the buttons or the circles. And when you are finished, you get love. OK, so I want to inspire my kids of five and one year and eight months. So I don't know anymore what more I can expect from the end end. But if you have questions, ask him. He made already the Groovity app. It's a PWA app. PWA apps or progressive web apps. Progressive web apps can be installed on your phone without internet. It's working. So we have like Groovity with many pages. You know you have seen apps with one page. Of course, we want to have children of the kindergarten with multiple games at once. So these are all stages after each other. You can play around. It comes back. It becomes bigger. I am on Placement in Belgium. It's a big website where you can post all the things about internet, you know. And it's very liked by the people, the teachers in Belgium. So now they want to make it theirself. And that's why I'm making Slate better and better. Also, the physics. We have a ball, for example, to jump. This is a basic example of sim. So you have to push the ball. And it can go outside or sim. So there are no limits for the border. So you have to push and then it comes. So now it's cool. We can add sound to it because we have the speech and the sim. So the kids know it's a number one. It's a number two. He can learn counting to 10, for example. This is what I show you. The explosions already. Sim coding with Slate. So these are all the websites you can visit. And also on TikTok, on YouTube, of Instagram, we are everywhere on all social media. Just take the time to explore Sim because it's much. But if you have grown up and about the blocky and all the blocks dropping, just go and typing. That's my thing I want to say to everybody. And last but not least, in March, we have Flanders Technology and Innovation. It's a festival about what we invent in Belgium. We invented the Jeepig compression by a woman from the university in Limburg, the province. And I made a little VR glass. You can see a video playing while we are turning. We are looking for our own way. We are looking for our own way. But above all, we are doing it for the sake of it. Like one of the first windmills, which we just saw here, in Zürich, there is a smartphone that is a little bit connected to an image in Leuven. And each one is only available through the game in Limburg. Strattoch or mitroscopic levels are available by the name of Strattoch. Red means order beginning or flight to the next if we want to make it better. Really, Flanders is working on it. But I'm sure even smarter if we make our own screenings for the big block if we can speak to each other. Look at the future. Or do we need to calm down? The future. Because that's exactly what we want with Flanders Technology in a vision. Do. Do it. Do it. Here in Flanders and all over the world. Yes, even this world. To stimulate, inspire and support talents. The problem of tomorrow is now solved. So, please join us. Thank you for coming to see how the world changes. And above all, how it started in Flanders Technology in a vision. So you see the head of the person. This is Flanders. By night, the lights on Brussels Antwerp, the coast. If you know, we have two parts in Belgium, Boulogne and Flanders of course. And if you have this in a Zapp, so you saw a website, I made also the Zapp. You can go to the full mode. You click on the QR code, you can do it yourself. But here I put it into a newer glass. The quest to reach with an image. Problems are stuck. That's what we want. Experiment, problems. So you see, you take a 360 degree picture and you can put whatever you want into the space with Zipp. Thank you very much. APPLAUSE See you online or somewhere else. Any questions? We have, I think, one minute for questions. Yes, or any other questions, please. You are all amazed about what Zipp can do.
Mainline Linux on Qualcomm SoCs, are we here now ?
Thanks. Welcome to my talk. So my first time at FOSDEMM, so questions, twists. So I will do a summary of where we are now about Qualcomm SOC supporting in Mainline because I think it's time to do a point. So I'm part of the Linao Qualcomm Longing Team. I joined one year and a half ago. So my many daily work is actually platform Qualcomm support. I'm maintainer with Calib, the commentator of UBoot Best Sports and I bring new platforms upstream in Linux. I also maintain and develop other piece of Linux, namely the MLG SOCs, DLM bridge, DLM panels. And I'm working only on upstream Linux and UBoot for the last few years. And so I have a lot of patches upstream in UBoot and Linux. So I'm part of Linao. Linao has been founded to actually enable and make Linux and any software on ARM work better. So we're basically helping vendors and product makers actually make better products to work on ARM. So we have plenty of services to help the whole stack of software running better on ARM. And open source is at the heart of Linao. We mainly work on open source software mainly. So Qualcomm joined Linao 10 years ago because they wanted to have better open source support at the time, which was minimal. So they joined to support Linux, but it quickly collaborated plenty of other places and so far so good. Even happy Linao and Qualcomm is happy with the situation. So in the last 10 years, Linao and Qualcomm pushed a lot of really huge features for ARM ecosystem in Linux, namely the power framework that the energy hour schedule really changed the way Linux schedules correctly on cores. With help and Qualcomm participated in the standard and software structure on our servers, the Dragon boards are the reference today to test Android, ASP for example. We have called Linao, which is the principal kit storage for Qualcomm and for Linao engineers. And namely for the last three years, we were pushing the flagship mobile platforms. So this year, the last three months, I pushed the standard on Gen3 upstream and it was 98% two months after the announcement, which is pretty cool. So the agenda, where we came from 10 years ago, where we are now and the two supported devices, a demo and what's remaining. So we were 10 years ago. So 10 years ago, Qualcomm and vendors using Qualcomm SOS ships with kernels with more like three million change. So it was basically a separate kernel in a kernel. So this was a problem, but it's a hard to solve problem. How do you integrate, how do you upstream so much change in main Linux? That's why Linao started the learning team to fix this. And this is a graph I made to show on the last years, the last 10 years, how Qualcomm managed their downstream kernels. So initially, they used the long term kernel very long time and they kept accumulating new SOS support over time. So each time frame and for the last four years, the company changed the strategy and they stopped adding new code and simply changing existing code. I think the reason is first Android strategy with GKI and because the main line Linux has enough support and has the principal architecture missing over time. So this is what I posted nine years ago, eight years ago, which was true. I mean, it was mostly nonexistent. It was the only SOS inverter that was not upstreaming almost nothing. So hopefully it changed. So Linao worked on Qualcomm specific features in the last 10 years. So the biggest feature was the remote proc to handle DSP because before Qualcomm had a complex custom solution to speak to DSPs, which was like two million line of code only to speak to DSPs. And the biggest work we did was to implement it correctly upstream. So we have now a fully integrated way to speak to DSPs and it works really, really well. The other big feature of Qualcomm SOS is Interconnect because the Qualcomm SOS are very complex and you can fine tune any data path in the SOS and you can change the bandwidth. You can change the performance of any data path. So it was a huge feature. It took a very long time to upstream it. All the Venus video decoder was complex because it needed the other support before. The DSP audio support also needed the proper DSP handling. The DM driver is a huge beast because the graphics display engine is really complex and supports a lot of features. Lately the sound wire support was upstreamed for Qualcomm and other platforms and we worked on plenty of very time consuming subjects but tiny. But all of these are needed to actually boot a platform. So this is a graph of the upstream contribution. You can see it was quite a blow but all these features are really complex to upstream because they are either Qualcomm specific or very complex but it doesn't fit in any framework. So it took like seven or eight years to actually push the base features to actually be able to boot a high-end device. And the last four years because we had a complex support for all the small and very important features we are able to finally boot high-end and commercial forms on it. So we had a lot of contribution from Linao, Qualcomm and also the community. This explains a huge peak in the last two years. So this is a graph of the supported board of the time. So 10 years ago we had only two boards in 2D keyboards and now we have a lot of 300 boards which is huge. And most of them are community boards and non-reference of the base boards. These are the new supported boards of our time. So for each release the number of supported boards were added. So you can see in the last 10 release there's a huge amount of new boards added which is great and the community actually helps a lot in this case. So for the boards, supported boards, like Caleb says the historical dragon boards were the first really publicly available boards in the SBC form factor and they really helped starting the mainline development. And while they were like low-end SOCs, we supported a lot of features. Those support camera and very high-end features so it helped develop the baseline support to actually enable high-end SOCs at the end. So like Caleb said, these are the robotic boards. They are quite expensive and it's the current Qualcomm offering in the IoT world and they aim to support them fully upstream and which is quite, it's quite each board as a mid-end and low-end. So it's quite diverse and it helps supporting all the new features. So there is commercial phones which are running very, very well. So you won't expect all the features for daily usage. So you don't have haptics, you don't have camera but they work fine and you can boot and actually use it with Wi-Fi, Bluetooth and storage. You can have a few tablet convertibles running on Linux, mainline like the Lenovo Yiga 640 and these are the Qualcomm high-end reference devices. So those are the devices we use daily use to upstream high-end platforms. So this one is a one-year-old platform, this one is a two-year-old platform and this one is actually running this presentation. And those are the specific Qualcomm reference devices with test points used by Qualcomm engineers to actually upstream develop Android and we upstream mainline Linux support with them. So as I said, I was upstreaming the Gen3 support, the latest Qualcomm high-end SOC which the Samsung phones were announced two weeks ago and in 6.1 RC1 that was announced like two days before announcing the Samsung phones, we had already a display, UFS, PCI, USB, thermal, CPU fray, QSPC, suspend-resume and crypto-working on mainline Linux, check out Linux master and it was working, it works. And in the meantime, we developed audio, display portal mode, DSP, full DSP, modem, compute and audio, USB PD and charger and GPU is the last remaining one and I won't talk about it. So the flagship device you could use today is the Lenovo X14S. It's actually the best platform to use Qualcomm devices. It's really powerful and you can use it daily. My colleagues are actually developing mainline Linux on this platform. It supports, my colleague can use it about eight hours time working and you have almost all everything supported. So this is an example what is supported. You have a JPEU working storage keyboard, thermal, USB, suspend-resume, audio and you can boot over EFI. So you can, but obviously they're still working process like every software stuff. So the most important is the camera. Camera doesn't work. It's complex due to the sensor putting raw data and Qualcomm not wanting to upstream the stuff. So it's in working progress. We have something working. It's not public. We are working on it. There's plenty of other small features are missing like the embedded controller, the power measurement. Power measurement is infinite. It will never be perfect. So we're gaining amps every release. So it's a constant work. There's always some small, wiffy and Bluetooth issues. Audio needs active speaker protection. This is a big, modern feature. All the new, modern audio needs active speaker protection because it's not no more included in the codecs. And some stuff are still missing like the fingerprint or VDOC shale action. But we aim to support all this in the short frame. So today, if you want to test Linux, many Linux on the expertise, you can use Fedora, Ambian, Nubuntu or Debian. We about changing anything. It will install directly and boot and you could use it daily. So this is a great, great advancement. So demo time. I mean no need for a demo because I'm running it. I'm running the procession on a Qualcomm device. So yeah, for example, this is 8550. You can play a video, for example. It works fine. You can switch. I'm still in full screen. So you can see everything is fine. The video is still running. So the GPU works very fine. Up. Demo effect. Okay. So... So to show it's really usable. You have Wi-Fi and Bluetooth working in GPU and this platform is one year old. But I got hardware like two weeks ago. So it was great. And the support for the board is actually on the... It's made by the Qualcomm ARM maintainer. So it should be part for 6.9. So globally, what's remaining to support, properly support the Qualcomm SOCs, power optimization. It's a long term, nearly infinite work because the Qualcomm is complex and we still need to gain every time. So performance. Performance, like I said, each data path can be optimized. And it's also a long, long journey to support power and performance and manipulation. There are still missing some advanced graphic features, mainly for non-phones and non-laptops, like HDR and multi-plane and so on. Video with decoding accelerator is working progress. We're working with Qualcomm on it. Camera support is a big feature. Audio support, we still need to support DisplayPort audio. To support audio over the HDMI or the DisplayPort. Speaker protection, the sensor hub for the phones, feedback and the vibrator and the new platform because each year we have between two and three platforms to support either in computers or phones or IoT. Otherwise, it's keeping us working a lot. So we need help of the community because we need testing and we need to support more devices. Thanks to the community, we have the largest ARM 64 change in the last years. Every single release, we have a top change because it's really actively changing. We are really supporting mainstream devices, the phones, laptops, modems, accessories, converters. And we are working on new books. Qualcomm is porting new devices. It will simplify installing new distributions. And if you want to know the status of each SOC, you can go to this address in our GitHub IOMSM. It will give you a nice overview of the support. So like the last line is the standard on Gen 3. So all the yellow lines will be green in four weeks now. So for example, it's really kind of cool. We simply describe each feature with and it generates automatically a website. So it's really cool. So thank you for listening. I was happy to present the state of Qualcomm SST port and demo it in live. And it works fine. So no demo effect. Thank you very much. Very nice. Does anyone have any questions here? Yeah, hi. When can we expect Qualcomm to start upstream in support for the Linux that runs on the modems? On the modems? I have no idea. I'm sorry. Another question? Thank you. The question is actually first, is Linao or Qualcomm considering doing any upstreaming for legacy platforms? For what? For legacy platforms, for early edge upsets. We do it daily. Okay. So this is also happening? Okay. Yeah, we continue adding features for all platforms daily and the community helps a lot and we are testing it. And in fact, Qualcomm is pretty consistent in the firmware interfaces and APIs and registers. So we in fact support all devices quite constantly. And then the other thing you mentioned, specifically on camera, there's a lot of work on Android. Second, you have a lot of out of three drivers. Instead of a platform, Qualcomm, they actually get everything supported directly in upstream Linux. I hope so. And the question here, one second. Hello. Very nice talk. Any plans for Spectra ISP? So yeah, it's the same question. I don't know. It's not in our hands. Another one? Okay. I'll pass the mic. You talked about many distros already working. If we had, for example, a root FS from another distro, the boot loader situation, is it the same as in mobile phones and their SoCs? Or can we just expect to boot from UEFI or similar? So for the laptop, they have a functional UEFI shipped with the laptop. So there's no need for you boot. I think it's not perfect, but it works fine. So you can directly install Fedora in UEFI on the laptop when you open it. So it works. Thank you. So you mentioned something about video decoding. How will exactly that work? Will there be a VA API driver or will it use something else? Today, there is already a Venus v4l2 driver for the old platforms. And we are working to support the new platforms using v4l2. So Qualcomm wants to push support for the platform. So we need to find a way to merge it and make it more prettier. But yeah, v4l2. Okay. Thanks. Another question, anyone? Yeah, yeah. Hey, thanks for the talk. I had a question about availability of certain documents required to write a lot of the drivers. Is Qualcomm making those documents available to the public? So no, it's not the industry. They don't want to document the hardware publicly. So for regular people that want to help, it would be like reverse engineering or? Yeah, code. Okay. I mean, cool. I've implemented all the MLG support using code only. Almost no documentation. So it's hard. So we need documentation for more complex features. But most of the future, we use code, even us. Because documentation, you have registers, but you don't have the state machine. You don't have the behavior of the hardware. So. Okay, we could fit in another question if there is any. Otherwise, yeah. Okay. Yeah. I'd actually like to continue the question that the teacher's raised. So how is it working then? So you signed an NDA with Qualcomm, get the doc. Docs can write the code, but you're not allowed to document it. Yeah, speak about it. That's how it works. Yeah. Yeah. Gotcha. Please give another round of applause for our speaker. And it was really all running from this device here, the board. No laptop. Yeah.
VoLTE for FOSS
I'm not sure if you should attach the mic. Maybe. One of you should attach the mic. Maybe. Okay. Do you want this one or me and Mike? Me and Mike. I'll give you this mic. How can I use it? So put this in your pocket and this one you attach to your... Yeah. Is it correct? Because I'm bad with mics. Yeah, that should work. Can you put it up here? Maybe like this? No. No. I'll just speak and test it. Okay. Let's try it once. Let's try it once. It shows here. Yeah, it shows there but not... It's here. I have a green light. Are you going to... How do you say that? I will do my... Alright, next up we have a talk called Volte for Force. It's over LTE. I'm pretty sure we all want to see that on Linux Mobile. And this talk was originally supposed to be given by Marius from UbiPort. But Marius is not here today so please give it up for Nikita and Ivan instead. So, hello everyone. I'm Nikita Ukhenko from the UbiPort community and also looking for Yula. People mostly know me from Telegram as NotKid. And since Marius is not here, we cannot be here to the replacement but we try to say what we have learned so far when trying to make the evolves in the same work on Ubuntu Touch and other mobile industries. Currently it's still HMS but I hope that we can get more people involved into this and if you want you can stay here to discuss on how we can implement the Volte 3 on more Linux distros and what can we do together in our understanding. Can you turn up the volume like... You put the mic up or something. Yep. Okay, is it much better now? Yeah. Great. And the chat is not yet so can you raise it up more into the... Oh, yeah, that's a delay. Okay, good. Just go on, yeah. Yeah. So, I expect people in this room are familiar with voice over LTE and what is it for but briefly just a communication standard for voice calls over LTE networks. And there are similar standards for voice over Wi-Fi and voice over 5G networks which is called VOR NR basically. The main reason that we have to worry about the things is that GSM and LTE networks are now becoming scarce resource and if you want to make calls from our mobile Linux distros we need to implement this VOR LTE at some point. And if you had voice over Wi-Fi it allows other cool things like when you're in a roaming you can try to connect to your mobile operator endpoint and make local calls to your home country at local prices, not roaming prices. So, let's start from how it currently works on Android. There's a picture with the TelephoneLinux website but the point here is that there is a modern firmware. On top of it there is a modern interface library or a library that is used by RIL which stands for Radio Interface Layer on Android. On top of it it provides an HDDL server which implements HDDL radio interface and on the recent Android versions it became ideal instead but that's not what we care about at the moment. So, the frameworks part are the ones that implement the communication with HDDL radio interface and there are vendor specific IMS parts which plug into IMS Android interfaces but the vendor implementation is closed source and unfortunately device specific as well or it is chip specific. When we go to Ubuntu Touch we keep those four bottom layers but we don't have the frameworks anymore. And here the problem comes that the IMS parts are provided by the frameworks on Algam and we have instead of frameworks we have Ophono that talks to Radio Interface and Ophono is talked to by Telepathy, Ophono or other layers on TelepathyCondistro. On SelfieShow as we use TelepathyRing but in just implementation details. So, if we don't have the IMS part of frameworks what can we do now? From here we have a motif option. First we can reimplement the Android frameworks part of IMS so it can still talk to vendor interface and that's how it's currently done on SelfieShow as for Xperia 10.2 and 10.3. It's also been tested to work for other Qualcomm devices but unfortunately the plugin by YOLA is currently closed source as it relies on Qualcomm specific headers and I think YOLA is afraid of some legal stuff that's going to happen if it's publicly released. On Obuntu Touch we've been trying to use the same plugin around here but the problem is that the Qualcomm IMS part is black box and sometimes it works and sometimes it doesn't for no reason. It's quite hard to actually understand what's happening because basically what's the more the Ophono part sending is just asking the model, hey can you connect to the IMS for me and the model just answers yes I can and that's it. So you don't know when it's connecting, why it's connecting, how it's connecting and it's a complete black box. So, as you see in the picture we can try to write an IMS plugin and we plug it between the radio interface and Ophono or some other telephony layer. It works but it's device specific and it's a bit of a pain. I've been trying to write a similar plugin for many of the devices now and the idea of it is very simple like you tell the model, okay please enable IMS, here is the IMS IP and connect to it, copy and config from some pass but whether it works or not it's a bit of luck. Dependent on your career. The positive of this approach is actually that you don't need a 4G network knowledge and so you don't need voice over utility knowledge but it's a black box and if it works, if it works you don't know why. Yeah, so second option we have, it's very similar to the first I mentioned but it's maybe interesting for mainland people and that's why I'm mentioning it. So we can ignore completely the ICTL Android parts and you just write a library of a driver that talks to modem firmware directly. That's how it currently works on Pinephone actually because on Pinephone the Qualcomm modem is a separate USB modem and you can tell the modem via QMI to enable IMS and voice over utility. So it's the same black box approach but on a bit lower level and I don't think we will use it on Halium distros because it will cause MS if Android and direct modem interface communication done at the same time. But that's possible, yeah. This approach requires at least a little bit understanding of the network stack. And you need to also know your modem firmware protocol. On Qualcomm it's QMI as mainland people probably know and on media it's interestingly 80 commons but of course it modifies it up 80 commons. And then the most annoying approach but also maybe the most interesting is that we can attempt the modem to set up a data connection to MSAPN and interface with mobile service operator services on network transport layer and it becomes a real mess of senders and protocols. Basically that's the end goal but we wanted to show you how the voice over utility stack looks like and that's the picture. It's not only the voice over utility, here's the full 4G stack how it looks like and the voice over utility in the network is going over just a second. So this is the TCP IP network stack. This is the transport protocol used for the 4G network and the voice over utility network and then we're going for the stack which we showed you previously. So our end goal is trying to implement this on software so it would be open source for every distro to use but as you can imagine it's quite a challenge. It can also allow for some interesting things to do like there is a project to perform the SIM card authentication and set up encrypted AKV2 IPsec tunnel to the mobile operator endpoint from your laptop. So it makes it a bit easier to debug if you can just use the phone for authentication and then you set up the voice over Wi-Fi connection from your host PC. And there are multiple projects who try to implement the open source telephony. The most prominent one is currently DoaBungo for IMS services but sadly it has been un-maintained for the last five years or so. It's in a working shape but you'll probably need to figure out a lot of throw-angels later on. However, courtesy of Mohammed who also wanted to make it here to Brussels but he was refused a Belgian video suddenly. We have a screenshot of DoaBungo connecting to the mobile operator endpoint via APsec tunnel for voice over Wi-Fi and it tried to receive a call while it couldn't because audio part wasn't really working. But at least it could receive an SMS. Here are those symbols you see because SMS uses ETF16 text encoding and the console is ETF8. It did receive something and that's where I am currently. If to summarize the set of part is that we have voice over LTE working with device-specific... using Android radio interface on selfish OS and we won't touch Android on this selfish OS plugin but only for specific welcome devices. We have something cooking for MediaTek and we tried the third option for implementing and we are facing with IMS services in software. Both of them are possible but we are not there yet. Since Marius is not here he cannot speak about all the operator wordnesses he encountered over the road but we are open for discussion and if there are other mobile projects who want to get voice over LTE working it would be nice to see how we could collaborate. Do we have any questions? Maybe in the chat room? That's a question. Can you pass the mic? Okay, so you are going to touch. I get that right. Who is developing a funnel? I was wondering who is pushing these kinds of efforts forward. It's interesting question. Even Ophono is not... We have multiple folks of Ophono. The upstream one is... I don't know how other developers of upstream one. It was sponsored by Intel at some point, by Migo, but they have some community maintenance for Ophono. But the Ophono version we are using for is developed by YOLO for selfish OS and it's heavily forked from the upstream Ophono sadly, but they have been enforced by Adam Peek to bring latest Ophono changes back into selfish OS fork. So it's closer to upstream Ophono and it can be used for pine for selfish OS. And the Ophono binder plugin I've been talking about has been developed by Slava Monich also inside YOLO. Is there any cooperation with YOLO or are you just taking their stuff and developing it forward? I have a style of fish, so I'm interested in it from the user perspective. It never worked for me, by the way. Obviously, the stuff on the fork is open source and when it's possible we try to make upstream MMRs, but the code base is quite diverse, so currently we are taking from Ophono and we will be using it. So I thought that's a question in the chat. So somebody asked, on the Librim I learned that it can be very carrier-specific, whether voice over AT works or not, and it carries white or black list specific modems. Is there anything we can do in this regard, like spoofing modems? So there are multiple carrier-level specific things. First, each one has vendor-specific configs provided by its vendor. For example, on Qualcomm you have vendor-firmware MNT partition, it has image-modem subfolder, and for many carriers there is carrier-specific, modem-firmware configuration, and it's very much assigned black box, we cannot do anything about it. Of course, we can try to load configuration from a different carrier and whatnot, but as Alan would say from Sony, do not do this, you will break the carrier's network. So I think the detection of modem on carrier-level is mainly done by few parts. First is the EMI of your phone, which you cannot spoof in most cases, and there is also user-agent. The user-agent part when connecting to the... I'm a service on the network stack level, of course it can be spoofed. Okay, thanks. Any more questions from RUM? Yes, at the very back. One sec. So a bit related to this, are you encountering any pushback from carriers, because you could potentially be messing with our stack? I guess we are too small for carriers to care about us, unless we break something to bed, so not at the moment at least, but on us. Do we have another question in the room? Yeah. Hi, just by chance I was on the schedule for later today. There's some talk, ProvideVol.te using OpenZips. Have you ever heard of OpenSips? And is that interesting for us? Yeah, I haven't heard of them, but it would be nice to check the talk and see if it can be run on Linux. Can I get some? Just to expand on the previous question a little bit, in order to not have problems with the carriers, I'm also trying to set up just a 4G network with a software-defined radio, a private one, so we can test whatever we like without breaking stuff, you know. Okay, maybe from the chat again. Are there any plans to upstream Ophono changes to kernel.org or Ophono version? I don't know, I cannot speak for Ophono developers at the other side. Okay. Is there another question? Okay. Yeah, I guess then we close it. Give another round of applause, please. Don't forget the mic. The mic? Yeah. Would you buy the pack already, the mic? Yeah, I'll take it. We talked in video chat at some point from SusmoCom. Yeah, yeah, yeah, yeah. It's good, nice. Yeah, so this is also your, how do you say that, your field. Did you get any further with it or? No, not really, no. I think it's a little bit with everyone that is like, we married that guy who's working, but couldn't figure out why if some last work, some doesn't work. Yeah, so I don't know, we just need people to like to have a different approach with everybody. Is this your bag? No, it's the Coulance here, so I like to... Are you also late for the mobile mic? Yeah, I guess I'll make you check it out. Thank you. I think the refuges are here. I got an avid setup here. What is the score on my... Here's the HMI and here's the... Just be safe, whatever you need. Just be... Okay. I mean, it's also small. Nothing that much specific. Looks really nowhere. And you have some more minutes for the end of the 30, so it's like 9 minutes. Okay. I have so many problems with the inters, because I would say like... I mean, you're gonna do a little bit of this, but I'm like, let's say like... Okay, er... Check the camera. I'm like, fuck, I can't say my name. I'm like, seriously. I think the ref I have is the one I was saying about. I was like, I should be shooting myself more often. I don't know. I don't know. I'm gonna get a step-by-step. Sorry. I'm sorry. So I had to write something. But they felt like, you know, I should have written something. I could have thought of something. Otherwise, I could have done that. I don't want to. I don't want to. I was thinking the very smallest bit. I'm not sure. I'm not sure. I don't know. I'm just wondering. I still have a few questions. Have you talked to me before about the Dragon Messenger? No. I'm just trying to set it. What was your voice before? I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering. I was just wondering.
Universal Serial Bug - a tale of spontaneous modem resets
Okay, thank you all for coming. The next talk is anniversary serial bug, a tale of spontaneous modem research from Sebastian. Give a big round of applause please. So hi, I'm Sebastian Krzyszkowiak, I'm also known as DOS, and I have many hobbies, I make games for example, maybe you have played any match for, but other hobbies there is also mobile GNU Linux, which started many years ago when I got an open Mokoni of Rane, and eventually I had been contracted by Poison to work on the Libre M5 phone, which is this chunky boy here, and within this device there is a USB connected cellular modem on an M2 card, the worldmobile BM818, and this is the main character of our talk today, because we had a problem with it, and the problem manifested itself in this way that sometimes, occasionally, seemingly at random, the modem would just disappear from the bus, it would be just as if it was unplugged from the socket, and it would come back right away, even though it did come back, it was still pretty disruptive, because the network interface would go down, the audio routing would be turned down if you were doing the call, so this wasn't really great, the modem wasn't crashing, it wasn't rebooting, because it maintained its state, at least some of its state, but it was just as if you would pull the plug and plug it back in very quickly, without with some external power connected to the modem, and there were also other instances where the modem wouldn't come back, or when the whole USB bus would actually die together with it, however we won't be talking about those turned out, even though they were like wars, they turned out to be connected, but separate issues that weren't as technically interesting as those resets turned out to be. So this talk will be some kind of debugging case study, and I would just like to talk about how we identified the issue, how we debugged it, and worked around in the end, and at the start I would like to note that this is not some groundbreaking research, this is not a new discovery, because it turns out that this was known for ages already, but I think it's not a common knowledge still, and it turns out that it can still bite, so I thought that this would be an interesting thing to talk about and to share. So in order to understand what's going on, I'll quickly go through the topology of the devices on the LibreM5, so we have two M2 slots inside, one of them is the cellular modem, and the second one is the Wi-Fi and Bluetooth card, and there are two USB controllers, one of them goes to the USB C port, and it all swaps, and the other one is connected to the internal USB hub, and therefore it works as host only, and the internal hub is USB 2642, which has three downstream ports, one of them is internal, as this hub has a microSD reader built-in, and the other one, the one that we will be interested in today, is the modem's port that goes to its M2 slot, and there's also the third port that goes to the Wi-Fi M2 slot, however none of the cards that we use on this phone actually use USB, they use different interfaces, so this third port effectively remains unused. So universal serial bus, I'm just going to assume that everyone here knows what USB is, we all used it, so I won't read Wikipedia definitions, I will however go through some of the properties of USB, either to remind you or to make you aware of how this works on the wire, so the first thing, that the devices can be suspended, this is a power management thing, you can put a USB device to sleep, theoretically all of them can be put to sleep, not all of them react well to that, the specification says that they should, but yeah, reality is different, and there are two ways in which you can suspend a device, you can either selectively suspend a single port or put the whole bus into so-called global suspend, and another thing is that no device on the bus speaks unsolicited, every communication is actually handled by the host, it's the host that keeps pulling each of the devices for whether it has something to say or not, and then the device only responds to what the host is asking it for. There is one exception, when a device is suspended, it can actually signal that it wants to be woken up, but that's the only thing that it can signal on its own. One interesting thing is that, I think not everyone is aware of that, that all USB hubs are technically smart, they are on their own, a proper USB device that you can talk to, that you can send comments, and that can respond and send some status. The features that you can control this way vary, so not every hub will, for instance, provide power switching control, however, this is exactly how suspend is implemented, you send a message to the hub and the hub inter-president does it. Internally, when it's on the wire, USB works with two wires that form a differential pair, and you can have, on two wires you can have four states, however, one of them is illegal in the specification, the USB doesn't use it, so we are down to three states, they are called J, K, those two are when one of the wires is high and the other is low, and there is SE0, which is when both of the wires are low. There are some differences between various speed modes between USB 1 and USB 2, we won't be going into newer versions as they are different, and the modern here uses USB 2. However, the differences between USB 1 and 2 are small, the old states are similar, they use different voltages, but logically it's basically the same thing. So, let's go back to the bug. At some point we have noticed that those modern resets are somewhat dependent on movement or signal strength, the easy way to trigger them was, for instance, to ride a train, you could often see the cellular connection icon just disappearing for a moment, or when you don't notice some file, it maybe could drop out, and that was pretty annoying. And also, sometimes in some places it basically never happened, like at my desk when I worked on it, and it quite often happened in my bedroom, for example, where overnight I would wake up to a bunch of resets happening overnight. So, in order to look at those issues, we have to check some logs, I have showed them earlier, but that's not enough, and Linux has this pretty useful feature called dynamic debug, and pretty much all the channels called drivers are sprinkled with debug print messages, however, they are by default compiled out for performance reasons, however, you don't need to recompile the kernel to put them back in. They can be dynamically patched in, and this is how you can do it. Using this interface, this invocation tells the kernel to re-enable all the print statements from C files, from drivers, USB core, directory in the kernel registry. So, we did that, and this told us a bit more. It turns out that this is an example of such a reset happening, and it turns out it happens when the device wants to wake itself up from USB suspend, and here we can see the status given by the hub to its ports. The port one is the microSD reader, and we can see that there is 0507, which means that five that is connected and enumerated properly, seven is that it's suspended, and change zero means that nothing changes, and port two is the modem, and here we can see that it's different. Zero one means that it is connected, however, it didn't actually went through the whole process of connecting, so something happened there. Zero one means that it's not suspended, and change five tells us that it both changed its suspend status and connection status, so it's just like it would be from the plug, and put quickly back in at this point. To compare it, this is an example of when things go right. After the port has been resumed, we can see that status is 0503, which is different from the port one, because port one is still suspended, and port two is already working up, so there's three at the end, and change four tells us that only the suspend status has changed, so this is how it looks when it works fine. That told us something but not much. There is another feature called USB MON, which can be used to sniff on the traffic on the USB bus, and can be used with such common tools like Wireshark, however, it still didn't tell us anything new, and it's just like if the device was disconnected and put back in, so not very useful at this level. We have to take a few steps back, and the first LIME 5 fonts shipped to the first customers in December 2019, and the issue about those resets was filled actually by myself in August 2020, so there was plenty of time to notice this issue, and it hasn't been noticed earlier, so it's safe to assume that it wasn't there initially, and just came up later. So looking at what was the state that those first fonts have shipped in, the USB power management was already enabled with the software that was running on them, however, it turned out that the SD card reader, the driver for it, kept the USB hub active all the time. It was basically pulling it for media change status, and that's why it never suspended, so the whole USB hub was kept active, and it was fixed in August 2020, and there is a somewhat lengthy thread on the Linux kernel mailing list that you can follow if you're interested in that, and there was also another thing, at some point I have noticed that modern manager pulls the modern for signal strength every 30 seconds, and I wanted to change that because that's not very nice on the battery, and to make it start listening to the messages coming from the modern whenever signal strength changes instead, and I got it working the first time in the context of LiveM5 in August 2020, and later I noticed that with this change the resets popped up more often, without this change they were still there once the hub started suspending, but not as often as with this patch. So now we know that this is related to power management, and it turns out that disabling the suspend may be the issue going away, so yay! However, doing so it's almost half a watt, so not so yay, and basically this was the main reason behind a poor reputation of battery life on those devices when they first shipped, so power management is essential and it must be kept on, we just have to find a way to solve it without disabling suspend, and there was one vital observation, I think Elia said that she observed it first that the issue only ever happens if the hub has just been suspended, never if the hub sleeps for some time already, and then the modem wants to wake up, it's always the hub goes into suspend and right away the modem wants to wake itself up, and things go wrong, so this starts to smell like some kind of race condition, so what we do with race conditions, we start playing with some timeouts, if not in hopes to fix it then maybe to make it happen all the time, just to learn something about what's going on, Martin Keplinger was earlier working on that other issue that made the modem not come back, he had some progress on that, but however he didn't really make progress on this one, when I took it over I started, I based on his work and to figure out what's going on with the kernel code in USB and started changing some timeouts, eventually I figured out that this isn't going to help because at this point where was the earliest possible point where we could query the hub for its status, it was already telling us that something wrong happened, so this didn't really help, and I think that really helped was finding out that you could reproduce it by pinging the phone, if you pinged it over the network interface, cellular network interface and set the packet interval, just right, I think it was about, a second above two seconds, you could actually make the modem reset this way, so this helped to investigate it, and at some point I also started playing with an USB M2 adapter to pull the modem from the phone and put it into other kinds of USB sockets in other devices, the idea was to identify whether it was the hub or the SOC or the modem itself that caused troubles, and I found out that with that listed kernel modules for the modem and sleep timeouts all set to zero, I could make it into some kind of reset loop, it would basically reset every second or two and keep resetting, and at some point I noticed that when it was plugged to some USB hubs, I got it pretty much all the hubs I had in my house, some pretty ancient ones as well, and with some of them it never reset it, I couldn't make it reset with some of the hubs with others, it was pretty easy, and whenever it was connected to the house directly with no hub in between, it always worked, it never reset it, it even applied to this port on the Libre 5 itself, when it was plugged to the USB C port, the resets were never there, so there was time to start to read some specs to find out what's going on or should be going on, and it turns out that the USB device enters the suspense state after three milliseconds of no activity seen on the bus, and this can happen in two ways, you can send a message to the hub to enable port suspense feature, and this is how the hub stops sending frames to that port anymore so it doesn't see activity, and it suspends itself, or you can stop any communication on the bus, we just call it global suspense, and then all the devices on that bus see no activity and go into suspense, and when the device detects that the data lines have been in the idle state for at least three milliseconds, and high speed idle state is SE0, it must revert to the full speed configuration, which is J, so D plus high if I remember correctly, and then it must sample the state of the line, so it checks what hub has or host has asserted, and if it's full speed J, then it continues with the suspense process, it is required because SE0 is also a reset signal, if at this point it would stay in SE0, it means that this is the default state that the bus is put in and the device must reset, but if it's J then it means that this is a suspense has been requested, so the device then asserts J itself and stays in J, and we now know how a suspense works and how about the resume, the host can resume the port at any time, it must send the resume signaling, which is K, for at least 20 milliseconds, and after resuming the bus the host must resume communication within three milliseconds, because otherwise the device would go into suspense again, and what if it's the device that wants to wake itself up, it cannot wake itself up after being put into suspense for at least five milliseconds, and then it can, and it must hold the resume signaling, which is still K, for at least one millisecond, but for no more than 15 milliseconds, and the controlling hub, which is the hub that actually handles the resumes, suspended as there might be more on the industry, must re-broadcast that to upstream within one millisecond and ensures that it is signaled for at least 20 milliseconds, so it kind of takes over that signaling. So now it was time to get dirty, fortunately I didn't have to do that myself, Eric Kazmenko, who is the hardware guy at Poism did it for me, and soldered some wires and put a differential probe to it in order to sniff what's going on electrically on the wires, so this could be then seen on an oscilloscope and recorded, and this is an example of what's going on. We can see here at the beginning some kind of high-speed communication, as it's a lower voltage than full speed, at this point we can see that the modem went into suspense, this is the J state, for some time, and then here we can see the K state, which means that it was either resumed by host or it wanted to wake itself up, and it happened, cycled this way for some time, and eventually something went wrong here, so to zoom it up, what happened here is that there was some kind of high-speed communication, it stopped for three milliseconds, at which point the modem went into suspense, and there was a J signal for another three milliseconds, then it went into K state, we can assume that the modem wanted to wake itself up, and it lasted about 20 milliseconds, but then the bus went into SE0 and communication did not resume, it stayed at zero, at which point after another three milliseconds the modem just suspended itself again, so this is somewhat informative but still not enough. My hypothesis at this point was that the specification requires a great period of five milliseconds before sending a remote wake-up request, but I wasn't quite sure whether the wording isn't ambiguous, because it says that it needs to stay continuously in the idle state for five milliseconds, but if we check here, we have two idle states, there is high-speed idle state for three milliseconds, and full-speed idle state for another three milliseconds, so when is this point where it starts? However, there is also a side of English description, there is also a bit more formal state machine description in the specification, and after deciphering that it turns out that both of these idle states actually counted as one continuous idle state, so this probably wasn't it. So we go back to getting dirty, and this time instead of just sniffing what's going on between the modem and the app, we also sniffed what's going on between the app and the fonts processor, at the same time, which required quite interesting contraptions to be made, but it worked, and we got some data, and this is an example of things going wrong, and we can see some USB micro-fames here, so host polling the devices, and then some communication actual, and then nothing for three milliseconds on the modem port, on the bottom we can see the part between the app and the SOC, and there the micro-fames continue, and the modem goes into suspend, and after I think here it was too many seconds, it wants to wake itself up, so with assets K, and the app takes over, then 20 milliseconds later it stops, but what happens here at the bottom, and the micro-fames continue when the modem is suspended, and when it wakes it's up, starts to wake itself up, the communication still happens, until this point, then it stops, this is the point where the app has been suspended by the host, and then after three milliseconds the app went into suspend process by itself, and what happens here is that at this point, at this exact point the app started to wake itself up, however at this point also it should start sending frames to the modem, start forwarding frames from the host to the modem, but the app itself was waking up, so there was no data to transmit, so it all fell apart at this point, and I started looking closely into the specification, and following the state machine, and I couldn't really figure out what the app was exactly supposed to do in this case, when the upstream-facing port went into a suspending state while a downstream-facing port was already in the resuming state, and I wasn't sure whether it was my misunderstanding or whatever, what was, at this point in time the host has no way to know that the downstream-facing port is already attempting to wake itself up, if here we would query the status of the port, it would say that it's still suspended, there was no indication, and that's actually how it works in the spec, so that information only becomes available when the port already finishes resuming, so now I knew what was going on, and I had the knowledge what to put into the search browser, and I found this email from many years ago from Alan Sten, who is a guru of USB and power management subsystems in Linux, and he stated that the USB to spec does not take into account that possibility, so Alan basically validated my suspicion, yes, before I made that suspicion, so at this point I could safely assume that my suspicion was true, and what's worse, that mail ended with, I don't know what we should do, suggestions, anybody, there were some replies but it didn't really went anywhere, and however that mail pointed to an IRATA, and IRATA said that there is a very unlikely possibility of a race condition, and this issue can be avoided if system software suspense are downstream-facing hub ports before suspending a hub, I completely forgot to check IRATAs, at this point this was the first time I seen it, and I was so happy that this was the first time I seen it, because what the hell, this recommendation suspending the port before suspending a hub is exactly what makes this issue happen, and Alan Sten said, so himself in his mail that this IRATA is completely bonkers, so I'm so glad I didn't see it because I would be so confused, so there were around what I did to actually stop, prevent it from happening, so I have added a port query in the USB subsystem in the kernel, which when it was enabled the port was never actually suspended selectively, Linux only pretended to suspend it, but didn't actually send the command to the hub, since this would cause troubles as if we just pretend that the device is suspended, we stopped pulling it for more information, but the device isn't actually suspended, so it can't wake itself up, so to prevent that from happening, we keep such quick port active whenever any sibling port is active as well, and when the hub gets resumed, all ports marked with this quick are also resumed as well, and this lets us rely on global suspense when we just stop sending any communication, and all the devices suspend at the same time, preventing this race condition from happening, and this works well with the topology on the LibreM5, but wakes apart on different topologies, if we added another device, for example on this third port that also wanted to use remote wake up, it wouldn't work, there's the code, so what can we do now? This hack isn't really a sweet table for mainlining, it's really a bad hack, so for now it stays in our downstream tree, however I believe there is a way to do it in a way that could be potentially upstreamed, it wouldn't be the default, I'm pretty sure, because this it would be quite inefficient, but I think it should be possible to have this as an option if you have such devices that are resettled in this way, that you could actually have them work reliably and wouldn't have to disable power management completely, and to do so we would have to ensure that no downstream wake up capable port is suspended while the hub goes into suspend, and there's also another thing that made me implement it as this hack instead of a proper solution first, is that while the proper solution is less efficient, this hack actually gives us some efficiency because we can skip suspending each device one by one, we just suspend them all at once and it takes less time, so this lets us make the modem go to sleep more often saving more battery, and so that's basically it, I'm available for consulting so I can turn your money into code if you're interested to have something done in mobile gaming space, and if you have some questions like my reviewer had here you can ask them now, thank you. Great. You already have a question here? Oh, you mentioned the influence of the modem manager on this effect, can this be explained with your findings? Yes, this is because when the modem manager is polling every 30 seconds, it's the host that initiates the communication, but if we switch to unsolicited messages from the modem, then it's the modem that actually initiated, so it wakes itself up more as opposed to the host waking itself up when this issue never happens. Hello, thanks for your presentation, how many man hours went into this bug fix? Oh, I don't really know, it took many false starts let's say and red herrings, so this is obviously just a chunk of it because I had to feed it into the presentation, but yeah there were many approaches that when we were really in the dark at the beginning, I didn't know anything about how USB works, initially I had landed from scratch, so it took some time. Hi, quick question for you, actually two questions. The first one is, is the USB the ideal way to connect the modem or is there a better protocol that we could be using in the future in another design? It depends what you have available, perhaps on the LibreM5 we could theoretically use PCI Express, however PCI Express would be at least on this SOC would be much more power-hungry than USB and USB makes it easy to find such devices that you can actually have on a replaceable card that you can put into the phone pretty much off the shelf, so the options are quite limited in this place. And second question actually on that, when it comes to adding a different modem, this isn't a modem issue, obviously it didn't come down to which modem you were using, but are you guys looking at releasing a gemalto modem because that would be pretty cool? I'm not really a person that has anything like any power in this regard, so I can really say much about it. We have a question from the Metrix channel. When will it be fixed upstream, hopefully? Hopefully soon. Making this presentation, submitting it here, was actually a way to force myself into going through this again because after getting this hack done, I just wanted to take a break from all this USB stuff, so maybe soon, maybe not, we'll see. I think it should be pretty simple, in fact. We'll see what the maintainers will say, whether they will be happy to take such a quick approach or maybe they'll have another idea. We'll see. Are there any proper solutions to this problem, like in the USB specification, for example, are there any hubs that don't have that issue? So the specification of USB 2 never fixes it. USB 3 works in a completely different way and there are also supplemental low power modes in USB 2 that could be used and that also don't have this feature, but you have to have a device that supports those modes and we don't. So we can say that it's fixed because it's all completely different in USB 3 and higher. And for USB 2 devices, it's all up to the hub and how it's implemented. If it's implemented to the word of the spec, there's a high probability that it will have this issue, but some hubs are like, specs gives you some time to do things. You can do it like the minimum and maximum time and some hubs are faster and then you may not see this issue happening with them. So yeah, at this point with USB 2 devices, it's probably up to your luck with what components you are using. I'm working on open source USB debugging tools, sniffers, software, so I'll be interested in talking to you about capturing this as a test case to make sure that we're able to spot this happening on the wire in future. Okay. Very nice. Yeah, first another from the chat apparently, then a further to you. Is there known other mobile devices that suffer this issue? I relate some aspects of the bug on Pinebook Pro Wi-Fi. Honestly, I have no idea. This was the first time I experienced this issue and had to basically go through what I told you today. So I don't know. This was known for years. The email was 12 years ago and Alan Stem has said that this came up in testing. So obviously this came up somewhere, but where it was and which devices were affected, I have no idea. So you mentioned the other USB bug you were facing where the whole bus died. Did you fix that as well? And can you say like two sentences about that? Is there once again? The other bug you mentioned in the beginning where the whole USB stack died and the modem didn't come back. Did you fix that as well? And can you say maybe two sentences about what's the possible? Basically that one was pretty boring. It ended up to be a missing queer queen, the host driver that was already implemented, but wasn't enabled in the device tree. And at some point, actually, NXP has enabled that for all IMX8 and board. So this is fixed now May 9. So please give another round of applause. Thank you.
The Linux Phone App Ecosystem
Okay, our next speaker is Peter from linmob.net and linuxmartphones.org and he's talking about the Linux phone app ecosystem. Please have a round of applause. Hello everybody. So I hope everybody can hear me and yeah, this is my first talk and I'm really glad to be here. It's amazing that this conference is running every year volunteer based and that we have another room this year to have all these great mobile Linux talks. Now we'll have one that's maybe less great but I don't know. So I think I need to hold that. So this is an important thing. You can use those devices, Qualcomm SOCs or the little five and whatnot and you want to touch on all of it but it does have no apps, right? So in theory this could be so simple. You just install Waderoid on your distribution, simple and then you install asteroid, free software apps and then maybe you need some proprietary stuff so you can do that and you have all the apps. Well, you know, I've done that with Linux, it was in the past and so on and microg is amazing and whatnot but there's always issues and especially with virtualized Android and so yeah. There are good approaches and worse but I think I would rather go with native if possible so this talk is only about native apps, whatever that means. But not so fast, let's have a brief agenda. Who am I? Some dumb puns maybe. What's not in this talk? I don't have a slide for that because why? And then apps on Safe for Sure has absolutely a bunch of touch and the new contenders so what I do with the links on apps.org or what others and I do with the links on apps.org because I don't develop any apps as other people and I don't add all the apps. Can't do it. And then highlights we have, gaps and challenges and Q&A maybe. So motivation. We already heard of three major projects, realms maybe mentioned like with Safe for Sure has and you want to touch and all these new Linux distributions that's born up that we'll get into later and I think this is a small space in terms of market share but to solve that it's heavily fragmented. So maybe there's something to learn. Maybe another platform project, whatever you call it, product does something different and that's great and maybe others can learn from that. And then I wanted to spend some time with you, want to touch and Safe for Sure has after a while but yeah I don't know, broke happened so yeah that part is going to be rather thin. So then I had some assumptions at first so surely stuff like email that's easy, document protocols, well maybe quite complicated but it's there, metrics, it's there, XMPP, just do it and then stuff that has free APIs also yeah you know people will do it and then everything that has an API even if paid should also be doable. So yeah, let's get into it. So Safe for Sure has. When, oh I forgot the introduction part, shit. So yeah this is my website, it's lin.net that stands for linear mobster's network, no actually not. So this is a logo, you may know it from YouTube and this is the current homepage, weekly updates, a lot of work and now how it started because I think that part matters a bit. So it started in 2007 and even back then we had plenty of Linux mobile projects, community and others coming over from the handheld age to the smartphone age, handhelds.org, linux.go.org, I don't know if anybody was in those IRC rooms at the time if you are in your year, great. That's real stamina, what would you call that? So I somehow stayed around, well I left briefly because in 2011 we had like a major two things killed by CEO, so what happened with Nokia and what happened at HP, new CEO and then boom mobile linux, look promising, died, also open moco. Now to get into this talk, while I was doing a blog and totally into the Aztec in 2020 with a pine cone and oh my god what can you do, this thing only lasts five hours but hey I want to use it so is there a list of apps from this, forked it, eventually turned it into this because the previous implementation would no longer work on those low Linux phones and it's still pretty bad, I think there's an issue tracker on Framigate and we'll get to that later maybe but yeah so improvements of Alka I say but there's a lot that has been learned and I think it can be helpful so we skip that so say a fish, like we just said Elop killed the Anain and Harmaton Nokia and from the ashes raised YOLLA and they introduced the YOLLA phone into 2013 and it's quite modern so it's BTRFS well yeah who cares file systems, Wayland system the 2013, Wayland really and then there's a going, troubled surely don't make any more on devices you can buy a license bring you on Sony device and they've got something that's quite interested for those that need those proprietary bits to close the gap that's Android app support not a topic of this talk so what do they have so there are multiple interfaces to get software so there's the YOLLA star requires the YOLLA account no for pay apps has no web interface so I did not count those apps maybe there's an API or something we didn't look into it but yeah it looks quite nice and that's not the only source of software that's well organized there there's also open repost on that now that one is really old if you go on to open repost on that you will see that it lists one app for the LibreM 5 or for POS but it also has many apps for the N900 which I think many people still have fond memories of and the N9 and there's even some development still for the N9 so people are still using that thing today yeah it has a Storm in frontend for Safefish has also no for pay apps it like I said lists up for the projects and it has approximately 1800 apps and counting listed for Safefish but I don't think that with the transition from arm 32 to 64 bit and the long history of four major releases that Safefish apps you will be as you will be able to use all of them now this is what it looks like a little bit less entertaining than the YOLO Star but also I think quite fun and then there's a newest contender of course because more options better and that's Chum it since recently has a web front end it also has no for pay apps it has 170 apps listed for Safefish and it includes and this is for me a total highlight because it's this cross project collaboration I'm talking about it includes some Kyrgyz apps by packaging a modern version of Qt because Safefish uses silica for its widgets and it's stuck on Qt 5.6 forgot to tell you that earlier I mean who wants to talk about those sites that aren't so nice and shiny but people made it work and you can run like cast Angerfisher web browser which is nice because sometimes you may want a Chromium web browser because the real web browser in Safefish or as is Gekko based which is also really unique and there will be a talk about that later on so yeah highlights I did a little impromptu poll on Masterdun I wanted to do something better but these are the highlights of Safefish OS so if you're using Safefish OS and you haven't installed those apps I mean what are you doing just take out your Safefish phone and then install them and maybe enter your security code yeah and then you can do this nice multitask gesturing thing I will not go into demoing apps on Safefish OS I did that for YouTube and I failed miserably people were making fun of me doing that again so yeah there's a lot Safefish OS connect by the way integrate with KDE connect so if that's not obvious and then we have even had contract so if you were like me having a relative that was in deep danger that was something to appreciate at the time I mean now no more tracking why would we so yeah then next one just at Safefish now let's go for a bunch of touch it's about as old if not older envisioned in 2011 this is nice quarter on there so it was in 2011 that it was announced and it would Ubuntu would support smartphones tablets smart TV smart screens smart watches head units whatnot everything maybe peak Ubuntu I don't know and then they I left out the prepaid crowd for everybody else about that one and then had the first commercial device in 2015 February 2015 so like nine years ago by now my man time flies and they'd used mirror which these days is a way than compositor but then wasn't upstart because yeah and unit 8 their own convergent thing unity hate is amazing it's now you know Mary thankfully because canonical eventually would drop that all that great effort because didn't have market success so another death by CEO if you will but it was picked up by the community and could be picked up by the community because it was completely open source so maybe that's one of those lessons so only trust projects that are completely open source because then it doesn't matter if they go under and yeah you be porters doing great drop the latest release was just a few days ago and the store situation is also pretty simple as the open store it has a web interface so you can browse it without having even to touch device and get an idea of what would be available even as ratings so can look into is this actually working and it has more apps for the older one than for the new one so really I think that those numbers you know with 210 whether it's about 610 I think it's actually 217 215 by now but yeah who cares about the exact number that really should improve the open store has one neat feature I wanted to put a screenshot of that into my slides but who has the time so when you install an app on the open store it basically sometimes if that's specified next you for donation to the developer and I think if I remember correctly it may do that later on again and I know nagging nobody like likes to be necked me neither and nobody wants to feel bad because they don't have the time to fill out the details and do that stuff that you need to do that donation because it's also complicated because payment but I think that's a nice idea because you know giving something back and not just feedback does not work for me fail I don't know this is garbage you know maybe communicate friendly that might help and maybe donate if there's a way to keep this going you know we need to do that and then of course other ways to install apps so you can do you want to contain over 604 this was totally uninteresting for you know all those new apps we get to later because well in 1604 you want to 604 not much was mobile friendly in the GT cable they can type that and neither in KDE land really and then with 20 or 4 it's a little bit better but you need to bring your own environment variable variables and then there's new development only work on some devices snaps are you can install snaps on you want to touch now snaps are known to be controversial but on a system like you want to touch which is also in a way immutable air quotes and was very early with that so that's another thing that's great I think it's nice to have another option to distribute software more widely and if snaps what's been added first got a sticker on my little tablet here that's just what I would have preferred but it's good to have really nice and well you need to bring your own and worse to make it scale properly but wouldn't be fun otherwise highlights you must know so if you do a poll on master then apparently people favor message on clients it's weird and Weber a tool for web app creation generally you want to touch has a bunch of web apps which is great they have a way to do those other projects should do that too because it's maybe is relatively simple way to make a service seem available from an app store because people don't think that there's web browser that they could use then Deco great email client well might use some work to get GPG award but I mean come on it's an email client didn't have that when it was on the canonical throne that was fun when I first tried you want to touch it was like what the fuck because the only email client that shipped was a Gmail client again whatever past memories and then you nav for navigation and then there are more some of those really should be brought over just some highlights I think you can read those yourself so fluffy flesh had flutters interesting because they did not ship GTK and flutter in that click package as far as I could concern they made it a web app so they flutter can do web apps and then they went that way so also interesting hack would like to see more of that and then there's an app for scooter for scooters you know those urban mobility shit supporting two services really great I don't know whether it works didn't try be friendly if you try and have bugs Tesla app don't have a Tesla no idea Nostar nobody needs Nostar but they have a client and it works for me because I try to go there with my blog and whatnot but and then of course it's body for a premium client because like assumptions earlier it's body for premium IP I works good so and then gaps briefly for this metrics apps maybe so yeah not not really happy with that situation it's interesting the element adaptation is something like a hack some CSS hacks on top of element desktop nice approach but of course something like that is prone to break it you basically patch the more moving target how to do just ask all the Android custom ROMs and then XMPP of course and desktop Firefox we want to touch that's one for the poll yeah that would be great now new contenders and that's the area I'm competent about which why I spend so much time on other shit to not talk about it too much so up top you see the UIs for or also mom shell I could have put another logo there plasma mobile and then as a joke because I'm not going to talk about thermal apps sorry as XMO it's it's awesome I use it on my pine phone pro and then distributions you know dang next post macros mobian fedora there's a mobility stick then that fun icon anybody know what the icon on the right is any hands yes no it's open mandrieva made made one image for the pine phone but I had to put it here we are right now as some rolling of but you want to mobile next OS nice to have that too and of course open zoos the lizards are here too so yeah and then of course how did that get started it's all history 2017 maybe 2020 live in five pine phone what's a project based on desktop distributions like we saw I've got two times there being in that list and many eyes plasma mobile with kirgami for apps for with first lipendi a widget library to make GTK apps more adaptive and then these days lip of data for GTK for which really made us rain to go so that's more of a success than I would have hoped for as a spectator on the sidelines really impressive and then the downsides well no proper app store solution ish hands links for naps for dog org you know I was really hoping that we wouldn't need that by now because you don't want to maintain a website that lists like I don't know 500 maybe including games these days apps and has to you have to check those and then does it still work oh no I don't know who has the time so these are all the fun UI frameworks that I used in apps listed on Linux for network most of these don't really matter and I already mentioned the ones that do really matter except maybe flutter because that's going somewhere well but we will touch the later this is just as an overview so there are plenty apps with Kirigami it's like hundred and forty naps listed so plus my mobile is going rather strong there no side goes a little bit stronger up top with a lipendi I mean I could also call the account GTK for and GTK 3 but some of those don't really super fit very well you know only with foch scale to fit heck and whatnot if you've been in that arena you've seen that rodeo it works and it's great to have it but it shouldn't be there so yeah the panty 66 lip advisor 156 used to be more in the panty camp stuff is moving over which is I think good to see I don't know why I've got one you bunch of components at there yeah I think it was future five before it was an open store and then programming languages well I think everybody here in here is more competent to judge this than me I can do a little bit Python and some CSS and HTML and whatnot and barely do JavaScript but so it comes with the with the toolkits right there are also some things that I did not know before I started this list I didn't know that there were GTK apps made with C++ I always assumed that was all cute but yeah you learn so looking at the interfaces you can use to browse software here's one that's really nice these days it's no software see that fun little thing there that says adaptive yeah that's great that's metadata if that were everywhere I could stop working on Linux phone apps boy would I love that so but we're not there yet so yeah it's tough so can show that and then there's even a fork that only lists the apps that are indicated to be adaptive you know you can always write anything in metadata nobody checks so you could claim your app is super adaptive and it's not but then you will get that feedback so don't do that and also don't do it because otherwise I really can't retire that website at any time yeah so and then discover well it doesn't show adaptiveness but the thing is if something is kirigami most of the time it should work except a few few things that don't but you don't need everything in your phone right then there are of course also some cute widget apps that also work only barely and if you're lucky yeah and then metadata it's beautiful so my day job isn't publishing and in publishing we still love xml and abstream metadata is also xml and so this is a common specification that has been extended over the years I think that started I don't know decades ago or maybe but it's definitely more than one decade at this point and there I have some links on the site on a blog post and form fractures how to specify those before that there was an intern specification by purism and you can put your licensing in there you can put description release loads you know go crazy and the good thing is except for the release notes if I execute a script I can pull all those nice informations into linux phone apps are no ain't that great so yeah if you are developing an app please add metadata maybe there's a meta info creator that makes that relatively easy I know it's some extra sure and it sucks and nobody has the time but it's I think it's really useful for people and if you maybe want to contribute and run through the code forges and find apps that don't have metadata and make merge or pull requests adding that metadata go for it thank you yeah but with that express the metadata sorry about my excitement for xml nobody likes that anymore I know highlights for apps I don't think I need to iterate through the app just highlight itinerary it's really a better travel companion than the app by Deutsche Bahn for example which I know very much unfortunately because it's generally not only taste you about delays but it also tells you how to get from that one platform you have to start changing trains on to the other platform so you can see that because it's not always that numbers that are next to each other are on the same platform and that matters if things are delayed once again and then angle fish nice mobile web browser also on SafeWishers like I mentioned and then pure maps pure maps again we had that before could also have been on the Ubuntu touch list pure maps well everywhere oh I forgot cast so sorry cast is it's also great it's really feature rich does chapter markers I like podcasts sorry and then highlights in the norm side well chats and calls because you know sms mms calls who wants to get phone calls but yeah people do and if all your stack works it works even as a yeah very worst client again that's from the poll and also it's really nice 10 gram that's little thing for web apps you can also use it on your desktop all of these apps are also available on your desktop so if you don't have Linux phone you can also use all of these apps on the past two slides and they are also great on desktop because adaptive apps aim to be great anywhere and I think these listed here all succeeded that and then of course communication railway like maybe maybe I trouble the trains too much I don't know can you travel to the trains too much no idea and then spot Spotify premium again API magic and then flat sale because helps sometimes and then other highlights so these are apps that are on kiri gami and I've put two matrix clients and they may be I use too much metrics yeah and I must use too much metrics so one is using cute quick components to Nico and the other one is using flutter so special one apps that run anywhere on mobile Linux we had no pure maps maps navigation whatnot and maze fish smart watches and stuff is that and then kaitan that's an xmpp client and yeah it's only in ubuntu touch 64 that's why the asterisk is there but otherwise looks like building cute apps that are cross-platform is possible another special apps that run everywhere including legacy platforms so iOS and Android well see next talk flutter maybe I don't know we're really interested looking forward to that and then current gaps so what if you are have time and want to start here's the list we already saw that some of these things are solved somewhere I think you're going to touch also as a cryptocurrency wallet if you need that I don't know maybe you do and then of course what's that yeah tough and then more current gaps that I found elsewhere attention grabbing social media I think we need Instagram and TikTok to make that mainstream and we need Facebook for the Grand Parents and we need office clients to edit fucked word documents and shit and axler well you need that there are some approaches by the way that's one kt app and then yeah so gaps this brings me to packaging um aside from metadata you know releasing an app helps I'm not explicitly said stating that I'm looking at k delta chat in this very moment but I am so that's nice app it works delta chat is encrypted chat via email protocols nice but no release so not package anywhere aside from a u r and xOS yeah and also I mean maybe maybe flat up so in my little impromptu poll one answer was and that made me really so yeah this app seems great I'm looking very much forward to it land in db and stable and I'm like oh god this person is patient should learn something from them crazy yeah so please if you maintain an app maybe do that toggle thing release it at some point you know don't release it while it doesn't work won't help anyone but maybe release it once it when it won't when it barely works because it works barely but works then of course flatter apps build only for x86 64 linux electrode apps build only for 886 64 linux what the fuck signal and then generally apps build only for x86 64 linux you know aside from doing this mobile linux phone thing I've been running arm laptops for years and it's I mean now with fast arm laptops it's less of a problem you can compile shit but oh god imagine the pine book and then compiling a big electron app I mean you can't do that but boy that's like waiting for stuff landing in db and stable yeah then future challenges things get worse actually more and more services disappear behind apps and they in apps that are you know on the android side require play service often and thus don't easily work in bed right and that's a deal for public and private services so I think this is some german examples who cares but yeah we need virtualized android maybe we need to reverse engineer other things or we need to push government well governments I mean we're in brussels here double capital Belgium and the EU and NATO they're not state whatever but yeah so technical solution obvious one is the web and then of course what would I like to see more cross project collaboration in the app space I think I stress that enough but I've made it it's stress it enough to access to non-distribute sources easier and distributions and now that's controversial like enabling flat up from the get go and maybe even the snap store oh god people with throats brings at me and then donation bagging and other app install things maybe a future for software thingies and then a bug tracker like mozilla's platform tilt if you don't know they list this stuff whether disadvantage by last large companies also goes into that political avenue and help with linux phone apps or or so yeah yeah I want to make it a progressive web app I want to make search and filtering better but yeah who has time so conclusions I hope this wasn't too overwhelming or boring there may be more apps than you'd think regarding initial assumptions I think honestly despite trying to prove it people are just scratching their itch and that is perfectly fine so thank you this is the stuff where can reach me and where the next four minutes are always and if you want to contribute from agate it has issues with sign in so send just send that page to the mailing list and that last link is a really cursed really bad my skills at web development level thing that helps to create things time for questions thank you very much Peter any questions from the audience ha successfully over they're all taking it in still bored them to tears I'd ask the question oh it's actually not a question it's a statement this is David but no I just wanted to I wanted to thank you for taking all your time and preparing the weekly post as a user of mobile linux not so much a coder it has been huge to get me in the community to keep me in the community keep me up to speed with everything that's happening I realize that one person can't always do it but I just want to say thank you thank you that helps keep going another question or statement yeah in the back we'll take a second so I too want to have a Linux phone so can you please tell me how much time it's suffering do I have to you have to give to to achieve that goal depends on your approach so I think it's impossible to answer without knowing your specific use cases and the services you want to use and how much pain you're willing to go through or whether you're going to be like well you know wait right fine and also it depends on which hardware you choose but to go to hardware choices we need first to establish which distribution you go on and then go down some huge decision tree maybe that's a talk for next foster I have a pine phone but it's lying on my desk for I'm so it's catching us like most of those yeah I've got one of those too so many two of those yeah so pine phone of course since I've been paid by post marketer has no post marketer is amazing Mobian is also amazing think those safe choices and then try to solve your issues one thing at a time but if you have issues with your carrier and reliability and stuff then yeah it's get tough so maybe maybe different device maybe different carrier it's it's complicated okay I keep on dreaming do that a question from the matrix what do you think of box 64 I think we can use this to run some of our x86 64 programs as a current worker on until we have a 64 version of the binaries I think in some cases this is definitely useful and I think people love that for proprietary games mainly um with some electron apps you can actually use an arm 64 electron runtime and then run that so it's not always necessary to go that route but I mean why not so I personally haven't played with that because I am too thick to understand the instructions and don't have the time but yeah box 64 also great just emulate shit works all right another question yeah there's one okay please pass on Mike hi once again I echo the comment thank you very much for your weekly lim mob log of everything that's going on in linux mobile but my question is I well I think it's about purism about a year ago talked about a payment mechanism for developers I think maybe it's like a theory of it but I don't know if there's any you know anything about that about how that might be changing the landscape of linux mobile apps well I think it would be very good to have some thing like that and they are in a place to do that as a business they've got an easier route to that than all these non-profits um but I haven't don't have any news so I very much look forward to something like that but as far as I know it's not happened yet thanks please give another round of applause
Flutter - about the nightmare of cross platform development targetting Linux mobile
Okay, next up we have Brage, she's going to talk about Flutter apps. Please give a round of applause. Hello, yes, I'm going to talk about Flutter, but not like about the fancy ecosystem as we were just introduced in the previous talk, but I'm going to talk about development and rather about the nightmare of development targeting Linux mobile. Because from the perspective of app developers, there's still much work to do until we can properly target Linux mobile as with cross-platform software. Who am I? My name is Brage. I do Flutter since it was publicly released in 2018 and I work in healthcare actually, so my work has nothing to do with what I'm presenting here, but I find it interesting topic anyway. I use ARM by the way, that's why I talk about Linux mobile. Even the talk is held on ARM, maybe people recognize the laptop here. You can reach me via matrix since I do matrix for work, so when you have any questions at break, colon, and that leaks, I am from France. Back to topic, why would we like to use Flutter? We had a fancy overview about the Linux native ecosystem, about GDK progress, about KDE targeting Linux mobile. Why Flutter? Because Flutter is a decent cross-platform ecosystem. Unlike if I develop a GDK app, I do not uniquely target Linux, but I target giant ecosystem consisting of iOS, Android, desktops, maybe even web, and I can potentially also target Linux mobile. It has a fancy plug-in system for developers to access native resources, so we are not bound for example to web browser APIs, as we know it from many JavaScript based cross-platform toolkits. We have an amazing UI toolkit and that's what they love Flutter for. You have animations, the 2024 style, and it's fun to use. It renders perfectly, it renders 128 frames per second on high-end devices unless you have some vendors doing weird things and then it won't work. And it's no JS, no XML, so we have design as part of the code, so no external design resources which makes it quite fancy to refactor to use it for development. Yeah, Flutter, but let's talk about Linux and especially Linux mobile. We will talk about both in this talk, but the goal is finally what are the issues about Linux mobile. We have a giant ecosystem, I already told, like there are 10,000 apps in the Google Play store made with Flutter, a bit less in the Apple App Store, but we have a giant ecosystem and all these ecosystems of Flutter apps could target Linux and Linux mobile too. They are optimized for mobile views, they're actually handy to use on Linux. We just need to make it happen. And we have big players into it, namely Canonical and Google. I know they are very popular here, but they use Flutter, especially on Linux and push it. Unfortunately, that's a problem too, that they are the ones pushing it, not the community we will see that later. Yeah, like what are the key points in targeting Linux mobile and Linux in general? The first is like, okay, if I have the application, it should not have runtime issues, it should be usable on the mobile screen, it should have functional interaction with the user. The second from the developer perspective is I should be able to debug the app. I should be able to compile the app for my Linux phone, there we get to a big problem. And the third thing is redistribution. I first of all need to redistribute Flutter in order to have a package system which can target Linux distributions with dependency trees, with Flutter as a build dependency. The second thing is I need to package my Flutter app for Linux distributions. It sounds easy, but it can be hell. This is the first thing we are going to talk about because that's the most complicated when talking about Flutter. Afterwards debugging and runtime, I will give you a brief showcase of Flutter on Linux. Yeah, Flutter redistribution consists of two parts. We need to build the Flutter tool chain, so everything we need to develop and we need to package it in a way we can use it on Linux distributions in order to have it as dependency. Yeah, let's look at packaging because that's easier to understand at that point. If we follow the instructions on docs.flutter.dev.slashgettingstarted how to install Flutter, we simply clone a Git repository. I mean, that sounds amazing. It's just a Git repository. It should be packageable. You download that Git repository or you clone that Git repository, you execute Flutter for the first term and you see that. We're downloading lots of things. First of all, we are downloading the Dart SDK. We could use that one as system dependency, but that's difficult. But then we continue downloading. Let's look where are we downloading to? I mean, should be a user directory or something like decent location, which is user configurable. Yeah, no, no, no. We download all the stuff. We download to the installation directory. Now imagine how it is like with packaging stuff for Linux distributions. It's a bad idea if your runtime has hard coded to download stuff into the installation folder. That's a bit annoying. But that's something you can work around with patches to apply. Yeah, step by step. What is it downloading? Like you download the Flutter source, blah, blah, blah. You execute Flutter for the first time at loop and it's downloading the Dart SDK. So Dart is the underlying programming language Flutter is using. And yeah, afterwards, it's creating the snapshot from the Flutter tool. So it's compiling the Flutter tool written in Dart in order to have an executable of the Flutter tool itself. Then this compiled Flutter tool, remember, you clone source and it's first compiling stuff. Then we download the engine, the Flutter engine, and dozens of platform dependencies. And they keep changing each and every release. Good luck capturing that. So what do we have? We have fonts, we have Android stuff. If I use the Flutter tool to target Android development, I have different compilers all per architecture, compiled with native APIs. I have the Web SDK for target web. I need to download Skiya, CanvasKit in order to render in the web. All this is going to be downloaded. Generic Flutter SDK tools, platform compilers for Windows, Linux, Mac OS, FrontRenderer, for example, the GTK and better on Linux. And then I'm mostly done. Let's look at where these downloads come from in order to capture them and in order to improve that. Get a release now, now, it would be too easy. Some package registry, like, I mean, that could be a hosted Nexus or something. Better Chromium CI, the build system of Google for their open source proprietary components. They build all these components you need at runtime in order to save time while executing, I don't know. And it's built in Chromium CI and then downloaded at runtime. So you need to capture that somehow. You cannot know what's happening in this Chromium CI. No one knows. It's just we download blocks from proprietary storage and this is not very open source of you. It's held to package. It's held to work with that. But back to the topic, how can we package that? Now that we know where all this stuff is coming from, we could take all this stuff from Chromium CI. I mean, it's the easiest approach. I just want to have Flutter function. I want to develop my apps. Let's just package that stuff we get from Chromium CI. We could pre-cache it at prepared time of the packaging process. So download all these dependencies, create the snapshot and so on. And then just have it packaged in the distribution package with ship. Other option would be, and I won't give a definite answer on it. It's just prospect. You could also patch Flutter to make this user configurable. I made a merge request for that like two years ago. It was rejected because the Flutter authors did not see any use case. It's obviously a perfect idea to download stuff to the installation directory. Yeah. But even better, we could build them ourselves. Because actually, when I talk about Floss and mobile devices, I do not want stuff dropping out of this Chromium CI. I have no clue about what's happening in. Yeah. Building Flutter next topic. I don't know. Has anyone of you already built Flutter? Like the Flutter Engine, the Flutter tool? I guess a couple of people here. I guess you had fun. At this point, very special thanks to Lauren. Amazing work on patching Flutter to be able to make Flutter buildable with more or less less-vendored tool chain. Amazing work. So the next few slides are going to present actually the work done by Lauren. Yeah. Issues on Flutter Floss builds. Like you have, first of all, vendor packages. Like everything you could use Assistant to Sensey is being vended from some random upstream source from Google. We do not want that. Yeah. It's coming from Chromium CI, by the way. Also, the Flutter sources themselves are written in a way it's not muscle-compatible, existing patches, adding muscle support to the Flutter Engine were so far always rejected. Same applies to existing patches making it compile on BSD. Those are not that functional yet, but there were clear statements. There's no interest in adding support to that. There's no use case in it. So the Flutter team is not willing to accept these patches, this work done there, which is super sad in my opinion. Yeah. So the tool chain to build Flutter itself, it's basically a G-client solution. So you get the fancy repo, Depot tools from Google and download the solutions, and it's downloading lots of stuff from Chromium CI. This is a screenshot, can you see it here, from the Alpine package build files for building Flutter. You have, I don't know how many are, it's 15 patches only to make Flutter compile. There, you have some patches affecting the Engine, so for building the Engine, and some for runtime for the Flutter tool, and in both cases it's giant overhead just to package this simple tool. Yeah, it's sad. Yeah. Upstream work, nah, so far not wanted. It's not appreciated. There was upstream work until all patches were rejected, like it's already known for a while. So far all aims to improve that were rejected, and that's why there's unfortunately lots of downstream work going on. Yeah, mostly rejected. There we are. So in order to build Flutter on using a Floss tool chain only, you first need to patch the build route in order to have the function environment to build the Flutter Engine itself. First of all, things like, hey, use my local Python. I do not need your Vendor Python. Use local libraries and stuff. By default everything is Vendor. Afterwards, you need to patch the Engine to, for example, work or functionally work on muscle. This is though not required if you target G-Lib C devices, but the post market OIS people and Alpine people in this room, maybe the Void Linux people might be happy about that. And there are the patches are pretty similar to target BSD because Flutter has lots of stuff hard coded to function on Linux only, though it could at many places work on BSDs too. I'm talking about BSD because I love using BSD actually, and I'm sad Flutter doesn't work there yet. And afterwards, if you got to patch the Engine, you still need to patch the Flutter tool. Like we were talking about that. These artifacts, we do not want to download the Dart SDK. I want to use the Dart language installed by my distribution package manager rather than like some pre-compiled stuff. At the moment, it's usually, for example, Alpine has the Dart muscle port packaged there in order to work around that. So there's no canonical way yet. There's no clean way yet, though there is work ongoing that. And yeah, so that was the brief overview. I mean, I need to hurry. The talk is way too short to dive deeper into it. Like the second thing is debugging and cross-compiling. If I have a Linux mobile device, it's usually another CPU architecture compared to my host machine. Though host machines with ARM CPUs are involving now, like most people still use AMD64 devices, and that's why in most cases for debugging Linux mobile app targeting like this device, they need to be cross-compiled. And that's the moment where I wished Flutter was go because go is fancy and cross-compiling and Dart is like, oof, it's crappy. But wait a second. There are these fancy arguments existing for the Flutter tool, like target platform and like target sysroute where you can like specify a sysroute of, for example, R64 installation. Let's try that. That's the reply you get. I mean, nice that you added these parameters, but that's not exactly what I expected if it's shipped. So yeah, you see, there we have the aim of the upstream team to make it support, but it's too slow. There are other solutions making it better yet, and now I'm going away from the upstream, presenting some possibilities like to get Flutter to debug and to cross-compile on your ARM device, on your Raspberry Pi, on your watch and whatsoever. At that point, I can also recommend the embedded Linux talks on Flutter taking place in this system. They are driving deeper into the solutions I will present. Yeah, the shark is very confused by this output. Yeah, if I just want to compile, I could also just use KMU and like compile if it's functional for release builds, compile the stuff on my host tank. I could use KMU, use a static binary. I have my ARM binary. Okay, it's compiled. I could ship it, but I actually want a debugging session where I can use the fancy Flutter features like HotRestart, HotReload, where I just do Flutter run, show my beryllium instead of building it locally, pushing it, debugging it, not debugging it, checking whether it works, manually checking some outputs. Compiling is not debugging. That's a huge difference in it. Yeah, for cross-compiling and debugging, there's no canonical way yet to do that. You can compile Flutter cross-platform using KMU static binary. Thanks, but that's crappy. We actually don't want to do that. You could also just have your standalone ARM64 build server. That's what I do. I have ARM64 CI devices at home with which I build all the Flutter stuff I build in order to have test builds targeting, for example, Debian on mobile. Or you use custom devices. Flutter supports custom devices, which means you have configuration files. You tell the engine, the Flutter tool at runtime to use or to run on device configurations actually not supported. And there you have projects dropping in there. You have Flutter in Linux, embedded Linux developed by Sony. It's the Flutter embedded devices. Okay, that's duplicated, but yeah. It's basically a wrap around the Flutter tool, which enables you to run on ARM devices also remotely and you have Flutter Py also uses the custom devices API in order to target remote devices on Linux. But again, there is no build in way. There are these fancy projects enabling us to do that, but there's no Flutter build in way and that's sad. Yeah. As of now, it's easier. I have a full Linux installation on here. It's easier if I have my Flutter development environment installed on the device and SSH on the device and debug on there because that's way more functional than using the typical stuff you know from the second phone Android here. I just plug in the device and debug. That's not the state of debugging here. It's rather easy to develop on the target device itself if you have a decently powered CPU and like a desktop Linux distribution there or like can do it by SSH, that's way more convenient. And you should hopefully see an image. No, that's a joke. I have prepared a short showcase for you. It was number seven. Yeah. That's like showcase of Flutter. In a few moments, you will see me opening a Flutter. I recalled it while traveling here. That's why it's a bit blurry. Like that's an example of a Flutter app. Like you see animation rendering is pretty decent. Animation is crappy because it requires upstream patches in order to have defaultly handle Linux touch events as touch events and not as pointer events. There it's getting crappy but from the UI part, Flutter is fancy. And for example, like some Flutter apps ship these patches like to get scrolling to work. Most others do not. Some vendors ship patches. For example, Alpine again has patches to include a scroll behavior treating Linux touch mouse input as scroll drag enabled input. I think it's broken. I know it's broken since the last few releases but I think that's because the patch must be adjusted. Originally Alpine had a patch. It's no longer functional but it had a patch for it. And one could adjust that patch to still function. And like short summary, the first point is the touch is considered as mouse. That's why if you swipe it selects instead of scrolling. Scaling is sometimes an issue but that's an issue everywhere in Linux mobile. These devices have full HD or even higher resolution so everything is scaled dozens of times. You saw a GTK header bar which is pretty annoying. I do not want to see your header bar but that's again a GTK issue, not an issue of Flutter. And multi-window is pretty crappy because if I start a new instance you run into issues with any database connection you have open if you use local databases and you mess up your applications. Though you run into those issues in Android 2 but on Android it's handled way better because by default it does not start at two instances of your app. And yeah, that's state of the art. It's crappy but there is momentum. There is work going on. If you use all the patches, all the tool chains around Flutter, if you actually use them to target Linux mobile you can target Linux mobile in a pretty decent way. And I hope it's going on. Some work is going on upstream. Unfortunately most of the work is going on downstream which is pretty sad. That's not very open source of Google. But I mean it's Google. Yeah, so let's get Linux mobile ready as a cross-platform target and that was my talk. Awesome. Does anyone have questions? Yeah? You talked about the upstream not wanting to support muscle. But doesn't Android already have a libc other than glibc and do they even support that? If we look at Flutter we are talking about a completely different target of Android and Linux. And the Flutter Linux engine does not support anything apart from glibc and upstream. Of course it supports Android. That was what it was initially developed for but there it's another completely different components of the engine. And yet they compile with Android libc. Forgot the name. Yeah, by Jonik. Any more questions? Yeah? Martin. Your demo video showed a Flutter application running pretty smoothly. What defies what? Sorry? What defies your demo video running? That was a few years old smart from Shomai. It's a Shomai Pocophone F1 running Debian. No, how is it called? Mobian. Ah, okay. So, Friedrino. Yeah. Okay, thank you. If you tried on the Pine phone for example you won't have that experience because the GL driver is broken. That's exactly what I saw the last video. I often have that in my issue list, believe me. Any more questions? Yeah, that's one. So it seems like quite a pain to get Flutter to build and compile and get it all the way an app running on a Linux phone. Is it worth it? Is there really nothing better to get an app running on a Linux phone? As of now I consider Flutter as pretty liable for targeting Linux mobile because you have this giant ecosystem of existing Flutter apps. You have thousands of them which could theoretically run on Linux mobile but simply do not target it yet. You have 10,000 proprietary apps in Play Store. Okay, we do not want to have them. We have dozens of apps in Android all by the end. All of them could run on Linux if we made it easier. And all those patches are usually not patches. I as an app developer need to apply to my projects. Okay, I need to apply some patches too. Are the vendors shipping my app? But it's usually the vendors or the distributors shipping the distribution package to ship Flutter. I can easily build the Debian package for Flutter app. But if I want to do it the fancy open source way, if I want to use Flutter as a build dependency shipped from my package manager, then it's difficult. But I have the vision of getting there one day where I do not need to install, use my local Flutter installation with Flutter.dev slash getting started. But using Vendor Flutter, Vendor in the upstream of my Linux solution. And then it's harder but it's not the work done by the end developers. So I think it's worth it because it's only the distributors who need to do most of this work. Okay, thank you. Questions? Okay, in the back, one second. Thank you. So not related to Flutter, but if you said that's so painful to get something upstream from an open source perspective, how difficult would be or what would be the challenges, for example, to say, okay. As a community, we fork Flutter and we start supporting this fork because the maintainers don't want these patches on the official one. And we as open source citizens, we adopt this fork. How difficult would be that culturally? Well, forking Flutter entirely would be pretty complicated because Flutter is a rapidly moving ecosystem. There are many patches in the upstream and that could always break your fork with the giant company standing behind pushing Flutter development. So you have on the one side this giant company, namely Google, working on Flutter with a giant community and you would need to maintain your fork of the entire Flutter system on your own. What I consider as more realistic is patching the build route and like single components of the Flutter ecosystem, you could use as drop in dependency when shipping Flutter as a Linux distribution, for example, that would be way easier and also that's where currently see the Flutter floss Linux mobile ecosystem moving towards. So this work is more or less being done, but it's at the beginning stage. But I would not consider like forking Flutter entirely as a new framework. Hey, with this one you can target Linux mobile too because then you would lose all the big players already having their apps and continuing using Flutter. Thanks. Please give another round of applause.
5G in ModemManager
All right, thank you all for coming. So next up we have a very exciting topic, 5G and Modem Manager. Have a round of applause for Alexander. So let's talk about Modem Manager. Let me know if you don't see me because I'm not sure if this is going to work very well. I read about me first. I think I'm going to keep it like this. I have been the Modem Manager, Maintainer and Developer for the past 12 years and I've also been involved in developing and maintaining the two libraries that we use to communicate with Modems, which are the QMI for the QMI protocol and the MBIM protocol. I'm not working at the Google Chrome OS team since two years ago. And this talk is going to be about not only how we're going to add 5G support properly in Modem Manager, hopefully, but also how we added 4G and which are the issues that we had when we added 4G and how we are going to overcome these same kind of issues when developing 5G support. So we will look at what went well and what didn't go that well with 4G support. So before I joined the Modem Manager project, there was already support for 4G in the sense that you could connect the Modem, it was using 4G and the Modem will tell you, hey, I'm using 4G and then we will expose it and that's about it. So we were treating 4G yesterday as a different mode, we had 2G, 3G and now we have 4G, nothing else. So when I joined the project, I started to review the 1.0 API suggestion that were in the main list and the major focus at that time was to support multi-mode devices. So at that time we had two separate families of Modems, we had 3G, 3G-Bb-Modems, GSM, MGS, LTE, then you had another family which were 3G, VB2, you had CDMA, EVDO Modems for 2G and 3G. So 3G, VB2 had its own standard of 4G they digitized and then they started to use LTE as the standard 3G, VB2 Modems as well. So we had these strange 3G, VB2 and 3G, VB, Modems, multiple modems that had to be managed kind of in the same way, but it was very different in nature. Like 3GPP modems require a SIM card, 3GPP2 modems, most of them require to have some kind of activation with the network to bind your user account to the device itself, and it was a manual activation, automatic activation depending on the carrier. So there were many different things. And managing these new multi-modem devices, we thought this was the most important thing, but it wasn't, because 3GPP2 no longer exists. So can anyone tell me which main feature of 4G we missed? Because we didn't think of it. What's the mind here? No. Much more important than that. Actually related sometimes. So what we missed is the idea that when you attach to the network in 4G, you are actually creating a data network interface between the modem and the network, even if the host hasn't seen it yet. So you actually get an IP address, full data setup, communication between the modem and the network in the user plane, but the host knows nothing about that. And why did not we catch that? Because most operators didn't really care about that. They would allow you to send a blank APN during the touch, and then that was fine for them. They would tell you back, which is the APN that you are using. That was one approach. The other approach was that the settings used for data connection were actually going to be the same ones as used for attach. So you actually, when you kind of connect, you're actually configuring profile number one, which is the one used for attaching Qualcomm. There were lots of assumptions happening at the same time. There was also no consolidated approach to define these settings in non-protocols. The NBIM 1.0 spec did not have a way to specify attach settings. And many of the APIs that we developed at that time were based on looking at what NBIM capable modems were doing. So there's a use case where this does not work, which is when the settings are different. And so in 1.10, we added the support to explicitly specify attach settings. This is the case of Verizon, for example, where they have one specific attach APN and one specific date APN. So now we were able to say to the network, okay, we want these specific settings for registration, and then the network will tell us, then, yeah, you could have those, or you could have like a subset of those. You may ask for v4v6 and then only get back physics. So that's a very, very common thing that may happen. And this was added very late in 1.10, like many years after the 1.0 device API was introduced. Another thing that we missed in 1.0 was the support for profile management. So right now, up until that moment, the way you connect the modem, you specify all the settings that you want in the connection attempt. And in 1.18, we added the support to say, we already have a set of profiles, maybe even provided by the modem itself, because when you insert the sim card in the modem, the modem itself will build, not build, but with some carrier-specific settings, with some predefined profiles. This is very common in US carriers. So you insert the Verizon sim card, the modem boots, with already profiles defined as the way Verizon wants them. And then in that case, you can just say, connect profile three, and that's about it. So we did miss that. We missed some other things, which are maybe not as important as that one. Where did we do it? So the first API that we defined for 1.0 had multiple PDN connections in mind from the very beginning. Even if we did not support them in the same way as it's implemented now, at that time, we had modems that would expose two network interfaces at the same time, physical network interfaces that we could choose, okay, please connect this one to this APN, please connect this other one to this other APN. The multi-PDN support that we have right now is based on multiplexing, so we can have one single physical network interface, but then we can say, okay, I'm going to connect three different PDN connections, I'm going to create three different virtual network interfaces. And then the host can assign different data flows to each of these PDNs separately, because you have three different network interfaces, so you can do all the routing logic in the host itself. And this very same support was used to support Qualcomm SOC boards with the IPA driver, for example, which requires multiplexing by default. Now, where are we right now with the 5G support in Modem Manager? The picture is very similar as what we had before 1.0 for 4G. We just have the way to say that we are using 5G. We can say that we are using 5G SA networks if we only expose 5G access technology, but we also have the way to say that we are using NSA, so we are registered in 4G and we will use a 5G extra-barre when the bandwidth requirements happen. And that's about it. We don't have any other 5G-specific feature for now. What are we missing? So I'm not going to talk about 5G-specific features that apply, for example, in the radio interface because Modem Manager does not really care about any of those. We only want to be able to support things that the host is aware of, and that is completely hidden to the host. So one of the things that we are going to try to support is 5G Slicing, which is this important word that if you investigate about 5G, it's everywhere. So in 4G networks, there is no clear separation between different types of UEs. A UE is the combination of host and Modem. And so in 4G networks, you don't have any differentiation between different UEs. They are all treated in the same way. And in 5G, they do define specific types of UEs with different quality of service requirements. So you may have a UE that wants to have a bigger bandwidth. You may have a UE that wants to have an extreme low latency. You may have UEs that may send data to the network once a day or twice a day, but they need to be spread across a very big area. So in order to support all these different kinds of UEs, 5G introduces the concept of slicing. And so with the slicing, you have one single physical network, but then it can be logically divided into separate virtual networks. Each of them with its own quality of service requirements. And the separation, this is very important, goes up to the base station, which is something that 4G did not have. So imagine this UEs case. We have thousands of people here and for them, all of them with a phone, and all of them trying to get access to the network. There's congestion, there's a lot of radio interference between older devices. With 5G, what you gain is that you could have a phone using that slice that has a specific base station only for that slice. And so you get priority access to the network through this slice. And this may happen even with the same PDN. So you have one single APN that you want to connect to to the internet. You may have different paths from your host to connect to that same APN, based on the quality of service requirements that you have. Now 5G, as I said, is a logical partition of the physical network. And they are defined, they are specified or named by something called single NSS AI. It's a really bad name, I think. And so how are we going to support this in model manager? There are two main things that we need to support. One is during the registration, we want to specify which is the slice we want to connect to. At the time of registration, and we can't do that. And then you may ask for multiple slices, the network will give you back, okay, you are allowed to use this, you are not allowed to use this one, and you also have available this other one. So this is one simple way of binding, for example, all the traffic of the system to a single connection, to a single slice. This is the case that I told you before, single, this is a UI connected to two different slides separately, or both of them going to the same internet APN, and they use completely different virtual network connections in the operator side with different QoS settings. The complex way of using URSP rules is by using, the complex way of using five-year slices is by using URSP rules, in the way that the operator will tell you which is the way that you need to route the traffic through that network. So they will give you rules, the UI receives the rules, in this case the modem will push the rules to the host, and then the host needs to make all these separate traffic differentiation and move one data flow from one slice and one data flow for another slice. The UI should not be capable of deciding by itself which slice to use, so because this is mandated by the network, and so if you try to use a slice that you are not supposed to use, they may kick you out. So that's a way that the network has to control the access to the high privileged slices. Any modem manager that supports the slicing will look very much like a multi-PDN connection. We will have virtual network interfaces created for each slice, and that is about it. There are all the 5G features that we could consider, but I'm going to name them here only. So non-GPB access support that's basically accessing the operator network through Wi-Fi, for example, you can authenticate to the network through Wi-Fi, and then you also have non-IP based 5G connectivity. If you have a network connection between machines using different protocols, you could virtually create a 5G network connection between them without using the IP protocol. Now, how it's going to look like for the next 10 years? I think we need to focus on what went right and try to avoid the mistakes that we made in 1.0, but we also know the limitations because everything changes, and what is important now may not be important at all in 10 years. So the planning needs to be done carefully, and actually made it in a way that if in the future you need to change cores, then you can do it more or less easily. The first thing we should be doing is remove legacy features. A lot of the structure in the modern manager code base is based on this logic of having 3GPP devices as a separate type of devices. We can remove all that. Same for the ports, plain old telephony system, like these dial-up models. We said we would implement them 13 years ago, and we did not do anything. I think it's time to say that we're not going to do it. We had enough time to try to do it. And then obviously, all the plugins from modems that are very old, we can't remove them. There is no point in having them anymore. The focus should be on 4G and 5G modems, and on PCI and USB modems that expose the network interface. So we acknowledge that there are other types of modems, that is serial modems or USB modems that don't expose the network interface, and you can only do 80 plus PPP connections. Those would still be supported, but let's say like in live support only, like bare minimum data connections set up, and not thinking about adding many features to those as well. For example, not thinking about trying to add 5G S-Lacing in those devices. It wouldn't make much sense. We may want to have a new API. This API that we are using right now has been mostly untouched. We didn't break API into more than 12 years. I think it's time to do some some breakage. As I said before, remove interfaces that we don't want, and probably not the same process as we did for 1.0. In 1.0, we spent, I spent one year and a half with my branch, until it was mostly ready to be launched. I mean, I want to change that. That cannot happen again. I don't have as much time as I had that time. So the idea would be to do it progressively and start to add new APIs, at least the basic ones, and so on. We will have registration settings as a first-class citizen in the APIs. We no longer treat them as something automatic, which is what we do right now. We now want to configure 4G-attached LTE settings. We want to configure 5G registration slide settings and 7L-over common settings that you may have in the modern, like the manual versus automatic settings. All those should go in its own separate API with the idea that in the future we may have more. So it should be open to updates in the future. Regarding connection management, I think it's time to use profile-based connection management as default whenever possible. There are many reasons for this, especially when you use carrier settings, where the modern gives you all the settings that you need to use. There's no point in trying to add new settings on top of those when you already have them. So using profile management is the way to go there, and enable multiplexed connection by default. So as I said, the primary modems to use would be the ones that expose the network interface. All those allow you to do multiplexing. Most of them allow you to do multiplexing. So we should enable that by default. This is one of the main things that I would like to change as well. So right now when you have a modern detected by modern manager, and it happens to have voice support, even if you're in a laptop and does not have any audio connectivity, modern manager will try to configure voice-related stuff, call waiting status, all that. It doesn't make any sense to do that if you know that you're not going to use it. So let's move that to separate interfaces as they are right now, but as a way that you can actively enable that. And if there's any application with the intent of using voice capabilities, you can, hey, please, not a manager, enable voice capabilities in the modern. Then we will enable all the URCs, all the unsolicited message support, and everything that needs to be done to support voice, for example. Oh, no, that's another one. Yeah. This is extended to each list. So things that I would love to have, even if they are extremely difficult. So we have QMI proxy, NBIM proxy. Why not have an 80 proxy? Other programs can use 80 commands through modern manager, through the proxy, to do other stuff that does not interfere with the modern manager own control. So if you could have that, it will allow many applications to use 80 commands as well. Then we could move our GNSS location out of modern manager completely as a separate team. There's no reason for modern manager to have all this support for configuring AGPS and injecting extra files to the GNSS module. We do that because the modern has that. But if we have the proxies in place, there would be no reason not to do it out of modern manager. And, yeah, draft maybe for binary parsing of messages and all that. That was something that was already investigated. And that is all I have to say. Thank you very much for this great talk. Do we have any questions in the audience? Yeah. Thanks for the good talk. I was wondering how do you test all this? So what is your CI? So in Chrome OS, we have a lot of automatic testing for the moderns that we use. So I do rely a lot about that. Like when I joined Google, I found that there were a lot of information metrics about crashes and things, back traces. I was like, I need to fix all this. But I do rely also on my own testing. I do have a home network, a home LTE network with SLSLT, open 5G, as I have my own SIM cards. And that allows me to do a lot of testing that otherwise I would not be able to do. Because all the slicing stuff that also is very core network dependent? Yes. So you might run into problems. Oh, yeah. I know many operators are doing pilots and like private pilots. I do some open ones. I think also in the US, T-Mobile is doing it. But for example, for a 5G slicing, I think that my home network is enough for this kind of testing. Thanks. Next question from the back. Hi. I'm debugging voice calls from my device. And from what the manager I see messages like gained, audio, lost audio. And I have no idea what happens after that. And whenever I try to... So how do you use 80 commands to control the modern? No, it does it by itself. When I'm trying to get to the bottom of what's going on in the code, I only see interfaces behind interfaces behind interfaces. But where can I find the actual code that makes the audio? Like where should I look? Is there a problem? So modern manager is only in charge of starting the code and hanging up the code. That's all accepting an incoming call. Nothing audio related. I mean, modern manager knows absolutely nothing. About the audio path. You know, who is responsible for getting the audio? It depends on the platform, of course. So if you're using Libre and Firephone or something like that, then you may need to talk to them. Thank you. Thanks. There was a question from the matrix apparently. I'm rushing to the matrix. So somebody's asking, can we anticipate 6G features such as sharing machine learning data for connection optimization? I have no idea about any of that. I'm still in 5G. Maybe in 10 years we will talk about. Same talk for 6 years. You talked about REST for the protocol parsing and how there's already been experiments. And it's on your wish list. So I assume those experiments are somewhat successful. Can you talk any more about what those experiments are? So not much. I mean, it's useful. I think it's very useful. And I still keep finding bugs. For example, in the 3GPP PTU parsing, which we wrote 10 years ago. And there are still bugs there. Nasty memory related bugs. So REST is very promising in that regard. Cool. Thanks. One more question. And I'm back. Thanks for the talk. So the question regarding the AT proxy. With all the possible vendor crap, etc. So how do you plan to define if the comment is going to interfere with model manager or not? So is it going to be a low by default or is it going to be for be by default? So that's why we don't have a proxy yet. That's the main reason. Especially because model manager handles a lot of crap that manufacturers push in the AT port. So the idea would be to, in the same way that model manager disables a lot of URCs that knows that may happen, the proxy could do the same. And so we could still need to work with known URCs as they happen in the world. But I hope that manufacturers will start to use other things than AT at some point in 20 years. Give a round of applause for Alexander.
Droidian - Bridging the gap between various platforms with convergence
On ways. So. So thank you all for coming. The next talk is about Droidian from Bardia. Please give a big round of applause. Good afternoon, everyone. And welcome. My name is Bardia, as you've heard. And if you've been following our project, know me as Fake Shell in the community. And I'm one of the core devs of the Droidian project. And if you have any interest in embedded systems, mobile devices, that's why we're here, obviously. You might be particularly interested. So today, our topic of discussion is going to be Droidian and what we're doing, how everything works, and how everything goes together, and why the whole project even works. No, I'm sure. Like I said, I'm prepared for that. OK. So who are we? Well, we're a number of fos and privacy enthusiasts committed in building a free and open source project and operating system that is user friendly and open that can be utilized in different environments, such as phones, maybe even single board computers, et cetera. Tablets, different things. So Droidian is, as the name states, based on Debian. We take the core of Debian and add our own repository on top of it and add our own so-called finishing touches. So Droidian utilizes a number of different projects. Should I go down like this? OK. That's too far. OK, I messed it up. So Droidian utilizes a number of different projects. Some of the more well-known projects are Hallyum. We use lip hybris and Gbinder from Joly. We use the stack from GNOME, as you guys may know, Fosh. And we currently have a selection of devices supported in our official CI or build system. And I think it should be over 20. We haven't updated that device page, so it's not exactly up to date. It should be 25 or 26. So devices vary pretty largely from different manufacturers. Different release states. We have the OnePlus 3 from 2016. We have the Pixel 3a, the FX-Stack phones, the Galaxy S9, Lenovo Think phone. Like, the list goes on and on. So the barrier of entry for getting into Droidian and development and porting is fairly low, because there's already a number of devices that do exist. And they mostly cover most of the possible cases in the Android space. So for Droidian, one of the main things that people who just get to the project need to know about is our porting guide. So the porting guide is mostly split into three sections. It's the kernel compilation guide. It's the Routefest debugging and Routefest creation. Kernel compilation is going to be the initial testing and compiling, changing a few parameters in the kernel, and packaging it to get a Debian output, because we need Debian packages to do over-the-air app kernel updates. We have Routefest debugging, which occurs after the phone actually boots into the Droidian root file system. And last but not least is Routefest creation, because we obviously need to somehow get built for each device. So how do we actually get from Android to Linux, or so what we call Linux? So on Android, there's usually the bootloader, LK, loading the kernel, kernel loading the RAM disk. And the RAM disk does everything to start up the inner process of the system partition to actually start the system. And then system mounts a bunch of stuff, mounts product now to vendor, and a bunch of other garbage. So on Droidian, we take the same kernel that there was on Android, and we change the RAM disk. We have a modified fork of the Hallym RAM disk, which the Hallym project and UBPORs used to maintain. Now, in our fork, we have some support for a bunch of stuff that we use that is not in the upstream Hallym RAM disk. The Hallym RAM disk mounts the user data partition, which is where Droidian actually resides. We don't use system, which is kind of a basis base, but it is what it is. It mounts user data. It does a bunch of Android bootloader stuff to get everything up and running. And it starts in it, which is system D, obviously. So now system D starts, and system D starts up all the usual services. We have system D time sink, the system D resolve, and all the other stuff. But then we have our own services from system D. We have a service that starts a very small container that runs Android. And that Android starts and mounts a bunch of partitions, Android partitions, modem, and everything that the firmware and the drivers need. And the vendor script starts, the system GSI script starts, and we get all the drivers loaded, all the firmware loaded, and a bunch of interfaces start from Android. Now, then we have the usual file system of Debian. There's the user interface. There's like, file feedback, the end-to-rest. So from the Android services, we have hardware composer, which we use for compositing to the screen. We have audio flinger. Well, not exactly audio flinger. It's Droid media, but ignore that. We have Droid media for audio and camera. We have the radio interface layer for us name states radio. And a bunch of other services, lip, perf, manager for power, NXP, NFC, et cetera. So all the communication that we do from the Linux side of things to the Android side of things is done through Google's binder pipeline, or the binder IPC. And we'll explain how we actually use the binder IPC, how we actually communicate with it directly to the interfaces. So from the Linux services, everything looks familiar kind of. There's Fosh, obviously. There's feedback for feedback. There's Ophono, kind of ancient. And because nothing in the modern Linux stack can actually talk to Ophono, we have Ophono 2MM, which kind of exposes modern manager interfaces as a drop in replacement through Ophono. It's kind of a hack, but we don't talk about that. Yeah. So we have Joid.DNFPD. It's a fork of Sailfish community FPD, which is used for fingerprints. We have Call Audio.D as usual for Call Audio. Again, we have custom backends because Android. And Pulse Audio, again, ancient, but Android. And a bunch of other services. NFC and GeoClue, again, needs its own backend. But we're going to talk about these later. So most of the components that we have are not directly used by the user. So for camera, which is for Joid media, it's abstracted. And users just see the Joid.DNFPD camera app. For modem via Ophono, but users just see kind of a modem manager sort of imposter. For fingerprints, this part is completely customized for Joid.DNFPD. We just forked the settings. I haven't had it, everything. For battery management, Batman, very funny name. That does the work for battery management. I started that project as a shell script. It was a mistake. So Batman does a bunch of funny stuff, turns off CPU cores, sets governors, sets power save, whatever. It doesn't watch nonsense. And then we have Fosh, which is the user interface. Again, we maintain our own forcofosh because sometimes stuff happens, stuff breaks. We kind of have to maintain our own. We have bad experiences. We don't talk about those ones either. We don't say that in public. Joid.DNFPD needs to have a good image. Then we have the encryption service. Again, a custom tab and settings which uses Lux and LVM2. And the unlocker, which was, I think, initially developed for post market OS. We added a mini UI back end through LVGL. Again, custom back ends Android. I mean, it's the usual. So now how does everything actually go together? So as we mentioned, we have a bunch of custom back ends. We have a bunch of custom plugins. We have the Qt5 camera plugin from the days of, I think, Canonical, which developed it. There's the Ofona Binder plugin, which was developed by Joela, nice of them. There's a bunch of Pulse Audio modules that allow us to talk to the audio hell, like Droid Media itself, not exactly audio. And get audio through the hardware working, microphone, speakers, everything. We have GSTDroid. Again, talks to Droid Media to give us a nice and shiny Gstreamer pipeline that we can use for camera. And well, that's pretty much it. For back ends, because not for everything, we can add plugins, not all different pieces of software accept plugins. So we kind of had to hard fork a bunch of stuff. Some of them are not that frequently updated. So that was good luck for us. But GeoClue is barely updated. So we just added the Hypers back end, slap it in, which just works. We have the W-Root's Harvest Composer back end. I don't even know who started that. I know a bunch of people are involved in that. It's a mess. We have the Color-UD back end, which routes a bunch of stuff through hard-coded values. What if it works? And the Feedback-T back end, which talks to the Android vibrator how through IDEL and HIDL and gets the job done. It's not beautiful, but it works. And for MinUI, as we mentioned for Unlocker, we added a MinUI back end to LVGL itself, so it can draw to the screen without GPU acceleration, of course. Who needs GPU acceleration in the RAM disk? Anyways. So for Woot Animation, I think all this was used by Muff. We also have a MinUI back end for PlayMuff. I think it started life as the MoUI back end from JoLa. I don't remember. So to actually talk to the Android services, there's two main pieces that are doing the job for us. One's Lepipus and one's Gbinder. They allow us to craft. I mean, the Pibus has a bunch of compatibility layers and Gbinder that gives us a way to craft transactions and send it to the Android interfaces. And the whole system, how the whole thing works, pretty much ends there. Stuff's maybe hacky at times, I'm going to admit. But it works because we use pre-built vendor services and a bunch of stuff that was provided by the vendor itself. Stuff works for now. Maybe futures too. I'm joking. Like, stuff actually does work. So what is next for JoDian? Because the services work and the system itself starts up, everything works for the most part. But in reality, one of the main issues of the whole Linux ecosystem is app support. You don't have apps, must be honest. And no one wants to develop any either. No other big companies do. So I guess start integrating Beidjoy better into the system. Getting like zero startup time on Beidjoy, maybe developing something that replaces Beidjoy, again a drop in replacement. And clean up all the garbage that we added. We have a lot of garbage. So it's not pretty. We definitely have to go through everything. At least I do. I'm not a good programmer. We have to go refactor a lot of code, clean up a lot of code, see what we have to do. And possibly actually add some new features. So some of the actual features that I had in mind that I have been working on was wireless displays, which has to go through a pipe wire of using old version of pulse audio. So it's kind of tough. So I don't want to do a drop in the basement of pipe wire. I'm kind of tired of hacks. So we kind of have to fix up pulse audio to actually get pipe wire working. Then we can get pulse audio working because there's like an XTG portal for it. So that's one of the stuff in my to-do list that I actually have some work put in. Face unlock was something that I've been working on for the past two months. We can get face detection working through G-streamer. And G-streamer will actually move as you move your face along. I'm going to admit it's like 3 FPS. But it does detect. And the rest of the work can be done with OpenCV because not all Android devices do have the sensor to do it in hardware. So that has been in my to-do list. I've been working on it. Maybe we can help out other open source projects if they like face unlock, maybe. And two other very annoying features that are kind of deal breakers for others is once MMS. MMS, we don't have MMS. I tried many times. I couldn't get it working. MMS is very important. RCA is more important. But MMS also, at least in Canada and the US where I live, Android users are always using MMS to talk to the iOS guys. So MMS is very important. Dual SIM is very important as a deal breaker for many. And we have to work on dual SIM. That is a very big priority for me also. We've seen many users who actually looked at Android and they were like, oh, yeah, this is great. But you guys don't have dual SIM. So I'm out of here. That's not exactly the nicest. And besides all that, we still do have to work on app support for Linux and the ecosystem. With LibitVita and GTK4 becoming very mature and things working out, I have been at the very least working on porting all the old GTK3 applications that I've been using to GTK4 and LibitVita. Not exactly joy-dien specific, but it will benefit everyone. So that's something. A lot of applications are very slow. Settings app, as we all know, is very slow for the GNOME settings app. Much of the stuff is not threaded. Everything is running in a single thread. It's just horrible. A lot of code we have, I mean, well, I do have, that will soon possibly become a PR for many different projects, making many things threaded. We at joy-dien have a big PR to optimize GTK4. Speeding everything up, we've had a user who was working on a Blackberry, and he was seeing 70%, 80% performance improvement on his on GTK4. Because apparently there's a lot of issues in GTK4. Who could have thought? And the very last issue is that we don't, as the joy-dien people, we don't allow community devices in our build system. So if one of us, Core Devs, has a device, it can be made an official device. So like, be added to the build system, get stable builds and nightly builds. But we kind of don't have that for other people putting devices. So you should probably look into having a way to allow community people to port their phones and have them in our build system. I know many community porters have worked on devices, and they saw that, oh, they couldn't add it. So they just gave up. And the most important thing, documentation. And that's something I have to do, because none of the code I wrote has documentation. We have to do a lot of documentation. We don't like, at least the stuff that I worked on basically has nothing. I just worked on it. I slapped it on. I was like, yeah, it works, whatever. That one has to be worked on a lot. And that is at least my to-do list for now. No. Don't go down. Don't go down. Don't go down. Don't go down. OK. OK. OK. So if you want to contribute to Joyden via our device page, via our website, via our telegram channel, which also sync to our matrix, I think you can also find the matrix group for Joyden project. I don't use matrix much. But apparently, if you have a group that has a bunch of channels in it, I don't know. So you can find us there as well. And one kind of announcement that I have is we have been working towards getting phones with Joyden as the first pre-installed on phones. What a weird sentence. We have been working with an ODM to get Joyden phones, or so-called phones, with a Joyden-based system installed on them. And have that be sold to have kind of as a way that 0.64 does it. But it's like, yeah, we as Joyden developers are doing it. So we understand the system and we understand the hardware. So it's going to be much easier to develop on, because we also understand the system itself. So you might want to look out for that. Few relapses, not very labs. Few relapses, please. And possibly the bigger news of this sort of project of getting Joyden-based phones will be coming out in a few months. But you can be on the lookout for it. We have a website at the moment, kind of not exactly the best. Still being worked on. We have a survey asking users, if they wanted to have a phone with a Joyden-based system, what would they want? What specs would they want? What would they want the devs to be focusing on, et cetera? So you can expect a Linux-based phone sold on the market in a few months. Thank you. Thank you very much for the great talk. I know we have a lot of questions in the matrix, so I'm going to pass it on. So the highest upvoted question right now is, do you have any plans to switching to motor manager from Ophono? OK. So I have looked into this. I'm going to be 100% honest with you. I have looked into this. I am by no means a professional. And when I tried getting this working, I could never get a motor manager kind of back end to register a command over the binder IPC, the G-binder. Again, I am by no means a professional. And this is probably doable. And it will be a huge step forward, which will make the whole modem stack a lot better. It doesn't have to go through this, this, and this, and this, and this, a thousand things, then user sees some and gets it working. So yes, it will be great. I spent some time. I couldn't get it working. But it is in my to-do. One question. You mentioned that you implemented a WL roots back end for, I guess, to get fresh running. Is there any plans? For example, I currently use postmarker S on my phones. This is actually running in mainline kernel. So I guess it's a little bit of a different situation. But for example, different other Linux mobile UIs, like Nome Shell, just the Nome Shell branch for mobile, stuff like Plasma Mobile, SXMO. Is there a project to get those running on Droidian as well? Or is the only focus at this point? So at the moment, I actually understand the question. And we have a lot of questions like this, like getting different UIs running. So each UI that uses an underlying graphics library needs its own back end, obviously, because we have to use Harper Composer. And I know that there's like VeyFire that uses W roots. So that one works fine. There's a bunch of other W roots that works fine. But as an example, Plasma uses Kavein. There was what they used to be a Kavein back end for Harper Composer. And it's pretty old, or it's really old. And someone has to revive that to get it running. I currently don't have the time. I have a full-time job, and I'm a student. I'm kind of already under a lot of pressure. So for GNOME, which uses mutter, well, that's a beast by itself. Because Kavein and W roots are modular, somewhat. But mutter is the opposite. So the code for the RM back end, or frame by frame, or whatever, everything is baked in so hard that it's a very tough task actually adding a new back end. And let alone maintaining it. Because no one's going to accept any of our back ends upstream. Because no one can test it other than us. So if someone spends a time sure, but for GNOME shell with mutter, I really doubt it. Because it mutter itself. I might piss a lot of GNOME people off. I use GNOME myself. Mutter is a mess, at least when I looked at it six months ago. Thank you. How does Droidian support standard Debian, like Bookwam, Bozi, Deb files for RM64 targets? Well, yeah. You can run the packages. Right now, Droidian is based on Debian Trixie, the testing branch. We also have a branch for stable. Well, we have a snapshot for stable that you can use. It doesn't have many of the new features, that is based on the Bookwam. But any repository you add, any dead repository you add, if the packages are built for RM64 or the architecture is marked as all, like Python packages and stuff, everything will work. Flatpacks work, Snap packages work. If app images built for RM64, app images work, it's just like a computer. Thank you. Thanks. Maybe another? Yeah. OK, you. And then another question from Matrix. Thanks. Just a quick question about the strategy, because you mentioned that all these hacks you've built around to get it working. So my initial understanding was that you built Droidian to foster the development of these apps for Fosh, for instance. But now you're trying to also have a phone delivered with it. So does this really make sense to have a device running these, let's say, many hacks from the start? Well, yeah, that's a very good question. Well, we're trying to eliminate every single thing that we think is like a big hack. But it really depends on what you consider as a hack. Is libhybers a hack to you? Then the whole system is built on nothing. But to my eyes, I kind of have a different look to it. And in my opinion, we can slowly get rid of most of the hacks. Again, we have custom backends? Fair enough. But I don't see there as a hack. But in my opinion, a lot of those can be cleaned and can be made ready to be shipped on a phone sold to customers. So it's not that far gone that I would consider a waste of time. I would consider working on it a waste of time. I still think that it is very doable getting it done. Give a big round of applause again. Please, thank you.
Genode on the PinePhone on track to real-world usability
All right. Next up we have G-Note on the platform on the track to real-world usability. Have a big round of applause for Norman. Thank you very much for the chance to be the second time at this developer room. I was here one year ago introducing G-Note based phone, and now I will give you an update what happened in the meanwhile. So I had very little preparation here, but I wanted to show some demo source. If something breaks, please feel with me. I hope it will run smoothly. So first, to give you some background. Microphone is a bit. I will try. So the background of what we are doing to just recap. Like in 2003, my best friend Christian and me, we had a dream about a truly trustworthy operating system. We are somehow new from academia, certain puzzle pieces that should lead us there, but those puzzle pieces, each of those seem to belong to a different puzzle each. So that was quite difficult to align them. Can you hear me still? Okay. So it took us a few years to bring them into alignment to build the first prototype back then. Once we saw how all this could work, we were quite motivated to bring it to the real world and bootstrapped a company in Dresden by ourselves doing contract work. And with the idea to do our licenses technology. Fast forward ten years, we kept on working on this. And during this time, we had grown a small team of ten people. And at this time, we were able to move our work towards our genoed on our laptops. This was a bit milestone. And now a few years later, we did the first baby steps to also bring it on a mobile phone. So on the PC, it looks like this. So this is this good OS. That's an operating system built on top of the genoed system. And this is actually also running on this machine right now here. So that's basically used day to day on our developed machines. On the phone, what I presented one year ago, is this system running on the pine phone. The basic idea is that there is a part of the phone that has kind of fixed functionality like a feature phone you can think, or like a boot loader, something that is really fixed. And then there is a user defined part of the phone where the user can install software into and switch it in and out. And I will just give you a quick tour through the user interface. Let me just log into ARM. This type into my Linux VM over here. And let's see if you can get some kind of video running. Yeah, this one. So basically, here you see the phone UI. And the basic UI divides the phone into five categories over here. Is it doing something or why? Yeah. And the device category basically gives you control over physical things, physical aspects like the brightness or volume, or control over the microphone like a kill switch for the microphone, or some power settings like how you want to operate the phone. And you can see here when I modify the brightness, it has immediate effect. Then there is a second section that is all related to telephony. And you see here also the user has complete control over even lower level aspects like the powering of the phone. So here the power lines to this phone is really controlled by the user. So now, for example, when you switch it on, the modem is booted. And now we interact with the SIM card and the user can type in some PIN to get access to the network. And now we can receive calls or initiate calls. And we can also initiate a mobile data connection, which I'm going to do now. So basically, switching on the option to use your mobile data. So you can see that's also the option to use Wi-Fi. And now you see the three dots over here. They are basically telling that there is a currently setting up the connection. And once this is done, we see an IP address appearing here. And this means we have a data connectivity. And with this data connectivity, we can actually now do interesting things like install photo software. And the image when you just install it comes with a few example systems, I would say. So these are basically systems for the other side of the device. You can switch these two sides using this gesture here on the left side. And this is, for example, a very minimalistic example of an interactive application as a subsystem running on this user defined side of the device. And there are a few other examples, like, for example, this small oscilloscope that just shows microphone data. And you can see, basically, when switching to the other side, nothing is really visible there. That's because the microphone is still not enabled. So the user must enable the microphone first. And then you can see that the application, the user defined side, can observe the audio data from the mic. There are a bunch of other examples. I think the most interesting one is a web browser that we ported to the system. So this is based on the Chromium Engine and the Morph browser specifically. So in order to bring this to our OS, we also had to port, for example, the Uwuntu Touch UI toolkit, or nowadays called Lumiri UI toolkit, also enable the GPU and things like that. And you see here that the browser is running. It's not super smooth, but you have to keep in mind, it's a time phone that we are running on. But it's actually usable, and you can browse those websites and use these kind of modern JavaScript-based sites also. I think visiting GitHub also is possible. Yeah, this was basically what I could show you last year. So just to get to the set of this point, and now, okay, you see here, the number of controls that you may know from the Uwuntu Touch UI project. Okay, so what do I want to cover this time? So shortly after my talk here at FOSSTEM, we published a first image for the community to try out and to get user feedback. And then once you get user feedback, of course the problem you have to incorporate a somehow, you have to do something with this. And then you want to give the user some new version, the user needs to install it. And so how can this interplay work to be enjoyable for both sides? Then I want to also talk a bit about our first wishes by the users, and then going forward to speak about how to bring software on the device. So first when speaking about user feedback, you have this loop where the developer installs it. The system originally on the SD card following the instructions from the website gives feedback to the developer, improves the image, publishes it, and then the user installs into a new version and gives feedback. And you have this loop. And now the question is how fast can this loop happen and how frictionless can this be? Friction comes in at these two places. So for example, when the user wants to install a new version, the question is can the user trust the new version? It downloading something from the internet. How much work is it to install a new version? So if this is like a real big operating system that you have to upgrade, it's really a lot of effort and also a risk. What happens when you have a regression and you want to roll back to the previous version, for example? And on the developer side, it comes down to basically labor. So the developer has to put thoughts and work into improving the images and then also building it, publishing it, hosting the images for the users. So these are the costs at the developer side. That's just how it disturbs you somehow. This kind of rigging is okay. So we tried to look at this cycle in a kind of holistic way. So you see here that the developer cycle can come down to about five to 20 seconds. So for the image that I will show you in a minute, the cycle for iterating over this UI to things was about five seconds to 20 seconds, depending on whether I could start this whole thing on my Linux system or on the Python via fastboot. And then the publishing of the new version takes about three minutes or I do a full release, 30 minutes. So this is all done from my laptop, so I don't need any special hardware for that. Out of this complete process comes a really small image on that source, about 16 megabytes. And the user can basically install this. It's signed by the developers, so there is some kind of integrity protection there. And the nicest thing is that the installation is very simple and very transparent. It's basically replacing just one file in the boot directory. And the user can instantly roll back to another version if some regression occurs. So let's now try to give this as a demo. So to do that, I first have to, let me see. I first have to start a USB webcam over here. Let me see if this works. Okay. Okay, here's the webcam. Okay, can you see this? Okay, okay. So I will switch down. So now there are a few risks involved because the update will run over the air. So I hope that I get some kind of connectivity over here. So now you have seen the boot of the image from SD card. So it's quite quick to come up, but it's also a small image. So let's try to connect to the Wi-Fi. Okay, I think I tested this first and do a stack. Let me see if I get an IP address. Ah, I got an IP. Okay. So now you can basically go to the software dialogue over here. There's this tab over there, update. And one concept is that we can basically select different software sources, which are basically URLs. So this is, for example, the software source of my company, JunoLabs. But I can also select other sources like my colleagues or this guy here, and Feske, that's me. And I can basically check what Nfeske has to offer. So now there are some metadata is downloaded. And I see, okay, there are different images offered by Nfeske myself. And I can basically get some information about these different images. This was the last real release, and there's a new image that's the FOSTA edition. So let's try to download this one. And, yeah, I have luck that the wireless connection works well. So you can see now the progress of downloading. I can actually see it. And you can see with these buttons, you can also download the other versions. So you can have any number of versions downloaded and also from different software sources and keep them on your system as well. So we are almost there. Okay, so now the integrity is checked using openPGB signatures. So everything went smooth. And now I can install this image to my system, which is basically copying one file to the boot directory. And it says, okay, reboot to activate. So let's do this. So I go to the device section and I say, okay, reboot. And I have to confirm it. And now I'm doing a hardware reset and pressing fingers, crossing fingers. Sometimes the boot load, ah, now it's actually working. And now for anyone of you who also grew up in the 80s using Atari 800 computers, you may recognize the fonts and the color scheme. So these are inspired by my childhood. But what you see here is really a custom image. So I hacked this together in the last week. We had a lot of fun with these kind of graphics. It's basically the same functionality as the regular script image. But you can now see that the appearance has changed completely. So it's a completely different image. And using the update feature, I could now also go to another place and switch, for example, install another version and switch back to the earlier version. Okay. Let's continue. Okay. So the first response we got from the community of users was a question about the power. So the pine phone is quite well known for not being very long lasting when it comes to the battery. So people found it quite unacceptable that we left the screen on all the time. So they asked, how about implementing some kind of a screen saver to save the energy. So that was the first thing that we considered. And I will just give you a brief tour of how this normally works. So when you speak about power management on the pine phone, you have on the bottom of this this power management chip, which is in control over the actually voltages, power ramps, battery, the power button is attached there. So these are the lowest level kind of electric concerns. And once the pine phone switches, it's switched on, and the communication processor starts up together with a kind of companionship, which is the system control processor. And these are completely separate. So this is an ARM processor. This is a small microcontroller. It's based on RISC 1000 CPU core over here. And the first thing that happens at boot is that the ARM trusted firmware is started, and this loads the firmware into the system control processor. So this is also an open source of the nowadays, which is pretty cool. And this firmware is basically meant to interact with this power management chip over here. And this can also run when the application processor gets switched off if you want to save power. And then the Linux kernel is started, and you have these bunch of drivers. One driver talks over these kind of devices, mailbox device and chat memory through the device to the firmware. You can give commands, for example, for suspend or resume. And then you have drivers for the display, for the touch input and so on, all as kernel drivers. And then on top of that, you have the user space that uses these kernel services, like the input driver services, like kernel mode settings, things like that. And on top of that, you have the applications. That's the traditional architecture that you may know. And with G-Node, we can do a bit more flexible, so the picture looks like this. What's the same is that we have the startup, we have this ARM trusted firmware, but this time it loads a custom firmware, which is basically a small force interpreter. You have to know that the execution environment over here is just about 16K, so it's really small. So we put a small force interpreter here, but left it basically like a hull. It has no predefined functionality, it's just an open-ended force interpreter. And then the system be woods a small microkernel, and on top of that, now things get upside down. Because here we have the GUI server directly running on top of the microkernel with no dependencies underneath. So it can run without any driver running. And the drivers, they come in later, they connect to the GUI server as a client. So now we have put this upside down. And so you have the applications that talk to the GUI server using the SCUI interface over here. You have a display server that talks to the CAPTCHA service, which is the same service as you would use for screen shots and things like that. And the input driver, the touch screen driver, talks to the event service for injecting input events. Then there is this platform driver over here, and this guy has the job to arbitrate the access to the physical device resources, like interrupts, memory mapped I.O. and things like that. And so then, for example, this display driver comes up to the platform driver, asking for a platform session. The platform driver turns on the right power, watches and the right clocks, and the driver can do its work. And then you have this power driver here. This uses this interface over here and can send force commands to the force interpreter and can basically extend from there at runtime, which is quite flexible. So when we started this system, initially it grew about two and a half watts, which is quite a lot. And now, when it's going to sleep, five minutes, oh, okay, I have to hurry up. This is basically different. You see this difference? We just removed two components and that's it. And the power draw goes up down to just one watt, thereby also tweaking some voltages. Okay, live sleeping demo. I don't know if this I should really show because of the time constraints, but let's do this one quickly. So now it's sleeping, you see. So I will also have to connect the console here, the Pico.com. Oh, no. Okay, I will skip this small demo. I wanted to show you how the drivers come up, but it's probably, even time constraints, I will just skip this. It's a bit sad, but yeah. Okay, here we were. Okay, the last point I wanted to talk about is the question of extending the system. So we identified this whole bunch of work items that the developer typically has in front of him. And we also touched this at some parts of the previous talks, like the flat out talk was quite interesting in this respect. You have a bunch of different toolings and different build systems and so on to consider and all these different steps. And this is quite complicated. So and when targeting a system, the developer is confronted with all of this. And so we came up with a tool that's called Goa. It's called Goa because it's gold, but reached a little bit sooner. To basically assist the developer with these steps. And I will just show you an example how this Goa tool can be used. Using my Linux VM over here, I go to my Goa playground directory where I'm just playing around. And so I wanted to, for example, port the Atari 800 emulator to this phone. So I can do a Goa run and this basically runs this emulator here now on Linux. And I can basically see when I do a PS that these genot components that you see in the background with this Atari basic running here. They are actually Linux processes over here. So this is basically the Linux version of genot running on inside the VM here. And there is a nice demo I wanted to show you for the Atari 8 bit. So I can basically make a modification here in this runtime file. So I will, for example, at this argument here to the emulator, do a Goa run again. And you see here that this is now running a small graphics demo here for that's quite famous on the Atari 8 bit. And you see now the cycle is really fast. Okay, correct. Then when I want to try out this on a real machine, I can say I want to target this script system over here. So my host system, I start the test environment over here. And I can say I want to target script and I want to give the information where this script server, where this should be run. Okay, thank you. Still not going. Time's up. Time's up. Okay, so I will stop over here and invite you to catch up with my colleague Johannes who will give a talk at the microcunnel death room later at 6.30. And I will be there as well. So if you want to get in touch and see where we should go. Thank you.
The Journey to Ubuntu Touch 20.04 on PINE64
Hello everyone, thank you for coming and thank you for all the live streamers. A little bit about me, I'm a college student living in the US and I've been doing a lot of tech tinkering on open source stuff since I was little and there's been a lot of that experimentation in my house. Ubuntu has also been a very common operating system in our house just as much as Windows or Mac OS and so I have a particular affinity to it. On top of that the ASAHI Linux project that came out in 2022 sparked up an interest in me and reminded me that what my mobile devices were capable of running on their chips and so at the beginning I was running virtual machine Ubuntu images on my iOS devices but that wasn't native, that was virtual machines so I wanted a native Linux first device that was also affordable and accessible and that is where Pine 64 particularly stands out. And another important fact is that Orin actually means Pine so I've had a particular connection to them and affinity with them and dedication to their work. And so what makes Ubuntu Touch on Pine 64 different from most devices is split in two ways. One, Pine 64's devices are not like most Ubuntu Touch devices and that is that like many of the other talks earlier today have mentioned Ubuntu Touch runs on Hallyum kernels as opposed to mainline kernels which means that there's a lot of extra components that are thrown in the middle to do some abstraction to get a lot of the sensors and modem and such working. However on Pine 64 devices we don't have to use that, we have to use instead our own middleware often and also Ubuntu Touch is different than a lot of mobile Linux distributions because almost of those distributions allow you complete control over your operating system with a read write and file system and updates as they come. Ubuntu Touch does a read only file system to provide an immutability layer as well as over the air updates so updates happen in big chunks at once rather than individual packages as they come. So these pieces in particular have made adapting Pine 64 devices for Ubuntu Touch a challenge but a welcome one. So some background starts with the original 16.04 port came at a pivotal time for both Ubiports and Pine 64. For starters there was ongoing work to move to 18.04 from 16.04 although that work was later abandoned in favor of focusing on the jump to 20.04 as the project was focusing mainly on migrating away from legacy tools like Upstart when Canonical was developing the project and towards a system D based stack which the Ubiports team has done a great job with. They also announced around this time the renaming of Unity 8 to Lumiri which is still an ongoing process and involved not just the changing of a name in one place but in every single bit of code which has provided some incompatibilities as we will find out later on. The original PinePhone Community Edition came with Ubuntu Touch as well as the original Pine tab and when both of these were developed they were done primarily by one guy Dalton Durst who did a lot of work for not only these ports but also for the entirety of the Ubiports team and so he was handling a lot of internal infrastructure which meant that when the team was working on the eventual switch to 20.04 the Pine 64 port had to be pushed aside in favor of a lot of other stuff that Dalton was working on. And then another pivotal moment came in 2022 when first Dalton had left the development team to go work on other projects which left the PinePhone port completely abandoned at that point and Pine 64 also came out with the PinePhone Pro Explorer Edition which was around the time when I started getting interested in the device but notably the device didn't have an Ubuntu Touch port which means that I had to make that. And so my process with this port originally began with looking at some of the other builder scripts that were around. Notably there's one that is linked on the wiki called the DPA image builder that taught me a lot about how the structure of the images are compiled which allowed me to create this chart here and what's important about the PinePhone Pro is that the bootloader is separated onto a separate SPI chip rather than within the images themselves which meant I didn't have to pack those anymore which is a great benefit. We can also use particularly tow boot as our bootloader which allows us to dual boot using the volume keys or even switch into mass storage mode to flash directly to the device from any other machine. But as I quickly found out most of the fun was in the kernel and it didn't work immediately when I booted it because at the time the PinePhone Pro device tree files were not in the kernel yet and so I had to pull them from downstream. Particularly a lot of my kernel work has reflected Medjy's work and it was looking at his work that helped me figure out how to get those device trees in. Once I passed that process I had a booting and boot-to-image but this was not a distributable boot-to-image it was built manually and was heavy. So I had to switch to making a port for a boot-to-touch. It uses a very similar process but slightly different rather than reboot strapping from scratch. We actually pull a CD image from Ubuntu server and then use a program called Devo's which can open a Docker or Podman container and build on top of that CD image to create our final distributable images. And last year at FOSSTEM I wasn't here but an early stage of my PinePhone Pro port was shown off at the FOSSTEM stand and this year I now have four devices, the PinePhone, the PinePhone Pro, the PineTab and the PineTab 2 all running on a much stabbler version of the port. So once I got the PinePhone Pro ported it was time to move on to the PinePhone which was still stuck behind on 1604 and I didn't have the PinePhone myself but I could do some research in the meantime and so I found out actually that there was no reason why I couldn't include both architectures for the devices inside of my kernel image which I also learned from Meji's stream and once I had a unified kernel I also found out that we could use tow boot on the PinePhone as well which once again split out that necessity of having to pack the bootloader into our images and I asked someone to try it out on their device and sure enough it worked which was wonderful which meant we had both the PinePhone and the PinePhone Pro up within just like two weeks of each other. Shortly after that the PineTab 2 pre-orders went live and at this point I was looking to make another port and the UB ports team actually reached out to me and said do you want us to send that to you so that you can make the port nice, happily obliged and they also sent me one of the original PinePhone to maintain at this time and then the PineTab 2's port was very similar to the other ones and I had most of the hang of it by this point but it was too early for a tow boot port to be out yet so we had to use the UBOOT binaries which meant I had to go back to learning how to pack that into the image properly but luckily besides the bootloader the rest of the process was essentially the same and then after we had the PineTab 2 port another community member reached out to me and said hey I see that you have these other three devices ported up and I've got an original PineTab sitting in my drawer not doing anything would you like me to send it to you so that you can create a port for that as well and once again I said of course and unfortunately tow boot doesn't work on the PineTab either because the run for how many PineTabs actually came out was quite limited so the main maintainer of tow boot never got his hands on the device to create that port so we used the PineTab 2's process again and just packed the bootloader back into the images and that had two congruent sides, a PinePhone set of images without the bootloader in it and then a PineTab set of images with the bootloader in it. Notably the PineTab and PineTab 2 do use different bootloaders because they have different architectures so there are individual images for each of those devices. I was also warned about using kernel versions greater than 6.1 on the PineTab because apparently it would cause a kernel panic and an infinite reboot. I found that this was partially true but it was a very easy problem to solve all I needed to do was move a module from internal to external which allows it to run after the DRM system that it was relying on to run and then it never has that kernel panic because it never starts before it's supposed to. As I stated previously though a ported device doesn't mean all of its features are working so there were a lot of software component hurdles that I had to get over to get to the state that we were in today. Two of the biggest ones have been rotation and modem both of which were due to the niche circumstances of trying to conform to Ubuntu touches, Hallym software stack. So in particular we have the split of what most Pine64 distributions use versus what Ubuntu touch uses for starters modem manager versus ofono which has also been mentioned in a few talks earlier. Modem manager generally has a lot better stability with the EG25 modem that the PinePhone and PinePhone Pro use but with several scripts we were able to get ofono in a similarly stable state. Another of those components was the difference between IO sensor proxy and sensor FW. Sailfish OS also uses sensor FW and we also use the ofono sailfish port but the thing is with sensor FW compared to sensor proxy is that you have to write your own configuration files for your devices and it also has to use a second adapter in order to properly read from the IO buses. And so you can see here on these charts that both ofono and modem manager can use EG25 manager which handles with the powering and a lot of the sending data between the modem and that was how we were able to get a much more stable modem version on 2004 than compared to 1604. And with the sensor files even after all of those patches were properly put in and all of our sensors were reading correctly rotation still wasn't working and this was maybe my biggest frustration for eight months. And then one day I decided to look in the log files and I noticed that the display was being enumerated as unknown rather than DSI which in some places it says that correctly but in other places it doesn't so sure enough once I had fixed that enumeration in all of the places where it properly had to be rotation was working. And the other big group of struggles was read only images and recovery images both of which use a special init ram FS script and so these two components help provide that those OTA images the read only images provide a level of immutability so that a user can wipe the system into a reset state and rather than having to re-flash the whole image and it also protects the system from too much destruction but there's also the recovery scripts which allow the device to switch into that updating modes that it can install those OTA updates as opposed to installing the updates for individual packages live like most Linux distributions do. So while the 20.04 pine 64 images currently release with image files most Ubuntu touch images ship their updates through tar balls which is where we are moving towards and the recovery image is what we need for that final component to get the tar balls working and recently we did succeed in getting those read only images working and now we can copy much more of the deploy style of many of the other Ubuntu touch images and then looking forward we have a lot of different types of images that we can use. We are moving towards 20.04 on the entirety of the distribution which will likely be around when these recovery and over the air images will also be available but this rebase is going to be a welcome one for us because most of the components that we back ported into 20.04 for the Pinephone Pro and PineTab 2 will be already upstream in 20.04 so we don't have to carry that in our repositories anymore. Outside of Ubuntu touch we are also working closely with the Lumiere team that is working outside of regular Ubuntu as well as on Debian and so we are hoping that some of the changes like the enumeration to those displays can help fix some of those issues on Debian with rotation for example and right now our ports is the closest thing that Lumiere has to stability on mainline but we are hoping to get that expanded to a more generic set of devices in the near future and that's about it. Thank you. We have some demos of the devices available at the Foss on Mobile stand in Building AW so feel free to check those out afterwards. Great, first question. You talked about the PineTab 2 versions of that, the Dev1 and the early Adopter one, is it fixed for both? Yes. Thank you. Thank you, very interesting. Having heard some of the talks today in this Dev room makes me feel like this is the early days of ARM system boards or even worse like the those days where every game had to ship 36 audio drivers. Do you envision a future where we have a sort of standard platform like UEFI on PC and ARM? I would hope so. I think that the ASAHI Linux project is certainly a push towards that and I'm hoping that other companies can follow suit. Hello. Great talk. Is it technically possible that the, you mentioned that the PinePhone images are the same image for the two different Pine phones? Would it be possible that there be non Pine phones in the same image if they didn't require bootloader or is there a specific reason why they only work on Pine devices? The only reason right now is the kernel. Otherwise we absolutely can boot those images that don't include the bootloader on plenty of other devices. How did you find out to put the, was it from internal to external, the kernel module? Was it that? I was looking in the device tree files and I noticed a mention of the display driver in there, but it looked like there were actually a duplication of those mentions. And so when I went and ticked off one of those modules from Y to M on the displays, it worked and that's all it needed. And then in the kernel logs it also said that that display driver was trying to start before DRM was available. A question from the matrix. I've heard this question before today, but yeah, the question is, any plans on migrating to Modem Manager? I saw that question earlier and I would also hope so, but I don't think that actually is viable right now because that would mean the whole, wouldn't you touch stack would have to move to Modem Manager and so we instead have to rely on what the rest of the distribution is using, which right now is Ophano. It's another question. According to the picture, recovery was dropped in the 2004 layout. Was recovery functionality integrated into boot in the DRMFS? So it wasn't dropped, it's just not available yet. It's still a work in progress. I do not necessarily have a question, but I have a quick addition to the person that asked about the standardized boot format, about the DOS games. I think it was that guy. People are moving towards U-boot and chain loading U-boot on other devices and making repartitioning possible. So in the end it would look the same as I and also the pine phone that you developed. So that was a quick addition. Thanks. A follow up question. You meant kernel options before compiling with Y and M or okay. Say it again. Did you mean kernel options Y and M? Yes, yes, in the DevConfig. Thanks. Could you name a single thing that would make the porting to another device easier? What was the hardest thing? What would make your life easier if you would have to port to a new device? If the boot loader was figured out for me, then it would make it really easy. Because as I mentioned with the pine phone and pine phone pro images, it's really just the kernel at that point. It's not hard to figure out what kernel modules you need to get a certain device to boot. Maybe one more generic question. What's the current status regarding the full disk encryption in UB ports? Say it again. The full disk encryption status in UB ports. I actually don't know that. Does anyone, Alfred? Yeah, passing on to Alfred. Yeah, thank you. So it's probably not going to be so first of all, there is no home encryption whatsoever right now. But unless manually set up with scripts, so in which case you can do that yourselves. We shouldn't provide any default, but we want to provide a default. And that's probably not going to be lux based encryption, but rather file, directly file based with X4 and F2FS based solutions. Because the Android devices, they have Android partitioning schemes, they have various differences where it makes no sense to do full disk encryption in that way that we used to from the desktop. And with it being on the user data, we can ensure that selective things inside of the user data are encrypted, like the home directory of the main user of the device. In which case we can unlock it with the same on-screen keyboard that the Lumiri desktop uses without having to basically add the on-screen keyboard to the inner-dramf s early up in the boot so that they don't look different, that they're using it like that they look cohesive, that they work with similar technologies so that it's one completely fitting thing that does it all for you. So in this case, full disk encryption probably not, but file based encryption or file system based encryption more likely. There have been experiments with that and they were successful. How did you feel when you first successfully booted up Ubuntu Touch on the pine phone? It was an awesome feeling, but as I mentioned, I have been tech-tinkering for a long time so it was also a very familiar feeling of, oh yeah, I got it working. Thank you.
Towards a bright future with Mobian?
Thank you all and thank you all for attending this talk. So yeah, I'll be talking about how we can improve our future as mobile Linux users, especially with Mobian, but this all applies to other similar projects such as Postmarka 2S and so on. So first question you might have is, who is this guy? So basically I'm working as Senior Software Engineer at Colabora. I'm dealing mostly with building and maintaining custom distributions for embedded systems, so kind of related to what I do with Mobian. I've been a long time Floss Introduce and I've been a DBN developer for a few years. And back in 2020, so just the last first damn before the pandemic, basically, I got my hands on a pine phone and this prompted me to work on that, work on mobile Linux in general and start and still continue working on the Mobian project. So what's actually Mobian? It's a DBN derivative or in the DBN jargon we call that a blend, which targets mobile devices such as smartphones and tablets. It has a separate package repository and provides ready to use disk images you can flash on a few devices. It's actually a very small overlay on top of the DBN and we only provide currently 25 source packages in our repository compared to the vastly greater number which is in the DBN, which means that technically of all the packages you have access from a Mobian device, actually more than 99.9% of that is pure DBN. And so we have a few packages with downstream patches which can't be upstream at the present time. Half of those are kernels, a few others are user space applications, which we're working on dropping those patches and trying to find upstream friendly solutions. We have also a few packages which are basically workarounds because the feature does not exist in the upstream world, not yet at least one of those being for example, Millipixels, which is the camera application for the Libram 5. Once the Libram 5 gets supported by either or both megapixels and Lib Camera, we can basically just drop this package and rely on upstream applications. And finally we have six Mobian specific packages which are to be rewrote to be included in the DBN itself so we can lower the impact of Mobian and the footprint of Mobian. So we hope that we can get below 10 packages by the end of next year. We'll see if we make it, but that's our end goal for now. So latest developments, what happened the past year? We had the first table release. We just did the whole quotes around stable. It's basically that we released Mobian Bookworm at the same time as the DBN Bookworm was released. So that's our stable release. It doesn't mean it's bug free. It just means that we don't do huge upgrades and only targeted fixes. So the system stays stable and keeps working as it works currently even after software updates. So it was released in June last year. We have a few supported devices out of the box which are several Linux first devices, the PinePhone, the PinePhone Pro, the Librem 5 also. We support a few Android-based devices thanks to the work of the community, especially on the SDM845 kernel support. So we support the OnePlus 660 and the Pocophone F1. And we also provide X86 images for desktop PCs or X86 tablets such as the Microsoft Surface Pro and Go. We provide a single desktop environment in this release which is Posh. And we provide also up to today's 6.1 kernels. So the 6.1 kernel is not the latest but the former one LTS branch, meaning it's supported up until 2026 if my memory is good. And we have a script in CI which is run daily and automatically rebases all the kernel packages we have on 6.1 on the latest point release. So basically when there's a security update, usually the day after or the same day, the kernel is up to date in the Bookworm Update's repo which is basically our staging repo for the stable release. There are however a few things we wanted to include in this release that couldn't make it. First one is universal images. The plan here would be to have a single kernel package for all supported devices. It's working quite well for SDM 845 devices because they share already a single kernel and the people working on those devices all put their patches into the same repository. But for pine 64 devices for example which is based on all winner A64, rack chip, different chips also. It turns out that making a single kernel package out of those proved to be trickier than we anticipated and so we basically dropped this effort at some point and focused on having just per device kernels, at least for this release. So we couldn't make universal images obviously. We didn't find the time also to improve the hardware support of upstream. We still carry lots of patches across for all the devices I mentioned. It must be a total of 800 to 1000 downstream patches in the kernels only. So that's quite a significant amount. We'd like to get them upstream but we all had dead jobs and for now every day is still 24 hours only. So we have to make choice. Also we wanted to switch to the latest LTS kernel which is now 6.6 and finally realized that we couldn't because we didn't have any time, any resources to spend on that. So that means that Bookworm is stuck forever on 6.1 which is not too bad because the life cycle of Bookworm will end in about a year and a half and until then this kernel will still receive security updates and bug fixes. So as long as Bookworm lives, the kernel lives along with it and we can get up to date and avoid security holes anyway. However, the next release which I'm about to talk is Trixie and is already on 6.6. So what about the recent developments? We try still to unify our disk images slowly. Instead of aiming for a single image for all devices, we're taking a step along this path and try to just ship one image per kernel. Until now we have one image for the PinePhone, one image for the PineTab, another one for the PinePhone Pro and the PineTab 2 and so on because some of those devices require hardly specific tweaks to be included with configuration strips, Udev rules and so on. And so we came to a point where actually most of these tweaks weren't needed anymore because upstream had picked up and had the necessary features for those devices. So we could envision having instead of having one image per device, having one image per kernel. And so we have our kernels per architecture basically, per sub architecture really. We have one for the old winner, A64 devices. We have one for the Rockchip-based devices which are the PinePhone Pro and the PineTab 2. Two different socks from Rockchip but still we can use the same tree and so on. It was already working well on the SDM845 devices but we took this step a few weeks ago and so it quite reduced the number of images we were doing. Regarding Qualcomm-based images we had until now one image for the SDM845 devices and another one for the SMS225 which is basically the PhanPhone 4 because we used to maintain different kernels for all of those. This is going to change and actually already changed recently because we pretty much imported all the patches we needed into a single kernel for all Qualcomm devices we support. There are not many of those which is why we are managing to do that but for now we have a single kernel which handles all the SDM845 devices, 1 plus 6 and so on, the PhanPhone 4 which has a different chip and also the PhanPhone 5 which has another different chip. And so we have a single image for all Qualcomm devices and we just use a simple config file at build time to generate the boot image for the device because although the root file system are identical the boot images are really device specific because they need to have the device tree appended and the specific RAM disk and so on. But other than this boot image generation everything is handled at runtime using JoyJuicer which fetches the binary firmware from the Android vendor partition because those devices ship with Android first and so the firmware are already present on the device. This makes things a bit easier for us because we don't have to care about the firmware license, we don't distribute it, it's at runtime fetched from data which is already available on the device. And there's also a small package with QCOM 4 new tools which basically just includes a few scripts and configuration for which are basically the same on all Qualcomm based devices we support. We're also adding in the process a simpler way to add new device support at least if it's Qualcomm based. The thing is until now we needed to have a kernel package in the Mobian repo and a few specific tricks in the image build process. We created a new target for these build scripts and build recipes basically which is QCOM 4WIP, it's kind of a dummy device but the thing is you can separately build or rather cross compile your downstream kernel using the bin depth package make target which is supported by the upstream Linux so you don't have anything specific to do there. It generates a Debian package which you can drop into the Mobian recipes folder, edit some config file, run the build script and it will provide you with a root FS image and a boot image tailored for your device. Then you can flash it using fastboot and hopefully celebrate that your device can run Mobian. This is almost never that easy but the thing is we're moving the complexity from knowing the internals of the build system to just debugging the kernel booting on your device. So there's nothing Mobian specific in that, it's just general debugging and we basically made it sure it was as simple as it could be from the Mobian side. And we also have a small first-dem-present in the sense that Mobian now provides, it's been a week since the first images were published but we now provide plasma mobile images. It actually started over a year ago and the goal was to from the start have everything in Debian itself rather than carry downstream packages in Mobian. And so Marco, one of the Mobian developers, worked on that for more than a year and he managed to get all the needed packages in Debian itself including the MetaplasmaMobile Meta package which you just have to install, apt install PlasmaMobileFool for example and it will put in all the packages you need and from there we could build our Mobian image. So that's basically what happened over the last year. Now what's next? We're taking, trying to take a step further towards universal image. So I've talked about the kernel issue, unifying all patches into a single kernel but actually there's all this little device specific tweaks I mentioned earlier which have to be handled and until now we have per device packages so that means one new package to have in the repo for each new device we want to support. This is an approach that doesn't scale at all. I mean it works fine if you manage 10 devices. If you aim for tens or maybe let's hope for hundreds of devices it's just too much work for a small team. So the idea here is to have a runtime service which will identify the device it runs on using the device tree compatible property for example or the ACPI, DMI, vendor, manufacturer and so on strings on x86. Select or generate the needed config files, put them into a runtime generated Debian package and install it on the device with the ability to place triggers on that so that when one specific config file is modified by another package this tweaks package is regenerated, rebuilt and updated as well. So that's something we hope to achieve this year as well as getting closer to Pure Blend. This is a specific class of Debian derivatives and it involves having all the packages into the Debian repository. So this is our next step once we have a working runtime tweaks management but basically this would mean having all our meta packages, tweaks packages and so on into Debian itself so we can just install everything Mobian from the Debian repository. Not all hardware features will work unless you use the Mobian provided kernels of course so Mobian will stay relevant for some time at least and we'll also be still able to generate ready to use images which will make things easier for users rather than having to build themselves from the Debian packages. Another big topic is also the call audio management. A few years back we created call audio which is a demon monitoring the phone call status and switching audio profiles and routing on the go depending on the situation. This was in a post-sodial world and back then post-sodial didn't really bother with such things the only automatic switching it did was when you plug the headphones and so on and we made sure that call audio did disable that on the post-sodial side. But now we are living in a pipe wire world and with pipe wire comes a session manager which by default is wire plumber and the session manager is meant to do just that switch audio profiles switch the routing to match the current situation. And so call audio raises with wire plumber most of the time it often loses so this means that you're having a phone call and actually you don't hear anything in the phone earpiece because wire plumber did the switching right after call audio instructed pipe wire to do so. So there's clearly a conflict there and the goal here is to make call audio basically a part of wire plumber itself. This needs some work in pipe wire to make it aware of the modem and to monitor the phone call stages but we hope to submit an initial RFC implementation at some point this year. No problem obviously. And finally we plan for a few other minor improvements. I mean most of the project development process and infrastructure is under documented as it is most often the case. So we have very user centric documentation written by users but we are very few developers and we didn't take the time to do so. So we'd like to improve that because basically a significant portion of the project has a best factor of one which is me basically. So I try to change that and make sure we have backup solutions and we get more welcoming to other contributors. And finally we'd like also to keep working on upstream device improvements. The Pinefone Pro has a few low hanging fruits. We can upstream probably easily. The support for the Pinefone 2 is being merged upstream as we speak. It now has a working Wi-Fi device. We'll have to look if it can be upstream as well. We hope to support also the Pinefone V or Pad 9.5 depending on how you see it which would be the first week 5 device supported in Mobian. And we also welcome obviously contributions to support more devices to help us with documentation and to basically help us make the mobile future brighter for all of us Linux mobile users. So here are a few links. I put the slides. Thank you very much. Questions? Hi. So I was profoundly disappointed to read your blog post in October about the Travals with the Pinefone kernel and the fact that essentially all of the work that had gone into the Pinefone kernel in Meggy's kernel tree was not being upstream. Which I presumed was the case really since the Pinefone had come along. So I was just kind of wondering what had happened if anything had changed on that front if Meggy was upstreaming patches now or anyone else and kind of what the situation was with that. For the original Pinefone the current situation is that someone in Mobian stepped up to maintain and update this kernel. He also started upstreaming a few patches and is monitoring the kernel mailing list and working with upstream to improve the situation over time. So there's lots of work to be done. I know there's also another person which has started working on the driver for the Wi-Fi chip which for now it was downstream real tech full of crap basically and nothing close to being upstream able. The new driver will be hopefully upstreamed and so that's already one big pain in the ass which will be removed. So now there's a bit more hope for the original Pinefone and if things continue that way then it will probably be great. So a question from the matrix. Is there any plan to port the EG25 manager to LibGP-IoD 2.0? Right, yeah EG24 manager is a very specific piece of software for the modem found in the Pinefone and Pinefone Pro. It's using GPIOs through LibGP-IoD and there's a new release which changed the API completely. The thing is for now LibGP-IoD isn't packaged in Debian at least the version 2. Version 1 is in Debian so yeah for now I don't have any definite plan. The plan being once version 2 is in Debian then we go with it but before I'm not sure I have the time to deal with all of this. But much requests are welcome as always. Yeah so a question regarding your tweaks approach. So why do you want to build if I understood this correctly? The tweaks on the device package them there and then install this package instead of having just one package that carries all the tweaks. The thing is we will have one package carrying all the tweaks but those tweaks can conflict with each other. You can have conflicting configurations for OGO for example and depending on the device you have to select the right one. You have also devices which can't suspend because otherwise they don't resume and other devices which can do that. So you have to select the appropriate tweaks and the idea of creating a Debian package is that the packaging system is aware of those files. If you have some files and the user shares something then it won't overwrite them with a file from another package. If we don't do a package on the device and install it then if we just move files around the packaging system will not be aware of those and if at some point one Debian package ships a file with the exact same name then it will break. So that's the idea. Alright please give another background of applause for Anu. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Exploring Quarkus Native: Choices and Implementation
Hello everyone, I'm Fivo Zakat and today I will talk about Quarkus native and some choices it makes and how it implements them. So how many of you are familiar with Quarkus? Know what Quarkus is? Well, less than I expected. Okay, so what it is? It's a Java framework. Well, it's an open source stack for building Java web apps, so it's a Java framework that aims to bring developer joy, it's Kubernetes native, brings the best of breed of libraries and standards and supports both imperative and reactive code. And that stopped working. So what does typically a framework do when you use it? Well, usually you write your Java application using the framework, then you package it, you save it wherever you want to deploy it and you start the application. And what it does, it will load configuration files, perform some annotations, perform some annotation processing, create some metadata graphs or whatever is needed and eventually run the application. So what Quarkus does to improve that situation is that it moves part of this configuration to build time, so you only run once the configuration and setup of your application and then when you deploy your application, it starts up faster and you don't have to repeat all this process. One benefit of this Quarkus feature is that it allows you to also go native. So instead of deploying on the JVM, you can deploy a native binary. So why would someone want to go native? We have put so much effort on making the JVM very mature, very stable, very high performance, et cetera, so why would someone want to go native? Without going in too much detail, I will list some of the pros and cons of going native. So first we will start with the pros. One of the major advantages of going native is that you get faster startup because you don't have a JVM that needs to start up, load classes, do classification, warm up, stuff like this, you get faster startup. You also get close to peak performance right from the beginning because you don't do just in time compilation, everything is ahead of time compile and that gives you close to your peak performance right from the beginning. You get a smaller standalone binary. Hint here, I'm comparing with shaping your application with the JVM. Otherwise the JAR file is smaller than the binary. And you also get smaller memory footprint when running your application because you don't have to keep all this data that the JVM keeps to track internal things. And another benefit is that if you launch the same application multiple times on the same host, they can share the heap as a copy and write memory segment. Now what are the disadvantages? First of all, you get slower development cycle. Compiling to native takes more than it takes to compiling to a JAR file. So we suggest that you develop on JVM, debug on the JVM and only when you are happy with your application then move to native because that takes some time. You also get lower peak performance because when you run binary, you don't get just in time compilation. So the compiler doesn't have the benefit to profile your code and to do better optimizations. It also can perform very aggressive optimizations relying on the deoptimizer to fall back to a slower version if something doesn't go as assumed during compilation time. Another issue is that security patches require recompilation. So even if a third-party library is vulnerable, you can just update the JAR file of that third-party library and don't recompile your code. You have to rebuild your code because parts of that third-party library might be empty in your application. So you have to recompile. Your application is also not portable. You lose the right ones run anywhere, principle. So because you are generating a binary file, it will only work on the target platform that you compile for. And last but not least, it lacks behind in terms of tooling support. So debugging is not as simple as in the JVM world. And the same goes for observability. That doesn't work. Okay. Now that we have seen that there are some benefits in using native code, let's see how it works. Quarkus uses GraVM and particularly GraVM's native image to generate the binary code from Java code. And how this works is that GraVM will take as input your Java application classes, the JDK classes, and the substrate VM classes. The substrate VM is a thin runtime layer that allows your application to run on bare metal. So it takes care of some of the system things going on. Then it performs a static analysis and this will allow it to perform dead code elimination. So it essentially doesn't compile any code that you don't need. If your application doesn't reference some part of your class path or your dependencies, it won't go in the binary. So it creates a graph like this where your Java applications reference some JDK classes and the JDK classes reference some substrate VM classes and it will eventually compile it to a native binary. However, GraVM comes with some limitations. There are things that are not supported and there are things that are supported but need manual configuration. And some of the not supported parts are currently working progress. I don't have enough time to go through this. So how does Quarkus offer, what does Quarkus offer on top of that? So GraVM takes Java and produces native code. So where does Quarkus native come into play? Because of the limitations I mentioned earlier, developing native applications for GraVM's native image might be painful and that's where Quarkus comes into play. It aims to help Java developers write their application and compile it to native without having to handle all the extra things that GraVM native image requires. First Quarkus will drive all the gathering of the metadata that the GraVM needs. So what's reflectively accessed, how many JNI interfaces are used, what are the resources we want to include our binary and stuff like this. Another benefit is that most of the ecosystem, so anything that comes with Quarkus is already supported for native image compilation. So if you want to use a library that's already supported by Quarkus, you don't have to do anything special, you just put it as a dependency to your application and it should work with native as well. It minimizes the dependencies because Quarkus already does a dependency analysis before going to native, so that allows you to pass less things to the class path and it helps the static analysis do the dead code elimination. Furthermore Quarkus through annotations, APIs and some configuration properties allow you to further find the configuration of your application for native. So some might think that that's not the only framework that does that, right? So why Quarkus? Quarkus takes an opinionated approach and it's different than the other frameworks in that it will try and build time initialize all the classes, while by default, Graph VMs native image runtime initializes the classes. And this might create some issues, so Quarkus will take care of reinitializing anything that's necessary like random seeds or some platform specific values and it will also reset fields that we don't need at runtime. It also doesn't allow incomplete class paths, so when you build everything needs to be on the class path, otherwise the build will fail and this ensures that you won't get any unexpected no class defound exceptions at runtime. And class, it uses Mandrel instead of the upstream Graph VM community addition, which is based on the Eclipse Temuring Open JDK build instead of the Laps JDK build and it's specifically tailored to Quarkus and maintained by Red Hat. So how does this really work under the covers? First of all, the Quarkus will take care of generating the Graph native image json configuration files. It will perform code substitutions wherever necessary. Code substitutions allow us to go and patch third-party libraries or even the JDK itself. So if we don't like there something or if something is not compatible with native compilation, we can adapt it. It will generate some byte code that is responsible for configuring things and it will change the defaults for Graph VM native image and it will also allow the user to pass additional parameters. So for the json configuration part, it generates these five files, one for JNI, for proxy classes, for reflective accesses, resources and serialization. These are the generation of these files is handled by the classes here. So it's native image reflective configs, let's say. And it decides what to put in these json files based on the build items that exist in your application. In Quarkus, you can define the build pipeline using these build items. And earlier I mentioned substitutions. Substitutions are heavily used in Quarkus because they assist in dead code elimination and they also make sure that things that are not supported in native code are not reachable and it will throw some appropriate exceptions for that. So Quarkus performs 303 method substitutions and 32 field recommendations in a total of 208 classes. This means that you don't have to do any of these on your own. They are already handled by Quarkus and this is only on Quarkus core. If you go and use some Quarkus extension, it performs its own substitutions and stuff like this. To see an example here, here we substitute the method allocate buffer in this class and we only do that when ZSTD is absent from the class path. And what we substitute the method with is a throw of an exception that this operation is unsupported. So if you compile your code to native and it invokes this method while the ZSTD library is not available, you will get this exception. And this is how we recompute fields. So here in Bouncy Castle's easy point, we go and reset the test random field because this is a secure random class and we don't want it to be preceded and pre-initialized in the native image. But whenever we restart the application, we get different random numbers. We can similarly change the value of a field by reinitializing from an alias. That means that we can pass whatever value we want not just reset it to null. Here we change the field unavailability cause to put a Quarkus specific exception in there. And we also substitute the method is available to return false to show that OpenSSL is not supported in this specific case. Regarding features generation, this is handled by the native image features step class and it will use Quarkus Gizmo to generate bytecode. And this bytecode is used to invoke Grail VMs APIs to perform stuff that cannot be done through the json configuration. So here is a part of the native image features that we generate. And what it essentially does is that it invokes first it gets the method descriptor for the runtime class initialization.initialize at build time method. And it will invoke this method passing it a string array with the empty string. This instructs Grail VM to build time initialize everything, which is different than what it does by default. And we can also parameterize the options that are passed to the native image build. And we do that in the native image build step. And here we see part of it. And what it does is that it always enables allow fold methods, which is off by default. It makes our application headless by default. It doesn't allow the creation of fallback images because fallback images are essentially JVM lancers. So you don't get the native application that you asked for. And we also always ask it to link at build time. And that concludes the talk. I would like to acknowledge that Quarkus participates in the IROEU funded project. And I'm ready to take questions, if any. Any questions in the chat? Yeah, the custom class loader is a bit tricky because Quarkus. The question was whether Quarkus also supports the standard JDK instead of Grail VM JDK. So this is the first part of the question. And the answer to that is yes. This is Quarkus native and this is optional. This is only if you want to go native. If you want to stay on the JVM path, you can use any JDK and it will work just fine. Now to the second question about custom class loaders. Although I'm not very familiar with that, I think that this might be a bit tricky because Quarkus already uses custom class loaders. So you have to make sure that they are somehow compatible. I couldn't hear the question, so. Okay, you find out a library and you wonder whether you can use it or not. Okay, if the library is supported by Quarkus itself, you will find it listed in the Quarkus supported libraries or in a Quarkus extension that supports this library. In that case, everything should work out of the box and you don't need to do anything. In the case that your library is not supported by Quarkus Core or any of the Quarkus extensions, then you need to use some of the trickings that Quarkus does to make it work. And Quarkus gives you some APIs and annotations that may assist you. Let's see that. There is a website like supported libraries that I can go to and have a look. I think if you go to code.quarkus.io, then you can see a list of supported extensions in libraries. Do we have time to get some more questions? One more question. Sorry. I was wondering if Worker's Native works with GNI-based providers, sorry, the provider interface, not GNI. The foreign API? No, no, sorry, like classes discovery when you want to load a specific service, SPI, that's the name, sorry, the service provider interface. I think I don't know. Okay, thank you. Okay, for the rest of the questions, please feel free to approach me on the break. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
An in-depth look at JFR in GraalVM and how it compares to JFR in OpenJDK
Hi everyone, my name is Robert Swenaga and I work at Red Hat. Today I'll be talking a little bit about JDK Flight Recorder in Gravium Native Image. And from now on we'll just refer to JDK Flight Recorder as JFR. So as a high level breakdown, I broke in this presentation to two sections. The first section is a high level overview of JFR Native Image and then we'll go into a low level deep dive of JFR Native Image and talk about some comparisons between substrate VM and hotspot. And I want to make note that even if you're not interested in Gravium Native Image at all, you may still be interested in the second half of this presentation because the details of JFR are going to be talking about there extend beyond just native image and also apply to hotspot more generally as well. Okay, so as a very quick refresher, JFR is an event-based monitoring and profiling tool. It's built directly into the JDK and it can give you some really valuable insights into what your application is doing both at a high level and also at the VM level. Okay, so Phoebus already talked about this a little bit, but Gravium Native Image is essentially a technology that allows you to convert your Java applications into binary executables. The appeal of this is you get much faster startup and use less resources and a big reason for that is you don't have to warm up your traditional JVM alongside your application code. And how it works is you compile your Java application to bytecode like you normally would and then you run the native image tool to convert that bytecode into your executable which you can later run. So why is JFR different to native image than in OpenJDK? The reasoning behind this is that a native image executable doesn't require a traditional JVM to run, however it still requires certain runtime components that your Java code expects such as GC and synchronization constructs like monitors, for example, and what's providing that in native images is something called substrate VM, which you can think of as sort of like a scoped down replacement for a hotspot. So it does a lot of the things that your Java code requires, but strips out a lot of the dynamic stuff that hotspot does that we don't really need in this environment. And the key here is that since a lot of the JFR code is embedded within hotspots, when we transfer it over to native image, we're using substrate VM so it has to be re-implemented in that VM instead. So that involves everything from the low-level JFR event instrumentation to the actual infrastructure that varies that JFR data from the point of instrumentation to the point where it's later consumed by a user. Yeah, so in terms of the current state of JFR support in native image, you can do things such as starting and stopping recording from the command line or from within your application code via the recording API. Several events are implemented, especially at the VM level. We have events for threads, monitors, allocations, you see, save points, etc. You can dump, snap, shot, to disk and inspect them with tools such as visual VM or JDK mission control as you normally would. The custom event API is also working, so you can create your own custom application level events. Stack traces and CPU profiling are also possible. Event streaming has recently been added as well. You can also even connect via remote GMX to flight recorder MaxBean, which practically means you can do things like from within the JMCUI, interact with JFR recordings that way, start them and manage them on the fly. How you might first interact with JFR in native image is at build time, you specify that you want the enable monitoring flag, specify you want JFR specifically, and that builds the JFR components into your executable. So then at runtime you can use the normal start recording, start flight recording option and pass all of the normal parameters that you would require, such as specifying a file name to dump the recording to or a duration, etc. There are still quite a few limitations to JFR native image. So not all events are implemented yet. It's an ongoing effort to keep up with open JDK in that area. Specifically, events related to bytecode instrumentation are not yet supported and of course some new JDK events we're trying to keep pace with that as well. Event streaming doesn't yet support stack traces, so that's one limitation of that. And we have a couple things that are in the review pipeline as well and are not yet supported in any release. That said, we've reached the deep dive, which is going to take up the majority of the presentation. And yeah, let's take a deep breath. So this road map essentially represents a very high level zoomed out view of the flow of JFR data through the system. And from now on each slide is going to contain this road map and the highlighted part will indicate the part that we're currently talking about just for convenience and easy reference. So firstly, the point of instrumentation. These are various points where JFR events are made at either an application level code or a VM level. And the screenshot on the slide is just from JDK Mission Control. I'm just using it to show some content that an event may contain. You can see there's a bunch of fields and corresponding values. And this is just one example. It'll vary by event. And you can think of JFR events as the primary thing that we're concerned with really. And the rest of the slides going forth are basically just piping to get that JFR data from the point of instrumentation to the chunk file where it can be consumed later. So yeah, speaking of chunk files, we're jumping all the way to the end of the road map. So chunk files are essentially the resting place of the JFR data as far as we're concerned for this presentation. And they must contain basically the same information, the same format regardless of whether OpenJDK or native images generating them. And they can be dumped to snapshots, the JFR snapshot which is the .JFR file format. And that's usually how people are going to interact with them via JMC or Visual VM or the JFR command line tool. Yeah, so chunk files are self-contained and they have four distinct sections. You can see in the diagram here header which contains pointers and other metadata. There is the event data section which contains the core JFR event data. Then there's the metadata section which describes the format and layout of the events in the event data section. And then we have the constant pools which contain constants which are referenced from the event data section. So the constants, in order to reduce the size of JFR data, we use a referencing ID scheme to increase compactness. And how this works is entries in the event data section of the chunk file will use unique IDs to reference into the constant pool section of the chunk file. And this helps with deduplicating the actual constants that are used by the JFR events. So in this slide you can see there's an example of one event entry which uses the unique ID 12 which is then going to be used to index the thread constant pool and reference the actual thread data residing there. So all this increases the compactness of the JFR data and what that does is it reduces overhead when dealing with it while it's in flight and when writing it to disk. It reduces the overall chunk file size as well. However the downside of this increased compactness in this referencing ID scheme is that we have a tight coupling of the event data and the constant pool data so that if they're ever separated and not found in the same self-contained chunk file then we can't decode the event data section and it's basically unreadable. So that's when down side. Right, so now that we've talked about the very beginning and the end of the road map we'll jump and fill in the middle. So now, so after event emission the JFR data splits. So the event data, the core event data goes to the JFR thread local buffers while the constant data goes to the constant pools. And in both hotspot and substrate VM the JFR thread local buffers essentially have the same purpose and same structure. So their structure in a segment way that allows for concurrent rating and reading of data and there are various pointers which define the sections. So there's the rate position pointer which basically determines where new data is written into the buffer. So when the event rate is in progress that's the pointer that's going to be in use. Then there's the committed position pointer which represents the end of the committed data section. And the committed data section is data that has been fully written so it's not an in-progress rate but it hasn't migrated anywhere else yet. The flush data section is essentially committed data that has been migrated somewhere else so it can be overridden at the earliest convenience. Eventually the buffers will fill up with committed data and will have to be flushed elsewhere and at that point all the pointers reset back to the start position. Hotspot is a little bit different in that it uses buffer pools to recycle buffers. So there's a live list and a free list and when a new thread requires a T.O.B. from JFR one will be taken off of the free list and put on the live list and vice versa when that thread goes away. But in such a threat we have it a little bit simpler. We just allocate a native memory, a thread local buffer when it's required and when the thread goes away we destroy that memory. So we don't really have to manage access to these buffer pools and maintain them. Right, in the case of virtual threads, multiple virtual threads may share the same thread local buffer of the carrier thread and that's not really an issue because each one has exclusive access at any point in time and the JFR data is eventually going to the same place anyways. Right, so after the thread local buffers fill up they are migrated, the data is migrated to a set of global buffers and the global buffers essentially act as a capacity for overflow storage and it's more efficient than increasing the size of all the thread local buffers because not all threads will be equally as busy with respect to JFR events. Right, so constant pools. Previously we mentioned how constant pools use a referencing ID scheme to reduce the size of JFR data and they do this essentially works by deduplicating constants. In a hotspot the deduplication works, one way the deduplication works is by using JFR specific bits and the metaspace data for certain constant types such as class with a K and also methods. So these JFR specific bits act essentially as Boolean toggles so when an event data reference from in a JFR local buffer somewhere references a constant that bit in that constant is flipped to indicate that it's referenced somewhere that way when it's time to actually persist the constants to disk we only have to persist the ones that are actually referenced not all of them. Additionally if multiple events reference the same constant that bit is only flipped once and that's only used to be written once so that's where the deduplication happens. There are some constant types such as stack traces that don't have metaspace data and those cases a lookup table is instead used for the deduplication and tracking and an interesting thing is in substrate VM native image there is no metaspace at all so we have to rely on the lookup table approach for all the various constant types. Right, so after enough JFR data has been generated a chunk rotation must be requested and what this is is essentially the way that JFR data is continually persisted to disk. The current chunk file and disk that's open is sealed and then a new chunk file is opened and in that process all the in-flight and memory data is flushed to that chunk file before it's sealed and the thread that's performing this the chunk rotation must flush the thread local buffers of other threads and to do that safely we have to request a save point. So the order of operations at a chunk rotation save point is as follows on the slide I want to make note that it's pretty similar in open JDK as it is in substrate VM and the space between chunk rotation save points the recording time between is called an epic and you can see in the green save point box that that's where we're actually flushing the JFR buffers both local and global to disk but the most interesting thing here is that we're writing the constant pool to disk outside of the save points when we're already starting epic 2 so what that means is we'll we're simultaneously writing the constants from epic 1 to disk while recording constants for relative to epic 2 so they're kind of mingling inside the constant pools so we need to keep them isolated however because we want to avoid writing constants perspective to epic 2 to disk into chunk file for epic 1 otherwise we'll have that mismatch and we won't be able to decode the data for constant for epic 2 the same issue that I explained a few slides back so how we do this is we tag each constant according to the respective epic to keep them isolated and essentially overall the more of the story is it allows us to reduce save point pause time by writing these constant pools outside of the save point and another way we actually reduce save point pause time is by having a dedicated JFR thread flush the global buffers to disk periodically throughout the epic time so it's not actually happening in the save points so there's less work to actually be done when we are stopping the worlds to flush the buffers to disk right um um one related note on save pointing is the question of can a chunk rotation save point interrupts concurrent event emission that may be happening in other threads so we have a scenario here where the save point actually and save points and epic transition actually interrupts the event emission and separates the constant data and the event data into different epics and different chunk files and then it will be unreadable then so that's a scenario that is in question right now um and in j in open JDK in hotspot the JFR code is written in C++ it's native code so it can actually be interrupted for a save point so it's not really an issue at all however in substrate VM it's Java on Java and the VM code is written in Java so the JFR stuff is Java code and potentially could save point at a very inopportune moment so how do we prevent that stuff from happening in substrate VM um how it's done is we have this annotation called an interruptible and what that does is that build time prevents the insertion of save point checks so that the code that's in the annotated with an interruptible annotation doesn't actually save point at all so you find that a lot of the JFR code is sprinkled with this annotation all over the place in the VM especially dealing with buffers and constant pools and event writes but this has pretty big consequences for the implementation itself because un-interruptible code that can't save point can only call other un-interruptible code that can't save point which means a lot of the JDK code that's written in Java is off limits so we can't use things like the normal hash tables, re-entrant locks, etc. we have to kind of like roll our own versions of that which are un-interruptible another thing is we can't even use manage memory on the Java heap because that can induce a garbage collection which requires save point and that's not un-interruptible so we have to use unmanaged native memory in order to craft room data structures to deal with a lot of these things so it's a little bit of work dealing with that and the last thing I want to talk about and the last difference I want to mention between JFR and substrate VM and hotspot is related to how JFR interfaces from the the Java level JFR code to the VM level JFR code and in open JDK it happens in the JVM class here you can see on the left side of sorry the right side of the slide and these are basically the points where the Java level JFR code and the JDK calls down to hotspot at the VM level using JNI so we reuse that code in native image we reuse that Java level JFR code from the JDK but there's no underlying hotspot implementation to call into so how do we resolve that mismatch what we use is we use substitutions which Feeb has talked about a little bit but I'll mention again but essentially what it does is allows us at build time to specify redirects from these Java methods to our own implementation the JFR VM level code so on the right side you can see mark chunk final is highlighted and that corresponds to the Java level code on the left sorry on the right I keep getting mixed up on the right side of the of the slide so we can see that we're actually grabbing that and then redirecting it to our own substrate VM base implementation of that code so that's how we kind of resolve that mismatch um yeah with that said um that basically concludes my presentation if you're interested there are further links for for more reading there's some documentation and some blog posts as well and you can always approach me as outside as well if you have more questions um yeah how good you are for time Chris okay if there's any questions I'm happy to answer them now you just did such a good job explaining it thanks yeah on on substrate VM is there did you measure impagant time to save point because if is it uninterruptible you know this uninterruptible trade oh time to save points yeah yeah I could imagine yeah um I'm not really sure of the exact figures I can't really give you a number but um I I know what you're saying it it would potentially an issue I haven't not really aware of it um but yeah that that's definitely a concern um but it's not just the jfr code that's marked as interruptible a lot of the gc code as well a lot of the low-level operations they they must also be uninterruptible so it's not just jfr yeah understood thanks yeah actually to tag on to that a lot of jfr code is really just instrumenting other low-level code which is already an uninterruptible so it's like collateral damage it's not really an issue to add a little bit more on to code that's already an intructible such as uh jfr gc event handling and uh slow path allocation stuff that's already you can't save point there anyways thank you okay okay uh thank you for listening
Ruby on the Modern JVM: Fibers, FFI, and More
Our next speaker is the esteemed and very famous Charlie Nutter, so let's give him a round of applause. Alright, microphone working. Can you hear me okay back there? Alright, great. I got a lot to cover. This is going to be a retrospective of all the problems that we've had trying to get Ruby onto the JVM. And then a little status report along the way about how we're doing on making the JVM catch up with those needs. Charles Nutter, that's me. There's my contact information. Been working at Red Hat now for I think 12 years. Before that worked for Ingeniard. It was a Ruby software as a service company. And then I was at Sun for the last three years as well. So I probably won't have time for interactive QA, but if you contact me online or post something in the Matrix channel, I will definitely get to it. I want to answer all the questions. Okay, so a little brief review of JRuby here. Ruby for the JVM, not too surprising there. It runs on Java 8 currently, but because of all the cool stuff and because we've ridden the Java 8 horse into the ground, we are going to be 17 or 21 minimum next release, which should be this year. In development for a long time, running Rails since 2006 and probably 2008, we started having production users. And we're the only alternative Ruby that's really had production users during that time. There's been a few other experiments, but nothing's ever really taken as well as JRuby. Maybe the most successful off-platform language brought to the JVM, Jython and Rhino Nazhorn might give us a run for our money, but given the maintenance state of those libraries, I think we're probably currently the most successful and most widely used JVM language that never was envisioned for this platform. So we've been chasing Rails all the time. That's kind of the gold standard for whether we can say we're a Ruby implementation or not. And after about two years of good work, we managed to get Rails working back then. Running Rails tests, running CRuby's tests, running all of the different libraries, suites, as much as possible. Compliance testing for Ruby has improved over the years, but we pretty much just run everything to try and make sure that we really are compatible. And very quickly, we ran into some serious challenges trying to bring a language like Ruby to the JVM and make it also usable and perform well. This is the quick summary. These are all areas I'm going to cover during this talk, so we will just blow right through here. These challenges help us grow both as a platform and as a community. They open up new worlds to Java developers, to JVM developers. They open up the potential of bringing new and unusual languages to the platform. It opens up the entire world of native libraries, native features that are out there that we don't necessarily have on the JVM. So we really need to focus on what are these challenges bringing a language like Ruby to the JVM and how can we make the JVM better to support languages like this in the future? So we'll start with strings and regular expressions. Excuse me for a moment. Okay. So one of the first things we ran into, JRuby's strings were just based on Java strings and we used Java's regular expressions. And at the time, regular expressions were being used in very unusual ways in the Ruby world. We ran into a case in an early version of Rails where they were using regular expression matching to parse HTTP requests that came in and look for, say, a mime header for an image and pull the image out. So you'd end up with a regular expression operating against a very large piece of data. And the built-in Java regular expression engine is implemented in such a way that for certain types of expressions like an alternation like this, it actually will recurse and recurse and recurse. And then very easily you can blow the stack out by feeding it too much data, just giving it too much data. To process, we'll blow it up. So we had to find other options. JRegix was an early one that worked against Java strings and we ran with that for quite a while. But eventually it came to be that the Java string itself was insufficient for us to represent Ruby's string behavior. Here's what that exception looks like. Very simple match here. It's just 10,000 of the A character followed by a B character with that same regular expression. It'll blow up on every version of JVM that's out there or anything based on OpenJDK classes. And I believe this is still an issue. So as we went forward and had to have a more robust, more robust regular expression engine that would work with a more custom type string on JRuby that matched CRuby's behavior, we ported over, or a contributor to JRuby, ported over Ruby's regular expression engine. So Oniguruma is the C library that Ruby uses for regular expression matching and ours is Joanie. It's a byte code based register machine, so there's no stack issues. It doesn't deepen the stack at all. It matches against byte arrays. And that'll be clear in a moment here why we need that. It also can do byte array matching with pluggable encodings, so regardless of what encoding those characters are in, and potentially if you want to use a different grammar for regular expression. This library was ported to characters and used by Nazhorn to do JavaScript regular expressions. They had the same sort of problems, and so they used our library but made it specific to JavaScript. So you see that I'm matching against byte arrays here and I said that strings were insufficient. Well the problem is that Ruby's string is not just one encoding, it's not just a blob of characters, it is represented as a byte array with an encoding. So within the system you can have many strings that all have different encodings and it all needs to be negotiated together when you combine them or use them against each other. So we had to follow suit essentially. We had to make a new string class for JRuby that used bytes, used a byte array, represented all the encodings, port all of the encoding logic over and the transcoding logic, which was a major piece of work. And essentially we have our own string now and we've had this for over a decade because Java strings just could not emulate all of the behaviors we needed for Ruby. This does complicate interop with Java but there are improvements coming there. So J-codings is the encoding library that we use. This provides the full set of encodings similar to what's in CRuby and the transcoding from any encoding to another encoding, which is used internally when we have two different strings come together and need to negotiate that. So where do we stand on the JVM today? Well rather than just having a character array inside strings, we do actually have a similar model now where there's a byte array but only two encodings are allowed inside that byte array. ISO 88591, which is essentially just the 128 bits of ASCII, or UTF-16, the old standard character. So this does lower our cost going to and from Ruby and Java when we are just using an ASCII range of characters, but UTF-8 would be nice to have there because most Ruby strings are going to be UTF-8 probably with at least one multi-byte character in there. So that has to all be copied over a lot more, a lot less efficiently. And Java Util Rejects does still blow the stack. I would love to see it get replaced at some point, but I don't know if there's any work being done to do that. Okay, so the next area that we ran into was that we have a nice runtime, but the performance wasn't there. We needed to be able to generate JVM bytecode from Ruby code and have it optimize like regular Java. So the interpreter was good. It was similar to Ruby 1.8 before they moved to their own bytecode runtime. It was very difficult for the JVM to optimize. We could walk through this stuff quickly and it was very easy to write as an interpreter, but you had a lot of polymorphic calls within that AST. It never really could see the optimization path through there. So we had to write a JIT. The reason that we did not just immediately start compiling all Ruby code into bytecode is because, for example, the Rails library will load into memory thousands of classes, tens of, or tens of thousands of methods, or tens of thousands of methods. That's a massive load for us to put onto the JVM when only a few hundred or a few thousand of those are ever going to be called. It also was slower for us to go straight to bytecode because the bytecode would end up being interpreted by the JVM's interpreter, which actually turned out to be slower than our interpreter after it JITs. So it made more sense for us to leave it in our interpreter until we saw it really needed JVM bytecode, and there we ended up with basically the first mixed-mode JIT runtime on top of the JVM. Later on, we did move to a more modern compiler design. We had a compiler engineer, Sabu Sastri, come in and help, and he basically helped us move a lot of the little P-Pole optimizations I was doing in my JIT up to a more modern compiler architecture. So this simplified the JIT, simplified what I had to write as far as emitting bytecode, which then let me explore performance a lot more in other ways. And then of course, as we move forward, we got invokeDynamic in Java 7. It's been steadily improving since then. It's used incredibly heavily in JRuby. If you take the bytecode of our JIT output from Ruby code, it's pretty much just stack moves and invokeDynamics for almost everything that we do. We will access local variables normally, but everything else has to have some little dynamic aspect as part of Ruby. So we use it very heavily and probably more heavily than almost any other project on the JVM. This is invokeDynamic performance over time from Java 8 up to 17. Really happy to see the performance improvements every release. It gets a little bit better. Looking at what we're doing with a more numeric algorithm, we get a bigger boost out of it. With something that's just walking a lot of objects, we're already kind of close to where Java would be on just walking an object graph, but still seeing that we do get some improvements from running invokeDynamic, making that more direct. Really cool is when we plug in a different JIT compiler here. So this is now using invokeDynamic on the Grawl JIT. And for a numeric algorithm where we're creating tons of numeric objects, we really see the impact of partial escape analysis helping us. And this is now really starting to get to the point of Java level performance for a numeric algorithm. This is the cases where it really helps. But over time, we have not seen that Grawl is generally faster and we don't generally recommend it unless you have something numeric or something that's doing a massive amount of allocation of temporary objects. So where are we today? One of the problems that we have generating individual methods or compiling at the runtime is ideally we want that compiled method to go away if the class goes away, or if it's a one-off generated method that eventually doesn't get used. So it's a class per method, and the only way to make those garbage collectible is a class loader per class per method. So every method that we JIT into the system has both a class surrounding it and an entire class loader just to work within the confines of the garbage collector. There's no other way to make garbage collectible classes right now on the JVM. There is anonymous class loader, but it's a hidden class, and we don't try to access that right now. Indie is clearly working very well. We're going to be doing more advanced call sites where we will have special case code along one fast path and then a slower dynamic path if it turns out it's not the logic we expected. It is a tricky API to use, but we have a lot of tooling that we've built around it. I've got some links to older talks of mine that go into detail on that. Okay, I think we're doing pretty good on time here. I know I talk fast. Come back to the video and do it at like half speed, and then maybe you'll catch everything that I'm trying to say here. So the next big area that we ran into was native interop. The sea ruby world really lives in a POSIX native sea environment. It's almost a DSL for writing POSIX code, really. And originally that's kind of what Mott's the creator wanted. He wanted something where he could write C, but essentially with a nice API, a nice language on top of it. So they are very heavily using JNI-like extensions to the runtime for most of their native access. This is clearly way too invasive for JRuby. It calls into internals of their object structures. It has direct access to the heap, direct access to garbage collector endpoints. Nothing that we can emulate efficiently in JNI, and we have tried. So we ended up pushing people more towards using programmatic access, like Project Panama, like libffi, rather than writing C extensions for C ruby to wrap a library. Let's just wrap the library by writing a little bit of ruby code. And so started out with the Java native runtime. It's basically our API for calling from Java down into native code and native memory. And then on top of that, porting the ruby ffi layer over with some invoke dynamic magic, try and make that all as clean and fast as possible. Java native runtime is actually a set of projects. Up at the top, jffi is the wrapper around libffi. That's where we ship about 20 different binaries in the jar for all the base platforms that we support. Libffi is in there, and we're just using standard libffi with some extra wrapper logic around it. JNR-fffi is kind of the baseline user API. If you're familiar with JNA, this is that level, where you say, I need a struct that's laid out like this. I need a function that takes these arguments, make these calls, allocate this memory. Then above that, we realize there were a lot of functions and a lot of behaviors that people were going to be rebinding over and over if we didn't provide them. So we have JNR-possicks, which is a slowly growing corpus of standard posix functions bound on top of JNR-fffi. So you can go in there and you can call things like posixpon or open a file or do native IO. You can even call fork, and it's a lot of fun to see what happens when you do that. JNR-enxio, extended native cross-platform IO, builds on JNR-possicks and provides an NIO channel that is all native down calls. So where we can't get selectable standard IO on the JVM, we can't get selectable sub-process channels, we can use JNR-enxio to have actual interactive control over standard input. Standard IO and sub-processes. You can actually use JRuby to spin up a VIM instance and it will have full console control and work properly. Basically impossible to do with the standard process builder stuff on Java. Unix socket, not too surprising, just wraps this other stuff with the Unix socket calls. And then JNR-process, like I mentioned, we have our own selectable channels for processes. You can use this as a Maven library, you pull it in and you'll have the same API as process builder but you'll get channels, selectable channels out of it instead of streams. So it's available right now for that. This is a little bit of what Ruby FFI looks like. Pretty straightforward, we're setting up a structure with particular widths of fields, attaching a function, get time of day, and then we can call it directly. Under the covers, this all uses JNR and ideally inlines as much as possible up to the native down call. So today, native interop on the JVM. Of course, we have Panama coming along, so the talk before me, Mauricio's talk, that's where all the information is about where things are going and we're really excited about that. I actually wrote the original JEP for Panama, which has now been walked away from many times, but we've been needing this for over a decade now and had to make our own but don't want to maintain it anymore. JNR is pretty much the fastest way outside of Panama to do these native down calls. In some cases, actually beating JNI because there's extensions to generate a little JNI function in memory using assembly that can cut out some of that overhead rather than just doing pure programmatic calling through lib.ffi. Jextract from Panama is coming along. We're also hoping that we can use that at runtime as a library to and access those data structures internally to generate Ruby.ffi code. This would be kind of the last mile for getting Rubyists to switch from writing C extensions to using FFI. If we could generate the Ruby.ffi code the same way we do the Panama code, there'd be nothing to stop them at that point. There is back-end work happening right now on JNR to integrate it with Panama. Michelle at Oracle is working on that and I'm hoping that we'll see something in the next couple of weeks. A little more review of some of these ideas. If we have Jextract that can generate Java code, we should be able to use Jextract to also generate Ruby.ffi code. That'll be the next big fun toy to play with is of Java 22. We also use the existing SQLite JDBC driver. Rubyists like to use SQLite for local development. But it's going through a JNI back-end. You have to make sure it's available for the platform that you're on. They are also playing with Panama behind the scenes. Early numbers look like two-ish times faster than the JNI wrapper around SQLite that they have. So this is coming along. We also are integrating a new Ruby parser called Prism, which is a simple C library that we all the implementations can share so that we are using the same Ruby parser. That we will integrate through Panama as well. And use Panama to make it much faster for us to downcall into this library, get our AST back out, and then proceed. Interestingly, we're also exploring using Prism as a Wasm-compiled library running on the chicory Wasm implementation on top of the JVM so that we can parse Ruby code using a native library even if we're not on a platform it's compiled for. And that's amazing that it works. All right. Moving along here. So lightweight threading is the next big one. Around Ruby 1.9, they introduced fibers, a coroutine-like concept, a micro thread concept. You would still have your native threads there, but they can bounce around to different fibers at any given time. And you get little structured concurrency, structured use of fibers, allows you to do multiple tasks in the same thread. There's also been a push toward structured concurrency in the Ruby world now, where fibers can wait on I.O. or make a blocking call on I.O. The runtime will see that and schedule another fiber to run in its place while it's waiting for that. So you can easily handle tens of thousands, hundreds of thousands of concurrent connections, for example, without blocking that many threads or having to write your own select loop and what not. So fibers on JRuby, without a coroutine API at the JVM level, of course, we've had to use native threads. And that clearly only scales up to a certain number of threads. With the structured concurrency example, we could have potentially thousands of fibers in the system, and it's almost impossible for us to support that with full, heavy native threads all along the way. Ruby also primarily uses internal iteration. Collections just have to implement an each method of basically a for each. And all collections in the system then expect you to pass a block of code into it. Well, how do you turn internal iteration into external iteration? You have to use a coroutine that can yield values back out while staying inside that loop. So now we've got that potential for all sorts of fibers, hundreds of thousands of fibers all over the system, just because we're iterating collections with an external iterator. I'm going to kind of blow through this because the next talk will cover fibers a bit more. The example here of handling requests on a thread. We've got a thread, a request comes in. Now it's waiting for more information, the thread's not being used. Finally we get more data, we can proceed with the rest of our request handling. With fibers, of course, we can use multiple different fibers handling different connections on the same native thread. So the request comes in, this fiber's waiting on IO. Well let's spin up another fiber that can handle the next request that comes in. And they can multiplex use of that same thread. This is what we're starting to see more and more in Ruby, and this is where it will be critical for us to have lightweight fibers, lightweight coroutines on J Ruby. Okay, so here is a little benchmark, a little example of trying to test how long it takes to spin up 100,000 fibers and run them all to completion. So they are 100,000 live fibers in the system at any given time on this benchmark. And of course as you would expect this doesn't work. We can't spin up 100,000 native threads, and it just crashes in horrific ways. I'd love to see this crash in less horrific ways, but ideally we just move away from this problem altogether. And that's where we get project loom. So JVM Today, as of 21, we now have an official API for lightweight coroutines for essentially fibers that maps almost perfectly to what we need in the Ruby world. And we've already got this integrated. We integrated it a year ago actually, and have only made minor changes along the way. I'd like to show this just to demonstrate how much work we had to do to switch from our built in native fiber, native thread fibers to the virtual thread fibers. I was shocked that this was all it took, and suddenly this benchmark actually could run. It could actually spin up all of those fibers and run them to completion. So amazing work on the loom side, and very happy with the results. Once wise, so here I drop it down to 10,000 so that I can actually try and get the threaded version to work. Clearly we're getting significant gains on passing, context switching between different fibers, because loom is just better at that, and there's a much lighter weight process for going from one fiber to another on the same thread. Not quite as fast as C Ruby. I suspect this is probably due to us relying on a very general purpose scheduler for the virtual threads behind the scenes, where we really just want to say, this fiber's done, now run this one, rather than unblock that fiber and wait for the scheduler to pick it up. I think we can make up most of this overhead. Similarly on M1, I don't know if this is general to arm or not, but this is the performance results we have. Could not get 10,000 to go on M1. I got to drop it down to like 2,000 or 3,000. The impact is a bit more here, but again I'm hoping that as loom evolves, as we use it better, we'll see improvements. Five minutes for the last section here. The classic problem with J Ruby is still startup time. If we did not have startup time, we probably would have won the Ruby war a long time ago. It's the number one, two, and three complaint about J Ruby is how much longer it takes to start up. The JBM is just not designed to start up quickly. Most of the core JD code starts in the interpreter. It takes a long time for that to optimize, and then your application can start getting fast. We make it worse because we interpret Ruby code, and then every once in a while we'll just throw more byte code at the JVM, like okay, now this call site's actually bound to a byte code method, not an interpreter, and we're just confusing the hell out of it all the time. This is one of the reasons we actually do lazy compilation to byte code, because we want to reduce the amount of overhead we force onto the JIT at the JVM level. Walk through J Ruby's architecture here quick. We have our Ruby parser, gives us our Ruby AST, we compile into our intermediate representation, interpret that for a while, and here's where it becomes mixed mode. Then eventually we will generate byte code for those methods, and then hopefully the rest of it all works and optimizes to native code. One of the early ways that we've tried to improve startup time is basically to turn most of that off, rather than turning anything into byte code, rather than even running the C2, the fast JIT in hotspot. We turn only to C1, we use the simple JIT in the JVM, and we only use our interpreter. This improves our startup time by about 2x. By far the best thing we've had so far. Now, another way that could be potentially a way to fix this would be ahead of time compilation. Of course, GrawlVM solves this very nicely for that world, but it completely disables all of the native things that we want. General purpose, invoke dynamic, and method handles just simply, essentially doesn't work. Then beyond that, we would have to pre-compile all of our code to byte code. We'd have to link it in some way that it could ahead of time compile the native. This is just not going to work for us. We're hoping that Layden will actually pick up here with a ahead of time option that can also do some dynamic stuff at runtime. Where are we today? The solutions we're looking at in the short term are mostly surrounding the checkpointing features. Checkpoint and restore in user space, the CRIU API on Linux allows us to run JRuby to a certain point, like just after startup, and then save off a copy of it that we can start quickly with later on. This is being standardized in Project Crack, an unfortunate name, but a lovely project. This is working pretty well with JRuby right now, just experimenting with it. We are still hoping that Layden with some ahead of time compilation that still enables the rest of JVM features will be our ultimate solution. You can see here, this is CRuby on the left side just doing a baseline startup. JRuby's baseline startup without our dash dash dev flag, which turns off all of the optimization. The dev flag here, not quite 2x, but giving us a good boost. Crack of course, significantly faster than all of those. We've actually gotten to a point in execution where we can start running Ruby code now, starting to get competitive with CRuby, which was essentially designed for fast startup. Same example generating a Rails app, again, getting very close to where CRuby sits on these numbers. So, wrapping up in the last minute here, JRuby is a test bed for all of these crazy JVM things that we're doing. We're pushing all of these edges. So whether you care about Ruby or not, we are the best invoke dynamic torture test. We're going to be hitting Panama extremely hard as it gets integrated into the system. All of the structural threading will be massively exercised by all of the structured concurrency stuff coming on the Ruby side. So if you're interested in helping us integrate any of these features, or if you're an implementer interested in testing these features at scale, JRuby is definitely something you should look at. This is more background. I'll let you take a quick picture of this if you want. These are talks I've done in the past that basically cover all of my many complaints about the JVM. That list of complaints gets smaller and smaller every year, thankfully.
The Challenges of Running the Fuzion Language Natively on the OpenJDK
Okay, people, we are ready for the next talk. Please listen up, quiet down, and get ready for the next talk. Thank you. Okay, thank you, Andrew. So I'm going to talk about the fusion language and a bit more concrete about how we are running this on the open JDK. It's basically the problem of mapping a functional language to efficient data bytecode. The background of me, I'm basically doing compilers all of my life. Don't go into details, but what is important right now is I'm working at Tokiwa software in a team of three at the years where we together develop the fusion language and its implementation. A quick overview of this talk. I'm going to give a very short intro into the fusion language, and the rest I will show you through examples in the code. So I'm going to go into mostly different types and how they are realized on the open JDK. We're talking about tech union types, about product types that have values and managers, about type parameters, how to do margins and dimensions. I'm talking a bit about the class of the class. So I start. You can't hear me in the mic. Can't hear you? Yeah, yeah. We're not getting anything. Is it all plugged in properly? Turned on. Is it on? No, it's on. Okay. I'm sorry. Okay. Sorry for those online who missed that. Okay. I will start with a quick intro to the fusion language. Fusion is based on simplicity. It's all based on one single concept, which is a feature which takes the role of Java classes, Java methods, functions in other languages. Fusion is aiming at safety critical systems. We see more and more systems becoming safety critical. And in that area, we see that tools can make the developers' life much easier. Short the language, fusion is statically typed. It is polymorphic in different ways. It has union types, parametric types, and object-oriented style inheritance. And it is pure using effects to model side effects. The fusion tool chain looks like this. We start actually with fusion source files that go through a front end, compiled into fusion modules that are then processed by a middle end and by a static analyzer into a fusion application represented in the intermediate representation. And that is then the input for different back ends. And in this talk, I'm going to go into details of the JBM back end that then transfers this intermediate representation into bytecode in Java class files that are run on a JBM. The first aspect I want to focus on is tagged union types. I'll explain immediately what that is. As an example, I take an oven. I implement an oven in fusion, and an oven has an input, has a setting that oven can either be on, either be off, or it can be on with a temperature setting given in degrees centigrade or Fahrenheit. So there's three options in that union type. And off is just a unit type. It's just a value with no other state. While the temperature settings is either a setting that gives a centigrade or a temperature as an integer or a Fahrenheit temperature. That's a float in this case. And within the oven, we can then use a union type value and match on this or match on the setting and do different things depending on whether it is off or it is a temperature giving in centigrade or Fahrenheit. Now, I have to make some space here. Now, when we compile this to Java, it gives Java source, not to Java bytecode, to explain what we do that. We actually compile such an tagged union type into several fields. First, we have a tag field. That's why it's called tagged union types. That decides, do we actually have an off value? Oops, I have to make space again. Do we have a temperature or, and what kind of temperatures? And in case it is a centigrade temperature, we need another field to store that value or for Fahrenheit. So we have basically several fields to store a tagged union type value. I'll drive this example a bit further now. This is the most generic case. We have the tag and the different kind of values. And during the talk, I will go more and more specific until I reach the point where the oven literally disappears into the void. So the next step towards that is if we do a bit more an object oriented approach, we can use a temperature that is actually a reference to an object that provides a inner feature to get the temperature as a Celsius value. So it's a union type of off or the temperature and the matching changes accordingly. Now, how this could be implemented is we would have a tag that now decides whether it is off or it contains a temperature. But this is not what somebody would do in Java typically. This is a typical case where a null pointer or null reference would be used. So this is also what our back end does. It just uses one pointer value in that case and uses the null value to represent the state off. So that's the so-called nullable implementation of the tagged union type. Going further, having a more complex example, now we extend the parameter to also have a clean mode and to maybe have some error state, which essentially means we have four different possibilities and we need the temperature and the error state. But of course, there's never the case that the temperature and the error state are both used simultaneously. So we can join them into a single object value because only one of them is used. Now, the tag field decides on which of these four states we have, but actually we could use specific values for the object reference to represent the off and the clean states such that this all collapses into a single reference field for all four states. This is also what our back end does in this case. So such a tagged union collapses into a single field. Getting even simpler if you have a very simple oven that doesn't allow any temperature setting. It just has the modes on, off, clean, or error. That is basically a classic enumeration. Internally, this is just an integer, so we only have an integer type for that. If we have an even simpler oven that could just be on or off, there's only two states. So that falls down into a simple Boolean type and is compiled into a Boolean value by the Java back end. We can go further if we have now an application. We have a static analyzer for our application. And if the application actually never uses an oven that is on, that value can actually be determined to be a void type value. Avoid infusion is not like in Java. It's much more like a never type, so the result of a call to exit or so, that really never occurs at runtime. So if you have that, we don't need any data to store because all ovens in that application are always off. Can even go further if you have an application that doesn't even produce any ovens that are off. So maybe an application that only uses a list of ovens and that list is always empty. So both are never used, so we don't even need the oven anymore because this all can be thrown out. So that much to take union types. Next point is product types and fusion has value semantics while Java has reference semantics. A small example, I want to do a small graphics example here. I'll start with a very simple product type of point, which is just a combination of two coordinates, x and y, two integers. And I pass these points to a draw function that should draw these. I won't go into details there, but I just show you a bit of the syntax, how fusion uses effects, in this case, a graphics effect, to document that this function actually has a side effect. It requires an environment where graphics is available to actually draw this. Now we create a point, store it in a local variable p1, and pass that to the draw function. Now the question is how do we do this passing of the argument? How do we represent this point? We have value type semantic in fusion. So what we do is actually split this up into two fields, two parameters here for the draw function that are passed separately in a call. Similarly, when we create a new point, that point is split up into two local variables, or two fields that are then assigned separately. And finally, when we make this call, we pass on a call these two values individually. That works nice, is performant. Problematic in the JVM backend is the case of returning a value product type with value semantics. So here we have a shear function added to our feature point that creates a new point that is returned as a result. So Java cannot return multiple values, so what can we do instead? I need more space for that. And I've looked into a number of possibilities how we can return such a product type in Java. The first baseline in that analysis I looked at how, at inlining. If you would just inline the call, returning a value is just assigning between local variables. So we can use that, but of course that doesn't work in the general case because inlining will blow up, will not work for recursion and many restrictions. That's why I put such a flash behind that. That is not a solution for all our cases, but it gives a baseline for comparison. The typical way in Java to do that is that the call method would return a newly allocated temporary container object that just contains two integers. We could also do the other way around. We could have the caller preallocate or preallocate an object and pass a reference to that object to receive the results. The fourth solution I looked into was we could use static fields. So when returning two integers, we're returning a point, we just have two static fields, X and Y, and we store the value in there and the caller then retrieves them for that. I put a flash there as well because that is not thread safe. It doesn't work when we have more than one thread. What would we thread safe would be using thread local variables to return this value? Or if we would put these two static fields into our own thread instance. If our threads have a specific instance of that that could have fields, then I'll use for returning values thread locally. I've analyzed these different possibilities using the Java micro benchmark Harness, AMH. Actually, to my surprise, the allocation of a temporary object that is returned was even faster than the inlining version that I analyzed. But unfortunately in AMH, I couldn't analyze the last case of using my own thread type and fields in the thread structure. So I added my own ad hoc benchmarking framework to do the same measurements and I did a fairly different results but basically the same, but I also cover the last case. Now, I exclude those cases where I said that they are not generally applicable, so the inlining and the static fields. Of course, we can't use that in our implementation. Next, using thread local fields, thread local variables, which are relatively expensive, so kicking that out as well, we are left with allocating temporary objects and relying on the jit to optimize this because the jit does very well, but I don't know what the heuristics are behind there and whether we can actually rely on that. So for now, we're using thread local variables to return structured data on a call. So we're moving forward to seeing Project Valhalla coming into life because Project Valhalla will introduce value types and will use type descriptors for so-called Q types that provide value semantics for method arguments and method results, which is exactly what we need here. What I don't see from the current state of Valhalla is whether you actually that when you return a Q type, you actually don't have any allocations. So I would like to best have the guarantee to have value semantics and no allocation on a return. So next, type parameters. Generics would be the counterpart in Java. Here's a small fusion example how type parameters can be used. This is a function that calculates the arithmetic mean of free values that could be of an arbitrary numeric type t, and it just sums up those three values and divides them by the value of three, but it has to first create a value of three in that given numeric type. That could be something like a complex or a double or whatever is fed in there. And now we can call this with integers or with floating point values. Java's implementation of generics uses type erasure, so there's no type information at runtime, but a fusion uses a so-called monomorphization. That means we have specialized versions of the code for every type that is in use. What that means is that our back end, in this case, creates for every actual type that is used for a generic function, a specific implementation for that type that has all the types stripped to the actual type parameter. So that's quite straightforward. Yeah, inheritance, fusion has multiple inheritance. The question is how to implement that. The ways we've looked at is actually putting some type identifier into our instances and then doing some kind of table look up and invoke static to call this. We looked into how invoke dynamic could help us. Unfortunately, fusion is so static, it doesn't help us much at all. And finally, the invoke interface is actually the most useful solution for us because that supports multiple inheritance. So in effect, what our back end does is in most cases, our dynamic binding just binds to one possible target, so we can just use an invoke static. And only in few cases we actually see that there are possible dynamic targets and then we compile them to an invoke interface and we have specific interfaces actually defined for every single function that is called this dynamic binding. So we have a case where the classes that we generate could actually implement really, really many interfaces and we have to see how that scales with bigger applications. So coming towards the end, class file verifier, not much to say that, but the class file verifier helped a lot comparing the development of the JVM back end to the C-back end. Did we do that before? Because we saw so many errors much, much earlier than we would see in the C world. So the status of fusion at the moment is the language definition is getting more and more stable, base library, there's still a lot of work in progress. We have a JVM and a C-back end. Then we have basic static analysis tools available. And if you came to see the dogs, this is Fedex and Shadow who disturbed me working on that during the last year. This is where you find more information on that. Follow us on Twitter, give us our give us styles in GitHub. Stay informed. Thank you. Any questions? Hi. Hi, so you mentioned monomorphization and you had this example with this function that takes t which is numeric And then you generated I think 10 different versions for all the numeric types But then if you have like three type parameters, which are all numeric do you generate a thousand different versions? If there's three type parameters, there will be one version generated for each combination of these three type parameters Actually is used in the application. So it's used in the application. So it's like kind of a closed world Yeah, so so we have a static analyzer over the whole application Okay, so you don't have incremental or separate compilation It's static at compilation. So we've been very compiled. We look at the whole application We don't have any dynamic adding of more code. So we don't luckily don't have the problem of having to be open to add More there, we know exactly what will be in there And do you have a Java FFI with the JVM backend? Do we have a Java foreign function interface? Can you call into Java? At the moment not we are we are looking forward to using your Panama there as we've learned Because that would be a big step for us to also helping on the C interface even if the C backend We don't have any FFI force calling C code at this point Okay, thank you Do you have your mind made up on in terms of the Approach to concurrency you want to take I mean on the JVM virtual threads could be an option But at the same time if you have a C backend that could be really expensive to implement on your own We do have a very simple concurrency support right now, but it's basically limited to starting threads But there's no not much synchronization or anything going on Our current idea is that when when we do something concurrent That we want to use the static and Analyzer as much as possible to to set it we prove that there's no race conditions and that the code is safe The question is what channels do we want to provide? to actually allow interfed communication and all of that and We are still looking into possibilities there are many many things we could do it's not decided yet, so Thank you Maybe
OpenJDK Project Wakefield : The Wayland Desktop for JDK on Linux
Hello. Hi. Everyone, please take a seat. We're about to, well, we are starting. This session is OpenJDK Project Weight Field, the Whalen desktop for JDK on Linux. It's going to be a bit of a show here because we have three of us, so we're going to swap over. There's only two mics. We'll do our best. Very quick intro and then I'll turn it over. So I'm Phil Race. I work at Oracle. I'm the lead for the client libraries on OpenJDK. Next to me is Alexei Ushakov. He actually used to work in the Java 2D group at Sun a long time ago, but these days he works at JetBrains. And there's Nils DeGraf, who's a product manager or product owner, product, whatever, at Red Hat in the desktop group. And so he is going to do our first session. I'll hand it over to Nils. Okay. That's good. Maybe I should. You should. I should. This is going to be very interesting to do. Yes. I did not sign up for some dancing, but it's fine. The first, quickly about structure, we're first going to quickly explain what Whalen is and then explain how OpenJDK, then it tries to work on it and the whole weight field project. And then finally we have an actual demo and some explanation by Alexei. So quickly for Whalen, because we don't have that much time. So a bit closer. Okay. That should be better. First of all, what is Whalen? What is X11, for example, that it tries to replace? So it's about displaying. So rendering things into a, let's say, a matrix of pixels, something called, which sometimes called a frame buffer. And you usually try to get that on a stream, but maybe also try to stream that somewhere over the internet. And why do we need something fancy around that? It's because once you have multiple applications trying to render something, you want to have something like some kind of decisions around, okay, do we put them next to each other on top of each other or that kind of thing? So basically window management or tiling or whatever you want. And that's where a display server comes in. And you talk a display protocol between the apps and a display server. And it's usually also very related to input management, because, you know, if you have typing something, you want to go to your browser, for example, and you don't want to type your browser in some key logger application or something. So quickly about X11, which is, let's say, the old thing. So X11 started from the white. So we have the X11 server, which you start up normally using something like startX, which is going to listen to a socket, usually based on the display name, usually something like zero. And then each of your applications, I mean, imagine those two top applications being your browser or your file manager. And then they will actually say they will also connect to that socket. Sometimes if you have to do a display equals variable for the environment, and that's when which kind of socket it will try to listen to. So it will define the server. Now you're going to say something like, OK, now please X11 server, can you create me a new window? So X create window or something with this window, with this width and height. And then you can do your fancy things that you normally would expect to be able to do from a window manager. Now, that whole logic of should we be doing tiling or should we be doing overlapping windows and so on. That's usually what another X11 client, which is then usually the window manager comes in. And that actually helps in setting all the logic, let's say. So that's how the usual thing in X11 goes. Now X11, a very oversimplified. X11 is old. It's from the 80s. Now being old is not necessarily a problem. But it is older than, for example, Java and Linux. But one of the things is that it made a lot of assumptions that don't necessarily hold anymore and that are baked into the core protocol. For example, it talks about a root window, which we, once you have multiple monitors, you can still try to have like one big frame buffer that spans all of those monitors. But if you have multiple TPI, you can do trouble. Once you have GPUs get complexer and complexer, there's overlay planes and you want to do fancy stuff with that for performance reasons and battery reasons. There's security. X11 allows you to screen share anything and do input sniffing and snooping and spoofing without any kind of consent of notification. There's a dev room about Linux on mobile. I do not want a device that actually could do all of that with my private data. And there's also new UX paradigms like HDR, tiling window monitors and so on that actually make it be a bit harder, especially HDR. It's very hard to do, for example, in X11. So at some point people got together to create a new display protocol, which is that Wayland. So it's very minimal, but really, really minimal. It really tries to make sure it does not fall into the trap of making assumptions again. It goes to just say, okay, we have clients that want to send something rendered to a server, a compositor, let's say, and then we can then extend things over. So it doesn't even have a concept of top level windows, for example. You actually need an extension for that. It's called the XZG shell. If you ever want to look it up, it's very fancy. And some things you just don't want to have that in the display protocol anymore. For example, screen sharing is also related to videos. So we said, okay, let's try to put that out of the protocol, the core protocol and something with portals. But I will explain what portals are later. So what does a typical Wayland session look like? It's again, we start from the right. So we have the Wayland compositor, which you start. That's going to be, for example, with GNOME. It's going to be GNOME shell with KDE Billy Quinn. This way it will be something else. And then you will start a Wayland socket, which you can again listen to. You will talk to Wayland protocol. And a Wayland client will say, okay, please create me a surface. And then using that extension, for example, the protocol extension using XZG shell, you will then have something where you can say, I want to create a top level window and I want to do it this size. You can do positioning Wayland. Always fun. There's a lot of reasons for that. And for example, another Wayland client can be X Wayland, which is its own X11 server. So actually, inside your Wayland session, let's say, you can actually also then create multiple X11 clients, which will talk the whole X11 protocol and X Wayland to the translation to Wayland itself in the best way possible. And I did lie a little bit that this is not, or I did say that there's not everything yet. So there are some things that we don't want to put into a display protocol anymore. We want to do that in the portals. So we did something that that's new with Flatpak and snaps and all these containerization fancy methods. We want to make sure that there's some kind of permission thing, let's say that allows you, for example, I want to do screen sharing, let the user choose if that's okay or not. And then the deepest interface within a Flatpak, you can access that. And then, for example, go to Pipe Wire and other components which do not necessarily need to live in the compositor. You will go to those and then we can go to the next step, which is how portals can be maybe implemented, or, you know, how this can be used from within Wakefield. And I think that's going to be the part where Philip is going to come in. Okay. So Neil's described what Wayland is. And today, what we now have is a project to be able to support that Wayland compositing desktop from openJDK. And what's it all about really? Well, JDK is clearly a platform. It's not just an application and it abstracts away that underlying platform. So we're not going to be exposing anything about Wayland. Today, on Linux, it happens to be an X11 client. It's basically an X application at a, you know, crude level. But to be able to support that Wayland desktop, we need to make some changes in JDK. Some of the policies that Wayland has that Neil's touched on around security and things like that mean that the things that just are supposed to work in JDK won't work, even if we're trying to be that X going by that X Wayland product client that Neil showed on his diagram, right, which is what we call X compatibility mode. So that's the Wayland's compatibility mode for X11 applications. And although we don't even work in that today, even if we make it, even if we start to work in that, is that really the long term solution that we want, what we really would like is to maybe, you know, be a full on Wayland native client. So Open JDK project Wakefield, there's a URL there, has started a couple of years ago. And there are people from Red Hat, JetBrains, and Oracle who are working on this. We have a repo at the standard kind of in the standard format for Open JDK project repos. And what are our goals? Well, first off, we're going to try to support that compatibility mode properly. So we'll have a fully concomit, conformant JDK, and everything works as well as it can do, as well as it should do on the Wayland, when you're running on a Wayland desktop, but we'll be talking the X11 protocol. You know, we don't talk about it here, but, you know, most people will see that these days, if you log into a, you know, Linux desktop, they, and if you pay enough attention, there's an opportunity to switch between pure X.org and the Wayland desktop, which supports that compatibility mode. And right now, JDK only supports the X.org. So the longer term goal, as I just touched on, is that we want to be able to support native Wayland. So the X11 compatibility mode is some touch things that I'm going to touch on, but the much bigger thing is that native Wayland port. And there's a list of, there's a list here I won't read out of the different kinds of things that we need to deal with in making all of this work. And some of what we need to do for the native Wayland port is really just starting to emerge in the latest versions of GNOME. So it's not, this doesn't, this work is not intended for some, you know, older versions of Linux. This is something that you'll want to, or have to use on upcoming versions of Linux. And yeah, the policies of security that Wayland enforces, I think is the right word, are going to be some of the drivers for the things that we need to change. For example, the issues include, one of the most important things is we have an API that lets you capture screen pixels. And capturing the screen is, you know, Neil's touch on is something that Wayland very early on, I think, and is very clear about, doesn't like people to be able to do for privacy reasons. But AWT has expected to be able to do that forever. We expect to be able to do things for, like move the mouse, grab the pointer. We want to put our windows where we want to put our windows. Wayland will say, no, that's kind of our job. And you can't actually find out where windows are on the screen. Also in the X-Wayland compositing, in the X-Wayland mode, it's, there's the high DPI support is not complete. So in some of those things that I described above, Soundlight, they're things that you'd only need maybe for a test framework, but these are actually part of our standard APIs. And these, not many applications use them, but we have to be able to support it. And there's a bunch of bug level fixes that we've found we need to do as well. And we actually, you know, as the project went on, we actually found some bugs that were on the Wayland side as well. And there's a whole bunch of, all of these things that are described in detail at that URL, which I obviously, I'm not going to read out for you. So where are we now? JDK 21 pretty much did all of the work. We got the spec changes in that we needed. And there is a complete new implementation of the robot screen capture that was done almost entirely, well actually entirely by Alexander Zaginsev. It's using the screencast portal and pipe wire. And it, so the first time somebody tries to do a screen capture, there's a manual process of saying, yes, that's fine. And then after that, it's okay forever in theory. And there are some follow-up fixes going into JDK 22. We basically, if you have a desktop with GNOME 42, we should be good. And that will probably mean that we'll be able to, vendors will be able to officially support running on the Wayland desktop. In this compatibility mode with 22, which should ship in a month. And that's when we shift real focus to the pure Wayland toolkit. But there's, you know, what's involved a bit more about that. Complete new toolkit spans all of AWT in the window management and the rendering side. So all of these things here, creating windows, configuring windows, the management of everything, integration of the desktop, how you render to the surface. We can't use X11, open GL, sorry, XGL really. Or X render, which we, an X11 and X render is the default way we do everything on Linux today. Desktop integration, all of these different things I'm listening here need to be redone. When I was trying to describe it to somebody who's sort of more of a VM person, it's like, well, we need a new JIT and we need a new GC. And so that's the kind of scope of the work. So how would you do this? Well, I mean, a lot of GTK4 makes it fairly easy for a lot of applications to just, you know, port over because it deals with all of that, hides it from you. And then, you know, you would, Wayland, one of the things, it doesn't have a window manager, so it's client side decoration. It's all client side. And GTK would do that for you. And everything, I think, there, it sounds like it'd be easy to get a lot of things up and running, but it brings in a lot of baggage. If you do an LEDD or something on a running GTK4 process, you'll be paging through it for a while. And, but really, the, one of the problems was it's just really hard to get the level of management of when you render to the surface in the way that we need to with GTK4. There's more work with using the low level LibWayland, which is basically the equivalent of LibX11. But, you know, we've generally, when we've tried to do something with, in the JDK, like the most, the last example was we were doing a pipeline for Mac OS for their new rendering engine. And they have, like, a high level toolkit that's intended for applications, but we needed to use the lower level one. And it just sort of works out that way, seems, every time. But anyway, there's some new challenges that Wayland brings that, native Wayland brings that aren't necessarily there, that aren't there in the X compatibility mode. We need a new library Lib, EI, or A, however people pronounce it. That's just prototyping in GNOME 45. Well, it's the API, I believe, is final, but, you know, I think it's the first time it's been out there. That inability to layout windows that I touched on, it has some oddities like splash screens come up in the top left-hand corner of the screen. And, you know, that's not great from my perspective. So there is a, so there is a, already a toolkit in process and Alexi is actually going to be showing you that right now. And it's called WLToolkit. So, hand over to Alexi. Okay. Okay. Yeah. We use a separate thread for event handling in our prototype called WLToolkit. And actual handlers of Wayland events are dispatched on event dispatch thread. So, on Wayland rendering happens on client side. So, the client need to notify the compositor about the region that should be updated. So, we need to have a complete frame content and then submit it to the Wayland. That's required some changes in AWT and swing code to make sure that all the content is ready for submitting. Also, we use software rendering for our toolkit. Actually, software rendering was the only option in early ages of Java platform for swing applications. But since then, the situation has changed and now Java platform provides hardware acceleration for all major desktop platforms. Surprisingly, in current WLToolkit implementation, we have sometimes better performance than in X2 kit in X11 compatibility mode. So, for example, if you know a swing mark benchmark, it shows about 40% better performance than in X2 kit. So, it's quite enough for desktop applications. So, we can use it now. However, there are some areas where we still have lower performance. For example, in current implementation, we have about three times worse performance comparing with the hardware-executed X2 kit. So, yeah, of course, modern desktop applications need rich text capabilities, including our IDs. So, we're going to work on this and improve the performance of front rendering. Our current plan is to use Vulkan as a hardware acceleration with our WLToolkit. And let's see our demos. So, let's try to run them. It's quite unexpected resolution here. Here, you can see a special aw.toolkit.name property, a standard property that is set to WLToolkit to enable this toolkit for us. And that's run, as swing said. Oh, yeah. It looks like it fit the screen. Okay. You may notice that here we have unusual controls, actually, if you'll mention that. It's because Wayland, in this core part, doesn't provide client-side decorations. So, these controls were rendered by ours, so in WLToolkit. Okay. And let's see how it works, actually. So, here is frames. So, buttons. So, we have some animated curve here. Some combo boxes, dialogues. Okay. And some checkboxes working. Some more dialogues here. So, progress bar demo. Scrollable image here in scroll pane. And sliders. Yeah, here we have split pane. Tabot pane with some animated images here. Tables. Yeah, they work quite well. And tooltips. Yeah. And tree with expanded nodes. So, as you can see, all the controls are properly rendered here. And then I would like to show one more standard demo, at least demos bundled with platform for many years to show. Swing UI from work capabilities. It's Java 2D demo. It shows some advanced graphics. So, we can see here curve rendering. Actually, it's not full speed, so we can reduce the delay and see how it works actually quite well. Yeah. So, we have many different things here. Some clipping, color conversions, compositing rules, font rendering, image ops. So, it's some conversions for images. Some stroking capabilities. Here's a mix it. Yeah, demo with different primitives, paint modes. We have also here path rendering and transformations. So, as you can see, performance is quite acceptable. Now, we'll try because of the resolution to launch real-world application. Community version of IntelliJ IDEA. Yeah. Probably, it would be... Wow. Yes, it works. Yeah. And here we also see that we use the same property at little kid.name, that in special property file that we use for IDs. Yeah, here we can see actual well and implementation here. So, it's the constructor. It's quite difficult to see something. And here is the separate thread that I mentioned that handles well and events. Okay, that's it. Thanks for listening. Any questions if we have time? We have a minute. So, any questions? So, what is missing for... I'm repeating the question. You showed IntelliJ working, so what is missing? It was in details. We have Piki users and there are some stuff that in some corner cases that it's not working well, but it's generally workable. So, we have some users who gave us feedback about the quality. So, yeah. But we still need to polish it. Yeah, that... If it wasn't completely clear, right, you can try this for yourself. The project that I actually... Didn't you have a slide showing the branch? So, you just can check out that branch from the Wakefield repository and build it yourself and try it. Yep. Yes, over there. It does it work with JavaFX? No, this is at this point, this is implementing within JD K. JavaFX is like a separate toolkit entirely and we have to repeat this exercise for that. Unfortunately. Yes, over there. Sorry, feedback. Did we fix bugs in Whalen? No, we reported them to... We had friends at Red Hat. So, we've had some calls and some of the developers, a couple of the developers who kind of work on the desktop and even on Whalen directly there and will say, yeah, I file a bug here or yeah, I think it is. So, we've reported bugs and they've been fixed. Yes. Yes. Any plans to support fractional scaling? Any plans to support fractional scaling? My recollection is that Whalen itself fundamentally decided not to support fractional scaling. There's an extension? There's a... Of course, there's an extension. So, yeah, there's a protocol extension to do fractional scaling and if the WR2 gets wants to implement that, it can do that normally. But it should work. It should definitely already with the native Whalen mode, it should already be better with then the whole blurriness sometimes gets an excapable. Yeah. The one thing about that though is just generally, I mean, with fractional scaling, we don't have to deal with fractional scaling on Mac because that's always multiple. With the Windows look and feel, we are still sorting out bugs trying to make that work. So, we would undoubtedly find a whole bunch of new bugs with the GTK look and feel when we started doing that. So, it's not just a simple matter of saying, oh, yes, you know, there'll be a mess to be sorted out that's separate from the Whalen project. I think we're probably out of time. Yeah.
Zeroing and the semantic gap between host and guest
Hello. I want to start. Hi, everybody. So my name is Foyker Simonis. Hi, guys and girls. So my talk, my slides and my examples are on GitHub. I will show this link one more time at the end of the talk, so you can take a picture if you want. I'm currently having some fun at the Amazon Coroeta team working on OpenJK and I did the same along for quite some time in the past at the sub-JVM and submachine team. Today I want to tell you some details about running Java in containers and in different containers. One is Cue and the other is Firecracker. So what is Cue? Cue is checkpoint and restoring user space. That's functionality in Linux which allows you to serialize a whole process tree to the file system, basically to an image or a set of images and then it can be later restored from this image and run at the same state where it was checkpointed. It only saves anonymous pages of the process so it's quite efficient. It doesn't save the shared pages. We will see what impact that has. And correct, that's a coordinated restore and checkpoint. It was mentioned before in several talks. That's a project in the OpenJK which has basically two goals. One is to create a user land checkpoint and restore notification API which allows it applications to react and take actions based on a snapshot or restore event. So before the snapshot they can do certain things like zero out, secret memory or stuff like that and then to restore for example they can restore network connections which they tear down at snapshot, things like that. And gaining quite some traction in the community, the new versions of the popular frameworks like Spring, Micronaut, Quarkus or even AWS Lambda, they support this API so if you write applications or code for this frameworks you can already use this API. The second part of the goal of the correct project is to make the JDK itself SNAP safe. So the JDM as well as the JDK. This means that it uses this notification API for example in the JDK classes to take the actions I just talked about. And this is sometimes useful or even required to make, to react appropriately not only on checkpoint, on restore but also on clone because once you've checkpointed an application you cannot only restore it, you can basically restore it many times which I call cloning and then it's important for example if you have UUIDs or secrets to as I said to either wipe them out or reset them or recreate them. And Quark is using CRIU as a tool to do the actual checkpoint and restore process but as I said the API can be used without CRIU itself and we will see how that can be used with Firecracker for example. So let's dive into a quick demo. So I will use Pet Clinic as an example here. Oh this is the wrong window. So this is for CRIU. So I just start Java with some default settings which I pick up from Java options. It's basically Java 20 or 22 I think running with 512 max of memory, running the REST version of Spring Pet Clinic. And it takes about 10 seconds to initialize and then I use URL to access it just to make sure that it works. And yes you see it works, it really works. Now we use PMAP to look at RSS of the Java process. It's about almost 450 megabytes as you can see. And we can now use CRIU to dump this process. Oh I think it's hard to see. Yeah I will scroll it up. I just start to dump. So this was just the command line here to dump the Java process into a directory. And once we've done that we can take a look to see how much memory that used. And you see that the image itself is smaller than the footprint of the process itself. That's because what I said the image only contains the private and the dirty pages of the process, the shared pages from the mapped files for example. And we can now restore this process. So we use CRIU restore from the same directory and it works like in about 200 milliseconds. And if we use PMAP again you see that it uses, after we store it uses less memory, about 20 megabytes less than before. So why is that the case? Again that's because of shared pages. This is the diff of the whole PMAP output for the initial process before it was snapshotted and after we store. And you see the basic difference here is that for a lot of libraries like system libraries, LAP-NSS for example, we used 140 kilobytes for startup but this memory, these pages are not required anymore after we restore the process. So CRIU has still recorded that the process can access this memory but until it doesn't touch these pages they won't be paged in. So that's why after we store the process uses less memory which is a nice side effect. Okay so what other possibilities do we have? We start the application once again and it always takes about 10 seconds. So it works again. Now there is a feature called Trim-NATIFIP which was introduced by Thomas, my former colleague Thomas Stüffer, which basically frees the Maloch, the G-Lipsy Maloch buffers. And this can have quite a significant impact on the footprint of the process. So we see that the G-Lipsy Maloch cache used about 60 megabytes. And if we run now P-Map again we see that the RSS is much slower now, much slower now, just about 450 megabytes. And we can now... I also experimented with the new option which zeroes part of the heap. So it basically does a system GC and all the unused parts of the heap will be zeroed. If we do that and look at the memory footprint of the process we will see that the memory footprint got bigger because now parts of the heap which weren't mapped before get paged into the memory but they contain only zeros. And I have a pull request for the QIO project which such as QIO can recognize zero pages and ignore them while dumping. If we check point now with this zero option, it's basically the same like before. We just used the skip zero flag which is not standard until now but I hope it will be integrated soon. And if we take a look at image size we see that the image size now gets considerably smaller. So it's just 200 megabytes because all the pages which contain just zero bytes are replaced by reference to the kernel zero page. So basically it's a sparse file, the image file. And when we restart the process the memory footprint will be smaller as well. So we restart now from the new directory and when we take a look at the P-map output you see again it's just 270 megabytes. This is a little cumbersome so why not using the crack release itself. And the good thing is that crack basically does all what I've showed you what you basically can manually do with a normal JDK and with a normal QIO release. This is basically built into a crack version of crack build of the open JDK. We use the option crack check point two and give it a directory. So we run the application and then once it initializes we see it works and then instead of using the QIO command directly we can use a J command to check point. So I scroll it up here so we just scroll J command with the PID of the pet cleaning application and we execute the JCMD JDK checkpoint and that killed the checkpoint and also killed the process. And we can now restart that again with the help of Java by using the second crack option which is crack restore from and then give it the directory where the file was saved to. And this takes just a few milliseconds again and we see it works and again the memory footprint is like before it's like 280 after the first restore so it's considerably smaller because the heap was shrink. So what crack is actually doing it's not zeroing the memory but it's unmapping all the unused parts of the heap and I also recently added the feature to call into Tomas Trim native memory functionality to also free the JLPC buffers. So to summarize like in for a spring pet clinic application it has about memory footprint of a good 500 megabytes and after restore it's a little smaller because it doesn't have to restore all the shared pages. Image size is about 500 megabytes. If we zero out all the heap unused heap and use the skip zero flag of crew the RSS goes up just before the checkpoint but instead we get a much smaller image size and also a smaller footprint when we restore. And that's the same with crack because it basically it doesn't zero but it unmapped the memory and it has the same effect so it would wonder why do we need the zeroing at all then and not just use crack so I hope that will get clear in my next example. So for the next example I will use firecracker which is a KVM based virtualization. It's basically QMU on steroids it's a stripped down version of a virtualizer it has only a restricted set of network block device network device. It's rest base configuration it's written in Rust and it's open source under Apache 2 license and if you ever used AWS Lambda or Fargate for example that's the technology which drives this offering so every Lambda function is running in its own KVM based firecracker container. This is a diagram of how it works but I think we don't have time to go into the details today. It said I want to show you how this works practically. So I have another prepared another window here. So I use a script which which basically starts firecracker and inside firecracker it then starts again the pet cleaning application. If you take a close look this basically boots its own kernel which is 6.0 this here Linux 6.1.7 so it boots its own kernel in a virtual machine and then inside the kernel it starts firecracker and now if you see a rail this virtual machine has its own network address so we cannot use localised anymore. So we have to use the IP address of the virtual machine running on our host system but apart from them it works exactly the same and we have to look for two footprints now. We want to know we have to look at the footprint of the firecracker VM itself which is about 670 megabytes. Slightly bigger than that of the whole process and we can also look at the size of the JVM inside the guest and we see the JVM size inside the firecracker guest is about the same like when you run it locally which is basically clear. And we can now snapshot the whole firecracker container. Again that takes just a few seconds to this directory and if we want to see how big it is it's like 670 megabytes about the size of the whole firecracker container had in memory when it ran. So just to demonstrate how it works now we can restore from the snapshot. This again basically spins up the whole virtual machine in about 200 milliseconds and we can check how much memory it takes and you see it takes very few memory because it takes only 570 megabytes. Because it only pages in the pages which are really required to start the virtual machine. Crew paged in all the pages from its page file into the newly created process. Whereas firecracker does this lazily that's why initially it needs so few memory. The funny thing is that if you look inside the container by SSH and to the container and to a P map the Java process within the firecracker container still basically needs 500 megabytes but the VM itself only has paged in like 50 megabytes of memory. And what we can do now is, yeah we wanted to see, we already saw how big the, yeah we are sorry, we just do a request. So you see it's still working after we store the network devices are restored interfaces and it works and if we look at the image size of firecracker after the restore you see it gets bigger like 270 megabytes which corresponds mostly to what the crew process used. So that's actually the crew restore Java process so that's about this 270 megabytes seems to be required in order to process this request in pet clinic. So now how can we get this smaller the image size of the firecracker container because 690 megabytes is quite big. So again we run firecracker and you can see that I started a starter firecracker process with the checkpoint option so I with the crack checkpoint option so I can actually use the J command version now to check point. So we have the Java process in the KVM guest. Again we use SSH to SSH into the firecracker container and then inside the container we execute J command to checkpoint. This is a special version of checkpoint where we doesn't make sense to use crew within the firecracker container because anyway we will snapshot the whole firecracker container so instead we just use the special version of crack which only executes all the callbacks and thus all the optimizations but doesn't really call crew. So that's where we have to restart it inside and when we look at the memory. Inside the the Docker container we see that it's about 290 so it was it went down the SS but unfortunately the container it's like firecracker process itself still uses that much memory and if we. If we snapshot it. That that works. But. Let's take a look at the size. It's still 600 megabytes so that's why I called why I choose the title like that that's what I call the semantic. Gap between the guest and the host like even if I free memory in the guest container the host colonel does not cannot know that these pages are not used anymore by the guest system and they are still dirty from his perspective so if I if I. Snapshot. The container the whole VM it's it has to save them to this which makes it inefficient. So there are different possibilities to cope with this. One is to use. The the trim native image and the the zero heap options are showed you before because then firecracker has the chance to wipe out these pages from the image which make the image. Sight smaller. So I have I've summarized this here in this table so initially the firecracker process needs about six or almost 700 megabytes of RSS the JVM inside like before 500. Snapshot is about 600 megabytes after we store 50 megabytes and after the first request again 266. If you run this crack and do the checkpoint we can minimize the memory size within the VM to about 290 but the but the image size that the snapshot size itself stays at at at 600. If we do the trim native and zero unused heap the the memory consumption of the of the of the of the virtual machine goes up because again we we touch all the pages in order to zero them. But we get a much smaller image size because now the virtual machine manager again replaces these pages by the kernel zero page so we get a much smaller image and faster start up time. There's another possibility and that's called in it of in it on three that's a kernel option so the kernel usually when you give when you am a page and give it back to the kernel. The kernel doesn't do anything with this memory the kernel zeros the memory when you allocate it when you am up on your page the kernel will give you a zero page a page only containing zeros. But there is an option called in it on three which does this in the other other way around so it's and it's. Just for example in security critical application where people want to make sure that once they release memory this memory is immediately zeroed out. The thing with this is that the initial memory size of the container goes up because when the kernel boots up it was zeros the whole memory so it touches all pages so the footprint of the of the. Firecracker process is like one gigabyte which is what I gave him for the for the guest on the other side. When we snapshot this. We we get down to a four hundred twenty megabytes which is already quite nice. Last feature which I wanted to mention briefly is ballooning that's a special device inside the guest which can allocate memory can sing of it like a file cache. And then it has a means to communicate is back to the KVM manager and the host and tell the host that the host can now reuse this part of the memory so with this. If we inflate the balloon we can decrease the the footprint of the whole virtual machine but unfortunately the snapshot again gets bigger because from the. Host site this page is still look tainted so we have to combine ballooning within it on three then we get like all the benefits small very small footprint of the running KVM process and the smallest image size so with that I came to the end of my talk there are some references here. And I linked to the to the examples I showed you and this is where you find the presentation so thanks a lot. Thank you.
Inner Workings of Safepoints
Hi, I'm Jonas Stechberger. I'll be talking about the inner workings of safe points. Essentially, what are safe points? You essentially have this nice little VM and you stop it. And as you saw, we have local safe points that only stop one VM. One VM, the other one, just float across. And the vicinity and the other stretch has worked. So the local safe points, they are quite cool because we stop a single thread. The thread doesn't modify its stack anymore. And it's nice to do things like garbage collection that just wants to operate on the thread stack. And we also have global safe points. They are also quite interesting. So essentially, you'll stop all the threads. That's cool when you want to do code deoptimizations and other stuff. That's what we're talking about, safe points. Safe points give either the state of the whole VM a guarantee to be stable or the state of the whole thread is guaranteed to be stable. And they are like one of the building blocks of the JVM. And they get even more important. I'll link their chat below because newer garbage collectors like concurrent etc. see use safe points, especially the thread local safe points to do work concurrently and to ensure that we can do as much work alongside concurrently every now and then at returns of methods instead of having a stop of all garbage collector as we have before. So essentially, when do we ask to be in a safe point? So when do we check he should be going to a safe point? We take here a typical method. It's just multi-blast. And so what we see here, we go into a safe point either when we return from a function, which is pretty neat. Or when we are at the back edge of a non-counted loop. So here in this example, we check at a safe point every time here when we're here or when we are at the end of the function, but beware of inlining. And that's a problem here. When we inline a function, then we don't have a safe point at the end of this inline function because the function doesn't exist anymore for the JVM. And then of course, we sometimes have loops. Some of you have written such for loops. And the problem is some years before, this didn't have a safe point anyway inside. And especially if B got really large, it meant that the JVM was taking quite a lot of time to reach the next safe point. So people started to do loop strip mining. And the idea is essentially you split this loop that we had before into an outer loop that usually iterates over increments like the value by a thousand typically, and an inner loop. And we have a safe point here. And that's quite interesting. That's quite good. It's called loop strip mining. And this reduced the latency in your JVM quite nicely. I'm Johannes Pechberger. I usually talk about profiles and sometimes about safe points. I work with an amazing team of talented engineers at the submachine team. We're the third biggest contributor to this little OpenShade. DK project that you might have heard about. And I sometimes fix bugs in template integrators. So template integrators like the thing when people say, oh, Java code is integrated in the lowest level of trig compilation, then they talk about template integrator because the template integrator turned O2 not produce safe point checks at the return of functions. And that's not great when you use this fact. So I sometimes fix bugs in the OpenShade. OK. And sometimes people call this work. And their back part of this fix us to all older LTS releases. What I'm not talking about is I'm not talking about how safe points suck. Because they sometimes suck, especially when you have profiles that are safe from bias. So they only sample your threat out safe point borders. As Nidson Barker says, safe point bias profiles are tricky, sneaky, filthy, all of the above. And if you want to know more about safe points and why they suck in profiling, ask this guy in the front. He knows a little bit about it. But more on the real topic of my talk. It's on implementation because we're all here to see some C code. Yay. In the beginning, I want to tell you how they code work. So essentially what we want to know, so essentially how safe points work, we have to insert these checks somewhere. And then the JVM goes into a safe point handler. And that's all amazing stuff like doing garbage collection work. Or hopefully in the future doing some safe point based deck walking to make profiling a little bit easier and to make profiling a little bit faster. But essentially what we could do in the search code, we could ask at every return, at every point where we run in search of safe point polygester, if threat is at safe point, please call safe point process. The thing is, this is like, probably this would compile in your check-in if you include some errors. And then in the last part, we did nothing. And it's quite cool. And it's slow, of course, because we have this branch everywhere. But the cool thing is, for once, that's pretty rare, this occasion. And so we can do some tricks. But what's in integrated mode, how it looks like, and here with the C++ code, how it actually looks like, that your template interpreter generates some code that calls test instruction that essentially sets a conditional bit. And it just tests that the polyg word, that the poll bit, is not zero. So that's just a simple check. That's how it's implemented in template interpreter. So we could essentially just implement this. And it's implemented this way. But we see that the first thing is, use pretty rarely. Because usually, we're not going to be at safe point. Because if we would get into this at every return, at every loopback edge, we would be just like being at safe points. And we wouldn't do some work with our JBMs. So usually, at safe point fails. And this is cool, because we now know that we can essentially make this a slow path. So it doesn't matter that much how fast it is if we get a fast path really fast. And one idea here is we could just read from a pointer. Because reading from a pointer in the fast path is quite fast, especially if there's a pointer that's in the cache or somewhere. It's nice. And the thing is, when we read from a pointer, there are two options. We could either read some data. Good. It's just a simple morph instruction. Or we can get a segmentation fault. And that's one of the things that JBM does. It uses segmentation faults for its own advantage. Because segmentation faults, yeah, they are somewhat expensive, but a fast path is really, really fast. So the idea is here in our method where we insert a check. We just access a so-called polling page. Because when we disable the save point, the fast path is yeah, it points to a good page, this pointer. But when we enable it, it points to the bad page. And we have a segmentation fault. And then essentially what segmentation fault does, it looks and looks, hey, did we want to access a save point, such a save point page? And I'm like, cool, we're probably at a save point. And that's one of the reasons why when you disable save points, when you do save point handling stuff around your JBM and want to be sneakily, so capturing save points, you got a lot of them. That doesn't help with debugging when you're out in GUB, when you're out in GDB, because you get a lot of save points. But anyways, before we go further to look into C++ code, I was told by someone in the audience that people like cute cat images. And the thing is, working with the OpenJK is interesting. But sometimes you have to sometimes calm down, take a cat, stroke it, have a nice time. And then you go back to learning how the OpenJK works. So I learned this because I wanted to fix a bug. So essentially how save points are initialized, we have here a bad page and a good page. And then we use the magic method protect memory. Essentially what it does, we call nprotect. And thereby we make the bad page neither readable nor writable. And the good page, we make it just readable. So we don't need to make it writable. So essentially we use the memory management unit of our CPU to implement save points. And that's pretty nice. So how, for example, C1 implements save points is quite, quite simple with this. It just accesses the pointer like it just accesses the value that's at this address. That's our page. So it's really just a single address. And that's a single moth, which is nice. And there's, of course, the question, how could we arm these? So what do we do here? And essentially what we do, we, for one, set the polling page here when you arm it to like the arm value or like the bad page. And of course, we're not doing any of this segmentation fall trick in template interpreters. So we're also setting the polling word so that the template interpreter can also check it. And what we do when we do a global save point poll, we essentially do this for every stretch. That's pretty simple here. And of course, sometimes we want drag save points because they can get quite annoying. And if some of you saw this 1 billion row change, then you probably saw that some of the winning contenders did this able save points because they can get quite annoying. But usually they aren't. So essentially there are a few ways you can, for one, use Jvi events. And I've built a website called Jvi Events Collection, where you can see all Jvi events available, also the events for trial and for all the JV case here. And you see here that there is a save point begin event, and you also have a save point end event. So you can check which save points are created. And also you can just pass x lock save point. You get lots of output. And I did this for like a Renaissance benchmark. And this is like the distribution that I get. And essentially most of the save points are in this case related to G1 because G1 was my selected garbage collector. If you want to learn more about me on my team, just go to this link. I was Johannes Pechberger here telling you a bit about the inner workings of save points. I hope you learn a bit. You can find me on Twitter on GitHub on my team at that machine.doctor. Oh, that was all from me. Yes, of course, Rief. Four precious minutes. Any questions here from the keynotes or any corrections of the OK developers? Can I ask a choker? So the question was before Java 5, how did it work? Any of my colleagues that were present at this time in the training came? Any of the OpenTree care developers here, any ideas? I don't. I only started two years ago. No problem. If these people don't know, then nobody knows. But if you have some ideas, come to Forstner next year and tell people about it. Yes, history lens. No, other questions? None? Good. Then, what's the pleasure of talking to you? And if you want to learn a bit more about Python, I'm tomorrow at 4 PM in the Python Dev Room, telling you about Python monitoring. And that's all from me. Thank you.
Java… to unlock GPU acceleration for Polyglot Language Runtimes
Okay, can you hear me? Excellent. Thank you. So it's a pleasure to be here. I'm on goal this amazing speakers today. So I'm Thanos. I'm a search fellow at the University of Manchester. I'm part of the Tornado VM team. And today I will talk about polyglot language implementations, which enable programming languages like Ruby, Python, and to run on top of the JVM, along with Java, of course. And I will try to make a step forward and show you how they can harness GPU acceleration from the JVM. I'll start a little bit with the polyglot programming, which has been here for many years, but in a sense it has been reignited by the advent of the Truffle framework from Graal VM. And in a sense it enables multiple programming languages to run on top of the JVM and interoperate. So that means that one Java class file can invoke a Python function and the Python program can invoke a Java method. Well, this is very interesting. It comes with many advantages. But what about GPU programming? Well, GPUs from Java. Well, this is not a thing yet. That's why we have been motivated at the University of Manchester and we have done all this research in the past eight years and we have created Tornado VM. Here is a link to the resources of Tornado VM with all the presentations that explain the programming model. Because my goal today is not to go very deep, to dive into the Tornado VM very deep, but to present the interoperability with the other programming languages and how they can use GPU acceleration from the JVM. So Tornado VM is an open source plug-in to existing JDK distributions. It is compatible with JDK 21, as you will see later. And it has some very cool features. So it has a platform agnostic API. So developers, they don't need to know GPU programming, FPGA programming. It comes with an optimizing compiler. So we extend GRAL with new phases that they can take Java methods and compile them to GPU code. We have a feature of dynamic reconfiguration at runtime, which means that the method execution can be migrated from a GPU back to the JVM and then go to the FPGA if it is appropriate. And with the latest release 1.0, we have enabled support for off-heap data types. So data can be allocated off-heap with a foreign function and memory API. And this is the API that Mauricio described earlier today. So feel free to follow Tornado VM in Twitter to engage with the website and of course to fork and try our examples which are open sourcing GitHub. So I spoke a little bit about off-heap data types, so I'll give an introduction, an example, because I'm not going to dive very into the API. So here we see two snapshots of code. On the left side, we see a main method that contains the allocation of float array by using primitive types, but is allocated as an object, in a sense, on-heap. So to migrate from such an allocation to the new allocation API that's exposed by the Tornado API, we have created the float array object that inside it can allocate memory by using the memory segment of the foreign function API. And it will allocate this memory off-heap. So this memory segment could be used directly from the GPU without the need to worry about GC collections and this stuff. And the cool part is that even if you don't use GPU programming, even if you don't want to execute on GPUs, you can still use this API to allocate memory off-heap. And here is a link that explains more. I hope it's visual from your side. If not, you will find my slides online in the Fosdome webpage. So the motivation for today is that Graal VM enables interoperability between programming languages like Ruby, JavaScript, and other programming languages. And Tornado VM enables hardware acceleration for Java. So what if we can combine them together and harness GPU acceleration from all these programming languages that are running on top of Trafl? Let's have a dive into the tech flow. So in this slide, I present a software stack from Graal VM for Trafl. So on the top, we see the Trafl framework and many implementations of polyglot runtimes like Graalpy, Graal.js, Trafl Ruby. And others because Trafl enables also programming language implementers to create their own programming languages by using the Java API. So I have grouped Python, Ruby, JavaScript, and Node.js in this side of the slide. And then beneath them, there is the Graal VM Zit compiler, so an optimizing compiler from Graal. So Java is also running on top of the JVM, of course. And all these languages, they start in the interpreted mode, and once they reach a hot state, then the optimizing compiler kicks in. And the cool part with such a polyglot implementation that enables polyglot programming is that there is, for the compiler enthusiasts, there is one Graal IR. So the nodes, at runtime, they are rewritten. That means that it can adjust. So if we kick in a Python function, then the node can be rewritten, and the Graal compiler will take a shape and will emit at the assembly code that will run on the CPU. So this solution offers the interoperability and offers the execution among different CPU instruction set architectures. But what if we have this heterogeneous hardware, like GPUs, FPGAs, which are available in some systems and servers? Well, then we'll have Tornado VM that enables Java methods to be compiled for GPUs, FPGAs, etc. Tornado VM has its own JIT compiler, which is an extension, a superset, I would say, of Graal, the Graal compiler, that it is enhanced with new phases in the compiler to automatically specialize the code from a method for GPU acceleration and FPG acceleration. So at the backbone of the compiler, we have three backends at the moment. We have OpenCL backend, CUDA, and SPV. And such a solution would enable many things. So if you want to learn more about the APIs, you can scan this QR code. And the code that is implemented with Tornado VM, it can harness besides the off-hip data types, it can also harness the execution with a Tornado VM profiler. If you want to learn more about the characteristics of your application, you can see how many data will be copying in the GPU memory, how expensive is the IEO maybe, because this could be very critical for the performance of the system. And you can customize even how many, how the data transfers will be performed. Because, for example, if you have a method that consumes redoneally data, then maybe you need to copy the data once, instead of copying the data every time you execute the kernel. Okay, so let's jump to the deployment. As I said, Tornado VM is compatible with different JDK distributions, so it's not a JVM, it is a plugin for JDK distributions. So it can be seen as a library, in a sense, because it offers an API in Java. And it is compatible with all these distributions. And on the other side, we have the compiler backends that makes it compatible with different heterogeneous hardware accelerators. We can emit vectorized code for multi-core CPU execution through OpenCL. We can run with different GPUs and FPGAs. In this particular talk, I will focus on GraVM, because we want to leverage polyglot, and NVIDIA GPUs, because I have created Docker images that they run on the NVIDIA GPUs. Now, regarding the GraVM deployment, I will focus in this slide in GraL Python, which is one implementation of polyglot runtime. This is shipped in two different standalone versions, releases. So we have the native standalone, which comes with the native image. And then we have the JVM standalone that enables the execution of Python programs on top of the JVM, and it has also the JVM compiler. The version that we tested is the 23.1, because tornado VM is compatible with this version of GraL. And here you can see that we have downloaded the community, and that's JVM. So we have the JVM standalone version downloaded. Well, we need the JVM standalone, because we want to run with tornado VM, and tornado VM will extend the GraL VM compiler. So this is the reason. The problem is that we tried it, and the JVM distribution is shipped with the JVM standalone, with a compiler built that it is built with libgral. So this comes with not many compiler modules, and that breaks the consistency for tornado VM. When we tried it. And this is because they wanted the image, the footprint to be lower, which makes sense, but it broke the compatibility with tornado VM. The good part on this story is that GraL is very active. The GraL community is very active in Slack workspace, so we managed to figure out what was the problem. On the bad side is that the solution was to build a GraL Pi and GraL VM from source, which was quite painful. And in order to avoid this pain for anyone who wants to try this work, we decided to build a Docker image that has inside GraL Pi, tornado VM, and we have also added the NVIDIA driver. So if you have a Linux machine or any machine that has an NVIDIA GPU, and you have also the NVIDIA container toolkit in this machine, then you will be able to run this image. The Docker file, the image is open source in GitHub. And on the other side, you can see the QR code that has the acceleration library. So the code that we have implemented in the examples module of tornado VM for the computation part that we will upload on the GPU, like K-means, matrix multiplication, and those are the examples. But there are also other compute examples that we have in the GitHub. And you can also pull the Docker image from Docker Hub. So we will jump into the examples. So as you see here, we have the Python and Java with tornado VM. So we have the Python program that imports Java, and then it loads the class from the compute examples class of the tornado VM repository. And then we have in this Java class that we have loaded, we have two methods that can be accessed by the Python program. The first one is the set inputs that set the actual data points and the number of packets that will be used for K-means. And the second one is the run with GPU. So this will invoke the actual GPU compilation for GPUs and the GPU execution. And on the other side, we have the Java tornado VM, where we use Java and the tornado VM API to create these parallel implementations of K-means. In this slide, you see, well, the steps, how to clone the repository that contains this Python program. And we see also the Python program, the K-means.py. So we see here beneath that we have the invocation of the actual method functions, Java methods, sorry. And here is the link for the Java implementation of K-means. And now if we jump into the Java part, which contains the computation that will be offloaded on the GPU. No, before we jump to the computation, we have the set inputs and I wanted to make a connection to reflect on the off-heap data types. So with these two, with a new vector float, this is an API type that is exposed by tornado VM and can allocate data vector types off-heap. And then we'll have the create matrix of clusters that does perform some initialization of the objects and also allocate some other data, like the clusters, which are going to be allocated off-heap as well. And now we are ready to move into the actual computation part. So on the left side, you see the run with Java implementation of this method. And on the right side, you see the accelerated one with the tornado VM API. So as we see here, the actual computation has been in this method, has been performed by this method. So they assign clusters. And the corresponding one on the right side, that is the tornado VM implementation, is this one. So in this one, I would like to focus on two parts. So you can see the task graph implementation. Task graph is an object exposed by the tornado VM API. In a sense, task graph enables you to define what code will go to the GPU. So what's going to be the actual computation and what data should be used on the GPU. So the input data and the output data. So in a sense, the task graph enables programmers to define what is going to go to the GPU for execution. And the second API, once we have done this, as you can see here, we can define also the data transfer mode, how often we want data input, input data or output data to be copied back and forth from the GPU. And once we have defined that, we can move to the second part, which is the execution plan. So the execution plan is another object that enables programmers to define how the execution will take place. So it could be, for example, with the profiler enabled, without the profiler enabled, with a custom grid size, which is defined by the programmer. And once we have defined how the execution will be done, will be performed, we are able to execute the actual task graph. So with execution.execute, it is this part that enables the actual execution of the code and the GIT compilation. So the second time that we will execute the assigned clusters, well, this is going to be the second time that we invoke the actual execute of the execution plan. And the second time that we will invoke the execution plan, the execution of the execution plan, this is going to be the time that the code will not be GIT because it is already GIT. So the code, the OpenCL code or the CUDA code will be all retrieved from the code cache of Tornado VM. So now we can move to the actual example to run. I have recorded a video that enables the execution of K-Means and MathExfoom.liblication because on my MacBook, I don't have an NVIDIA GPU. So we will fork the actual repository with examples. And now that we have forked, we will go inside, we check out the FOSDEM branch. And this is the Python code that we saw earlier. So it has these three. First, we load the class, and then we are able to invoke the Java code from Python. And here we will run, first, the Java implementation and then the GPU accelerated implementation. We can also pull the Docker image that we have created. And here in the repository, we have a launcher script that enables to run. So at first, we will try the Tornado devices to query how many NVIDIA GPUs exist in the system. And here it is the 2000 GPU that exists in my machine at home. And once we have done this, we will run with Truffle, the Python program. So Tornado Truffle, the Truffle flag and Python, will be able to run the actual Python program. And we will see here that at first, it will bring Hello World from Python. And then we run the Java implementation, which is a sequential, that I'm with Java. And then they run with GPU method. And as we see here, they take the first one, one second, and the second one, 140 milliseconds. So here we will try the same example, but with the thread info, which will enable the printing of the actual threads that have been used on the GPU. So as we see here, we have the number of data points that we passed with the set input. It has been the number of the global thread size that is uploaded on the GPU. And now we move to the second example, which is the matrix multiplication with Tornado VM. So in this example, we run five times the matrix multiplication. So we see here the execution time of matrix multiplication on the GPU. So the first time it was half second, and then it has moved to three milliseconds. This is because the first execution, it involves also the GIT compilation, which is expensive. Then the second time, third time, the execution time has been saturated because it is the actual launching of the code. Okay, I have showed you example of Python with Gralpy, but this is not the only one. We have also the key images for the other programming languages for JavaScript, Ruby, and you can find more details in those links where we have a blog post. And we explain also the polyglot programming from Tornado VM. So now we will try to find the other examples so now I will jump to the summary of my talk. So as key takeaways, I would like to emphasize that GralVM and Traffl enable Java interoperability with other programming languages that run on top of the JVM. Tornado VM afflows Java methods on GPUs, FPGAs, and multicore CPUs, so you can create parallel implementations. And that Tornado VM offers a Java API, so programmers, they don't need to know GPU programming. It is a Java API, a Java way to express parallelism. And we have also new off-hip data types. So finally, yes, it is possible to create high-performing implementations of code for data science libraries in Java, and reuse them by other programming languages. This is a slide that summarizes everyone who has contributed as a research staff for students at the University of Manchester, and these images are from our campus. And this is a surprise that it was taken and it was not raining. So I would like to invite you to join our community, follow us in GitHub, join us in the Tornado VM Slack space if you have questions, or if you want to interact with a team for discussions, and also to try our examples in GitHub. And in my last slide, I would like to acknowledge all these research funds that have supported their work at Tornado VM, like Elegant and Crip, Tango, Iro and InCode. So with that, I conclude my talk, and I think we have time for one or two questions. Okay, I've got the mic here, but first, I lived in Manchester for five years, and it doesn't always rain. Just mostly. Just mostly. Thanks for a great talk. Like one of the first pictures you had showed Tornado VM in parallel to the GrowlJIT using the JVMCI. So do you interact directly with JVMCI for generating code? Correct, yes. So the JVMCI enables other JIT compilers to be hooked in the JVM, and that's how we run, because we extend. So do you work with the standard JVMCI in upstream or open JDK, or you need the lab JDK with the latest JVMCI changes? Because the GrowlJIT compiler, as far as I know, requires the lab JDK with latest changes. We work with the standard JVMCI, yes. Thank you. Thank you. So when you write the kernel code in Java, then is it usually high-level code that you write, or do you try to write optimized code in Java? Like usually when you write, let's say, Qtacode, then you try to write a very specialized, use warp intrinsics and that kind of stuff. Is that something that is like in scope for turn out of VM, or not so much? No, that's a great question. Well, to answer this question, we do both. So we have two APIs. One is created for Java programmers. We will have, let's say, a computation that has four loops. So this is something that you can paralyze if you don't have data dependency. So we expose an annotation in this case, similar to OpenMP. So you can do add parallel in the four loop in order to give a hint to the compiler that this can run in parallel and will create parallel implementations in OpenCL or CUDA. And the second part is that if you are familiar with OpenCL and CUDA and you want to have access to low-level intrinsics, like, for example, use barriers or local memory, allocate local memory, then we'll have a second API, which is called kernel API. And with that, you can pretty much access every interesting that exists in OpenCL and CUDA programming from Java. So personally, I have used the second API to port existing OpenCL kernels in Java with Tonedo.
How to bring up GCC for your new chip
Okay ladies, if you'd like to get yourself settled down, because of the way we're running this room back to back, my talk has already started and I haven't got much to cover but we'll do what we can. So, yes, so that's everything that makes up the GNU toolchain. I'm going to go through some of these slides very fast because it's reference material so you can go back and look at the video afterwards if you want to check something. This is only going to look at GCC so I'm not going to worry about the assembler or any of the other stuff, I'm just going to look at the compiler and how you add something to a new chip. So how you get the back end up and running, where you can get more information, what the key things you need to do are and what I hope is at the end is that at the end you won't be able to write a new compiler but you'll know where to get started. So first of all, source of information, there's loads of theory behind compilers, there's an excellent beginner's textbook there, you can still buy it second hand, I believe someone bought one for a penny on Amazon, second hand, so and I've been recommending, I haven't used the second one but it has, I strongly recommend it by someone else there and this is the Bible. If you've got a lot of money you can buy the one on the left, if you haven't got so much the one on the right is still rightly available. But this is what we're going to worry about today, the GCC internals manual, everything you need to know is there, some of it's out of date but it's generally a pretty good document and it's online so you can just go and get that. So, we've got a new chip, our new chip, this is an entirely fictional architecture, it's taken from my textbook I showed earlier and it's a simple byte stream architecture used for just as a target you can compile to for demonstrating how to write a compiler. So, we've got arithmetic, we've got logic, we've got shifts, we've got the ability to store and load and we've got some branching and a branch and link so we can do sub-routine calls. And there's all the details of it but we'll come back to it. So, getting started, first of all you need GCC so you can clone it and there's a mirror on GitHub as well. You've seen this from Dave, here's the structure and the bit we're going to be concerned about is within the GCC primarily the config because that's where you put the configuration for the new back-end architecture. So, we're going to, there's one, there's one for RISC-5, there's dozens of them there, we're going to add one for VAM. So, if you were to look in RISC-5 you'd find these four key files, there's loads more in the RISC-5 directly but you have a .h file which is where you define a lot of parameters that says what my back-end looks like, you know, how big's a char, how big's an int and so forth. You have RISC-C which is where you put C code and it's really helper code to get you off the ground. You don't need, you need hardly anything in .c to get started. The big one and where we'll spend quite a lot of time is the machine description. It's a, it's the thing that describes what your architecture looks like and GCC will then pick that up and use that to be able to compile to your target. Okay, and it's written in a, nominally in a dialect of LISP called scheme. Okay, and lastly there's a file called .opt and you don't actually even have to have .opt but it's where you've got target specific options and our architecture, we're going to give it an option that says you can have soft multiplication where you do multiplication in software or you can have hard multiplication where you actually generate multiplication instructions. So first of all, we need to see how do we configure GCC for my new target. Well first of all, we actually need to go into the whole auto-conf system and actually add it in there. So at the top level in the repository you'll find a file called config.sub. Now that is actually pulled in from a separate project. Okay, so if you're doing this properly you would go to the project listed there and you'd make your change there. But I'm just going to hack it today and I'm just going to add a line in the, if you look in there you'll say case dollar CPU where all the CPUs are there and I'm just going to add VAM, our architecture. So now the auto-make system will understand about VAM and then inside the GCC sub-directory, so the GCC proper sub-directory, there's config.GCC and that's where you put all the GCC specific configurations. Okay, now our full name of our architecture is probably our compiler will be VAM-unknown-elf-GCC because we'll put the full triple in front. So VAM whatever you like, ELF will match that. So if you go and say I want to configure for that target, what do I define? And there's a whole load of variables you can set to tell your target what goes in there. The thing is you don't really need to put anything because it'll know there'll be a, if my target's VAM you must have a VAM.C, a VAM.H, a VAM.CC, a VAM.H and a VAM.MD and maybe a VAM.Opt. I'm going to say actually I want one other because this is bare metal. I'm going to take the standard ELFOS file for bare metal operating system file and add it to that and that's the target machine list of files that make up that architecture. So that is all I need to do to make GCC to know that. And now I can say go and configure GCC and this, you'll see it's a bit like Dave did, but this time my target is going to be VAM-unknown-elf and it will configure for that. I'm going to do, I'm going to put it in pre, I'm going to, when we've finished it, it'll get loaded in, it'll get installed in OptVAM. We'll do it without headers just to keep it simple. We'll just do the C language and as Dave said earlier disable the bootstrap, just the stage one which is on a plain C compiler and there's loads more options there and we'll come back to that later. And then I can just say make all hyphen GCC and lots and lots happens and then it will complain and say ah, but I can't find VAM.MD, the machine description, okay? Because I didn't actually create a machine description. I just told it that was here's my machine. So we're going to have to do something about that. So let's start adding those files in. Let's start with the header file and so let's create our configuration directory. So we're going to the source directory, we're coming out of our build directory, going to the source directory, create a sub directory within GCC config for VAM for our architecture and I'm just going to create empty files, VAM.CC, VAM.H, VAM.MD and VAM.Opt. Come back into our build directory and make all GCC again, lots more happens and then I get an error message. It says ah, in somewhere deep inside the GCC world I haven't found a definition of first pseudo register and maybe you meant first virtual register. And that's actually one of the variables that I have to define in .h. So in .h there's a whole load of macros I've got to define that I will need for that. Okay, so here's an example, so in VAM.h we've got some things. You define target CPU but those are the built-ins I want to appear. You know that when you compile for a particular architecture in GCC there are some predefined macros there including one that tells you what your architecture is. So we want underscore VAM in capitals and lower case actually defined so if you're writing code you can put hash if def VAM, if def underscore VAM and put your VAM specific code there. And there's a couple of asserts there, I'll assert the CPU's VAM and the machine is VAM. Okay, where does it go, what goes in the header file? There's a whole section on this on the internals manual. You'll be here till 2057 if you try and put all of those in. Easy approach do what we all do is copy an existing architecture and hack it around for you. Open risk is a really good one. It's quite small and Stafford Horn knows what he's doing so it's a good starting point and it's what I used. Okay, and associated implementation codes in VAM.cc and it's things like data storage, data types, register model, the ABI implementation, all the constants that will define that. Okay, so here we are, here's my storage layout, you know all the number of bits that go in everything, what boundaries I'm aligning on, the sizes of all my data types, what the ABI looks like, so I've got a comment to say what it does and then I define the first pseudo register, so I've got a total of 33 real registers and then anything else would be a pseudo register and I'm not going to go into pseudo registers, because I've got my 32 real registers, general registers and I've got my status register. I don't have the program counter as a register because it's not actually exposed in my architecture, I have nowhere treating it as a real register, it's just something behind the scenes. And I've got names for all my registers and some of those have fixed purposes, so r0 is always tied to 0, r1 is the stack pointer, so I've got an array telling me which of those have got predefined uses and the last one is the status register, that's got a predefined use. And then what are good ways to allocate this, so when GCC needs to use a register, what's a good one to choose, so I don't actually end up choosing one, I have to then worry about restoring and everything. And so I can give that in a priority order of what order do I want you to allocate registers in. We talk about register classes, now this is very simple because we haven't got many registers, normally you would separate out your integer registers from your floating point registers and then you can tell GCC to do different things depending whether you're doing floating point or integer. In our case it's only an integer machine anyway, so we've just got general regs and we've got one class for the status register which is the flag regs. You always have a no regs class which is no registers and all regs class which is all registers and you define the last thing in that enum because it tells you the size of the enum is limit regs classes. And then from that we can define a macro called n regs classes and we can define the names of these which are just the text strings. And lastly then we say for each of those classes we're going to have to give you 33 bits to tell you for each of those classes which bits are there. So for the no regs none of them are set, for the general registers all the bits are except the 33rd bit and it's the bottom low bits on the left and the top low bits on the right and then the status register is register 33 so it just has one bit set in the other bit there and then all regs has all the bits set. Okay and you've got a macro to tell you which regs, you've given a register number which register class are you in and there's loads more in there and you can read through it there and see what happens. So we say make all GCC and even more happens and then it complains that it can't see SP regnum. Now you think ah didn't I define a stat pointer, I did but I decided something else because the point is this is not SP regnum as known by a header, this is SP regnum from the machine description. Okay so some of these things are actually not defined in the header, they're defined in the machine description. So if we look how code generation works in GCC it's generic okay it's a pattern matching compiler, it looks for patterns and replaces them by new patterns. Okay so it's how it does code generation, it's actually also how it does optimization and what we have to do is give it all these pattern templates in order to be generated and that is what the machine description is and actually when we come to optimization replacing patterns by better patterns is what you do. So we heard from Dave the different types you've got generic then Gimple then RTL and we're really worrying about how you get down to the RTL level. Okay side note here GCC has its own name for type systems so they're everything from quarter inch to eight bits up to double inch and tetra inch and double float with and so they're known as QI or HI and so forth and you can have unsigned variance of those just when those will come all the way through so when you see those they just sizes of things. So how do you get Gimple to go down to RTL okay which you can then code generate from okay you we probably had a set of standard patterns okay and all you're going to do in the machine description is tell him given add QI 3 that's add quarter inch to three arguments two source arguments and destination and they're mostly three address code like that so add two quarter inch and so forth. There's a whole set of these to define you define all those okay and then GCC has all the patterns and it will generate code for your machine okay so quite a lot of these have to be defined but some of them don't need it you know you don't need atomic and vector patterns if you're not going to atomic if you haven't got atomic ops or if you're not a vector machine okay so I say when we build the compiler it's parsed and all that scheme description of these patterns will be turned into C which is then built it then compiled and put in your GCC compiler and there's a whole huge chapter on this in the internals manual machine descriptions but we will do the same thing is we will copy an existing machine description and hack it so I've this we will take OR1K again okay so let's have a look I want to just describe machine description I'm taking them these from risk5.md just because I want to show a lot of ideas here quickly and they're richer in in the risk5 one than in my simple one okay so at the heart of it is define instant define instant which is the semantics of a pattern this architecture supports the name can be anything but obviously we're worried about the predefined ones and add SI3 is one of them okay and that's how GCC can learn RTL using that name okay so the first thing you see is match op-rand that's telling you how to match the first op-rand and then the second op-rand and you see there's match op-rand size of it single integer number of that op-rand so we've got 0 1 and 2 and then a bit about what it is okay so register op-rand says I can be any register it's it's an allow or deny gating function okay and you can write your own predicates as well but the whole load of standard ones okay and then we have constraints on that now the constraint is not much here equals r comma r and that's saying I'm giving you two scenarios and they actually have both happened to be r in this case but we'll explain why that is so it can and the equals means I'm writing to it so I'm either writing to a register or I'm writing to a register okay now the reason that matters is these pairs go together so the second op-rand is a register a register the third opera the first opera and then op-rand 2 is register or I for immediate and you have to read those if you as though we're there in columns so we're looking at one scenario where first op-rand is a writable register and the other two operands are registers and we're looking at the second scenario where the destination is a register the first operands register but the second operand is an it's just so if you think of them in columns that's how to think of them okay and yeah so the next line which is just empty here that's often for a global predicate okay and that could be where you put one of your flags so you may have to find a predicate like is this soft multiplication in which case I can't generate a multiply okay and just empty means true just always do this and then the code generation template it's just a C fragment and in this case so you say if it's a 32 if it's a 64 bit architecture then generate the string add word blah blah blah if it's a 32 bit architecture then you it's just a generic add instruction okay and the percent elements there percent nor percent 1% to refer to operand nor operand 1 operand 2 okay and at the end you can add some attributes we're not going to worry about attributes in BAM attributes are useful because they're where tagging the instance and sometimes you can have code generation options and opera optimizations that can take advantage of them okay so let's look at what we did for BAM first of all you define some constants that's where sp reg none the numbers of the key registers is defined okay and then we've got a very simple instance it's called no op and it doesn't have anything to match really it's just constant zero and it generates the text string it generates for code generation is just not here's a more comfortable and add si 3 you've seen that bit before we've only got one sort of ad okay the first operand is destination register the second operand is a register and because VAM is a two-address machine okay so add a B means add a to B and put the result in B we actually have to say the destination you see I've constrained it to be zero that means it's got to be the same as opera and zero which is the destination okay and I've got the same for sub i and the template to generate the code okay so the standard names the standard MD patterns machine descriptions and output statements how you do the assembly language templates and you've got some useful files in there and I say the open-risk one is a good example that's pretty simple okay so what about the option file VAM.opt there's a whole spec on this and we're going to allow it to have hard division soft division hard mode whether or not you generate multiply and divide instructions and they have a fairly simple pattern of explaining what it is and a bit of descriptive text okay okay putting it all together so we do make all GCC and almost everything almost everything happens and away it goes and it blew up cannot stat 10 permit 10.cc you know I have no idea what this means it's in deep in the bowels it's journey mitt so what do we do about this I asked for help and so thank you to match a Rizzicky who came up and said there's a trick you can tell it to emit fewer partitions it might be a bug and so I tried with emitting five partitions and it all worked fine okay and actually I ended up with a GCC because X GCC is what the GCC within the build tree is called and it ran it and it ran itself test it said let's check if the compiler is any good and then I got an internal compiler error because I haven't actually finished doing my compiler so VAM.md is missing some patterns and it's essentially blown up because it couldn't work out how to find a pattern to get the code down there for one of the test cases but I do actually have a working compiler well I have a working compiler in the sense I've got a compiler I can run it will crash whenever it compiles things but that's actually that's actually quite an achievement so now I need to just debug it okay but I have actually got a GCC build so I'm Dave covered this how to dump stuff we are so you didn't know you just mentioned so you can dump all the different intermediate codes but what Dave did cover was the wrapper option and the wrapper option is your friend that's where you can go inside we've talked about the wrapper option and how it puts things here actually you can do the same sort of thing as you can do gdb args and then I just copied that error message I got with the internal compiler error and now I can run under there and I can run it and I can generate my internal compiler error under gdb but I now have the ability now to do to debug it okay self-test even better so there we are and this is why we work as a community because we are so make self-test type in gdb we'll do all this magic for you okay so there was a bit of smoke mirrors in there I created a minimal vam.cc guess what I copied it from there was a bug in vam.op.ul's I had to hand create that in the hack round that and that I think is a bug I had to create vam.com.cc and I'm not quite sure why I had to do that but everyone seems to do it except open risk and I had to make it and I just took the template one I used it I added vam to the documentation that's a good thing I also compiled with enable maintainer mode which is used to regenerate some files I'm not that was when I was trying to fix the url's problem I'm not sure I actually needed to do that okay but that's what I did to get there so what next and the reason this is rather rushed is it's part of our three-month graduate training course this stuff was put together by my colleague Max and Blinoff a few years ago it's a five-day part of the course um for eight hours a day with exercises and so I've compressed it into 25 minutes um but hopefully it gives you just a little bit of a touch on how you can get started and there's enough hooks in there that you'll get off the ground and if you get stuck ask for help we're a friendly bunch I have an ambition one day I'm going to create a full public tutorial on GCC that's probably my retirement project but in the meantime everything I've just shown you is on github thank you okay I've got I've got two minutes for questions yeah are there any ready-made um CPUs that are a bit weird like um big guitars that we can use and play around for fun yeah so the question is are there any ready-made ones there are loads I mean there are what 50 or 60 backends for GCC and some of them are really weird and some of them very normal I would look at open risk because it's relatively recently done it's well done it's quite small because great excellent so the comment was about working on power isa power power isa and adding the scalable vector um functionality into the back end please join in ask for help scalable vectors are the flavor of the month at the moment so you said that we have to add the architectural specific stuff in the machine description I was wondering if there is a minimum set of touring to complete that you say that you do the assignment the addition and this yes question for the audience then we start the question is yeah our time's up is what is the minimum set in the patterns I don't know but if someone could tell me I couldn't find that thank you thank you
What can Compiler-Explorer do for GCC
In parallel, let's get started here. Up next we have Mark Boudiaz, if I pronounce it. Yeah, that's mostly correct. Yeah, another time. Even for French people it's complicated. He also worked on the GCC RAS front end for a bit, and I think why are that got involved in the got called compiler explorer. And now he's telling us what the compiler explorer can do for GCC developers. Yeah, thank you. So my name is Mark. I'm a compiler engineer at Hidakor, and today we'll talk about compiler explorer in the context of GCC. So what's compiler explorer? So for people who may not know what the compiler explorer is, it's a website where you can enter a program in some language, for example C on the left, and you can pick one or more compilers and get the corresponding assembly. So that's the very basic usage. Compiler explorer was created mostly 10 years ago by Matt Godbold. So that's why you may know the website as Godbold, because he was hosting it on his own domain, and it stuck. So now people are referring it as Godbold. We are now a team of eight people. We host like 2,000 compilers, support more like 60 languages. We have around four million compilation jobs per week, and thanks to our sponsors and patreons and stuff like that, we are able to pay the bill every month of around $2,000. In the interest of time, I will only showcase a very small subset of what the website can do. If maybe you should go and check out by yourself and experiment and see if there's something that you can find useful on the website. And at the end, I will answer questions and maybe get feedback or future ideas. So basic use case. So I'll try with the live site if it works. It's not too slow. Okay. So let's say you have a CC file, then you can add a compiler like this. So by default, it's TCC. You can see that the assembly is color coded too much with the user source code on the left. You can also execute the code. So for example, here, you can see that the printf is displayed on the bottom. You can also ask to stop after the assembler and get instead the option view of the object file. So you can see here that you still have the relocation in the file. Or you can also ask for a full link for the program. Yeah, still. And you can see that the relocations are gone and it's resolved. The last thing that I wanted to show is that you can share this by clicking on share. You get a link. And if you send this link to someone and they open it, they will get the exact same setup and layout. So it's very useful to share some code, bugs and stuff like that. The next use case is if you need, for example, multiple files. So that's the case, for example, in Ada where you have to have different files for the package. For example, the full package is in the two files named foo.adb and ads. And we have a main unit called example. So this unit is using the foo package you can see here. And you should see I'm also using an input file called input. So you can also put like text file in it if you need that. And then you can add as before a compiler. So it's not compiling because I need Ada22 and you get the same features as before. So you can execute, get the object files. You can share the session. Everything works as before. So that's the very basic use cases. We support many more features. You can build your program using CMake. We have GPU support so you can execute code on actual GPUs. You can see both the target and the host view of the code. We have deep views for assembly so you can compare the output of different compilers or the same compilers with different options. We support libraries, environments. There is documentation for some ISA and many more. So please try it yourself and experiment. Now the first feature that can be useful for compiler development is the conformance view. So for example if you have a bug report, so in this case it's from the GCC bugzilla. It's an internal compiler error. You can use the conformance view to find when it started regressing. So you add a conformance view and from there you can add some compilers. So GCC, PX86, for example trunk. So you can see this is red so there's an error. If you hover on the right you can see the backtrace so it's an internal compiler error. So from there you can just duplicate and check with a different compiler. So GCC 13, so still failing. And you can do that for all the compilers. So I won't do this now because it's short of time. But... Okay. I will skip this one and just use... So this is local so there's only a subset of compilers but it's fast. And you can see that you see quickly where the problem started, so around the 13 release. And the nice part is if you want to modify the code and see if it changes anything, the view will update itself so you can play around and see if you can have better ideas or things like that. And again you can share the session and send it to anyone. Something I used during my day job where I need to test against different compilers or targets or language. I create empty templates meaning that I simply create the conformance view with the compilers. I'm interested in for the given target and the language and I leave the code mostly empty. And whenever I need to test something against C++ for x85 targets, I click the link, the share link. This opens up. I copy paste the code and I directly have the result. I don't have to add the compilers by hand every time. So that's it for the conformance view. Very recently, Jeremy in the team added the support for the GIMPAL. So it means that now you can use GIMPAL as any other language in the compiler explorer. So maybe that's useful for some of you. You can just copy paste and use any GCC starting from the nine release. We also have support for the dumps Dave and Jeremy talked about previously. So this is C. I can add the compiler. Then you can add GCC tree RTL. And from there, you have access to all the dumps that GCC emits. Like this. If you need, you can filter between the tree, IPR, RTL. And you have access to all the options that you would have from the command line. And again, if you change something like the optimization, the view should refresh itself. So believe me, it should work. And that's for the most used dumps. But if you have debug dumps from frontends, for example, I've added the one for the ADA. We cannot support you. Simply have to ask. And maybe we can guide you or we can do this ourselves. So just ask and we'll be happy to help. Something else we offer are the nightly compilers. For GCC, we build a subset of supported targets from the GCC master branch. We also build from different repositories. For example, the core ball or the Rust one for VikiTub. We can build the topic branches if you have some that you would like to see on the public website. Or we can build more complex stuff like the Rusty, Codgen GCC where you need to take Rusty, build GCC and stuff like that and package and publish it on the website. So again, ask and maybe we can help. We provide an API where you have access to the basic feature, mostly compile and execute. So you can use that from Shell Script to do tests or you can embed this in application, plug-in, IDE. For example, this is a screenshot from the tool I've done for work using... I can run against different compiler using filters from the command line so I find it very useful. So maybe this could be for some help for you. And the last thing I wanted to mention is how easy it is to create a local instance, private instance. It's mostly heat clone, make and it will do some NPM magic for you. And this will bind to local host so that's fine. You can use it yourself but if you want to do that for a team, multi-user, please, please, you need to take extra care because this is basically a remote execution as a service. So you are from the web browser asking people to enter code and click execute and do everything. So for yourself, easy, for multi-user, not so easy. And we have ideas of new features we would like to have in the context of GCC. For example, for Clang, we have a nice view where you have all the optimizer passes and you can see how each pass is modifying the IR and with a nice div view. So it would be nice to have the same thing for GCC. Maybe a better div view where you can do divs on the RTL directly. Someone has for more Windows Compiler so maybe you have other ideas. So this is the end. So again, that's only a very subset of features. So go and experiment by yourself. We accept any kind of contribution called feature request, anything. So thank you and I'll be happy to answer. So one question. So one question. There was one question. How do you manage security? I don't. We have people working on this, mostly Matt Partouf and Austin. They are doing very complex stuff. I don't understand because that's really not my domain. But everything is sandboxed. The nodes where you are executing are mostly empty. So even if you exit the sandbox, there's nothing to steal. And if you crash the machine, we just reboot a new one. So that's as far as I can give any details. But you can contact them directly. They'll be happy to answer that. Okay. Thank you. Thank you.
Can the mold linker be /usr/bin/ld?
So up next is 3. I hope that's reasonably correct. Yeah, that sounds right. Of linker thing, let's just say that. Yeah. Now talking about whether the mold linker can actually be used as a system linker. Yes. So thank you for coming to this talk. My name is Rui Uyama. So I'm the creator of the mold linker as well as the LLVM linker. So I wonder if you guys are using my linker. So raise your hand if you are using mold linker. And what about LLVD? OK, maybe almost everyone is using my linker. So it makes me very comfortable to be here. Anyways, so the mold linker is my latest attempt to create the best linker for developers. And that really matters because in most compilations and build times, linker dominates, especially if you are doing a quick edit, debug, compile cycle, because you edit a single file, build a thing. The compiler finishes pretty soon because it compiles just a single file. But the entire executables need to be built from scratch. So the link time matters. So I've been developing the mold linker since September 2020. So it's been almost three years under a little. So it's relatively new. So it's available under the MIT license now. It's been under a different license because I was trying to commercialize it. But it turns out that it didn't work out. So I decided to go with the published license. And the main purpose is to offer the fastest linker to that developer. So it's order of magnitude faster than the new linker. And it's also faster than my previous one, LLVD, as well as the new gold linker. So as a rough to give you an idea, the on a decent multi-core machine, mold can output one gigabyte output per second. So if your executable is two gigabytes, and then it takes two seconds on your machine. And that's pretty fast. But the modern executables are gigantic as well. So for example, if you build LLVM with debug info, the output would be like one and a half gigabyte. But it can be built in one and a half seconds. And the mold linker supports almost all major targets, except MIPS. And the reason is because MIPS, ABI, has diverged too much from the other ABI's. The fact is that the other ABI's have evolved since 2000. But the MIPS ABI has stagnated since the collapse of SGI, because SGI was a de facto player in that field to set the standard. And then no one has since then made any effort to improve the ABI. So MIPS has diverged. So at this point, I'm not sure if we want to work, continue working on MIPS support, because it seems like no one is really making a serious effort to refresh the architecture. But anyways, it supports a lot of architecture, even including long arch, which is a newcomer in this field. And despite being pretty new, I think that the linker is production ready. And I think that many people are actually using for production use. I will talk about that later, how I tested the linker. So from the developer's perspective, so this slide explains what is the model linker from the developer's perspective. So it's written in C++, specifically with C++ 20 features, and with Intel TVB as a 3D library. And the one thing that you would notice immediately if you take a look at the source code of model linker is that almost all functions and the data structures are templates rather than just plain functions or structures. And the templates are specialized for each target. So for example, if you, so we have, and the source code quality, and ideally have readable source code. So I put a lot of efforts to make it readable. So this is an example of how you write target specific code in mold. So it uses if constexpr in the source code. So if you are not familiar with C++ 20, this is a feature, this is a new feature. And the beauty of this feature is that if constexpr is evaluated at compile time rather than runtime, so this if constexpr expression will be compiled to nothing. If this function will not be specialized for PowerPC 64, V1. So if as long as you got your new code in this way, your new code cannot do anything harmful for other targets. And it cannot be, it cannot slow down other targets. So this is another example how we use C++ 20 feature in mold. So this is a data structure representing on this format of relocations. But there are many types of relocations because we at least have big Indian, little Indian 32 and a 64 bit version. So in combination we have already four different versions. And the beauty of C++ 20 is that you can use a require your crowds after the template keyword to specify what kind of type parameters that you wanna specialize for. So in this case, this data structure is specialized for middle Indian and real way of which is very technical stuff. But we have two different versions of relocation data structures. And below the definition, we have different versions of data structures of the same name. And we even have completely different version of data structure for specifically for Spark 64. Because Spark 64 has this weird field that doesn't exist in any other architecture. So, but we can just define this data structure only for Spark 64. And as long as you guard G code that access this field with if course expert, then your code will not be cause GM, you know, you are using the missing field of the data structure. So this is a very beautiful way to compile your code to a specific target. So, it's not loading. Okay, so this is a machine description of the of G some specific target. In this case, it's a machine description for x86 64. So we have bunch of constexpr static variables as a parameter. And it defines, you know, that whether it's a middle Indian architecture or big Indian architecture or it's 32 bit or 64 bit. And basically you, so if you wanna put the mold link to new target, then you define this kind of data structure where basically copy and paste. And then make the modification as you needed. And then it's just as simple as that. And since this is G's fields are compile time constant so the compiler knows what the value is at the compile time so they can optimize code based on these values instead of, you know, that dispatching at runtime. So this is a comparison of the number of lines that you need to put more linker to the new target. So on the left hand side, we have code. So it is not a really precise comparison because lines of code is not a direct indicator about how easy or how hard it is to put linker to the new target. But it gives you enough idea about the scale of you, about the amount of work that you have to do. So apparently for gold, you have to write tens of thousands of lines of code for each target. But the reality is most code in the target specific code for gold are just a copy paste. So for example, if you wanna put new gold to like spark or long arch or whatever, then you would start copying the entire file as long arch dot cc or whatever and then it make the modification. So you have a lot of copies of code and that's not a really good way to, you know, put that thing to the new code. And on the other hand, we have very little code in mode to put to the new architecture. So we have a few, we have some amount of code outside of these files for target specific architect code but overall the amount of code is very, very small, like only a few hundred lines of code. So testing, testing is the most important and the difficult part of writing the linker because as you know that if you write a simple linker it's not really hard because it's just a program that takes object files and combines them into a single executable or shared object file. But the thing is there are so many edge cases and because there are like hundreds of thousands of programs that uses the linker, essentially every program uses the linker. So every corner case will be, there is some use case of corner cases out there. So testing is very hard. So we have two tests of how to say the mode to ensure that you, I will be finding a bug before you will notice in the production use case. So the first test is shell script based test which is a very simple test. I have a slide, slide for this. So this is just a test case for the very simple test case. So we actually compile code and try to link the object file with mode and then actually execute it on the machine. And as you can see that if you have a cross compiler and the QMU, you can test that this test for other architecture that's different from the one that you are running on. So for example, you can test Spark 64 on x86 machine. But apparently this test is not enough for real use cases, right? So the other test that I was doing, I'm doing is to try to build all gentry packages in a business mode in a Docker container to find any bugs. And the beauty of using gentry is that with gentry, you can use the exact same command to build any package. And it can also run the unit test that comes with the package. So it's very easy to wait to test whether you can build the program and the build program will work or not. So I did that and it takes a few days on the 6C4 core machine. But it works. But the thing is it is sometimes extremely hard to debug the stuff when something goes wrong. But somehow I managed to fix all bugs that I found this way. Well, yeah, it was a fantastic experience to fix all the jits bugs. But my point is that it is very important to fix all bugs before you would notice in the world. Because if mold didn't work out of the box for your project, the next thing you would do is just switch back to the original linker and you will never try it again with the mold linker, right? So why mold is so fast? Well, so we use multiseletting, multiselet parallelization from the beginning. So that's essentially why mold is so fast. But the other thing is that mold is simply faster than the other linkers with single-slated case is sometimes because we are using optimized data structures and code. Actually, the data structure is more important than code. As Rob Pike once said that you would write code around data structures and not to other ways. So designing the right data structure is important to make faster program. So here is, I think, a good visualization of how good mold linker is to use multi-core all-G cores available on the machine. So on the left-hand side, LLD fails to use all-G cores, but the mold finishes very quickly with all-G cores. So why, but the question is, would be why do we want another linker even though we have LLD? So my answer is, so LLD is not known, first of all. And the other thing is that LLD does not stop or support GCC LTO. So LLD is actually tightly coupled to a specific version of LLVM. So LLD, for example, version 15 can do LTO only for LLVM 15. So it of course cannot handle any GCC LTO object files. So if you wanna do LTO with no faster linker, then mold is the only viable option. So what about Gnu Gold? I think the problem with Gnu Gold is the lack of clear ownership. So it looks like it's not really maintained well anymore. And the original creator of Gnu Gold, which is Google, has lost the interest of keep maintaining it because they are now switched to LLD. So I think the future of Gnu Gold is not clear. So and the gold is not as fast as my linker too. So can we improve Gnu LLD so that Gnu LLD gets as fast as my linker? My answer is no. I think that it's almost impossible to make the thing faster unless you rewrite everything from scratch. And if you rewrite from scratch, that would be the same thing as I did. So and in my opinion, the source code of Gnu LLD is not very easy to read. It's like the source code was written more than 30 years ago and it's been maintained since then. But people are still adding new features to Gnu LLD first and then put to other linkers because what they are actually using is the other linkers. But I think that the situation is silly because people do not really use Gnu LLD anymore for their real world project. So I think that it needs changing. And my question is do we wanna stay with Gnu LLD, the current Gnu LLD forever? My answer would be I don't think so since we have a good replacement. So if I can, I'm open to donate more to Gnu project so that we can call it a Gnu mold if that accelerates that option. It's not something that I can only decide but because it means a lot but I'm open to that option if it makes sense. So the death missing piece to use mold as the standard linker is the kernels and the embedded programming support. So user and the programs are mostly fine. Well, if you install more as a system linker you wouldn't notice any difference other than speed. But the kernels and the embedded programs needs more special care about memory layout because hardware for example, enforces you to put some data structure or code at a very specific location of the memory. And if you are programming against MMU this computer then you wanna layer as the hardware memory is. So that kind of stuff is usually handled by linker script as you know. But the linker script in my opinion has many issues. The first thing is that it doesn't have any formal specification of the language. It only has the manual and we implement to, so other linkers are trying to mimic the behavior of Gnu LD but it of course causes compatibility issues. And the other thing is that the linker script predates elf file format. So not all linker script command can translate directly to elf terminology and it causes more confusion than necessary. So, and I think that it is almost impossible to add a linker script support without slowing down the linker. So I think that we need something better. So this is my current approach to support embedded programming and counter support. So I added a very simple command line option which is called section order. And that specifies them how to layer the things. So, and I think that this option alone can satisfy like more than 90% of the usage but I'm pretty sure that that doesn't cover all the usage of linker script. So I need a help from you guys. So because especially in embedded programming world, their programs are not open source and they are not available on GitHub and they tend to be in house program. So I don't know what the real usage is for embedded programs. If you can tell me that I wanna do this with the mold linker, then I can implement that for you. So I would appreciate it if you give me a hint. All right. So this is the end of my slides. Thank you very much. So you mentioned that it's possible to do link time optimization, like as a feedback in the GCC, but in general, is it also possible, how easy is it to do link time optimization inside the linker, like is it possible for the linker to disassemble some instruction and try to put something else there? Okay, so the question is how easy it is to do something like link time optimization but not quite there. So I don't know if I correctly understand your question, but it's... It's basically optimizations during the linking. Yeah, of course, but the thing is... It's not by the compiler, it's all LTO, but it's not by the compiler. So the way how LTO works in the linker is compiler emits. So from the user's perspective, all you have to do is to add hyphen FLTO to the command line option to compiler and the linker, and everything works automatically. But behind the scenes, the compiler emits intermediate code instead of the actual machine code to the object file, and then the linker recognizes that intermediate code. And then it calls the compiler back end to compile all things once to the single object file, and then the link continues as if that gigantic single object file were passed to the linker. So in that sense, you can do anything with the intermediate file inside the compiler back end because the linker doesn't really care what is going on behind the scenes. So, well, does that answer your question? Yeah, so you said that you tested more against being all of the factors in gender Linux. How long did that take? How long does one count take? So how does it take to test all gender packages against more the linker? And it takes, if I remember correctly, three, four days on my 64-core machines, 64-core machine with 200 gigabytes, 256 gigabytes memory. And yeah, it's a very long time, but it's definitely doable on a beefy single machine. One target? Only for x86-64 because in order to cross-compile everything to different architectures and run-g test, you have to do that on QMU, which slows down like 100 times than the real performance on the computer. Yeah. Yeah, I can't. Yeah, sorry. What kind of mistakes did you make in LLD that you're fixing in mode? And are there any mistakes in mode that you think are interesting? So the question is what mistakes did I do in LLD that I fixed in LLD? And did I make any other mistakes in mode? That's a good question. The first thing is the relocation processing in LLD wasn't as good as mode. So it's complicated. It's hard to maintain, and it's slower than mode. So I fixed it. And the other thing is that LLD uses templates to support L6432, big-endian, little-endian, but it's just four instances. So it doesn't instantiate for each target. So you cannot use the technique that I used for Spark 6c4, that I showed you on the slide, for example. And did I make any mistake in mode? Maybe not. I am pretty satisfied with the quality of mode. I think that I really made... I'm personally enthusiastic about the code of the readability. So I tried to make the source code as readable as just like a book. And I don't know if I could achieve that goal, but the point is that, well, yeah, it's definitely readable. One last question. Are there any plans to ever support any order of that file that helps? Oh, so the question is, can you support other file formats? No, I'm planning to ever do that. Oh, do I have an plan to support other than LLD? Well, I did for macOS, which is a Unix-like environment, but it uses a different file format, which is called macOS. Yeah, but the thing is, and I succeeded to create a faster linker for macOS, which is much, much faster than the upload linker. But the thing is, last year in September, they released Xcode 14 with their own new linker. So there wasn't going on efforts within Apple that I wasn't aware of. And then their new linker is as fast as mine. Maybe they wrote my source code as well, because it's available online. But also, GTIB3, then? Oh, my linker is now available under the GMIT license. So it's, yeah. So maybe you only heard Apple. Well, Apple haven't released their source code yet. So, okay, we have to stop. So thank you again.
Build Distribution for Maintaining the Famous GCC 4.7
That's great to see so many people interested in GCC from more than 10 years ago. Okay, let's get started. So we are taking a step in time by more than 10 years, I think. Yes, almost, yeah. Okay, so Oliver Reiche is going to talk about the maintenance of GCC 4.7 and the reason for that is that GCC 4.7 has a special property which I'm sure you will talk about quickly. Exactly. So, hello everybody, my name is Oliver Reiche. I'm working for the Huawei Research Center in Munich. And yeah, I would like to talk about a built distribution for maintaining the famous GCC 4.7. So, and I would like to start with dissecting the title a little bit. So, first of all, what is the famous GCC 4.7 and what is it famous for? And then also talk a little bit about what do we mean by the term built distribution? Then I will show a little bit of patches that we applied to that GCC version and show a little bit about a bootstrap process before I wrap up the talk. All right, so GCC 4.7. Well, there is a movement that's called bootstrapable builds and this movement strives for building all software from source. And of course, you have to start somewhere. So, in practice, you usually start with a minimal set of binaries that you need to start the bootstrap process. And then at some point you bootstrap your C compiler and at some point you want now to bootstrap your C++ compiler and then you might ask yourself, how do I build a C++ compiler without a C++ compiler? Because most modern C++ compilers are actually written in C++. So, this is exactly where GCC 4.7 comes into play. It's a key role for the bootstrapable builds movement because it's the last GCC version that can only be compiled, that can be compiled with only a C compiler. So, if you want to enter the realm of C++ and everything that is beyond in this bootstrapping process, you will need this version of GCC. So, and it's also about software preservation because, yeah, it's a quite old code base. It does not run out of the box with modern compilers. It does not run out of the box of modern systems. Modern systems and modern compilers use by default usually the C11 standard. Also, this code base has some issues with that and GCC 4.7 does not build reproducibly in all scenarios. I will come to that a little later. So, the next thing is from the title, build distribution. I mean, this is like a very fuzzy task that we term that we invented. So, what do we mean by that? So, we have actually a project that's called Bootstrapable Toolchain. There's a little bit of advertisement here on the right side. You can build this project using our very own open source build system that's called JustBuild. And if you use this project, you can Bootstrap the latest compilers and latest build tools with it. And all you need, of course, our build system and reduced binary seed. We need the core utility being installed. We need a POSIX compliant shell and some C compiler with a working C standard library. So, even the tiny CC will work. And what we do is all of those two chains here are actually built from source. So, we didn't reinvent the wheel. We used the existing build descriptions for GCC Make or CMake for Clang. And our build system basically takes care of orchestrating the build and calling those foreign build systems. And yeah, you might have noticed, Make and CMake are not part of our initial binary seed. So, we have to Bootstrap those first. This is also what our build system takes care of in this project. And so, what we do basically is we do on-demand Bootstrap of all the necessary tools during this process to make sure that we have everything that we need in the next steps to do Bootstrapping of the next tool chains. And by doing so, we basically unfold the minimal Linux distribution on the fly that is barely enough to just build the tool chains that we are actually interested in. And yeah, this minimal Linux distribution is what we're referring to as the build distribution. All right, next I would like to talk a little bit about what patches did we apply to patch up GCC 4.7. Well, most of them are actually maintenance patches and backports. So, from newer GCC versions, so in the square brackets you see the GCC versions where we backported those commits from. So, in the PDF those are clickable links, brings you directly to GitHub. And yeah, just to mention a few, so the largest commit was the general Muzzle support. And yeah, this is just an example here. Of course, the commit is much longer. This introduced the entire macro infrastructure that is actually necessary for GCC to work with Muzzle. Another interesting commit was the actual linker support for Muzzle. So, it adds this magic string here which is the hard coded path where GCC expects the program interpreter to be located. But much more interestingly though is how did we patch up reproducibility for GCC 4.7. Well, if you use our build system or any other modern build system as a build orchestrator, they usually build in isolation. So, all of the stuff that runs in the action, so the make command, the make binary, everything that is needed to get the job done, is actually located in an isolated directory. It could be a temporary directory at a seemingly random path. It could also be located in the user's home directory. And there's a problem, for instance, yeah, those two binaries you heard about it today already, and CC1, the C compiler, and CC1 plus the C plus plus compiler, they contain checksums. And those checksums are computed from many things, and parts of that are the path of the linker that was actually used. And because we built in isolation, the linker is also located in this temporary isolated directory, and that path is seemingly random and finds its way in the final checksum. And the other problem is that the relevant object files for linking those binaries are also hashed to compute this checksum. And well, the object files contain debug information and therefore also contain somehow the build directory. So, we needed to patch that as well in order to compute a reproducible checksum that is independent of the build directory. So, which is actually fairly simple. So, we just made sure that the linker, we know the linker, we control the linker, so it's actually not necessary to hash the full path. So, we just stripped the path by some constants and replace it with some constant string. And of course, we copy the objects that are relevant during the process to some temporary directories, stripped them from any debug information using strip for target, of course, and then use those hashed those to compute the final checksum. So, at the end what we get, we still have a meaningful checksum that somehow represents how those binaries were built, while still being reproducible in the sense of being independent of the build directory. And all of those patches that I just showed will then be automatically applied during our bootstrap process. So, what is the process? How does it look like? So, we have actually multiple stages to, until we end up with the modern compilers that we actually want to build, because of time limitations I will only go into details of the very first stage. So, we start off with just having core utils, a shell and some C compiler. And the very first thing that we do is we bootstrap certain parts of busybox, because it includes very important tools that the auto tools and the auto config scripts later will need. And we only restrict to those very specific parts. So, grab find, say it for instance, and of course we need patch for patching GCC later. And with those tools at hand now, we can now bootstrap make. Make can be built with make, of course, but they also have a bootstrap path. Luckily for us, there's a shell script and with a little bit of magic, we end up getting the make binary and now we have make build system available. And then together with those tools and the make build system, we can bootstrap the archiver from the bin utils sources and then we also have an archiver available for producing static libraries. Okay, now we can do the first real build. So, we can build with those at hand, we can now build latest bin utils, the normal way it's meant to be built, configure and make, and then we can patch GCC and build GCC. If you're interested in running this on your machine, it should work on any x86-64-bit Linux system. You only have to install just build, clone this project and run this command. It should give you a working GCC 4.7 installation. Okay, so let me wrap up the talk. So, we tested that on many systems. It should work on any x86-64-bit Linux system. We also tried to test it on very different systems like NixOS, where actually everything is located at some custom path. We also tried very reduced images that only contain a tiny cc and a muslin libc. And with our project and together with our own build system, you can easily integrate, if you have a C++ project and use our build system, you can easily import this tool chain into your project. And then you can make the tool chain a committed dependency to your project, which has several advantages. Of course, it's easier to set up for the user. He doesn't need to have a certain C++ compiler installed. You can just clone your project, run build, and then the first thing that happens is the tool chain is being built. And don't worry about compile times. Of course, bootstrapping the tool chain takes a while, but this only needs to be done once. So the next time you build the tool chain is like a static part of your dependency chain that doesn't change, so it will come from cache. Of course, if the tool chain is committed to your project's history, also git bisects are easier. And we can even show, if you do it right, that you can predict the binary hashes of the binaries that your project produces. Because you have a very confined tool chain, you know exactly what the output should be. If you use the Moodle, Lib C, stripping, everything using static linking. We have a demo application showcasing that. We can predict binary hashes for this project that should run on every x86-64-bit lingo system. All right. Last thing, I would like to encourage everyone who's interested in that to just install just build and try those commands yourself. It will take about 30 minutes. If it doesn't work on your machine, please let us know, because this is super valuable information for us to make this process even more stable. All right. That's all. Thank you very much. Thank you. And we will allow for maybe three minutes of Q&A because we started late. And actually, I want to start with one question from the Matrix online channel because give them a chance to answer some of the questions. So Ismail Luceno asks if there is any collaboration with OpenBSD because they have been maintaining their own fork also of GCC 4.7, I guess, because of the C++. Okay. No, there's no collaboration. So the question was, is there any collaboration with OpenBSD? Yeah, very good. Okay. Because they maintain their own fork of GCC 4.7. No, there is not. This is actually a good question. I haven't heard about that before. So this is already valuable input for us. Okay. Got a question? Is this partly in the timing for things like bootstrapping for trusting trust at a time? What were your tries to avoid the possibility of your compiler to be supported? I didn't recognize it. Trusting? Trusting trusts, can you enrich your model to remember and then you should think of where you could support the compiler, you could insert an actor, compile it in such a way that you compile a source code and then recompile itself as high an actor and define the ring, but not present in the source code? Okay. Okay. It's pretty hard to repeat that question for me. I may be just in paraphrase. Yeah. Okay. So to ensure the question was whether this is security related. Yes, in some extent it is security related. So one idea is to have the possibility, if you build reproducibly in a way that you can say, okay, this source code compiles to this binary and will have this hash, pretty much independently of the system you're building on, of course there are some restrictions, that gives the opportunity to say, well, we can basically prove that this binary originates from that source code and that source code alone. That is actually also one of the motivations. Yes. It looks like typed up. How we got? One more question. One more. Do we have the next speaker in the room? Our guy finding. So, yeah. I was surprised that it's machine dependent. I wonder why different architectures aren't easily done. So the question was why it is machine dependent and different architectures weren't done. The reason is just we were focusing on x86, 64-bit Linux because it's the most widespread right now. And it's also quite of work to patching GCC up to make that happen. So we basically just not had the time to look into other architectures. But we already have it on our to-do list. We want to at least support ARM 64-bit. And then let's see where we come from there. All right. I guess we have to go. Yeah, then. So at the end of this process you get C++ compiler, but it is an older C++ compiler. Yes. I was just wondering like how many stepping stones are there to get to the latest? All right. So the question was that after the bootstrap process of stage zero, we just have GCC 4.7. This is a quite old compiler and what other steps are necessary to reach modern compilers. This is a very good question. Yes. So modern compilers usually need C++ 11 support. GCC 4.7 does not have that. And so the next stage, so stage one is actually bootstrapping GCC 10.2, which is to my knowledge the first one almost completely supporting C++ 11. 4.8? Is it all right? Okay. So current GCC can still be bootstrapped with GCC 4.8. Oh, okay. Okay. But not that clear. Okay, but we definitely need one more step. And yeah, we got that covered currently with GCC 10.2 is the step stage one. And then from there we can go on. So don't need more than one step after 10 years. So is that okay? Yeah, exactly. Yeah. And I guess the advantage of picking a later GCC version is that we don't have as much patching for new back ends and configurations, and stuff like that because that's then all. And it looks shiny new. And it's still maintained, GCC 10.2. Yeah, and you would help these main things. You're the next. Okay. I hope that'll be the end. I'm sorry. Okay. Thank you, Arvid. Thank you, Arvid. Could you help me with this? Yeah. I'll cut it off. I'll see you there. Thanks. Thanks.
Sega Dreamcast Homebrew with GCC
Okay, cool, cool. Okay, so up next is Falko Gurgis telling us I'm sure an entertaining story about Sega Dreamcast, how did you get this idea? I have an entertaining story about Sega Dreamcast Homebrew with GCC. That's true. Not this standard thing you would do. You ready? Alright, so I'm talking today on behalf of the Sega Dreamcast community. I'm actually a developer on the Homebrew independent SDK, SDK called Callisty iOS. And we're talking about how basically... Yeah, no problem. We good? Okay, yeah. So basically this entire Homebrew community is powered by GCC. And I'm just showing you kind of what the kind of stuff that being part of the GCC ecosystem is allowing us to do. So first of all, what is the Sega Dreamcast? Maybe some of you don't know because it had two years in the limelight. It was released in 1999 and it was only commercially viable until 2001. Despite that fact it had a substantial effect on the gaming industry. It left a huge legacy and it competed directly with the PlayStation 2. A little bit less the GameCube and Xbox because it didn't last that long. And then a little bit about it, it had Hitachi, which...SH4 CPU, which is now owned by Renaissance. An imagination PowerVR 2 GPU, which was the predecessor to what eventually got used in the iPhone. So that same technology actually for the GPU went on to do quite a lot of fancy stuff. And then there's a little bit extra about it. But the key thing here is the Hitachi SH4 CPU. And that's what has made our destinies intertwined with GCC. Because GCC is the only compiler that supports the Super-H architecture. So why the Sega Dreamcast? What's the big deal? So I think there's a lot of strong arguments for doing it. I think in like an era where people are into Raspberry Pi programming and embedded systems, it offers a really good middle ground between high performance because it's good at graphics, it's good at floating point operations and embedded programming. We have a lot of established tools that are really good. As you'll see we have really modern compiler support. We have a lot of language support. Thanks to Matt Godbolt we have SH4 and compiler Explorer. So you can actually check, look at what the disassembly of your Sega Dreamcast code looks like to make sure it's optimized. And as a beginner you can treat it like just kind of a weak PC using cross-platform APIs. Or as you mature in advance you can go down to the hardware level and optimize for it. There's also a lot of cool toys and peripherals. There's light guns, Samba De Amigo Maracas. There's the visual memory unit. And visual memory unit itself, the little VMU, has its own little homebrew scene. So as I was saying we have a pretty decent community and because our independent SDK uses no Sega code we're actually able to release our homebrew commercially and sell it online through retail stores and stuff like that. So this is how many we've released each year commercially and there's just a collage of different commercial games. So as you can see you're not going to get rich on Dreamcast, but you know if you're making a PC game within that spec range maybe you should check it out. So this is a little bit about Callisty iOS before I get really deep in some code stuff. This is a little bit about the architecture. So it's Callisty iOS, it's like a big SDK but it also is like an operating system. We have a kernel. We integrate with NewLib 440, which as far as I know is the latest one that's out there. That's where we do file I O, date time, malloc. We have a really cool virtual file system which abstracts the way the PC CD-ROM. You can stream from your PC. You can use the new SD card readers. Networking, we even have IPv6 on this thing. We have examples. We have add-ons and ports for OpenGL, OpenAL. The tool chains as you'll see we have GCC 1321, latest Benutils, GDB going on it. We're trying to take this retro game console and let you use the latest and greatest versions of the languages of your choice on it. That's kind of a little bit of what we're going to touch upon. This is a little bit about my Dreamcast. It's not going to go into too much detail but as you can see it's like a car. You can totally spend all your money on it if you want and go to town on it. You don't need to do any of this though to develop for it. That's another big point of it is as long as you can burn a CD-ROM, 90% of the Dreamcasts out there can boot your homebrew game as long as it's burned a certain way. That's part of why the homebrew scene became so big. The first thing we're going to look at is C23. We wanted C23 on the thing. What did it take to get there? It didn't take as much as we thought. One of the first things that we had to do was support Atomics from C11 so that you can say atomic int, atomic bool, and have, since we have a preemptive multi-threading scheduler, you want to be able to have atomic variables that aren't interrupted and stuff. Unfortunately the SH4 is old so there's no hardware support for Atomics. But since it's single core it's not a big deal. You just disable interrupts around it, you load or store your value, and then you enable interrupts afterwards. So this is actually offered by the compiler, the SH compiler, as the soft imask model. What it did not offer is 64-bit and generic Atomics. So we had to implement that and there's the C code for, it's kind of an ugly C macro to do it, but you can basically see. We just disable the IRQ, we load or store a type, and then we enable it later. And that's the basis of our Atomic model. So if the scheduler can't get interrupted when you're accessing Atomic, then it's Atomic. Then we validated the Atomics. You'll see a bunch of the output there is from my Dreamcast. So we have a bunch of tasks we ran through, a bunch of different Atomics, an Atomic buffer, and yeah, the Atomics work now on the Dreamcast. It's pretty nice. Something that was much harder was adding thread local storage support. So in C and C++ there's a thread local keyword, and there's a lot of stuff you have to do for that. It's a delicate interplay between the compiler and the operating system. On the operating system end, don't worry if this code is a little dense, that's the whole point, this was actually a pain, and that's code just there to show you what a pain it was. You have to allocate with every thread, you have to allocate extra block for thread local storage with the T data and T BSS segments for thread local, and then you have to swap every time you swap context, you have to swap the thread pointer to point to a new thread chunk. So we did that, and this is some of the validation tests for it. What actually makes it hard is you can align your TLS storage arbitrarily, so we had to compensate for arbitrary alignment, that was all the extra logic that was more than just a malloc with a fixed size, you have to also align those segments. So yeah, now TLS works on the Dreamcast. And then that was pretty much it, we got C23, we have no pointer, auto, type of, all the cool stuff that C23 added. VAopt is now in C23, Align As, Static, Constexpr, Compound Literals, one of my new favorite things to use right there. This is just me throwing in a bunch of C23 with a Breakpoint API. Oh, Binary Literals, pretty nice, C23 edition. This is a little video, uh-oh, was a little video, it's not working. Okay, well, cool. Maybe after you can check out my Twitter, all the videos are on there in case they don't work. So C plus plus 20 and 23 is up next. What we got for free, we actually got a whole lot for free, it's kind of cool. Concepts, constraints, modules are not fully supported by GCC yet, but hey, everything that was supported worked fine for SuperH, we were pretty shocked. Ranges, look at that crazy range stuff that we can do with C plus plus 23 on the Dreamcast. Pretty sweet, standard format, and this thing, a static, variadic, multi-dimensional, overloaded subscript operator. You can do that on your Dreamcast now, it works. That was pretty awesome. What we had to earn with this, standard async did not just work for us because our kernel had a serious bug with once in it that nothing had exercised that code path with the ferocity that modern C plus plus did, and we found a race condition there. Standard random device took a little bit of work, I'm going to get into that. Standard file system is not quite supported. Yeah, that's a sore point for me right now, we're working on that, that's our fault. We're not propagating error now properly with NewLib, working on that. Standard time zone, well Dreamcast doesn't really have a time zone, so not much we can do about that, although I will say we gracefully don't support it, so it's not a big deal. Stack trace is one that doesn't look like there's much we can do about that. Yeah, C plus plus 20 stack trace, I got the library compiling for it, but it looks like deep within the library where it's trying to look for the binary path for reflecting over the L for executable to unwind the stack and look up the symbols there's just not really any way for us to tell it where to look over the network for a Dreamcast, so yeah, there's no stack trace right now. Maybe we can hack something up for that later. Standard random device, it actually works fine, so you can do all this crazy random stuff. This is the NewLib hook where we actually hooked into, we supplied the entropy from a bunch of uninitialized RAM, and that's what the entropy is coming from uninitialized RAM, which goes to standard random device, and then this is just a uniform distribution getting generated on the Sega Dreamcast and showing, you know, looks pretty uniform. Yeah, C plus plus concurrency meets the Dreamcast, this is pretty exciting. Yeah, there's a bunch of interesting C plus plus 20 stuff there, so I made a huge test thing that we're running on Dreamcast, which is just, it is running a bunch of standard, it's generating a bunch of standard async threads, testing everything from semaphores, latches, share locks, condition variables, barriers, and everything. And at this point, I guess I can't show it because the video is not loading, but it would just be like a big printf printout showing that all the tests are passing. So, yeah, as far as I know, including code routines, everything from the support for GCC up to C plus plus 23 is working fine on the Sega Dreamcast because you definitely need that level of concurrency to work with this machine here. Alright, let's see. Yeah, I had another little video that's not, I don't know why they're not loading, but they're all on my Twitter. Alright, Objective C, there's a little more to this because there's a couple reasons. GCC, it looks like, doesn't quite support the latest version of Objective C. Objective C, too, I guess that's because Apple didn't want to fund it anymore, I'm not sure. So, we had to make do with what we had. It looks like Objective C might be a little broken right now for cross compilation. We had to patch a build script right now to get to cross compile for bare metal. It was failing a compilation stage and we just basically commented it out in one of the config files and then it worked. We were able to build LibObjective C. The problem with, you know, building plain Objective C is, LibObjective C is a C library that lets you access all of the object-oriented features of Objective C. It's not very pretty, it's not very idiomatic of Objective C, but that's the raw runtime. In order to do anything useful with Objective C or do anything that you normally associate with Objective C, you need the foundation standard library, which is typically associated with Apple. That's where your NS string, NS object, all that comes from. Luckily, GNUSTEP has an open source implementation of that. So, we tried to port that to the Sega Dreamcast to give you, you know, this very big, nice Apple API that you definitely want for your Sega Dreamcast homebrew. Oops, that went pretty well. So now you can, you've got data structures, you've got auto-release pool, you've got NS string, NS log, all that kind of stuff on the Sega Dreamcast. You know, that's just basically some Hello World stuff doing that from Objective C and that's the Dreamcast output. Now, what gets a little more interesting is the concurrency model for Objective C is actually pretty cool. We support the NS run loop, which has NS timer, which lets you schedule periodic timers. They're used for, like, GUI updating, you can use it for game engine logic. And then we're firing NS notification events asynchronously from that event loop. And the video was really just showing, like, a bunch of events firing asynchronously on a Dreamcast. I don't know why it's not working. But anyway, so you've got the Objective C concurrency model as well. And for the record, if you need Objective C++23 to get everything, that works too. If you want to mix both of them. Okay, so then we tried to get D on the Dreamcast. This was not done by me, this was done by someone who goes by Luna on the Luna Fox girl on Twitter. Thankfully, she helped us because I didn't know really much of anything about D. She did a great job. What was involved with bringing D to Dreamcast? Well, we used the GDC front end for GCC. We cross-compiled it for SHL. She wrote a custom run time to do some of the stuff that the D run time does, which I'm a little sketchy on. But I believe it's stuff like lifetime management, like allocation, deallocation, entry point. She did not use the garbage collector, not because it won't work on the Dreamcast, because we run Lua and it's fine, but she wanted manual lifetime management. And at this point, we did not try to do libfobos for the standard library. We actually are just binding to libc for that kind of stuff. And then that's kind of a folder view of what the project looks like. It's called DKOS, which is the D bindings for what we did. And as you can see, I was worried that a bunch of the low-level stuff we were doing in C and Callisty iOS would have to change. Like, hey, can you bind to inline assembly? Like, what are you going to do about the C macros? And actually, a D is quite capable. And here's some of the crazy stuff that she either rewrote or bound to from D. So there's inline assembly. It can handle flexible array members, inline functions, macros, versioned enumerations. I started getting a little jealous there as a C and C++ programmer, actually. It's really good stuff. So yeah, D meets the Dreamcast. So here's some fairly idiomatic looking D that had a video there that all it was doing was basically changing the color of the... It was animating the background color with the PowerVR on the Dreamcast, the frame buffer, and printing some stuff to standard out. And it worked great. And let's see. Here was one more video, which was a bunch of animated cubes showing 3D accelerated graphics with the D language. That's on her Twitter, actually. And then finally, everyone had been asking the entire time we were doing this on Twitter, hey, what about Rust? Hey, what about Rust? And we're just like, hey, man, I don't know what to tell you. LLVM doesn't support SH, SH4, like, take it up with them. And then GCCRS came along and happened. We weren't having any luck with Rust C at the time. We couldn't get it cross-compiling properly for SHL. So we started playing with GCCRS, even though it's like very, very new in its infancy. I mean, we were seeing like four loops being added almost in real time, you know, like we pulled down like, oh, you can use a loop now. It was pretty cool, you know? So this is not stuff that necessarily is ready to be played with, but we don't care. It's what we do here. There's no bar checker yet, so you'll notice everything is just unsafe, but it's still fun and it's still Rust. So this is, oh man, the video's not there. It's a rotating cube that is driven predominantly by Rust. It's unsafe, as you'll see. The main control flow is Rust. The OpenGL API is calling in to C for that. And then there's a mystery third language that you're about to see, which we implemented miscellaneous support utility functions for things that GCCRS wasn't able to cope with just yet. So, all right, we're going to go into that demo here. All right, so on the left we have the Rust, which is calling in to C. On the right we have the utility functions, which are Fortran. So we had C, Rust, and Fortran, all on the Dreamcast. And yeah, here was the rotating cube. So, yeah, I would say we inherited quite a good deal from the GCC ecosystem. And yeah, may your homebrew be powerful and good and fast, and yeah, that's it for us. I just want to say thank you to everyone who contributes to GCC, and to our GCC in general for supporting us, for supporting the SH backend. If you're interested in looking into any of this stuff, that is a link to our Wiki page, which is everything on how to set this up. You can do it from Windows, Mac, Linux. It's mostly just running a script that works in any POSIX environment that sets up the cross compiler. And I wanted to say that we are just one community that's powered by GCC and is modern. I'm friends with the guys who do the PSPSDK, the Sega Saturn stuff, Lib Dragon for Nintendo 64, SGDK, Sega Genesis, and the Vita SDK. I can tell you right now we're all using GCC. So, yeah, you guys are, there's a lot of people out there who owe you a lot. And if you like this kind of stuff and are interested in hearing more, you can follow me on X or Twitter or GitHub. And that's it. Any questions? Over there you have actually sitting, or one of the Fortran main tables. Really? Oh, it's awesome. We have a couple more in the room shortly. Oh, yeah, but... No, no, no, that's... Oh, I'm sorry. Our application at the moment is targeting the Lib Rome, basically. Oh, yeah, yeah. Which is a good library, but why that over Callistae OS? Because our app has been targeting the Dreamcast for that last 12, 15 years. It's developed by Marx, I think, in North Marx. Oh, my gosh. Oh, okay. Yeah, I know. Yeah, yeah. Accesses. I was wondering, it's fairly easy to install your GCC chain because trying to patch up GCC to a bit of Lib Rome character is a pain. Oh, you should totally use our tool chain. Yeah, our tool chain should definitely work. And our scripts, there's so many people in the Dreamcast community that by now they're pretty battle tested. Like, people will want it for Mac, Ubuntu, every flavor of Linux, Windows with StigWin versus Windows with WSL. You know, ours is pretty solid at this point. You should definitely check it out, actually. I'll definitely be trying to pull it. It's pretty nice, yeah, yeah. But, oh, that's really cool, though. Very nice. Anyone else? Yeah. Which version of NGL are you supporting? All right, so the latest you can get on the Sega Dreamcast is 1.1 because we don't have any shaders. It's all fixed function. But I will say, it's one of the most epic end- end-late-stage kind of GPUs that's fixed function. We have a lot of the stuff that went into shaders in hardware. Like, we have hardware accelerated bump mapping. We have some things called modifier volumes, which are really cool that you can use for cheap shadows and stuff like that. So there's a lot of cool stuff you can play with, despite being an OpenGL 1.1. You guys ever heard of Raylib? Yeah, we actually just got a port of Raylib that sits on top of GL 1.1. So it's really cool being in the Raylib community right now. And like, someone makes a game for like PC and you're like, hey, sick out your game on my Dreamcast. It looks pretty good. And they're like, what's a Dreamcast? But yeah. Anyone else? Yeah. Well, as you know, I'm the SuperH Conal maintainer. I do know that. My hero, man. There's actually the SuperH backend and GCC is actually still in a questionable state. Oleg Endo is watching this. So yeah. Well, he hasn't been doing on it so much recently. Yeah, yeah, yeah. There used to be two people working on it. So if I'm seeing now there's so many people like working on SuperH, it would be nice if like kind of people came to the Debian community or like there's also a Linux SH RSE channel on my barrel. Because like doing this all alone, like what I'm doing in Debian is quite a burden. So there's like some people that would like to help like also improve GCC. Absolutely. So the Linux kernel almost dropped the SuperH architecture and he saved his life. So yeah, we owe this man a great debt. And yeah, I meant to reach out. Definitely. Anyone else? Oh, yeah. I wanted to ask, as I know the Dreamcast had sort of support for Windows C for Dreamcast. Yes. And was there any plan or something about that? Because I remember the vertical system run on the Windows C platform and if I'm not 100% mistaken, GCC might have a Windows C target. I'm not sure about that because when the Dreamcast was released, there were two SDKs. You could use the one that was Windows CE, which a lot of games used and it was very impressive. It supported a lot of the Windows kernel and there was one that was pure Sega. But the thing is we try to distance ourselves from that because those are official proprietary SDKs. They're not independently developed so you can't really sell your home brew with that stuff. So I don't know too much about that, to be honest with you. Yeah, sorry. Anyone else? Yeah. We have actually, there's a giant chart on that wiki page that I linked to. There's a giant chart going back to like GCC4, looking at, we are running one of our polygon benchmarks, looking at performance versus binary size versus a few other variables and it's kind of interesting how it's varied across versions of GCC. Definitely GCC 13 is not the best or the worst and it's not like a linear trend either. But yeah, you can definitely take a look at that and that's a very good question too. Yeah, please. If anyone wants to port anything else, we are very interested. Okay, thank you. Thank you.
The secret life of a goroutine
It's time for our first actual talk of the day, which is by a very frequent speaker who I didn't have to look up the introduction of, because every time I look at his talk, it's like, wow, I learned something very deep about Go. So, small applause. Okay, just... Hello, everybody. Well, I'm going to talk about the secret life of a Go routine. This comes from my interest about how Go works internally, and I was investigating how the Go routine works internally. So, when I started investigating it, my idea of how Go routines were created and all that stuff was something like this. A caring mother with a baby in her arms, taking care of that beautiful, full of joy baby. It wasn't like that, okay? I started digging into the code and I realized that it's more like this. And necromancer racing the deads. I was like, why? There's a reason for that. But before that, I'm going to talk about something more general, that is the Go scheduler. For understanding how the Go routine works, we need to understand how the scheduler works and how it is shaped. So, let's start with the different pieces of the Go scheduler. One of them is the P extract that is the representation of a virtual CPU. Whenever you say Go Max Prox, what you are saying is the number of pieces that the scheduler has. And a processor, as I said, is a virtual representation of the CPU. It can have a status that can be either running, c-scrolling, or g-stop. It has associated the current M. We are going to see what an M is in a moment. Then it has, each processor has a queue of Go routines that needs to be executed. And a list of free Go routines. We are going to see what free Go routines are later. And, of course, other metadata. This is a very shallow explanation of the scheduler. This is an over simplification. Of course, it's more complex than that. But, well, a lot of other metadata inside the PS track. Let's talk about the M. The M is the self-representation of an operating system thread. It's what is executing your code in the CPU. And it has associated normally the current Go routine that is running in this M, in this machine. And the current processor that is associated to this M, that can be null, actually. There are some cases where the M is not associated to a processor. But, in general, they are associated. And other metadata. Let's talk about, let me, let's talk about the scheduler itself. On top of all these M's and P's, there's a struct that is called a schedule. That is, it has all the, it has a list of all the, all the idle M's, all the M's that are not doing any work, all the idle P's, processors that have not, that are not doing any work. All the, at least of global runnable Go routines, a queue of work that is not associated to any specific processor for now. And a list of global free Go routines. Okay. And the start of our show, the Go routine. There's a struct that is called GStrug. That struct is, represents a Go routine. And a Go routine is composed by, in a lot of the stuff, but mainly you have a stack that is a two kilobytes chunk of memory. The program counter that is similar to the program counter in a thread that is pointing to the next, well, to the current instruction that is executing. The status of the Go routine that can be running, waiting, runnable. There's a lot of different statuses. The current M that is associated to this Go routine is being executed right now. And the wait reason. The wait reason is if the Go routine is waiting, they have to be waiting for something. They have to be a reason for waiting. And that's the way reason. There's a lot of other metadata. But let's take a look at the whole picture. As I said, we have the scheduler at the top left with a list of free Go routines, a list of runnable Go routines, a list of either processors, either machines. And we have running processors with running Go routines associated with machines and all that stuff. Also, another interesting thing is that at global level in the runtime, as global variables, we have a list of all the M's, a list of all the P's, and a list of all the Go routines. That really are three global variables in the runtime. Okay, but how Go routines are created? This is where the necromancer raising the dead's metaphor comes into place. Because whenever you create a Go routine with just Peggy's, you create a spawn a new Go routine and start running things on that. But that's not what is happening. There's two ways of creating a Go routine. One option is to create it from scratch and the other option is to reuse all Go routine that is no longer working. So this is what is happening. Whenever a Go routine finish, it's changed the state to dead. So all that free Go routines, actually they are dead Go routines. So whenever you need a new Go routine, you can reuse one of them. Or the other option, if there's no free Go routine or dead Go routine to reuse, you create a new Go routine full of life, you kill it, and then you raise that from the dead. So that's the process. And actually that is how it works in the source code. It was shocking for me and it was a funny way of representing this. So let's see an example of that. Imagine that I have this Go routine here that wants to create a new Go routine. What it's going to do is pick one of the free Go routines in the free list and raise that from the dead, convert that into a runnable, put that in the queue of the runnable Go routines of the processor, and call the scheduler and the scheduler is going to, well, and the scheduler is going to eventually execute that Go routine. Another option is this Go routine here wants to run a new Go routine, spawn a new Go routine, but there's nothing in the free list of the processor. So it's going to go to the global free list of the scheduler and pick a chunk of them, move them to the processor, and then pick one of them and raise that from the dead and add it to the queue. And finally you have the option of this one is it wants to create a new Go routine, but there's nothing in the global queue. So what it's going to do is create a new Go routine. It's going to kill it and then it's going to raise that from the dead and put in the queue and all that stuff. So that's how Go routines are created. Let's see how Go routines, how is the life of a Go routine. A Go routine can go through a lot of different states, can go to runable to running, from running to waiting, from waiting to runable, from running to preempted, from preempted to waiting. There's a lot of stuff. Let's see how, let's see all these transitions one by one. From runable to running. That happens when you for example have a Go routine have finished the job or a Go routine start waiting for something. So it's going to call the scheduler. So the scheduler is going to try to find another Go routine to execute. The first thing that is going to do is try to find a Go routine in the local processor, in the runable list of the local processor. If there's nothing, it's going to go to the global runable queue and it's going to take some of that, it's going to move that work into the processor, it's going to schedule one of that Go routines to be executed. Then if there's nothing in the global queue, it's going to go to the net pool. The net pool is this system that allows Go to do IO work in an efficient way. And what it does is do the IO work and whenever it's finished, it gets the Go routine runable again. But sometimes what we do is we need to find work to do. So we go to the net pool and check if something is already done and start executing that. If there's nothing in the net pool, we are going to steal work from other processors. And if not, we are going to help the garbage collector in the marked face. Well, once we have found a Go routine in all the process, we are going to mark that as running and we are going to assign the machine, the operating system thread to that Go routine. We are going to mark that as running and we are going to start executing the code. Another option is running, well, another change is running to waiting. One of the interesting part of this is it's exemplifies how Go routines are cooperative entities. So they cooperate to give you the sensation of concurrency. So the Go routine, when the Go routine needs to wait for something, is the own Go routine who parks itself. Whenever I have to write to a channel, for example, if the channel is not buffered and I have to wait for something, what I'm going to do as a Go routine is park myself, stop myself, check my state to waiting, set the wait reason, detach myself from the operating system thread and run the scheduler. It's the Go routine that is marking itself as waiting, the one that is calling the scheduler to schedule the new Go routine. So the scheduler is going to find another task and it's going to start running that. So what are the reasons why we can wait? If you go to the Go source code, and actually there's in the bottom right corner, I usually put some references to the Go source code, but well, if you go to that point in the Go source code, you are going to see the wait reasons and that's the least of all the wait reasons. There's no more, there's no less. That's all the wait reasons. Don't pay too much attention to that. I'm going to summarize that. If you want to take a look, you can go. But the summary is you have GC reasons, garbage collector reasons, mutex reasons, semaphore reasons, channel reasons, sleep reasons, and other reasons. That's mainly why the garbage, why the Go routines waits for something. Okay, from running to Cisco and to running or runable again. Well, the Cisco is an interesting part. The Cisco is basically calling the operating system to do something and that can be fast or can be slow. And for some Cisco, it's kind of obvious, but for some Cisco, it's not so obvious. So what it does is whenever you enter in a Cisco, whenever you try to execute a Cisco, it's going to detach from the processor and it's going to detect if the Cisco is slow or fast. And if it's a fast Cisco, it's going to finish the Cisco and go back directly to running. But if the Cisco is slow, it's going to just stay in Cisco state and it's going to detach the processor. Well, it's going to keep the processor detached so the processor can select another Go routine to execute and it's going to finish the Cisco eventually and whenever it finish, it's going to move the Go routine to runable again and then queue that in a processor and all that stuff. The other thing that is interesting is the copy stack status. Whenever a Go routine needs to grow the stack because it needs more space for the function parameters or for the local variables of the function execution, it's passed through this process that it's going to move from running to copy stack. It's going to reserve the double of the current stack size in memory, copy over all the information from one place to another and change the pointers and then it's going to move back from copy stack to running again. From waiting to runable, this is a very interesting case because, again, as I said, Go routines are cooperative. So normally, a Go routine, it's changed from waiting to runable whenever other Go routine calls go ready. Whenever other Go routines say to my Go routine that it's ready to keep executing, we are going to see examples of that later. So whenever Go ready is called, for example, if a Go routine is sending something to a channel and some other Go routine is waiting, it's going to wake up that Go routine, it's going to mark us ready that Go routine. Then it's going to mark us ready, it's going to add that to the queue of the processor and try to get a processor to execute that. Another way is when you reactivate a list of Go routines that happens, for example, when the garbage collector have to reactivate some of the Go routines and then the garbage collector are waiting for the garbage collector phase, for the mark phase, and when that's finished, it's going to wake up a list of Go routines. Another case, it's when there's a case where it doesn't need to wait. Imagine that you say, hey, I'm going to wait for X, but that X is already fulfilled, so I'm going to go back to runable directly. Another thing is when you are trying to find a Go routine to execute the scheduler, you check the scheduler, sorry, you check the net pool, and the net pool sometimes has these Go routines that in theory they are waiting, but the data is already there or the job is already done. So it just moved that app from waiting to runable. Okay, from running to preempt to waiting or runable. Go has a preemptive garbage collector, has a preemptive runtime, and what it does is when a Go routine is executing for too much time, the system monitor is going to detect that and it's going to send a signal to the operating system thread that is executing the Go routine. That signal is going to mark the Go routine as preempt, so it's going to be moved from running to preempt, and eventually the Go routine itself is going to find the time for moving from preempt to waiting. And after the next garbage collector scan, it's going to move from waiting to runable again. So again, this is the whole life cycle, runable, running, syscall, waiting, preempt, govistak. Now all these states should be more obvious or more clear to everybody. There are some other kind of similar states of parallel states related to garbage collector. This is again a bit of a simplification, but this is in general what is the kind of state that you have in the Go routines. So let's see some examples. Imagine that you have a channel and you want to send data to that channel. The channel is not buffered, and there's nobody else waiting for that. So I try to send the data and because nobody's waiting, I'm going to need to wait for that. So I'm going to park myself, the Go routine is going to park itself, it's going to add itself to a list of Go routines that is inside the extract of the channel, and it's going to wait there. So it's there, it's waiting, and eventually another Go routine comes to read from the channel. What it's going to do is go there, read the data directly from the memory of the other Go routine, and then when it has the data, it's going to call Go ready on that Go routine saying this Go routine is already prepared to keep going. It's going to, and that's going to end in this state, and eventually the scheduler is going to select that Go routine to be run and everything is going to keep going. Yeah, this is the whole picture, trying to send the data, waiting inside the channel, getting the data from the other side, and the other Go routine is the one that is responsible of waking up the Go routine that was waiting in the channel. Let's see another example. Let's talk about the wake groups. For example, I can create a wake group and add three in this case. This is a very common pattern. And then I just found three Go routines that are going to do certain work in parallel. Then I'm going to wait at that point, maybe one Go routine is already running, maybe not, doesn't matter. So I call wait, so I'm now waiting. The Go routines keep going, maybe some of them are executed, maybe some of them have finished already, doesn't matter. Some of them finish and are there. And the last one, the last one is going to call done, the last done, and it's going to see that, hey, the wake group is already zero, so I'm going to call ready on the list of Go routines that are waiting for this wake group. So that end up with this situation where that's a runnable Go routine that is going to eventually be executed by the, well, that is going to be a schedule by the scheduler, and that's it. Again, the whole picture here. Okay, let's talk about how Go routines die. There's a Go routine normally dies when it finished the work. Basically, whenever there's nothing else to execute, it's going to change the state to that, it's going to set most of the data to the zero value, it's going to disconnect the Go routine from the end, add the Go routine to the free list of the processor, the dead Go routine to the free list of the processor, and call the scheduler to find anything else to execute. So, yeah, the whole life of the Go routine. Again, if you see this is the scenario where the Go routines are doing things. If I did my job correctly, you now should understand this better. And also this should sound familiar to. So let me finish with a couple things. One of them is I want to thanks Laura Pareja, the one that did all the illustrations for this talk. All the illustrations are creative common by. And you can see the webpage of Laura Pareja. So you can reuse it that do whatever you want with all that images. Also, I want to, I have a gift from MatterMos that is my company, they're the company that I work for. I have some stickers. I going to left out the stickers there, like Margie said. So that's exactly right there. So feel free to pick as many as you want. But I don't know if, well, I also have some pins too, but they are going to fly probably. Another thing is what is missing. I haven't talked about certain things because in the sake of simplicity, I try to avoid getting too much into the details. One of the things that I removed from the equation and have a lot to do with Go routines is garbage collector. I ignore the garbage collector entirely and it's a big chunk of how the scheduler interacts and how the Go routines are moving from one stage to another and all that stuff. The net pool, I mentioned the net pool, but I haven't entered into the details. There's very good talks about the garbage collector and the net pool out there. I know SIGO. Also, SIGO have certain implications with the Go routines also, but I have ignored them. The mark assist phase that is kind of important is a relevant part of things that Go routine does, assisting the garbage collector in the mark phase. This is the monitor that I have mentioned, but I haven't talked in detail about that. But again, there's talks around system monitor out there. One of the main references is the Go source code. I totally recommend you to go there and explore it. There's an illustrated text of Go runtime scheduler that is a YouTube video there. There's a series of posts from Argonel Labs about the Go scheduler. It's from 2018, so it's not super up today, but the general patterns are still there. Well, I hope this talk, after this talk, you have a better understanding of how the Go routines work, how the Go routines change from one state to another and all that stuff. But I want, what is more important to me, I want to encourage you to go there and explore the Go source code because it's a great source of information. There's a lot of super cool stuff there. And well, and depending on a combination of your passion about learning and your taste in movies, this can be more exciting than a zombie movie. So thank you. If you want to keep in touch with me, feel free to contact me. And the other thing, if you want to have a follow up session, then try this. If you want to have a follow up session, asking questions or whatever, feel free to join there. If you're leaving. Thank you.
You're already running my code in production: My simple journey to becoming a Go contributor.
And I will now like to introduce our next speaker to you. I would say he needs no introduction because you're already running his code. But he might need an introduction. This is a new... Sorry, could I have some silence in the room, please? Thank you. You're already running his code and he's telling a story of which I am, for some reason, after running the Go Dev Room for five years. Still I'm curious about, because I haven't contributed the Go project yet. And he has. I'm jealous of him. So round of applause for a Go contributor. Thank you. Can you hear me okay? Is the microphone on a good spot? Yep. So quick show of hands. Who here is a Go contributor? Is contributed to the standard library, the compiler. I see one, two, three, four, shows hands, five. Who here would like to be, like Marcia, who would like to be a Go contributor? There's a lot more hands. Who of you who wants to be is afraid to become a Go contributor? Who thinks it's intimidating or complicated or you just don't know enough about Go routine scheduling or something like that? Okay. This talk is for you folks who have your hands up right now. So my goals for the talk... Oh, first of my agenda. I'm going to talk about goals, who I am, and I'm going to tell my story of how I became a Go contributor and talk a little bit about how you can too. So that's my goal. My goals today, tell my story. And ultimately to encourage you to be less intimidated about becoming a Go contributor. My non-goals are to be exhaustive. I'm not going to do a deep dive into how the proposals work or how Garrett works or all the technical stuff. And I'm not going to show you a lot of code. There's a little bit of code, but you don't even have to be a Go developer to understand the code I'm going to show you. Who am I? I'm a Go contributor, technically. I'm a fractional Gofer. Fractional CTOs are all the rage these days. I'm not that. I'm a fractional Gofer. I work for different clients. You can hire me if you want some help with your Go. I also do Go mentoring and career mentoring, hire me. I'm also the co-organizer of the Go Amsterdam meetup. And I'm a podcast host and YouTuber. I hit that word, but I put videos on YouTube, so I am one. So some of you may know me through the Cup of Go podcasting. Listeners here in the room today? All right. A couple. I hope there's a lot more after this. I have stickers, by the way. They'll be over there. If you like Brewster, our little Gofer mascot for the Cup of Go podcast, get a sticker for your laptop a little bit later. So how did I become a contributor? Well, first I needed an idea. So long ago, I wrote this public open source library called Kivik. It's for CouchDB. It's sort of like database SQL, but for CouchDB. So if you wanted to be document store stuff. And I had a request from a user of my library. They were trying to send a regular expression as JSON to CouchDB because it's a JSON store. And it was just submitting an empty object rather than meaningful data. So they said, hey, could you make your library do this thing the right way and send a regular expression string? It's like, that's a really great request, but I don't feel like it's my library's responsible to do that. That should go in the standard library. So I created a request, which we'll talk about. But first, here's the problem they were explaining. So here's the code. I think this is the only slide in the presentation of code. So imagine you have this regular expression, food question mark. So it would match fo or foo, pretty simple. And you call JSON Marshall on something that contains that. This is the output you would get. Not very useful. This is the output the user of my library wanted and what I thought made sense. So I created a proposal on the Go issue tracker on GitHub. Now this is a great point to mention that there is a process, a proposal process. Some of you are probably familiar. If you listen to the Go podcast I just mentioned a couple of Go, we talk about proposals fairly frequently and we talk about, oh, this one's in the accept phase or this one's been declined or this one is possibly accepted and so on. That's all relates to this. Now this is a very simple proposal, so it didn't need the design doc, which some do, like generics had a design doc, actually multiples of design docs in the end. So this is a very simple proposal. I mean, I just explained it to you. I don't need a design doc to explain what I just explained on the last slide. So this didn't need that. So I just created a little, you can see there, that's the entire issue there, right? That's what I wanted. I showed the code that I just showed you. I showed the current behavior, the expected behavior and a little bit of conversation about my reasoning. And so that happened in 2021, May 13, if I can read that correctly. And then that kicked off this proposal process or a truncated miniature version of it anyway. So we had some discussion. One of the first comments came from Daniel Marti, who said, this would also be useful for this other thing and tagged Joe Sy, who was working on another issue that it would be relevant to. I don't know who this person's name, I didn't look it up, but they said, losing the options feels like a deal breaker. What that was referring to, there's actually two flags you can put on a regular expression in the Go library. You can say it's a POSIX regular expression and you can say it's, is it longest match? So at the end of two Boolean flags you can set on a regular expression and those are not expressed when you call the dot string method on the regular expression. So those flags would be lost. And so this person said that feels like a deal breaker. And there were some other comments too, but ultimately Russ Cox came in and said on June 9, so this is almost two months later, said it looks like this is probably going to be declined based on the fact that it would be a lossy expression of the regular expression. That was sad. Not really sad because this isn't a feature I wanted, I just was kind of excited to see a feature I proposed, you know, get through the process. And then Roger Pepe, I think is his name, came in and said, I think it would be fine if we went ahead and did this. You know, just use the equivalent of string, it's already lossy, why don't we just go with that and so on, gave his reasoning. And so this is just a month later now, we're into July 2021, Russ says, so this is the current idea, we're going to have Marshall and un-Marshall do exactly the same thing that string does, blah, blah, blah, and then it looks like it's going to be likely accept now. So, cool. Happy about that. Fingers crossed, let's see if it really becomes accepted. A week later, no change in consensus, so it became accepted, yay. So who's going to do the work? Sadly, just having your proposal accepted and go doesn't mean it's done, someone has to actually do the work. Now this isn't a lot of work, in fact Russ said, even before it was accepted, I'll do the implementation and see if I come up with anything surprising. I don't know if he ever did, if he did he never mentioned it on the issue tracker. If I ever had the chance to interview him, I'm going to ask him, did you ever do that thing? So I said, January, this is six months after it was accepted, I said I'm interested in working on this and nobody really responded except somebody gave me a heart and I thought I felt good, but. And then three months later, four months later, Joe Sy says, hey are you going to do this, Russ? I can actually use it now. And Cricket's from Russ, he's a busy guy, no shame on him, but you know, so more weight eating ensues. So I decided I was going to go ahead and do it and I decided to, I don't remember exactly when, we'll see the dates in a few moments, but so I decided to go ahead and do the code. Now this is a good time to talk about the contribution guide. This is probably the part, at least I felt, was the scariest part of contributing to go, so I'm not going to talk in detail about it, but the TLDR is you have to create a Google account, you probably already have one unless you're intentional about not having one for security or ethical reasons or whatever. If you want to contribute to go, you have to have one, I'm sorry to say, so if you're avoiding that bandwagon for ethical reasons, maybe go, contribution isn't for you, I understand your reasons, but you have to have a Google account, you have to set it up a Garrett account with a Google account. What's Garrett? Who's used Garrett, I'm curious? Who doesn't even know what the word means? All right. So think of like GitHub except an open source version of GitHub from 1992, that's what it looks like, but it's really powerful in ways that I can't really comprehend or explain because I haven't used it that much, but it's not bad, so don't be afraid of it, but they use Garrett for that. Now actually I lied a little bit, they do use Garrett for that, but you can do this through GitHub also, and I've not done that process, but if you're really afraid of Garrett and you can't read the documentation and follow the instructions, you can also use it, create a GitHub pull request, so that's an option open to you if you're really afraid of this, but don't be, it's not that bad. So 11 months later I finally wrote the code, I created my Google account and all that stuff and the Garrett account and I wrote the code, this is my change, this is what I added to the standard library, plus some tests and a couple other metadata things. It's like 20 lines of code if you count the comments in the blank space, the blank lines, that's not a big deal. I was really hurt though that Marcia didn't mention this in the Go 121 changes because I know it just barely threw under your radar. I actually got this yesterday evening, you're going to find it. Yes, yes, okay. And you knew I was going to talk about it, so why mention it twice? So really simple, I guess I lied, there's two slides of code, but it calls the string method and turns it into a byte slice, that's all it does to Marshall, to Marshall, your regular question, and then to un-Marshall it, it does the same thing in reverse with an extra error check, super simple code. So I pushed that up and then I, this is a screenshot of Garrett by the way, like I said, 1992 GitHub, that's what it looks like. And I got some code review. And then it was time for some humility. I kind of pride myself in writing tests and writing good tests, I usually write them before my code, first comment, make sure the test pass. I failed to, I mean I tested my code but I didn't run the entire test suite, which takes 10 minutes or something on my machine, and it was failing. The reason it was failing is because I failed to add some metadata about public API changes, it wasn't a big deal, it was easy to fix, but it made me feel a little bit silly for like, not writing, not running the test suite before I asked other people to waste their time reading my code. I had learned the project style, this was my original commit message, I don't see anything particularly wrong with it, but it wasn't the style that they wanted, they wanted something much shorter, they didn't want me to, they didn't want a long paragraph explaining, like they felt like, I say they, Ian felt like add these functions was enough, I didn't need a paragraph explanation, so I followed his style guide and ended up something shorter. The tests, he wanted some changes in the test, I called t.fatal, but it was a for loop, so if one test failed, the other test wouldn't run, so he wanted me to do t.error instead. Cool, makes sense. And then Godoc recently, I don't know how recently, recently in my mind because I used it before this, but they recently added these square brackets to do hyperlinks and stuff, and I didn't do that, so I needed to add that. Yeah, little nitpicky things, plus I forgot to run the test. That was kind of it. That was my thing. It got merged on March 27, so just over two years after the original, was that right? Just under two years after the original issue was opened, it got merged, and then it was in the Go 121, yay! My name's not there. It's in, it's in Git somewhere, but whatever. It still felt good. So I think I just breezed through that. I have a lot of time here. We have a time for questions here. I mean, I have a few more slides, but this is the point of my talk, really. What does it take to become a Go contributor, and what does it not take? So non-requirements are you don't need mad hacker skills. I mean, you saw the simplicity of that code I wrote. Now I've written much more complicated code, at least I like to think so, but not at the Go project. I've spoken to people who contribute to Go just by adding those square braces to Go doc. That's cool. That helps. I mean, that's valuable, right? It's not cheating. That gives me hyperlinks when I go to the Go doc for that package. I can click on a hyperlink now. That's useful. So if that's what you want to do to contribute to Go, that's all you need to do. All you need to know is how to type square brackets. You don't need to know about zombie Go routines and whatnot. You don't need deep Go knowledge. What do you need to be a Go contributor? I think the main thing I learned from this process is that for me to be a Go contributor, I need patience. I mean, a lot of that wall clock time was me not doing anything. If I had been trying and pushing the process forward, I probably could have truncated that down to maybe three or four months. But that's a long time to get 20 lines of code implemented, I think. I mean, relative to what I do at my day job anyway, where I do that 15 times a day or something. So it takes patience. But if you're willing to put in the time, you can become a Go contributor. It takes a little humility, especially when it comes to learning a new project style. I mean, I don't know if you've contributed to other open source projects before. I have. Each one has their own flavor, their own style. You need to learn that. You need to be willing to learn that and not, yeah, just put your ego on the side. That's not the point. It's just to do something useful according to the community's guidelines. And to learn some new things. Yeah, I think I'll breeze through this. Those of you who raised your hand that you were intimidated earlier, any of you feel less intimidated now? One, two, three. Okay, my talk was a success. That was my goal. If you're interested in learning other ways, one of my goals is to make Go less scary for people. That's part of the Cup of Go podcast idea where we talk about the weekly Go news. It's part of my YouTube channel, Boldly Go, if you want to watch that. If you have questions, reach out. You can find me at boldlygo.tech. That's my Go themed website. You can find all my socials and contact details there. Any questions? I don't know. Do we have, we can do questions, right? We have enough time for questions. We have time, so yeah. I will hand you the microphone. If you're too far away, you'll have to shout and he has to repeat. Hi, thanks for your talk. I want to do a Cup of Go listener. Wonderful, thanks. Shout out to the podcast. My question is, are there other ways to become a Go contributor like, you know, good first issues or stuff and get up? Other ways, other than introducing a proposal? Yes, definitely. You can find one of the existing bug fixes or proposals. So this was the first code I wrote that was implemented to Go. I had participated in the sense of filing bug reports and stuff like that previously that others then fixed. And many that had been just like closed as invalid or something that happens too. There's that humility part that comes in. But yes, there are a lot of open issues. There are some tagged as good first issues. You can find typo fixes, typo, I actually have an open CL. It's the Garrett terminology for a PR. Open for a documentation fix in a package in the center library. Things like that. There's a lot of things you can do. You don't need to file either a bug report or a feature request. You can find one that's already there. Hello, thank you for your talk. Yeah. I've tried several times during Octoberfest to do some contribution. And the big part of it was to find an easy issue to begin with. Do you have some tips for that? Not really. I mean, there is a, I believe there's a tag on GitHub on the issue tracker for like good first issue or needs help. I know there's a needs help. You could look at that. I think there's a good first issue, but I might be confused about the different project. One thing that is understandable but frustrating to me about the Go project is it's not really designed for newcomers. That's one thing I hope to help change with this. Help at least lower the mental barrier that you might have individually to doing this. But I say it's understandable because they're trying to build a professional quality, high quality language and standard library. And that requires one set of skills and guardrails around the project. Being open to all new contributors is a different one and requires very different types of open source management. So Go, I think, mostly intentionally has moved to that side of high barrier to entry for reasonably good reasons. But that is frustrating for this question. How do you find something you can do to contribute? I don't really have a great answer except look through the issue tracker and find something. In front. Become a Fotherm organizer, get fitness for free. Yeah, hello. So you had this requirement at the beginning and this sparked the problem and the solution in the library. But what did you do in the meanwhile? Because this took three years, right? So what did I do about this in the two years in this thing between issue, file, and I didn't do anything, honestly. The person using the library, I'm assuming they had their own work around. I mean, so there are work arounds for this sort of thing. Suppose that you want to, suppose this already exists. Now you're using Go 122, but you want a different version of the regular expression to be presented. You have the same problem, right? So you probably would end up wrapping the regular expression, reg x dot reg x type and put your own custom marshal on it, for example. That's probably what they were doing. I do that with time dot time or time dot duration fairly frequently depending on the application needs. So that's probably what I would do. Are there any differences in the main Go code versus like the Go X modules? Yeah, that's a good question. I haven't contributed to the X stuff, so I don't have experience to go on from there. I think it's pretty much the same process though. I do think the requirements for inclusion in the X packages are lower. So if you want to add, say something to X slices, you want to add, I don't know, change color or something, you know, some ridiculous thing there. There's lower barrier to entry to get in there because it's considered experimental. So you're like, if you want to do it in the center library, they have a high standard. Like we want to make sure that we're never going to regret doing this. In the experimental they're like, yeah, we don't know if it's a good idea, but let's try it. So in that sense it's easier, lower barrier to entry. Any last questions? Okay. I think this can mean one thing, but it was an amazing talk with not too many questions left. Round of applause everyone.
Efficient Integration Testing in Go: A Case Study on Dapr
Actually, an ex-collworker of mine, we worked together on CertManager, if I recall correctly. We wrote a lot of tests there, not enough tests in my opinion, but there is never enough tests in the world. And I have to be honest, when I code and I'm not being paid for it, I do not write tests. So Josh does, and that's why he's going to talk to us about how to make your testing life way, way better. Right, that's possible Josh? Thank you very much. Cheers, Marsha. Good. So hi, Ron. Yeah, hopefully I can change Marsha's opinion on that during this talk. So I'm Josh. I work on the project DAPA, which is an open source project. I'm going to talk about that in a second. And the talk is about efficient integration testing in Go. So it's a case study on DAPA. I work on DAPA, I'm coming from a DAPA perspective, but the idea here is the kind of learnings that we have did through DAPA, you can kind of bring to your own project and make your project better, more efficient and correct and these kinds of things. So this is the agenda. Like I say, we'll talk about testing, we'll talk about DAPA a bit, the framework that I wrote for the integration testing in DAPA, and then some learnings and some gotchas and some things you can pick up for your own project. Cool. So testing. Why do we test software? Fundamentally, why do we test software? So the first thing is to prove the correctness of software. That's the main point, right? We write software, software is complex. Code is hardly readable by humans and we make mistakes and the more software you write, the harder it gets to keep track of the state and yeah, we all write bugs. But it's not necessarily the case that this is the only reason why we write tests. If it was the only reason why we write tests, we would write our test once and then once they start passing, we would delete the test file. So writing tests just for the correctness is not the only reason. Another reason is for putting guardrails in place. Implementation code changes over time and so assertions you want to make about your code behaving in a certain way, you want to kind of keep into the future. So yeah, that's why we don't want to delete our test files after we've written them. The next thing is ensuring compatibility with external APIs. So if you have external services, I'm thinking I come from like a Kubernetes world and things like this. So Kubernetes version changes, they break stuff all the time. You want to make sure that your code still behaves in the expected way when external things change. Verifying performance, performance testing, these kinds of things, making sure that not only your code is correct but it also does things in a timely manner or uses less resources than is your limit or things like this. And finally, and what we'll follow in this talk is hopefully that if you write a testing framework which is usable by humans and is efficient and is easy to read and use, then that testing framework itself can then be used as your kind of sandbox on how you can test or do experiments in your software and test features and things like this. So a really good testing framework is really important to improve your developer experience and the final thing is increasing developer velocity which is largely a big thing that we care about, right? We want to write features. So test types, if you open a textbook on testing, you'll probably see this graph somewhere. It's a very kind of classic visualization of the different types of testing. At the bottom you have a unit test, that's your test bar, that's your logic code, and it tests a variable equals another variable, really exciting stuff. And then at the very top you have things like your performance testing, your testings and things like this. And then the middle section you have your end-to-end and integration testing. The difference between these two things is semantic and depends what project you're talking about and who you're asking and things like this. Again, I'm coming from a dapper perspective. End-to-end tests for us are deploying to Kubernetes and running it in a Kubernetes environment and invoking it there. Integration testing is running binaries locally, typically, and that's where the differential takes place. Integration testing ideally runs quicker than your end-to-end testing. Kubernetes is a slow software so it's a pain in the ass to write loads of tests for an end-to-end test. So yeah, the talks about integration testing, what are integration tests? Fundamentally, this is what an integration test is, and this is true for a lot of testing as well. But fundamentally, you're setting up your system to be in a particular state that you care about. You're then asserting a particular behavior and then you are then cleaning up that system state. That is it. That is fundamentally what you're doing. As an example, again, going back to dapper, this might be executing one of the dapper services, then doing a curl, in this case, to make sure that the healthy endpoint returns a 200 or something like this, and then finally killing that process at the end. That's it. That's what an integration test is. Keep talking about dapper. That's interesting. That's not dapper. Okay. Try that again. What is dapper? Not that. Dapper is an open source project, all written in go. The tagline, the marketing headline, is that it is a set of APIs and SDKs and frameworks to make a developer more productive in a cloud-native environment. What that means fundamentally is that the project will expose a bunch of APIs for you that you typically need to write some business logic that does something interesting. They have a list of APIs here, so it gives you some state management, PubSub, Actors, and then you can back those APIs by whatever implementation that you want. It might have different concerns, so the infrateam might manage your postgres, and then to you as a developer, you're just exposed with the state support API. That's fundamentally what dapper is. What is important for this talk is that dapper is a complex software system. We have multiple services running, and they're all doing different things. We're all talking to each other. Maybe sometimes they're MTLS, sometimes it's not. Sometimes GRPC, sometimes HTTP. We have a whole set of APIs. We have a bunch of backing services that we support, whether it be postgres or some Google stuff, whatever it might be. The point here is that this is a very complex software system, which all software turns into over a longer period of time. When your software system becomes this complicated spaghetti mess, it becomes a house of cards. It will happen, and if anyone who's worked on a larger project will have first-hand experience, you make a small change, and that will have unexpected consequences or behaviors in a completely seemingly unrelated part of the system. You'll have software turns into house of cards, you don't want to make changes, and again you slow your developer velocity that we were talking about. How do we resolve this? Tests. We use integration testing. When I joined the project, there wasn't any integration tests, so it was kind of a blank slate. I could start from the very beginning of how I wanted our integration tests to look. I came with these set of design decisions. First of all, I wanted to go as the sole dependency on these integration tests. I hate make files. I think make is terrible, and I don't want that anywhere near having to invoke tests. The next thing that I wanted to do was to run a test. I wanted to do something like a test, and it would be worse, something like needing Python or God forbid having to run Docker or something like this. It just run my tests. We want them to be as close to what developers are doing in their day-to-day, because remember it's a community project, we have lots of contributors. Having go as a sole dependency was really important. They need to be quick. Time.sleepers.band, we'll talk about that later. Tests need to be portable. We basically get that for free with go, because go is very good in that it can be compiled to different architectures and operating systems and things like this, and it's designed from a portability perspective from the start, so we get that for free. It needs to be extensible. We have lots of contributors. People need to be able to write code for the integration tests as they contribute to the project, and it needs to be readable. Similar reasons. That was the design philosophy, the design decisions I came into the project with, or into the integration test with. Next was actually writing the framework itself. If we go back to our original diagram of fundamentally this is what an integration test is, the first thing we can do is turn this into go stuff. We create what I call the process, which is the thing that is managing the setup and also the cleanup, and then we have the test case, which is doing the assertions that we want on that particular test scenario. We can then put in some kind of wrapper stuff, so this is actually executable, and there's like an entry point into this kind of test case. And then we're in go, so it probably makes sense to make these interfaces. So this is what a test case is fundamentally. If you can do a setup and you can run, it will be able to be executable in the integration test suite. This is what an integration test looks like in DAPA. It's a single self-contained file, we do some registration on the test, and we'll talk about that in a second, and then we do a setup and then we do a run. You can see here in my setup that I'm creating a process, which is going to do the setup and the cleanup, and then the run bit is where I'm going to do the actual assertions. Talking about the process part, the bit that's responsible for the kind of dependency creation and cleanup. Again, similar story, it's an interface, it does a run, and it does a cleanup. Really simple, and that's the point, it needs to be simple. We'll talk about a bit in a second on why this is a great thing. This is what a process would look like. This is kind of like a no-op kind of example, not super important to read the whole thing. The whole idea is it's, again, a self-contained package. We have the new, which creates the thing with a bunch of options, using functional option style here, which isn't necessarily people's favorite. It made sense in this particular case. The kind of struct versus the kind of functional style is a bit of a hot topic. Yeah, it has a run and then it has a cleanup further down. I know very abstract, but it's clear, it's obviously very important to get your interfaces correct because you're going to live with these forever. Cool. We have a framework run. The thing that I wanted to point out here is we do a process run here, and then you can see that we're using the go test cleanup function, which is amazing because it puts things on a stack. When you create your dependencies, whether these be binaries or whatever else that we're using in our processes, it will clean them up in reverse order. You have that stack, which is the natural order for things to be executed and then cleaned up in. Cool. We have all our test cases defined. They're running various processes. Again, there might be executing binaries, writing to files, things like this. We do our assertions and then we do our cleanups. These will get put into test cases and then we have some kind of sweet runner that executes these tests. That's what it looks like. It's a for loop over a set of tests and it executes them. Simple stuff. The next thing is how does the integration sweet runner know about these tests? What we need is a case registry, which is just a very fancy way of saying that we have a global variable that has a slice of test cases. What is important here that I wanted to point out was that it was a design decision that our test cases, and I mentioned it before, that they should be self-isolated in single files. I think as a developer, when you're reading test cases and things like this and you're having to go backwards and forwards into various places to even follow what the test is doing, is not good practice and it's confusing. Again, you can run into these problems. In order to eliminate that, we went for the style of having an init function, which does the registration to that global variable, and then using the bare import and style to import our init functions up into the top-level registry. Next thing is naming, which is always hard. I think there's a thing where developers generally don't necessarily respect testing code as much as they should. They care a lot about their implementation code and make it look pretty and performant and things like this, but they don't necessarily respect their testing code as much. This leads on to the kind of mess that people don't want to add to it because it's difficult to read. Having respect to your test code is really important. Similarly, naming is generally really important. Go has good standard on how you should name things, i.e. meaning should be derived through context. If you have a HTTP package, don't call your thing HTTP server, call it server. It should be hierarchical. Similarly, derived meaning through context, package path, describe your thing. Less is more. Go is not an IDE language. It's a good language. You don't need to have really long names. Just be very specific. No under scores, things like this. The benefit of then treating our test cases to be this package hierarchy with very meaningful being purposeful names is that we can do some reflect magic that gets us a lot of benefits. So when I showed before that we're doing this kind of sweet test case registration, when we are registering a test or when we're pulling out all the tests, you don't need to read the code. But basically what we're doing is using reflect to name the test its package path plus that struct name. So before our thing was called base, so it pulls out the package path of where that base test file is plus the struct name itself. So in this particular case, this test would be test underscore integration, DAPID foo base. Why is this a cool thing to do? Because that means we can start doing reject searches over our tests. So you can imagine for example if I'm writing a feature for DAPID or trying to fix a bug, if I'm working on maybe the active subsystem or something like this or placement, I can in another time and I'll have my integration test running and I can just do a search, a reject search on all the tests that are in the project for related things. So yeah, being very specific about your naming means that you can search through them and run all the relevant tests. Again being quick, developer focus, good UX. Yeah, that's how you do rejects in Go for loop and then you filter out all the test names that don't match the rejects. Here's another example, I'm working on century related things or MTS related things, I want to run all the century tests, I can just give it a query. The next is processes. So these are the two bits down here, the kind of dependency setup and the cleanup. We've been talking a lot about the different services in DAPID, so these are obviously using the exec, we're exacting processes on the computer, using the exec package. What we've decided to do is follow the kind of UNIX philosophy of running these processes as in do one thing and do one thing really well. So the exec process does really good at exacting a binary on the computer. You can then wrap that process in another more meaningful, again being intentional about naming which has a bit more context about how that binary should be run. So for example, this century process has all the context of knows what the CLI flags and things like this gives it same defaults, exposes the options in a human readable way in order to run that binary. And then as I mentioned before, DAPID has lots of different services, it's a complex software system but following this UNIX philosophy you can do this wrapping in your processes to make more meaningful, higher level naming and interfaces for your developer. So I can talk about a Kubernetes process and it's very easy as a developer in my test suite to say run Kubernetes, whatever that might mean, under the hood that's actually like a mocked Kubernetes API server which is actually a HTTP server, yada yada yada. So yeah, having this kind of wrapped process is kind of an elegant way to handle that. Here's an example of another one, so there's an operator service, we're doing some log line stuff in here, some DAPID stuff, but these are very high order concepts of dependencies that we're creating and these are all wrapped going down. Process binaries, so I mentioned before that we want to go as the sole dependency and go is a good language and it's got a very good build caching system and what that means is that in our testing integration testing itself is we're building the binaries in the test, so one of the first things it's going to do is it's going to build all the binaries that are in the project, that's the code that's doing that. It's then going to write them to a deterministic static file location and what that means is that every time I invoke the test it's going to run that go build, but because of go builds cache magic it's not going to take any time at all, so I can completely retry my go test and it will just be quick. The other nice thing about this is that if I change my implementation code and just write go test in my integration test, it's going to pull all the changes that I've just made to the code right because it is building from source every time. So that's a neat thing with go piping. So software writes things to logs and these can typically be very noisy if you're running lots and lots and lots of tests and this is going to take up a lot of disk space potentially, it's going to write a lot of things to the screen and it makes it impossible to read the test output. If you've got oodles, like a gigabyte of test logs and you're trying to find one test failure and read the logs from what happened, it becomes impossible. So write these things to in-memory buffers and then you can do things like only write the in-memory log buffer to the screen if the test actually fails, which is the only time where you actually care about what the log line is. Then obviously you can do things like because it's in memory, you've got a reference to it, you've got a pointer to it, you can then do some assertions on what was in the log lines and test log lines that way. It's quite good for this, you can create pipes and things like this. All very idiomatic kind of go stuff that you're familiar with. Asserting eventually, so all software is eventually consistent fundamentally like computers that are any as quick as the speed of light that is as fast as they can go, they're not as fast as that. But fundamentally computers to do a thing will take some time. And so we have to wait a period of time to observe some behavior when we put it into a particular state. Just fundamentally we have to do that. However you should never use time.sleep to do this, which I think is very, it's always there and it's very easy to just be like, time.sleep three seconds or something like this, but you should never do it. Time.sleep is the nuclear option. So to kind of illustrate this, if a single test sleeps for five seconds and DAPA CI for example runs four times a day, not counting PRs or anything like this, just standardly runs every four times a day, this equates to two hours of idle CPU time a year. If we then do it more than this, so like DAPA currently has 133 integration tests, if just 10% of those tests sleep for five seconds, then that equates to more than an entire day in a year of idle CPU. Which is crazy, right? This is bad for the polar bears, bad for the environment, it's bad for our developers too, which, yeah. If your test takes ages to run, no one will want to run them and no one wants to add to them. So being very intentional about the speed of your tests is very important. The way to do this would be to do polling basically, so in Go there's the kind of testifier package that is really, really good and highly recommend using it and it has this eventually function. All of the functions in this package are like super sane and highly recommend used to use them. And yeah, computers are faster than you think they are. Stuff does not take as much as you think it does, so like HTTP calls over local hosts take like milliseconds. It doesn't confuse as fast as you think they are. So even I've got here an appalling of like every 100 milliseconds, maybe that is even too slow itself. So yeah, computers are faster than you think they are. Be more aggressive with your kind of assertions and your polling. Clean up. Tests should never leak. Having data leaking from one test case to another will invalidate your assertions just fundamentally. So it's very important that you clean up state in between test case runs. And yeah, and it's also the case that if you're not cleaning up the state in your project in between case runs, then you're going to reduce the resource utilization that each test case can do and it's going to slow down your tests. So I'm thinking, you know, if you've got database tests or something like this, you're writing a bunch of stuff to disk. What if you fill up the disk? You're not running any more tests, right? So clean up is important. To list through some of the things that could be interesting for you to use, use temporary directories, using the test package. That's really good. T.cleanup, we just spoke about that earlier. That's doing the kind of stack thing, so it does things in the kind of reverse order. Use port zero. Ideally your kernel is going to give you a free port if you ask for zero. Use in-memory stuff. Don't use the internet. Don't give stop channels into functions. And use context. Context is one of the best things in Go and always use context. Very quick to talk about operating systems. Operating systems are very weird. Use build tags where you need to do different file types and things like this depending on their operating system. Work through the pain. Use if statements. Yeah, and then finally being productive. So building a culture of integration tests in a distributed team is always a work in progress. To know unnecessarily really likes writing tests, however, if you write a really good test framework, that's going to encourage people to add to them. And if they're quick, they're easy to use, then yeah. A good testing framework should be usable as a development sandbox. So what I mean by that is if you're writing a new feature, your testing framework should be your first port of call to wanting to use that new feature. Tests are great because they're encode, which means they're reproducible, and I can execute them and I can make changes over time. And it's very clear what's going on. Just running binaries on your terminal and things like this are fine, but having it in test code makes the reproducible better. And then the more, again, the more higher order your processes are, the more productive your team will be. So don't describe things like your developer shouldn't be describing things like exec, this binary, things like this. They should always be in a high order kind of thing that they're describing. Again, it decreases the amount of code that you have to write in your test case and makes them more approachable for contributors. And that's me. Thank you, everyone. APPLAUSE Saved some time for you, but I don't know if you want some questions or leave it there. I can fit in one quick question. Otherwise, you can just grab them in the hallway. Ah, no question there. Let me run one second. Keep holding your hand up. So, quickly, why did you make your own sort of test filtering system instead of using Go's test filtering system? And secondly, why didn't you use an event hub instead of polling? Say the first one again, sorry. Why didn't you...
Effortless Bug Hunting with Differential Fuzzing
Our next speaker is Maché and he's going to talk about us, about hunting bugs and how do we hunt bugs? We do that by sending a bunch of random input into our programs or more scientifically called fuzzing. Round of applause. All right, welcome. So in the spirit of testing, let's talk about fuzzing. So I'm Maché, I'm an offensive security engineer, I've introduced the platform engineer and software engineer, I sail, climb and play board games. So what we'll talk about, we'll talk about fuzzing, we'll talk about differential fuzzing, how it differs from fuzzing and we'll talk about bugs that are in the sun in the library and how you can actually find those bugs and fix them using fuzzing. And then at the end we'll talk about fuzzing in continuous integration pipelines. So what we'll not talk about is how fuzzing works under the hood. There are excellent resources out there that we'll talk about like fuzzing engines and like other stuff. I'll link to them in the end, but this talk is not about this. Why should it occur? So there's an OSS Fuzz project, who's familiar with this? Cool. So this is a kind of a platform that gives open source projects computer resources to run Fuzz tests continuously. And there's about 1,000 projects in there and within a six or seven years it has found 10,000 vulnerabilities and 36,000 bugs. And if you do a simple math, that's 10 vulnerabilities per project and 36 bugs per project. So this seems like an F word that's worth investing in. So let's assume we have a simple function, it accepts the string, mutates it and it gives you a transform string back and it transforms letters or characters in the alphabet to a character that is fricking positions later. So you get n for a, o for b and b for c and so on and so forth. So in your regular testing, you'll come up with some inputs, you put those inputs into the function and then you make assertions based if the output is correct. You're all familiar with this probably, you can run this using your standard Go CLI. With fuzzing, the situation changes a little bit. Instead of your device input, your things you came up with, you have a random input, you put it into the function and make some assertions. It looks very similar and is supported in Go from like Go 1.18 and you can also run this using the CLI. You see some boilerplate around the test but you know, in the middle you basically have your unit test that you had before. I intentionally left the assertion blank because how the assertion stuff, if you don't know the input, right? If you run the fast test, you'll see that it tries hundreds of thousands of inputs per second in this instance and it runs indefinitely. So you can run it as long as you want. As you've seen, it's easy to create fast tests if you have unit tests in place. So there is no reason not to do it really. One thing that we haven't talked about is that it's not our magic. You still have to kind of instruct the fuzzing engine to be able to come up with inputs that make sense for your test. So you can actually reuse the inputs you use for unit tests and add them to what's called the corpus and that tells the fuzzing engine to come up with something that's similar but quite random as well. Add the inputs from your unit test. That helps a lot. I've talked about those assertions that might be pretty tricky to come up with them if you don't really know what the input is. So what you commonly see in fast tests is that they don't make any assertions. They just, the engine just checks if the function crashed, which is still very efficient because it tells you that there are some out of ground size axes, for instance. But you should and can assert an invariance of things that don't change and in our instance, for instance, there is a property to the ROT13 function that you can actually call it twice and you get the input back. And this holds true for anything that has an inverse symbol. So if you have an inverse function, you can make a simple search like this, which is called ROT13 and ROT13 and then you expect the input back. If it doesn't agree, it's, you know, the test fails. Some examples that are commonly used are encoders, decoders, marshallers and marshallers. You can just call the things, you know, decode the encoded thing and you should get the input back. There's other stuff, like if you do a SHA sum, for instance, you always expect it to return 32 bytes. But there is other technique. And what if you had two implementations of ROT13, right? Something that you wrote and then, you know, something else. And that's called differential fuzzing. So basically, you get a random input, you put it through two implementations and you see if they disagree. So, you know, think about for a moment and, like, where can we get those second implementations from? The first thing is refactoring. Let's say you have your function but, you know, it's unreadable, maybe it's not performance enough, so you're refactoring the code for whatever reason. You can save your old implementation to the site and use it to basically reference it when you refactor the codes. The second example is performance. You might have, you might maintain two implementations in the first place. For instance, you are following a specification closely and, you know, the first implementation is written very closely spec, but it might be inefficient. But the second one is heavily optimized, but it might be not quite readable. You know, you might have some straight buffers or, you know, whatever. The third option, which is really interesting, is that there is a C library that does a similar thing. And you can use C go to college. And that's what we'll explore further. So back in January last year, I saw an interesting bug report and I can go with a newsletter where there was an issue with the HTML tokenizer, basically the piece of the, or part of the experimental library that does HTML tokenization. And the thing was that it was incorrectly interpreting comments and this led to an excess attack. So what does an HTML tokenizer do? It basically takes a HTML input and it gives you the HTML token. So for this example, for instance, you have a paragraph and a text inside and an anchor afterwards. You'll get start attack of P, text, and then the text inside and tag of P and then start attack of A. This is a very well-defined process and there is an HTML specification for it. It's very high in detail. It's easy to follow. And it's a state machine which will become important later. If you look at the go implementation, though, it's not a state machine. And it's not quite easy to follow, at least for me. So I thought, you know, if there wasn't a report for it, there might be other bugs lurking around. So let's, you know, let's use that function a bit and make another one that gives you a list of tokens because the API works in a stringing way. So we'll just call the tokenizer, collect all tokens, and then return the tokens it generates. So you know, when we, let's say, start with the fuzzing, we will supply some HTML input to the corpus and then call the tokenize function without making any assertions. And there are no results. It doesn't crash. Something will be expected from, you know, from some library or from the experimental part of it. So let's try differential fuzzing, right? We'll have the, our tokenizer function that we wrote and some alternative implementation for it. And if they don't agree, we'll fail. And as you can imagine, because the, you know, C ecosystem is very mature, there probably is a library that does the same thing. So in this case, I found Legsport, which is a web browser engine that, you know, is a software library. It has no extra dependencies. It has a Poshy to Prenel license. It sounds about perfect for what we want to achieve. So don't look at this slide really. It's, you know, it's basically implementing the tokenize function that we implemented using the Nets HTML tokenizer, but using the Legsport. It's actually a lot more complicated than that, but we'll be good for our tests. So we call the tokenize and Legsport tokenize and do some equality checks and if they fail, we fail the test. And it found something. So there is some weird looking HTML codes, looks, month forms, and Legsport says that, you know, it's an ATAC, but Nets HTML library is like, oh, there's nothing in there. So let's transform this a bit and let's see what the browser thinks. So we have these agreements. Could this be a security issue? So what if we made trust decisions based on the tokenizer? And so imagine you have like some, you know, user input on your website, you accept the HTML inputs and, you know, you decide whether the staff people input is safe to display or not. And you should, by the way, you really shouldn't do this, but we'll have a S-save function that will return the Boolean, whether it's safe or not, and we'll just look for the tokens we get and only allow strong tags and nothing else, strong attacks and text tokens. So the S-save method thinks that, you know, the thing that we got from the fuzzing is safe because it thinks there's nothing in it. But the browser says otherwise. When you look at the documentation, though, there will be a security consideration section in the HTML tokenizer and it says, you know, care should be taken, especially with regard to unstressed inputs. If your use case requires a well-formed HTML, the parser should be used rather than the tokenizer. So let's implement this using the parser, right? I want to go into detail, but we use the parser here. That's also in the same library. The thing is the parser also thinks this is safe, and the reason is it uses the tokenizer underneath, so it doesn't really, you know, differentiate between the two. So we still get the XSS. So we have two things. You know, the first thing is that the documentation could be improved because it's unclear. It's tier C in the wrong direction, and second, that there is a bug in the tokenizer. So I thought, right, if there was a vulnerability report in the VRP program for the common thing, I'll do the same thing. So I submitted a VRP report. There was some back and forth. They closed my ticket. I told them to reopen it. They reopened it. And the result of that was that there was a documentation update, which is cool. And they say that in security context, if trust decisions are being made, the input must be recerealized, for instance, using render or token string. So what they are saying is that instead of doing, you know, a safe function that returns a boolean, you should actually transform the input and construct it in a way that, you know, basically sanitize this, transform the string. And there are two ways to do this. One is to use the token.stream function, which, you know, when you loop over the tokens, you can reconstruct the input or render when you use the parser. A few months pass, and there is a comment to the library. And they fix the actual bug. So, you know, handle equal signs before attributes, and they quote the spec and fix the debug that was there. So now if you call the is safe function, it returns false. That's pretty cool. But let's run the fuzzer again. I mean, you know, you get something that is very similar, and it acts the same way. So I thought, all right, I have this fuzzer. It's not pretty. You know, it has no way to reach the standard test suite. But we can, you know, learn the code base and iterate over it. So run the, you know, fix the problem, run the fastest again, and then, you know. So I prepared the patch, and you've seen I get it screened today already. It has the code review, but as Jonathan mentioned, you need a lot of patience. It's been stuck in, like, ready to submit for like three months, I think. So it still hasn't reached master, but it's close, I think. But when you run the fastest again, there are no more findings. So the takeaway from this is that fuzzing is very effective, and differential fuzzing helps write correct for code. So let's talk about what are good testing candidates. We've used it on parsers, which are pretty complex codes. You can use them to get the coders and coders, you know, marshallers, and any complex code that, you know, can be unit tested, basically. But running those tests in CI is kind of traumatic, at least in my experience, because it's not really mature enough yet, I think. And when you run the go-pest fuzzing vocation, it can only run a single fuzz test. So people have been doing a lot of hacks, like, grabbing this fuzz code, trying to find those fuzz targets, you know, sleeping, like, some pretty hacky buskers, for instance. There is also a very cool project called Cluster Fuzz Lite. It's actually a subset of OSS Fuzz that you can run in your CI. But we found some problems with it. First, it has problems with extracting and failing inputs. Like, if you have a byte array, for instance, it doesn't really translate one-to-one to what the actual input is, because you have to apply some of your own transformations over it, and it's being convenient to run locally. So we built Go-CIFuzz. And it's kind of a lightweight wrapper around go-test fuzz. And it separates multiple test targets, and it allows you to extract inputs. So if you want to give it a try, there is a link here. And, yeah, good to go. And it's basically plug-and-play, drag-and-drop. You can use it to run fastest as part of your pull request workflow, or you run it on schedule, so, like, you know, during the night, or whatever, whenever you want to run this. All right. So we've placed for it. But, yeah, if you want to, you know, say hello, there is my email address and my handle. And also, I wrote a blog post about this, but it goes more in detail about this actual finding. And there are some references. You have the, if you want to start fuzzing, there is a very excellent introduction to it in the Go documentation. There's also Goode's article on Wikipedia on how it works under the hood. And there's a link to clusterfuzzlight, the Go-CIFuzz, the blog post, and also a pretty interesting entry in this list is the second one. So there was a recent paper from Google where they use AI to actually generate the fastest. So maybe you don't really need to write them, and AI will be able to do it for you. All right. So if there are any questions, happy to answer. All right. Any questions? We still have some time. And the front. That's nice. Okay. How many minutes do you run the fuzzer in the CI because this is important, right? Because it costs money. That's true. Yeah. So, you know, it depends on the workflow. So for instance, when it's a pull request, you really don't want people waiting. We run this for like five to six minutes. It's enough time in our experience to catch like those bugs that are, you know, the edge cases that are quite common. But you can run this indefinitely during the night, and it depends on how much money you want to spend for your CI runs. Yeah. All right. Any other questions? Questions. Can you keep your hands up and I can go to the right row if you could pass us along. Have you tried to fuzz only inserting random strings or like also a combination of valid tokens in different order? Could you please bring? From what I got from the slide, if I'm not wrong, you were like inputting the data. You were like putting random strings, right? Okay. So how it works really is that you provided a starting corpus. So like your, think about your unit test inputs and then the fuzzing engine underneath takes those inputs and puts transformations on them. So every time you'll get a slightly different input. It won't be completely different, but it will be a bit more formed. So like if you saw these, the findings for instance here, right? It outputs all, well it outputs a valid HTML or almost valid HTML. So it kind of reached this conclusion based on some coverage data it found. So like it also looks at test covers. So when it runs the fastest, it kind of captures which branches of code have been covered and tries to reach the other that have been not covered. So it's kind of an interactive process where it applies transformations to the inputs. Right. There's another one. How does the engine know which part of the corpus it might change and which not so it doesn't only input like random strings as I could obtain from the random package? Could you repeat the beginning or the question? Yeah, sure. The fuzzing engine, you give it a set of example strings. How does it know which part of that it may change and so that it doesn't just put in random things? Okay. So I don't know the exact details, but I think it works that it makes a change and it looks at the coverage data. So it looks at the branches, it kind of discovers when it made the change and it will note some interesting inputs and then try those inputs. So like if the coverage increases, it will try to make more transformations similar to the one that it makes. Yeah, one more. What kind of coverage metric is it? The question is what kind of coverage metric it is. I think it's, I'm not so sure, but I think it's branch coverage based. If you run the fastest with some variable flags, you will see that there are coverage bits and I think it tells you how much coverage there is for a particular input. All right. There's one more. One second. I can probably just speak up. So the question is, there is a go cache or when you run fastest, there is a cache folder that will capture the inputs already run and the question is whether the tool will or can support this. And the question is, the answer is it doesn't right now, but it's planned. So for those that are unaware, when you run a run fast test, there is a directory that will capture all the input it has tried or the interesting ones. And when you run this again, it will start from the point, which is really handy because you will not do the same work every time or a similar work. You can start from where you left. Yeah. Thank you. Yeah, there is one more. The question is slightly tangential to this directly, but you said we provide a starting corpus and then there's transformations on that, which is run against whatever we're testing. So is there a way to optimize the starting corpus to increase the kind of test cases that are actually generated by the FuzzError? Is there a way where the starting corpus can be designed to cover as many edge cases as possible? Okay. So there are similar perspectives to this. There are corpus that you can find online in GitHub, for instance, that you can employ in your FuzzTests. Also when there's a finding, for instance, when you run the FuzzTest and you find a string, it will add it to the corpus that you have in your repo. So when you run this, there will be a directory created in your repository that's called test data. And inside that test data folder, this will be captured. And you should actually commit that folder to your repo so that every next time you run the FuzzTest, it will actually check for regressions. So yeah, I hope this answers your question. Any more? Thank you. Are there ways to customize the kind of transformations that are applied by the FuzzError? Not in the Go Native FuzzTests. So there are other tools that have been used before, and Go introduced native fuzzing. There is libfuzzer, for instance, that's very commonly used by the OSS Fuzz. And I believe if you use that, you can customize it. But the way native Go tests work is that they actually use libfuzzer, but it's not very configurable. So it's supposed to be good developer experience-wise and cover most of the needs that you need, but I don't think you can drive the transformations from it. I'm going to end the questions here.
How we almost secured our projects by writing more tests
The careful eye might have noticed something in my schedule. I put a lot of similar subjects together and because Philip was actually replaced by the speaker, this would have been three hours filled with only tests. Glad we saw where I say it from that. But let's continue into this test thingy because tests are important and many people love them and many people hate them. So Alessio has got to take us away with security by testing. All right, applause. Hello, everybody. Welcome to my talk. I give you a little introduction about myself. So who am I? My name is Alessio Griggi. I'm a software engineer at Armo Security, the company behind QtScape. My full-time job actually is to be a cat food opener for my furry friend. But jokes apart, I'm passionate about reading and taking long walks. You can find me on GitHub and Twitter with this account and the following avatar. But let's start the talk. So I will give you some introduction, some easy concept that can help you to understand better the world talk. So first question is what is the code coverage? So code coverage is a metric that we can use. It's a percentage actually, as a metric, that we can use to understand how many of our source code is covered by tests. Really or better, mostly, it is used to write when we write a unit test, but not only for this kind of test. Let's go a bit more in depth. So code coverage related to Golang. So first time it was introduced in Go version 1.2. It was more or less 10 years ago. I guess it was April 2013, if I remember well. With support for the unit test in this specific article. But the story continued after more or less 10 years. So one year ago the community introduced in Go version 1.20 a new kind of support for tests. This time it was support for the integration test. So what happened since last year that we basically sensitively increased the percentage of the coverage in our project. Of course if we were already doing integration tests. And yeah, basically in these 10 years a lot of things changed. They also implemented some nice tool in order to check the coverage rendering the profiles with an HTML page that you can check on your browser. It's really nice also to use, really helpful. But let's see another concept that is important for this talk. What is a second profile? So first of all, second is a kernel feature. And it helps you to block certain syscalls during the execution of certain program. You can define second profile as a kind of rule. So you can list all the syscalls that you want to execute or you want to block during the execution of your program. And what else? It is extensively used in the Kubernetes ecosystem. Also in Docker you can attach this security profile when you run a specific pod or container. And the container will use this second profile in order to check if all the syscalls are enabled to run. And another important thing is that in Kubernetes if you enable the second default profile feature flag you can basically use by default the default profile that is a list of deprecated, really dangerous, let's say, syscalls that you should not use during your execution. So by default you can use this profile and be quite safe more or less. But it may be better if you create your own second profile for the project that you are implementing. So the main idea that I had was to generate a security profile during the test pipeline since it is probably the best environment when we, of course, if we write a lot of tests that can help you to run all the syscalls that are included in your project. So the test environment is probably the best candidate to use in order to extract all the syscalls that are going to be executed in your project. So the idea was to generate the second profile and in case you have your project that is based on Kubernetes, you are developing something related to Kubernetes, the way was to create an init container that can inject the second profile into the node and use the security context with the second profile localhost in order to attach this security profile that you just injected into the node. And that's one example. So you have the init container that's downloaded the second profile. In this case it was just a test but you can think to provide it as an artifact on GitHub or whatever you want. And the application container can use the second profile type localhost by referring to the second profile. Okay. This was the first part of the talk. But now let's see how I try to achieve this goal. I mean, how I try to extract the syscalls from the test. So in this case we are talking about integration test and unit test. In this case you can see a kind of execution path of your project. So if you run the project you are going to have this kind of tree. So with the code coverage you can understand which part of this tree it has been executed. So you can refer as a metric about your second profile in order to understand which part is missing and how much it could be readable since it's a metric that gives you a percentage. So first thing, extracting the syscalls from the integration test. So let's say it was the easiest part. So with the integration test you can build a binary, provide some script that basically checks for expected results. And when you run the binary that you built you can use one of the tracing tool, for example strace or perfer or whatever you prefer, in order to extract the syscalls during the execution of the binary during the test. So this was the first part but let's see the other one about extracting this information from the unit test. So first of all it was a bit more complicated and I'm going to explain why. So the reason is that GoTester actually compile and run the test all at once. So you cannot do strace GoTester because otherwise you are going to catch all the syscalls that are not related to the function that you want to trace because think that we are speaking about unit test. So we are testing only specific unit, only specific functions and you want to extract the syscalls that are executed during this function. So you cannot do strace GoTester first of all and even if we build the binary, the test binary for the test we cannot neither do strace dot slash test binary because the test binary could include some noise that could be related to for example let's suppose that you have some data file that you want to run against your function and you open this file and you take this data and you put this data inside your function. So when you do this open file you are going to catch with strace also this open. So it's not really suitable. So my personal solution, let's see another step, so more or less the solution could be split all the steps. First of all we can compile the binary without running it with the GoTester. So you can do gotest dash c followed by the package that you want to build and consequently you can from this binary extract the function name just by using obj dump followed by dash dash since so you can extract the entire symbol of the function that you want to trace. So at this point let's see my personal solution. I don't know if it's the better one but it's a solution. So this project is called ARPUN. You can find this project on my github and it makes use of an eBPF. I want to clarify that I'm not an eBPF expert but understanding the technology I try to use this technology to solve this issue. So the main idea was to define a trace point with eBPF that started its execution so it's tracing about the function. When a U-probe that was previously attached to the function basically emits an event. So the U-probe informs you that the function started the execution and another probe, the U-ret probe emits another event when the function finished the execution. Another important thing to know is that this project actually is a POC, it's not a production in a great project. It's based on Go.BPF that is a part of the Iovizer BCC project. So that's the main, how does it work actually. So you can put U-probe and the U-ret probe inside your health binary at the point of the function symbol. So in this case we have main dot do something that is our example function. And the U-probe and the U-ret probe will inform you when the function starts the execution and when it finished the execution. So in the meantime the trace point knows when to trace the function. And the trace point is going to trace the function with the C-center event. So it's going to trace all the C-scales that are executed during this time. So that's an example. In the right side there's a function that's some easy things. And in the left side you have the result. So you have the right, the open-et and the other C-scales and in the end you can see also the read. Okay so all these things are really nice. I was really happy to have achieved this result. But at some point I also realized that these things were not really working. I mean not every time. And I discovered after a while why this was not working. But first let's understand how the U-ret probe works. So because we have a problem with the U-ret probe in this case. So a U-ret probe basically overrides the return address of the probed function with an address to a trampoline. The trampoline basically jumps into another kind of function that in this case is our EBPF program. But since the GoStack dynamically changes during the time due to the GetBatch collector, when the trampoline function tries to return on the stack it is not able to do this. At least not all the time. Because the stack changed and the previous address is not more useful. So possible solution, likely for us the U-ret probes can be attached to a specific offset in the health binary. So we can basically simulate a U-ret probe that informs us when the function is finished by adding a list of U-probes on the ret instruction of the function. So if the function returns three times we should place a U-probe on these three ret instructions. So we can basically simulate the U-ret probe instead of using the U-ret probe. So future improvements, so when I realized that this solution could work I tried to check on the IOWI's or Go BPF library but it was impossible to attach the U-probes at certain offset. So it was my fault actually because this library is deprecated. So future improvements are to move to another library before. So we can use for example a BPF from Cilium or the one from Aqua Security and so on. So in this case we will be able to put the U-probes to specific offset and so put them into the ret instruction of the function. So here are some references that I found on internet that helped me to understand better what was the problem, how to solve this issue. Also some special thanks to some people that really helped me during this experiment. So thank you for your attention. Well I have your attention or sleeping depending. I have two announcements. One is read the wide board, not repeating this again, lightning talks, we still have available slots. And the second one is this room is not possible without volunteers. This is a 110% volunteer conference. I get no money, I even have to pay for my own dinner tonight. Oh no that's sponsored now, thank you. But I want to make a special shout out to my dear co-organizer Eva, a proud of her past. Eva is a student in computer science, more specifically in application development. If you have internship positions at your company, you can hire her for free.
Dependency injection: a different way to structure a project
I'm going to talk about using Go. What is important when you use Go is dependency management. You cannot write a program these days without depending on something. Dylan is a co-worker of mine. We work on Cillium together. He's going to talk about anything to do with dependency management. So run of applause. Hey everyone. Thanks for coming. So dependency injection. Before we start, a little introduction. Already got one technically. My name is Dylan Reimering. I work at Isovalent on the foundations and the loader team. So we're responsible for basically doing dependent or a lot of changes that I'm going to talk about within the Cillium project. You can find my get up there. In case you find anything interesting. I don't know. You never see. You never know. So before we dive into the dependency injection, why, how it works, what it is for those who don't know, a little journey about why I'm here, why I'm talking about this and how I got here. So what is Cillium? Cillium is a CNI. So it's long, long talk short. We use EBPF to do networking in EBPF. We secure it and we make sure that you can see what's going on. And that actually involves a lot of components. So this is our nice visual about a lot of the different features and we actually have way more that wouldn't even fit on the slide. You can imagine that with a lot of components that we get quite a large application. I checked and we are currently the third most active project on the CNCF. We have, I think, so again last time I checked this is like a month ago. We have 650,000 lines of code that are not the vendor directory. So we have a big code base, a lot of things that happen, which also means that we have a lot of dependencies. So to illustrate that, I picked one of the features that I personally worked a lot on, which is called the Alto announcer. And it's a little feature in Cillium that basically makes sure that certain IP addresses are reachable in the local network via ARP. So both gratuitous ARP and responding. So we have like the big Alto announcer block there, which contains most of the business logic, but all of the other things are dependencies. So all the way to top, we have, in the white still, are our external dependencies. So we have to create ports. We set up, we get environments, configuration, standard outputs, et cetera. Those are connected to our infrastructure layer. So our infrastructure layer does all of the things that are really common in the application, logging, metrics, configuration, da, da, da, da. And then we get to the orange layer, which is our control plane. And there are abstract business logic happens. So this business logic gets go objects, and it also writes go objects. It's all pure go world, and it doesn't have to care mostly about all the things. And then we go down to our data path, where the translations happens from this perfect abstract world into the real world, which in turn often means, for our case, that we talk to the kernel via net link, ebpf, maps, raw sockets, et cetera. So we have to, but for this one, for my big component to be able to work, I basically need all of this to exist, at least in production. So I went back to 111, which is before we started working on dependency injection in Cilium, and looked at what does initialization look like at that point. So we have our main program. We could call into Cobra. This is common, hopefully. We go into our run function. It starts up three components. It initializes the environment, where we already have 50 components. Then we call something called run daemon, which has 50 components spread both before and after the new daemon. And then in our new daemon constructor, we actually create at least 150 components. I started counting, or stopped counting, sorry. So we have a lot of components, but they all have to somehow wire into each other. And at some point, the development team decided is that we are going for sort of hub-nispoke model because we had so many components. We had this big daemon, which was our hub, and it had pointers to almost all components. And then it's easy. You only have to give the daemon to everything, and then via the daemon, you can find every other component. So it was, but that becomes a real mess because when is this pointer nil, when is it not, et cetera. So I started looking into this new daemon function, like what is this about. And then you'll see a pattern. You don't have to read everything. So we initialize this before we're creating this. We must close this before we open that. This must be done before we start identity allocation. IP cache must be done after the initialization below. This must be read. You said after this happened. So we discussed some for a while. So at some point, so at this slide, I'm at sort of the first snippets that 350 about. And then I basically, I stopped. So I just scrolled down at that point. My point was made. In the last reference I found something like before, do this before, do this after was at 718. But what is perhaps interesting to note is that this top snippet is basically a sort of defer. So it talks about cleanup instead of initialization, which is also a really big thing that we have. So to summarize the problem that we were facing at this point in development. So we have a lot of dependencies, but this is just inherent to the product that we're making. Just nothing to do about that. What we can do something about and what is a lot of the source of the pain are these implicit dependencies. So we have dependencies on global variables, these very big objects or system states, which require us to use comments to tell our other developers which how our dependencies work. So our dependencies are all implicit in this state, which makes things really hard to modify. Like when I started and I created a component, it broke CI, it broke everything. I couldn't figure out why. And it turned out that I had to move it up a few hundred lines in the initialization or down in some cases to make sure that everything that I needed or implicitly dependent on was there. So it's really hard and it really destroys confidence. It's hard to shut this application down at least correctly. You can kill the application, sure, but then open files are not saved. And if you are running end-to-end tests or anything like it, then you need to make sure that all your resources are cleaned up. So the next time you start, you are not blocking other things. So this was really hard and it made it really hard to test because if I wanted to test my L2 announcer, I had to recreate all of this additional infrastructure a lot of the time, even if I had interfaces because some dependencies were still problematic or whatever. So for us, we started looking into solutions and this led us to dependency injection for a few reasons. So before I go deeper, for the ones for people that don't know, dependency injection is basically a way to instead of explicitly initializing your project, so basically having a very big main file, you define your components and you explicitly define what their dependencies are. And then you can have some component, in this case I call it a graph builder but it's basically the name of your framework that you use to actually initialize that and you hand off the job of correctly initializing your application, you hand it off to some piece of software. We know software never has problems or bugs. But in all honesty, so this is actually quite popular pattern in other languages like Java, C sharp, PHP, but we don't see it that often in Go projects. So the only thing that is required for this to work is that you always, or at least work correctly, is that you specify your dependencies explicitly, so as arguments to a constructor function. So what I would like to introduce to you is the Uber FX library, so it's made and maintained by Uber. Originally developed by Glipp, who is now actually a colleague of mine, which is why how we got into this library. It's really well battle tested and I'm going to show you how it works and what this looks like. But what's important to know is that it is an, as is the Penesy injection library. The Penesy injection libraries might not all work for your use case, it didn't for us. So we actually, if you were to look at Cilium today, we actually use our own custom flavored framework, build on dig, which is basically the underlying library under FX. But FX is if you go ahead and first try something, then FX is your starting point. And this actually solved most of, like it was made to solve a lot of the problems we had, not only for this initialization issue, but also because we have a lot of binaries in a big mono repo, so it also allows for really good reuse, which is, as far as I understand it, where Uber first started. So to explain this, I first created a very, very small application. Normally you wouldn't use dependency injection on such a small application. So we just have a simple web server. And this is, I why, for example, might have, might write this without dependency injection. So if a main, we construct everything, link everything together, call server.serve, and we're done. So this is nice and short. So when we do dependency injection, we have to be a bit more formal. So I defined a new listener, a new logger, and a new server. My listener and logger at this moment don't have any dependencies. I could give them configuration or something else, but that wouldn't fit on the slides. And the server takes both of these and constructs itself. So we defined everything, what everything needs, and then on the top left in our main, we say we create a new effects application, and we provide the listener and the logger, and we invoke the server, because if you recall, the server was, the serve function was the thing that we were interested in, that we called. In practice, what this does is the invokes are basically your entry points. So and the library will look for all dependencies of that, of that entry point. So you could, for example, create a very big graph and have multiple entry points or remove entry points depending on, call different entry points depending on, for example, commands in your, in your binary. And then it will only construct and start your dependencies that you need. So it also does a little bit of that code elimination implicitly. And then you call the run, which actually wouldn't do anything in this example. So I'm sorry, because the serve is not called. So this would start and it would construct everything, but nothing extra would actually happen. For that, FX has something called life cycles, which are really useful. So we, the last slide talked about the construct time. So when we construct our graph and then when we run it, the life cycle gets invoked. So what we can do is we give this, we say, okay, the server is now dependent on a life cycle. And within the constructor, we, we tell the life cycle, okay, I can, I have some, something while, while I'm alive, I want to do something. So I have an on stop and an on, on, on start and on stop hook. And when I start, I want to start a go routine and serve out whatever I do. And when I stop, I want to shut down, which is something that my initial program didn't even do, do a proper shutdown of the, of the HTTP server. So when, when it's, it's a little bit hard to like show that in the original example. So I threw together a very small sample that still fit on the slide, which is important here. So I have ABC and they basically all depend on each other. So it's a very deep dependency chain. And then I have this print function, which you can decipher later, but it basically, I call it in every constructor. It's both prints at that time and it prints in the life cycle hooks. So you can see what happens. And when I would were to run this program, the output would be something like this. So it says, A is constructed, B is constructed, C is constructed, because that's the order in which the, so, so we have all the dependencies there when we are, but it's just some construction. Then the start hooks are called in the exact same order as we constructed them. So if you have dependencies, for example, A opened the file and we need that file to be open because B will start calling things in this life cycle. And we know that the, that the start hook of A is always called before any of its dependencies get time to run. And then when we stop the application, we control C or something else happens, we shut down. But the nice thing is, is we automatically shut down in the exact opposite order, just like you would add the first somewhere, but it's at the application level. And this allows you to do proper shutdown, write your files away, do everything else. And you also know that you, because you depend on everything else, that you get the first chance to shut down properly and no one will call into you after that, because, in their shutdown functions, because they don't have references to you. They are not your, you depend on everything else. There's also a nice feature called groups. There are actually quite a bit of features. I couldn't touch on everything because of time constraints, but this one is nice for, for a small section of problems. And it's called a group. What you can do is, so I actually use two features. I use the effects in and effects out feature. And it basically allows you to, to return multiple dependencies from a constructor or take multiple dependencies in a nice way. So I can, for example, have a parameter structure that takes in 20 different dependencies and don't have to spell them all out separately in my arguments. And I can also return multiple things. Crucially, in my case, I can specify group names to basically route outputs from one, from an output, from, from one place to another. And in this case, I created a mox. And this mox collects all of the mox handle, mox handle objects that are there. And I have a foo and a bar and they both admit their own thing. And they are collected by, they are collected by this, by this mox which we could, could give to, to a server. And the cool thing about this is that you, you have this once. And you can then add a lot of additional, you can add a lot of additional parts to the, to your whole application. And it all collects as an array into this group. There's some caveats. I'll come to that in a bit. So under the hood, how this works, very simplified, is we have our definitions. At least effects and dig use reflection to then look at the parameters and then based on the types, it creates a directional acyclic graph. And that graph can then be walked to get the, to get that correct ordering. So there is a small bit of magic there and it's called reflection, but it's not much. Like it's quite understandable if you actually go dive into, into how something like this works. And then again, the constructors to start and stop are called in that, in that determined order by the deck. It also means that you can't have cyclical dependencies. That's, that's a no, no. So it's a good reason to remove those from your code as well. So I would like to share with you in case you want to try this, try dependency injection. Some tips, tricks and lessons we learned because there are, there's a good way to do this and there's definitely also bad ways to do this. So inject, but in moderation. So not everything has to be a component. For example, math libraries are stateless. There's no reason why you would make that a component as like a dependency in this system because you can just, you can just use them and they are pure functions, etc. So my rule of thumb is if it has states, make it, make it a dependency because then you benefit from all of the state specific things. But if you have libraries that don't use state, please don't make it harder than it has to be. And also a note of inject, but in moderation is that we saw that doing dependency injection adds a lot of boilerplate, which is worth it in very, very big applications or even moderate applications, I would say. But it's likely not for your small CLI tool or whatever. So pick, this is really a technique for medium to larger projects. When you do this, pick logical boundaries. So we, for example, we started and then made 20 cells within the same package and then no one outside the package actually ended up using those cells, which is massive amounts of complexity and overhead is just not necessary. In my experience using packages as logical boundaries for these components is the best thing to do because you can also leverage what types I export, which type, because you can export, you can provide something and not export that type, for example, and then only export an interface that matches it or whatever. So that's a really powerful combination. So and the last thing to note is that one of the other features that I wasn't able to show you because of time constraints is FX options. So FX options is really cool because it allows you to basically take multiple of these components and bundle them under a single variable. So while global variables are big no-no's when doing this, you can still use them or you can use a variable, global variable on your package to export these constructors. And the nice thing there is you can make a sort of hierarchy. So if you have a package hierarchy that's three layers deep, you can basically reflect that. So in your main application you don't have to list 200 constructors all separately. So that also really helps with readability, seeing where what is provided and so on. Provide targeted interfaces. So go idioms still apply. The smaller your interface is, the more powerful it is, the better you can swap it out. So when I provide a very small interface or when I depend on the smallest interfaces I can, and it's really easy for me to mock out in my test, create a new FX app, only provide the direct dependencies which are interfaces which I can then mock out and it makes everything really nice. So this is general advice, not for dependency injection but like it goes hand in hand. If you have dependency injection and don't do this then it takes away a lot of the benefits you would otherwise get. So it also makes it easy to rely on external, for external components to not rely on internal implementation. So when I export something or when I provide a component I always try to provide it as an interface as well. And the last thing which is more of a trick is you can actually, if you for example have a struct, that struct can implement multiple, so instead of having one interface that implements three methods I can provide it as three separate interfaces that implement three separate methods. And that way you can, you have both on the receiving and the sending side of your dependency, you have the smallest possible interface again to help with mocking out but also so if you don't use certain methods that you don't have to like write fake methods that panic if you were to call them etc. I mentioned groups and they are really powerful but go easy on them. Groups are really only ever useful if you have multiple parties that are interested in the same list of objects. So for example we have metrics, so we have a Permetheus metrics registry which collects all of the metrics to actually use them. But we also have tooling that automatically generates documentation about these metrics and I can write a very small CLI tool that basically just with one component that depends on all hooks or all metrics that we have defined in our application and I collect all of them automatically and everyone who uses who registers a new metric it automatically appears in this metrics tool. So it's really great and the same goes for our configuration HTTP elements which will also have configuration for or sometimes CLI tools which live want to interact with the same things. The alternative to using groups is to just use like a registry pattern where you say I provide a registry, it just has a registry pattern and everyone else, so if I have 20 other components I can depend on that and I can register myself during construction time. And the upside of doing that is that you can, like if you use FreeScope for example or any decent editor is that you can follow those traces back. So you can always use references to see who actually uses what. With groups it's all magic. Something everything goes into this group and it comes out but it's not clear. You can't trace that back in the code itself without having difficulties. Stay with a static graph when possible. So you can, with this FX application you can in theory like depending on configuration provide or not provide components. We have opted in Cilium to never do this because it makes it completely impossible to verify that you never have missing dependencies or other problems like circular, the references and there are certain combinations. The graphs are verified at runtime so you have to have a good CI to run everything, make sure that it works. What you can do instead is use this life cycle and so you always have the objects but then you can always choose if they do or do not subscribe to the life cycle and that way you can enable or disable certain logic if you don't want to run it at that time but always provide it. And that was it. Thank you very much. Thank you. I have time for one question. I see a hand there. I'll quickly come over and hand to the microphone. If you are exiting already do it quietly please. What can you make choose, dig and FX instead of Google OIR which is more popular for example? So like I mentioned, colleague of mine, Glyb, authored it so it was very, we were very quick to jump on that one he suggested using the library. So it's purely advertisements. Thank you. Any questions?
REST in Peace: using generics to remove REST boilerplate
Well, this is a depressing title. RIP. Rest in peace, and I hope rest means, like, restful. And not like this is the end of go, because I kind of like go. Anyway, this is going to be a very interesting talk. Rodolfo Plaza, thank you. Oh, actually I don't have my notes, so I'm going to swing it because I had to clone. Anyway, hello everybody, thanks for coming. So I'm going to present a project of mine that I created like maybe years back, a few years back now. It's called Rest in Peace, and it's about to make rest in peace, oh ho. So in 2021, Jan Lans Taylor and Robert Grissmer created a talk about how to use generics once they create implemented, actually implemented, the generics in go. And they were like, basically, I don't know if I have audio. We'll see. No. Anyway, so basically what Jan Lans Taylor is final words were, please use generics wisely. So of course, when a figure of authority asks you to use something wisely, what do you do? The total opposite. And people from CrowdStrike security company decided, I don't know if you can read in the back, but basically they were like, created a channel on the discord. It was a creative usage of generics contest submission, submit your worst implementation of generics in go. Basically, everything that Jan's told us not to do. So my, I did a thing I'm pretty proud, my, let's say to make the world the worst place, was an, I think, await in go because who needs go routines and channels anyway. Some people even did some try catch. If you're missing those good things from other languages. I got a plush because, you know, when you do something, the world like gives you something back called karma. And just for records, I listed everything that was attempted. We had like monads and stuff like that. Anyway, but out of all of this, I created something that thought was actually useful, use of generics, maybe not the one that was like supposed to be because the current implementation is not optimized for this use case. But I thought it was a nice anyway. So about me, I'm Tanguy, I'm from France. I worked 17 years in IT and I'm also CEO of HTMX. Okay, one person knows about HTMX. Anyway, so as you will see from this video, oh, we have sound now. Anyway, I'm ready to make anything for money. So I'm a freelancer. Specialized in go since 2015, worked in a normal consulting before that. I worked mostly on classic restful APIs, backends, and I've done some blockchains. And I stopped freelancing for a year, about a year to work a dagger. You should check them out. I think what they're doing is pretty cool. It's CI, CDS code. And I'm also very interested in pushing goes in more, I didn't say that, areas than just like microservices and web backends, so GUI, game engines and stuff like that. And the next talk will be about GUI. So I'll advise you to check it out. So now, can anyone recognize this? Basically, it's, yeah, thank you. Basically, it's the HTTP handler code that anybody does. You might even have like a validation step if you're fancy. And, but we do all of this code in the end, we decode the JSON and we encode the JSON for the response. But all of this is just for this line right here. For just this line, we do all of these gymnastics. So that's a lot of code. And let's say we had another handler to deal with another type, another type in our API. And now we do basically copy paste all of that previous code and change the code here, here and there. Myzermanos, whatever. Anyway, and so again, I see like a lot and a lot and a lot of duplication. And for me, like duplication, it's just, it's just something we should try to avoid as much as possible. There are some of the few rules about the rules of two or three, which I think is good. But when you create like a big API, you have more than two or three copies of that code. So you can have a solution and abstract the handler and create like a very unsafe type to basically get whatever you want in it. It can make it work, but then you call your back end, but you have to deal a lot in the back end. And then the back end will deal with a lot of type casting everywhere and it can fail many times. So you need to do a lot of hair handling. And here I put like basically what we should do to convert from one type to another and make sure it works. And it's all of this. All of this is, I don't know if I have the thing. Yeah. So basically all of this is for two types, for a structure with two fields, A and B. And for all of that, we have all of this boilerplate that basically take the dynamic code, transform it into a type safe that you can actually use in your back end. That's again a lot of code. And then the real back end can be easy once you have the right types, but we had to do all the way to here to be able to just call a simple back end. And by the way, if your back end is just that and you can make money out of it, go for it. So as I said, a lot of runtime reflect boilerplate to get back to types and potential reuse of the handler. So we had a lot of durable potential. Not so sure about it. So if finally we had a solution, thanks to Go 118, we have the generics. And so that's when this ID popped out. So the pro about generics, we have better type safety. We have better performance than empty interface. And as a wise person said, empty interface means nothing. And for this user, sorry, mis-type, for this use case, for my use case, we don't have better performance. There's an article from Vincent Marty that talk about this in depth. But somebody told me maybe it has been improved since. So maybe it's the procated. I don't know. And in general, it's more readable code for the users. And it allow more like don't repeat yourself all over your code base. So for example, without generics, we have this. So I just want to check the minimum between X and Y. And can anyone tell me what it prints? OK, not very interactive. OK. Well, actually it doesn't print. It doesn't even compile. Because mat.min accepts only float 64. So you just have to do this. I hate you, but I hate it. So, I don't know. I do it disgruntly, whatever. Not my native language, sorry. And with generics, we have this function, which is way better. It doesn't look like it, but it's way better. So the library code is not that great to read. That's for sure. But you can get used to it. And so you compare the previous one and this. Like, yeah, it's not one reads better. But the user code. On the user code, it's really way better. You don't have to cast those everywhere. So it makes for a better code base. So what about rest in peace? OK, I'm checking. OK. So rest in peace. The idea is to basically use generics to avoid all the... this HTTP boilerplate that I presented. For example, here we have some user code. We just wrap strings.toUpper in a function which is like an input, output. I don't even remember the name. But basically it takes a context, it takes a type, returns another type, and an error. And as long as your function respect this interface, you're good. And you can just wrap it, send it to the rip.handle, and you indicate the method, then you indicate you call this function. And then you have like a route options. And then you can just call curl on uppercase, and it will just put your input into uppercase. Magically you don't have to handle any HTTP about that. Library code, less readable, I will admit. So we have the type input, output, funk, which is like the function that needs to be respected to be used, then in the handle, which takes the input, takes an output, and put the method for this route, and then you can just pass it like that. But this was fun to do. It was my first experiment, but I wanted really to go a little bit further, because I do a lot of rest back-ends, so there are a lot of routes to deal with for resources. We need to create, delete, update them, et cetera. So I wanted to automate that as well. So rest in peace. So the key concept of rest services is the notion of resource. It's accessible via an URI, an action on the resource URI via HTTP methods. I mean, this is one implementation about rest. Normally it doesn't have to be HTTP, but anyway. And the current state is sent back through the same system, which in this case would be HTTP. So on the user code, I create like a user provider, like an entity provider. I pass it to here. I decided this is the path I want, and here I just take the default route options. And thanks to that, we'll see what it will give us. But so this user provider needs to implement this interface. Okay? So create, get, update, delete, list all. I will update that because list all is a little bit too much. I need to handle pagination and stuff like that. It's not there yet. But once you implement that on your code, you can just use your code. You don't have to deal with any HTTP whatsoever. And you pass it to this function. Then you have a whole slash user with all the bells and whistles, which gives you all of this. You can create the entity. You can get the entity. You can update the entity, delete, list. And I recently added fields. So now, because you can use patch, basically, to update just part of your entity, but the protocol is not defined. So you have to define your own way of doing patch. Is it like text-tips or something like that? And it's a little bit quirky. So I decided I found somebody talking about a pattern, which I liked, which is basically you take the whole path to the field, and then you can just put and get from it. And so that's how you update part of your resource. And so you have your entity and the entity provider thanks to a type inference that improves. You don't even have to pass that. You can just pass directly. You don't have to put the square brackets and put those types. So you pass the URL, the entity provider, and the route options. And here you go. You're good to go. What you get, creation of CRUD-HDP endpoints, content negotiation for many encodings. Right now we have JSON, XML, Protobuf, MessagePack, HTML, HTML forms. We have automated resource web pages that can edit the resource. Right now it's very nice UI. You see this is not my major. And a harmonious way of handling common scenarios, and because I've worked on many projects, and maybe because duplication, you do it once, and then you forgot you need to update all those boilerplates you've done, so then you forget. And then the behavior between all your handpoints are not really coherent. So that makes a good thing, I think. For example, this is just an implementation of adding a new encoding to this platform. That's the whole code to add the JSON encoding on REST in piece. So you just, I have like a facility like the RAP codec. I use the JSON new encoder from the standard live. JSON new decoder, then I define the MIME types. That's it. You're good to go. Most of the implementation are like that. So RIP is to HTTP. What an ORM is to SQL, me. But I know that how many of you hate ORM just to see if, okay, you might hate me as well. Anyway. But seriously, I hope it will help you create services more easily because I have a pain of repeating all this code all the time. So here's the QR code. Like, subscribe, click the back icon, something like that. Anyway. And here's a demo. Last time I did live code, it was awful, so now I have a video. Amazing. So I just run the server. And so all the logs that will print in yellow are from the servers. There is one that is from the logging handler that is the logging middleware. And the other one is from the backend code that we just logged for ourselves. So here I just get the list of users. There is only one called Jean. So here we see the backend. Whoops. No. No. Sorry. Anyway. We'll check on the next one. So now we're going to create a new one named Cam. Are you stopping, please? Thank you. I'm sorry. All right. Okay. Mm-mm. Did I check this video? Maybe not. Okay. So maybe it will take a little longer. I'm sorry about that. Yeah, that's karma. Are you serious? Okay. Okay. So, oh, yeah, but also there was no output. So we just saved this new user. So here is the log from the backend. And this is the middleware, the logging middleware, which is just a apache log style. Then I go, then I released again to just confirm that we have a new user in our list. Then we have, we just get one user and we decide to display as XML because why not live in the past? Because on each endpoint, you can have multiple encodings based on whatever I do content negotiation. So if the browser or whatever client asked for, and I have it, I will give it to you. That's your problem. If you want to XML anyway. So here we're going to just modify the first users and call him Philip instead of Jean. So here he is. Check is still Philip. Yeah, good. All right. Now I just want the field. So I just want the name of this entity. And so it just returns it as a JSON string in green. Now I'm going to, I don't know what I'm doing. Oh yeah, the email address got thrown in a trash because I did a full put on the entity and didn't specify the email address. So I'm going to just modify this field to modify the address. So now I do a put just on the field, on the email address field. And now I'm going to check again that it's correct. And here now we have a correct email address and correct name. Then I delete and then I will just check that we rightfully so deleted and there is only one user left in the thing. So yeah, so that's what you get. Sorry, it switched this as well. Okay. So that's what you get with just like this one line of handle entities and the whole backend implementation, of course. But yeah, I think it's pretty cool. On route option, it's something that I added since I did this talk in Golab. So now each route can have their own set of handle encodings and middlewares. So that's pretty nice because before it was like a global state. Not good. Okay. And for this to implement that, for example, we need the entity, which is just like the user struct. And we implement those two method, ID string, which return the ID as a string because our ID is an int. And the other way around, we need to convert from an int to a string. So if you have a better design, come talk to me because I'm not very satisfied with this, but it is what it is and it works pretty well. Plimitation is quite simple. And then on the entity provider, so on this example, it's just like a map, memory map, and we just, I just present you the updates. So this is just the backend you have to write. I put blogging because why not? I get my user from the memory, and then I just update it and that's it. So that's the code you have to create. So basically I tried, like with the memory map, it's like 100 lines of code for this, for all those method and implementation. I did in SQL, it was 110, something like that. So then you really reduce to just that. So for the future, oh, I have time, yeah, maybe I will have time for another thing, but I'm going to just finish that. So for the future, I would like to do nested resources, but I've heard like even Django REST API doesn't do nested resources, so maybe not. I want to add pagination, I want to add open API auto generation, so then you could generate whatever client for your system directly. I would love even more 8 OIS, I don't know how to pronounce that, but to have links and stuff, so the API is self-discoverable, I guess, even more than open API. And I would like to overly improve the API. So since last time, I did like the route options, the fields, I added protobuf during my way back from Italy, and I would love to use a log and slog, better handling and customization of HTML templates because you will see. And I would love also to generate GUI apps for that directly so you don't have to also bother that, of course. Simple GUI apps. So let me check if I can do this. Yes, yes, no, yes, yes. Okay, so yeah, this is my beautiful HTML GUI skills. So we have the user Jean. For example, we decide to, I don't know, Jean Marc. All right. And we can add a new person, see like a very well designed from the 90s, as vintage as me. Let's add Marc here. All right. Okay, we get back there. Then we have our full list and we can just delete. All this is thanks to HTMLX, which is you should check it out. It's pretty cool. Anyway, so this is the thing. I wish we could update those through by what you want, actually. And then the last demo I have is I play with my daughters a game called GoCraft, which is a implementation of, simple implementation of Minecraft in Go. And to bother them, I was like, how about I use my thing and see how it's usable and just be able to create blocks in the middle of their construction to annoy them. Or I can just delete and yeah. So for this, I'm just going to show the code at four last minute. I created a block type, which the ID will be the coordinates, XYZ. So my ID is, maybe, yeah, I guess I can still see. Yeah. Okay. So I split by X. So the ID is basically, if I show this, it's like that. I'm sorry. So I just say like the coordinates in XYZ as the ID and then I just have to marshal and then marshal this. And then in the game, I didn't implement get, I just implemented create. I get the coordinates. I create in the right format for the game. And then I update the block, dirty block. So this is really code just about the game. I'm not doing any HTTP in there. And here the delete, the same. It's just like code about this specific go game. That's it. And it works. And yeah. So if you were excited to use it, talk about it or something, go talk to me. I have a bow tie. You should recognize me. And so I would love to talk about it. If you have like a design ideas and stuff, I'm really up to it because I think we could improve it. And discuss about it, contribute anything. All right. And I want to thank the go team for the generics. Without that, it couldn't have been possible. The Ghostrasburg meetup because they had to suffer through my first iterations of crappy slides. The logo from a fellow Strasburg gopher. You for coming here, you online. I don't know where is camera. I guess there anyway. And for them in the go, they have organizer because like you're really, really top and HTML for the mean, of course. One of us.
Low code graphical apps with Go top to bottom!
We're going to continue in more creative uses of Go. Most people use Go, microservices, Kubernetes stuff, servers, whatever, but usually not user interfaces. Every year there are a few crazy people who come to talk about some crazy new front-end thingy built in Go, and I personally always like it. So I also invited Andrew this year to talk about Logoad graphical interfaces in our favorite language Python. Go! And I'm a boss. Thank you very much, Dave Matcha. So yeah, I'm going to talk to you about Logoad graphical applications. So on two levels there's not going to be much code on screen. I think I actually have less code than John's description about how to get involved in contributing earlier. So let me see how I can do it. However, what's a pretty picture, so hopefully I can keep you engaged in that way. So yeah, hi, my name is Andrew. I am a software engineer, working various startups. I've written a couple of books, and occasionally appeared on podcasts and interviews talking about graphical app development with Go. It is exciting to be here on stage at FOSDEM. I've been coming for decades. Having been an open source contributor for years as part of the Enlightenment Project, Maven, all sorts of things that potentially predate and certainly stand outside of the Go ecosystem. More recently I started the Find Project. Perhaps a few of you might have heard about this. It's a way to build native graphical applications using Go that are going to work on any device. If you've never heard of it, I'll do a quick recap. If you have, just hold on a second and I'll move on to some new stuff. I've been a Go developer for about two weeks less than I've been working on the Find Project because we had an ambition of what we were going to do, and then we figured out what language is going to work to deliver on these ambitions. Hopefully everybody agrees, it goes just a fantastic choice. How did all of that come together, and what are we building on top of it? My day job is at Fine Labs, where we're working on products and services that help businesses get more out of the type of technology that I'm presenting today. Like I said, the Find Project started in 2018, and it over that time has aimed to be the simplest way to build native graphical applications. They should look good, they should be easy to build, and they should work absolutely everywhere. Of course, easy to build is relative. We've had great feedback from people who have never coded Go or who have never built an app before, but there's plenty of people out there who feel that that's a little bit overwhelming to learn, they don't want to be a coder, they just want to build stuff. That's why I'm talking about something a little different today, about building with absolutely no code at all. But before I do, here's the recap for anybody who's not familiar with Fine. It's been running now for six years, I can't believe it's been that long, but hey, it's come a long way. We're currently ranked around sixth of all cross-platform graphical user interface toolkits by OSS Insights. That puts us up amongst some other names that you might have heard like Flutter or React Native, and of course I should probably shout out to Wales as well. They're very popular, lots of different ways to build in modern toolkits and actually in Go. So, variety out there. Last week I was really excited to realise that we have become in the top 1000 GitHub repositories of all time, out of, I don't know, 350 million or something, long tail perhaps, but it's a little bit of a milestone, very exciting to be celebrating that. We have about eight core contributors, they come and go. This year has seen a lot of new contributors coming in and as part of the Go community, it feels like a really welcoming inclusive space and we have some channels on the Go for server, we have a Discord for people to chat and there's about 2000 people across the different platforms that we are discussing on. But that is enough about the technical side and the project, which if you're interested to hear more about, there is a talk in the graphics dev room tomorrow afternoon at a similar time. I wanted to talk today about not using code to build applications. So, I'm going to introduce you to a tool called Fission. The spelling is just as peculiar as the fine project, but why not? This basic screenshot is not going to reveal too much, I'm going to step through a little bit of what is capable of it, but more how we pulled it together and how it has really been enabled by the Go built-in functionality and what we have been able to build on top of that. This is the screen that you might be greeted with, if you do load the app for the first time, it is going to help you get started building a project. So what did this set out to achieve? There's so much that we could do and probably I should have thought twice about getting into the space, I think there's 130 to 150 no code, low code platforms out there, but if you've ever tried them, they're mostly building websites or web technologies and if they're apps, they might be bundling them up into native installs, they might be targeting specific platforms or they might be reliant on technologies that I might refer to as legacy or certainly not with the same awesome modern tooling around the Go community does. So we wanted to do something new, something that was truly building native apps, like I said before, fine applications are going to compile and run on any device with a screen, at least that's the ambition, we're about 95% of the way there. So we're wanting to build native apps, but also we want to make this really easy to get started, with easy to build stuff, so as much graphical editing as possible, as you would expect from a low code platform, we started with the UI and the theming capabilities, so although the application has got a long way to go, as you might see, there's something to get started right away, it should always be building on a real code base. So if you don't like the front end or you want to work with a team of developers just loving Go at the low level, you should be able to work with them collaboratively through the Git repository, for example. The applications should compile for all platforms, but also this should run on all platforms. We're making use of our own technology, if you want to build an app for an iPhone, but you want to do it from an Android tablet, that's cool. If you want to use Windows as your development environment, but target only mobile devices, that's just grand as well. Little tweak on the bus, because you know the boss was expecting something before you get in in the morning. Of course being at FOSDAM, everything that I'm showing you today is open source, it's going to remain open at the core, but some days companies want the business add-ons, the plug-ins, so we're going to be using this as open core, but like I said, nothing that I'm showing today is proprietary or held back. The repositories are evolving and some of them are not landed in the right place yet, but I'll point you in the right direction at the end of the talk. Like I showed you at the beginning, we're going to give a UI that allows people to get started with templates to get their application running really quickly, but also you could build an application completely from scratch if you want, with the building blocks that we've provided on top of Git repository for managing the source control. But there's so much to get started with building your first project. I kind of don't want to say that. When I started it was super easy, you opened a text file, wrote a couple of lines in there and then you just ran it. I mean it felt a little bit like a script, but really good solid code. I've opened a few issues upstream with the project team about why has modules made things difficult to get into? Workspaces are amazing, but it's more metadata. We're going to have to manage that for you, but that's exactly what's going to happen. You tell us what your application is going to be called and we'll generate all of the metadata, set up the modules all for you. The metadata about the UI, about the themes, everything you're editing is going to be stored in the source control as well. So if you decide that you want to work, like I said with somebody else who's not on top of this UI, you can pick up absolutely that code and work with it. But also we want people to be able to pick up this project having worked on the code directly for a period of time. So not like a project where you can really quickly pull together a user interface for an application and then export it. It's amazing, you've got a React Native app out the other end. Nobody can read it and if you want to start working graphically on it again, you're possibly going to be starting from scratch. I don't know. Anyway, so everything is synchronized with the source control onto the file system. So we are working on a Go project. I did promise something a little bit graphical. So here you have the first slightly better looking screenshot I think. We're going to be working just now on the theme. We have a pretty crude mock-up of a smartphone device here, a generic one. The cutout is somewhere between a magical island and a place where let's face it, cameras exist and we don't need marketing about it but it's there. The UI is going to allow you to see how these applications work on mobile device, smartphone tablet or a standard window, inset inside the application. It's going to handle the scaling, the alterations that you would expect for these different types of devices. But also we need to present in light and dark mode. So you can see a toggle at the top of the color picker on the right hand side. All of this lovely information is just saved directly to JSON. We've used the standard encoding package that Go has provided to save it to the wonderfully comprehensive file that you see illustrated on the right hand side. That wasn't easy. Go made it super easy, completely built in. But then we needed to load that data into the application that you're building. We didn't want to do any weird generation of things, stuff that could get in the way of working on code like you would in a real code base. So we just store the file there and embed it into the application using Go embed. I haven't realized how easy it was to work with this. I'm going to call it new functionality because I work a few iterations behind the cutting edge because we're trying to support as many devices as possible. To be able to stream this most effectively into your application, a fine app can have its settings set to a certain theme. You just call set theme. But it doesn't really expect a JSON file. It expects some Go code, a struct. We provided this from JSON functionality in the package scene. You can see here illustrated how we can provide both light and dark alternative colors for applications. Less so well illustrated here is that you can work with fonts, icons and the different sizes. Everything that makes an application feel the way that it does. We've got a bit of a look. You can imagine how that file might have your brand identity or something stored in it. You can port that across multiple different applications. Widget editing is the other thing that I feel is actually quite an enabler on a UI like this. If you're thinking about building out your first graphical app and you're looking at fine and you want to use Go but you're not quite sure how to get started, something like this, just this one screen could provide you with the graphical editing that helps you to understand how things are put together. The functionality in the user interface here is mapping to the APIs that are available if you're looking at this as a developer. Actually, let me just go back a little bit. I'll show you a little bit more later. You can see basically here there's a section highlighted on the user interface. We've selected that and down on the right hand side it is giving you the different areas of settings that are available plus the option to insert more things into your user interface. I feel like I've said a little bit too much about JSON already. The fact is it's really super helpful. I don't like to read it. I don't know if it's the win but I'll agree with the folk that perhaps suggested that XML was a little bit cumbersome in comparison. We use it again. Actually, it is great that Go not only supports serialized using something like a map to JSON but we can because we have a stateful widget toolkit, we're able to serialize the entire state of your application the way the widgets are positioned, the containers around them and the metadata for them streamed directly to a JSON file. Again, illustrate it over there. There is also a little blank field on line four for name. A chance to put an identifier on your widget so that you can hook it into code later because this is a low code solution. We know we haven't solved all of the problems and you might want to write a little bit of Go so you can hook into that through the name which is going to be exported as a field on the application which I can show a little bit more in detail. As part of the Find project we've created a library which did start out as a project a little bit like this but now has shifted focus to helping more applications to load and save graphical state. It will also allow you to understand which widgets are available so you can iterate through. You can, at runtime, create new instances of an object based off some textual representation or just ID of the object type that you're looking to work with which, as you can imagine, pretty helpful if you're trying to generate at runtime a user interface that's normally compiled at compile time. One thing that I find really quite surprising, in fact I don't know how many people have realized this but your objects and types in Go in memory can be written out to Go code to reconstruct them as though they were source code. That's pretty cool. It's like stringer but it's go stringer. Has anybody heard of go stringer? I'm really curious about that. Right cool. So hey that's really interesting. Anything that you have in memory pretty much can be serialized as the Go code that generates it. You may need to write a little bit of code to make that fully functional yourself but we built on top of that. That means that every time you save your user interface state it's not just saving JSON but it's spitting out the Go code that will generate the application source code so that you can be working with developers but also so you can actually compile and run it. Which moves on to compiling applications. Now it goes amazing at the cross platform compilation, portability building applications for anything but there are certain requirements when it comes to building native graphical applications. Partly they want metadata around them but partly people who own certain platforms put licensing restrictions in place and require that you run on their hardware or with certain toolkits present so there's a little complexity here. The project that I've presented and will illustrate uses local developer tools so you're never beholden to anything at all. If you've got stuff installed you can build the application that you have coded and have it run on the local system and install it into your applications directory or the start menu whatever the equivalent would be on Windows at the moment. For the local system that's really quite straightforward, the tools are there. For cross compiling we've had some really great contributions to the fine project called Fine Cross from Looker and Jacob and Cedric as well so that you can with that level of simplicity build for any platform. It pulls down images with all the developer tools installed that you would need but even then you still need to have it running on a Mac to do iOS development or on a Windows box to ship off to the store so I'm not going to say this is proprietary but if your business was interested in something that just worked in the cloud there's going to be an option here that, good timing, there's going to be an option that allows you to spin up basically a pipeline in the cloud. It sends the latest version of your code and it comes back to you with the native app bundles for store deployment or ad hoc distribution and also included in that we have support for over the air self-update of application software as well. This little diagram here is something I created a while ago to try and explain to people why platform agnostic development or building with a tool chain that works on any platform makes a really big difference. If you think this would help to convince people to use Go More in your organization there's a couple of postcards over there next to the stickers and on the other side there's a couple of really sweet doodles which just show how coding nirvana can be achieved so hard level tooling, cute fields. Lastly before I actually show it in action there's a project out there called Fishos. Again you might get the theme here with the FY starting of the name which is a Linux desktop operating system that is built from the basic graphical level all the way up entirely with fine and fine applications. We're moving to all of the applications being created or editable with vision not just with source code so that you can be very well running your desktop software and go I actually think that this could be tweaked with something wrong here I can improve on it and you could go and edit the software that you're running, load it in the UI, make some modifications and then install it right back over the top of the software that you're editing before. If that sounds really interesting well we're working in that direction you can head over to fishos.com to see where things are. There is a beta ISO, stick it in a virtual machine nothing more and some of this functionality is not in the version that's there yet but keep an eye out because this is all coming very soon to a platform near you. With that I thought I might just try and show you that this works and bring up the UI editing and application of my system here. I have the bar of icons on my installed system here. I see this calculator app. It's nothing special, it's a calculator, it's going to calculate some things. Clearly there's some things in this that could be improved for some reason I think that's true. Let's actually go ahead and look at how this is. We can edit the calculator application and it's going to load it in the editor that I showed you. I was demoing this for somebody else just immediately before so it's defaulting to smartphone apologies for that. Of course we're really working in desktop software so this is the more familiar button size, text size button size that kind of thing. This C button, it doesn't seem to be quite right, it's very vague, I feel it should be a robust red warning. It is a warning, it should be a danger button. Let's really indicate that there's a problem likely to happen if you press this. You might be one of these people that thinks that clear isn't quite substantial enough so all clear or AC might be more familiar to you. We also could look at the layout of our application expanding down here on the containers. If I tap this, this tap here, this is a two item grid, I think it's this container here. I could do something a little bit bizarre and make those rows wow. That did make sense from a mathematical point because this is a new and evenly spaced grid and I just asked it to do something a little bit daft but actually the columns were just fine so we can go back in there. This application obviously it's just a quick editing in line. I want my app, I want to test it, I want to run this piece of software so I'll test, run, I'll press run. It's going to go and okay I forgot to save the file. Sorry I should have asked if anybody realized what I'd done wrong there and offered a prize but it's happened to me once before and it will happen to me again I'm sure. So there you go it has compiled the application and built it natively onto our system. That is a live native application compiled for the current system but it is just a binary where it's a single binary as any good go application would be running off our hard drive but actually what I wanted to do was commit my really significant improvement to this calculator app to my operating system and so I'm going to use this other developer button called install and it is just going to improve every day of my life now so when I go back to my calculator app over here I now have a new version of this little piece of software and I just feel like this has been a big improvement for me. Hopefully you can imagine a lot more possibilities and you see the next project that you're going to build and I would love to hear about that. Anyway let me just, oh by the way if you really like building applications you like mark down and think it's the future of all good things. This slideshow application is a fine application called Slides with a Y of course and it's just marked down. Anyhow sorry I'm pressing the wrong button aren't I? There we go. If you would like to learn more about this that I have showed you and please do check it out any feedback that you have. It's early days but we're looking for people to get involved. Beta testing.app is the homepage for everything that we're doing. It offers you links to, I feel like some surveys let us know what you think is going to be useful. Sign up to beta testing when it's available and the second link there is actually not connected to that app website. We recently completed some user interviews and got some really great feedback about where the opportunities might exist in this area. If you're at Intrigue we're running a questionnaire based follow up so the second link there would be really interesting to get your feedback. Like I said this is all open core and everything so far is fully open sourced under the BSD license. Actually it's dual licensed with GPL as well for the licensing of business add-ons later but it is all out there with a compatible license. If you would like to see the source code which I didn't tell you about but honestly is there fully available and pretty straightforward you can go to our YouTube channel. There is a video series called Creating an App Builder I think. We used to do them weekly and then moved to monthly. There are 11 videos there that take you through almost all of what you've seen demonstrated earlier and the source code is currently in the tutorials repository because we're just working on neatening up the first iteration of the actual product that I just demonstrated to you there. But the majority of the code as I said been through the videos is available in the tutorials repository. Hopefully that's been really interesting. I'd love to take questions now but also like I said there's these little weird things out there but if you're interested in building the future pick up one of these stickers and slap it onto a laptop and tell the next person how it is that Go is going to be the next future or the best, brightest future for graphical application development. Thank you very much. Did you all just realise we just saw an operating system user interface completely building Go? Yeah. Wow. I'm shocked.
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
Hello everyone, thanks for coming here. I'm Danilo, I've been working on Tornip Driver for three years in Igelya. And I want to give a status update, what we have achieved so far, and what's coming for us. Let's start with the new hardware we support. We now support a lot of hardware. And recently we started support at 700 series, Adreno GPUs. We already merged Adreno 730 and 740, and the merge request for the most recent, Adreno GPU 750 is being on review. There are a lot of changes between Adreno generations with four mostly performance reasons. There are registers changed, and many new performance features out there. We also currently implemented only direct rendering and not tile based rendering. Adreno GPUs are a bit weird because they support two modes, tiling and direct rendering, which is the same that desktop GPUs support. But tile based rendering is still working progress for now. We also support a lot of, almost all, 600 series GPUs, but there are some variants out there we don't support. There are five sub generations of 600 series. We support all of them. So to add a new one, new variant of the GPU, we just need to change some registers there. As for our features and extensions, we now support Falcon 1.3 and a lot of extensions with it. Most interesting one for us was dynamic rendering. It's rather simple for desktop GPUs because they don't care about render passes boundaries, mostly don't care about them. But for tiled rendering for mobile GPUs, it's a big deal. We have to stitch together the render passes, sometimes even at the submission time. It could be really nasty. Like the code is bodily readable for it. And we have all extensions implemented for DxVK, D3D Proton and for Zinc supported. So it's great. While we do not claim Falcon 1.3 conformance, we do regularly test Vulkan CTS. We test a lot of game traces, we test games, but with games it feels like a vacuum all game right now because there are not a lot of real users out there. And we don't have a proper CI with game traces, like Radvid does. Another big changes we've done are in pipelines. Our GPU has some unique way of dealing with pipelines and with all the new pipeline related extensions, we have to rewrite them every time in some way. But thanks to Conor, Conorabot, our pipelines are healthy. We've done a lot of IRC optimizations, which is our backend compiler. They add up a lot with time passing. And we've done a lot of work in debug tooling because we have to reverse engineer GPU. We deal a lot of with unknown registers, unknown instructions, so we have to be able to quickly understand what's going on right there. So I want to spend some time on these debug tools we've implemented so far. I gave a more in-depth talk last XDC. You could find it at this link. So what's our debug tool? We have GPU breadcrumbs like in Google flight, graphics flight recorder. We have ability to reply common streams. We have ability to edit common streams. We can print for GPU memory. We could print from shader assembly in these common streams. And we could debug register reading of undefined state from registers. I'll describe each of these feature a bit more in the following slides. Why we even need our own GPU breadcrumbs? There is already a solution for this at Vulkan API level. It called graphics flight recorder from Google. It already could tell you where Hank occurs at which command, but there are two issues with that. It's two cores because for example, the start of the render pass could translate into like 10s or 20 bleeds at the worst case and each of them may hang. So API level tooling could be like not great at this. And what's really prompted me to create the breadcrumbs to implement breadcrumbs in our driver is debugging of unrecoverable hanks. When your computer or board just completely hangs, you cannot do anything, writes to disk doesn't come through. Like graphics flight recorder doesn't work with it. And to make it work, you need some new Vulkan extension and so on. It was much easier to deal with in the driver itself by doing all the things synchronously. And it worked rather great. But this tool is currently is not used too much due to the tooling I will talk about now. Okay, let's say you cannot even reproduce the bugs. Some bugs are random hanks occurring in different parts of the game and so on. So the easy way to reproduce them is just to record all comments submitted to the GPU and then replace them back. I mean, for most hanks and issues works great for reproducing them. There are a few caveats like it's necessary to record all buffer objects submitted and there could be a lot for some triple A game. So it works mostly for one frame or two frames. And not all issues are reproducible this way. There are some that are too finicky for this. But most of them are reproducible, so it's good enough. But it's not enough to just be able to replay the trace and see a hank in the mask. You have to have a way to narrow it down. So what we implemented is a simple way to edit the common stream. So we could decompile some submit to the GPU into very trivial packets. Like there are packet names only in comments right there besides some of them. It's really easy to do for probably any GPU and even in this form, it's very powerful because you could bisect the trace and find the exact comment which hanks even if you have like the comment. Even if it's impossible to determine from any other way how to deal with it. So you could edit some part of the packet and see if it helps. If it solves the hank, you could like deal with it as with ordinary code. What if the issue is inside the shader itself? We already could compile the shaders from assembly. So with this replay tool, we could add ability to just print some registers from the shader. And the most trivial print is good enough. So our print takes temporary registers for address and so on and registers to print. And print them. Like it increments global counter and tries to global storage and replay tool just reads from it and prints the registers. It's trivial and it was incredibly useful in reverse engineering and hardware. You get the trace from proprietary driver, you decompile it, you edit the shader to print something and you see the values and what's going on. It's incredibly useful. And the last tool in our tooling is the way to debug undefined registers, stale registers. A lot of issues are due to reading of like run value from the registers. Some state is not immediate. Even games have issues of not emitting some state and so on. A simple solution, at least for us, it was writing run values to all the registers and seeing what's breaks. And it mostly works. It's not that trivial because there are at least registers which are written at the start of command buffers and never touched again. And there are registers written in each, like in the render pass, like registers set that are set by pipelines. So we divided the registers into two categories. The ones that are set at the start of command buffer and the ones that should be stomped before each bleed and render pass. Again, there are some other caveats but it helped us quite a lot in debugging various issues when we implement new features. Let's forget about some weird registers. Okay. What are the real users of our driver at the moment? Like where you could see it. At the moment they are emulators on Android. Why? Because proprietary drivers are terrible on Android. Not due to their code but due to update policy of proprietary drivers there. They are not updated at all. So users are stuck with their terrible, many years outdated drivers. And with many issues, these drivers have many issues. They don't have necessary extensions. Like it's bad, it's really bad. And emulators need new features. They need for drivers to work. They push drivers to the limit. So if, so they, like for example, you now is able to load our driver, Chornip, and use it instead of proprietary driver. And it works rather well for them. And I remember some other emulators use the same technique to deal with issues in proprietary driver. Let's see an example. Here is some Zelda game running on Android on Adreno 650 with our driver. It's running rather great, even if it's a previous generation of Adreno. Like FPS is nice, runs correctly, it's great. So proprietary driver is a bit weird to say the least. Like maybe it works with the most recent one but it's hard to tell. Drivers are not updated. It's hard for users to update them and so on. So there are lots of issues and probably they don't test with these games. Okay, fair enough. We also don't really test these games. But the developers of at least Yuzu are willing to implement some debug tooling like recording the games, the game traces for us to easy to debug them. Because it's not that easy to launch a game without having the switch itself. Like it's not legal. Okay, earlier I said that Tornip implements all the features for DXVK and VKD3D Proton. So can we run desktop games? Yes, we can run desktop games. Here you see laptop X, X13S running cyberpunk. It runs via a lot of layers. Like you need FAC simulator to translate X64 assembly into IRM 64 assembly. You need Vine for Windows compatibility. You need VKD3D Proton and so on. There are lots of layers. So we mostly test game traces, not games themselves. We test games, but mostly traces because they are easier to deal with. But we will test games more soon. So what is the future for us? We need to support tile based rendering on 700 series because it would maybe not give a lot of performance boost for desktop games, but it would lower power consumption and help probably on Android for the games. Mark Collins, my teammate is working on it. And I hope we will see it merged soon. It would be great. And then we need to squeeze even more performance. There are lots of performance features we need to implement there. So even if we will not come to proprietary driver performance, we expect to be somewhere near it. At least we hope for this. I hope. And in the distant future, we want to implement ray tracing because at least like, 740 should be able to support Rayquery. And 750 probably could support ray tracing pipelines. I hope we implement this someday. And maybe we would be able to implement my shaders. That would be cool. Okay, another exciting development, not from us. It's not a Galeas project, but an easy way to run desktop games on Android. There is a work in progress project called Kasha. It's worked upon by one of my teammates, again, Mark Collins and some other people out there. It's an amalgamation of Vine, DXVK, VKD3D and FaxCore on Android. And I hope Jornip would have a first party support there. So it would be all bundled together and work together as one. Or you may say that people already are running desktop games on Android. Like here you see some person running Assassin's Creed on their device. Like it runs. Yes, that's true. There is project. There are several projects probably for this. It is done with Thermux. It's, I mean, I'm not sure exactly what it is. But it's even more unholy amalgamation of projects. It runs, it's really cool. But there are some performance issues, some issues with how all these moving scenes are are stuck together. But like people running games, desktop games on Android, that's super cool. Okay, that's all from me. For today, so you have some questions, suggestions. So you said you... Mike, Mike, no, okay. So you said you could use this on Android to replace the proprietary devices. Yes, you could use... So does that, okay, does it meet the root device or custom kernel? There are two cases. If you want to replace proprietary driver for the whole system, you need the root. You cannot change system libraries without root. But if you want to use a tournip for emulator, if emulator supports this, it could just load the shared library, packaged for it. So, and Google Play allows emulators to use custom drivers, they asked for it. And Google Play allowed it for this case. And the loaded driver talks to the proprietary kernel driver. Yeah, there is proprietary kernel driver, KGSL, it's a downstream driver. So we have backends for several kernel interfaces. That's right. Anyone else then? Sorry, will you recall the one with the upstream for doing all the kernel? Could you repeat the question, sorry? How would your implementation interact with the upstream kernel driver for the seven access? Do you go as fast as you can? We develop a Mesa for 700 series on MSM, on upstream. Not exactly on upstream MSM, because we have some custom changes to make it work. Not all of them are upstreamed, at least for 750 GPU. But it will be all upstream, we need it upstreamed. It would be there. But the kernel is not done by us, so we don't have much control. It's other people working on it. Okay, I guess that's all. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Graphics stack updates for Raspberry Pi devices
So, as I said, thank you to attend this talk about the acceptance update of the graphics stack in the Raspberry Pi device. Thanks also to the organizers of the 4 of them and especially to the people organizing this DAPLOO. Which is great. So, let me introduce ourselves. My name is Juan and with me is my colleague, Czema Kacenova. We work in the Alia and the graphics team and we are working on the Raspberry Pi graphics stack. So, what is it to come out? It's basically cover the change that happened in the graphics stack since the release of Raspberry Pi OS Bullseye Edition was in November 2021. Up to the latest version, which is Boogworm. It was released several months ago in October 2023. So, I mean, this is for people that are not used to with, primarily with MISA. We have like five devices, Raspberry Pi devices, well, there are more, but are like variations of those devices. The Raspberry Pi 1, 2, and 3 use the GPU from Brathcom. It's called VideoCore 4. And the name of the MISA driver is called VC4. And then for the Raspberry Pi 4 and 5, they use the VideoCore 6 and 7. And the name of the driver changes like V3D for the OpenGLES. In this case, they are support for the Vulkan driver, which is called V3DV. So, what things happened? Well, probably the most exciting one is the release of the Raspberry Pi 5. This is evolution of the GPU from Raspberry Pi 4. It's an architecture, but with more benefits, how it really means like. It has like a higher clock rate, so it's faster. It supports up to eight render types. It has better support for subgroups operations, which is interesting for Vulkan. And that provides a lot of changes at the institutional level, so it allows to have more compact shaders which run faster. Drawback is that it has a bit of less register, so it suffers a bit of more pressure. And this is the support. It's integrated with V3DV-divis, and it was submitted for review almost the same day the Raspberry Pi 4 was announced. And now it's released in MSI 23.3, and that's in the current 6.8, which is required. As I said, this is more or less the evolution of the GPU, so the GPU front has the Raspberry Pi 4. So nowadays the features are more or less the same in terms of the driver implementation. So it supports the OpenGL ES 3.1 and Vulkan 1.2, and that supports a non-conformant version of OpenGL 3.1. We will see that at this moment. So from the point of view of the drivers in the MSI, the OpenGL driver, well, one of the important things was that we promote from OpenGL 2.1 to OpenGL 3.1 with some caveats, I'll explain later. I think this is quite important because at the end the Raspberry Pi is intended to be used as a desktop PC in most cases. So targeting the OpenGL desktop apps is quite interesting. So there are some applications that require OpenGL 3.0 something, and now they come on the Raspberry Pi. The upgrade from Bullside to Google allow us to expose 35 new extensions from OpenGL, OpenGL and OpenGL ES. And I was saying before, the driver is not fully compliant of 3.1 because there are some missing features in the hardware. For instance, this version requires 8 radio turrets. This is fixed in Raspberry Pi 5 but not in Raspberry Pi 4. It doesn't support 4. And then the hardware itself does seamless QMAP filtering, and the OpenGL spec requires no seamless. And then some other formats that are not supported. But all in all, probably these are not the most easy features. So we support anything else. So from a practical point of view, probably any application that uses OpenGL 3.1 will work in the Raspberry Pi. Then in the Vulkan driver, we move from Vulkan 1.0 to 1.2. So this is Vulkan 1.0, 1.1, and then 1.2, which meant exposing like 80 new extensions. It will compare both versions of the driver from Bullside to Vulkan. So there are a lot of new extensions. Some, I mean, I mentioned like extension dealing with sub-drops, as I said, which is very interesting for Vulkan. Extension dealing with geometry shaders. But I think the probably the most important work done was improving the performance. So when Vulkan 1.0 was released, the target was just having a performance driver. So we didn't spend any time on making it fast. And during this lifetime, we were working a lot on making it more performant, specifically in the shader compiler to reduce the liminal analysis and make strategies to make the shader smaller and faster. The good part is that the shader compiler for the Vulkan driver is actually shared with the OpenGL. So both the OpenGL and Vulkan share the same compiler. So all the improvements in the driver in the shader compiler also affect the OpenGL. So basically the improvements are both for Vulkan and for OpenGL driver. Another thing relevant to mention is that now Think, which is the driver that supports OpenGL over Vulkan, works with the V3D driver. So it means that you can use the Think to open up the applications. And then that's how we, well, we know Roman Stratenko was working on that in support for Android. So now you can run Android in Raspberry Pi 4 with the Vulkan driver. And now my colleague will continue with the workbook in the kernel. Okay, Sam. Well, continuing with our work on Vulkan on the Raspberry Pi, we need to implement several features that were not available in the hardware. We need to create what we call CPU jobs. That is part of the behavior that is not available in the hardware support for the GPU. So we implemented that in the Vulkan driver in the user space. But that implied, that was affecting mainly to some queries about performance counters, time-stun queries, and compute-shaded in this patch. So this caused issues because when we were submitting the different command buffers to the GPU, we need to start the driver of the GPU submissions, do the work in the user space, and then continue after having the result. So one of the improvements that we have just recently landed upstream in the kernel was these kernel CPU jobs. So we moved this operation that are already known to the kernel space. So when we are creating in the Mesa driver the submission, we are currently going to handle that so we don't stop the submission of GPU jobs. That was quite an interesting improvement in terms of performance. Because before this, there were a lot of stalls in the submission. Another feature that was quite interesting for the users at the end was to know if they were just really the GPU when they were running the different applications. It would happen for a lot of developers. I don't know if this is really working with the GPU. So we implemented GPU stats. We suppose these users' stats per process using the standard way of doing it in DRM. And we also suppose global stats. So this way some application just, if you want to know the global status of the GPU, just check that the value of the percentage of usage. Because in other case, you need to go to every process, check each process which amount of GPU has been used, and do the complete sum up. So because of using the standard interfaces, we can run application like GPU top. That is really nice because it works for several drivers. And at the end for the global stats, as we, there is no, no a common defined interface to expose that. We are currently using a CFS. So the hardware lacks some features to provide the stats as other drivers are using, in the case for something in Intel. So we are, we go to a simple approach. It is just, we put in the DRM schedule, when we submit a job to the GPU, we just get the time stop and the job ends, we get the finished time. As we are only processing one job on each queue, we have the information about how much the GPU was used. So we can show here, for example, it's on the top right of the, there is a graph with a widget that the users can check and the GPU users. And in the task manager, we already have the information about the GPU users. For example, in this screen, the main user of the GPU is Chromium, and the second one is the compositor, in our case it's Wayfire, because it's compositing all the different windows and surface that we already have there available. So, well, that's the highlights of the modifications from the kernel. And one of the main important changes that we did from Bullside to Bookrow Raspberry Pi OS was the change from the default desktop. Previously, in Bullside, we were running for the Raspberry Pi 4 devices, matter with Xserver, and it was being OK. And for the previous generation hardware, Raspberry Pi 1, 2 and 3, we were running the previous desktop we had, that it was an open-boss with Xserver. Matter was too heavy for the generation of hardware. And when we have this release of Bookrow, now all the Raspberry Pi's that we started, the public e-mails, they get a Wayland desktop using Wayfire. That was, and for Raspberry Pi 5, it was just a digital, it's the default one. For previous generation, we still maintained the open-box and Xserver, but I want to commend on this, now this is the last part of the talk. So, well, Wayfire is using OpenGL for doing the composition. It's based on WLroute's backend. We use the OpenGL, but it's quite tight for OpenGL. So, all the plugins are implemented there using the OpenGL API. One of the most important things we did in this transition from Bullside to Bookrow was that the user's experience don't change a lot. So, as we can see, Simon Long from the Raspberry Pi has a lot of effort here. So, it's difficult if you don't see the change of the background to figure out what are the differences between the previous version and the new one. That is Bullside and this is Bookrow running. So, all has been rewritten, the panel, the theme, because there are different compositors. And, well, now we go to the desktop on the previous generations of hardware. Well, we are still using the Xserver with Openbox. It's the file, the file we have. This has been the same way since Bullside, we didn't try to see it in two matters. The main cause of still using this is that we need to use sober composition. We use the CPU to render the desktop because the hardware limitations are supposed to have a memory limit that is 256 megabytes by default. The problem is that we don't have control when the GPU memory that is using the CMA, Continuous Memory Allocator, runs out. So, at the moment, we'll answer a new Chromium Brows tab that uses CMA memory. If we run out of the memory, next application that can do the following allocation could be the Xserver and it can crash or the compositor. So, the solution it has been there during all the time is on these devices, all we are using is CPU sober composition. So, GLAMOR has been off all the time and there is no hardware support. You can run a full screen application. All has been, you can enable it, but it's not the default. You can enable GLAMOR and you get hardware acceleration. But you are supposed to crash in your desktop at any moment. And there are a lot of hungry applications like the browser that can kill you if you open six tabs, you are completely frozen the desktop. So, during the previous development cycle on both sides, we wanted to make the possibility of enabling the hardware accelerated applications. So, if you want to launch your GLX-DRS, the GLX-DRS is not using a low beam pipe and it's using the driver for the hardware. So, we managed to do that. We enabled the hardware acceleration on the four applications while we were still doing sober composition for the rest of the desktop. So, in case you run out of memory, what is going to crash is just application. You are not supposed to the Xserver crashing or matter crashing or whatever application because they are not prepared for when you do a memory location, it fails. We assume that all the time it has been working. This was implemented modifying the mod-setting driving index server. We implemented the support for DRI-3 in this case, but without the need of using Glamour. How it is currently written is just Glamour enables the DRI-3. So, on crash-by-devices, we can use DRI-3. Even we don't have open GL during the composition of the general browser. There is a request for the server, but there is now too much interest in integrating that. We understand because Xserver development is stopped at the end. But we have been using that downstream for almost a year. It was a huge improvement for the users. With these changes, we avoid the problem of the GPU memory subsystem. When we were about to release Book World, the idea is that we are transitioning to Wayfire as the stock compositor. What can we do for the older generation devices? We need to rethink again how we solved the previous problems with Xserver, now with Wayfire. We need to do the software rendering composition using the CPU. We would like to allow again hardware-obscleted applications. The problem with using Wayfire to do the software composition is that Wayfire is quite tight to use OpenGL. It is using WR-ROOT's backend. As you have seen in parts of the code, mainly in the plugins, doing the different effects, we are doing calls to OpenGL API. We don't want to do that. The first thing is WR-ROOT already has a Pismand backend that is working. You can just transition and the parts of WR-ROOT that are using Wayfire, just do small changes to use the Pismand backend and it works. The next part is we need to reimplement all the parts that were tied to OpenGL in the different plugins that we are going to use in the distribution. There are some that are quite complex that we didn't need, so we didn't implement the change, to use the Pismand rendering logic. This way we managed to get all the rendering done by Wayfire to be done using CPU rendering. The problem is that if you do that and you start doing blending operations, in architecture they become really slow. Reading from the even buffer when they are doing the blending, assuming that we have synchronous memory, all the changes are flashing at the moment, it is terrible. We experimented with enabling for the buffers, we used for doing the run buffers, to use non-coherent memory. That makes that if you write on the CPU and then you put it on the display, maybe there is not coherency. So you start needing to flash in some places, you need to handle that. Some things happen funny because in 32 bits, IRM is different to 64 bits. Things that work in one place, in 32 bits you can just, I'm going to flash the memory before putting it on the display, and it works. In 62 bits, it doesn't work. At the end the flash is not doing anything in architecture. So we need to handle the synchronization but in the compositor to do that work. The difference of that change is that everything runs fast enough. The problem is that when you enable non-coherent buffers, you only want to do that for the compositor, but not the rest of the application. So that is complex because some applications don't work on the core and buffers and we are dealing with that, maybe enabling it with a parameter, creating a new IOCTL for the getting non-coherent buffers or what. So well, and for the other part, that was quite fast as we already know, the part about getting hardware-assisted applications because we already have the knowledge of doing this on the X-server, so at the end we need to handle in WOL routes to pass in the pieceband backend, to pass modifiers with the memory buffers, and it was already working. So I'm going to show this is the current working progress we have with this work. This is our Raspberry Pi 3 running the desktop. It's using the non-coherent buffers. In other case, you will see how the programs are moving. The performance is quite good. Some of the more complex things that are most expensive is the shadow calculation. You cannot imagine doing this in the CPU. Every time you scale a window, it's complex. We are seeing that these DLX-servers are using the hardware acceleration and it's not the best thing that we can do because there are possibilities of having another different plane to show that there is no display, but we are bleeding this by the compositor. We have enough time, I think. So we are going to see several plugins working. So as a conclusion, we are on the point of maybe thinking about putting this for the users, but it's still not ready. One of the things that Raspberry Pi devices have is trying to maintain all the generation of hardware because you can run the last Raspberry Pi OS with the Raspberry Pi 1 and it will work. We have already tested it with the Raspberry Pi 1 that has slower memory. Juan was doing that this and was good enough. It's comparable with the results we are having with the X-server. We are seeing Chrome running, that is using hardware acceleration. The good thing is that as we are not spending memory of the CPU, we can run more applications. But you can crash Chrome in just open, I think it's eight apps. You will crash Chrome in, but only in some cases, only one window. This is the Zoom working, this is all software composition. And I think that's all from us. We already implemented, this is the switcher that has Wi-Fi by default, we implemented with Pisman and we tried to do a more simple option, but this one was already working fine at the end. So we are maintaining the most complex part is doing the transparency and using the alpha channels in software. So, question, I think we are on time. What features do you need the CPU to actually get into the job? And are they used a lot, the applications need them, will that impact the performance? Well, our colleague, Mayra Canal, has a lot of positive planning in this and they are really out. The question is, which features in particular are needed to be done in the CPU and cannot be done in the GPU? I think I already commented, there are some things related to performance counters, mainly if you want, when you are running the CPU commands, to reset the counters. These need to be done in the GPU. No, you need to write in a resistor in particular. The other one is related to getting the time stamp, because there is no support to get the time stamp from the GPU. And the other one is indirect computer shader dispatch. Is that when you are sending several instances of a computer shader, in this case you need to send in the CPU one by one, because there is no support in the GPU. So you just submit the buffer to the kernel and the kernel is going to handle that. In the other case you were in the user space, you send one, wait, and you are going to send one by one. So, well, time's up. Thank you very much for your attendance. you
Delegated compositing utilizing Wayland protocols for Chromium on ChromeOS
So, hello everyone. My name is Maxim. I'm a browser engineer from EGALIA and today we are going to talk about the delegated compositing utilizing Willem protocols for Chromium on ChromoS. And here's our today's agenda. So first we talk about the goals and the motivations of the project, why we have Willem on ChromoS and why it's in Chromium. Then I will talk a little bit about what lacrosse is. Also I will need to cover a little bit about the Chromium display compositor to give you some of the ideas, why, how it works and why we actually needed the delegated compositing there. Then about the delegated compositing itself, the Willem protocols and a big picture of what we actually have. So Chromium and Willem on ChromoS. So there are quite a few vendors who are shipping ChromoS on their devices and as soon as the devices become, well, they are aging, right? So they are not receiving the updates. That results in having them with the old browser and so on. So in order to improve that and improve the maintainability of the devices, it was decided to split the Chrome browser from the ChromoS system itself because they are tied together. And that would also make it possible to receive them the browser updates. But how is it possible to do that? So the idea was to decouple the browser, as I said, from the operating system itself. That was called the lacrosse project. And the ChromoS itself, it has a system UI and the window manager called Ash. And yeah, Chrome was tied to that operating system. And at this point, there was also a Willem implementation already in ChromoS and it was decided to use Willem. So basically in 2015, if I'm not mistaken, the ChromoS received an own Willem's version of the implementation called Exosphere. It's currently used by ARC to run 100 apps on the ChromoS and also Crestini to run Linux apps. And in about 2016, we started to port Chromium to Willem and on Linux, you can use Chromium with Headless, X11 and Willem. So it was kind of a natural choice to employ that implementation and have a browser running them. And basically Willem is used for graphics and the wind handlings with stable protocols employed and also with some custom extensions employed. And for the high level features like file picking, Cross API is used. Well, it's basically Google's implementation called Moja IPC. This is similar to Win32 and Cocoa. But what is Lacrosse? So Lacrosse is a project to decouple the Chrome browser from the Chrome OS window manager called Ash from the System UI. So on this box, on the green box, you see the Chrome OS operating system. And on the yellow box, you can see the Lacrosse browser, which using Welland backend through the Ozone layer. The Ozone layer is basically an abstraction in the Chromium browser, which allows you to implement on backend. And as a sent on Linux, it's X11, Headless and Welland. And it's switchable in the runtime. For the ChromoS itself, it runs on the DRAM, but you can also use like X11 and run ChromoS emulator on your Linux device. So the Lacrosse is using Welland to communicate with Exo, which is in built in the Chrome OS, which actually forwards the input devices and has some graphics communication there. But there was a problem. So this split resulted in performance and resource cost. But why and how to mitigate that? To understand why it was causing a problem, we need to switch to the Chromium display compositor and understand a little bit how actually Chromium draws frames. So as you may know, Chromium has multi-architecture, multi-process architecture. So we have a GPU process or this service process. And also we have clients, which are the render process, the browser process. There's also this video client, which sends the video frames. So basically, we call them the frame things. And basically, the way how it works is that if we are talking about this GPU acceleration and the GPU rasterization, the way how it works is that, for example, if we take the render process, it prepares paint operations for the compositor frame. Then when we are preparing the final compositor frame, we submit those paint operations to ski on the GPU process. That is called the GPU rasterization. And we prepare textures. And these textures basically represent tiles if we divide the whole window to the tiles. So those represent tiles. And the compositor frames, they have references to the tiles, including some frame data like masks, filters, clipping, and other stuff. And on the right side, you can see the vService process or simply GPU process. It represents clients as surfaces. And each of the surfaces has own compositor frame. So we need to aggregate all the surfaces into a single compositor frame and do the final compositing. So this is a high-level picture, high-level overview of how it was working before the delegated compositing. So Lacrosse was aggregating the quads that would end up creating a final surface. And that final surface was, of course, represented by Zingobuffer. It was sent over Weyland to Exo. Then in the Ashcromb site, Ashcromb you can call HromoS, it was like maybe getting some other frames from other windows, I don't know, some system settings if you open that one. And it was doing the compositing once again in this step. So that resulted in double compositing and bigel resources overhead. But how to fix that? And the solution to that was to use the delegated compositing. So basically, we left the aggregation step, we created our final compositor frame, but the quads that we got, which are basically the textures, all of them must have been sent over Weyland protocol to Ash for the final compositing. And of course, I need to say, basically, this is about serializing the HromoS compositor frame, sending over a couple of IPCs through Weyland to Ash. And basically, it was at this stage, deserializing the data that it received, and it basically created, must have been creating the same kind of browser frame for the final compositing. And in order to achieve that, a couple of, well, at first I was thinking that there's actually more things we implemented, like some custom things. But in the end, it wasn't that much. So Weyland subsurface, that is standard, right? Each quad and, well, let's say we were sending quads as overlays, they were represented by own surface. Of course, Weyland buffers and explicit synchronization protocol, because we want the job to be asynchronous. And the main thing is surface augmenter, right? Because we wanted to have this data to be sent from Hromo, Hromo browser, basically, the compositor frame, with this additional information like rounder corners, clipping, also pixel precision, this is one of the important things. And we needed to make our own protocol extending the Weyland surface. Also we used, in the beginning, we used our own protocol for the single color buffers, but as soon as in the upstream, there is now, right now, a single pixel buffer protocol, we just employ that one, so that we don't need to create a real buffer. At first, when nothing was there, we were just clearing a buffer to a certain color, but that's not really efficient. Yeah, why we also needed to pass this kind of round-end corner and clipping information? And the reason to that one is basically because when Hromo sends, it thrusterizes the quads to the textures, those do not have any masks, right? So when we do the final compositing step, we apply those mask filters and so on and send them to Skiya, which does the final picture for us. And for the pixel precision, the problem is that Hromo basically works with the pixels, and as long as Weyland uses deeps, it resulted in some pixels losses. And when it was compositing the quads together, we could see some of the glitches. For that, to overcome that, we actually added some additional stuff to the surface segmenter and started to pass this information using VLFixed, basically, which allows us to use some floating wallets. It was also required to update the VP port, this destination, and some of the other stuff, like setting trust form, setting trusted damage, because when we, for example, change the order of the Weyland subsurface, this Z-order, at some point, we don't need to recalculate the damage or do we need to recalculate that? So basically, all that is managed with this additional fact. And there can be some other stuff, but I would say that was the most important one. And so this is the big picture how everything is implemented. So we have, like, on the top, Lacros viz process and the Lacros browser. So Lacros viz is basically preparing the frame with the quads and sends over the Weyland to the ashram, which then creates the same compositor frame as Lacros would have if it wasn't delegating but was compositing itself. Prepares the final frame, prepares the overlay and sends it to the DRM and that's it. You have the final frame with the system UI and the browser content as well. That's it. Questions? No, go ahead. Yes? Well, I can just repeat the question. Okay, so the question was whether the GTK and QT can also benefit from that. Do you mean the Chromium browser or you mean itself? No, just regular apps using GTK or QT. Yeah, I think so. Basically, if it's possible to have the double compositing, it is possible. We had to use some additional protocols because as long as Chromo is a really closed environment, we can do whatever is possible, whatever is convenient for us. But I think that is possible for the GTK and get this improvement of the performance as well because if the Weyland compositor can do that, why not? Yeah, basically in similar direction, but the Chromium on a regular Linux Weyland compositor, I mean, that would benefit from such features as well. I mean, there is double compositing again. So, have you looked at getting up to or generic protocols to manage that? So, now you have custom protocols, right? But for it to work on a regular Linux Uniday, yeah, a generic protocol. Do you look at doing that? So the question is basically about if Chromium Linux can benefit from the same implementation and whether we considered creating some generic protocol and upstream that. Well, if we get back to this pixel precision and the rounding corners, for the pixel precision, if the browser doesn't work in some custom scale, it's one, right? So it's fine, we don't need this kind of protocol. But for the rounded corners, well, probably we could do something like do this processing on the Chromium side, but it's not very efficient, right? Well, it should be possible, but creating a protocol and upstreaming that, it will take some quite some time. I personally did not thought about that, but it's an interesting concept for the future, of course. I mean, especially for embedded, it can also help if you, I'm guessing part of the subsurface is offered, for example, the video in the browser. If the compositor on the embedded device can then put that video on the plane, the rest is not branded, then you can benefit from these kind of things much more easily. Yes, of course. Do you delegate all the compositing and the compositor can decide what to put on the plane? Well, at least we can submit this video frame as an overlay. If I'm not mistaken, there was a, from somebody doing this, this forwarding Chromium, if I'm not mistaken, I actually saw this by the patches. I think that landed from the problem later. Yeah, probably, probably, yes. I didn't pay attention to that. I was busy with the Chromium itself. Yeah. Yes. What's the granularity of these subsurances? Like, how many would you expect to have on a regular webpage? Are we talking like almost every screen element or is it the more hard to think? Are you compared to like CIL? Well, if you just take a normal page, right? So, the question is how many subsurances we are going to have, I mean, how the page is itself like divided, whether we are going to have each, for the each element, sub-sub-surface or it's kind of done in other way. Well, basically, if you imagine a page, right, as a simple page, there are no additional textures and so on, we can split the page to the tiles, like there will be, I don't know how many, maybe six tiles, something like that. So, basically, this is how much you are going to send. But if you take like, for example, motion mark, right, there are some tests, like images tests, it can create hundreds of those textures. Then we are starting to send all of them over the pipe. But there is a limit for the IPC. So, we have to limit this, the number of quads that we are able to send. And if I'm not mistaken, it's limited right now to 50. Because after this while you, it just doesn't make sense to do any delegation. It's kind of become too expensive in terms of, I mean, there will be too many subsurances. If we could like squash this together, that would definitely help. Because it seems like it wasn't like a use case that was thought when the wheel and was designed. So, any other questions? Thank you.
Flutter in Embedded
Okay, we can start. Hi everyone. Today I'm here to present Flutter in Embedded System, precisely Linux Embedded. Quick presentation of me. I'm André Rikki. I'm from Italy and I work at Amarulla Solution as Embedded Linux Consultant and Developer. My background is mostly C++ on both console and UI application using different frameworks and also recently Flutter. In this talk I'm going to present the Flutter framework from a developer point of view. So we will not go deeper in how the framework is used in the Embedded and Outworks. Outintegrate with the most common build systems such as Yokto and Buildroot. And if there is enough time I'll show a quick video of a commercial product that we developed with one of our customers. So, what is Flutter? Flutter is a UI framework developed by Google, was first released in 2015 for Android, and then was later ported to iOS, Windows, Linux and web application. The idea behind this framework is to have a single framework and codebase to create a good-looking UI application, natively complied and multi-platform. And it uses Dart as programming language and we will talk about that later. So, let's go through the advantages of Flutter. First of all, Flutter is fast because it compiles natively for ARM and Intel machine. You can expect great performance both at startup and at runtime. Also, the idea was to help the developers to achieve 60 frames per second. So, you can expect fluid and responsive UI on any kind of devices. Now, let's talk about Dart. Dart is the programming language used by Flutter. Being modern and designed precisely for UI, it comes with multiple advantages and tools that are really helpful when dealing with UI applications. First of all, is that the language is completely asynchronous, so it's quite impossible to have the application to freeze. By contrast, for example, with C++, I saw in my experience multiple times where, for example, in the UI loop, people were opening files or doing some blocking operations and performance were really bad on that. In this case, it's impossible and the architecture of Dart abstracts the complexity of a multi-threaded typical shell memory multi-threading. So, this is really, really important in my opinion. Another important point is that the language is completely no safe, so it's practically impossible to have a segmentation fold in Dart application. This is another really important point because the UI application is what the final user sees, so it's important to have it always responsive and alive. Finally, all the error management is handled through exception. So, even if there are some errors during the execution of the application, simply error exception are thrown. And even if there are not catch, the application keeps running. And we will see an example in a future slide. And all of that in an easy to learn and familiar syntax language. I came, as I said, from a C++ experience and working with Dart was really, really smooth and really easy to learn. So, Flutter is fast, Dart is great, but that's not all. Flutter and the entire framework is also productive. And with that, I mean that allow development and maintenance of application, which really is. One of the most important key is Outreload and OutreStart. Outreload is the ability to apply changes without recompiling and restart the application. As you can see in the GIF, by simply changing the code from a dark theme to a light theme, without recompiling, the changes are directly applied in the running application. This is really important. We reduce a lot the time of development and maintenance and also the stress on the developer. Also, being modern, it comes with a set of useful tools such as Stotics Analysis, Widget Introspection and also debugging and logging and much, much more. Finally, Flutter Extract is flexible. As I said, it's meant to work on a multi-platform. And you can expect the same look and feel of the application on any screen. This means that from a more commercial point of view, if you need to deploy application on embedded and also on mobile and Windows, you can expect the same look and feel. And this is really important from the final user point of view. And also, the out-of-time compilation and natively compilation of the code allow for really fast and great performance on any platform. All of that in a single programming language, of course. This is my typical setup. I use Visual Studio Code with the official Flutter extension and on the right, it's the running application. As you can see on the top left, there are the typical running tools and debugging such as Start, Stop and Step Into, but also the Autrestart. Autreload is embedded in the Flutter extension. So if I apply any changes in the code and save the file, the changes are automatically reloaded in the page. And in the bottom, there is the Debug console. As you can see, an exception is thrown because the application tries to save a file when starting. But even if there are some errors, the application is still running and works without any problem. In my opinion, this setup is really important. It's really easy to use. It's really productive. And I think that from the developer point of view, allow the maintenance and development to be really, really easy and less stressful. I'm happy to see that Flutter is becoming more and more popular in the embedded system, precisely Linux. Yesterday I saw a different talk regarding this topic. And it's important because the community is really active and huge and it's becoming more popular on Linux. So if you have any difficulties or if you are facing any problems, most of the time you will find a solution online. Then there is a huge list of packages. There is an online repository where free packages are hosted and developed by the community. They come in different types. I use the packages to visualize, for example, a lot of animation or SVG file, but there are also packages that are more into the code such as MQTT communication or file parsing. Flutter is actually developed and updated. In the last year, I had to update the Flutter version both on my laptop and on the target multiple times. So Google and the community keeps updating the framework and improving security and adding new features constantly. Finally, it is used by big tech companies, first of all Google, the creator, but also BMW and Toyota. And those companies keep the project alive contributing because Flutter is completely free and open source. It goes on a BSD3 license so you can use it without any troubles. Now, let's do a quick comparison with the most famous UI framework for embedded Linux. First of all, a VGL. The first point for me is the most important one. C is not dark. C is a really powerful language, but when it comes to UI application, it's not so easy to use, and it's really easy to mess things up. Instead, Dart is designed to work with, for UI application, and we saw all the advantages that the language come with. Also, autoload and autrastart, there is no way to achieve that in LVGL. Of course, you have to rebuild and recompile and redeploy the application every time. Instead, Flutter has this amazing feature. Also, it has more platform supported because LVGL can only run on desktop or embedded, and Flutter can also run on mobile. There are more packages available. Let's call it, also call it their Libs, as we saw in the previous slide. And of course, Flutter is a bigger community behind it. Finally, using Flutter is much, much easier to build the application and publish. We've seen a later slide, how to integrate the Flutter application inside the Yocto, and you don't have to mess with the build argument. It's all handled by the framework and the Yocto project. Instead, with LVGL, if you need to cross-compile, it can be a bit tricky. Then, Flutter versus Qt. C++ and QML versus Dart. C++ is a step up against C, but still can be quite difficult. I saw multiple times Qt application having a really bad performer issue because C++ was not optimized. And QML is designed for UI application like Dart, but first of all, it's an interpreted language. So if you start doing any kind of logic inside QML, the application will be crap. And I think that Dart is still much better for UI development. Here again, autreload and autrestart. It is possible to achieve autrestart and autreload with QML, but in my experience, I was never able to do that. Most of the time, QML is strictly connected to C++ for modeling and such, so you of course need to recompile everything. Third point, Flutter, as I said, is completely free and upersoos. Instead, Qt has commercial licensing. So if you want to use Qt in a commercial product, you probably end up buying a lot of royalty. And finally, Flutter, I think that it's rapidly improving. I mean, Qt is improving too, but the release cycle is much slower compared to Flutter, so I think that this one is also really important. So we saw a lot of advantages and good points, but it's not everything is perfect. So in my experience, one of the, let's say, tricky part of Dart is that when working in embedded Linux, you can expect coming from C or C++ to be able to do anything you want, for example, accessing the hardware directly from the U application. For example, in the product that we showcased later, I added to read proximity sensor input directly from the U application to simply turn on the display. This one was not possible because Dart doesn't allow to read directly the hardware to access directly the hardware. The structure is a bit complex to use, so long story short, I was not able to do that. But the solution that is foreign function interface, also known as language bindings. So what is possible is that we can create a C library with a public interface and then we can call those methods directly from the Dart application. This is really important because we can solve a lot of the issue related to more complex stuff from the language by using a C library. So start up the application, the library is loaded, and then I can call directly the public function. By doing so, I was able to solve the issue and read the proximity sensor input from my Dart application. Now, how to integrate Flutter in your project? Well, build route, there is the Flutter package developed and maintained by my co-worker Adam Daskat. It has done a great job on this package and is currently maintaining it. In my experience, I used Yokto for my project, so I'm a bit more into that. There is the Metaflutter hosted on GitHub that is the, let's say, official Flutter, is maintained by some guys from Toyota and the community. Integrating the Flutter inside your operating system is really, really easy. Just include the layer, add the dependency in your image, and you are pretty done. The Flutter engine and Flutter Embedder are automatically compiled and added in your system. If you need to include, obviously, you will need to include your application. You can use the Flutter Gallery Recypes as reference, pretty straightforward stuff. You just copy the Recypes, adapt the repository, and maybe if you want to adapt some build arguments, and then add as a dependency, and it's done. You don't need to mess up with any related cross-compiling stuff or any of that. On GitHub, on my page, I have a repo manifest that I've done almost a year ago, hopefully still working. For a tinkerboard, simply create the Yocto project and download all the layer needed, and then add the dependency inside the image. When you compile, you can simply put the result on SD card, and you will have the Flutter Gallery on your hardware. Now I'll show a video of the product that we developed with one of our customers. The device is an intercom device. It's running a Rockchip processor, so obviously a bit more powerful than the tinkerboard. This is the intercom device. It's able to connect to another device on the other side and have video and audio stream. Under the hood, there is a lot running, but as you can see, the application is still smooth and responsive, and the performance is really, really good. This is a custom keyboard that we developed to achieve the design that the customer wanted, and those are some lotty animations that are running with the package that I included from the Flutter Dev. The final result, in my opinion, is really good. We as developers are really happy with the result. The customer is happy with the result from the commercial point of view, so that's why I was presenting Flutter today. So, if you have any more questions, thank you everyone. Great. You mentioned this Dart is a custom language, but there's an MQTT library. How does it work? Do you have to, for each protocol, implement it in Dart again? So, Dart is not... The question is if, for example, MQTT is actually... There is a custom stuff running under the hood. Not really, because in reality, Dart can be recompiled as C++. So, simply, Dart, you can use Dart as you... It's like you are using C++ code, and MQTT actually didn't really look into it, because I downloaded the package. It was working flawlessly, so included on running, and I was really, really happy with the result. So, yeah. Thank you for the talk. I'd like to know what is the memory footprint with Flutter Engine, I mean, like Flash and ROM. Okay, the question is the memory footprint of Flutter Engine. So, let's say that one of the disadvantages of Flutter is that it's a bit more resource-less hungry compared to Qt or LVGL. I think that the Flutter Engine was like 14, 15 megabytes on the storage, and on the memory, I didn't run any kind of profiling on the hardware, but I think that it's comparable to Qt application when running. Yeah? Yachto, do you really need a big operating system underneath, or is it capable of running on, say, a pre-artore or something like this, a more lean operating system underneath? Okay, the question is that if Flutter is able to run in less powerful operating systems such as FiatOS. Oh, perhaps Bare Metal. Oh, Bare Metal. I don't think so. I think that at the moment it requires a lean operating system. Okay, thanks. I have one more question. You said about integration to a distant project based on Yachto. So, we just integrate a map of Flutter, and you said we do not have to take care about any cross-compilation. What was the work? Like, my distant project is compiled by GCC, and Flutter, I read that there is a dependency to C1. So, the question is how is the end-all the layer, and how is the cross-compilation managed? So, I think that everything is end-all by Yachto, and the meta-layer, the meta-layer. The meta-flutter is really well done. You can simply download and add the layer dependency in your Yachto project, and if you include the dependency in your image of the Flutter engine, it will automatically be compiled, because Yachto takes manage everything about that, so you don't really need to take care of that. Hi. Hi. What about Flutter, Yachto, and IRM32? Sorry? IRM32, ashtag, too. Have you had any experience? Actually, the question is if Flutter is capable to run a 32-bit platform. IRM32? No. Simply no. Do you have any port, any project on Yachto? No, at the moment, I don't think so. Yeah, the question is if I know any company that is moving from Qt to Flutter. The video before explained one of our customers was mainly using Qt for your application, and now is moving to Flutter. So I think that because of the open-source licensing, this is really tempting from a core-mesh point of view. Yeah? It's a bit of a tilt question, so for the current project, did you do the whole project in Flutter and Dart? Yeah. Okay. In your company, do you also have projects where part of the product is made in C++ or something else? And how do you, would you integrate that with Flutter? Okay, the question is if the whole project was made in Flutter or if there are also other applications running with C++. Well, in this case, the UI is run with Flutter, and there are a set of microservices that are running under hood with running C++. For example, they take care of the video and audio stream and all of that, and the application communicates with those microservices via NQTT, for example. Okay, I think there's no more questions, so thank you very much. Thank you.
Building Cross-platform GUI apps with ease (and Go) - desktop, mobile and beyond!
Thanks very much everybody for coming here to the graphics room to listen to another talk about building native compiled applications that are going to work everywhere. I had the title of the slide up and realised it didn't actually say go on the title slide so I just wanted to get that out there right away. It's very exciting to be here in the graphics dev room and to be presenting in the same place that fantastic people over the last decades have shown great new features in KD and Nome and fantastic discussions around all of that and hopefully I could bring something new and interesting to the room as well. Just out of interest to get this right I recognise some faces from the Go Dev room yesterday so maybe a show of hands if people have programmed Go at all. Wow, okay cool probably unusual for this room and anybody then that is a C developer just in case I need to go back to some common ground. Right, okay cool well thanks very much. So just a little bit about myself. Hi my name is Andrew it's really nice to meet you all. I am a software engineer have been for 20 years I think now, I stopped counting. I work a lot in startups either my own or other people's companies solving interesting technical and personal challenges, building teams all that kind of stuff and I've written some books gone on a couple of podcasts on the topic of building applications like the ones I'm going to show you today. I have a background in open source if you've seen me here before it might have been talking about the enlightenment project where I spent a lot of my time and before then the Maven project as well. I started the Find project which I'm going to present today to build graphical applications built on top of Go in 2018 and I have been a Go developer since two weeks after the project was founded. I'll tell you a little bit more about Find as we get into it but I didn't pick up a language I wanted to learn and decided needed a graphical toolkit. I had an ambition to make getting into graphical application development so much easier I knew what I wanted to achieve and then I hunted for programming language and I don't know if it's a good or a bad thing to say I wanted it to be rust I so wanted it to but I couldn't figure it out and so I picked up Go and I haven't looked back I've never felt more productive. My day job is at Fine Labs which is a company that's set up to help businesses get more out of the types of platform agnostic technology that I'm going to show and so we have products and services that could help companies working in this space. So I don't know whether or not people would think that Go is a strange choice of programming language for building graphical applications. It's certainly what the Go development team have said over the past few years although I think they're coming around now they've seen how easy it is but just to summarize the benefits for anybody that doesn't know much like Dart in the previous presentation it's going to allow you to write applications that compile natively to absolutely any device so they can pretty much run anywhere from desktops through mobiles, wasm on the web browser through into embedded devices as well. It's important to me also that there are no runtime dependencies. These pieces of software should drag and drop or install through a store in the usual manner without any need for additional steps, no runtime setup, no hidden pre-conditions required to get the applications running. We may have to do some as a developer but we take the pain so that our users get the big benefit. We're going to deliver native performance. These applications are compiled down into the same machine code at the TIN level that any piece of software built with C or other built for specific platform technology is going to offer but fundamentally I thought it was important to lower the barrier to writing graphical applications, help people to realize it's not so difficult. It's something that you can see and do and have installed on your device very very quickly indeed and Go provides the ability to do that whatever platform you're on but also the standards and the pros and the tools, the techniques in the language help to make everything easier to understand. It's good documentation, it's standard ways of writing things, unit testing built right in, all of those good things helping to promote good engineering principles. And so for me this is why it made such a good fit and it's why the fine toolkit picked up Go because we're wanting to be the simplest way possible to get people building beautiful and usable native graphical applications and to not have to think about any changes that might be necessary to get them running on any particular device. So the fine project like I said started in 2018 so that is now six years old I think possibly as of this weekend actually, complete coincidence I was not sitting in a Fosden room when I thought of the project which is same, would be a good story. It has become the most popular graphical toolkit for Go which is pretty exciting. Over the years there's been actually quite a few have started and it's nice to have choice. They have started perhaps with different technologies under the hood some are using embedded web browsers for example and others are interested in enabling more control, more power where we're focused on the simplicity and the ease of use I suppose. OSS Insight if you track them on Twitter, X, wherever they are have ranked a sixth out of all cross-platform graphical tokens which is very exciting although for some reason Qt and GTK don't seem to list in the top 10 so how they came up with the numbers I don't know but it puts us up there with others like Flutter, React Native and other names that you would have heard of and just last week I realized that we had got into GitHub's 1000 most popular repositories across the entire entirety of their code base if that's the right word which I think something like 350 million repositories and as part of the Go ecosystem we make use of the really excellent and welcoming community that they have established over there and across Slack, Discord, Matrix and in-person meetings we've got about 2,000 people that like to get together and talk about building applications offering help for people who want to get started. So let's get a couple of pictures on these slides as well. This is the fine demo application so if you're interested in checking it out you can load this right now it's in the standard repository that we have we ship a few demo applications and if you're on the Google Play Store you could download this right now onto your phone and see how it renders on a mobile device. Hint looks exactly the same except it's adjusted for the different screen sizes and of course I mean as a developer at heart Light Mode is no good to me we ship the dark mode by default as well sorry they're both in there by default it will pick the right variant of the theme depending on your user preferences. So let's get started and build an app. I'm not going to overwhelm you with complex code which is perhaps a relief to people who don't know Go or C but I'll step through what we do have. Go is known for being easy to compile across all different platforms from whatever developer device you have which is fantastic it's a good place to start but because we're going to be doing some graphics programming and we want optimized binaries that are going to use your hardware acceleration we've got to get a little bit of C in there under the hood you're never going to see it but we do need a compiler installed as well so you'll need to install Go and GCC or Clang or a compatible compiler. If you're unsure whether you've succeeded setting up a development environment we have a fine setup tool which will verify the runtime yeah the runtime it's linked from the Getting Started pages which I'll add reference later and that's just going to check that the Go compiler, the C compiler, are found and catch typical challenges around just having your path set up properly so that tools are indiscoverable and we have a tool called Fine that's going to be useful for our packaging later. So there are a couple of steps that we need to do to get started with a project nothing like if we were going to be starting with a C code base but nonetheless it's something to be aware of we need to make a directory for our code we need to initialize the Go module this is a step that was introduced recently it adds for much more powerful dependency management on the Go project it used to be that you could just open a file save it run it and you would have an application displayed and I'm trying to coax the Go team to allow that as a default for really early stage because the mantra is start with the smallest thing possible and then add to it over time so apologies we've got a couple of steps there that you need to know we're calling Go get which is going to grab the library and goes looking all of the stuff up on the web through pretty efficient caching mechanism but as you can see that's a URL it's finding the source code and that's going to download it into the module that you've just created actually it's referencing in the module and putting in a common space so you don't need to download it again for another project and then we're going to edit our Go file I'm calling it UI.go because I'm really good at naming this is the code that we're going to put in it not adventurous to live code I'm afraid I'll step you through it we have package main because every application enters through the main package we're importing two packages they're in the same namespace of find.io slash find slash v2 because this is our second major API and it's the app and the widget sub packages that we're going to be using the app package sets up the runtime pools in the appropriate drivers for the device that you're running on and then is going to bootstrap the application and the widget package we're using to add something into our window our main function again probably no prizes here forgetting that's the entry point for a Go app is creating a new find application that's invoking the driver is creating a new window from the application with hello as its title so if your device has title bars hello gets popped in there and then the one line which is basically our entire user interface says set the content of the window to a new widget a new label widget that says hello fine and then we then call show and run on the window which is a little shortcut for show my window run my application if you're not familiar part of the challenge with graphical apps is they do have to run in an event thread the operating system has specific requirements for things we just bundle it up as simply as possible so there we have I think four lines of code and a couple of import statements we can type go run full stop the period there is just say the current directory you could equally have said main.go because we only have one file and it's going to load this picture this window here says hello fine I was running this on a desktop in dark mode at the time which is why it looks that way wow yeah I can see you're really really excited about a hello world application I mean I was the first time it appeared on the screen but that was a few years ago so let's do something a little bit more interesting and show that it's still going to be easy to do something useful going to make a markdown editor for you we have built into the standard widget package and an entry widget that seems to say editor it's an entry widget that's going to take the user input we're going to use a rich text in our application to render the output part of the reason this is going to be really straightforward is that a rich text understands markdown as a source for the information to mark up a text document and a horizontal splits container for laying out our user interface I showed you widgets before but containers are sort of like a type of widget where you have multiple things in it and it has a layout that's going to describe how things should lay out on screen you don't position widgets manually you have an area and a container fills it which means that we can adapt to screen sizes orientations very easily widgets don't have to think too much about how they're placed what type of device they're running on it's actually very powerful when you're not really wanting to think too much about what system you're going to be running your application on and I'm going to hook the two together with an unchanged callback so that when the user edits their text the runtime will change it okay so I described four lines it is a little bit more than four lines of code but we have the same imports with the addition of the container package and we're starting the application and window in the same way although as you can see a very exciting new title is going to appear on our window the editor field the oh sorry the editor field can't seem to point sorry anyway you can read it's not a lot of text is a new multi-line entry which is a standard entry widget but has more than one line in it we don't need to specify how many because it will fill the space available we have a new rich text we're saying load markdown but we're loading nothing as you can imagine if you passed in a string a markdown string there it would actually render that as it was loading for the first time then the hook that I mentioned is again one line of code the unchanged on entry passes a string to whoever is interested in what changed and the parsed markdown function of our preview accepts a string because you would parse your markdown from a string so we're able to set one function to to the other one so when unchanged happens it fires parsed markdown so we can avoid signal slots string based IDs and comparisons to connect multiple widgets together and just use a single line of code instead and then the most complicated piece of code in this entire snippet is the container we're using an adaptive grid which is like a grid but it adapts to is the number of columns slash rows that it should it should have if you had a standard widget a standard grid it would have columns or rows specified and as it reaches the end it flows onto another with an adaptive grid it's going to decide whether it's columns or rows based on the space available so if we were loading this on our phone in portrait mode one will be above the other and if it's in landscape one will be to the left and one will be to the right so sorry about the sneak preview before but this is a markdown editor there we go that's better thank you thank you you're too kind as you can see this is not difficult we have the entry widget on the left we've typed some markdown into it and it has rendered on the right there is a link which you could tap and it has referenced a local image as well that's quite cool but this is cooler it's exactly the same software that has been packaged as an IPA and dropped into my iPhone simulator actually it's a dot app because it's a simulator not a real device but exactly the same so the code could be dropped onto a device as well as you can see it's also running in landscape so the arrangement is the same so there that is the application running across multiple different platforms how did we get it there so compiling for targets that aren't your current machine is a little bit more complicated but let's start with what if you are compiling locally the fine tool the helper that I mentioned before is pretty important and very helpful as many help helpers are you can get it from the URL there and that go get command is going to download it and put it into your path and then you can use it to do helpful things like package the application or install it locally as I'm sure you're aware a binary that you get out of a compiler is fantastic and efficient and you can move it around because it is portable it doesn't look good and you can't put it in your start menu so fine package is going to give you a binary with whatever metadata around it is necessary so it will inject an icon into an XE for Windows or it will put the icon and desktop file into the appropriate places on your Linux system and fine install on that second line there is doing all of that for the current system and installing it into the right place for you so user local probably for most people here or the start menu or your applications folder on Mac line three there is how do we do it for a different platform because we can't just invoke the compiler we also want to package it differently the XE is going to appear instead of whatever our native system is that is going to use local tools and so if you're familiar with cross compiling and see having a tool chain specifying the CC variable is going to be likely needed for some of these cross ports we'll come back to that in a second the fourth one there is to build an Android application on our platform to do that you just need the Android SDK installed essentially quite straightforward and relatively portable the only reason we it's a bit more complex is we need to say what the application ID is because the sandboxing and the operating systems rules say you can't just be an anonymous piece of software so we pass that in there is also metadata file that you can do if you prefer to avoid command line arguments all the time fine app.toml I'm not going to cover it but it's there and can help you save a little bit of pain but what if you don't want to manage multiple developer tool chains installing packages even if it's for your local environment you just might not want to you might not be able to so contributed to the project is fine cross from Luca Corbo and Cedric by who many of you will know and another guy Jacob on our project have pulled together a Docker based build system with a very standard command line front end so you could much like you would say fine package OS windows you could say fine cross windows and it's going to take your application bundle it up inside the Docker container put the binary back into your current directory and and exit the container so helps you to avoid all of the setup if you don't mean running Docker or podman on your local instance that's going to be super super helpful very briefly want to touch on some more interesting parts of the toolkit because it's not all about just showing dialogue sorry showing graphical elements on screen one of the hard things about making applications portable is the file system we take it for granted but we shouldn't it's not always there so we've provided dialogues to open and save files and a package that helps you to manage storage in an abstract way even more abstract actually than the recently added go file system package it doesn't assume file paths it uses URIs to uniquely identify any data source so you could have your your data remotely on a network somebody made an application to bores their steam library they connected it through the storage api and they used the file open dialogue to browse their steam library cool but why it's really cool is the picture on the right here i've asked my application to open a file i've put that onto my iphone simulator yes and it has shown me this file picking dialogue i don't know if people are familiar or not but this is what's going to come up if you have an iphone set up with an iCloud account i can pick data off the cloud or i can back out and i can go to the dropbox picker where i might have something stored so third-party applications can provide data as though they were files because we're not making the assumption that they're files and if you get further into this and you want to separate your ui from the data that you're managing internally separate state from from rendering then we have a binding package so you could pass around a string binding not have to remember that it's going to a label or you could multiplex i have some data it's going to go to two or three widgets and most of the standard widgets will provide a with data constructor so i can pass the data binding in and that's a two-way data binding everything's always going to be kept up to date so two pretty helpful things but i wish i had more time to tell you more obviously there's a full widget library or i wouldn't be shouting about hey everybody you should try this we have a dialogues library and full featured forms as well which surprisingly is one of the things that could be a little tricky to get working on a mobile app menus some more complex containers than i have shown you we have notification integration system tray for desktop and popping in and out wherever it happens to be appropriate for the device you're on and we've provided native access to apis that you might not have in go so if you need to use a library it's not available in go you can call out to that natively through a c api the go team have done a fantastic job with making that integration really easy you just essentially import c and call it with a c namespace and it works pretty much transparently again there's some complications if this is android and you want to access the end the k you need the jvm instance so we've provided some native integration that that give you the context necessary i'm not going to step through that today however there is a little bit more than that it wouldn't be a presentation in the graphics dev room if i wasn't able to say but hold on a second um we built an entire desktop system using this this api stack the presentation that i've just run through is in an app called slides it is a markdown file we support markdown rendering that is pulled together in a fine app the terminal here another fine app in fact the desktop system everything in front of us it's all rendered in fine it is go apis and very very easy to understand so there you go i feel like i've fulfilled what's necessary to to consider ourselves a serious graphical contender if you would like to learn more well i'm here i'll hang around outside if anybody wants to chat there's a lot of documentation online the like i said the project's been going for a while so you can find a lot of what we have at docs dot find i o there's also a pretty good video channel um at fine i o on youtube where you can find tutorials examples and do search for fine tutorials outside of what we have because there's translations in okay i'm not going to list them in case any of them politically insensitive but plenty of different languages for folks who find that they want to try um a platform that's this different to the standard ones available perhaps um there's a book available about fine that i wrote you don't have to buy it but it's out there um if you would like to contribute and we would really love it if some people came along and helped us to improve this project everything is on github including the documentation the websites the examples you can all find it in the organization and the main repository is simply called fine and you can find the source code to everything i've shown you today we're of course looking for sponsors but who isn't if you love it you know help out in whatever way you can appreciate your time thank you so much and i'll take any questions that you have excuse me please yes um we do a lot of complicated stuff to make oh i'm sorry of course so what's the support like for ios is it more complicated than android it is very complicated for us it is trivial for you the developer who's using the toolkit you don't need to think about it you don't need to do anything at all the tools that i've shown you will create any type of application from your code um the one proviso is that if you want to put it into an ios into an onto an ios device apple is going to insist that you own some hardware that they have produced it may or may not be possible to do it in other ways but that's the license um but no fundamentally this is this is platform agnostic the apis are all guaranteed to work absolutely everywhere um sorry the the fuzz done to fit yeah yeah i was just wondering if you could um i if you could provide a sense of are there any certain kinds of application that are both particularly good bits and maybe less good i good good bit the video in this framework yeah so are there applications that are good fit or not a good fit um for using fine i think the easiest answer is going to be that if you have a document rich piece of um content if you're helping people to browse archives of um documents and things like that you'd probably be better with a web framework honestly because i mean that's what it's built for if it's more interactive if it's graphical driven um then that's something that we're going to be much better place to do fundamentally if you want to get this out to many people we're going to um alleviate a lot of the pain of getting it out there quickly and some of the things that other toolkits might offer as built in or community add-ons might take a little bit of time to implement but you've saved a lot of time up front i wouldn't go and implement games because we don't offer the 3d acceleration as part of our api we just use it internally for the speed improvements but it has been used for such a wide variety of things we have remote desktop application screaming uh 60 frames a second um full screen so that that kind of thing that's pretty cool and can we squeeze one more in yeah um just there please uh yes you mentioned about uh using open gel as the back end under ios for instance are you going straight to metal or are you using say angle or some other solution okay so yeah what are the graphical back ends that we're utilizing um it is open gel on the desktop you're you're quite right and we are using gles on the mobile platforms ios and android i'm aware that some of these have been deprecated and they may change over time um on the desktop mac is trying to kill off um open gel they've not really said that they're going to kill gles on mobile but it's inevitable that they will want to we're looking in the future to build more back end sorry more platform specific uh engines because you know performance also it offers slightly better portability ironically if you build for everything separately internally but we've designed the api so we don't need to make those decisions it's really easy to use it's going to work and over time we're going to adapt the back ends to be more efficient or or whatever is needed by the platforms um and so we're offering we do updates every um six months four to six months so that we can keep up with the the specifics of each of those platforms so if we have to look at a different one it'll be there before you have to worry about it thank you so much everybody enjoy the rest of your day
The FIM (Fbi IMproved) Universal Image Viewer, in a Nutshell
Welcome everybody. So this is a talk which introduces the FIM image viewer and it's a pretty classical introduction talk with some new things about new features which I'm working on, which are not so very stable but half of the presentation is about that, so something which will work very soon. Good. So the classical introduction is of those happy times I was using the frame buffer a lot and I was using the FBI image viewer there and I was happy but I needed more and more use of the X-Windows system from time to time and foremost I wanted the VIM keys working in FBI because this was not there and this is very bad. So at some point I started wanting more things from FBI but FBI couldn't give it to me so not in this order but I started introducing things like new graphical front-ends, arbitrary key bindings, custom keywords or commands, shortcuts of course, in a completely different order of course but no, actually much more than this. Handling metadata of image files or searching of those metadata so incrementally adding things which one could do and also interaction with the standard inputs and standard outputs and shell scripts or scripting from the inside, so wherever. It was, you had to be a 360 degree nerdy image viewer for my purposes so it was just driven from my curiosity. So but still I maintained that this is meant to be and still is a unique tool so it's meant to show images so not to show movies, not to edit images so to do that with a possibly minimal user interface, yes this is right, for command line user interface enthusiasts, for people who like to have perhaps a few lines in the configuration file for some custom caption or perhaps some other customizable graphical feature and for perhaps people who want to use it with other utilities, all sorts. So what happened is that some kind of Kimera came out of the original FBI so I thought maybe I can call it FBI improved because of this. So with one executable you get pixel output, ASCII art output with or without colors, the second part of the talk will be also mentioning that a GTK front end, back end or front end sorry, front end is there so yeah. And what's the core of FIM? On what does FIM, on what is the basic FIM? FIM is based on a number of commands of a small language which perhaps I could spare myself if I only had taken some time to learn perhaps Gile or one of those languages so I evaluated I think the ninth rule of the Greenspan, I mean rules of Greenspan so that in every highly configurable program perhaps a LISP interpreter would have been done better so maybe that will come. Anyway, keys are always bound to commands. Commands are customizable, keys have names and you can bind them to commands. Commands live in a space of a simple language which is documented in the man page and there are also shortcuts so once you enter the command line mode with column of course you can have some shortcuts for things which are useful like go to the third picture of scale smaller by 10% or by some other precise amount. You have top auto completion of commands with a deliberate line. You don't need this always but sometimes you need it and it can be useful. I also forget the commands so therefore I have those features. I also like when I show pictures to my friends to have nice custom captions over the images and perhaps having those captions customizable in different parts of the screen and therefore there are four captions which you can customize with expando codes so this can expand to internal variables or to specified values like width, height and so on. You can make it look the way you really prefer. This is your right. I think this is important in an image viewer. I also like to use a bit metadata when I have collections of my travel pictures like to quickly show to somebody a concept from a place where I have been so then there is a simple text file format which can parse with simple commands there and perhaps some other features to make those commands perhaps more flexible. So to categorize groups of pictures or files actually and this allows quite functional forward and back search with the slash and of course the backspace key just like in VIM. What happens if you have a lot of pictures in a list in a collection? Is it also sometimes you just limit what you're showing for a time to some category, some values to some value key combination, key value combination. So for instance here you see that from five files that we have selected that we gave to the FIM we are showing just the four ones which respect this specification. So there are other limiting features like on size, on date and so on. Another recent functionality is that one that perhaps if you want to try out systematically some particular filter command that can be expressed as a UNIX command you can just start FIM that way with this particular command being specified as an argument to FIM and then systematically all of the pictures will be filtered that way before being shown to you. It can be useful if you don't want to convert all files just to having a preview but having the preview on the fly happening on temporary files. I find this also useful to me at least and the syntax is absolutely inspired by FIND the FIND minus exec feature and also just another important thing is that sometimes FIM doesn't really read all the formats you wish although it reads many of them but thanks to the help of external converters it can open many many other sorts of files also with the use of image magic as a last resort. So zip pad files, zip files and ISO files I know many of them and what's new in FIM because all of what I've shown so far it is the classical presentation with all the stuff but there's some nice stuff coming in a bit instable yet. It's about GTKB news because I'm getting old. I don't always remember of the internal commands of FIM. I wish people to find FIM more accessible, more usable perhaps and I never wished actually people to remember the commands. Why the hell? But still I wish to see myself and to allow others to quickly access to the functions which exist in FIM because I also forget them actually but yeah there are nice things. So yeah so this is the boring menu which you see now if you use minus OGTK and it's a normal menu right? So there is nothing really particular or funny here. There are not even like icons but still I cared, I wanted, I want that this menu should be, you should be able to specify this from the textual configuration file with like one line per item which seems sane to me. No XML files, nothing else. Just my custom and dangerous notation for that. And also that you should be able to customize the menus on runtime, why not? I think it is also very important. And this is actually to have like a reminder to what you can do with FIM when you need it. So the menus should be there just waiting for the moment. So back here. So those are the menus which I just had on my instance of my computer today. And let's explore the first one, the file menu. In the file menu you will find the classical open file and quit. And other things. And each one of those is like one line which is half-holded here but you can also have in your FIMRC file. So if tomorrow you want to have some specific variation of next or go to or jump to something, it's just one line in the configuration file so no recompilation, nothing is needed. Since commands are bound to keyboard keys, therefore there will be working automatic here so documentation hints to which key a specific command refers to. I mean if there is exists, for instance here, previous file in list, by the way it's b which is specified in a completely different line but this is kind of let's say detected or crossed. And this previous command actually is an alias which this alias is actually go to minus one and in the moment when the menu is being specified and in the specification of menu, it's only written that this is like slash file slash previous in list, space go to this and a few tabs so it's pretty light as a specification I would say and it should stay so. And only in that moment there is a correlation which looks that by the way, pref is go to minus one and why not to show this in the tooltips. So I find this useful. I find this to ease the path into using FIM. And you also have like perhaps things which might be useful like next in this directory. I didn't put it here but also open recursively directories. I didn't put it here because you don't do it this actually interactively you do this when you start FIM but it's there so you can put here the things you use more often or go to the next file from the last search for instance. Anyway so these are all things which I don't know if many other file image file viewers have but still this just reflect what already exists in FIM since a longer time. There's also like the categorization functionality which I introduced in the last time which I used for my vacation pictures. So once I used this very simple custom notation for categorizing files so listing the name of the file, a comment and a few like variables saying I don't know which artist a picture which artist made a particular picture then kind of also automatically such a menu can be computed during the rebuild of the menus and this can be useful to like select to short list remember the pictures of Bill and Richard before. To shorten the list to only to the files which pertain those particular keywords. Now also the usability of some functionality like interacting with internal variables should be easier perhaps with this GTK menus because yeah I said that you can specify the menus right but you can also specify not only menus which like run an action but also menus like toggle ones which let's say change between two values and this again fits like one line can be something custom for you if you want to have some particular let's say variable on which something else gets triggered because like in VIM you have like auto commands hooks so perhaps something can happen after you change a value yeah so the toggle is on two values so after there is a say some hook detects that the variable changes and the picture is for drone and it's also so that after yeah after such a thing happens yes you have also in case you have another menu which actually refers to the same variables there should be some consistency and to say the same state of the widget should be ensured so to avoid situations like the widget being an inconsistent states in different parts of the menu. Same story yeah now that there are so many variables in VIM that sometimes yeah there is a bit of confusion yeah for instance there is a variable that says yeah flip everything however as it comes and yeah so if you flip let's say the file but also flip all the files what ends up that it flips twice so you have to do special menus one for this image and one for whatever comes so this can be useful to me at least and perhaps to you as well and remember that this is just my choice of what I like so here there could be other defaults which you find better like your own values for the scaling or orientation yeah yeah moment yes so yeah let's go forward there is a window default menu which I specified and again you can specify your own you can take it away if you don't like it where I have put up with other things like since the the menu of VIM is actually it's using the font from from the linux kernel and it's just very ignorant I mean very simple pixels you can magnify the text or change the text font yeah by the menu here so just it refers to what the kernel uses and yeah and if you are to play let's say a bit more with the configuration of the gtk menus you can kind of rebuild the menus perhaps in a in more verbose ways so there are ways defaults at the moment I mean in the unstable version which you should not use yet which will be give more verbosity yes yeah so it's good that you have things which are experimental to you yourself like I don't know taking a selfie with with the camera and reading it in FIM at a given moment you you specified here in a so here we are calling the webcam alias the FIM alias code webcam which is reading from from the output of a command and it will show it here whatever output it is as long as FIM can read it the first few bytes will say which format it is and it will be shown same story as I have shown shown here there is another default here just to say something which means to read loud allowed the comments via speech synthesizer which I find also useful sometimes so you can have somebody not seeing the picture but actually giving you the sorry somebody not able to speak but telling you about the pictures you see so yeah there are also some special menus which are kind of not specified by you but just requested like the all action alias menu or the help menu which you just specify with one line but actually a lot of lines in the menu items in the menu gets in because this is like some kind of automatically generated recapitulation of what commands or aliases or key bindings or variable or variables exist so this is kind of a documentation intended feature so in the in the case of the help menu it's also a handy way to get to the to string telling you which are the options or particular usage modes of particular commands or aliases yeah so over the years I'm very grateful to for the help of the debut entertainers they always ensure a higher level of quality of what goes into code so they they help a finding problems so different maintainers in particular La Basia and so I'm helpful for whatever bug report hug or patch which comes once in a while and yeah that's that's that was this presentation and famous package so many Linux distributions and in case you wish to run it I don't know not only no but somewhere else perhaps I then can help you with it and we can do it I know that most of the people use film for for the Raspberry PI or for such systems I don't play with that it's okay if they do it I like to use it interactively and I invite you to do the same and yeah I welcome your feedback or your ideas of what's not yet there thank you time for questions first second is there just the just been used the upload anything to the gpa no what the question is whether there is a floating functionality of for doing what exactly for instance I don't know I just wanted to use so starting an image view from scratch if there was any no no no no no no if you but motivated now or in a different venue why it's useful important or whatever we can do it but at the moment no it's it's simple it's not a new project sorry is it a new project official from the new no it's not an official new project this is why it's called not new.org the website when it's post it looks very new it's I was confused maybe I maybe knew one day just for yeah perhaps yes I mean it's not going to itself because it uses so far now it's used to do a project and everything actually there are no projects there it is the reality yeah next question all right who used film so far maybe one side yeah one user who thinks we'll use film will you who really thinks we'll never use such a thing oh no yeah the honest good what's what's the typical use use case for using I use it often for showing what's the use typical use case of using film I don't track users I know that many people use it for like frames frame displays which is boring to me if it's useful to them good for them it means that they have like small computers like changing every each second picture super boring to me but they do it they're happy I'm happy but what I'm using for I use it for whatever I mean if I'm I create plots sometimes via pipe so I pipe it to film and film show me the picture I test filters graphical filters so I pipe picture I go show it to film I have my picture collection of vacation pictures I want to show how a good pizza looks like to a friend and I just look for pizza in my in my search in film and show you look this is how you can inspect visually a pizza and I don't have yet the pizza detector the plug-in for film which tells you the pizza is good just by showing it but you can easily integrate it with some OCR program if the OCR program is fine I mean or the pizza detector if you train the pizza detector fine it will show you a some kind of emotions are not yet supported but perhaps some other caption you can program to to be shown like good pizza or bad pizza and these are uses which I really support and invite you to to to have in film or I use it for aliases on on on my shell so sometimes I want to also show some kind of specific concept like somebody's being nasty to me so I start a special alias which shows something a gram picket for instance this may have communicating you're being nasty I don't know what other people do but whatever it's it's about showing a picture not manipulating it really so keeping the file not modifying the file I think belongs to film if you want to go dangerous you can compile film with the system I mean you can interact with the shell and you can you can do dangerous things with him it's many things are possible but I think the best way is to have a powerful possibly minimal because you don't have to implement all of this fun although compile in all of those functions so it should be a flexible image viewer that you can use with one over SSH over screen over T mooks over whatever I said I don't know and with different graphical outputs so I don't know other programs who do the same six is not there I'm really sorry many people use six nowadays I know that for some reason they like to encode to use more bytes than necessary for viewing pixels perhaps one day we'll have six cells six also here yeah thank you you
Productionizing Jupyter Notebooks
We're trying to get some fresh air into the room as well. Have to be creative. Okay, we'll get started with the next speaker now. Anthony is going to talk about productizing Jupiter notebooks. Please welcome the speaker. Hello, nice to meet you everyone. Yes, my name is Anton. Hello, hello, hello. Okay. So, as I was saying, I'm a software engineer programmer and I've been working on data engineering and data, disability data processing projects for the past 10 years in VMware. I'm glad that VMware decided to open source, invest in open sourcing for Stout Data Kids for the past three years. I've been focusing on maintaining and developing open source projects. Today, I'm going to talk about the challenges in productionizing Jupiter notebooks and show some possible solutions to those challenges using Stout Data Kids. So let's get started. You have been using Jupiter notebooks, if you can raise your hand. It's 20%. Hello. Okay. So, if you're a CEO, hello. Okay. A CEO at Jupiter notebooks are really versatile too. It helps a lot in terms of doing experiment, exploratory data analysis, POCs, like, things like, okay. Hello. Is it working now? Okay. So, with Jupiter, you can do a lot of things because it allows you to mix documentation with markdown, with visualizations, and also code in different languages. Still, there's quite a bit of struggle that you don't really deploy your notebooks directly in production. Most likely, you can correct me, but from my experience, you'll be redoing this using some kind of Python scripts or other type of applications of framework and not directly in the production, which is a bit of a double work often because you do experiments in notebooks and then you do actual productionizing separately. The other tool, generally, I'm going to talk about is for Stout Data Kids. So I'm going to quickly introduce it. For Stout Data Kids, it's an ETL framework, or EOT framework, depending on how you want to use it. It provides you with an SDK, which allows you to write basically steps and those steps. You can ingest data or process data. There are abstractions that make this a lot easier. So this is just Python. So you can install it with Python. And separately, there is control plane operations UI, which is the server part optional. That can be installed on top of Kubernetes. So let's now dive into Jupyter and what are the challenges to productionizing. So I've listed here five challenges. Though that's not exhaustive list, there's some of the most prevalent one in productionizing and using notebooks in production. I'm going to go over each one. I'm going to explain what I understand, the challenges, and then I'm going to show what a possible solution to that challenge is. So let's start with the first one. Reproducibility, right? So it's in this example, let's say that you have a notebook where you want to develop in three different cells. The first cell in the notebook, you set some variable to zero, then increment it, and then you print it. What would the result you would expect at the third cell? I imagine one. And that's what you would expect if this notebook is being run in production in an automated self-sufficient way. But that's not necessarily the result you do during development, right? It's quite possible that the user can execute the second cell twice. In this case, the result would be two. It's also possible that you execute the second cell, change the second cell after executing it. In this case, your result, if you deploy this job in automated pain production, wouldn't be one, as it's currently. It's possible to remove the cell. But because the notebooks run in state, again, the variable is one, and every call after it, it assumes it's one, even though if you deploy it in a scheduled automated way, it no longer be so. This means that notebooks are not really reproducible, and that you can start developing in a state that's diverging from one that will be in production. So what is that is clear trouble, because things may seem to work locally, and then when you deploy your job or notebook in production, all of a sudden things break. So what we can do, one thing, for example, we can do is with VDK, what we've thought of is the concept of, let's say that we have this kind of notebook, right? And we want to have a predictable way of which the execution is done. In VDK, we can mark certain cells as production ready and mark them in the order which they will be executed, which is top-down noise. So we have created a visual way that the first cell is executed first, the second, second, the third, third. If our state, which is on the left side, the current execution order is different than the one that's expected. We can issue warnings, and this allows also to smoke testing before deployment, your notebooks. It shows the predict exactly the order that you'd expect the cells to be executed. This is done by, there will be different ways, but for now this is done by setting a tag called VDK for each cell that you want to be running production. The order is always top-down. This solves some of, helps solve some of the problem by providing this kind of determined sequence in executing order of the cells, while it potentially detects divergence, and we'll see later it allows to test easier. The second challenge I want to speak about is code organization. Overall, in the notebooks, you expect to have quite a bit of irrelevant or maybe debugging code. That might be useful during development, but it's not something that you want to run during production in an automated scheduled manner. Like this very simple example, the first two cells import pandas and read CSV, and the third visualizes it. We can say that this is most likely relevant for your workflow, and this is something that you want during development, or if you want to share the notebook with a colleague. This is again, could be helped with VDK tags. As VDK tags, you talk only the cells that are relevant, and you think that they're going to need to be deployed in turn on schedule manner in production system. All the other cells will be completely ignored when the notebook is being executed. Like in this example, the first, second, and the third cells, which are on the right side, are covered in blue, will be expected to be executed in production, and we don't need the debugging code that's simply checking the data frame or visualizing being skipped. The third challenge I'm going to talk about is the execution model. In notebooks, generally, you have much more complex execution model. This is necessary because of the way the notebook needs to keep state and the way you want to be able to use multiple languages, but those kind of execution models are really bad for automation, or using notebooks as part of a workflow. They add a lot of extra work on top of this, and in order to execute your Python code, you need to go through a notebook server to the Ipython kernel, for example, if it's Python, and so on. Usually, the way you want things in production is much simpler. You have a Python script, and it executes on top of Python, or SQL script, and executes on top of some SQL, and that's it. You don't have a lot in the middle. Because with VDK, if you can extract exactly those kind of Python pieces and construct your Python script or SQL pieces, we can do the same thing, which is what VDK does. So when VDK executes a notebook, it basically would extract the Python and the SQL pieces and directly execute them. This would enable us to do things like reuse another notebook as a template, or almost as a function, in a similar way like this one. Let's say that we have a... This is some kind of job, or Python script, and we are going to execute another notebook, process notebook, Jupyter notebook, almost as a function with arguments and so on. You can also execute it within a workflow, and you can run automated tests, which will show how you can do that in a little bit, which is... Which goes to the fourth challenge that I want to talk about, and it is automated testing. You can see a CD. There is no doubt that automated testing overall, we were able to have a CD pipeline, it's cornerstone of having reliable software nowadays. Jupyter notebook do not really easily lend themselves to this kind of traditional testing paradigms, and also really vital if you want to push code to production, and make sure that any changes to be done, don't break out and things work as expected. In Jupyter notebook, there has been some attempts to solve this, and it has been quite a bit of a challenge. With VDK, we will be attempting as well. One of the things that you can do, because the pre-determined order that VDK tagging provides, and the fact that you can mark which cells need to be executed, and the fact that VDK skips a lot of the kernel and extra in the execution model, means that with VDK you can use command, for example, VDK run, which is provided in a Jupyter notebook with the VDK plugin, which will execute the job as it is supposed to be executed in production. One by one, each of the cells in the order that you expect them to be. Another, of course, you can also do this with CommonWide interface using CLI. Beyond that, if you end up using the control plan, which is the part that you deploy the job in production, the integration with the notebook UI would make it sure that if you create new deployment, it will prompt you to run basically end-to-end smoke test, the data job, as it's called, the notebook files, and to make sure that they execute correctly. Finally, because this notebook can be used practically as a function, that means that now you can test them using Python PyTest. You can write PyTest tests, the survivor code, VDK test, that helps with that, in which you specify your dependencies through plugins. For example, you can specify dependencies like SQL databases, it could be HTTP servers and so on, on a way sort of to mock them. For example, using PyTest HTTP server if you want to mock an HTTP API. You can verify the results after the notebook is executed. This is the link over here about different cases that can be used with PyTest and notebooks. This is pretty powerful because it allows you to actually be able to run automated tests of all your code that you want to productionize the notebook. Finally, another potential issue with notebooks is version control challenges. A notebook file tends to be this kind of JSON structure when you put it in version control, which contains all kinds of extra fields, including output and so on, which generally you want to clean. This is something that will style data kit and deployment. When you deploy data job to be in some kind of managed environment, it can strip all the necessary parts. Instead of having this kind of diff, maybe where the only relevant information really is just the source, the last three lines, you can have this diff, which is where you show run job input, right? There are actually no changes in this diff, despite what it appears. Those are the five challenges and potential solutions that I wanted to share with you. There's this kind of self-paced tutorial that's showing some of the things that I've shown, how it can be used and done, so you can actually try out your self-ings. Overall, if you'd like to discuss more about do you think those kind of challenges, reproducibility, co-organization, execution model, automated testing are really relevant for your use for notebooks? Do you think the solutions make sense? Do you see any other challenges that are important? I'll urge you to contact me. I'll be happy to talk about it. You can do this through LinkedIn. If you want, I would appreciate if you can take the survey, which would simply ask you, basically, what I ask you, do you have any comments? What did you like? What are you talking about? Do you have any other issues? And you can leave any contacts if you want to talk about more. Or you can just contact me directly through LinkedIn. And yeah, that's everything I wanted to share. Thank you very much for listening to me today. Thank you. This is going to be interesting. Okay. We have time for a couple of questions. Do we have any questions? If you can, please wait until the Q&A is done. That will be helpful. Yeah, there's a question there. Can you ask her question? Can you repeat the question? Yeah. Okay. Let me run it around. Okay. Thank you. I was wondering if I want to deploy VDK, does that replace my Jupyter Lab, my Jupyter Notebooks, or does VDK replace my existing Jupyter Notebooks server, or is it an add-on that goes with it? Okay. So does VDK replace existing Jupyter server? Is the question? No. Actually, it's Pugin, both Ipite and Pugin, and also Jupyter Pugin that provide this functionality. It basically, you can, you'll be running your Notebooks or the VDK also, they're called jobs, which is the directory of Notebook files or other scripts with VDK using either VDK run, as I showed, or the UI, or you can just run them as a Notebook. It provides on top of it some extra variables, the Pugin, or there's also programmatic way in Python to run them. But it doesn't really replace it more like co-exist on top of it. Thank you very much. It was very interesting. Doc, does VDK, say for example, I want to export my Python? For example, respecting the ordering rules in VDK, does that, if I wanted to export that from Jupyter, will that impact the Python that's produced by doing a pure Python.py export? So, you're saying that you want to export the Python which is marked with the VDK tag, for example, in a script? Yeah, it's possible. And I, assuming that the script runs with VDK run, it's supposed to be running in almost the same way in Python script. You might, VDK provides some kind of extra libraries like job input, which if you're using, you might need to just initialize yourself, but other than that, there's no reason not to work. Hi, thanks for the nice presentation. My question is what was the first requirement for you that you needed to productionize Jupyter notebooks? What was the purpose of first productionizing them instead of using the regular Python scripts? So, what is the purpose of productionizing the regular notebooks instead of using Python scripts? Well, I'm hoping that the idea is to prevent double work so that you can reuse the same basically environment that you are developing and not needing to redo the same things in Python scripts in separate environments. Make it easy also for people without needing to know a lot of pyn of internals and software developing engineering practices to productionize things. So, there's a point I think that this might break if you have some very complex applications. In this case, probably you still want to switch to using IntelliJ and Python and some kind of framework, but until some point it should be much easier for other people, I hope. Thanks for the talk, first of all. Just wanted to ask a couple of things about regarding dependencies. You know how Jupyter notebooks typically don't have any versions of specific dependencies stated at the top. Probably just says import specific packages here and there. Probably doesn't do anything about pip installing those unless you specifically put a show command at the top. So, how is it processing those dependencies? Is it like automatically interpreting it? Or we still need separate requirements or a PyProject or something else that specifies those dependencies? So, how dependencies specified in VLK? Yeah, VLK, basic VLK data chop is a directory which has a couple of special files. You can have Python, SQL files, notebook files like this one, but you also have requirements, the XT file where you specify your dependencies and it will either you install them locally or when deployed, it will automatically install them in the environment. That's how it handle it. Okay, thank you very much. Thank you.
Overcoming MPI ABI incompatibility
All right. We're ready for the next talk. So Mark is going to explain us how to overcome MPI, ABI compatibility with WM for MPI. Yes, can you read me? Great. So, hi, my name is Mark. I'm working at CEA. And today I'll talk about how to overcome the MPI, ABI incompatibility using WM for MPI. Just a couple of words about CEA. Can you read me? Yes. We are a French organization. We are hosting a computer. We have big supercomputers that are used at the national and European level. And in a couple of years, we'll host one of the two exhaust scale systems in Europe. So I start with a quick introduction on the ABI MPI incompatibility issue. Why we actually care about this problem. I will try to give you an insight of what is the problem actually. And then I will show how we can overcome this issue by dynamically translating between different implementation. And I will show you some use cases before concluding. So first of all, why do we need MPI library portability? There are a few reasons that are motivating for us. The first one is to be able to work around the limitations of an MPI library. So as you may know, MPI is a norm. There are different implementations. And as all software libraries, they may have limitations. They may have bugs. So to be able to switch between libraries can be interesting to actually diagnose the source of a problem. It's interesting at the level of a user. It's also interesting at the level of managing the supercomputers or at the level of the developers, of course. It can help you also to choose the best MPI implementation because as you also may know, between the different implementation, you won't have necessarily the same algorithm to do a specific communication. And on a specific cluster, some MPI implementations may be better optimized than others. And it's interesting to be able to test easily all those things. It's also useful to enable fast and portable containers because what containers offers or claim to offer is the flexibility and portability. And it is almost true, but there is a problem in HPC that is you will have a loss of portability if you need to match the host MPI. And in almost all cases, you will need to match this host MPI because it's optimized, it's vendor optimized, and you have no other choice than doing so. And it's not so probable to have the right underlying libraries. Sometimes they're even closed in your containers. It can help you also to add flexibility to high-level languages. Some high-level languages like Julia or Python at some point depended on a specific MPI library. So if you want to use another one, you're stuck. And also sometimes you will build very complex software stack with those languages and if you want to switch the library and you need to rebuild everything, it's really, really time-consuming. And if you can switch easily, that's also interesting. And the last point is to be able to run a bidding-edge system or early access systems because in many cases on state-of-the-art systems, you will have a single vendor optimized library. So if you have your application already compiled with another implementation, you'll be stuck once again. And that's sometimes something we saw with cloud providers, some cloud providers provide a single vendor optimized MPI library and that's it. So now let's talk about the ABI compatibility in MPI and why is it a problem? So MPI, as I said earlier, is a norm and the norm actually defines a single API, which is great, but it has several ABIs. At the moment, there are at least open MPI, MPH, and all the MPH-based library or MPH-compatible implementations, MPC. And if you go back in time, there are others also that exist. And the problem is there are in general ABI incompatible because even for the simplest element of an MPI library, it won't be implemented in the same way. If you look at just a communicator, which is the very basics for MPI, within MPH it's an integer and within open MPI, it's a struct. So you have no way of going from one to the other. And it means that if you want to switch from between those libraries, you will need to recompile. And sometimes it's possible, sometimes it is not, as I said, because you could have a very complex software stack and it can take literally days to recompile everything. And sometimes you're stuck with proprietary software. Even though we are at first-dem, those software exists and we have to live with them on our clusters. So how to do that? That was the motivation to create Wayfarr MPI, which is a library that allows us to switch between MPI libraries. The idea is to catch the calls from a library and to translate them. So we have the input arguments. We will translate them from the original library to the destination ABI. We will call the function from the destination ABI. And then we will go the other way around. We will translate the output arguments and the return value from the destination ABI to the original ABI. There is just one catch. That is, in some MPI functions, you have actually a call to another MPI function. So you have to ensure that you're not already in an MPI function and to avoid to re-translate functions. Because if you do that, it will crash. So we have an assembly code selector to deal with this issue. And as you may also know, there are literally hundreds of MPI functions. So to deal with those translations, the functions that are doing the translation are actually generated. So we have templates. We have files to define the different functions and the input arguments and so on. And we generate everything to be a bit more robust. So now how to use Y4MPI. One of the great things with Y4MPI is that you have two available modes. The first one is the preload mode. And this one is quite interesting if you already have a software that is compiled. Because you can dynamically, at runtime, translate between MPI implementations. So imagine that you have a software that was compiled with MPH. At runtime, you can use OpenMPI. And we have an interface mode. So now we are using Y4MPI as a stub implementation. So we will compile the code against Y4MPI. And it is at runtime that we will choose which MPI we want to use. So this one, if you know that you will have Y4MPI at end on the cluster you will be using, it will ensure you greater portability. From an installation point of view, it's really simple. It's a basic CMake-based installation. And Y4MPI is also available through the SPAC package manager. So if you're using SPAC, just SPAC install Y4MPI and you're all done. And in practice, how you can use it. So there is at least two ways of using it. The first one is to use it directly as a wrapper. So imagine you're using slurm. So you will call slurm. In general, you will then put the binary you want to launch. And you will just add the wrapper saying that you want to go from one implementation to the other implementation. Another way of doing it is to use it transparently using environment variables. And here the main catch actually is to have the right LD preload because you will inject the Y4MPI library at runtime to catch the MPI calls and to translate them. And then if you do that, you can run directly your app with this run. Don't worry about all those variables. They are, of course, documented. And if the translation work, the only thing that will differ from a normal execution is that you will have a message saying, hey, you're using Y4MPI in the preload mode from one implementation to the other. So for more advanced Qsage, I invite you to read the documentation. We have a bunch of tutorials. So it's available online. At the moment, we have seven main tutorials. One, well, we have few basic tutorials about the installation of Y4MPI, how to translate MPI dynamically using either the preload or the interface mode. And a few examples with Python, with Gromax, which is a molecular dynamic code that is heavily used in the HPC community. And we have examples with containers. The tutorials we have at the moment with containers are using Podman. The only reason why we use Podman is that it's the easiest to install almost anywhere. You don't need to have root privileges to use it. But even though it's specific to Podman, actually, the ID stays the same. So if you have another container runtime that you like that is not Podman, you can apply what we did with Podman on your specific use case. So Y4MPI started in 2016 at CEA, and it's still in Active Development. It's, of course, open source. And it's under your license, CCLB plus BSD3, to be compliant with the French and more international law. All our developments are validated using a CI that is using well-established benchmarks, especially MPI benchmark, like the OSU benchmark, IOR, and more user-oriented benchmark as Gromax. And it's also an ongoing collaboration between CEA and the Laurence Livermore Laboratory in the US that started in 2020. And we have publications together. And last year, we did a tutorial at ISC, and we hope to host another tutorial this year at ISC in May. Regarding the support and the limitations of our library, so we're supporting X86 and ARM architecture, GNU, Linux, and BSD. It was tested recently on 3BSD. We are supporting also, of course, CN4TRAIN and 3.2MPI norm. In terms of the limitation, for it to work, you need to have a dynamic linking. If you compiled your code statically, it won't work. It's better if you can avoid a circumvent air path, because our idea is to inject the library at runtime. We are not supporting the timeout feature on BSD distribution. Because so we have, so we added few features in WeforMPA that are not defined in the MPI norm, and in particular, we have this timeout feature that allows you to set a timeout on a specific MPI code. And that is very interesting for debugging, actually. For the transition of some constants that are defining the max length of some strings, you may have truncation, but with all the tests we did so far, it was not an issue per se. And lastly, the MPIX functions that are the experimental functions implemented in the difference MPI implementation. We are dealing with those functions on a case-by-case basis. So they are all supported, but we started to support few of them. So now to give you a glimpse of what you can do with it. The first examples I wanted to show you is something that happened to us that we had a GROMACS version. So GROMACS once again is a molecular dynamic code used in HPC. This version was compiled against MPI. And on the target cluster we were using, MPI could run only on GPU, and we had an error on CPU. So this is due to the fact that in GROMACS you have a call to the MPIX query code support that is just checking if you have GPUs. But actually the way it's implemented in MPIX is that if you don't have GPUs, it crashes. In other implementation, it's not doing that. It's just telling you, no, you have no CPU, sorry, and then you can do whatever you want. But the way they did it for no in MPIX is that it crashes. So we couldn't run this code on CPU. And so we used Wi-Fi MPI to run the code to go from MPI to open MPI. And here I see the results. So using MPIX, it fails. Using Wi-Fi MPI, we have some performance. And actually we recompile the version of the code using the open MPI. And we have performances that are really similar. And the other use case I wanted to show you is with containers. So the idea is to have an MPIX-based container in which we have an OSU microbenchmark that was compiled. And we did some comparison between two AMD mail-in nodes at TGCC, which is one of our clusters at CE. And in all the things I will show you, I compare an execution using open MPI directly on the cluster with the OSU microbenchmark. And one execution using a container in which we have open MPI available that we are plugging on the open MPI of the cluster. And the last one is a container with MPI. And we are using open MPI using Wi-Fi MPI. So the first graph here is to show the in-it time between those three cases. And you see that it's very comparable. We have the same results. It takes the same time to in-it MPI with Wi-Fi MPI. Here it's bidirectional bandwidth. And it's the same. The results are very comparable. And another example is an old reduce. So here all the cores of the two nodes we used were participating to the communication. And we have very comparable latencies between those three cases. So the good point is that the overhead of Wi-Fi MPI and those tests is really minimum. So now in conclusion, for the future, the good news for HPC and the bad news for us is that there is a standardizing project for the ABI layer that started last year. And it's really great because it will greatly help all the HPC users. So there will be very likely a common CBI defined in one of the next norm. You can refer to the Haman et al paper from 2023. And we actually can reasonably hope for a convergence because nowadays there are actually two ABI that covers more than 90% of the HPC platforms that are MPI and OpenMPI. And the plan is to have a single feature ABI-only release for MPI 4.2. At some point they were talking about MPI 5, but finally in India it should be for 4.2. And they are hoping for a draft for SE 24, so more or less one year for now. There is already a prototype available in MPI. And there is also lots of ideas regarding this common ABI that are implemented in Mukatuva from Jeff Hammond. So I put in the links if you want to have a look at it. And if you want to have more info on that, you can check the MPI ABI working group also on GitHub. And the good thing with this standardizing effort is that the Wi-Fi MPI is actually seated as a reference implementation. And so to conclude, for Wi-Fi MPI in Ananshel, this library helps you to switch between different MPI libraries. So it allows greater portability and a better flexibility for HPC applications, including a containerized app or proprietary software. Its usage is mostly transparent. So in most cases we studied so far there were no significant overhead, which is also a good thing. And the library is still evolving. So in the years to come we hope to have an MPI 4 support. We would like to support the Muk ABI, which is the Jeff Hammond project that should be close to the common ABI defined in the future norm. And of course the project is open, so we are waiting for your contribution. And in conclusion, I want also to thank all the people who contributed to Wi-Fi MPI, and especially at CEA, at the Lawrence Livermore Laboratory, and at EOLN, which is a company with which we are working. Thank you all. Okay, perfect timing. Questions? Yes, thank you for the last presentation, and I have several questions. And trust me, which version do you support? So we are supporting the idea to support the Fortran API of MPI. So we have no limitation in terms of Fortran supported. But it's not the part that is the most tested. So if you have Fortran use cases, you're welcome to try it. And if it works, great. If you have any issues, open issues on GitHub, we try to be very proactive on the GitHub issues. Okay, thank you. The next question is about ILP64 ABI. So because a lot of programs compiled in ILP64 mode, I mean, for example, it's like FDFALT integer 8. So did you support this feature? We're supporting once again, if it's working, the idea is that if it's working with your MPI implementation, if you can actually do it with your MPI implementation, if you have a compatible MPI implementation, it should work. But some cases where you need to have a right, you have to have kind of the same way of compiling your library and the ABI. You probably didn't understand the problem that initially what a lot of implementation has two wrappers already. So you have initially you put in some wrapper which translates from ILP64 to LP64. And then only after this it will do some traditional calls. So do you support this feature? I'm not entirely sure. We should discuss it afterwards. Yeah, and the last question is about runtime because, for example, we have some additional parameters in MPI run and some programs like ORCA or MRCC. Did you just do internal calls of runtime, I mean just exact of MPI run? Did you do something with this feature? I mean, because, for example, Intel MPI supports some additional flags and the OpenMPI also supports some additional flags and there is different between these two implementations. So, same thing I'm not entirely sure because for quite some time we actually didn't support MPI run for this kind of question and also because on our clusters the norm is to use SRAM directly. So there were a few things with this regard that we didn't support it. So sometimes the Google started to support MPI run but I'm not entirely sure of the level of support we have on specific things you could find on one implementation versus another. Thank you. More questions? Hi, cheers for that. I wanted to ask. I want to mention some example programs and benchmarks that have performance portability here. What HPC applications are you targeting with this that require the performance portability? Because if it's a Gromax application that wants to use all of the system, it's typically that code will run for years on the system and be optimized for it anyway. So at which point do you need something that works on multiple systems that's okay and at which point do you need to have something that's very specialized so you wouldn't actually need to have a wrapper in between? I'm not sure to understand your question. So what applications use this in the real world on the clusters that you're working on? So none, example programs, what actual programs require the translation from one MPI to another? The only one in which we really had to use the wrapper and we have no other choice were commercial softwares. And especially when I talked about bleeding age system, we have some systems with BXI interconnect. And BXI interconnect is a evident interconnect. And they were supporting only open MPI. The thing is some commercial vendors distribute their softwares built against a specific MPI implementation. And they have no interest of supporting another one. And so if we wanted to be able to use for real those commercial softwares on our system, we had no other choice than using Wi-Fi MPI. That is the only case in which is really, really mandatory. For other cases, in most cases, it's actually more, it's more comfortable in some sense because it helps you to debug quickly. It helps you to test things more easily. But if you have an infinite amount of time in front of you, you can do anything. You can always recompile everything. But we don't have infinite amount of time in front of us. OK, thank you. Any more questions? I have a quick one. Have you looked at MPI trampoline? That's a very similar project to what? Yes, MPI trampoline appeared a few years ago. And the main difference between MPI trampoline and Wi-Fi MPI is that MPI trampoline allows you only what we call the interface mode in Wi-Fi MPI. So you can compile against MPI trampoline and then use another MPI implementation. But you can't bring your own code and run it directly with MPI trampoline. But yes, in the past few years, there were a few ideas similar to what we did with Wi-Fi MPI. And it's interesting because for quite some time, people didn't really care about this API incompatibility issue. And we see now that there is some interest. And that's also why at some point at the MPI forum, they decided that it was time to have a common API. So that's great to see other projects like that emerging. All right. I think that's it. Thank you very much. And if people like the project, there are some stickers here in front as well.
PyPartMC: engineering Python-to-Fortran bindings in C++, for use in Julia and Matlab
We'll get started. Sylvester will introduce us to Piepart MC. Thank you for coming. I'm Sylvester Arrabas. I work at the AGH University in Kraków in Poland. And this is a project carried out together with a team from the University of Illinois, Urbana-Champaign in US. So Piepart MC is the highlight here. But from the perspective of this conference, probably I should read the subtitle, namely How to engineer a Python to Fortran binding in C++ for use in Julia and MATLAB and why to do it. So the package that this tool is interfacing is called Piepart MC. It's a Monte Carlo simulation package for air resolves that are, for example, floating in the air. It's an open source tool developed for more than 20 years at Urbana-Champaign. And just one line about the physics. So usually it's kind of a box model, so studying just processes without a spatial context. But it also has an option to be coupled with the Worf weather simulation for a cast. So here is the HPC context. And it simulates things like air pollution, evolution due to collisions of particles, condensation, chemical reactions, et cetera. And on the technical side, it's actually an object-oriented code base written in quite classic, using quite classic subset of Fortran, but still in very much object-oriented manner. And despite 20 years of heritage, it has a very comprehensive test suite. And I would say it could be an example of best practices in Fortran. However, its usage poses several challenges, for example, to students who intend to start off using it, for example, from a Jupyter notebook. And these challenges are related with, first of all, multiple dependencies. The need to compile it. Getting updates doesn't have really a workflow ready. The automation of simulations, analysis, et cetera, usually involves Shell. The input output is handled through multiple text files. And to analyze output from these simulations, usually one needs to actually look or use some of the Fortran code the simulation is based on. So the question that was posed when we started was how to bring together these two seemingly separate worlds. So on the right-hand side, this is the simulation package, part MC, with its Fortran code base, a bit of C code base, different dependencies. And then a perspective of a modern student, let's say, who starts with Jupyter and expects basically everything to be importable and interoperable with other libraries, scipy, numpy, et cetera. So the goals would be to lower the entry threshold for installation and usage. To ensure that the same experience is doable on different operating systems. And also to streamline the dissemination of studies based on the simulation tool, for example, for peer review with scientific journals. So the status of the project, as of now, of part MC, this Python bindings, is that we released after two years of development version one, it's on PyPy. And we also published a description of the package in the software X journal. So we are kind of ready for a rollout. And today I will talk more about the internals. And the internals start with PyBind 11. So despite we are talking about Python and Fortran, we actually, we picked PyBind 11, which is a C++ tool for developing Python packages as our backbone. So here's some highlights. The project actually is for those who are new to it, it's quite a remarkable success, I would say, with over 300 contributors on GitHub, 2,000 forks and 14,000 stars. Congratulations to PyBind 11. And it's very useful. So it fits here into the picture. So essentially we developed in C++, in C and in Fortran, so it's a triple language project, something that uses PyBind 11 and a few other components to automate building of this part of C and offering the Python package. So probably what's also worth mentioning is here that most of the work on PyPartnC was around substituting this text file input output with JSON-like Python native, let's say, or Python-like Pythonic input output layer. And as I mentioned, the original project has the object-oriented structure, so we tried to also couple Python's garbage collector with the Fortran functions that are provided for creating and deallocating objects. And there are many, many dependencies that the project has in Fortran, in C, in C++. And here, let me just mention that we picked Git submodules as a tool to pin versions of these dependencies, which is useful because the pip install command is able to grab packages from a Git repository, and this would include all the submodules with their versions. So let me now present a bit of code and how it looks from a user perspective. So this example here, please don't look particularly on the license of code, maybe just on the bulk of code, and the type of code. So here on the left, we have the Fortran Hello World for using the PartMC package, and on the right, three text files that would be the minimum to start a simplest simulation. So now this is the end result that uses the PyPartnC layer, so essentially the same can be obtained with a single file, starting with importing from this PyPartnC wrapper, and then using this kind of JSON-like notation, essentially here, list and dictionaries that are wrapped. So one achievement kind of, and one big advantage of using Python is that actually providing Python wrappers, you are catering also to Julia users, for example, here through the PyCall.jl package, essentially the same code and the same logic can be obtained for Julia users using PyPartnC. And finally, example with using Matlap, which ships with built-in Python bridge, and then which allows also to use PyPartnC to access the Fortran code from Matlap. So these three examples I've shown are actually part of our CI, so we have them in the readme file, and on CI we are executing the Julia, the Python, the Fortran, and the Matlap example, uploading the output as artifacts, and there is an assert stage that checks if the output from all these languages match. By the way, the timings here are essentially compilation and set up, so it's not that Fortran takes much shorter, the execution is always done through the Fortran code base and binary, but clearly compiling just the Fortran code is faster than setting up the Python, Julia, or Matlap environment, and how it works actually in practice when looking at the code. So here, this diagram might be not perfectly visible, but the right column is C++ layer, here is the C layer, here is Fortran layer, and here is the user code either in Julia, Matlap, or Python. And the different color here is to depict the package that we are interfacing with. So if we start with this readme code here, the user's Python code, we have set up the some import and instantiation of a single object of this arrow data class as an example, and what happens if we call it, first it goes through barely visible, I guess. So anyhow, this is the kind of outer layer for the C++ implemented Python package, and now I hope it's more visible. This is how PyBind 11, how one works with PyBind 11. So this is the C++ code where we define a module for Python, creating a Python class from C++ code looks roughly like this, with some templates defining the class that we interface how to handle memory allocation and defining particular methods. Here there is an init method, so a kind of constructor, and this constructor, when called, goes through C++ code, this arrow data class that we wrap, but quickly we need on our way to Fortran to go into what is written here up at the top, C binded signatures for the Fortran function. So they cannot take exceptions, exception handling through, across these languages is essentially undefined behavior, depending on the compiler. This is how it looks from the C++ perspective. So when we look now on the C signatures here at the top, they match to what is later defined in Fortran with the Fortran built in C binding module. So whenever you see this bind C or C underscore types, these ensure within Fortran code that we can access this code from C, and each of these routines is written for our wrapper and essentially calls quickly as a fin wrapper around the original Fortran routines that we wanted to wrap. So for example, the one below spec file read arrow data. So now we go finally to the wrapped code. This is the unmodified code that we access, and it sits in a Git submodule of the Pypartmc project. Now the fun starts when this Fortran code actually calls its input output layer, and there is like, usually a simulation takes something like 20 different text files to be read through, and these text files are nested. So what we've done is we replaced one of the components of the original Fortran package with our implementation that starts in Fortran, then goes through a C layer back to C++, which then uses JSON for Fortran. So this is a C++ library that helps get very readable C++ code for using Fortran, and this was our solution to replacing the multiple text files with what from user perspective are essentially in memory, MATLAB, Julia, or Python objects. We also have online documentation for the project generated from the source code, and as you can see here, for example, the types are hinted correctly. So despite in principle the Fortran parameter ordering is the key, we do inform Python users for the types of the arguments. So to start a summary, what we achieved with the Pypartmc wrapper is that we have a list of different types of the wrapper, and we have a single command pip installation on Windows Linux and OS X, with the exception that from Apple Silicon we are still struggling to get it done and help welcome, if any of you is a Fortran hacker who could help us produce universal binaries. We provide access to unmodified internals of the Pypartmc underlying package from Python, MATLAB, and also C++. So as a side effect by product of this goal of providing Python interface, we got also Julia MATLAB and C++ layer. Probably something that might not be obvious from the original plan, and we ended up actually using extensively is that this provides us with a nice tool for development of other Python packages because we can use part mc in test shoots to verify against the established simulation package. And also probably it's maybe a non-trivial way to use pip, but since C and Fortran are probably not the best, are not the solutions, not the technologies where you see mainstream package managers coming in or being established here, we managed to ship Fortran codes to users of Windows 6 Linux different variants of binary packages through pip. So it's essentially probably one way of thinking of the PyPy.org platform. And from the point of view of what I mentioned earlier, providing students or researchers using this package with tool to disseminate their research workflows, including input data, output data analysis workflow in a single, for example, Jupyter file for a paper peer review. And finally, PyPy.org mc allows to extend the Fortran code with some Python logic. So since we interface with, we expose the internals of the package, we can do in a simulation the time stepping can actually be done from Python. And you can add to, let's say, if you have 10 different steps of the simulation done in Fortran, you can add an 11th one that is in Python, Julia or whatever. And the final point is probably one of the key things here is that having statically linked all the dependencies, we can actually use the package on platforms such as Colab or Jupyter Hubs of various institutions by doing just pip install and importing what otherwise would require getting a lot of dependencies and a lot of compile time stuff available. Take home messages. So I wanted to kind of give you a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit kind of underline that PyBind 11, despite being a C++ tool is actually a valuable thing for interfacing Fortran with Python. And this is linked to the fact that PyBind 11 offers CMake integration. So your C++ projects can have build automation in CMake, and CMake handles Fortran well, so this was the key thing here. The glue language role of Python is, I think, nicely exemplified here with Julia and Matlap, including CI. Static linkage of the dependencies was essential for us, for example, due to the fact that there is no standardized ABI for four different versions, even of the same compiler, have different binary incompatibilities, and this was essential to get it working on on platforms such as Colab or other Jupyter Hubs. But this prevented us from from publishing the package on KONDA due to KONDA policy of no static linkage. We've used more than 10 Git submodules for tracking our dependencies from the GitHub repo. As I mentioned, help welcome in getting the universal binaries generated with G4tran. The CI on using MATLAB is possible thanks to the MATLAB actions. So the producer of MATLAB MapWorks offers CI, GitHub actions that actually do not require any MATLAB license. So if one wants to run MATLAB code on GitHub, this is important and just wanted to thank them. And finally, a fun fact or the positive thing that actually when we submitted the paper about the project to the Software X Journal, just reporting that during the peer review, the reviewers indeed tried the code and provided us with feedback that also helped. So this was kind of positive that it did work. Let me acknowledge funding from US National Science Foundation and Polish National Science Center and thank you for your attention. Any questions? Yes, thank you for that presentation. My question was exactly did you keep in Fortran and what did you pass to Python site? So it's a race or some or just single values? So the question is about if I understand correctly what kind of data we are tackling with passing us during the simulation. So it's a the Monte Carlo simulations here are tracking particles in kind of attribute space that tracks their physical and chemical properties. So it's usually 20, 30 dimensional attribute space that is randomly sampled. So we have vectors of these particles in this attribute space. So usually this could be from thousands to hundreds of thousands of particles that each of the particle has like 30 attributes. From Python perspective, usually the user does not really use the roll data of the simulation, the state vector, just some aggregate information which is passed back to Python as enumerables that can be used with NAMPy, but we don't actually assume that it must be NAMPy. So one can use just lists if they are enough. I hope that answers. My question is just because we need some roll data from Fortran site to Python site and then it's just some two dimensional matter. Here we have some problems that we need to know where we keep the data. We are not exposing particle locations in memory. They are always returned as new objects to Python because this is it is never the state vector of the simulation. It's just a some aggregate information that characterizes it in a simpler way. So usually we have just one dimensional enumerable. For you it's much more simple. Thank you. Time for one more question. If there is one. Okay, if not we'll wrap up here because apparently there's a queue outside to get in for the next talks. Thank you. Thank you very much.
Feeding ML models with the data from the databases in real-time
All right, it's two. We'll get started. Hello, my name is Vojta Uranek. I work as a developer at Red Hat and in this short talk, I would like to discuss one approach, how you can ingest your data from into a machine learning pipeline from your databases in real time. First, this is how the machine learning pipeline can look like. And I will be speaking about where the beginning of this pipeline basically here, how to get the data from your databases into a machine learning pipeline. It can be as complex as this, but I will simplify. Just imagine you have your application running, you insert data in one or several databases, and now you want to get advantage of some machine learning or maybe AI and get your data and do some, for example, prediction. And you want to solve a question how to actually feed the data from your databases into a model, and especially when some new events are coming, how to do, for example, online predictions. It might sound simple, you just do some selects from the database, but when you have quite heavy traffic and you will run the selects in a loop, you will quickly find out that you will overload the databases with the selects. And if you add more conditions like you don't want to miss any data, you won't have consistent view of your data, you don't want to do the right and so on, you will shortly find out that it's a pretty hard problem. There are several ways how to tackle this problem, but I would suggest maybe in my opinion the best one is to change data capture. What does it mean? As the name suggests, change data capture is a process of observing some resource for a change, and if there is any change, you will extract this change and propagate it further, typically into some messaging system. What does it mean in terms of databases? It typically means to observe transaction log of the database, because transaction log of the database is typically the source of true for whole database, and all the changes which are happening to any data in the database are recorded in transaction log. So basically you will observe transaction log of your database, and if there is any change with some data you are interested in, it can be one example, whole database, whatever, you will extract this change from transaction log and send it into, for example, some messaging system. It's pretty, well, probably sounds good, but maybe you have to ask yourself if it's easy to implement. Fortunately, you don't have to solve these questions. You can use Dibizium, it's an open source for change data capture. It's pretty major. It has connectors for most of the popular databases, and it currently comes into two flavors. It originated as a Kafka Connect source connector, when you can deploy the Dibizium connectors into Kafka Connect, and they will connect to one or several databases of you are using, extract the changes, and send it to Kafka, where you can use sync connectors to do whatever you want, like, for example, updating the search index, invalidating the cache. You can, for example, use it also for replicating one database to another database, but in our case, you will probably want to push it into some data lake pool warehouse, feature store, or maybe even directly into some machine learning model. If you don't want to use Kafka for whatever reason, you can use Dibizium standalone, which is standalone process, which basically does the same, but allows you to push the events into whatever system you like, like Apache Pools, Google Pops up, and it can be even, for example, HTTP endpoint. And if you are missing any sync of for Dibizium server, it's pretty easy to implement your own. And Dibizium provides some other features, like, for example, it's capable to do snapshot of your database at any time. It can also transform the records before sending out into the messaging system and so on. So back to our problem. Unfortunately, this talk is too short to go through the whole example. On the page of this talk, for them, there is a link to our blog post when we described in details a use case when you store in the database images of handwritten digits. And whenever the image is stored there, you extract the change using Dibizium, send it to Kafka and Kafka, send it to TensorFlow, which will recognize what number is written in this picture. So it's a well-known example. And everything works in real time. As I said, we also have a full example on GitHub. But what I would like to show you here is that it's really simple. It's basically consists of deploying Dibizium and configuring it. And it's really just one page of JSON config. Here is you just provide credentials. And more interesting part is here. It's some transformation. Here is one predefined transformation where I extract only the content of newly inserted image. And because there is some caveat when you use TensorFlow with Kafka because it can interpret correctly the bytes, I will, I'm transforming the image into string, which is later on parsed in TensorFlow. But I would have to do it in TensorFlow anyway. So it's no overhead. But I can define my own transformation here. So, and it's just a couple of lines of code which just converted into string. And on TensorFlow side, it's similar, easy. It's again one page. Here I define the coding function which decodes the string. And I think it retrieves from Kafka. The most of the code is just defining Kafka endpoint. And it's about three lines to push it into the model which will recognize what is the number on the image and produce the result which you can consume further. So, as I said, if you are interested in it, please go to our website, take it up and take a look. And basically that's it. So to sum up, DBZoom is able to do a snapshot of your database and load existing data from your database into messaging system or directly into TensorFlow. And it can retrieve any change. So it can, once anything is stored into the database, it can immediately extract this change and send it further to your pipeline so you can do real-time analysis of your data. And basically it works for, you can deploy many databases and do more things with that. So that's all. If sounds interesting to you, you can try out DBZoom and please share feedback with us on Zulek, or mailing list. We have pretty large Vibran community and we will appreciate your feedback. Thank you so much for your attention. Thank you very much, Vojta. We have time for one question. Is this working? One question. Someone come up with one question. Come on. So if no one, just if you have any questions, switch pop up later on, just hit me on the corridor or elsewhere in the conference. Thanks for the presentation. My question is, is there any database that's already provided what DBZoom does by default? Any change? Could we repeat? I don't hear. Yeah, sure. Is there any data? Can you please stop moving so we can hear the question? Thanks for the presentation. My question is, is there any database that does this, what DBZoom does natively already change tracing? What do you mean natively? Because without any external tool like DBZoom, is there any database that does this already? Well, it again boils up to what means natively because we leverage typically some native features of database. For example, for Postgres, we use replication slot and we just read replication slot from Postgres. So you always need something which will need something which reads from database or from Mongo, which is from change stream. So always the database provides usually this natively, but you need something which will read it and translate it into something usable, which will parse, for example, the data you get from the replication slot from Postgres and so on. So yeah, that was the question actually. Is there any database that does this anyway without using DBZoom? But you said, I think there is no competitor then. Pardon? Is there any competitors? Yes, like is database is doing this natively already what DBZoom does? I'm not aware if there is any database which uses this. I'm aware that some databases provides this, but several of them use DBZoom under the hood as far as I know. Okay, perfect. Thank you very much, Vojta. Thank you.
HPC Container Conformance
Okay, next lightning talk. So we don't have a lot of time to switch between speakers. Please take a seat. Next lightning talk as Christian, who's notorious for being very good on staying on time. I did once a very great job. I benefit from that still. Yeah. So if you see me, I talk about containers a lot. So this time I would like to give an update to the HPC container conformance project, which we started or I started last year and which got a little addition by being introduced to the OCI working or we created an OCI working group together with some other folks. So what is the problem? I mean, to just maybe call it challenges, everyone knows modules, right? If you're not, if you're new to containers and you use native code, you most likely use modules to figure out what's the best binary for your program on the current system you add. So you do module load grow max and the module system will pick the best binary for the current system you are on. So it's a runtime decision, right? So you have a bunch of software in a software share and it would just pick the best one problem or not problem. I think it's a good thing with containers. You don't want to have a lot of binaries or different variations of binaries within the container. You want one, right? So a single set of libraries and a single binary for a given problem. So what we end up doing was to create multiple containers for different systems. Let's say for the CPU like Graviton, Skylake, Zen 3, or maybe even we use a name to identify a cluster we are running on. That's fun, but problem is how do you pick the correct image? Within the container space, you have something that's called an image index, which is just a matching artifact that says, okay, you are on an arm system, you get this image. You're on an AMD system, you get this in or Intel or x86 system, you get this image. And if you are a wasm guy, then you even get another system. But the thing is that's not, that's not fine enough, right? It's very, it's very gross grained. You cannot just put like your, your, all your x86 code in this. So what we actually want is an image index that is more specific. So they can say, um, this CPU, this accelerator gives me this image. If I have this CPU, I get another one, maybe even configured with MPI in, in, in mind so that you say, like, if I have MPH, this version, I get this image. If I have open MPI, I get a different image. So you get the idea. So have a very, maybe long image index with different variations and then you pick the best image. And another thing that I didn't mention in the first slide is, uh, run times will go through the, uh, image index, the normal image index and we'll just pick the best or the first match that they get. So even if you have an image index with five different x86 images in it, the runtime will just pick the first one. It matches and off you go. And with this, of course, we, we cannot do this. We need to go through all the versions that we have, all the different specific images and then the runtime ideally picks the best image for you. Okay. So I did some hacking back in the days, right? So I changed or used an unused feature in the image index to make some identification. So I saw it. Okay. This is a broad will this and media driver and I hacked, uh, the Docker runtime to also recognize what the best image matching is for this specific platform you're on. So with this ugly heck, you were able to identify, create an image index with a lot of different images for different, um, different systems. And then you configured your runtime to search for a specific tech list, if you will, that was like hacky. And, um, I didn't intend it to be, I created a pull request for Docker, of course, what turned down, but, uh, because it's, it's, it's ugly, right? And what's ugly about it is that you need to implement it in every runtime. You need to implement it in any scheduler to make sure that it works. And this was of course bogus. So what we did, uh, as I said last year or the year before, uh, we created an HBC container conformance project to establish best practices and provide some guidance on how to build images for the HBC and how to implement the use of those images. The first thing, which is very brief, uh, is what we expect an image to behave or how it should behave. So the first one is, uh, we want, there are two types of containers, application containers, I call them and login containers. Uh, application containers is just if you have, for instance, a binary and you put the entry point to be this application, you can create an alias that just runs some program within a container without you knowing about it. So for instance, let's go release a example. You just, instead of running a binary, you point to an alias and then you run, um, this problem here is if you want to debug things, uh, and if you, it's, it's hard because the entry point is always tricky to get rid of, or at least I need to look up the Docker command or every time. The other thing is multi or a lot of HBC applications have multiple binaries you want to run. Maybe you have a pre-processor or the application and a post-processor with this case, you would have like three different images for this because the entry point is different. So that's kind of ugly for HPC is not really usable. What we actually want is a login container. So you start the container and it drops you into a bash. That's that way you can just, um, augment your, your script and just say a Docker run or a singularity run or whatever to, um, execute the GMX command. For instance, you can just run it here, uh, within a container and it just works. Um, another aspect that's very important, but everyone hopefully does it anyways is, um, that the user within the container, if you use a shop, a shared file system needs to be agnostic, right? So you cannot rely on a certain user within the container. So you might, or you should, um, make sure that the, that the container is able to run with nobody because the username will be dropped from external, uh, the user ID and group ID to have access to share file systems so that the process is owned by the user outside of the container and the container itself has no knowledge about the actual user. Um, okay. And then that's how we expect the container to behave. And I think that's common and already understood. I think I talked about, yeah, last time was annotations that was an idea of us HPC guys and girls, a simmering in our own soup and tried to come up with something to put forward. Um, that was kind of a nice exercise, but at the end, um, we jumped on the train of the image compatibility working group at the old OCI initiative. And you might ask, and hopefully a lot of you know it already, but what is the OCI? It's the open container initiative. It maintains the more the, the relevant specifications about containers. So what's the, what's the image like? How do, uh, run times interact with images and distributions and registries and so on. What is the distribution specifications or how do, uh, registry work and security stuff? So it's kind of like a body that maintains the specifications and we formed a working group together with others. Um, that's called image compatibility. So we want, as I discussed in the, in the beginning, we want to extend the manifest list or the image index to not only be able to, um, pick by platform by architecture, but extended so that you can make what I, what I said as a, as a desired state for the image index so that we can pick the right image and an optimized image for, uh, a certain application. And of course we want to express like what the image was built for, what we expect from the host, what runtime we might want to use and so on. All this cool stuff we want to incorporate in this. And why is it a better way? I mean we HPC folks, we like to, to do our own thing, right? And we are kind special and snowflakey, but this is of course a better way because we interact with the OCI community and we put it in front of them so that we can take into account other things like for instance, wasm is a thing, uh, haven't used it, but seems to be a thing and it's a runtime was in, was in the, was in the container ecosystem. And of course we also have different run times, right? We have like singularity, obtainers, saros, what have you. And, um, picking a runtime over the other is something that we are interested in. The wasm folks are interested in. Say you have a Kubernetes cluster and you have an x86 image for an application and a wasm image for an application. Maybe you want to pick one or the other like different, depending on the condition. So they want this, we want this as well. Uh, scheduling a registry, of course HPC is great, uh, but the container tech is much wider than HPC, say the least. And, uh, we want to make sure that we align with Kubernetes. We want to make sure that the registries are aligned with us and the OCI working groups have like, they have an oil machine of sanitization. So that's also very cool to do. Okay. Where are we now? Uh, we discuss around use cases and while discussing the use cases, we already brainstormed some implementation ideas and we came up with a couple of use cases or, uh, stand in, um, stand in stakeholders, let's say, um, for instance, like the first one, of course, and we are all building images. So the first one is image author. If you build a container image, you want to define this compatibility definition that we propose that we want to propose. Uh, easy. I, ideally it's implemented with an easy build of spec or geeks that, um, you don't need to do it yourself. So all the stuff you can put there and, uh, Vanessa already wrote a little tool for that. The other is of course a system admin that wants to make sure that the system that he's maybe pursuing, procuring, uh, is able to run the container. So you just go through all the competibilities and then you, you figure out what's, uh, what, what works and what not work. So that's all, uh, this good stuff. And, um, you also want to make sure that the configuration of your system is actually able to run this image. The end user just wanted to work, right? So we need to make sure that the system admin and the image author and the other stakeholders just hum together and, and conclude on a certain configuration. And that's what it wants to do. There are other use cases. I don't have time to go all of them, but, uh, we have a list of, of, of this use cases we are going through currently. Our meeting is every Monday. Uh, and if you want to join, please do. Um, I have some links. There are resources. The, the slides are available online. If you want to get in touch, we have an HPC container slack. We have an OCI slack channel. There is a HPC social slack channel as well. So if you want to have a more general overview and if you're at ISE, make sure to, uh, join our high performance container workshop. It's a tense edition. So we do it for 10 years now, which pretty cool. And we have a friends of container boat trip. So if you like to, to, um, meet container guys and girls, uh, make sure that you point your mark, your calendar at the 13th of May. Yeah. That's it. Thanks. And now the famous and I think I'm good on time. Awesome. Maybe do, do I get a sticker if I do it three times in a row on time? You, you get a beer. Oh, even better. Right. We have time for one question.
Automating Spark (and Pipeline) Upgrades While "Testing" in Production
Okay, that's it. Please take a seat and we'll get started. So Holden is going to talk about automating spark upgrades and also lots of testing in production. That's going to be interesting. Testing in production is the best place to test when the alternative is no tests, which it often is. Okay, cool. So let me know if you can't hear me because I'm very easily distracted and get excited and I might not notice that I'm not talking directly into the microphone, so please grab my attention if I screw up. So yeah, I'm Holden. My pronouns are she or her. It's tattooed on my wrist. Super convenient when I wake up in the morning. I'm on the Spark PMC. You can think of this as like having tenure, except it doesn't guarantee I get paid. It just guarantees that I have work to do, so it's like the shady version of tenure. And I've worked at a whole bunch of different companies, not super relevant, but I've seen a lot of mistakes made in production and I have made a lot of mistakes in production so you can learn from some of my mistakes. My employer who sent me here, Netflix, is hiring and I would be remiss if I did not mention that. They're actually finally hiring remote people after who knows how many years. I'm a co-author of a bunch of books. Some of them are related to HPC-ish stuff. I get the highest royalties on scaling Python with Ray, so I think it's a fantastic book and everyone should buy several copies with your corporate credit card. If you don't have a corporate credit card, the internet will provide. You can follow me on social media and there's lots of pictures of my dog. If you're into that stuff, there's a lot of complaining about American healthcare. If you enjoy Shaddenfreude, highly recommend it. It's great. I also do a lot of open source live streams. If you like seeing people struggle with computers, once again, it's great. You can watch me fail. The code for today's talk and a lot of my other code is on my GitHub. You can check it out. And there will be more pictures of my dog. In addition to who I am professionally, I'm trans, queer, Canadian, in America on a green card, I make great life choices. It was a great time to move to America and part of the broader leather community. I can make that joke now because I have a green card. It's slightly more difficult for them to kick me out. This is not directly related. There is no secret Canadian code modification tools. Everything we use is open source. There's no secret Canadian GitHub alternative. If you go to GitHub.ca, you don't find... Actually, I don't know what you find. Maybe you do find something cool. I'm imagining you don't. But this is something that I like mentioning because I think for us who are building big data products or machine learning things, it's super important that we look around and we see like, hey, who is on my team? And if you realize you're hanging out with only Canadians, that's fantastic. Enjoy the poutine. But maybe it's time to get some other perspectives. And if you don't know what poutine is, you're missing out. You should try it someday. Cheese curds and gravy and French fries. Best thing ever. Okay. So what is our problem? And so why do we care about automating upgrades? So fundamentally, our problem is we have unsupported versions of our big data tools and other data tools running in production. And this is a problem because when things go wrong, I get woken up. I don't like getting woken up to figure out what I did five years ago. And that's just not fun. The other option is sometimes I get woken up when I'm trying to focus. That also, sorry, not woken up, interrupted when I'm trying to focus. And this is important because we are also getting Spark 4 soon. That's super exciting, super lovely. There's going to be all kinds of new breaking API changes. And that's just going to be so much fun, right? Like, yeah. Anyways. And so I don't know about you, but I'm not looking forward to going back and trying to figure out all of the different things that I've built over the years and upgrading them, right? Like, I know I'm going to have to do it, but that is not the thing that excites me in my life, which leads into, like, why do we have these problems? Why do we have old things running in production? We have it because APIs change and code breaks. And then people are just like, you know what? I don't want to upgrade. Just keep running on the old version. It totally worked. It's fine. What could go wrong? The other one is like, this isn't fun, right? I don't know. Does anyone here wake up in the morning excited to upgrade their API usage? Yeah. Okay. So this is zero people, right? And the other possibility is, right, like, we could try and keep this old software alive, but we don't want to. So, how are we going to work around our problem? So we're going to use software, and then we're also going to have to deal a little bit with humans, right? We're going to do automated code updating. It's super fun. So much fun. If you took a compilers class, this is going to look very familiar. If you didn't take a compilers class, this is so cool. AppSection.x3s are really cool. And we're also going to do automated testing and validation and prod. So the social problem is much harder. I am completely unqualified to solve it. I work with other people who are much better at talking to humans. They did a fantastic job. They made newsletters. They tried to make the project exciting. That failed. And then they tried to make the project required. That failed. And then we set deadlines. They slipped. But for sure, totally, we're definitely going to hit our new deadline for real. Okay. And now, let's go and see how else we addressed it. So the other thing that we did is, like, hey, we have this problem that humans don't want to do a thing. What about if we made it so they didn't have to do as much work? And so that's sort of the approach that we took. We can automate a bunch of this. And the other part is, like, so we've got API changes, which we mentioned. And then the other thing that we have is testing code as a nightmare, especially code that you inherited and is called untitled underscore seven dot ipod dot notebook. I don't know what it does, let alone I can't make tests for it. It's terrible. So yeah, we have a problem. We're going to fix it with computers. Google has a lot of really lovely code mod tools that I saw while I was there. Super fantastic. This encouraged some counterproductive behavior. I don't know if any of you have used Google APIs and watched them change underneath you. So this is a double-edged sword, and we should heed the warnings before we go, like, super, super all in on this. So what are we going to do? So how are we going to move on? Basically speaking, we're not going to use regular expressions. For the most part, there's going to be a few times when regular expressions are like the simple hacky way, and we're just going to do it. For Scala, we use ScalaFix. For Python, we use something called PySparkler. For SQL, we use SQL Fluff. And for Java, we looked at it, and we were like, we don't have that many Java pipelines. Get them to update their code by hand. It's fine. We know where they work. Okay. So how do we figure out what rules to make? So we could read the release notes, but they're not very complete. We could look at the MIMA changes, and so Spark has a binary compatibility checker that it uses, but, oh, dear God, there is just so, so many things in there. Or we could do my favorite approach, which is run it in production, see what breaks, and then fix it afterwards. So we went with the YOLO approach, which is just like we're going to try migrating some things as it fails. We'll add the rules that it turns out we needed to add. So what do these rules look like? Today, we're just going to look at Scala and SQL. If you love Python, you can check out the GitHub repo. It's got some samples there. So in ScalaFix, we override this function called fix. We take an implicit semantic document that's really just the syntax tree, so that's the parsed version of the source code. And we specify the things that we're interested in looking in, and then we can write a recursive function which will match on this tree and generate a patch. And so here, we can see like, hey, do we see something that's calling the JSON reader? Because the JSON reader, certainly no one would use that ever, so they cited it was a great idea to change that API because who has JSON data? That was a joke, by the way. Everyone has JSON data. And so it turns out like, yeah, this actually happens a whole bunch. So we should write a rule for this. Do we see someone trying to read JSON data from an RDD? And if so, this is the path we're going to add. Now the really cool thing here is that we're matching on a syntax tree to produce new syntax tree. I can just say, like, swap this part of the syntax tree for this string, and then underneath the hood, Scala fixes very smart, turns it into a syntax tree. Everything's happy. I'm quite happy. I've got a bunch of sketchy hacks, and they're all inside of a function, sorry, a library called utils. So it's great. We hide all of our mistakes inside of utils because only nerds look inside of utils.Scala. Huzzah. And here you see we're recursing on the tree, and we just return nothing if we don't find any matches. SQL very similar, but the AST is a little bit fuzzier because we're using SQL Fluff, and it has to support a whole bunch of different versions of SQL, not just Spark SQL. Things are a little fuzzy. So we go ahead and we look and say, like, hey, do we see someone calling this function that we know has changed? If so, go ahead and extract out the part that we care about. And so we go ahead and we grab the third element because, God, whatever, don't worry about it. Magic number, totally fine, no mistakes. And then we go ahead and we say, like, hey, what is the type of this element? If it's a keyword and it's cast, we know we're good. The types are matching. Everything's fine. Otherwise, if it's not a keyword and the type is cast, we probably need to go ahead and change this. Because the types change. We actually need to add explicit casts into this function. And so we go ahead and we check it, and then we say, like, okay, function name, no, if it's cast, we're fine. If not, we go ahead and we produce these edits. Now unfortunately, SQL Fluff isn't quite as amazing. We can't just give it a string and have everything work. We have to produce, like, the chunk of the syntax tree. But this is still better than writing regular expressions, right? So much better. So this is totally fine. Everything's great. How do we know if it works? So there's a bunch of different things that we could do. We could try and make tests, but realistically, that's not going to happen. What we do is we do side-by-side writes and we use icebergs ability to stage commits. You can do the same thing with Delta Lake or Lake FS. They're all open source. I don't know how to do it with Delta Lake because I haven't used it, but I'm sure that you can do it. You might be saying, like, holding this sounds like you're running all of your data pipelines twice. Isn't that expensive? The answer is yes. Does it catch everything? The answer is no. But it's a hell of a lot better than just, right? We've got hope and a little bit of data, and together, are better than hope alone. So now we're going to come out and it crashed last night, but it's totally probably going to work today. Yeah, thank you. Thank you. We see I made a backup copy just in case it fails. What our demo does is it builds a regular Spark project, and it also makes a copy of it first. This is a Spark 2.4 project. Did I break it? Hello? Oh. Okay. We're back. Yay. Okay, cool. So you see here we've got everyone's favorite big data example, word count. And so, okay, this is going to go ahead and it's going to add the Scalifix plugin to our example. So we're just going to go ahead and say, like, yes, add Scalifix. And now it's going to run Scalifix, and it's going to run Scalifix with our additional rules that we created. So much fun. It's probably going to work. This is where it crashed yesterday. Everyone sent good vibes to my computer. Come on. Come on. How's that? Okay. You can see I subscribed to printlin debugging. Oh, well. And now, so it's run the first set of rules which do automated migrations, and now it's doing a second set of rules, and the second set of rules warns about things that we didn't think were important enough to create rules to automatically migrate, but we wanted developers to be aware of. And one of them is the group by key function change behavior between Spark 2 and Spark 3, because who uses group by key? Turns out everyone, very few people depended on the specific weird behavior, though. And so it's just warning, like, hey, I see you're doing this, and I applied a regular expression and I see some, like, bad words, not bad words in that ones that I use, but bad words in that, like, they're bad. Okay. And we say, like, everything's fine. It says we should review our changes, but we're not going to just, like, real developers. We're just going to hit enter and see if it works. And now it's going to go ahead and replace Spark 2.4.8 with Spark 3.3.1, and it's going to run these two pipelines side by side and compare their output. And so we will see if the demo finishes, ooh, five minutes left. Okay. We'll probably finish inside of five minutes. If it doesn't, we'll give up on the demo. That's okay. That's okay. So here we see it's running these two pipelines side by side. You can tell because Spark loves logging. And it passed. Yay. Okay. And then this, this, okay. Hmm. Okay. Well, this part didn't, and that's how you know it's a real demo, is that it failed at the final end part where it's copying the jar to a new special location, but that's, that's okay. The important part of the demo worked. So we'll call that mostly a win. And if we want, actually, yeah. Okay. I'm going to go. Oh, thank you. My lovely assistant. And so I wanted you to see that like, yes, this actually did update some code. So we go here, SRC main Scala, Spark demo project, word count dot Scala. And then we're going to go ahead and we're going to look at the regular version of this. Oh, God. Emax, come on. Now is not the time. Eight megs and constantly swapping. I can make that joke as an Emax user. Okay. So here we actually do see like it has made some small changes between the two of them. And, oh, sorry. Yeah. So here we see, for example, we have this old pattern of creating the spark context and it's been swapped for the new pattern of creating the spark context. And it's done other similar updates to the code. And the important thing is it now works. And this is fantastic. I think it's really cool. Thank you. Thank you. Hand for my assistant, please. Thank you. So I'm super stoked that the demo did not crash. Unlike last night, I switched it back to I was running a nightly build of the JVM and not surprisingly that didn't go well. Okay. So this is all cool, but like where does this fail? So this kind of fails when it comes to dependencies, right? Like we can only update the code that you've got. We don't rewrite byte code. We just rewrite source code. So if you're depending on something that doesn't support the new version of spark, it's not going to work out. The good news is for us, we got to this so late that all of our dependencies were upgraded. So there's something to be said for waiting right until the software goes end of life. Don't tell the security people. I said that. The other one that doesn't work super well with is programming language changes. In theory, that was actually the original purpose of ScalaFix. In practice, this didn't work so well for Scala 211 specifically because it's just so old. We had a bunch of Scala 211 code. So in conclusion, you should definitely check out the repo. It's here. It's spark-upgrade. It is in my personal GitHub, but a whole bunch of other people have contributed to it. They're awesome. I'm lazy. I wouldn't do all of this work myself. Thanks to my employer again for sending me here. I'm super excited that I get to hang out with a bunch of other nerds. The good news from this talk is that we haven't made a system so powerful that the spark people don't care about making breaking API changes. The bad news is we haven't made a system that's so powerful that we can't just not care about breaking API changes. The excellent news is that my dog is cute as fuck. He's here. I said that at the end of my talk just in case I'm not allowed to swear. He's really cute. His name is Professor Timbit. I miss him so, so much. Y'all are lovely, but I miss my dog. Hopefully there's time for a question, maybe. Yes? We can also do... Thank you. Thank you all. Have a couple of minutes for questions. Thank you very much for the talk. Very interesting. One general question out of curiosity. How long did it take to convert everything? Because you just showed like, I don't know how big the script was, but I can imagine just how big the repositories that you guys have. Totally. So that's a great question. It takes a really, really long time to convert everything. And we actually, internally, we have a whole bunch of different projects. One of them is a project that goes through all of the repositories because we have a whole bunch of different repositories, and it generates PRs to these projects. And that code runs daily. And it doesn't actually catch everything. So what we do is we generate the changes, and then, as I mentioned, we sort of did the YOLO run in production approach to life. So we'll look at these changes, and especially for SQL, it'll be like, hey, we do this shadow run. Does it look like it works? And if not, we actually flag it for review rather than raising the PR so that we can go back and say, hey, do I need to add a new rule, or is this a one-off special case where we'll just have a developer deal with it? So I know that's not exactly an answer, but several hours. Okay. Thanks. Any other questions? Yeah. There's one right there. No. How many rules did you end up coming up with for this migration from two to three? And do you anticipate going from three to four? What? Do you anticipate going from three to four? Oh, yeah. Okay. So two questions. I love them. I don't remember how many rules we came up with. For Scala, it wasn't a huge number, and that's because while there are a lot of breaking API changes in Scala, our usage of the APIs in Scala is more narrow, and so I'm very thankful for that. For SQL, I think we ended up with around 20, maybe between 10 and 20. And for Python, I haven't kept track, mostly because that code has been working really well, and so some of my other teammates have been working more on the Python side, so I don't remember how many rules we made there. But they're all in the GitHub. As for do we anticipate going from Spark three to four? Yes. Probably not like the same month Spark four is released. I love Spark, and we'll make Spark four available internally, but we're not going to go ahead and start pushing users to migrate to it right away. We normally wait a little bit for things to stabilize before we start doing managed migrations just because it's better for our sanity, and there's more fixes to the code base in general. Cool. We got another question. Any more questions? Okay. Cool. Hazar. Actually, hold on. You can keep talking because the next speaker is on the bus. Oh, okay. So with the next speaker is on the bus, I'm super excited, and we can go ahead and we can actually look at more of the changes that it made to the code, which I sort of skimmed over because I didn't want to eat into the next person's time. So it's kind of basic, right? But we can see here, this is the side-by-side for the Scala one, and we can actually go ahead and what we're going to do is we're going to go outside of our end-to-end, and we're going to go ahead and we're going to look at some of the other SQL rules. Oh, fancy. I don't... Okay. Oh, this is so that it's better to read. Okay. Okay. Okay. Cool. Fantastic. And we're going to go ahead. I need my lovely assistant again. Thank you. Thank you so much. Hand for my new lovely assistant. So here we see one of the things that changed between Spark 2 and Spark 3 is that previously you would be able to do just an arbitrary cast to things as integers, and even if they weren't integers, it would do kind of a fuzzy conversion. But in practice, if you wanted to parse a string as an integer rather than casting string to an integer, you should use int at. And so here we see we've got something similar. We use a lot of print debugging. It's not great. But what we do here is we return this lint result, and what it's just doing is it's taking this expression and swapping it to an int when we see a cast with a data type of int. So much fun. There's a lot more rules, but I didn't do a git pull on this because the demo barely worked, and I was just like, let's not tempt fate and do a git pull because I hadn't tested the end-to-end demo. But this is kind of cool. We've got similar updates to our format string. Super fun. Oh, right. And then char versus string types also got updated. Super fun there as well. And where was another one? I want to find it. Sorry. Then we've got, there's a rule down at the bottom. Oh, no. Okay. I guess the rule that I was looking for isn't in this version of the code. Let's go back to ScalaFix. So the other cool thing about this, sorry, doot, doot, doot. So one of the really cool things about ScalaFix, just while we're waiting, is that you can test your rules. And so, for example, like, I wrote these accumulators, and this is the old bad style of writing accumulators, and I was like, okay, let's make sure that it updates to the new good style of accumulators. And this is super convenient because I don't have to manually construct syntax trees. ScalaFix just has built-in functionality for this. And we see here what this rule does is it actually throws out a bunch of situations. And it's actually going to generate a bunch of warning messages. But there's situations where, like, this doesn't directly translate to the new API easily. So we just told users, like, hey, you need to make a change here. But we'll get it to compile, and then it'll pass the test, and it'll yell at you because you're trying to access a null. It's not perfect. Like, this is very much like a, how would I say this? This is a very mediocre rule. But in practice, we didn't find all that many people were creating accumulators with fixed values to start at. But the one that we did see was people creating accumulators that explicitly started at zero long, and so that we just converted to a long accumulator. And then the other one that I saw here was I also added some tests to make sure that, like, I had a rule which was applying itself too eagerly. So I also created a test which was just, like, make sure that this rule doesn't do anything if it's not, like, encountering the thing that I wanted it to do. So we can also make essentially negative tests for AST transformations. That's super convenient. How much time do I need to kill? How much time do I need to kill? Do we know how long the bus is going to be? Okay, cool. Okay. So we see another one, the group by key thing that I told you about. We actually had two different situations. These are ones that we could automatically rewrite, and so that's what we do here. And so here we see, like, the situation where someone was explicitly using the column name in a way which we could detect. But then we also have the situation where, like, we weren't super sure, and so these ones we did with a warning. And so we said, like, hey, this should generate a warning because we don't know for sure what's going on here. So we want to generate the warning, but in the other situations where we could do the full rewrite, we made sure that the full rewrite was able to be applied, which I think is kind of cool from sort of, like, a point of view of you don't have to get everything right, and you can, like, add these warnings in places where, like, it's worth it to let people know their code might not work, but, you know, it's not 100% required. Um... Choo-choo-choo. Cool. Let's see here. Ah... Just a quick interruption. The next speaker is going to be late. He texted us that he's still on the bus, so we're letting Holden entertain you. Oh, I got an idea. I got an idea. Hi. I'm just a speaker. What does that mean? Where am I? Oh. Yeah, I got a... I think I got another minute of something fun that I want to talk about if it's okay. So the other thing that we sort of, like, lost over was the, like, side-by-side comparison in pipeline runs, right? And so that's totally really... I think it's really neat, right? Like, because it's super important because people don't write tests at the end of the day, and that makes me sad. But we've got this pipeline comparison project, and... Oh, God. I'm just remembering how ugly this code is. Please don't judge me. This code was originally written at a conference and then made it into production, as you can tell by the fact that it's called domagic.py. Very sorry. Very sorry. So yeah, so this domagic.py does a bunch of really interesting and terrible things. And I was mentioning how we mostly don't do regular expressions, but we do a little bit. And one of the things is when you've got Spark 2 versus Spark 3 and you've got Scala or Java code, you're going to need different jars. Whereas in Python and SQL, like, we could maybe just be using the same files, or we can use the same files with a little bit of a transformation. But so for the jars, we use a really nasty, really terrible, regular expression to just kind of extract what we think the version 3 version of our jar is going to be. And then this is convenient because we can run it side by side. And then so we've got sort of different options. Here we've got it so that you can specify the input table. But I actually did a hack that I'm super proud of because I'm a bad person. Where we made this plug-in, Iceberg Spark WAP plug-in, where what we do is, oh god, we use the Iceberg listener and we output this string any time something happens to the logs. And so if anyone's touching a table while their job is running, we know what tables it's worth so we can go back and run our comparison on these two tables. We actually have some special code that goes ahead and looks at these tables before doing the comparison and says, if the user updated more than 1,000 partitions worth of data, just don't bother and tell the user they're responsible for validating their data. And if they're touching more than 1,000 tables, sorry, 1,000 partitions in a table, they should really have some reliable tests. For the people who are touching five or 100, like I get it, untitled underscore seven, it's great in production. When you're updating that much data, maybe it's not time to depend on Holden's sketchy do magic dot py. So I think this is really cool. And we're going to go back to our friend Pipeline Compare and down to our friend Table Compare. And so Table Compare is really basic. And there's actually an updated version internally that I need to bring out that does better tolerances. But we just go ahead and we compare these two tables with sort of traditional drawing, which is part of why we had this limit on the number of partitions. Because when we didn't have this limit on the number of partitions and we tried to do these comparisons with some of the pipelines that ran on all of the user data, everyone was very sad. And we took down production. I hope that part. Yeah, anyways, there was an incident and I got woken up when we did not have that. And so, yeah, all kinds of fun. But you see here the thing, the magic here is the snapshot ID, because the other thing that we output in our listener is what snapshot IDs we're writing to. Super convenient. And Iceberg allows us to read from snapshots even if they never got committed. There's a new thing in the new version of Iceberg that allows for branching that would be even better because then we would have named things rather than random git hashes. But we're not running that and it's also not supported in the really old versions of Spark. And because we want to do the migrations from the really old to the really new, I went with sort of the lowest common denominator. And that's kind of how we ended up there. Okay, that's all that I had that I thought was interesting. And I think there was someone else who had something that was interesting. Do you want to come and do your interesting bit? Thanks to Holden for filling in. Does anyone have any questions? Does anyone have any questions? That's that? Yeah, all right. First of all, thank you for the talk. I have a quick question in the summary of your talk. You also mentioned that if time permits, you might have an overview of the changes coming in Spark 4. Do you have this overview? Yeah, so if you're interested in the changes coming in Spark 4, the place to look is the Spark Jira. And there's actually like this meta tracking Jira that's in there. And you can see sort of like the things that we're planning on coming. Historically, I would say without naming names, there's a particular vendor that loves to show up at the last minute with the giant piles of code and just kind of yolo it as a nice surprise for everyone. So this Jira will give you a good idea of what's coming. But my guess is there will be a surprise that we find out about in June, just based on history. I could be wrong. Maybe everything is actually planned this time. That would be a pleasant surprise. But there's a non-zero chance that there will be something new in June too. Cool. Okay. Take it away, my friend. Or no, you don't. Oh, okay. You've got a USB key. I think my employer would be mad if I let you plug the USB key into my work laptop. I enjoy being employed. No, no. I just had more time to kill.
Semantically-driven data management solution for I/O intensive HPC workflows
So people can hurry and sit down for the next speaker please. Okay, thanks for our next talk. We have met in talking about semantically driven data management solution for IO intensive HPC workflows. Thank you. My name is Metin Chakrachalov. I work at the European Center for Medium Range Weather Forecasts Department, Forecasts and Services Department at ECMWF. I will talk about the semantically driven data management solution for IO intensive HPC workflows, which the work was funded by the EUR HPC project called IOC. It is work done by many people. So a little bit background on the ECMWF, European Center for Medium Range Weather Forecasts. It is established in 1975 by 23 member states and 12 cooperating states as an intergovernmental organization. There are three base duty stations with more than 450 people, Redding, Great Britain, in Germany, Bonn and Bologna, Italy. So ECMWF is both a research institution and 24-7 operational services, producing numerical weather predictions and other data to member states. There are two big projects that ECMWF is a key player. One is Copernicus. It is the Earth Observation component of the EU's space program. We provide climate change information, atmospheric composition information and also flooding and fire danger information. The other big initiative, EU initiative, is the Destination Earth project, which is prototyping digital twins of the Earth. So the ECMWF's production workflow looks like this. There are per day 200 million observations, collected acquisitions and fed into the Earth system model. Those observations and the output from the weather predictions are archived. Also, these data are used to generate products, which are 300 terabytes of data per day, which then accounts for 65 terabytes of data per day as products disseminated to around 350 destinations to member states and other customers. So the information system, the data is central. It provides access to data, models and workflows, and the data management is very critical for the operations. We need transformation of data into information, insights and decisions. So, semantically driven data management, we have been doing this for a long time. It means managing data based on its meaningful logical description, rather than just storing data. We also abstract the backend technologies. We also abstract where and how the data is stored from the users. So instead of, we try to avoid nested folder structures or UIDs, such as this home user projects ECMWF and blah, blah, blah, or some cryptic UIDs that doesn't make much sense to the user. Instead, we want to use meaningful, scientifically meaningful metadata to describe the data. For example, in this case, this project is ECMWF experiment number 42. The data is 224 parameter pressure and level. So for that, as part of the IOC project, we developed DAISY, data access and storage interface. So we provide, we index and identify data using its meaningful description. And for that also, that allows us to implement optimized algorithms to retrieve archive and retrieve data. And this is based on the ECMWF, ECMWF object store called FTB, which is also free and open source on GitHub. And this abstracts, we also abstract the storage technologies behind POSIX. We support POSIX, DAOS, Moto, and Ceph. And we provide interfaces and tools as well as CEC and Python APIs. So the schema, the main complexity is the schema which describes the database. And it is a collection of rules and each rule is a tree of attributes. In this example, I have a schema file and inside that I have two rules and each rule is consisting of multiple parameters. For example, here project experiment date parameter level would translate to a key project ECMWF experiment 42 and so on. The other rule is event city year and this could translate into event for stem, city is Brussels and year is 2024. So the rules are, the rules have, the rules are blueprints of the database, how to construct the database. And they have three levels and they, each level can have multiple attributes. And to make a rule, it has to be unique and complete to describe the data so that we can identify data from other data. And we also need to think about the locality where data, where different data is related to it, we would like to store them together. So each level here, we can think of the first level as directory, the second level file level, and the third level as the indexes in the file. So the locality would be increased when we go deeper in the level. The other, we can set daisy, we can set up daisy by the configuration file in YAML configuration file. We can point to the schema file to find the schema file and we can set the backend storage technology by saying file in this case is reference. We can also have different parts to the databases. We can have multiple databases. It's called roots. And we can set multiple behavior to individual roots. So aside from data, we also need key and we also have query. The keys would refer to single objects while queries can be any number of objects. In this case, key defines, identifies, and single objects on the right, I have level as a list of values, 0, 1, and 3. So it means I make a query for three different data where the differences, the levels are 0, 1, and 3. So we provide multiple interfaces, command line tools, C, C++, and Python APIs. But here I present an example for Python API because it's simplest. So I need, for storing a data by key, I need a key and data. So data can be anything in this case. I just put a string here, but it can be PNG file or PDF file or any other type of data. Then I make a key. User is met in project is IOC. Date is 2023 and city is born. And I pass this key and data to Daisy and Daisy would archive it. Then the other main feature is list, searching for data in the database. I need to make a query, in this case, user met in project IOC. And in this case, I just want two data objects for two different dates. And I pass this query to Daisy and it returns me the keys that I need for retrieving. And in the next example, I have the retrieve getting a key, getting data by a key. I make a key, user is met in project IOC and so on. And I pass this key to Daisy and retrieve the data. So it's very simple. So to sum it up, we describe data semantically instead of your IDs and nest directories. And we index and identify data by its meaningful semantic information. And this also allows us fast and efficient retrieve and search and archive algorithms. Also, we abstract where how data is stored from the user. And we make blueprints called rules. And we make keys to attach to the data and pass it to Daisy. And Daisy would store and manage the data using multiple different storage technologies. So more about Daisy, we have, we published, Daisy is free and open source. We published on GitHub. We have example C API and Python API. We also provide binary packages on GitHub for C, C++ as well as Python. We also have Python packages for Linux, RPM and the beam packages are available. We also have documentation on read to docs. And that's all. Thank you for your attention. APPLAUSE Thank you. Do you have any questions for Metin? No? Oh, there is one. Thank you. Hey, next presentation. I was wondering if you can specify the type of the values in your SEMA. You mean integer? Yeah, yeah, like that. To facilitate the queries. Yes, attributes can have types. You can set integer, date, string, they can have multiple types. Okay, thanks. Hi, thank you for your talk. I was interested in indexes because you mentioned that you index and identify data. Did some standard type of indexes or you have some format on your own, optimized for this three type of data? Yes, indexing is based on the rules. So the rules here, we have three structures which has three levels. So it has to have three levels which translates to a directory file and data. And we have in-house mechanism algorithm to that indexes that translates this three into identifying data. So is it something like gene indexes? I couldn't hear. If it is something like gene indexes? I'm not sure if I understand. Gene index? Yeah, I'm not the right person, I think, because we use the FTB which has been developed since long time. And it's a big library. I cannot answer the question because I haven't worked with that level. No problem, thank you. Thank you for the talk. I would like to know where are these keys stored because you need to query this kind of index and where are they stored for the user? Yeah, so the indexes would be stored separately but together with the data. And they would go, for example, in this case, inside the roots. So each root would be different database. And if you would look inside the root one, output one, for example, here, you would have index keys inside as well as the data, so together. Okay, so it's a file system or a database stored inside these two directories? Yeah, for POSIX, it would be directory, but for object storage, it would be not directly contained or something like that. Okay, so the way how the index is stored depends on the type of the storage you describe here. Yes, but we can also have, we have two different abstractions. One is indexing and we call it catalog and we have a bulk data. So we can have indexing catalog inside POSIX directory and bulk data on an object store. Okay, thank you. Any more questions? Is the next speaker in the room? Okay, thank you again, Metin. Thank you.
Vector Search in Modern Databases
Okay, next up we have Peter telling us about vector search in modern databases. Okay. Well, hello everyone. My name is Peter Zaitsev. I was supposed to speak here together with Sergey Nikolayev, who is actually a much better expert in this space, but unfortunately he couldn't get a visa. So, guess what? You are stuck with me. And we are going to talk about the vector search. How many of you are familiar with the vector searches? Oh, well, that's a good number of hands. Some not, so that's a fun audience to have. Let me maybe start with ruining the suspense and kind of showing the highlight what is a state or a vector search in a variety of databases. And I think what is interesting happening here is what we have, A, the number of new databases started in the last few years, which is specifically focused on vector search and related applications. And then also at the same time we can see a lot of mainstream databases have added support for vector search. You can see that starting back in 2019, which is actually relatively recently and relatively quickly. I think that is very interesting because databases are often rather conservative, kind of relatively slow moving product. And what that reminds me is something what we saw with Jason. We saw databases like MongoDB came and really got a lot of developers' hearts and minds. And then later on, Jason support came to pretty much every major database out there and was also added as SQL standard. Now, what is their unfortunate mission here? What you can see is the MySQL, right? Well, you can say, well, MySQL is now owned by Oracle, which is a big, fast, slow moving corporation. Right, so they're not doing that. But there is another problem is what with MySQL, it's actually exist with a heat wave solution, which is cloud-only Oracle's MySQL variant, right? And it's, I think, not very clear to what extent it will come in the MySQL proper. At the same time, MariaDB is working on solution in the MySQL space. The planet scale announced what we are going to implement vector search. So if not in the MySQL community itself, it will come in some variants, right? And obviously, if you look at the PostgreSQL, I think that's always wonderful ecosystem, right? So something has done the process with like a multiple way of doing this stuff, and there have been number of vector search extensions created, but the PG vector seems to be one which is getting the most support and most of attention those days. Now, what is also interesting, what we can see with vector search being so hot, right, with AI, and some databases like Elastic in this case here, they are pretty much focusing, calling themselves the vector search database first, right, and the full-text search database second, right? So I think for me that was very interesting to see that change. Well, anyway, with kind of a big picture out of the way, let's talk a little bit about the vectors and vector search, right? And why suddenly this becomes kind of important those days. Now, if you, Leon, I don't know, let's say, is it like a high school algebra or something else, right? You probably know what the vectors are, right? We can think about 2D, 3D space, that's all very clear. But also if we think about that, we can think about vector as something as represent colors, right? As in a software engineer, they probably know like well, colors, this is red, green, blue, right? We often encode that in one byte each, right? And that actually can be seen as a vector in color space, right? And we also can think about similarity of colors as similarity between the different vectors. Now, if you think about the vectors, there are actually a number of different approaches how you can think about what vectors are being similar and the same, right? Like there is, like, Euclidean distance, but if you think what is the most common this case is a cosine similarity, right? For example, if you go and ask our, you know, the third leaders in the eye space, open the eye, say, hey guys, then I'm using your API, what you suggest as a distance between vectors, they would suggest you to use their cosine similarity. Now, let's talk a little bit about the history, right, and how they have been using the database and specifically in their information retrieval space, right? Vectors are actually not something which is suddenly became new, right, as you may think, a vector search database, right? If you think about the full-text search application specifically, there have been vectors used for a long time, right? For example, if you, one approach would be to look at the sparse vector and say, hey, we have a document, we can actually look at all the words we have in the dictionary and let's say state a frequency of that word in that vector, and then if you want to compute the similarity between two different documents, well, we can essentially look at cosine similarity between those two kind of massive sparse vectors, right? That was something which exists for, again, very, very long time. You can find that in, you know, Lucene, Elastic, Libraries, and so on and so forth. Now, if you think about what we use in the vector search for much more is we are looking at more of a dense vector, which are called vector embeddings, right? Which are different, right? What is interesting in those kind of sparse vectors, like also referred as a bag of words, we can actually think about the every dimension in a vector as a mean something, right? We can say, oh, this dimension means if a word, you know, cat was seen in a document and how many times, right, or a relative frequency, right? Then we compute the dense vector, right, or embeddings, what that means is we take a document, right, we learn that from the model which generates those embeddings, and then we don't really understand very well what those different bits mean, right? What we know, though, is what the similar documents should be close by, right, how this system works. Or it doesn't have to be even a document, right? Like if you think about their, let's say, image recognition process, right? I can train the system, let's say, you know, faces of a bunch of people, though I know it's kind of totally legal in Europe, but let's say imagine we are in China and we are going to do that, right? Then we can look at, you know, vector embeddings, computer of somebody face, right, and look at what is a people, people in the database it looks the most like, right? That should give us, give us the closest much. So here is also something interesting. This is embeddings which are computed for single word documents, right, using one of the open source frameworks by Glove model by the guy called Jay Alamard, right? And what is interesting in this case is what we get, you know, sort of, you know, cardinality, we see, you know, bunch of words we don't really quite know, right, what those different dimensions means, right? And it's kind of like very common in AI, right? We know those things work, but we can't really figure out how exactly this works, right? And then you can think about and rationalize about what those things could be, like for example, all of those things have time to do with humans besides water, right? And then you can see there is those, you know, you see like a straight blue line which is blue for everybody by that. They say, well, maybe that is something which is related to humans, right? Or we can see something else, right, like a king and queen, right? They also have a lot of similarity. They may say, well, you know what, AI, we don't really exactly know how, but it may have something to, you know, think about their royalty. Okay. So if you think about their vector search, right, in a nutshell, we have vectors. That typically will be dense vectors which came from the AI applications. That would be some embeddings which in some database systems, right, they are, they focus on supporting operations with those embedding systems, right? For example, postgres, PgVector, it's just say, hey guys, I don't know how you compute those vectors. That is not our problem, right? We just going to help you to operate FM. Some of the more advanced features they may, they vector databases, they also may support creating embeddings. Maybe even, you know, from the database itself, right, especially in the cloud database, being able to, you know, call open AI's API in the background to compute the, you know, embeddings for you. So anyway, let's talk more about the technology, how does that work, right? So what exactly operations do we typically see in the vector search applications, right? Well, typically we do have a bunch of vector store, we have a vector on input, right, and we are looking to find a vector which is closest to it. Right? Through the distance we want to define vector search is a cosine, cosine distance. Right? Now, if you look at this problem, if you, of course, you can just, you know, as with about everything, right, you can just scan all the data and find the closest vectors, right? And that is, you know, fantastic way you can do it, you can do it exact, but it is also very slow, right? So that means why it's not used in practice instead, using special index structures, right? This HN SW, right, that seems to be the most popular algorithm, right, which industry seems to be coalescing about. And I think what is important to note about that is compared, like, different from many other things in database, this index is not exact, right? So it gives us rather, well, good accuracy, but it does not guarantee what that always will give you the, you know, let's say the closest vector when you ask it, right? And if you are familiar with database, you can say, ooh, that looks strange. But in AI applications, how those vectors are computed, right, they are not really exact to begin with, right? So these are quite usable. Okay. So let's talk about those solutions, right, what we are really using this for, right? As I mentioned, the most common features in this case will be found in your nearest. There are some other more kind of global operations in this case, which can be used in terms of clustering the data or classification, which is supported by advanced features, advanced systems. Okay. Let me show maybe in this case a little bit of example. And here we'll use their, an interesting way, we'll use their multi-core search system, right, that's where one of the search is created for. But we'll connect to that through a MySQL protocol, right, that's what it's for. And we can see what we can go ahead and let's say, create the table, as you can see, right? We've defined in the flow of vector, in the traditional database type, but something what this engine support, right? And you would see typically vectors support some sort of vector store functionality, right? And then we can find their distance between the given vector we have as well as the vectors in the database, right? And it can give us the information, right? So what we actually had in this case is their different images, which would run through creating embedding for them. And then you can see what, what was that? Yeah, I think that was like an image of a bag, right, which was saying, hey, you know what, it's much more similar to a yellow bag compared to their, to a white bag, right? Which you can, that is a kind of pretty common, what we have. Okay. I mentioned what, when you speak about embedded computation, right? There, if you look at especially non-vector databases, not specialized databases like Postgres where we would say, hey guys, you guys can use external encoder, but they don't see that as a database problem, right? You process some information out there, your favorite find, external API, or use some local open source model, that's fine. Though there are some libraries, right, with Microsoft last time, a library, right, they are being added. And I think over time, of course, if our desire to enable developers simplicity, we'll see more of a direct support. Now, something else I think what is interesting about their embedding and the information retrieval tasks specifically. I think what is very interesting, if you look at the search applications, for years, there have been a lot of time spent about sort of like a hard coding, like in the engineering, the structure of the language, defining synonyms, defining antonyms, right, and so on and so forth, right, and so we can run our search quality. The other approach, though, is we can use the AI, right, for this approach, right, so we can look at the matching document through embeddings, right, and that really allows to avoid a lot of that thing which has to be manual processing if a good quality. But what is an interesting there, though, is this new generation AI search may not be best, and also it may not be the most effective, right, because if you think about that, if I have like a lot of a document in my database, right, billions, right, then actually the search can be relatively expensive. So one of the approaches which is used is dual approaches, right, and you can say, well, we may be getting, you know, like a top thousand documents or something through, you know, like a legacy kind of frequency-based search methods, right, and then we can use AI to rank those, right, and you can see what that's sort of like a combination that's like a last-second, like a Vespa hybrid, right, it shows better, right, on many benchmarks. Okay, well, oh, close, thank you, is very usable, both in information retrieval tasks as well as many other applications, and what we see also what the vector support in databases in relatively early stages, right, it just happened implemented in the last few years, right, I would say what the current implementation is related to basics, and we would see a lot more of the features, right, as we kind of figure out, right, what we actually want to empower us to go into the continuing innovation in this data structures to support us in the fast vector applications, right, and as well improve accuracy. Well, that's all I have, right, you had some, yes, the gentleman was about to show me zero minutes, yeah, and if there is any questions, I would be happy to try to answer. Any questions? Yes. Hi. Okay, okay, so given the fact that so you say that databases are going to transition from just supporting externally modelled embeddings to generating internally, but given the fact that we see a lot of many advanced models that generate embeddings, for example, the GPT, the GPT is a machine that basically takes entire concepts, translates very efficiently into embeddings and then outputs also, and is the standard like you provide a single external embedding generator model compared to a traditional BWAC, and then everyone can just benefit from an on par model. Well, yes, yeah, absolutely. Well, what I'm saying in this case, I think it's being supported, right, maybe in different things, right, I'm not saying, oh, you know what, you should expect postgres incorporate inside that over possible models, right, I think I think the same would be say like if, let's say, Python support those things, right, that just means it's easier to do from a Python standpoint, right, so now if you think about this case, if I want to generate embedding, even for data I have in the database, I kind of have to do that, that externally, right, what I'm saying is, well, get some, you know, fun, was us by talking externally, not creative, so that is what I'm speaking about, make sense? Thank you. Any other questions? Thanks. Thanks for the talk. As a database developer, how do you deal with very rapidly moving target of both algorithms and storage formats? In the sense, do you deprecate rapidly? Because if today you have a KNN function that has this, some certain API, and a month later, you know, ML is very rapidly moving target, you have a better KNN function, and same goes with densely, dense format. Well, yeah, I think that is a good point. Now, I think what is interesting what you're saying, right, you know, thing on one extent it's always interesting and then you have a, you know, interesting and then you have this, the industry in like an early stage, because on one extent things are changing quickly, on the other hand, often people implement something, they put it in production, it's good enough, and the fact that that kind of is a better state of art out there, that doesn't mean what they want to change, right, and that means, like at least in the database world, right, you always have to say, well, you know what, there are certain things you would love to kill, but actually some very big corporations already deployed in production and they're not freaking changing that in the next 10 years, right. So, reality what that's going to be, right, I think is that evolution on that side, but then we'll still have to have a competitive I think if you look at, I mentioned like a postgres, like a PG vector extensions, well, you support like a whole bunch of different options in this case, right, so yeah, I think that's what we'll expect. Okay, hi, yeah, can you go back to the embeddings slide where you showed the similarity between Queen and Queen, I think, the similarities between Queen and King, I think, or was like the first, Oh, this one? Yeah, yeah, yeah, yeah, yeah, you mentioned you use Ada, I think. What do you use? Chart GPT embeddings. No, so yeah, this one is actually, you know, a global model, right, that was one of the open source models, right. Ah, okay, okay, yeah. But I mean, in this case, I think that is just like an example, right, I think what you wanted to look in this case to visualize, right, I mean, when you say, hey guys, you generate those kind of embeddings, and they do not particularly mean anything, right, but can you just plot them, you know, as we plot, let's say, you know, DNA of a frog and a fish, right, can we see something maybe in this case, right. So that was the case to do here, right, to visualize how particular embedding generation model. Yeah, no, no, I find it really helpful also for, I find it really helpful also for like my future students, because like it really grasped the accents of embedding and similarities, I think. Yeah, I mean, again, like in this case, that is just to visualize to show people what that things look like, right, and my point I was trying to make is, on one extent, we cannot really state what exactly are those dimensions, what it corresponds to, right, but as a human, we can, you know, try to rationalize over, it seems to be like, you know, something. Yeah, yeah, also. You know, something there, right. Also because other, the embedding of OpenIE has like 100, no, 1000 and a half features. Well, that's right. Yes. Really difficult to like. That's right. That's right. So like in this case, that was specifically kind of cut, right, to have any less features, right. Yes, it's because, you know, 1500 or like, well, like, you know, 3000 tried for large. Yeah, that would be too many to display. Okay. Thank you. Thank you. you
How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
about Kubernetes and HBC and AI. Hello everyone. Yeah, so today I'm going to be talking to you about what the Kubernetes community is doing to improve batch workloads in general. So just a brief background about who I am. I work as a senior software engineer at Red Hat. I'm a big upstream developer in Kubernetes and OpenShift. At Red Hat I focus mostly on Cryo and KubeLit now, but I also dabble where I'm also a reviewer in the job area in Kubernetes and a project I'll talk about also called JobSet. I was a maintainer of a batch project called Armada, which was for running batch jobs across multiple Kubernetes clusters. And generally I actually started my Kubernetes experience by trying to run, trying to build a platform that could run jobs on Slurm and Kubernetes. So I kind of liked the Kubernetes aspect a little bit better in some ways, but the Slurm scheduler was a lot more easier to use in Kubernetes. But I think I saw a gap in Kubernetes and I've been kind of helping try to contribute since. So just to give a little outline, I'm going to kind of give a historical perspective about Kubernetes and how it developed and why we're in this area that we are now. I will not really be talking too much about how best to get the most performance out of your cloud vendor or what other things you need to do to get Kubernetes. I'm going to be kind of focusing on the APIs that users could use in Kubernetes. So this is my couple slides of what is Kubernetes. It's pretty complicated. But generally I've noticed that when people start using Kubernetes as a library, I like to kind of think of it as sort of a react, but for distributed systems. So you're kind of using all the Kubernetes client libraries, you're using the APIs, you're composing custom resources on top of objects and exposing them to your customers. That's kind of where I've seen a lot of companies start using Kubernetes, especially when you're trying to build like a quote-unquote Kubernetes native platform. So what does that mean really for most people? Well generally I think the benefit for in this community is you have declarative API for workloads. If you're running on the cloud, failures happen, it sucks, but it does. And a lot of times your users also don't want to be told, oh yeah, you had a network failure so your job failed. Sorry, restart it. And a lot of users are pesky and they ask more and more of you as time goes on. We all know this. So and also for better or for worse, everything starts with YAML. You take it with what you want. But generally what that really means is that we have a big focus in Kubernetes on what is your API, backwards compatibility, most of the time, and also how to make it useful for people. So generally a Kubernetes cluster has not too many components, but I want to try to focus a little bit on a couple of components for this talk. So generally you have the API server which everyone talks to, CLI, whatever. NCD is your database essentially for storing all your objects in Kubernetes. The scheduler is an interesting component because it's, I think, the hardest thing for the HVC community to kind of grasp with the Kubernetes scheduler versus Slurm is Kubernetes is a scheduler focus for the node. You don't get as much fine-grained control in a slur, you get a lot more control in a slurm scheduler than you would in Kubernetes because slurm can actually target like, I don't know, sockets and everything on a node. It's much more fine-grained than Kubernetes. So I like to think of the Kubernetes scheduler as kind of a heat-seeking missile for a node. You give it hints and it just, it targets it and then your pod is on a node. So in the node, what is actually on a node? Well, there's this thing called KubeLit which talks to the container runtime and actually I will talk about that next slide. So the point of KubeLit is to actually start a pod, but I want to walk through what actually happens with a pod. Like this is, you know, step one, a user creates a pod that's a workload and it goes to the API server, the API server stores it in that CD and then the scheduler says, oh, you don't have a node specified on your pod. Okay, let me do a little scheduling loop, finding a node. And then once it's, once your pod is located on a node, KubeLit will pick it up and actually start running it and if you're running a batch job, it will run into completion. If you're running a microservices, it's just there and it keeps running. And KubeLit actually talks to a container runtime and the host. KubeLit also handles a lot of stuff with volumes. It's a pretty, it does a lot. So now you saw the pod lifecycle and I'll be honest, my first time using Kubernetes, I was like, deployment, stateful sets, this is so complicated. I'm just going to use a pod. Unfortunately, I learned pretty early on that you kind of lose a lot of the benefits of Kubernetes if you're using pods directly. Pods are stateless, so if your node goes down, you essentially lose your pod. And a lot of times if your cluster is overworked, you're actually going to lose, you, well, not overworked, but your pods will get deleted after a while. You also don't get self-healing. That is an important part of Kubernetes, even in, I think, the batch community. It just means that when you define an API, things are going to keep running and if you have, like, a job, you are going to keep retrying, is one example. The more pragmatic thing is the pod API fits the need in both microservices and the batch area, and you cannot really change it for one area, not the other. So generally, I don't recommend people using learning stuff that people like. Unicorn is actually, it's more popular in Spark community. It's trying to bring the yarn scheduler to Kubernetes by replacing or by adding a separate scheduler. And then MCAT is a project from IBM around trying to deploy arbitrary objects to multiple Kubernetes clusters and adding its own queuing. So now, what does this mean when you have all these projects? Well, you have chaos. You have Kubeflow, I'll pick on Kubeflow a little bit. I only have two machine learning frameworks, but from the last I checked, there's like six different APIs for representing a machine learning job in Kubeflow. And that means that there is a lot of APIs for running a batch job from Kubeflow. They are trying to consolidate most of them into a single one called a training operator. Still, you have a new API. You have two versions of running MPI jobs on Kubeflow. Now, it isn't as, I actually don't know if that MPI operator fits for all the use cases that people can give with MPI, but it is, as far as I know, the only public open-source way of running MPI on Kubernetes. And you also have things from Armada and Volcano that have their own representation of jobs. Well, this is honestly pretty chaotic. It's not really fun as a developer to be told, like, you know, how many, like if people want to bring a new API, can you support them? And you say no, because we don't really want to install all of Kubeflow just so you could run a PyTorch job or whatever, or install the controller. And it gets kind of complicated. So this group was founded, it's like a working group in the Kubernetes community. Batch workloads run the full gamut on Kubernetes from the scheduling all the way to the node to some representation of the batch APIs. So they actually had to form a working group to kind of coordinate, not really have to, but it's kind of a way to sort of allow you to focus multiple people on a single area and try to improve it. And some of the goals of this group are, let's make the batch API useful again. Let's allow people to actually use these APIs without having to install something like Kubeflow or Volcano to run a batch job. And also, the other one I'll talk about is queuing. Carlos over there could probably talk to you all about DRA, which is another exciting area that's happening, and that's about getting more use out of the GPUs, and that is in scope of this group, but that is actually mostly led by NVIDIA and Intel right now. And I'll be focusing on the two bullet points for the rest of this talk. So what is the job API? Well, this is generally a pretty simple way of representing a batch job, and I think that's one of the downsides of it, is that it was really focused originally on kind of simple use cases. I have an example here of computing Pi, and I'll just walk through the API so you'll see it kind of repeated again and again. So generally, Kubernetes has this concept where you define a template and you define a replica. And the job API that's called parallelism, and that just means how many pods do you want running in parallel? So the first thing that I want to talk about with this group is how many of these do you want to actually are complete before you consider my job successful? Active deadline is just how long the job takes to run, and then back off limit is retry. It's kind of how the job gets some self-healing, if you will, because it just says if the job fails for any reason, I want to retry, in this case, up to the back off limit, or the default is six. And one of the first features that this group added is a pod failure policy. It's essentially a way to kind of short-circuit the retry limit, because let's say your user has a segmentation fault and they're using a GPU. You probably don't want them to be using that resource when other people could be using it, and you probably don't want to keep retrying. And there's no limit on these retries, so someone could say 10,000 retries and kind of be on that node forever or whatever. So generally, that pod failure policy was kind of a way to short-circuit that. Now, how do we actually make the job API useful for workloads that need to talk to one another, which is pretty much most of the most interesting use cases in the HPC world? Well, this is kind of this idea of an index job, is can we provide a static name and environment variable so that the applications can actually refer to a replica of a pod and not have to worry about not being able to communicate to it and not be able to say, you know, my replica zero, that's my index zero, is always going to be this, and so then you can kind of talk to it. So you could think of this as sort of being a common way in like an MPI area where you have maybe like a rank zero pod and you have a series of workers and you probably want to make sure you have a rank zero. And that's kind of the idea of an index job. Now, I should wish I would have shown a slide here, but when you couple an index job and a headless service in Kubernetes speak, you're actually able to get all these pods to talk to one another. So when the last area is if you're trying to build queuing in Kubernetes, you kind of run into this problem where this pod lifecycle, I like to kind of joke, the way I envision this lifecycle is it's kind of like a racehorse. Once you create the pod, it's just, it's running and it's never going to stop. And effectively, the why this can take down a cluster is because if you have a million of these things running, it's just an infinite loop and it's going to kind of drain all your resources of your cluster. But you still need to know kind of how many objects are being created, but you also do not want when you create the object to start this loop. So this was kind of this idea of suspend in the Kubernetes community, adding suspend to our essential queue supports, a wide range of jobs via this use of suspend. So kubere, all the kubeflow operators, a project I'll talk about next called jobset, job, and then another project called a flux, which is, I don't know what I'm going to add, but, and so this is kind of a nice thing that queue provides. So what do you do about representing a more complicated job? Well, generally the job API is only, is, you kind of have to have the same pod definition for all of your workloads. And that may not fit for a lot of use cases. So the job set was kind of created as this way to say, can we create a representation of a single job that could have maybe different pod templates and then also have its own kind of failure and success policies. So when you run these jobs at large scale, you're going to see failures and you may want to restart some jobs in case, or maybe you don't want to restart, or you want to, and I'll talk about one interesting use case of success policies. And one of our goals is Kubernetes is kind of an implementation detail. Most people don't want to know about it if you're a researcher, you just want to know I'm running this. So we kind of want to streamline the creation of stuff like index job and headless services, because we know people want to communicate with their pods. And so at a high level, the API for a job set looks very close to a job to a pod. Instead of replicating pods, we are replicating jobs. And I didn't have it specified here, but there's a replica field under the spec, which says how many replicas of my replicated job I want to create. And then inside of the inside of a replicated job is a job template. And so this job is a PyTorch job. It creates an index job with a headless service, and then it creates a single job that has four pods. And I'll show in a little demo why this is useful. And the other area that we've actually gotten quite a bit, it's one of both Volcano and Kubeflow have implemented this in their projects, is one of the main reasons why they kind of created these projects, is what do you do if you kind of have this leader worker paradigm, where your leader, let's say, is a Redis database and your workers are talking to it, or whatever, you know, a message queue. Well, I want my workers just to finish. Like, I want to say, hey, once my workers are done, my job is successful and I don't really care about the progress of the leader. And so this is kind of one of the use cases we had in mind with this project, or not, there's a lot of them, but this was one, like, can we use something called a success policy to say, I only really care about one set of jobs completion, the rest are fodder, essentially, or not fodder, but they play an important role until the workers are done, and then they're also taken down. So, how am I doing on time? Okay, so I'll walk through the demo a little bit. So generally, with a job set, you have this controller, a job set controller manager. Right now, you can check it's running, great. And I kind of, in this demo, I tried to take the PyTorch job and kind of show the template, and then try to run it as just a normal job and kind of show you what happens. You can't communicate to the service, because if you try to create this job normally, there is no service for the communicate with, and it just automatically fails. So then, what do you do? Well, you can use job set. Woo-hoo. And so, I already created the job set, and you can kind of see that with the kube control logs, I'm actually able to, the job set is running, it's doing training, using PyTorch. And also, I created a headless service called a PyTorch that's there. And so, this allows you to kind of hide all this stuff from the user. And then, I think in the next part of the demo, I'll show the success policy. Come on. Oh, well. So, I guess, I mean, it will go on for a little bit, but does anyone have any questions? Any questions? There's a couple up there. Wait, wait, wait. Who was first? Hi. Yeah, I'm very much from the Slurrem bioinformatics snake make next-flow world. So, and we have an IT department, and they have a Kubernetes cluster, so this is very interesting talk to me. But are you thinking about these kind of workflow managers that typical researchers like that use, because I was just in a high-energy physics session, they also use snake make, and they have schedulers, of course, but somehow that also has to interface. Do you have any comments on that? So, generally, we don't want to get into the... We don't want to add another workflow engine, and there's too many of them, but what I kind of view the job set is like a single node of a DAG, and one of our goals could be this, like either this job or a job set could be added to something like Airflow or Argo workflows. It's another example to kind of be like, this is a single element that you could run, rather than having, like, Argo has their own way of representing, like, what they actually run on Kubernetes, which is, you know, fine for pods, Airflow is also pods. There are a lot of other workflow engines out there. I've actually... We took a lot of inspiration and two jobs ago for me in applying bioinformatics, some of their workflow languages, and trying to get... Trying to standardize a workflow language so we could actually run across different environments. And so I'm familiar with the area, but we're trying not to be a workflow engine for this project. Thank you for the talk. I noticed that a lot of the things you were talking about seemed to play kind of in the same field, sort of where the Slurm plays. So, I don't know, a few years down the road, do you see Slurm kind of giving way to, you know, this Kubernetes-based infrastructure, or do you think they're targeting kind of different tasks, and Slurm will always have its place? That's a really good question. I was not at KubeCon North America this year, but I heard of a company called CoreWeave that was actually collaborating with SchedumD to try and kind of provide Slurm on Kubernetes. From what I understand, kind of using the Slurm scheduler, but also allowing people to run some of the more popular Kubernetes stuff, like have Kubernetes for services or Slurm for batch. Generally, everyone is kind of converging in this area. Our motto is actually taking from inspiration of HT Condor and trying to apply that to Kubernetes. And then I know that the... Sorry, I'm pulling a blank. The University of Wisconsin, who kind of created HT Condor, they're big on trying to actually use Kubernetes for a lot of some of their infrastructure also. But, and also, we do talk pretty closely with the SchedumD folks, at least in my last role, and there is a lot of interest in trying to bring Kubernetes to Slurm. And part of it is Slurm has been around a long time, and so they had to do a lot of work to just even to get in the fact of, I want to containerize Slurm in Kubernetes. Okay, great. Now, do I want to schedule a pod, or do I want to schedule a single container? And that's kind of where I can see... That's also what's kind of challenging, and the other thing is convincing more and more people to use containers, because it's great, but it's also a pain to change everything that you want to go to a container. Okay. Any more questions? So, if I understand it correctly, you're primarily optimizing that I do not schedule 10,000 pods, and then have job sets, right? Because when I think about batch processing, I do think about, let's say, CI, and then we are running like 5,000 jobs per day, and we do this with Jenkins, which actually works super great with Kubernetes plugin, but I'm not seeing enough features on this proposal to get rid of Jenkins or any other components. I'm primarily seeing a way of not overloading the cluster with pending pods. Is that right? No, I would say the main thing is trying, if you want to say run a PyTorch job, the one option is let's create, let's use Kubeflow. Fine, that will work. But what if I don't really want to use Kubeflow? What if I have my own representation? What if I want to add my own...
Kubernetes and HPC: Bare-metal bros
Okay, this is going to be interesting. We are relying on the Wi-Fi bit here as well. So it would actually help if you turn off your Wi-Fi. I know that's a big ask. Consider that for the next half an hour. That would be really helpful. So Vanessa is live here through a video call. Give us away, Vanessa. We can... Well, can you try speaking? What's up, folks? Sorry, I'm not working. Is that working? Try again? Still what's up, son of them? Okay, that's really better. Nice. So we'll start your recording, Vanessa, and then we'll try and do live Q&A at the end. Sounds good. I have some answers for the previous Q&A, too, so we can talk a little bit about that. We can try. We can try. By the way, Vanessa is also the one who designed the HPC social logo. So you should thank her for that and take some stickers when you leave. Thank you. Thank you. All right, here comes the talk. Hi, folks. I'm Vanessa Socket, and today we're going to be talking about Kubernetes and HPC, the bare metal bros. So I thought I would open this talk by putting two words on the slide and then I'll go to the next question. So, what is the question that you guys have been asking or very anxious? Those words are cloud and HPC. So probably the question on everyone's mind is what does the future look like? I'm going to answer this question by posing a question back to you. Where is the money going? So we can look at polls from Gartner and Hyperion Research that suggests that cloud is projected to reach $40 billion by 2026 with a smaller CGR of 6.4%. So very superficially speaking, the money is going to cloud. Now, we can also then follow up on this question like, okay, that's great, but who's going to get left behind? We can look at a paper from Reed Gannon and Degar from 2023 that identified some really interesting trends. For HPC, it suggested that the way that we design our system will not be a problem because we're not going to be able to design our system will not continue to work. We cannot depend on dentered scaling and Moore's law. There's increasing rising costs for improved semiconductors. This is going to make it harder and increasingly more expensive and laborious to deploy new systems. And they define something called NREs or Non-Reoccurring Engineering Costs that we are incurring for every new system. Now, cloud, on the other hand, is leading the space of innovation. As we know, there's this massive expansion of large-scale commercial clouds. They are not depending on software vendors or hardware vendors. They're making their own stuff in-house. And guess what? They're hiring away and attracting the talent pool. And they made a really interesting analogy with temperature. They described HPC at endothermic requiring the absorption of heat for survival. And cloud is exothermic and really giving off of heat. And we know that, folks, we're not talking about heat here. We are talking about money. But to continue the heat analogy, you'll know that if you've ever been out in the snow in a cold environment, you are much more likely going to be wanting to give off heat to survive. So who gets left behind? Well, the person that needs to constantly absorb heat that's probably going to run out is the person that needs to absorb heat. And that's the reason that we're all here. It's because we need to ensure that the needs of our science are represented in this new environment. And guess what? The success of our science, the reason that we're all here, really depends on our ability to be collaborative in this space. And so this is really kind of the manifesto of Converge Computing. So if we bring them together, we get this new technology space where we have the best of both worlds. So where do we start? Well, here is how the talk is going to proceed today. We're going to start with models for convergence, talking about patterns for bringing together traditionally disparate environments. We're then going to move into strategies for convergence. So designs that I've noticed allow for easy movement between the spaces. So let's start with those models for convergence. Now, if you've looked in paper land, you've probably seen many different models. There's many different ways to take HPC and cloud and put them together. I'm going to talk about the high-level patterns and from the perspective of someone that's maybe deploying a system. So let's say that's me, and let's say I want my cloud and HPC, I'm going to take my limited set of resources and I'm going to try to split them into two steps. So I spend a ton of money and I do this, and then, I chose poorly. No one's using half my resources, and oh my god, so four years later I come back and I'm like, all right, I want cloud, X or HPC exclusive or HPC. I understand I can't have my cake and you to choose, so I am just going to choose one. We've used HPC for all these years, red and butter, this is why you've always done things. I choose HPC. Great, six months later, someone comes into my office. Are we dinosaurs? You know, everyone over there is using YAML and automation and we have this old setup and ah, so you go back in your office, you contemplate your life choice and you're like, oh right, no, it's okay, I'm not going to wait another four years. I'm going to sneak it in. So this is where you see all of these ideas, like bursting, multi-cluster, and these are generally referring to this idea of having some home base of resources and reaching out to get more. And the problem with this approach as I see it is that the complexity of these approaches often reflects the complexity of the systems. So they tend to be snowflake, they tend to be complex, and this is why there hasn't been like a single leader that has emerged in the space. So here is a different idea that's less common because it doesn't superficially kind of make sense. I want cloud or HPC, meaning I want to be able to run HPC, or cloud, or at the same time, or something together that's more converged, like what the heck am I talking about, don't I? Am I talking about, don't worry, we'll talk about it. Let's first talk about strategies for convergence. So these strategies I need to point out, these are not just about the technology, they are also about the people which is often harder. The first is common goals. In order to get two different communities working together, they have to care about the same things. You can't get around that. The second is modularity. So the degree to which your application or infrastructure can be modular, is that you can use things interchangeably and swap them, be very creative. The third is integration. This is consumption of an entire thing in another thing by way of different strategies. So let me give you some examples. For goals, the best overlap of goals I've seen is with respect to batch workloads. So a few years ago, the Kubernetes community started the batch working group, and this was because this new need to have AI ML workloads in Kubernetes. Traditionally, Kubernetes is where you run services, you keep something running. And there wasn't this concept of starting something and having it complete, but all of a sudden there was this new need, and guess what? We have been doing that in HPC land for like a couple of decades now. Modularity, a really great example, is actually with Kubernetes and Flux Framework. So you may think of Flux as just like this workload manager, but actually it's called a framework because we assemble many different components together to assemble into the workload manager known as Flux. Kubernetes is the same, different set of components, and there is going to be a creative way that we can kind of use these interchangeably. So the final example of integration, the best technologies I can provide are containers and language bindings. Container technologies are literally this vehicle to let you move between spaces, and language bindings are going to let you take it traditionally like C++ HPC project and extend it into a language that is native to the language and extend it into a language that is native to cloud. So for example, Go. Alrighty, let's get into some examples just like eggs three ways. Here are some projects that we've actually been working on at the lab. The first is Fluids. As I alluded to, this is the Flux scheduler, swapped with Coop scheduler. The next is the Flux operator, the entirety of Flux Framework implemented inside of Kubernetes. And then the namesake of this talk about air battle grows, Flux and Kubernetes working side by side. So let's start with the Flux scheduler within Kubernetes. You may be familiar with Kubernetes when you launch a job. You ask for a certain number of resources that's given to the scheduler. The scheduler says, okay, here are four pods. Have a nice day. So what we're going to do is bring in Fluents. So our C++ package, FluxSched, that is mapped with Go bindings into a custom scheduler plugin. We're going to swap it. And so you're basically going to be asking for the same amount of resources, but the scheduling is going to be done by FluxSched. How does this do? Well, we find that the workflows run three times faster. So what you're seeing here is Coop scheduler on the top, Fluents on the bottom. You see a lot of randomness with respect to how Coop scheduler places jobs. What this leads to is a pathological scheduling pattern. So anywhere you see a red box on there, that is a startup delay. And what that means in practice is though, is that although the workloads themselves run in similar times, we have a lot of outliers. We have a lot of jobs that take a really long time to get started. And so Fluents improves upon us. So Fluents is a really great example of modularity because we're taking an HPC technology and we're literally swapping it. And the modularity of the software allows for that. It's also a great example of integration. Because we have those Go bindings, we can speak the language of the cloud need of communities. Alrighty, next project, the Flex Operator. Super cool. All the gophers in Flexland are pretty cool. Alright, so the Flex Operator is implementing the entirety of Flex framework inside of Kubernetes, your own HPC cluster. This happens by way of a custom resource definition of CRD, where you basically give all the parameters that you want for your cluster, whether that's a single job or whether you want an interactive cluster. This creates what we call the mini cluster, which, you know, Flex Operator is a mini cluster, which, you know, Flux doesn't know the difference that it's running in Kubernetes versus on bare metal. There's a lead broker that's connected to several follower brokers. So here you have one pod for one physical node. The tree based overlay network within each pod or node, you have Flux that's added on the fly to your application. And the Operator is just going to basically reconcile until the state that you need for your cluster matches the actual state of the cluster. How well does it do? We added it to the best in the space last year. The MPI Operator and the Flux Operator consistently outperformed the MPI Operator we believe because of the 0MQ bootstrap. So the Flux Operator is a beautiful example of integration because we're taking the entirety of Flux framework and implementing it inside of Kubernetes. Bro, bro, bro, is it time for the bare metal bro? Yeah! Okay, so, warning. I've been saying bare metal, but nobody's going to give me bare metal. Let's be frank about that. So I was using virtual machine. We're using virtual machine as a proxy for bare metal. So just a warning. So what's different about this picture? The orange is on the outside. So we actually have Flux framework on the outside spinning up a Kubernetes cluster and notice that we actually still have compute running on bare metal alongside Kubernetes. How's that possible? Don't worry, I'll tell you. So why do we need this in the first place? As you know, also, there are increasingly more complex heterogeneous workloads that are coming to HPC. So this means not just, you know, embarrassingly parallel stuff, but also adding in services, databases, task queues. Ah! Okay, so I was... This slide is not wrong. I was going to give you an example of such a workload, and apparently this slide is giving you this warning that I'm a bad scientist and I'm not wrong, but I will point out that my example is actually a very good example that is a prototype for this kind of design. Let's talk about that. So let's say that we're running simulations. We're training examples one through N, whatever, doesn't matter, and we want to send them to a machine learning server, a specific endpoint to do the training. We then want to wait till some metric of goodness or perhaps a number of samples, and then we want to flip it around. We want to run simulations again, but we want to instead give this to our machine learning server without the actual values, then we're going to have a vector of the true values and the predictions, and we're going to see how well we did. Now, very superficially, if we match this to HPC versus Kubernetes, this is how we do it. We would expect that the simulations would run better on bare metal, and the service thing would run better in user netties or Kubernetes. This is way to be... We need to prove to ourselves first. So a lot of you are probably out there like, user net, like, Kubernetes? Like, in user things, are you nuts? I'm not nuts. There's actually something called user netties. It's a Kubernetes enhancement proposal or CUP proposal in 2022 by a very talented developer named Akihiro Sudo. Akihiro must point out won the top maintainer award for KUKON last year. He's an incredibly talented developer. If you've used any of these technologies, he's the one behind it. Hats off to Akihiro. So last year, at the beginning of the year, user netties was really a hodgepodge with kind of bash grips. It was really hard to use. So I engaged with Akihiro and we released Generation 2 of user netties in September. And guess what? It is using containerization, which is really great. It has these components that we'll go into in more detail. So what does it mean in practice? Well, it means when you're building a virtual machine, you need to have C groups version to enable. I recommend LIMA or Linux virtual machines if you're prototyping this for the first time. It also means that you need to enable these kernel modules. So very generally speaking, the RNet filter is going to allow you to apply IP tables, rules, bridge traffic. VXLan is going to allow you to connect VXLan devices on different hosts to a standalone bridge. This is important because we actually have different physical nodes. Now it's going to use RULE stocker. This isn't such a crazy idea anymore. Many clusters have podmin these days. And so what does it mean? Actually, when you bring out these VMs, it means that you're going to run a make up command that has two contexts. So both of them are going to build and start a base image that is using kind, kubernetes, and Docker with CNI plugins. And then the two contexts are the control plane and the worker. The control plane is going to install Flano, run kubernetes, and admit. This makes a joint command which is basically a token that you give to the workers, and then the togers can authenticate and join the cluster. And so that's what they do. They're just like, I'm ready to serve. All right, so we created this garbage cluster small and mighty using Overt and Ansible. It is small and mighty because each has eight cores and 30 MBs RAM and a 10-NVVD iterate. And I want to point out that we have seven nodes here because generally speaking, we're going to have six that we run things with compute on and one's going to be an admin node or control plane. Again, warning, not bare metal, you get the deal. All right, so what's in these VMs when we bring them up? We have a complete system install a flux, singularity on bare metal for reasons I'll tell you a little bit. Lamps installed on bare metal and of course user netties ready to be brought up. So once I shell into these VMs, my flux cluster is ready to go. I can do flux resource list and I can see all my nodes. And user netties, again, that administrative node is also a control plane. So we technically have six nodes to work with. And then we have a user netties. So we technically have six nodes to work with. And we can still see them with coop control get nodes. Here's what we're working with. User netties and flux running side by side the bare metal bros. All right, bro, bro, what experiments do we want to run all of them, bro? All right. So we first need to sanity check that what I said earlier about the bare metal and lamps and the simulations is actually true. We need to look at application performance between flux and user netties. So the way we're going to do that is by running a few things. We're first going to run lamps on bare metal with flux. We're then going to do the same thing but in a singularity container. And I did this just to demonstrate that you don't lose anything by using containers. Here's great. We're then going to run lamps in user netties with the flux operator. And then finally we're going to repeat cases one and two, but with user netties running in the background to look to see if there's any overhead of that. And I need to pause for a second because I know how incredibly cool this third case is. We have flux on the outside. Flux is running user netties. Within that we are launching the flux operator which is bringing up another instance of flux and inside there is where lamps is running. So folks, like I know Thanksgiving is over but this is the ultimate production. And we expect lamps to be slower in user netties because as we know it makes MPI collective calls. User netties are using something called SERP 4.NET NS which requires additional processing of packets with a tap device. I have a great paper I can share if you're interested in learning more about that. So drumroll the results as we expected the well actually maybe we didn't expect but guess what the bare metal case is the singularity container is very comparable to actual bare metal. I was very surprised by this. So user netties does not add a lot of overhead. And this is what we'd expected that guy up there running in user netties is about twice as slow as running on bare metal. So what did we learn? Well, we learned that for a setup like this the network sensitive stuff probably should be run on the HPC. But I'll point out there's opportunity for improving this in user netties. If you have experience with networking I'd like you to go over to the GitHub right now and I'm just going to wait a lot for the talk and engage with that to hear it to work on this problem. Now the next thing we want to look at is distributed machine learning specifically two cases one distributed to across six nodes and then the second on one node so the distributed case network is a variable and for the one node obviously network is not a variable. Drum roll results same thing it's about twice as fast on bare metal or twice as slow I guess on user netties. And interestingly when you look at just a single node these are really comparable so there's no issue with running something on a single node in user netties in and of itself it's really when you bring in the networking that it becomes a variable. So it's a network right well let's sanity check one more thing here's I per thing we did one bit of transfer for each node as a client to each node as a server we see bit rate and give you bits per second is between 10 and 30 for bare metal user netties with like non detectable closest here are really really terrible we can look so we can see the same patterns for transfer gigabits per second and so yes it's the network we're pretty confident for the setup it's the network. All right can we do the fun workflow now we absolutely can so guess what I actually prototyped this kind of workflow because I was really excited about it and so what we're going to do is we're going to be launching a batch job with flux batch this means the flux instance that's only by the running user it's going to scope resources using hw lock in this backshot where we can basically bring up and tear down all of user netties. We're going to take that workflow that I mentioned before we're going to map it into our star track cluster space so we're going to run simulations with lamps randomly selecting the problem sizes predict well time we're then going to bring up a machine learning server a special server I made using river a few years ago and then we're going to basically do the test cases we're going to run lamps again but we're going to leave out the actual well time and we're going to ask our models what it is and we're going to do a thousand training samples and 250 testing samples. How do we do? I put no thought into these particular models but I did three kinds of regression the Bayesian and sampling from a probability distribution didn't do super well but for the first two there's an actual kind of pattern between the predicted and the actual time and so although I put no thought into this I was really pleased with this result to see that the general prototype this idea of having bare middle simulations running alongside a service there is something here we can do science this way with actual real scientific questions and I'll point out that there are real heterogeneous workloads out in the wild and you this capability here's Moomi the massively parallel multi-stale machine learn model infrastructure and this is basically simulating biological systems the interact between proteins and plasma membrane I'll also point out that the Moomins are what it's based on the name the finished book comic book series with really cute hippos with often yellow spiky hair very awesome so this is the perfect example the bare metal rows of coexistence adopting technologies to make it possible to go to coexist and continuing to improve upon them so that for example with networking this environment can get even better so what should you remember from this talk if you take nothing else away the first is looking out for opportunities for collaboration look for that alignment of goals between spaces that's an opportunity the second is providing handles for your components so you don't have the bandwidth to look for opportunities add some go bindings to C++ projects because someone else could find you the third is engagement we need to show up at the table we need to go to working groups, conferences places that you haven't traditionally been to engage in to find these opportunities for collaboration and possibly the most important is this mindset we've had this mindset of cloud versus HPC that one has to win but they're different for so long we need to throw that away and get rid of the adversarial thinking and have a more collaborative mindset this is the vision that we have for the future for converge computing and we hope that you like to join us so thank you that's how to reach me my email and social networks and here's some interesting links for the flux and the various projects I think I will take some questions virtually now okay we can take a couple of questions it seems like the wifi is stable enough to let Vanessa answer them do we have any questions okay so Vanessa we may have to repeat a question for you we'll see how that works hi Vanessa amazing talk congrats so I was wondering if your architecture can support sidecars because one of the nightmares I had when I was trying to do something similar was that in order to get the sidecars running I had to spin up a second network stack and that created a lot of overhead so no no just one is on okay did you get the question Vanessa no I didn't hear the question at all neither did I yeah maybe that's better okay let's do it like this you'll come up front and ask it here yeah that's perfect that'd be great I can hear you great hi there hi so I was wondering if your architecture can support sidecar containers because as I was saying when I was trying to do something similar when I tried to create the sidecars I had to create a second network stack within singularity so the network overhead was amazingly high so absolutely a flux operator actually uses a sidecar container on a net container which is similar in concept to add flux on the fly as a view what's going on in Kubernetes is sort of a different thing than the networking issue so the short answer is yes to kind of add to that though I'm not sure that singularity and Kubernetes singularity as the container runtime for Kubernetes would work I have never tried that but it doesn't sound like it would work yeah it needs to be done yeah exactly hi Vanessa thank you hi it was the most fun presentation on the post then so far thank you so when you were saying that the main difference between performance between EBM and bare metal workloads was related to network was that the case also for distributed training and if that's the case were you using infini band or not so this we did not have infini band and you make a really good point that this kind of setup would need to be tested with an actually great network and that is still a very big challenge even for cloud so for example if you use AWS you can bring the elastic fiber adapter which will give you great networking performance but if you go to other clouds and I don't have to name this specifically you tend to only get really good networks when it comes to using like TPUs or GPUs the exception though is Azure which has a lot of really great HPC stuff kind of built in so absolutely you can get that setup with infini band Hi thank you for your talk I had a smile on my face the whole time thank you for having such high energy at the end of the day what was I going to say oh yeah so probably in my workloads I can reduce the network traffic by a very large margin if I can constrain certain jobs to specific nodes because then large files don't have to be moved for certain jobs to across the network is that something that you could keep in mind so if you remember the very quick machine learning experiment that we showed when we're running something on one node and you're not using the network there's no issue so if you're just running something on one node in user netties you won't have an issue in a degree to which you can reduce anything that uses network so moving data MPI etc etc you will get similar performance at least from this small prototype experiment that we've seen as you would on bare metal I have to do this because it wasn't really bare metal thanks one more question hey Vanessa that's Danny I'm gonna die my hair soon so you won't recognize me again I really liked your framing actually I thought I was going to sort of being adversarial and then I actually realized what you were saying and I really appreciated it however though regarding the adversarial framing I have some experience with for example cloud tools and cloud environments being used as platforms for vendor lock-in I think that you described especially with your converged computing kind of the way that you can push back against scientific labs aren't kind of in-depth to corporations I actually think that you kind of made a really useful example of one way to do that in your talk so again I actually was very very impressed by the way you kind of explained that I would like to know in the more general sense how can labs and potentially RSEs make use of cloud tools without getting locked in or becoming beholden again to a corporate environment and again by the way I think that you effectively did that in this talk so I'm more looking for a general kind of thought about that You're totally correct that vendor lock is an issue and when you tend to see many sort of niche APIs in different clouds and then you built your entire thing around them you do face that as an issue but the great thing about Kubernetes is that it is this open source project that is available across clouds there are subtle differences but if you make a workload that can run on Kubernetes you're going to have an easier time to move it between clouds and that's you know speaking from my lab we work on flux framework and one of our goals with flux is to make things portable not just between clouds but between cloud and HPC that's also something like user netties running actually Kubernetes on bare metal alongside HPC is so important because all of a sudden you have the same workload and it runs in all the places that is sort of like the vision we don't we want to make sure that the scientific workloads that we're running today can run in all places not just to one niche specific cloud not just one niche specific center just convergence TLDR that is very exciting and I really appreciate that response thank you so much okay that's all we have time for this workout great Vanessa I hope you agree yeah it was really fun if anyone has further questions and stuff please reach out to me I love chatting it was a pleasure chatting with you and I hope you have a great rest of your fun then thank you and the best way to reach out to Vanessa is via HPC social so don't forget to grab a sticker and you walk out please consider doing a small donation in the box as well to help cover the costs and if you're leaving please check if you see any trash around please take the trash with you bottles anything anything you clean up we don't have to clean up thanks a lot Vanessa this was great bye
Welcome to the Identity and Access Management devroom!
We are starting the second edition of the Identity and Access Management Dev Room. My name is Alexander Bakavoy, this is Iker Petroza. Formally we are the ones who organize this Dev Room. If you have any questions, anything, please talk to us. A guy in blue t-shirt will be the guy moderating a specific session. I'm not talking about him or Trevino specifically because this is a moving target. We have one t-shirt for people who will be moderating. I wanted to do a bit of history reminder. We had the first edition six years ago. It was, I think, a successful one. We got roughly the same amount of talks we will hear today. They were as much diverse and wide in topics. Also, we had quite a lot of people coming to listen to the point that at some talks we actually caught the FOSDMQ sickness. There were like 50 people in the room, there was a smaller room and hundreds of people waiting to get into the room. So, truly FOSDMQ sickness part that we enjoyed six years ago. I hope we will have enough space because this is twice the room, the first one. For this year, as you all know, this URI probably has a whole schedule. You can get access to all the things that I will just remind to speakers here that please upload slides. Using this new interface to pre-talks, but please upload them roughly like half an hour before your talk so that people who will be watching this live stream and they have some reference point and can see their... You can omit things from the slides that you want to make a surprise during the live presentation so that it's not spoiled. But it's typically good to have it uploaded. Since this is a smaller room and we don't have another mic, please, when you're talking and hearing some questions, please repeat the questions so that it's recorded in the mic. Finally, we have smaller slides, but please leave one or two minutes so that we can change to the next one and don't get the time taken from the next one. And since this is largely done by automation and volunteers, you will get an email from the video team that will give you a link with the details of your presentation recording. And you need to act on that. Preferably if you have time today or at most tomorrow so that they can re-encode the video and publish it. There will be an interface where you get to set where the talk starts and ends and maybe not in our rooms, we have one mic or maybe two. And in some other rooms they also have more mics and you can choose which audio you're taking. And once you set this off and signed off on that video, it gets re-encoded and published automatically on the schedule page. For all the people who missed your talk, they will be able to get the recording and how fast they will get it depends on you as a presenter. So, yeah, and now Iker does an overview. You forgot one thing regarding the video meeting that you need to make sure that you sound correctly. I forgot to say that yes, you need to check that you sound correctly in the video. Last year's talks. This one wasn't very good at the beginning. Yep. So, now it's to you. Okay, so my name is Iker. Well, this is my first post-dem. I hope you are enjoying it as much as I'm doing it. And, well, this Identity and Access Management Dev Room is about Identity and Access Management. We have several talks regarding passwordless. We also have multi-factor authentication, signals unknown, user federation. And, well, just in short, I hope you are enjoying it a lot. Leave space to the next one, do the next speaker to prepare everything. And we are all volunteers. So, if you find that something is not correct or just fix it or tell it to us so that we can, you know, try to fix it and have everything correct. I don't have much else to say, so thank you and have fun.
SpiceDB: mature, open source ReBAC
All right, so this is the talk on SpiceDB. Thanks everyone for showing up. So early in the morning, I'm starting to lose my voice because there's a long day yesterday of talking and meeting awesome people. This is my first FOSDOM. So who am I? My name is Jimmy Zalinski. I'm the co-founder of a company called OthZed, an OthZed-billed SpiceDB. Previously, I worked at Red Hat and CoralS. So I've been around in the container and Kubernetes ecosystem for a pretty long time, basically since the beginning. There, I'm actually a maintainer of OCI, which is the standard specification for Linux containers. And I've also started a bunch of projects in that space, notably Kubernetes operator framework and some others. This talk is entitled SpiceDB. But since FOSDOM is more of a developer community conference, I really wanted to focus less on this talk being a vendor pitch for SpiceDB, but actually more of a level set about the problems in the authorization space and the history and status quo of that. So that everyone understands what might be the best tool to solve their problems. I'm not going to try to sell you SpiceDB for all problems, because the more informed you are, the better you can pick the product that's actually going to complement your software stack and what you need. And that means there's going to be way more qualified people using SpiceDB and way more qualified people using other authorization tooling. Obviously, I'm the most jazzed about SpiceDB because I created it. So why are we all here? We're all here because there is a not-for-profit organization called OWASP, which is the Open Worldwide Application Security Project. And they kind of got started in the early 2000s. And they're famous for having this list called the Top 10. And the Top 10 is basically an enumeration of the highest risk, the highest threats for web security. And as of 2017, broken access control was number five. As of 2021, broken access control is number one. That means this is the biggest threat to the web and to all the applications running internet facing to the web. But really, the question is, how do we actually get to this point? Like, how did this happen? And how did it happen so quickly? I'm not going to point any fingers, but what I'm actually going to do is kind of dive into two different groups of stakeholders in kind of the history of authorization. There is kind of the academia and people publishing papers in this space, kind of defining concepts. And then there's the industry practitioners that are actually building the software and realizing these systems as they're actually connected to the web. I'm going to start with academia first. So on the right-hand side, you're going to see a timeline. And then on the left-hand side, there's going to be some notes. And then not for this slide, but you'll see QR codes in this corner as well. Those QR codes are going to link to the specific novel paper. So if you're interested in any of these particular concepts, you can feel free to scan the QR codes. But our history kind of of authorization is going to actually start in the 80s. And it kind of gets really kicked off with this publication of the Trusted Computer System Evaluation Criteria, which is a security practices book published by the US Department of Defense. And in it, it's outlining a lot of different security practices that are effectively a part of the military, the United States military. And in it, they kind of describe these two different access control systems, discretionary and mandatory. Now, discretionary is effectively just if you created the idea or the information, you can share it. And if you're then given access to that, you can share that. It's at your discretion. I kind of use file systems. And Google Docs is an example here. It's not a perfect one-to-one match. But if someone shares a file with you on a UNIX file system, you can copy that file if you have read access. And then you can change whatever permissions on that and share that, similarly with Google Docs. So it's at your discretion how you're going to share that information once you're given read access. Then there's mandatory access control, which is effectively a long list, an exhaustive list, of all the access for a particular thing. Most notably, people are kind of most familiar with SE Linux as the example of this. If you're unfamiliar with SE Linux, it's a way of locking down the Linux kernel. Honestly, it kind of comes with a negative connotation because mandatory access control is very verbose and very difficult to get right because you have to enumerate absolutely everything. Some people say that the three-letter agency at the US government that created this are the only people who actually know how to configure this correctly. I don't know if that's actually true or how many people use it. I know Red Hat is one of the folks that actually does promote SE Linux. But the one thing about this slide I really wanted to kind of drive home is these ideas, they're as old as the military and war itself. There's nothing novel about the 80s where these ideas got invented. But what actually happened was someone only actually ever thought to write this down in the 80s. So it took that long after using these ideas for many, many, many years. So we jump roughly 10 years, 9 years to 1992, which happens to also be the year I was born. That makes me feel relatively old. But in 1992, we get this paper published on role-based access control. And role-based access control, often called RBAC, is kind of where actually most people believe the state of the art for authorization systems is. The core idea is basically there is a group that is assigned access to a particular thing. And those groups are called roles. And then you map users into these roles. And by means of being in this role, you get access delegated to you. The kind of number one problem with RBAC is that everyone kind of defines it differently. If you build any enterprise software, you're going to talk to clients and they're going to ask you for RBAC. But the difference is if I look at two different enterprise applications, how they implement RBAC entirely differently. The only commonality is this mapping of users into groups that then have access. This is kind of going to be a recurring theme across all of these papers published in academia, anything with Starback, because they're documenting concepts, but not actually specifications that would give you an ultimately cohesive and designed and secure system. So kind of most famously, the biggest issue kind of with RBAC is that there really is no scope. If you say someone is an admin, does that mean they're an admin of the entire web app? Does that mean they're an admin of a particular resource in the app? You just don't know until you actually build it yourself. So there's not really an easy way to reason about these systems until you actually touch them. So we jump actually well into the future now into 2015. And now is when the paper on ABAC, which is attribute based access control, is written. Effectively, the idea behind ABAC is to kind of generalize on RBAC and say, the role that you're assigned, that is just one attribute that your user can have. And other attributes might be that you logged in with this IP address. Many other dynamic attributes can be assigned to you. The kind of really important thing about ABAC is it's providing this real time context. So now you can kind of write rules, like are they connecting from this country, this subnet, this time. You can delegate access at particular windows of time and kind of perform more logic on these attributes that folks have. And now we're going to take a huge digression back to 1965. So if you're unfamiliar, Multix is actually this operating system that was developed between MIT, GE, and Bell Labs. You might not remember it, but it actually inspired an operating system you're probably familiar with. Unix. So Unix is actually an attempt at making Multix concepts ported to less expensive hardware. Multix is often credited as the operating system, like the first operating system that has access control for the file system. I actually don't know if that's true, but it's often credited as that. So in Multix, you have a file system tree, so you get hierarchical structure. And then at every branch, which would be a file or a directory, you can have five different attributes assigned to that file. You get read, write, exact append. These are all kind of file operations that you'd be familiar with. But there's this fifth one that's super interesting called Trap, and that actually gives you the ability to do callbacks and to see functions. And it was initially designed so you could do file walking in user space. But kind of like the whole thing with Multix and the reason why I bring it up is because there was inheritance, there was aback, and there was user-defined functions in an authorization system. In 1965, when in academia, the ideas behind attributes were published in 2015. So there are systems using these concepts, but they maybe haven't been formalized and written down in the concrete form. And this is kind of like a huge issue with the whole space, because people are doing things, but they're not really studying how to make these systems robust with these ideas. They're kind of more just documenting these ideas ad hoc. So getting back to the normal timeline, we hit 2019. It's actually in 2007 that the term is coined relationship-based access control. And the idea behind this is actually that by establishing a chain of relationships, like Jimmy is a speaker at FOSDOM and speakers at FOSDOM have access to the FOSDOM speaker matrix chat. If you can follow these chains of relationships, you can actually go from Jimmy has access to the FOSDOM speaker room. So this term is kind of coined around then, and it's looking forward at what tech in the Web 2.0 era will look like. It's published initially while considering how Facebook, the social graph, works internally. So when you share photos on Facebook, you say, friends of friends can view this. You're literally defining it in terms of relationship to yourself. So we hit 2019, and actually that's when Google publishes a paper called Zanzibar, which is documenting an internal system at Google powered by these concepts. And the difference and the reason why I have 2019 for you back is because Google is documenting a concrete implementation of this. Unlike a lot of these other papers talking about concepts, it's talking about an application of these concepts and really giving you a framework for how to use this effectively and in a correct way across multiple products at Google. So then in 2021, SpiceDB is open source, which is also implementing the similar concepts to Zanzibar. And obviously, I'm going to get into that later. There are other models like Starbucks, but that's kind of like the primary ones that I see mostly in industry. You can dive into Wikipedia if you're interested in other ones. But now you've got kind of the industry side of things. We're leaving academia. And industry has this problem, which they go to build in a web application. And your first job is just build the MVP, the minimum viable product of your web application. So what you're going to do is do what you do with everything in a web application, which is store data in a database, probably the relational database you're using for everything else. And you're going to try to check if a user has particular access based on some data you store in the database. It might maybe going to be a role if you're inspired by RBAC. But maybe it's just a numeration of the list of users that can do a particular thing. So you may have written code that looks like this. But the problem is this falls over at some point in time, whether fundamentally you build a system that actually is just really slow, or you have to build a new system that is way faster than you ever intended it for it to be. Or you basically get users of your software that demand new functionality that is not actually possible for you to implement until you refactor your authorization code. So a great example of that is if they want recursive teams. So if you have groups of users, what if you have groups of groups? Or groups of groups of groups, right? That is something that most people cannot build, or they don't build in their initial MVP. And when you get functionality like that, you're forced to completely rewrite your authorization system. The other thing that could happen to you is your company buys another company, and they're based in a different continent. And that means all the requests for checking permissions now have to travel across an ocean if they want to be correct. That's a huge problem. And making sure that the performance is actually going to be viable, and the answers you're going to get for authorization questions are correct is a difficult problem. So you hit one of these kind of big issues, and then you kind of are forced to enter the cycle that I'm going to get into. But these numbers are kind of fudged. But the whole point is that if you take an engineer, probably with expertise in that web app, has worked on this authorization system, it's going to take them a while to implement this. It's going to be super sensitive because someone else is going to have to review it. That person is going to also have to be deeply embedded in that code base. They're going to be extraordinarily careful because any mistake that happens in this code base is going to be a CVE. It's giving access to people that shouldn't otherwise have access. So that's going to take a long time. Then you're going to do QA. You might actually have to perform a security audit before you can deploy this software because you're deploying to enterprise environments. And then you're also probably going to want to take extra time rolling out these changes into production. You probably don't want to deploy it to everyone all at once. You probably want to deploy to a minor subset just in case you find something wrong with the code. And all of this just takes time. And the problem is it's actually putting security of your software at odds with development velocity. Fundamentally, it's going to take you too long to add this functionality. And you're going to want to take shortcuts. But shortcuts are security flaws in your software. So then as rinse and repeat, you basically don't know how long until the pain is going to build up where you're forced to rewrite these authorization systems. And that is the mystery box entirely. You could finish or not even be finished rewriting your authorization system. And then all of a sudden, a new user sets some requirement for you. And you're doomed. You have to completely rewrite the thing you just thought you re-architected to be future proof. So how do we fix this? There's never ending cycle. And OAS themselves actually have recommendations for this. They say you should no longer adopt RBAC, but take concepts from A-BAC and RE-BAC. Obviously, I'm biased towards RE-BAC because I think it's a more modern approach to this. But the OAS folks also give you some high level benefits to why you would do something, like why you would adopt these new ones over RBAC. I'm going to just take this from the RE-BAC perspective. When you're doing a graph-like thing, a relationship-based system, you're forced to basically talk about individual entities. So this user, Jimmy, has access to this particular document. Because you're doing that, it has this kind of buzzword, fine-grained. You're not resolving Jimmy to a role or a group. You're actually following Jimmy directly to the document. So you're talking about individual entities in the system. So as a result, you get actually more fine-grained access. I'm not trying to generalize about any users or paint over anything. I'm actually talking about the exact objects I care about. And that means you can actually develop systems where you delegate access to a particular row in a database or a cell in a spreadsheet. And all of these systems are designed for speed because they understand they're going to have to store a lot of data to be this fine-grained. And then because your applications are only talking about the direct objects that they care about, any of the relationships in the in-between don't get written in your code. So you just ask the question, can this user perform this action on this thing? How they got access to that? And if you ever refactor or change how they get access to that, that does not live in your code base anymore. That means you can make changes to your permission system and not change a single line of code in any of your web applications. And believe me, when you do that for the first time, it is a magical feeling because you don't have to touch any code. So then there's also multi-tenancy and management ease. And this is just simplicity around modeling. And then with ABAC and REBAC systems, you're paying it forward. So our back might be really easy conceptually for you to implement at the beginning. But these systems, the ABAC and REBAC ones, they're more focused on forward thinking. If you need to make changes, like I just described, you can change REBAC designs without changing code. It may be a little bit more effort for you to get started in building and integrating with one of these systems. But by day two, if you ever need to make a change, it's going to pay dividends. So I wanted to get deeper into this Zanzibar paper I talked about earlier, which kicked off the interest in REBAC that you see today. Basically, Zanzibar is a purpose-built graph database that is very specifically optimized for one thing, which is finding a path in a graph. And by virtue of finding that path, that means that the user has access to that particular thing. It's actually one of the few good things that came out of Google+. So there's only two things that came out of Google+. There is Zanzibar internally at Google and then Google Photos. The novelty of this paper is actually that it is solving an authorization problem with a focus on distributed systems. So if you'll notice, the title of the paper is called Zanzibar Google's Consistent Global Authorization System. So it is fundamentally trying to tackle authorization as a distributed systems problem, which is not really something else has done in the past, because they kind of acknowledge that if they're going to deploy one system at Google, it needs to work across all geos in the world. And it has to be extremely, extremely reliable, and it can never be wrong. These are really difficult requirements. But the anecdote I like to use is when you're on a cloud provider like Amazon and you go to provision something like, say, an S3 bucket, you're always choosing what region. But actually, if you go to set IAM rules in a cloud provider like Amazon, you don't pick the region. That is because these systems fundamentally have to be global. And when you're designing them yourself at a particular scale, you need to think about how you're going to make your system global. And so this paper actually inspired two companies, Carta and Airbnb, to go forward and implement their own internal systems based on the ideas in this paper. None of them are truly 100%, I would say, authentic to the original paper, but rather the paper refused with the requirements of their business at the time. So I think the real superpower to Zanzibar, though, is this, which is if you go to send someone a Google Doc in Gmail and they don't have access, Gmail will pop up a box and tell you, hey, you didn't give access to this person. That fundamentally means that Gmail actually has a way to ask questions and check permissions that are built into Google Drive. So that means you could have one central source of truth for authorization data that your whole application suite can share, microservices can share. And this is incredibly powerful because not only does it allow integrations like this, but it also lets you have that central source of truth where if you need to audit something, you can just ask that one service. It's the only service you have to trust. It's the only service that you have to query if you're trying to really dig into any of this data. So you have a problem like an outage or something, an incident, and you need to understand what the access control looked like. So you might be wondering, how do I Zanzibar? So this is exactly what we set out to do. Basically, the year after the paper was published, my co-founders and I left Red Hat to found and basically build SpiceDB in the open source. There were some folks experimenting with the ideas around Reback at the time. But no one was really moving the needle towards making this a production thing that you could use in a real enterprise environment or at a real tech company. We originally prototyped the thing in Python. It was type annotated, lazily evaluated, functional Python. So it was way faster than you'd ever think Python should be, but it was not fast enough, so we ended up rewriting it and go and open sourcing that. The name is actually inspired by Dune because internally at Google, the project was actually called Project Spice because the ACLs must flow. So the timing for that has actually been really good with all the Dune resurgence in the movies, but internally at OZ, all of our software is named Dune References as homage. But if we fast forward to today, the SpiceDB community has actually gotten contributions from a lot of companies, big names like Netflix, GitHub, Google, Red Hat, and Plaid. And there are production users in small companies, startups, where it's just the co-founders, all the way up to Fortune 50 companies. But I still haven't actually told you what SpiceDB is. So SpiceDB is, as I described with Zanzibar earlier, this extremely parallel graph database. So developers basically apply a schema, just like you would for a relational database. And I've given an example schema here, kind of modeling a Google doc. And then what they do is they store data inside that database and query that data according to that schema. And it's really magic where you can actually make schema changes and not in a forward compatible way that lets you actually modify your permission systems without changing any code. So we don't actually have a SQL API, despite being a database. We give you GRPC and HTTP APIs. And effectively, like the primary interface we recommend as GRPC for latency reasons. Because authorization is in the critical path of everything, your web applications are going to do, and possibly everything at your business, you really have to make sure the stuff is fast. Thus, everything needs to be kept in memory. Everything needs to be returned in single digit milliseconds. So GRPC is actually pretty critical for that. And then in addition to the actual main server, we also expose servers for power and dev tools. So you can get auto-complete and things in your editor. But then also integration testing services. So it's Kubernetes native. Designed from the beginning, our background is all in Kubernetes. So actually, SpiceDB is self-cluster. So if you deploy just SpiceDB directly onto Kubernetes, it will discover other nodes and actually start to divide and shard up the in-memory graph that it's using to actually serve this data across them automatically. We also offer a SpiceDB operator in the open source, which will then do automated updates for SpiceDB. Notoriously, having zero downtime updates for a database is very tricky. So we just took that problem off the table for most people and just implemented it automatic for anyone using Kubernetes. So we remain true to Zanzibar's goals of consistency at scale. So we actually have pluggable data storage systems. And basically, depending on what your requirements are, say you need to deploy everywhere in the globe, you can actually store all of your raw relationship data in something like Spanner or a Cockroach DB. And then you can deploy regional deployments of SpiceDB that will exist as independent caches for those geos. But fundamentally, they're sharing all the same core data and they're consistent across those environments. If that sounds too complicated for you or you don't really need that, you're just single region shop, that's fine. We also have deep integrations of Postgres or MySQL if you just want to use something like Aurora or Amazon or ES. Obviously, then there's also memory for testing. We also have a tool called Zed. Zed is the CranLine tool. It basically manages cluster credentials as backups. It gives you a command for every single SpiceDB API. And I just kind of give an example of running kind of with debug flags permissions check. You can actually see it gives you a whole graph traversal. It shows you a tree of how you actually computed whether or not someone has access with timing data associated with all that. So you can see where things slow down. We have a web IDE. So actually, the two things you just saw, SpiceDB and Zed, we compile to WebAssembly and then run that in the browser. And then we basically build that all on top of Monaco, the engine that powers VS code. And give you a full IDE where you don't have to install any of the software I just showed you. You can just go to play.offz.com and start playing with this stuff. Run Zed against live data. You can load in test data. And what we actually do is we can generate exhaustively all of the paths available in the graph for you. So there's somewhat of a model checking happening here. So you can actually prove exhaustively that all of the ways you can traverse the graph are the ways you think they are. And that basically lets you prove that a system is correct without you deploying it into production or having someone do a extremely long security on it on your process. And then you can check this stuff into CICD. So if you make a change to the schema, you can actually guarantee that certain assertions always pass and that everything is exhaustively checked. So Zanzibar is not a silver bullet. We actually have had to extend Zanzibar in a bunch of different ways. So SpiceDB remains true to all of the core concepts that you'll find in Zanzibar. But not everyone is Google. So effectively, not everyone relies on users being represented the same way. So we are kind of more flexible with how people can model their own users. And then we kind of add on developer experience because at Google they can say, you're forced to use the software. When you're building open source software, you can't force people to use your software. You have to compel them to use your software by having a better experience than what they're currently doing. We've also added kind of contextual relationships with ABAC. So that means relationships can actually exist basically dynamically based on context that you provide at runtime. That was a joint project with Netflix. So if you're wondering how you SpiceDB, you can go to our Discord, discord.gg slash SpiceDB or check out GitHub, basically anywhere on the internet where you expect to find open source projects. SpiceDB is there. So thanks everyone. Thank you. Thank you.
Improving Infrastructure Security Through Access Auditing
Today is Scott Bryan. He's going to talk about improving infrastructure security through access. So, you're up. Morning, everyone. So, I recently joined Red Hat and I work full time on the Adoptium Temerin JDK project. So, we use a very traditional build model with a large suite of machines. We support between 12 and 15 different platform and architecture combinations. So, it's very difficult to do just with dock containers, just single machines. So, we have a massive, massive suite of infrastructure. This doesn't work. So, we're currently undertaking a massive piece of work to secure our supply chain. So, we are looking at S-bombs, reproducible builds. But underpinning it all is a good infrastructure security strategy. And we've implemented centralized keys, rootless access, things of that nature. But how do you know all of that stuff is working? Unless you can visually see the results of all your security work. It's very difficult to prove whether it's working. So, I came in. There's no security or no strategy for verifying that any security fixes have worked. So, it's a very cut down presentation from the full length one. So, first things first for us was identifying what we wanted to get out of an auditing system. So, we want to capture, login any access attempts, anything at all where somebody was accessing a system, particularly in the build sphere. If you think about the Sol wins attack, which was a compromised Jenkins server, I believe. If your build system infrastructure is compromised, your builds and source code are potentially compromised. You build something, it's got a vulnerability in, but it checks and everything else looks valid. So, any end user sees that. The other thing we need, we wanted was an automated response and alerting. So, should somebody try to log in as root on a build system, we need to be one. That needs to be stopped straight away. And we need to be alerted that that's a thing that's happened. Come to it, why in a little while, the scale of the problem when you don't know about it is very different to when you do know the numbers involved. So, and then we want some analytics and reporting so we can, again, gauge the program and the success thereof it. Ultimately, for us, our infrastructure is all provided by a dozen different cloud providers. It's all publicly accessible. So, all of our, even our build infrastructure is open to the web. You can request access to it when you join the projects. So, again, the attack vector is significantly large. We don't have a single firewall that we can use, sneak and restrict the IP addresses. It's all publicly available. So, for us, host-based intrusion detection using Wazoo, not a tool we build, but it's open source. It's a very good tool for this use case. I would recommend you do a very similar exercise, analyze your requirements, and then have a look into the tools that are available. There's quite a few of them. Wazoo itself is a fork of OSSEC, which kind of stopped development when it became semi-paidware. Wazoo was an offshoot that is still open source, and they've continued to feature develop it. So, the scale of the problem. So, some numbers, which 24 hours across our infrastructure suite, 202, just slightly over 2 million attacks in 24 hours. It's a bit of an eye-opener. Of those, 12 are deemed by our, and the standard rule set from Wazoo is really excellent, of being serious enough to warrant concern. And you can see in 24 hours, about half a million people, people just brute-forcing the build machines to try and compromise them. I think a demo is slightly impossible without my laptop, but you can drill down into all these. You can see there are all the metrics that are available for the attack vectors, the CVEs, and you see there are also the 79,000 authentication successes. It's here on the right. What's the difference between SSH and brute-forcing? Not all machines are accessed by SSH, so they will be things like Windows, brute-force, password attacks. But Wazoo detects, again, remote services, modify registry attempts, all via RPCs and things like that. So, again, this is the, the first thing it does is give you a nice kind of visual view of how big the scope of problems are. It's why I like this tool quite so much. So, drilling down a little bit just into the authentication failures, you'll notice that Windows, by far, is the key attack platform compared to the Linux service. The numbers are hundreds of thousands times as many. And you'll see there, the top three machines are all build, Azure, Windows. It's a very popular thing for attacking and, again, get a much better kind of breakdown of the attack vectors. People trying to access restricted accounts. People trying to get valid accounts. So, although they're disabled on ours, the standard Windows administrator, the standard Windows guest accounts, although they're all disabled, everybody can guess one of the Windows or can find out one of the Windows standard accounts that, unless you've disabled it, is a very easy attack vector. And then just brute forcing things. And then, looking even deeper into just a single host. You can see down here at the bottom of the screen, you're getting the login failures, unknown user, a bad password. In theory, it's somebody just typing an IP addressing wrong. However, every single one of these attacks has been stopped with an automated response. You can go even further into blocking IP ranges, geographic ranges, so you don't even get the alerts. It's that I like the visibility. I would say only the really high priority stuff. And you'll notice once you drill down, there are actually no serious alerts. That proves it's working. So, again, you can take some knowledge in that your infrastructure is fairly secure. And then another really useful feature, and again, is you can then go into the details of each individual attack. Although you get a geographic region name, IP address, things like the target users, they've tried to brute force on our SSH-based host. There isn't a slide for this. We've extended it because Wazoo is eminently customizable. So, we also capture the SHA-256 checksum of the SSH key being used to try and attack. So, we can then determine if it's one of our valid users, because we have all our keys stored centrally and distributed centrally via Vestillion. If it's not one of our keys, we can then start blocking SSH keys at that level. But, again, we've extended it to capture that information. And Wazoo is basically an Elk stack-based tool system. It uses the logging part of it, the elastic search, and it just captures all the logs from all the systems. Again, you can customize it to capture whatever you like, your Windows system registry, whatever the Mac equivalent is, audit log, syslog, and it just harvests it all into one. Really nice, easy to query, work with. It's got the capability of doing dashboards and searches. We're still fairly new to rolling out and leveraging it for real serious stuff, but I think it's worth sharing even at this stage. And again, more extended audit information. This is from one of our dock hosts. Somebody there has logged in as Root. It's probably me. But again, you can see the kind of information you capture even on successful logins. If you're trying to find out who's doing stuff, they shouldn't. And Wazoo itself goes much further. It's got a file integrity management tool, which again, you can alert on, so you can track all the changes to key system files. It's got a SEA component, so it will check your system against the NIST databases, look for any vulnerabilities, give you the links to the CVEs, and then the potential fixes if that information's in the NIST databases. All of that in one happy place. Worth a look, and if you want some more information about how we use it, feel free to connect with me on the adoptium slack after this meeting. Whatever you need. I think we've got like a minute left, so time for one question, maybe. Say we're already using something like HashiCorp Vault. There's that lagging behind in audit capability. Let's say audit capability is something we want to elaborate on right and get ahead. Does that even give us an advantage? Is it doing everything in Vault or not? What is the wisdom and that? Okay, so that's the question is, compared to HashiCorp Vault, what does Wazoo give you? I can't see any reason why you couldn't use both. You could still use Vault for everything you're using Vault for, but what this would give you is the reporting tool on top. Would that work? Yeah, yeah. How much effort would go into it? I've never used HashiCorp Vault, so I really... But Wazoo, say you could get it to monitor your Vault, as long as Vault's putting some logs out for you to monitor, you could customize Wazoo to look at those logs, as well as your system logs, and still use the same visibility features and log harvesting. I don't see why that wouldn't work, but... So it's string matching based, right, as long as I have log performance? At the base level, yes, it's string matching and regex from log files, but that's just what it ships with by default. You can extend it to do whatever you like pretty much. If you're willing to write it. OK. Right, I think that's it. Thank you very much. APPLAUSE Thank you. Thank you. APPLAUSE Doctim is an Eclipse Foundation project for Temrin JDK. Although I ran out of paying my wages, I worked full-time on the adopting project. So... Wazoo is a third party to look into the Eclipse Foundation. I just think it's... Yep, sorry. Sorry. Well, cheers, George. I'll catch up with you later, mate. Wazoo, just a little bit. I saw it was best for our needs. OK. And all good things about being a little bit independent about working for the foundation.
Role of IGA in Access Management with Multilateral Identities
I have two affiliations, Manis with Evolvium, which is company behind open source IGA system midpoint and also I'm active in academia helping scientists across the world get together and solve their identity problems and in this talk I will kind of combine all of my experience with this rather complicated topic. So let's see with some introduction. If we're talking about multilateral identities we are meaning basically the whole scale of identities that are available to the users because the users today have a lot of identities that they just own and they can use to access systems. So it can be like identities from one's institution but it can also be social identity, identity on for example GitHub or even though states especially here in Europe are pushing some European IDs, digital wallets, some academic identities, banks and so on and so on. There's a little lot of them and all of these identities can be used somehow. Then the next item in the name of the talk was access management. So it's a component basically responsible for really giving access to people and do everything related to access. So one thing you can do is of course just type your username and password in but in principle you can use all these identities as well and that we have IGA which is identity governance administration for those who don't know this term is basically an extension of identity management and its main purpose is get the identity management rather technical stuff for administrator to people who are actually making decisions. So some managers or even support just get them in, let them manage what they supposed to manage, not have everything done by technical people when the others call them. And in this talk the identity governance system will be represented by midpoint and I will try to show you how all these pieces fit together and what can you do with the combinations. So let me introduce midpoint as well. As Zawar said it's identity governance and identity management system and because I'm here of course it's fully open source and usually suppose it's not important to say it but when you are dealing with identity management and access management areas a lot of the products there are claiming to be open source but in reality they are just open core or something even else but with midpoint we are really doing our best to make it fully open source including all the documentation guidelines for the developers whatever is needed everything is open and available to use. And the product itself is maintained by Evolveum and we have few external contributors. We would happy to have more of them but it's kind of hard. The identity management, identity governance is very complex tool contains a lot of code so it's very tough to get contributors. But luckily we have some contribution at least to some integration part something that is easier to get to. So about midpoint it's very feature rich and I would say it's really comparable to any commercial alternative. So I consider it a big success and even we are recognized by some analytics company which is really nice and what we can expect from open open source system it's really customizable is using as much standard as possible if you want to get more there is a link that you can find all the information. So let's get to access management integration because this is quite a common but I think there is a lot of potential if you are integrated IGA system of access management. So from the IGA to access management this is the more common path so the IGA because from the identity management part most less information about users and their accesses naturally IGA can provision all this profile information about users to access management and also provide data for authorization. It might be attributes, might be roles, even some combination. So this is quite natural. The other way around it's something not that heavily used and then I think there is a lot of potential in that because the access management especially when we when using some external identities have a lot of information to pass back to the IGA because and if we are talking about single organization, if you're using a password and you have no new information, if we are using these external identities usually with the identity we are getting some attributes that can be used. So if it's a state identity we at least know this person was verified by state and we have some identifier from the state that can be laid to use. For example if we are dealing with some big security incidents, if we have academic identities we can get information whenever this person is an academic employee or student and again use it for access control later. If we have social identities at least we have some social identifiers of the person that we can use for some integrations for example or we might have also other attributes like names, emails, whatever that can be used such as to make life of the person easier just to request them just use the information that we already have. Second thing that we can get from access management is access timestamp because the access management of course know when the user was accessing the system so we can get these timestamps and work with them later and I will get to this. What are the typical interfaces for the integration? There is no standard unfortunately but we have some standard common option so from the identity management part integrated anything is usually through some kind of connector basically writing custom connector to to get whatever API is that for the access management or there can be some middle layer like some let's say LDAP or Active Directory some standard database that access management can use. And to get some information back I do if there's this direct synchronization of connector identity management can read it back or if you want some like just like runtime integration you can always call some API and do some something that. Let's move to identity governance benefit. Basically if you are familiar with identity governance it probably won't be anything new but I will just repeat it. The very important one is overall visibility. If identity governance deployed because you usually deploy it within a single organization you want to be in control and have some visibility of it's happening in your organization mostly to tight your security be able to go through audits and so on. So the main feature is some kind of reporting or web pages dashboards who has access to what and why so you can visualize for example if you are using role based access control who has which role what the role is entitled to to which applications and why every person has this role. In midpoint we are using something that we call they are calling policy driven RBEC because the RBEC is very good tool it's very easy to visualize it to explain to people but you need something more in order to work with some attributes with some automated rules so we have kind of extension of RBEC and if we are thinking about this talk how we can get this multilateral identities and this data we can get through the access management to AGA I have to use it. So first one is use these attributes for example if I know the person was wedded by state come with state identity I may note this as a level of assurance attribute I have big restaurants in this identity and based on it I can give some access through RBEC classically and then I can visualize it in the standard way using dashboards and really know what the person has access to. Also when I have time stamp when the user is accessing each system for the last time again through the access management I can use it to build some policies either to remove unused accounts and to tighten security or I can naturally work with some kind of expiration renewables of account whatever I need for my particle workflow and of course AGA wants to automate all of this so using airbag automated rules some provisioning through connectors and integration in the system make sure that everything that I just said is completely automated and you don't have to worry about it. If the full automation is not enough you want some human element some kind of interaction you can have some approval processes, expirations, renewables and so on. So let's get to some interesting feature about integration all of this together and I will start with integration of basically access management to given surveys using just in time provisioning now without identity governance. What you can do and this is very nice trick you can basically create accounts on the fly because when we are using these multilateral identities which are coming already with attributes we can just pass this identity to target system and by passing it we basically authorizing the identity to access the system and the system if the system supports it can create the identity and accounts for it on the fly, use the attributes and give proper permissions within the system. What this basically here is how to how to deprovision such accounts because this creating of accounts is ideal it's very simple you can use it really on the fly but you have no way how to disable these accounts the only way how to do it is again by the end system itself to have some kind of expiration because what is important here if the person accessing losing the access they are not going to the system anymore and the system doesn't know never gets the information there's no way how to get the information and also with this we have no central visibility who has account where and why which might be tough for doing some audits or resolving security incidents you have to manually go through all the systems and get through it. So with midpoint if identity governance component in place we can basically extend this using some extra tricks so the basic premise is in midpoint we are managing entitled users so I'm not saying the user should have active account on the target service at the moment we're just saying he or she is entitled to have it and whenever the user decides to access the system again using just in time provisioning we can create the account on the fly using this entitlement at midpoint midpoint managers. Also what is nice about midpoint midpoint supports provisioning and it's really quick it can be done in real time so even though if the target system doesn't support just in time provisioning can create in accounts immediately basically access management can ping midpoint and say now it's time to provision this account midpoint checks if the user is entitled and if so triggers the provisioning so we can have just in time support even for system doesn't support it not that able. Also midpoint and is this provisioning connector can read some data back so regardless if the account was created on the service or through midpoint midpoint will get the information the account is there is active and also we can read any additional information and basically then we have full scale information for the IGA we have who has the account active who is entitled but doesn't activate the account and we can build all the policies on top of that including some expiration renewals work with last access timestamp and combine this all together so for example if the user doesn't use the system for a long time it's kind of a security risk we can deprovision but we still know the user is still entitled so for the next next usage we can we can still work with that. And now gets to this part with multi lateral identities because that's brings just another level of complexity because with multi lateral identities we are expecting that a single user can have multiple identities and we can even combine them because we can say okay one identity for example the state identity brings your account to the higher level of assurance we are know this account was vetted then you can have some social accounts saying okay this is your social ID and we can connect with some with some system with some social systems and we can integration because we know this ID because we know this ID if we have some academic scenario we know the person is a student or employee of given university or even more universities you can combine it all together. Tough part is how to correlate these identities because there is no common identifier nothing that we can automate it. In midpoint we support two way we call it smart correlation and it enables you to configure how the individual accounts could be correlated and you can you can base it on the source so if in the source you know this is the email which was verified and I'm happy to correlate with existing accounts based on this email you can set up this rule you can set up some fuzzy rule like matching on name or even even have fuzzy matching counting with typos and some something like that but this probably you want to fully automate because there is some risk if it's really the real person so you can also define what match should be like processed in an automatic way and just connect these identities together and what rules need some human interactions and there are two ways how to do it. If you want strict control it can be done by some administrator or some other delegated responsible person basically manually whatever process you need you will just see okay this is attributes you have this is the new identity you have to decide here are some potential matches decide if one of the matches is real or if you want to create a new account for this user. Second option might be to again use the access management part because basically the user in principle own all these identities and can use any of them to sign in so let's just the user sign in with first one then the second one in the same session and then we know for sure that the user's own both identities and can be connected together. Also what is nice here you can combine all these external identities like state, social, academic with local one if I have deployed IGN and usually I have within a single institution I have some local accounts managed by HR department so even combination of these local accounts and this like remote identities is possible using the same principle there's nothing really different there and what I can do with that is build some kind of unified profile so take all these attributes that I'm getting from different sources and then usually I want to build a single user profile I don't want to work with users who has six names from six sources and most of them are exactly the same the value is the same maybe sometimes if you have like your name from the social network maybe you have different biospelling because you like it or something like that so we can just gather this data and then put a formula how to build single user profile how to select which name is the one how to select which email is the one that should be used or if we need it we can just build this like one profile which is always handy to have if you don't have any special requirements but then we can also have some extension of this profile for example we can have like official email within the institution and then like a personalized preferred email and then we can decide based on the target application which one of the emails should be used which one should be provisioned and this is basically or possible with midpoint just put all the rules in how it should be processed of course the most difficult part is how to decide it because we want to have it simple so people can understand and also give an option to select for example they prefer the email or preferant address at least for the system when this this is not that important and what we can also do it thinking about these rules and how to combine all this data together how to put some organization policies in because it's really nice to have if you use this freedom to select their mail or they prefer preferred name or preferred email but sometimes we have some systems that we really want to enforce strict rules because this is something that I don't know is sent to send to authorities for for some validation or I don't know it's it's might be tied to to your payroll and you want to have real data there but then you can have like a company social network and let user give them the freedom so when constructing these rules we can combine like organization policy with some user preferences and even based on the target system decide it what results should be used and where sounds complicated it is but it's all about programming and how to put it together for your organization and again with the end goal to have fully automated processing at the end with some middle user inputs user preferences and so on it's not complete there are still some missing pieces we were we were experimenting with running some demos improving midpoint as a product to support this better but for sure it's not fully complex the biggest issue is user experience because yeah a lot of these options especially dealing with external identities when users needs to sign sign in actively work with them it's hard and this will be hard for a while but it's getting better how how how people are getting more and more used to work with work with their identities and use the identity to sign with completely other system now we've pushed for european evils and so on again people will be getting more getting used to this principle and it will get easier also the interface between access management and aga is not well defined now we are just writing custom integrations on both sides dependent on the on the needs for sure it will be better to have some like prepared interface that we can use and we can connect our product midpoint to existing access management systems that would be really really handy also life cycles of the individual identities because we are combining different identities to a single profile and also we should think about life cycle of the individual identities some of them are pretty persistent like the state ones but other like if i if i know that someone is a student probably i should verify this this this statement once in a while and i can put some policies or condition it would be nice if the protocols where we're going to support this so we can for example query each day but with other protocols like samble basically until the user didn't sign in you don't know the current state of the information so have some maybe some explorations some renewables here as well would be really nice also the whole assurance and trust model in this might be very complex again working with different sources of information which are trustworthy which are not how we can process them how we can use it what's our assurance on this information it's difficult to even decide what we want to do and we are and when we are when we are have this decision is an essential thing is how to process it we experiment with some kind of small project which we called mid privacy and it was about putting metadata to each value that we are storing at the metadata source of the identity the assurance level and also for example how we can be used within the gdpr framework so having this all tied up and again automated so we can use it for automated processing and provisioning rules would be really nice we started as an experiment just to get some feeling for it it's fully available to people but as far as i know nobody tried to put it in practice yet which is which is a bit pity and again there is a link if you want to read more about it so just to conclude and hope to leave some time for the questions it is it is really possible to combine these worlds and really tightly connect identity governance system with access management and basically unlocks new potential for new features and nowadays there is a lot of identities that people can use to sign in to our systems state banks and i'm expecting there will be more and more of them and people will get think more and more custom to use them especially with these eids on the european level so this is something that we should be prepared for and i ga even though and i think about i ga is mostly within a single institution to make sure everything is tied everything is well well ordered everything is automated it can be audited it can work very nice in this world of multilateral identities and bring these same conditions and the same the same benefits from the i ga to this world as well but having a full implementation covering order english is complex and it will take probably some time when we all get it there midpoint is kind of halfway through and that doesn't mean halfway exactly we have something now that can be used it can be experimented with but if you want to reach maximum potential it will need to improve the product as well and because everything is open source and available all the contributions are always welcome so thank you for your attention and we have a few minutes to get some questions yes the question was if we have some machine learning on our roadmap we have we are already experimenting with that not for this particular problem we decided to first start with role mining so if you are importing roles usually within the existing role that you already have how to mine some business roles out of it because if you have if you are migrating towards i ga you have a lot of roles manually manage and it's good to build some business roles that can be easily managed and we are using machine learning principles for that it could be good to use it for example for this identity matching but so far we were happy with some customers yes so this identity of management is only one side of the picture because if i have a user he might be a suicide man or whatever he's leaving traces in the application you grant access to okay so if this is now leaving the company the institute the university how do you deal with the traces do you have a mechanism like maybe scramble the username and change the username in the application so if it's reused that yeah re-usage of usernames in the target application might be a big problem yes so so the question was we have all this in place and what we what we can do when user is leaving the organization with his or her data scramble them remove them somehow something with that and this is a tough question because one part is the application itself and when you have this automated identity management identity governance system in place usually you can deprovision the data completely out of the application but then you have this central point of the identity governance and the question here how long you want to keep the data for security incidents for example and that's valid and probably you want to have them unscramble for year two depending on your policy and then you should again automate the process how to either scramble the data or just completely get rid of it i would say accept identifiers because especially if you are talking about usernames you probably don't want to reuse that or at least not within a central period of time so i would recommend to keep that yeah but it's a bit more than that like in the talk before we had the wasuo what we have some web application and some person is creating a dashboard so within the application it belongs to that person and everyone else is using it so i kind of delete it but the the creator is done so it's more complex than this yes yes so the so the comment was if the user is creating something like a dashboard in web application that others are using if the original owner leaves can we delete it or not and if you are within an organization when you have complete control over your users you need some process to pass this work to someone else and i would say you have to i process for levers in the same way you are returning your keys to your office you should also return all your digital systems or transfer this to someone else but what you can at least automate in this case if you have this like a dashboard something have a process that before the deletion will send the notification or let someone approve it and that could help to automate it okay time is up thank you and we can continue this question later
FusionIAM - a full Open Source Identity & Access Management solution
So, we're going to start our next talk. It's going to be so as to explain what we do and we'll be doing this so that I tell you fusion I am a full open source identity access management solution. Bonjour, je suis français but I will speak in English, okay? No problem. Yes. So, some words about me. So Clément Wido, I work in a French company which is called Vortex and I'm doing a lot of stuff about identity management of course because I'm here to talk about it. I'm also doing other things like music. If you want to listen to French music, open source music, it is a creative commons, you can go on my website and I'm also doing a theater and other things. Very quick about Vortex, we are a service company and we provide many solutions like collaborative tools, containers and of course identity access management and I will talk about this thing. And if you want to not play music but work on open source, you can apply on our website. So, for the topic today, I will talk about the fusion I am project, explain why we created this project and which open source component we use to try to build this big solution. So, we decided to create this with Benoit Martier which is the leader of fusion directory. I don't know if you know fusion directory product, who knows it? Okay, but many people. So, it's cool that you come here so you will know about it today and Vortex. So, we are both people working on open source product around directories and identity management. The goal was to offer a complete identity and access management solution because you know that in the propriety solutions, when you buy one, you get all the components of identity and access management. But if you are using open source tools, most of the times you only get one piece of the full picture and you need to install them and connect them. So, our opinion was to say, okay, we know that in open source, each product must do one thing and do it good. But if we want to be able to go to companies and say we are doing identity and access management, we must provide a full solution integrity. So, that's why the reflection. And we are today working on this project in Vortex with David Coutador and myself. So, who knows OW2? Okay. It's normal because it's a French consortium like Eclipse, but you know Eclipse, but you don't know OW2. So, today you know OW2. There are a lot of products inside OW2, Blumain, GLPI, and Lemanelda, et cetera. And so, we are an official project of OW2. So, one solution when you want to do a new open source software is say, okay, all that exists is mess. So, I will write everything, but of course, I have a family, so I don't have the time to write everything from scratch. So, we took all the open source projects that we know and we tried to combine them together. The one you may know is this one, OpenLDAP. Who knows OpenLDAP? Okay, yes, one. Of course, we are not the developers of the OpenLDAP software. It's something that is managed by a Siamese company and OW2, which is the leader of OpenLDAP. But we are very implied in the community and we work a lot with OpenLDAP. So, our choice for the directory server, which is clearly the base of the identity management, is OpenLDAP. And then, we put a lot of products. So, Le Mans-Haldap-NG, who knows? Ah, yes. And we have the founder of Le Mans-Haldap-NG, which is Xavier Guimard here. So, we have some of the community here in Fosdame about Le Mans-Haldap-NG. I will explain all this fusion directory. So, I said, Haldap Toolbox. Okay, LSC. Okay, it's normal because it's the products that I created. So, okay, it's normal. Okay, so, these are all community projects, open source projects. Of course, you know only this one, but you will see how we try to combine them. Our approach was to say, okay, we can be as IBM, HP, et cetera, and we can go in your company and say, we have all the components. So, access management, access manager, the directory server, the directory manager, synchronization, the connectors, and two other components, white pages, and service desk, I will present them. So, that's a typical big proprietary IIM solution, which is, okay. But we put all the open source software behind the same. Okay, so, of course, the directory server is open-end app, but we added some tools in Haldap Toolbox project to better manage open-end app, to do backups, et cetera. The directory manager is a fusion directory, connectors LSC, the access manager is Le Mans-Haldap-NG, and other tools are some part of the Haldap Toolbox project. Of course, I will present them, but I know that you know other software to do that. Typically here, the most open source tool known to be the access manager is Kiklok, who knows Kiklok? Of course, everyone knows Kiklok, but I will explain why we do this one. We have another choice to have a single sign-in product, and this is Le Mans-Haldap. And for this, of course, Evolvm midpoint, that we just saw before, is another possibility here for the directory manager, et cetera, et cetera. So, everyone can choose which technical components it will bring in identity access management. We did this choice because we are clearly developers of a lot of these components, so we can act on the roadmaps of these components, and we know how to make them work together. So, if you choose Fusion IAM, you will take the choice we have done. If you do not agree, you just can fork and replace the components if you don't like them. On a technical point of view, if you already install Kiklok and a directory server, you know that it's quite simple. All components are linked to the directory server because that's where you have your users, passwords, groups, et cetera. And here you have the connectors to be able to synchronize from a database, from an active directory, for example. So, it will go into the Haldap server. These are tools to manage the data. So, to white pages, just to display the photo, et cetera. Here is to be able to reset the passwords. Here is to create icons, et cetera. And the access manager will also be connected to the directory server to do the authentication. All these are Haldap, HaldapS flows. You have just one database used for the access manager to be able to store the configurations on all the sessions, but the other tools did not need any database. All the tools are only using the directory server. And of course, you have here the access manager. So, a user, the end user will only see the access manager part to be able to access all the data here and to access also all the components. So, some explanation on the software. The first one, everyone knows. So, like Tynitero said, it's simply the best. I hope you all have the song now in the head. It's the best Haldap server in terms of performance of standard, because the people coding on Haldap have also written a part of the RFC of the Haldap protocol. So, we're sure that this component is respecting the Haldap standard. And if you manage your Haldap by yourself, you know that you can add a lot of features with overlays, like password policy, which is very important in identity and access management to be able to expire your account, to lock your account, et cetera, et cetera. And we will see that we bring other tools to be able to manage the open Haldap password policy. And in the Haldap Toolbox project, we provide some package to be able to install open Haldap on a different distribution. You may know, are there people from Red Hat here? Okay, it's not a problem. But you may know that Red Hat has chosen to push away open Haldap from the distribution to be able to use the Red Hat directory server as the main directory server. So, if you want to install open Haldap with a package on a center-est, et cetera, you can use the Siamas package or the Haldap Toolbox package. And of course, we also provide package for Debian Ubuntu, et cetera. Okay, so the directory is okay. The directory manager, so we choose a fusion directory. It's a PHP application. It's not like PHP Haldap admin, which is a very technical tool in which you browse the tree, et cetera. Here, you have a functional view of all the objects that are in your Haldap directory. So, of course, users, groups, but you can also modelize the service icons, applications, et cetera, et cetera. So, it's a very functional view of this. And it includes administration delegation. So, you can say a people is connected to this interface, but it can only manage the people in the service, et cetera, et cetera. So, it's like the midpoint or all those software like this. It's just to offer user interface to people to read or edit and illustrate data depending on their why. The connector, no UI. It's just a command line, but it's a very powerful tool written in Java. And it talks with RISTPI. It talks with databases. It talks with Active Directory. So, we are able to easily synchronize Open Haldap and Active Directory with the store. So, very efficient. And Lema Haldap ng. So, the key clock killer. No, I know it's not, but okay. It's like key clock, but we provide an application menu. We manage all the access control. White pages is an easy way to display the data of your directory for end users to search for phone number or email address. So, these are only Haldap data. So, I created an Haldap directory with Star Wars data. And you can display them, search for the umpires, the Jedi, et cetera, et cetera. But there is no database. It's only an Haldap directory. And ServiceDef is a little tool for the support team because you're able to see first one. You can see all the password policy data from Open Haldap. If you work a little with Open Haldap password policy, you know that it's very technical to understand how all the state of the password is managed. So, here you have all the dates, et cetera. Here, and you can test the current password. Of course, you can reset the password. And you can see if the account is locked, you can unlock the account. You can see the password is expired, et cetera. So, it's very easy for a support team to know if an account is expired, it's locked to unlock it, et cetera. So, moving to the cloud, because that's how we need to work now. Why? Because before, and we still do it for customers, we have virtual machines and we deploy all the package and we configure all the package. And we say, okay, Haldap directory is here and you need to connect to this web server, et cetera, et cetera. And when you want to put the logo of the customer, you need to put the logo in every product. So, the customer say, okay, it's integrated. Okay, it's integrated. But, okay, this still works, but it's a lot of work indeed to reproduce this by every customer. You need to write that. And the cloud approach is to say we will move from package to containers, images, and we will try to configure all the images, all the containers through variables. And indeed, we saw that Haldap server is the same for each component we need to connect to the Haldap server. So, I only need one parameter, which is Haldap URL for all components. So, I configure it once and then I can have the full solution. Of course, you do cloud. Okay, it's a mess. Okay. We need to have pods. We need to have volumes, et cetera, et cetera. So, you see that what was a little easy with some bricks and some components is not easy with the cloud because you need to identify which volumes you need to run the containers. And when you split, you usually split the web application between the front end and the PHP, FPM, or the past Haldap server. But it's better because we can run all these images and we can have, so, of course, for Haldap, we have a volume for the data, a volume for the configuration. And also for the certificates, KN certificates. And so, we identified in the FusionIM project, we identified all of this and we created all of these images and volumes. So, you just need to do make, run, and it's running. We have a container registry. So, it's open source. It's available. You can just pick the images and you can run with a Docker podman or a Docker compose. So, it's very easy to test. And you can also download the Git and run the Mac, run all with a Mac file and it works. The only thing you need to do is to initialize the volumes and, of course, put some configuration for your domain, et cetera, et cetera. But you just have to do this and you will have the full stack of identity and accept management running and configured. So, it's very easy. In Vertex, we choose to create a new offer which is called with us, identity as a service. And we put FusionIM inside of our, in our cloud for our customers. So, we can run for each customer. We run one FusionIM project. And so, a customer don't have any directory, don't have anything, but he can connect all this application through SAML or OpenID Connect. He has all the applications to manage the data inside the enterprise directory which is inside the cloud. And we, of course, have a lot of RISC API. So, we have RISC API for provisioning to create accounts, to create groups, et cetera. So, you can do all this with RISC API. And we also have some RISC API here to be able to create a new OpenID client or a new SAML client. So, you can provision the users, the groups, and you can also provision through RISC API the applications. And, okay, I know I have five minutes. Yeah, I can do a demonstration. Okay. Ta-da-da! So, it's not a screenshot, okay? It's a real, it's a real interface. So, it's hosted by Vortex. It's running on OpenShift, which is Kubernetes from Red Hat. And so, this is the login form. And so, you see, it's the access manager component, so, LemoneldapNG. And inside, we plugged all the IIM components. So, this one is for configuring LemoneldapNG. So, just the administrative interface. And, okay, so you get all the parameters. And here, you have the other components. So, why? Of course, it's a demo. Yes. Yes. Yes. Okay. So, this is how we manage the users. So, what you can see is that you can work with departments and branches in the end-up directory. And we can create, so, I create, for example, a codecobain because, okay, it's 30 years ago. It dies this year. So, it's a simple account, okay? But if I want to, so, I'm an administrator, I have this view, okay? But if I want to browse the directory, I can see it. So, I'm an end user, and I want to see the information of codecobain. So, I can browse it through web pages. But this is clearly the same. You see that you can also browse the groups, so, with Brittany. And this tool is wonderful because we can dynamically use the postal address inside the end-up directory to display people on the map, okay? So, it's a nice feature for an intranet, for example, when you are all in a remote location, you just put the postal address of people and you can see them on the map. It's quite nice. And, of course, you can click on and see, okay, this one. And if you are in the support, okay, Brittany Spears has lost his password, okay? So, Brittany. Okay, I can reset the password. I'll say, okay. Baby, one more time, okay? And, okay, the password was changed. We activate the flag, so, she must change the password as the next connection. All this is managed by the open-end-up password policy. But, of course, when she will connect through all the components, the component will respect the password policy and she will be forced to change the password at next level. Okay, that's all for the demonstration. Thank you. Some questions, maybe? Yes? Can you change actually directory passwords from this? It's a feature that we did not implement. Can we use this component? This component is an adaptive box service desk to change the passwords on active directory, not yet. But, all people are saying, okay, this is wonderful, but I don't have open-end-up. I have active directory and I want to use it. So, it's in the roadmap, but it's still not available. But, for information, this one has some hooks, so you can reset the password in open-end-up and also hook it to a change at the same time in active directory. If you have both directories, open-end-up and active directory, you can use the hook to push the password on active directory. But, if you only have active directory, you can, for the moment, not use it. But, it will be the case maybe next year. Maybe next year. Yes? Do you support private ACME servers for the certificates for these web services? Sorry, private, what? ACME. ACME. Let's encrypt. Do we support ACME or let's encrypt? Of course, yes, because you just have to run it in the container, yes? How do you handle applications which cannot use OpenID or Sample? Okay. And, where do you use the host headers and authentication? Yeah, so how we manage applications that are not modern applications which use either Sample or either OpenID Connect to do single sign-in. Lemana Lab is also compatible with the CAS protocol. So, we can also use CAS, but in the cloud, we say that CAS is not secure enough to do the cloud. And, of course, we can have a component in Lemana Lab. We have a component called the Angular, which is an agent that you can install remotely on your infrastructure system and can communicate through REST with the portal in the cloud. So, you can secure it, some local application with an agent on your side and let the agent deals with the session, et cetera, through REST API in Lemana Lab. So, we can do a mixed mode between the cloud and your local applications. It's over? Last question. Last question, a very good question, so. Can we authenticate users using certificates, personal certificates? Yes, you can use. The question is, can we authenticate users with certificates? Yes, Lemana LAPNG can use certificates, Kiberos. We are compatible with second factor authentication, WebOTAN, et cetera. So, we have a lot of methods. It's like K-Club, but it's French. Time's up. Okay, thank you. Thank you.
Add user self-management, brokerage and federation to your infrastructure with Keycloak
Adding a user-sales management brokerage federation to the infrastructure with Keycloak. Keycloak has been mentioned now and then in the previous talk, it was great to hear. I'm Alexander Schwartz, I'm just Alex, I'm working at Red Hat for the Keycloak project full-time and I'm also a maintainer since last year. I've been using Keycloak for several years. When I was a back at IT consultant, we were building applications, we were using it as an identity next-to-management solution and back in the time, a lot of customers did not have Keycloak, so we brought an application in there and the custom-built one, we put Keycloak next to it to do the IAM stuff and over time, then we built our applications for them to customers. They already had Keycloak, so it was great. Two years ago, I joined Red Hat full-time working on Keycloak. What do I do at Keycloak? I'm doing a lot of performance testing, database stuff, also a bit of LDAP. Keycloak has so much to offer and when I was reading the corporate presentations, this was then stating about Federation LDAP and I thought, yeah, I could present you this slide today and this is what I will do, presenting what's already existing in Keycloak and also some of the things that will arrive in the next version of Keycloak, like the current version is 23 and the next version is Keycloak 24 and you can already download the things that are shown today in the 90 build of Keycloak. Right, so yeah, and the agenda that I brought for today is more like a journey that I saw customers going through when they entering the identity and access management space. It's like day one is seeing a sign on a school, right? I need only one password to access all my services, so that's where it all starts. Day two is, yeah, well, I need to get a bit more flexible because I have maybe one directory with users, maybe multiple directories of users that I want to integrate, lots of applications and then day three, yeah, I want to eliminate a daily churn, like reset of passwords, user self-management and that's especially where the things come in that we have in Keycloak 24 around user self-registration and declarative user profiles, what we see there. So why is Singleton on cool? I said, well, users need to remember only one password, that's, yeah, and then they authenticate only one today. In the morning, usually when they get to work and then it's, depending on how you configure it, maybe more valid for 24 hours, for 10 hours, for eight hours, that's the policy of the company and then they can access all these applications over the day with the credentials they entered. And well, usually a password might not be enough, so you have a second factor, you have one-time tokens, you may have maybe a mobile app that generates these small codes, you have file keys, web auth and all that stuff, and maybe some applications need it, other applications don't need it when you access them, maybe you want to re-authenticate during the day when you access a special application, so all those things come with KeyClick. And well, not the last thing, but usually in the middle when you deploy KeyClick to your organization, you want to theme the front-end, right? It should look like at least the colors, maybe the logo of your organization, it's to make your users feel at home. It might seem like a small thing, but it really helps the exceptions of that in an organization. So I say, even if you're deploying a single application and need an identity nexus management for it, it makes sense to deploy KeyClick for that, because you then don't need to reinvent it yourself, right? And doing user management right with all the bells and whistles is not a nice thing. So how does KeyClick work in the end? Like you have a user with maybe a mobile device, maybe with a regular device, and they log in with KeyClick, so KeyClick presents a login screen, does the handling of all the second factors that you come about with, and then the user sends from their browser a token to the services in the cloud, whatever they are, and the application can then either check the token directly by inspecting the token's cryptographic signature and the timestamp, or it will send this token, for example, to KeyClick to figure out who's that user, I want to retrieve some additional information. This is possible. You might also use that token, I don't know, when you're integrating other authorization services that then return like OPA or something like this, where they come up with is this user allowed to access this service or not. So that's the basic setup, and KeyClick, you can deploy it as a single container connected to a bunch of databases that you can choose from, be it Postgres, MySQL, Maria, Oracle, MSSQL server, usually, well, as an admin, you don't, or even as a developer, you don't have a choice, like usually an organization has chosen a database, they know well how to do backups, how to restore, how to operate them, so we give you a choice which database to connect to, and then you have KeyClick either deployed as a single whatever binary container, or you deploy it using an operator with a high availability setup to the Kubernetes of your choice, to the bare metal of your choice, that's what you do and do. And well, this is what users then usually see when you don't customize a login stream, it's a username and password, right? And once I log in, let's see if the demo goes with me, so I'm logging in here, maybe it's expired, oh it hasn't expired yet, so I get an admin screen here, so where I can set up clients, basically clients or applications, and have client scopes, users groups, so all of this and rows somewhere as well, right? I can configure all these in a web UI and it will, in a very basic installation, will just start to be in the database of KeyClick, and it will then take care of all that. So, yeah, that's a simple start, you have your application, it's secured, it's all working well, but then, yeah, you usually don't start in the green field, that's very rare, so you need to become a bit more flexible in what you're doing and to integrate with all the existing stuff that's already in your organization. So for example, there might be one LDAP, there might be many LDAPs in your organization, I think it tends like whenever there's a merger there might be other LDAPs joining, other user factories joining that you want to integrate with and there's Kerberos, so people might be already authenticated on their machines, especially in corporate environments, there might be some service around in your organization or external to it, but it only talks summer and your applications want to talk open or disconnect, so it's great to put KeyClick there in between, there might be also other OpenID Connect things, but then why would you put OpenID with KeyClick in between? Yeah, well KeyClick can train it to summer or KeyClick can also give the right tokens to the right application because maybe your this one application is on a special diet to require that or the other attributes in the right tokens and KeyClick can do that in the way this application is then finally working. You can also create your own extensions to KeyClick, so for that you need to get familiarize yourself with a bit of Java and then you can integrate custom stores, you might have, well it's called legacy usually for a good reason because maybe the old systems, the customers are known to those systems, they make money, you can't shut them off and you want to integrate KeyClick with existing user stores, you can do that, you can then connect it to a database directly, call some rest services, wherever you get these information from and make it work and also we might hear later today about SCIM integration, all that is then possible by adding extensions to KeyClick on this area. So we use everything that is already there and integrate and connect with that, so that's very, I say, essential on your day two things when you say yes KeyClick is cool, single sign on works, but then you need to integrate with a lot of stuff and yeah, KeyClick hopefully makes that a lot simpler for you. All right, so that's, yeah, some diagrams around that, so identity brokering, Kerberos, Samo, OpenID Connect, you can connect to those and yeah, we can show that in the demo shortly, well the good thing about Kerberos is you don't have, your user might not see KeyClick at all, look, well the user tries to actually see the application, the application wants to get an OpenID Connect token or some Samo token, it forwards the browser to KeyClick, KeyClick will negotiate with the browser that the user already logged in using Kerberos and then will not even show the login screen but forward directly to the application back with the right token so the user can continue, so the user will never see the login screen, so there's Kerberos, but on the other hand if on that system the user is currently on, Kerberos is not configured correctly for whatever reason, it will fall back to a login screen and you can use the regular credentials and then what we see in a second maybe use that credentials and verify these credentials against an LDOT, so it's yeah, it's like Kerberos but without the Kerberos it works the same way with the same credentials in the end, we can get all these social logins integrated, so with those then the user usually has login screen where they pick the right social login provider, they want to use to authenticate, it might not be the right thing for corporate environments, but it might be the right thing when you are integrating, well your public facing website with users coming around that they want to integrate, yeah and Federation as I said OpenLDOT is their active directory, custom user stores, you can have none of those when you want to store things only in KeyClick database, you couldn't have one of those but you can actually have multiple of those as well, so I wish or I hope for you that you have a simple environment but on the end, on the other side you can't really choose when you are, I don't know, there's another merger coming around the corner and or yeah then you might have another directory to integrate or maybe a customer has some users they want to bring there and you want to integrate as well, so yeah looking at the demo, so you can identity providers that would be OpenLD Connect, all the social logins that you want to integrate with here, they're here either custom or predefined with some defaults or some sensible defaults, user Federation I already configured LDAP here, so LDAP telling you okay this is, yeah I'm running some patchy directory server here locally on my machine because it was simple to set up, the usual LDAP I'll say, I can choose if it's a read-only writable or synchronized, all these things are here and then yeah not all OpenLDOTs are, or sorry, no not all LDOTs are the same, they need some special configuration seen here, yeah and you can configure it that it matches the organization, there's usually also some methods so there are lots of attributes in LDAP that you want to leverage either to put them into the tokens, that you want to pass on to the applications or that you want to leverage and the user into endpoint where the application can then carry those if you don't want to put them in the token, so all these things can be configured here mapped on a per realm, per LDAP connection in the needed to work, eventually you can also configure it on which application should get what kind of attribute and what kind of token, yeah but then it's the real world catching up on this, the simply can make you set up the better you'll be off but on the other hand you need to make it working with the things you have and I, well we're hoping that we got Keeklog in a way that it's not standing in your way, so let's go on to day three, a limited turn, so all these repetitive tasks that you have to do every day when it comes to users, they're well annoying for admins and also annoying for users, ideally they want to do these things themselves, they don't want to be bound to some opening hours of IT or so some things, I've shown a minute as users required actions to basically you can as an admin choose, well as an admin you might have sent out an email please enable second factor and you sent another email saying please finally enable second factor for login and then you say well now's the time I go through maybe all of my users or some of my users let's, on the next login they need to must enable the second factor no matter what, so you can do that as an admin and you're done because no one will enter your system without a second factor enabled. Also password recovery, you can add a link to the login screen we will do in a second that you can do password recovery that you send out an email, click, the user can click on a link and you will, it works with an external with an external database of key cloak but it will also work when the user's on an LDAP, it would also work when the user's on an active directory, also well this kind of bits work when you're using the password recovery mechanisms of key cloak. Also well in a corporate environment you might not want to self register for people right, so they probably need to sign a paper contract first but then on the internet on the public facing side you want the people to self register, again this is something that comes with key cloak. Also once you're registered you want to maintain the data yourself as a user maybe update your mailing address, your blog, your social handles whatever all these things should be managed by the user themselves and key cloak allows you to do that and this is something that greatly improved over the last releases in key cloak 23, you can enable it as a preview feature and we are pretty sure that we will have that in the final release of key cloak 24 enabled by default so that you can really use that in a very good and configurable way. So yeah and it's great to resolve the need for either phone calls or tickets or chats in nowadays right. So let's go back to these required actions so there are lots of them so let's maybe have a look here. So in authentication for each realm I can really decide what I want people to be required when they log in or to be checked when they log in for example one-time passwords, maybe you want to have them confirm the terms and conditions, I need updating the password, update profile, verify email address that we sent out an email with a link people can click on it. So that's for public facing registration very useful. WebAlhtim is in there, people should be able to choose their locale, we want to verify them the profile and I can enable those and maybe also have maybe some policies when and why and then on the realm settings I can, well this is basically the tab called login which configures the login screen and I say okay from now on user registration should be enabled right. For good password flow yeah I want to have a link there where I want to allow that people can reset their passwords and once I do this I can just when I sign out here now these fields have appeared so for god password link is here and I'm asked for my username and email address and I have a register button where I can register with some fields that are here required and if I then log in again and we go for the user profile, yeah there we are. So this is the configuration where I can say these are the fields that exist that should exist for both the admin to be edited in the admin UI they should these are the fields that should exist on the user self registration form and those are also the fields that are available for user self management. So basically you can think of this as a form configurator and for each of these fields I can go in and say okay this is the name to be there's an attribute name like a technical name I can reference it by later a display name well this is an automatically localized name here but you can put a simple name in here as well I have attribute groups here so we can group things on the form for each field I can decide who can edit this either a user or an admin who can view this either the user or the admin and then can put lots of validations on top of each field about the length for the username it's the minimum length of three the maximum length of 255 characters for username I can there some prohibited characters so you should use regular keyboard characters for that we also don't want to have any homographs in here basically letters looking like real letters from a Latin alphabet for example but they're actually from a very different alphabet so you could have like a user registering with a username that looks like an already existing username it might need to lead to confusion so that's a really sensible good security by default so and I can add more things here I can also add annotation and saying how should this element be formatted should it be an input type should it be a helper below and below the input size the columns I can also reorder those so it's it's basically a form builder and the form builder will be consistent in all three places user self registration user admin form and or admin form for users and user self management right so when I go here for example the block so I can I can change it here with a different display name and once I go to as an let's do that as an admin to manage my own account then I would see here okay now it's renamed and another field here and I can then choose when is this shown is it shown is it mandatory on on first login is it maybe mandatory once a month like it can have maybe a scheduled process that inserts that actions on each login once a month and then I can see here all these things are then how I configure my login flow and have this information populated by my users so yeah we saw that we saw that as well right we have this recovery we have seen the configuration how we can configure those with validations and attributes and all necessary information and again the three areas where you have the admin we on the left in the middle the registration screen and on the right the personal information the users can self manage and all this information is either stored in key cloaks local database or if you then choose to store it in an external service like LDAP it will be stored in the external service of LDAP right so that's basically almost the end so while we saw day one singer sign on is cool and it makes a lot of sense to not reinvent yeah identity and access management even for a single application database you want to get more flexible and integrate with a lot of existing security infrastructure in your organization once you are a happy user of key cloak and then day three it allows you a lot of automation around users when you really want to scale especially when you if you want to scale with lots of users signing up on the internet and if they want to manage their stuff on their own and you don't want to get calls from them or emails and stuff so I brought some links so this is the key cloak homepage please pay the visit we have some docs on there how to install it the key cloak nightly release I linked it directly so and if you go there you can download the zip file if and extract it but there's also a docker registry on kio that you can yeah have a container built ready for that with a nightly release if you're on github please give us a store there's the key cloak book second edition that been published last year so if you've been using key clock maybe two or three years ago you might know that this was based on eap and wi-fi it was now moved to quarkus so some of the things changed so it might be good to look at this second edition something that is my of my very personal goals I want to start a key clock hour of code so to get more people into contributing so I'm planning for maybe once a month maybe every two weeks I think an online session um to get people familiarized with coding how we do how do we code in key cloak how do you maybe contribute documentation how do you um yeah how do things work around key cloak and at some point we also want to bring in community to also review issues helping with triaging those helping if another community member creates a pull request maybe it also the community joins in and helps and helps to get that to a material level that we can merge it in so that would get some some weight from the shoulders of the maintainers that would be great so that's my my thing for this year I want to try out yeah so that's me so I'm around for the rest of the day so meet me here meet me in a hallway I also have some stickers of key cloak some postcards if you want to sell key to your managers or friends or colleagues so send them a foster postcard with key to it thank you very much all right we might have like two questions or something yeah and what is the best way to configure key cloak declaratively so um usually want to use the UI to figure out what's there and how it works and then you can one way is to maybe export the full realm as a JSON and then re-import it so that's like the full export full import there's also the chance that you there's a terraform hopefully open tofu compatible key cloak provision mechanism as well and there's a rest interface so you might use a rest interface to yeah use this API to configure it and there's a command line interface as well but the command line interface is basically a wrapper around the rest interface so that you can yeah configure different settings of a given client or maybe override the client with a new config that kind of way but it then depends on how you want to do things if you have the chance to um I don't know delete it and re-import it might be very helpful for test environments or if you're more bound to um incremental like database scheme immigration style of things that you really want to things like one step at a time and always in that order and maybe open tofu would take some shortcuts that might not work but you want to have maybe some migration so it's depending on what you want to do but it's the good news is it's all automatable. Just one question how key cloak can be beneficial in Linux ecosystem? So how can key cloak be beneficial in a Linux ecosystem so like if you then logging into you say um with a SSH somewhere or I haven't seen it this way but it kind of connects very well if you have like for example Kerberos around so if you have Kerberos I have them on my machine as well when I'm in a corporate environment that key cloak can leverage that okay maybe then okay to repeat it for the video so there was a talk on 20 23 on FOSTEM on password list authentication on Linux here at FOSTEM right okay one note there's a redhead SSO Ansible Collection yeah okay okay yeah so there's also redhead SSO Ansible Collection that allows you to configure key cloak right yeah the old name well key cloak is the upstream project it's a CNCF project there's also redhead SSO like the thing that you get with a subscription from redhead where you find tools that work with that as well and from end of last year there's no longer redhead SSO but redhead build of key cloak so it's going to be easier to find in the future so whenever you need so for something for key cloak it will be both the upstream project and that what of redhead offers for a subscription okay I think this time is up thank you very much
Ipa-tuura: FreeIPA connector for Keycloak
Hi guys. So now we have Francisco Tivino who will speak about the Ipatura project, which is a FIJK connector for people. That is awesome. Yes, thank you Ikea. Thank you. Thank you very much. Yes, my name is Francisco Tivino. I'm a principal software engineer at Red Hat, specializing in identity management systems. And I'm part of the Free API team. And I'm very excited to introduce you to Ipatura, a collaboration between Free API team and SSSD teams. And basically what you are allowed to see is a redesign of the system integration between identity management between key clock and Free API. And then the Alexander talk, the one before this one, it was perfect because he was explaining all the concepts, all the, he was giving a very good overview of key clock, what are all the features that are supported and specifically how to integrate key clock with other identity management systems through the user federation and identity federation and all the brokering and stuff like that. And he was missing one integration, actually. That's nice because he didn't spoil my presentation. He was missing the integration with Free API, all right, through the user federation. So, yeah, this is all about, yeah, well, this is very basic stuff that I would like to scope well the project, right? I just want to spend a few minutes talking about some of the background and some of the key aspects of this project that we have been keeping in mind as we have been undergoing this work. So as far as the background, IAM is an umbrella term. It defines processes to assess the right assets at the right time for the right reasons, right? Well, keeping an author right access and the control. So some of the common products are Microsoft Active Directory. This is for environments where Windows is dominating. So we have also identity management, which is Free API. So if you are familiar with it, it could be, yes, you can understand Free API like the open source version of Active Directory, but for post-example environment. This is Linux, right? And, yeah, because basically it relies on the same building blocks like Microsoft Active Directory, like LDAP, Kerberos, PKI, CA, DNS, right? And, yeah, another one is Key Clock, okay? It doesn't need an introduction. It's a more scope for modern applications and services to the users in general, right? And there are more of the solutions such as Octa and TrID where they are more oriented to cloud-based environments, right? So when comparing these solutions, we soon discovered that one of the main differences is like number of assumptions regarding how users and groups are controlled, okay? So, for instance, Free API is tied to user and groups like this. They are necessary for the applications running in a post-example environment, right? On the other hand, Key Clock offers authentication services to modern applications where these applications are deployed usually in the cloud environment and the identities, they completely differ from the system level ones. Meanwhile, AD, for instance, Active Directory relies on other identifiers like security identifiers or organizational units. I'm not going to talk about post-example environments because the very last talk of today is all about that. So I recommend you to watch that one. And, yeah, the key point is that sometimes you are happy with a standalone solution with all your conflicts in place, or that is not a useful case. I mean, many times you will need to integrate multiple identity management solutions so that the same user can access different operating systems as well as different cloud applications with the same set of credentials, right? So, luckily, IIM, this umbrella term I was talking about in the first slide, defines some processes like single sign-on and also identity and user federation where a user basically will be authenticated directly once, okay? And then the fact of authentication is consumed by other services for a certain amount of time, regardless how and where the application are operating, right? So that's basically it. And then when talking about this, this one, and this one, federation that Alessandro was talking about that, Yaki Clock is very well known for providing these functionalities, okay? So, yeah, the way it works is like when a user logs in, Key Clock will look into its own internal database, right? And if the user is not there, it will fetch or iterate over every user storage provider that is connected until it finds a match, okay? This is basically how it works, right? And guess what? Key Clock already supports the integration with API, yeah? As a backend to look up for authenticated identities and so on, right? This is already supported. You can do that. So, yeah, by default, well, I'm not going to spend a lot of time here because this was explained by Alessandro. So the one on the left, the one on the left, it was the one from the previous presentation. Yeah, includes LDAP and AD provider so you can federate with multiple LDAP servers in one Key Clock. And, yeah, at the same time, and the one on the right is the one that I'm going to focus on, is that Key Clock also includes SSSD plugin, right? And it provides access to multiple identities and authentication providers from free API, right? And also very nice features like failover and offline support as well. So then what is the problem then? If we support everything and we can integrate with anything? All right, let's have a look. So what are the problems that we are trying to solve? So one of the main issues is, and the most important is that we are missing feature parity between those integrations. I mean, they are really different ones. I mean, you can integrate from a Key Clock to the user federation with LDAP, with Rehabilitation Server with AD in a different manner because we support a lot of features there. And at the same time, you can integrate with ADM with free API, but there is a huge limitation there. It's that it's only a read-only interface. So Key Clock, yes, can fetch information from SSSD, that's all. It can write there. So that means that if you do changes in your user database in Key Clock, you will need to drop by the free API and do the changes as well. So this is kind of a very limiting factor, right? So yeah, another one is that if you want to integrate with SSSD, you need to deploy SSSD in the same host or container where you are running Key Clock. This is also a limiting factor, especially when talking about cloud environments and open sieve, where you usually deploy the bots in different hosts and different machines, right? So that's another limiting factor. So then, yeah, we are thinking about already designed then. We need already designed to already design this, okay? And then in this slide I have, yeah, well, this is where the IPaTura service comes into play, okay? And then these are kind of a list of requirements when redesigning something, yeah. We are thinking about to support all these things at the same time. So we need a common API for managing identities. So the requirements are to be able to read and write. This is the most important one also to authenticate users from any integration domain. At the same time, now that we are redesigning everything, we are going to try to simplify the integration. And one idea is to replace all the existing plugins by just one plugin for all of them. So you can easily configure and then you can connect with anything. Another one is the cloud-friendly, maintainable solution. Yeah, we need to get rid of this limitation about deployment. We need to get a key clock and this is in the same container. This is kind of difficult without performance impact. That's always there. And ideally we shouldn't reinvent the wheel. So we need to ideally rely on existing open source projects, okay? So then, now this is a question for you. How many of you know about the scheme? You can raise your hand. That's that kind of the whole part of the room. That's nice. So it stands for system for cross-domain identity management. Okay, this is a protocol. This protocol finds or helps with this chain of user identity data between different identity management systems. It simplifies the provisioning, the updating of attributes, also the provisioning of users, of accounts, and it helps with interoperability. Okay, so it sounds like this is what we need, right? Yeah, so the idea there is to implement a scheme server for free PA as a backend to process all the requests coming from key clock, right? So, yeah, the idea is to don't start something from scratch. So based on this protocol, I think there are kind of 10 or 10 to 15 projects already implementing this protocol. And we were paying attention to one in particular, which is Django scheme 2. And this is why it's written in Python. And the reason is because free PA is also using Python, especially the API. So it's very similar. I mean, the interconnection between this, the scheme server and free PA will be some sort of straightforward. So, okay, let's start building it. Let's start building a new service. Okay, we mentioned that it must be a cloud-finally solution. So we are targeting a container. This is a container. Okay, on the left we have key clock. On the right we have free PA. So the first thing to do to add into the container is the Django framework because there is where we have the implementation of the scheme based on an open source project. Okay, this project in the container is already exposing some sort of endpoints. Okay. What is the next requirement? It must be secure enough, you know, the Django framework implements an HTTP server, but this server is kind of for developers. It's not, I mean, it's not protected at all. So, okay, so we can include Apache. It's a well-known server. And we can enable HTTPS for production-driving environments. So, all right, we add Apache. We connect Apache through the WBSGI connector to Python, and this is from Django. Okay, now we have a secure API. Okay, what is the next thing? Yeah, it must provide a generic API. So, let's rely, yeah, the breach can, at the same time, I mean, the idea is to, this is a breach, right? And then we have connected to Key Clock already, so through the user federation storage and other identity providers, brokers, we can connect to the container. This scheme protocol will help us to translate, and so we can make another call to free API through the API. And this is how, basically, how we connect everything. And it's generic because it's based on the scheme. All right. And then, yeah, we need, I was mentioning that, we need to read and write. Okay, so we implement two interfaces to connect to free API. And then, about the performance, well, deploying a container with a start service, talking to free API, making API calls is kind of expensive. Okay, but no problem because we can rely on SSSD because SSSD is implementing a catch. Okay. And then, okay, let's include SSSD in the container, right? Let's connect through the Django, through the divorce info pipe. And this is how we can access to the user materials, identity materials, right? All right. So, looks like we are almost done, but we mentioned that it must be generic enough. So, these interfaces, the right and the right one, we can easily configure them to talk, not only to free API, but also to any Active Directory, and through LDAP and also any RojHDS Directory server, right? Okay, so this is basically the idea to unify. Okay. And what about key clock? Yes, that is support the scheme calls because, well, we have implemented a scheme server. It's pushing a generic API, but key clock, well, doesn't support the scheme calls as a client, right? So, okay, as Alessandro mentioned, there is another, well, I mentioned that there are two ways to integrate user federation with an identity management system. Elavanidian, the other one is SSD, but there is a third one. The third one is that you can implement your own user data connector key clock. So, then this is what we did, basically. We implemented, and this is another project you have in GitHub, and it's basically a custom user federation that is capable of acting as a scheme client, all right? And this is what we need in key clock to connect with the server. All right, and this is how it looks like. You go to key clock, and then you will see all these options. You will see parameters for connecting to the bridge. It's basically the server URL and the username and password, but, well, we have other authentication mechanisms, probably, but this is basically, you specify the details about the integration domain. You can choose between the type IPA, is free IPA, you can choose AD, but also a lab. So, this is just for summing up, and then if we combine both projects, then we have it. And let's say that this one, which is the server running in a container, is basically supposing a lot of endpoints. For instance, there is one, which is called domains, is kind of an administrative endpoint. It's basically when key clock tries to enroll with a scheme, is sending some details, and then a scheme is implemented on the automation to make an enrollment with any other identity mind-saving system, right? And then once this is done, key clock plug-in can simply make user calls to the user federation storage to fetch for users or write and read whatever. So, yeah, it's important to note that key clock plug-in now, it doesn't communicate directly with the databases or with the other identity management, it's kind of only talking to the scheme, which is a container. So, let's go for a quick demo. Yeah, this is, in the demo, this is what is going to happen. Okay, basically, you will see key clock, you will see free APA, and a container running in another host, and we will make an HTTP post-request to the domain's application in the bridge, and the bridge will be capable of doing all this automation, because those are steps that they were done by the administrator in the past. I mean, if an administrator today wants to enroll user federation to free APA, it's going to do a lot of automation, a lot of steps there, like file a service, the proper role with the proper privileges, and generate key tabs, and blah, blah, blah, right? So, this is fully automated now. And once the other process is done, now key clock every time is looking for a user or something, it will make a generic call to a scheme server. So, what this whole looks like is a post to the scheme server, which is in JSON format through the recipe, and this service will translate that into an APA API call, and it will make it to the domain. Okay, so, yeah, I love the live demos, but I have to admit that today a little bit cold, I have a recording, and yes, I think I want to do the real demo because I have all the infrastructure deployed in Red Hat, and then this is everything recorded, and I don't want to expose DNS names or internal IP addresses, so I have a video anyway. And so, yeah, if this works... So, no. Okay, let me see what is going on here. So, very quickly, I have a quick one, so how many minutes do we have? Okay, then I have a three-minute video. So, yeah, there we go. So, yeah, the three consoles you see on the bottom, the one on the left is key clock, the one on the middle is the bridge, and this one is free API. So, this screen is key clock, we are authenticating there, and the same in free API, all right? Then we go to the user federation so that now we see it was quick, that was super quick. It's quicker than the speed of the light. So, yeah, I wanted to show you that there is a new user federation storage there. Yeah, you see? Wait. Let's see if I can... Yeah. So, these are the ones that Alexander was talking about in the previous presentation, and this is the new one, okay? So, you will see that. All right, let's continue. So, yeah, key clock, free API. So, these are the services, because you know that the machine will configure things here automatically, okay? So, now I'm going to show you where the container is running. This is a different host, by the way. It's not tight to key clock anymore. So, that's the... If you do podman.ps, this is the container, right? For the demo. It does the proper lead network mapping. You see, it's not running HTTP. Apache is not running the host, so if you log into a container, Apache is running inside the container. There we go. Yes, I'm piping the error log just to see that there is movement here, but I'm not cutting the content, because otherwise we will see a lot of IP addresses and internal stuff. So, I'm too lazy. I don't want to go to the key clock and type all the parameters, so that I do a cool call, okay? Or, well, this is with the KAdmin. You can see all the parameters that we are configuring for integrating with free API. And once we execute that command, the container is capable, you see the activity. And now we have the user federation enrolled, okay? And you will see a new service here, which is this one, the bridge, okay? It was done automatically, so you don't need to worry about that anymore. So, now everything is set up, everything is configured, so now you can manipulate users. When we create one user, for instance, I'm trying to file a user in key clock, so right away after we click on create, the user is added to the key clock database, but it's making a call to the scheme service, and the scheme service is redirecting the user to, directing the user in the IP as well. So, it appears here, okay? Yeah, that's the user, and yeah, you know, we can do all the administrative stuff, like changing for instance the email, everything is fully replicated to the free API and the Dimachment system, based on all these cool calls that are happening to the bridge, so the user is there, you can see it from the click command as well. Yeah, the modification was currently provided. So, I guess I'm trying to delete the user now, and yeah, basically, it does a group of operations, I mean create, modify, list, and delete, yeah, the user is not there anymore, and also when you delete the scheme federation, it is unenrolling, okay? So, it goes to free API and then remove the service because it's no longer needed, okay? So, this is also fully automated. Okay, so, what you just saw here in this video, is the user provisioning, okay? We are not done yet. Let me see, because I have a bonus. If I can close the video now. So, the bonus is... Oh, now it's working. Now it's working, okay, when I don't need it. Okay, now, okay, this is a bonus. This is a working progress, all right? This is the other piece, the identity federation. You just saw in the video, the user federation, but this is the identity federation, and it is all about to expose another endpoint in the bridge, so that key clock can also make calls to the bridge, but now, not for user provisioning, or modifications, it's for authenticating, and this one is for Kerberos, okay? So, this is kind of controlling Kerberos, and then the scheme, I mean the bridge, Ipantura, is capable of translating that into an operation that we've modelled, yes, this API, using a proper kit up, and then free API will answer with the session cookie, or it will fetch from the SSD, or well, from the cache, and then it will respond back the session cookie, so that the cloud application that is running here, trying to log in into key clock, is actually authenticating in IPA, nothing key clock, okay? So, yes, and then, yeah, this final slide is about potential usage, so this can be used for synchronization of identities across different providers, as you can see. Also, we can use it to migrate all the users, because the beauty of the scheme is that you can do mapping of the attributes, so you can translate anything you have in any cloud application into something that is more powerful, like free API, and then with UIDs and UIDs that are generated automatically, I mean, it's amazing, and the good point as well about potential usage is that key clock, if we merge this in key clock, there will be a user federation that will be capable of connecting to any scheme server, it doesn't need to be this one. So now key clock can talk to a scheme as a client, right? And this service, as a scheme server that we implemented in IPA as a container, can be also used to connect with other clients, it doesn't need to be necessarily key clock, we can connect to Azure or Azure AD or any other, for instance, I don't know, anyone that is supporting the protocol. So, yeah. So, yeah, that was it. I think we have time for questions, right? More or less? One, two minutes? Okay. Yes, please. You spoke about intervention with AD, I want some idea of the client side, you would have windbind, would you be able to replace SSD with windbind and still use this solution? So the question is if we can replace windbind... Yeah, from SSD with this solution with... So, not yet, the answer is not yet, but yeah, I think we can look into it and potentially, potentially, yeah, it could be done. If we decide to prioritize that use case over the others, why not? Yeah, but not yet. What's the not yet part of it? Say again, please. Will it happen? Will it happen? Will it happen? That's a good question because we haven't done any release. This is an upstream project. We have two upstream projects. So, yeah, our intention is to make this to happen. This will help simplify a lot the key clock user integration with identity management systems and also it's very convenient now to get a deployment independent from the host so that you can get a container, this is kind of on the towards the cloud, cloud-based applications. So, yeah. So, about our plans, our plan, I mean, the key clock plugin is more or less completed. Now we are thinking about sending to key clock so that it will be emerging upstream first and then it will appear scheme client there. And the service, as soon as we finalize the Kerberos authentication redirection, then I guess we will be in good shape to make a first release in upstream, okay, and later on, if once we prioritize a lot of aspects, then, yeah, potentially yes, it will replace or, especially will replace the SSD connector we have in key clock, that for sure, okay. Okay. Thank you.
Passkey authentication - the result
So guys, I will start now the presentation. My name is Sikert Pedrosa, and I'm a senior software engineer at Red Hat. Well, today Erwin was supposed to come here and present something about garage door opening with pass keys, but apparently there's some time of curse because, well, he couldn't come. And I will present a topic that I was supposed to be presenting last year about pass keys also. So I will show you today the final results because last year, Alexander, who is there, kindly volunteered to present my talk, and now I will do a kind of learning talk very fast about the problem and the solution that we gave. So introduction. As you may be all aware, in January 2022, the US government released a memorandum where they constrained their agencies and their, the companies working from them to use telotrast architecture. So if we focus just on the topics about user authentication and authorization, we'll see that the memorandum speaks about centrally managed users, and more specifically about using multifactor authentication and passwords. On top of that, it explains that they should use single sign-on as much as possible, and they mentioned two specific protocols to achieve this. One of them is PIF, or smart cards, and the other one is Fido2. So let's speak about Fido2 a little bit, why users should be aware of this authentication method and why it's important for them. First of all, because it's passwordless, so you don't need to remember lots of passwords. You also don't need to, sorry. So you don't need to be aware when there's some type of leak in one webpage or some service that you are using, because the private key resides in this token that is here, and it never leaves it. So you will not have any problem with data reaches or any other kind of problem. On top of that, it enables a strong authentication by providing multifactor authentication. So the keys that I'm using, they usually ask for the pin, but you also have some others that ask for some fingerprint or some other kind of biometric reading. So the design is quite simple. So we have a user with Fido2 key, it goes to some computer, connects is there, and using SSSD, they will contact the ADM server and authenticate there and get a Kerberos ticket to do the single sign-on. So in this case, we are speaking out IDM server because the best integration is achieved with it because we will get the Kerberos ticket. If you are using some other type of a lab server, you will be able to authenticate, but you won't get the Kerberos ticket. So if you want to know more details, the first link is the talk that I was speaking about before that was given by Alexander here at Fosben last year. Second one happened last year also at Defcon in June, I think, if I remember correctly, it was me giving it and you will have some progress in that area. So now it's time for the demo or the demo Gorgon, who knows, because I never was able to do this demo lively. Yeah, you know, it's like that. So first of all, I'm authenticated in a SSSD client and we also have an APA server. So I will add a new user, which will be called Icar. And here, the important point is that you need to set the authentication type to Paskey. Sorry. So the first part, I guess you are aware of it, if you are IPA users, but the second one is kind of the new thing. So I will create a user like that. Okay, it already exists. So let's try another one. So that's my sister. So this is your trust. So we have created the user and now we need to register the Paskey for this user. Okay, yes, with that, I will present there. Just again. I guess I don't, oh yeah, I forgot the name. Now I need to enter the pin. And now, well, I need to touch the device. The device is already blinking, so it's kind of obvious. And you see there down below the Paskey mapping data. I will show it to you. Well, I will clear the screen and show this user. So we have user I know. And here we have the Paskey mapping data. So now I will change users because you know, if you are root, you can authenticate as any user. And I will try to authenticate as user I know. Okay, I need to set. Okay, first of all, you need to insert the Paskey and this presenter. You are prompted for the pin that you need to input. And finally, you don't see it on the screen, but the LED is blinking here on the Fido 2 device and I need to touch it. Okay, perfect. So we are here. We are using I know and as we are using a free IP server, if I saw it, okay. We have here a server ticket. So that at this point, we would be able to authenticate to any other service or application that is enrolled to this server. One thing to notice here is that, well, the key needs to be physically connected to the device where you are trying to authenticate. Okay, you cannot do it remotely with SSH or something like that. This is important because, well, I heard some people asking me this question and well, currently it's not possible at least to do the remote authentication. Okay, so some conclusions. Availability of this feature. First one is SSSD 2.9.4. You can try with the 9.2.9.0, but it has some bugs, so I would recommend you to go to this one. We also have free IPA for .11.0. And if we are speaking about specific distributions that have this software, you can use Fedora 39 or CentOS Stream9. Some reference links. So we brought three design pages, two for SSSD and one for free IPA. The first one for SSSD is about doing the local authentication and the second one is about doing the Kerbalos integration. And if you would like to test this feature on your own, I brought a Fedora Magazine article that was kindly translated by a Chinese reader. So you have there the demo and how to work with it. If you don't want to mess up with your production environment for some reason, you can use SSSD CI containers. This project is, well, it provides a set of containers that you can use to test SSSD IPA lab and things like that. The only, you will find these instructions in the GitHub page. The only thing that changes is that you need to run, well, you need to connect the Fido2 key first and then you need to run MakeupPasky instead of Makeup so that you can redirect the Fido2 device to the containers. Okay, so that was all. I think we have some time for questions, right, Tvenho? Yes, we have four minutes. Thank you. Thank you. Thank you. So the system didn't ask you to touch your device, is that some limits or was that just an implement? No, it's a feature in reality because you can have the Fido2 device connected and some applications or some malicious actor could try to sneak in your device is already connected, they could use it to perform the authentication. So if you press it, you demonstrate that it's actually you who is trying to authenticate in this device. Thank you. Can you speak louder? You indicated it will not work remote at the moment. However, would this possibly work with USB redirection, for instance, it's a fix. Yeah. So the question is, would this work with USB redirection? The answer is yes. Yeah, we would be able to do that. Question here? If we lose this key, what happens? Okay. So somebody asked a good question here, ask what happens if you lose your key? You are doomed. So my recommendation is to have at least another authentication method, or you could have two keys. That's what I have. So I have one here and I have another one at home. If I lose one, okay, I won't be able to do this demonstration here or to authenticate somewhere, but when I arrive home, I have it there and I can use it to authenticate. We cannot store somewhere the algorithms. No. No, the private key. So this uses public key algorithms, and the private key resides in the key, and as long as I know, you cannot keep it out. Yeah. So. Do you have any plans to support built-in platform authenticators like Windows Hello? So yeah. So the question is, if we have any plans to integrate Windows Solo, you said, right? No. The hardware now has the FIDO key in it. Yes. You don't need a USB to extend the hardware. Every piece of hardware now has FIDO built in. Can we just use the platform authenticator? Not yet. I will answer you. Okay. Yeah. Yeah. Not yet. What this project supports is big FIDO 2. So any effort to extend it to support a platform authenticator should be against big FIDO 2 project. And then we will inherit that one. So the question was whether we are supporting platform authentications, and the answer is no. We don't have those plans yet. So Tveniu, how much time do you have? I think yes. We have room for one more person. Okay. Yeah. So. How do you solve the pin code? Like, which action do you use to install the pin code? And do you have any pin code policy? Okay. So the question is which is the cryptographic algorithm that we use to store the pin. And the answer is I don't know. This is embedded in the FIDO 2 key. So in reality, we ask for the pin, and we realize this information to the FIDO 2 key. And it's them, the FIDO 2 key, who does all the decryption, and who, well, it signs an attestation that it's you who are doing the request and send this to the server. It's PQCSD2. Sorry, can you repeat? It's PQCSD2 normally like a key derivative algorithm which you use for the project. Thank you. All right, all right. Okay.
Post-Quantum Cryptography transition: where we are now
Okay, great. Let me introduce myself. My name is Dmitry Bedovsky. I work in Red Hat for several years and maintain the open-estation, the open-estate. I am also involved in development of open-estate, a member of the Open-estate Technical Committee. And my current work is dedicated to post-quantum transition in Red Hat. So, first, brief reminder. Why do we need post-quantum transition? There is a wide consensus that quantum computers, if they ever happen, nobody knows, nobody knows when, will break the traditional cryptography in sense that digital signature becomes forgible, decay generation becomes reversible, and so on and so forth. So, if a malicious sector records the communication now and they are still secret and confidential to the moment when the quantum computers happen, they can get your secrets. Not sure it will happen soon, but this is considered as a threat and it means that the technical community has to implement quantum resistant algorithms that will be unbreakable even with post-quantum computers. So, some words about challenges we have. First, as quantum computers happens, we can't trust the existing algorithms, as I mentioned before. Second, well, when we implement new algorithms, they are not tested for a long time enough, so we also can't trust them too. For example, in the NIST contest, one of the algorithms that was moved even to the fourth round of the contest was completely broken without any post-quantum computers. It's a pity it was a wonderful algorithm. So, currently, a lot of efforts are related to providing so-called hybrid schemes when we use both classical algorithms and post-quantum algorithms simultaneously and combine them in this or that way. It can be two different signatures. It can be some combination of calculation, but again, if one of algorithms is unbroken, the second still provides some relevant security. The second area where we can expect problems on post-quantum transition is related to key size. Well, let's compare the key size for classical algorithms. RSA, well, practical 3K bits, means 400 bytes of the key and 400 bytes of signature, right? For the deletion, one of the algorithms choose for standardization, and the digits I provide are not for the most strongest version. It's for some intermediate version of it. We will have more than one kilobyte of the key and two and a half kilobytes of the signature. So, as a key and the signature are parts of the certificate, as a certificate doesn't go alone. You have a chain. You can imagine that, well, currently you have, say, four kilobytes of a certificate chain, and switching to deletion, you get, well, 18, 20, something like that. Also, we should expect performance problems because new algorithms will, with high probability, be much slower than existing. We will have compatibility problems because, well, other implementations of algorithms will contain these or that mistakes and probably implement various versions of intermediate standards instead of final, at least at early stages. And sometimes, I am not sure, we will meet problems with middle boxes analyzing traffic, passing through them. Well, is it something known and should go forward or is something bogus and they should be stopped? Well, let me remind that when TLS-103 was in the process of standardization, people have measured and found out that something between five and 10% of TLS-103 traffic don't pass through middle boxes. And the TLS-103 protocol was significantly redesigned to better mimic TLS-102, which was already familiar to middle boxes. And, of course, when we are speaking about network, we also get traditional problems. Well, big keys doesn't fit the TCP or UDP packets. We have to deal something with, for example, DNSSEC, which is currently stateless and expects that the response from server comes in one packet. And, of course, if you send a little request and get a huge response and use UDP protocol, well, all the protocols that rely on post quantum algorithms will be a good chance to implement so-called amplification attack where you send a legitimate request to a server, but spoofing the IP address. And if you use UDP, the response, which is much bigger than the initial request, will go to the victim computer and so distributed denial of service will be implemented. Okay, now when I briefly told about the threes, let's go to something more positive. First, we have several standard bodies that are involved in the process of post quantum standardization. NIST, which organized the post quantum contest, has chosen three, four algorithms for standardization. Here are three of four links to the drafted standards. Kiber is the algorithm for key encapsulation. Deletion is the algorithms for digital signature and Falcon is Sphinx and Falcon are also algorithms for digital signature. The standards, the final version of standards are expected to happen in Q1 of this year, but did not happen yet. Then, when we have algorithms, we should specify the usage in protocol. Okay, sorry, how to switch it off? Yeah, sorry. So, IETF is the standard body which works on protocols. The work happens in, well, in almost any working group that is dedicated to cryptography that is in the so-called security area of IETF. And it was created a dedicated group named PQEAP, which will cover the protocols that currently don't have dedicated working groups, such as SSH, for example. I will briefly speak about it at the end of my presentation. And also for hardware implementation of the keys, for example, tokens and HSMs, the standards are developed by OASIS group. As far as I remember, well, several weeks ago, there were no final version of the standard. There were some drafts, but they're not public. So, despite lack of the final standards, you already are able to use Fedora for experiments. We have chosen LibuQS project. It provides an implementation of a wide set of post-quantum protocols. For Fedora, we build only those which are chosen by NISTO, the standardization, I'm sorry. If you want to play with something else, you will probably have to rebuild it yourself. And, well, LibuQS is a part of OpenQuantumSale project. And they also provide some fork of OpenSSH using post-quantum mechanism for case establishment. And what's also important, OpenSSale provider. Let me briefly remind what is OpenSSale providers. It's basically a plugin style mechanisms that allow you to add or modify functionality of OpenSSale, including providing the new cryptographic algorithms or hardware back implementations. In Fedora 39, that was released at the end of 2023, we have OpenSSale 3.1, we have LibuQS 0.8, we have OpenSSale provider 0.5.1. And we plan to update all these components in Fedora, Rochide, LibuQS, and OpenSSale provider are already updated. And we are currently finalizing the rebasing of OpenSSale to next version. I'm sorry, I am lazy and not brave enough to provide you a live demo. But well, it's quite simple. If you have a Fedora machine, you can do it yourself. You should install, okay, as provider, the first line. You can, then you should generate the K-Pair. I have chosen electric curves, but it's a matter of taste. And then you just run OpenSSale server. But now you should exactly specify what groups for K-Exchange you plan to use. So it can be done with a common line key groups. And here, if you see the group's names are in red, the names consist of two parts. One, X25519 is a classic cryptography algorithms and the second, Kiber is a post quantum stuff. The second group allowed for K-Exchange establishment also have the same structure, but uses a different parameter for classic part. And now when you have run server, yes, it's a demo server, then you can also connect to it. When you run the bad connection, I strongly recommend use the key trace. It provides in more or less human readable form the handshake process. And trust me, you will see that you use the hybrid algorithms for K-establishment. Oh, well, S-Plant and the server is sort of fun, but I don't recommend them for any sort of production use. But you already can also use such a popular web server as engines, but again, for now, I'm speaking about Fedora 39, we will have to load OKS provider in the global open state configuration or in the local copy and provide it explicitly to engines. For demo purpose, I recommend using global. It's just simple, you load the provider, you activate it, it's done by providing the section dedicated for it, and then you configure engines in a regular mode and you add a derivative, say, ACDH curve, which is more or less equal to the groups parameter that I mentioned on the previous slide. So, well, then after restarting Jinx, you have a web server that provides that uses hybrid K-change for groups and you can use URL, which is open state based, at least in Fedora. Again, you should have to specify the curves, but you will get something over quantum protected channel. Of course, it's worth mentioning that big companies, well, also have their post quantum stuff. Google Chrome allows enabling post quantum algorithms, it requires switching on special flags, and you can check that your server setup as it is done on the previous slide, will be able to communicate with a standard Google browser. Also, you can use a CURL to reach, for example, Cloudflare, the demo site, they also use the same algorithms and compatible implementation. Okay, future plans. First, we want to pack all our results to container because do it yourself demo is fine, but for a practical purposes container, it's much more convenient. Then, as I mentioned before, we are going to provide the recent version and it's work in progress in Fedora Rohide. So, you can use the post quantum algorithms also for digital signature. It's currently doesn't work for Fedora, for OpenSS 3.1. And of course, we are involved in upstream work, OpenSS cell, NSS, GNUT LS, we have identified some deficiencies and working on fixing them. And as I promised, there is an opportunity to be involved in that community work because let me speak about SSH. For several years, OpenSSH has implemented post quantum algorithms for K exchange. Unfortunately, that is not any algorithms that is chosen by NIST. There are no standards for it, neither NIST or ITF, there is work in progress on ITF level to write a specification, a formal specification with these algorithms. And there are no specification for no specification in RFC form for using the NIST chosen algorithms in OpenSSH can shape. So, OQS project has the version of OpenSSH which is currently frozen because of lack of contributes. So, if anybody desires to speed up the process of transition of SSH to quantum safe world, I think it's worth organizing some activity both in the development, in cooperation with OQS project and with writing specification of the draft for ITF. Thank you very much. Feel free to ask questions. Sure. Have you analyzed the performance difference between the classic implementation and the one with post quantum? What is the performance impact? I didn't analyze it myself, what I expect performance degradation, just because we are implementing classical algorithms for decades, and first post quantum algorithms will be imperfect by definition. Sorry, the question was about the difference, the performance difference between classical algorithms and post quantum algorithms. Sure? So everybody nowadays is using X509 for services, and you mentioned that it's difficult to trust the new algorithms, and also would be impossible to trust the old algorithms. So did you do any experiments on like dual implementations on X509, and the impact on that, because the certificate will be huge? Yes, certificate will be, so the question is, do I correctly understand? The question is how does the post quantum algorithm affect X509? Yeah, if you use it in dual combination with the old algorithms and the new algorithms. If it's used in dual combinations with old and classical algorithms. So there are several concurrent documents of combination of classic and post quantum algorithms. And yes, the certificate will inevitably be huge, which combination, no matter which combination will be chosen. There are some efforts how the impact can be reduced. For example, let's add intermediate certificates to the trust storage instead of sending them on the wire. But it definitely has its downsides, because well, increasing the size of root storage. But yes, as I mentioned, network protocols will be seriously affected by huge certificates. So just add on to this question, so that means we need more computing power for our applications? No, it means we need to reinvent the CPN DDP. Sure. If we bring these into how we usually provide a very friendly user experience in order to communicate these keys from one device to the other, we sometimes use QR codes, NFC and Bluetooth. Will that be still possible if we go to these size of certificates and keys? Will the user friendly certificates, will the user friendly way of transferring the certificates such as QR codes, Bluetooth and so on and so forth, be still suitable for post quantum keys, right? Okay, yes, for QR code because it's just linked to the URL. Yes, yes, yes, don't know about Bluetooth, sorry. How many time do I have? For minutes, sure, go ahead. Do you have any expectations on when will actually have to deal with post quantum signatures in the wild, like in our products or because of a server we're interacting with or as a client? Well, when do we have an expectation, when will it appear in real world, right? So, I have expectations but don't trust me too much. We have, there is a promise that algorithms are finalized in Q1, right? Presuming this, the ITF process even for near-finalized RFCs is about half a year, right? So, I'd say that first attempts of introducing post quantum certificates into real world will not happen before 2025, especially taking for account that real world CA needs hardware which is capable to keep post quantum keys inside and also it will take time to develop such hardware. You showed the hybrid mode of the hybrid, right? Do I use a hybrid mode of the hybrid and the classical algorithm, right? Yeah, yes. What is its security level, let's say? Is that hybrid mode also quantum safe or is that not fully quantum safe? It's quantum safe. At least the current evaluation of this hybrid mode is that it's quantum safe. As I mentioned, we did not study the post quantum algorithms enough yet. Go ahead. And how do we evaluate the quantum safety in general? Like what is the current evaluation of the current in general? Like what are the approaches to presumed that are going to be quantum safe? Which approaches are presumed to be quantum safe? Sorry, I'm not a mathematician. Yeah, right? I can say some words. I can say some words such as lattice-based cryptography, hash-based cryptography and so on and so forth. But please investigate what this word means yourself, sorry. Okay, the last question. Are the quantum save algorithms, sorry, do I correctly understand the question? Will the quantum algorithms will be resistant to all types of quantum computers? We hope. Thank you very much. Thank you. May I take the question? Yes. Okay, thank you. Thank you very much.
Beyond passwords: secure authentication with passkeys
Alright, so I had a talk yesterday, you missed it. So today I'm going to talk not about Passball, the open source password manager, which we are building with my friend here, Kevin Clayton and Shmouty. I'm going to talk about Passkeys because I have the chance to participate and be a Fido Alliance member and sit and participate to the Plenary Conference and CPSIG. You will see Fido, they love acronyms, so it means Credential Provider Special Interest Group, so because we are a Credential Provider. So what is authentication, just so that everybody knows a little bit what we are going to talk about. Authentication is something you know, or something you have, or something that you are, like biometric, or something you do. You can even have like behavior based authentication. So authentication, these days generally a combination of one or two of these factors. You know about passwords and password based authentication have a lot of issues, so you have issues of the user selecting a weak password, or basically people being able to brute force, or like phishing is a big one, and you have all sorts of other issues. Generally you can implement content measure to make sure that basically your authentication is good enough, but phishing is the one that is really hard to solve because it depends on the user. So you can solve for example password strength selection by introducing a Credential Manager, and you can prevent a little bit of the phishing with a Credential Manager, but you still have some room there. So who has set up Pasky's as a user in the room? Well quite a bit of people. Who has as a developer implemented either Authenticator or like implemented Pasky's authentication on the website? Yeah, three people. So we can see here that it's still a new topic. So what is a Pasky? You will see that Pasky's mean different things for different people, so I'm going to try to give you like a 10,000 feet view of Pasky's, and not like go too deep into the protocols and the options for the protocol or particular implementations, but just give you like a high level view of the landscape, something that I would have like when I started working on this because it's like really tentacular and there's a lot of options and a lot of different views. So Pasky's the official definition is Pasky's a password replacement. There are public key, private key pairs that are used for authentication using cryptographic signatures. So basically you have like a site that gives you something to sign and you sign it and you prove that you're you using an Authenticator. Pasky's are user credentials that are discoverable, so it's possible for the browser to know if you have a Pasky for a given website for example. And because in the browser the JavaScript is served by a website, it means that the website can also discover whether you have credentials. So these Pasky's are stored within application or security keys and they may be synced across devices. So this is the new stuff we've seen in the previous talk that we were talking about device-bound Pasky's, the Pasky's that are sitting on devices, but they are now a new class of Pasky's that can be synced across devices. So you can see this is the lay of the land like depending on like who you ask about Pasky's, they will tell you maybe they are like thinking about device-bound Pasky's, so Pasky's that are on physical devices or if you ask Google and Apple, they will talk about sync Pasky's and sync Pasky's are basically keys that can be synced across multiple devices. So you can for example have them on your laptop and on your phone or you can basically transfer them or like do an attestation using your phone while you're trying to authenticate on your laptop. And these Pasky's are supposed to be exportable and transferable, but in practice they are only transferable within a given ecosystem. So for example Apple will not let you export Pasky's to Windows for example. So it's their advertiser's being like you know interoperable, but because they are not coming from the open source world like we do, interoperable for them means different things. So they are also another class of Pasky's which is called up level Pasky's which generally have lived with the device-bound Pasky's and meaning that they can be used for other things or you have like additional properties added to them so typically you'll see them in banks. So for example like a bank application will use Pasky's to sign transactions or they will use additional signals to unlock a Pasky's something that is not there with a classic authenticator. For example they will check like you know your location or like your working hours like you can do all sorts of different signals. And you can build a custom authenticator so like you can build like the UI that you want. You don't necessarily have to follow like the OS or like the physical device design. So there are a lot of different requirements. We've seen like basically Pasky's means a lot of things. So like on one side you have like people that are working on the enterprise level. So people that want hardware keys that are like very strong and you cannot export them. And on the other side you have Google and Amazon for example they don't really care that it's very, they want the friction less experience. So they are ready to trade a little bit of security for usability. So for example when you're doing a checkout using Amazon they want you to go as fast as possible through that checkout and pay. Even if there is a security issue they will be okay to give you back your money. But if you do that with a bank they are not like having the same mindset. So on one hand you would have like Pasky's that require certifications. The banks will check is it the authenticator that I gave you to authenticate. Is that really your personal device? And on the other side you basically have like a website like Google that just want to show like okay you authenticated with the UB key but they don't really care like which one it is. It's just to present you like okay you have these Pasky's and this is the kind of authenticator that you are using. And on the enterprise side they will issue for example for like super high security setup they will issue you like a security token that is just for you. So you can see that these are some privacy implications. So for example if everyone was using like a device that is unique for each of us and we log in on each website with this device you would be able to do cross domain tracking. So you would be able to see like okay this guy logged there and then he logged there. And this is a privacy issue obviously. So on this side you want privacy. On this side you want basically no privacy because you are in an enterprise setup. So all of these are like very complex requirements and they are all fitting in the same standard. So basically like it's a little bit complicated to know like what's going on. But the common denominator is phishing resistance the fact that the Pasky is domain bound. So like you have one Pasky for Gmail, one Pasky for AWS. But you don't reuse the same Pasky twice. And it's always supporting HTTPS. They made this choice which is very wise. It's like no support for HTTP. So the Fido2 project is a project that works with the Fido Alliance. Fido Alliance contain like Google, Amazon, Visa, but also like TALES, you know people doing like security devices. And on the other hand you have like the WebOtent protocol which is managed by the W3C. So you'll see like basically people working on the Fido Alliance are also part of W3C. It's the same people, you know for example the person at Google is the same on both projects. And together this is called the Fido2 project. So on one side you have W3C that manage the WebOtent protocol. And on the other side you have Fido Alliance which manage the Ctap which is credential. Sorry, client to authenticator protocol. So basically the relying party is the website you're trying to authenticate to. It uses WebOtent over HTTPS. Then you have the client which is basically your browser and the JavaScript application that is running in it. And you have the authenticator. So authenticator can be the OS platform. It can be like a device, can be like a UB key, can be anything that is basically Fido approved. And these days it can even be credential manager. So it can even be like Bitwarden or Dashlane or OnePassword. And the interface for client to authenticator is a bit more messy than WebOtent. Basically you have like it works with Bluetooth, it works with what I call, everybody calls monkey patching. So basically if you want to integrate in the browser in JavaScript and become an authenticator, for example as a password manager, you will just hijack the JavaScript API and replace them with what you want. So that technique is called monkey patching. It's the only way for a browser extension, for example, to act as an authenticator. But you have like also proprietary protocols, for example, like when the Google Chrome browser wants to use the Google authenticator, you know, you don't know what's happening underneath. They are using their own stuff. So I hope that's clear and gives you like a high level view of what we're talking about. So there are two ceremonies. There is the attestation ceremony, which is the registration. And there is the assertion ceremony, which is the login. So there are no other operations. So for example, you cannot list what are the pass keys available for a given relying party. You cannot delete pass keys. These are not part of the protocol and need to be implemented separately. Like it's not normative. So we will see that this goes some issues. So the attestation ceremony, you have the client, which basically post a username. So this part post a WebOtten attestation option is not normative. You can do whatever you like. As long as you send a username, basically the URL doesn't matter. So it's for the relying party to decide what is the language that it wants to use. What is the URL? So recently they introduced WebOtten.wellknown file that you can place on your web server to basically say like, OK, this is my attestation URL. Then the relying party reply with the public key credential creation option, which includes the RP, which basically the ID of the relying party, the challenge that the user need to sign, and some other options. Like for example, we've talked in the past, in the previous talk, people were saying, like, do you check for user presence? Do people have to enter a pin? This is basically the moment where the relying party can say like, OK, I want to use this algorithm and I want to check the user that way. I will require user presence. I will require you to do user verification. And then the client does basically what it wants. So from that, the client called the navigator credential API. There is not a WebOtten API. Basically, we use a JavaScript API, which is the credential API, which can be used for other things, but it's used mostly for the WebOtten protocol these days. And then we basically enter a setup protocol or something else. So either like a proprietary protocol, but here I put setup for, you know, like clarity because it's the one that is the most defined. So same, you will send some data about the RP and the user, and the authenticator will assert the parameters, see if the crypto operation is supported. So it's asking to use a particular type of key. Can I create such keys? Then collect the user gesture. So like either enter or enter the pin. And then generate the credential and generate the signature. It will return the attestation statement and the authenticator data. And the client will send this information over to the relying party. The relying party will assert if the key is valid and will check the signature. Is it valid for that particular key? And it will check if the RP ID is also matching. So that basically you don't have a client that use another request from another website. So we keep the property of having a domain bound process. So the assertion ceremony, I'm not going to detail it again, but it's pretty much the same thing except you're not like giving the new public key. You're just signing with a key that is already there on the authenticator. And then what about account recovery? So obviously if you lose your device or you lose your passkey, then what do you do? So there are two types of account recovery. There is account recovery for the RP. So basically the solution to passkey is more passkeys. So it's good if you have like device bound passkeys because then you need to buy more devices. It makes a lot of sense when you're selling devices. And you can also use passwords. So generally you have a website like Amazon that will let you have a password but they will propose you to have passkeys on top. So basically you will default to passkeys but you can still use your password for account recovery and magic link. So basically passkeys are as good as the account recovery mechanism. So we're kind of back to square one. Unless you get rid of these methods for account recovery, you're not like really changing your security posture in my opinion. And on the authenticator side, it's a little bit more complicated. So I think Apple recommend you to have several Apple devices. Makes sense. And then you can also set recovery contacts. And you have custom procedures. So I give you like for example what happens on iCloud if you lose all of your devices. And it's actually possible to do a recovery for my cloud using all your devices and it's quite smart. I'm going to say the problem with we have in open source world is we don't have such a service that is ubiquitous where people have an account. So for example if we are Ubuntu and Firefox, we don't have such infrastructure to exchange such a scroll mechanism. And that's going to be a challenge I think moving forward. So how does it look like from an authenticator point of view? It's a work in progress and there's a lot of change. So like maybe like by the time I put this slide together, it's already updated. So this is an example on macOS and Chrome. And so you will see on Chrome by default Chrome will prompt you to use, when you click continue it will use the Google Authenticator. So you have the impression that you're using the OS level authenticator but you're actually using Google Authenticator. That is leveraging the API of the OS to provide an experience. But it's not the Google Authenticator, it's not the Apple Authenticator. So you can see already it's already kind of sneaky because like if you're using Chrome, they will prompt you to use the Google Authenticator. But you have some other options. For example you can use the phone or a security key. So basically you see there is more clicks if you want to use something that is not Google. And then you will scan this or press your security key and you will have the same result. So if you can even do like two device ceremony where basically you scan this QR code and then you will unlock your phone and the signature will happen and will be exchanging through Bluetooth BLI. So there is no pairing with the laptop and the phone and you will be authenticated using that mechanism. So it's possible for example to use like an authenticator on Android phone to login on Windows device using this mechanism. So if you use Firefox, you will start directly using the Apple Authenticator. And it's the same if you're using Chrome, you'll see like basically if you use that option that was there on the previous screen, then you switch authenticator. And for me like I don't expect like people that are not like knowledgeable to understand what's going on. So and it's the same like depending on like the options that are provided by DRP you may have like different mileage and different user experience. So it will be quite confusing I think for the average user. So it's the same if you want to manage your pass keys, they are like buried. It's really hard to see like how many pass keys do you have and where they are registered. Same on iOS, if you want to manage your pass keys you need to go to passwords. So you need to click on the password to see that it's a pass key. So pass keys are like okay we've solved the password problem, we can all go home, right? Like that's it, mission accomplished as George would put it. But no we still have a lot of issues to solve. Like we have like the what happened when you lose devices, especially when you don't have like a sync fabric that is common to different authenticator. And there is no real work being done on pass key management and review as we have seen. So for example in the future we've seen with the previous talk with quantum computer coming soon. We will need to roll out new algorithm, maybe we will need to change algorithm faster than we had in the past. So we will need to revoke keys. So in order to revoke keys we will need to design experience where the user understand okay this key is using an old algorithm that is not supported, you need to create a new pass keys. So we will need to have a user experience to manage pass keys that is understandable for the average Joe which is we are very, very far from there. And it's the same like for developers, I think like for developers to understand all the different options and what they mean when you're implementing as a RP. It's quite tentacular and you for example you can't follow the implementation of Google because Google does not care about user enumeration because you can already send an email to a Gmail address and it will bounce. So you already have user enumeration in place so they basically don't implement best practices to prevent user enumeration. But for you, for your use case maybe it's important. So like you can't even follow like what the big players are doing, you will need to do your research and find out. And I think this can lead to some problems down the line and we will need to do a lot more of education on like what are the security problems around pass keys. There are some other issues as you've seen like the user experience is quite fragmented and it will not be the same on different OS and different authenticators. And there is an entry barrier for authenticators so like one of the few open source projects on the Fido Alliance project it costs around 50K a year to be in the room when these things are being normed and so it's basically like a pay to play initiative. And in my opinion for something that is supposed to replace password but so ubiquitous that's an issue. So we even have like I think Firefox have a seat at the Fido Alliance but they didn't have the staff last year to be there. So I think it's an issue. And there's a lot of proprietary protocol and monkey patching happening and like we need to do much more normalization and I invite you to get involved to be interested about this because if you don't act on it basically they will make the decisions for you. That's it. Do you have five minutes? Yes? All right. Yes? The complications that you just mentioned to implement that is also true for somebody let's say a software service that wants to offer pass keys to their users. Yes. Do they also have to deal with all this complexity? Is there maybe a simpler way? Yes. So do the RP have to do their homework to understand the issues around pass keys? Yes. And they are not like super easy to get. You know like for example if you let's say you're building like a service like that authenticate people using pass keys but it's like a globalized service for your enterprise and you're doing just one pass key authentication on that domain and then you're switching to another protocol. Maybe you're using an iFrame and in the case of pass keys you may have issues with UI redressal so you need to basically take care of these issues and these are like you need to read the specs and I think maybe there will be more education and more easy resources and I think we need like all tools like for developers and like not for them to trip their feet on. And same like what kind of algorithm you should support like we've seen there is a lot of legacy website like maybe like they start creating keys with a certain algorithm but two, three, four, five years down the line when quantum computing becomes like something that you have to do because the state is telling you you have to do it. Then you know what happened with these keys you know like. He said earlier that kind of the relying party can say the use of the present for example. So how does the relying party know that the client actually should that. So. How does the client ensure that the authenticator is doing what the RP is requesting. I think it depends on the option but most of the options are basically like for example the client can ignore that the user needs to be verified but with the data that will be sent back as part of the assertion you would have. Okay, what what did the authenticator do? Did he verify the user or not? So it will be the RP to verify. Say to the client I want the user to be verified and when he has the response to say did the user get verified? Yes or no. So you say I want this but you need to check if it actually happened when you get the final assertion. Makes sense? Yeah but you have a password manager so someone says please make sure that the user is present and you sign the challenge and say oh I didn't say the user is present but you actually like because you can do that. Yes. There's no way for the website to know that the password manager is present. Yeah that's where the Fido certification comes into play. But it's like nobody wants to be like you know caught off and like say like hey look there you know so it's also like a gentleman agreement that you're gonna respect and do. But I could I could suspect like that some you know people that are like do it yourself kind of thing they can make their own authenticator that does whatever they like. But for example a bank may refuse to use an authenticator that is not Fido certified. So you know depends also on the RP. Because you assert which service are you using for the station. When the. Well often. So on your response do you know who certified that. Yes so you have information so it's like in the response do you know do you have information about the author of the cater yes so you have information which is I level which is what they call the a a good which is basically a global you you ID. That says like you're using like a UB key five but you also have like older version of this way where you have like actual certificate and you have like a signature that you know like. So this is stored in the MDS of the Fido alliance so if you are like for example a bank and you want to make sure that. You know it's not somebody pretending to be a UB key. You can actually see the certificate the route certificate and check against that assertion. One minute that doesn't make sense so you have you have two level you are one which is like the user agent kind of thing that most are using but you have another one which is like more complex that involves like cryptographic and signatures. We've seen some examples with the ecosystem and what about the system. Is there anything like that piece for private. On the unique system basically like to my knowledge not much is happening I think there is a talk with a GNOME the GNOME team was going to present like what's what's happening on the Linux site but it's way behind in terms of like Microsoft. Hello Apple. Kitchen and Google. I've not seen an open source one like but you have like credentials manager like for example. Bitwarden or dash lane that are basically can bridge the gap in an ecosystem where there's no OS level support. Thank you very much. Thank you.
Passwordless authentication in the GUI
Well, as Alexander was mentioning, this is kind of a continuation of the talk I had before. The previous one was about pass keys. This time it's not only about pass keys but about other passwordless authentication mechanisms. So now I have more time so I will present myself correctly. So my name is Sikar. I'm a software engineer at Red Hat. I work in the identity management world, more specifically for the SSSD, Shadow and Linux PAM projects. I'm the upstream maintainer for Shadow. So if you want to contribute there, we are welcoming new patches and fixes. In the past, I also worked as a software engineer but in the automotive sector first and then in the 3D printing world for HP doing their industrial 3D printers. And something that I didn't mention here is that I like swimming so if you are a swimmer, we are in the same team. So regarding the agenda, I will have a little introduction about passwordless authentication and then I will deep dive into the actual status of the pass key. Smart Car are external entity provider authentication mechanisms in the graphical user interface. I will also provide some, well, something, the vision that we have right now, including some mock-ups and even a demo. And finally, I will give up a conclusion. So let's start with passwordless authentication. So what is passwordless? It's a way to authenticate a user without using a password. Well, that's kind of a common ground. It usually involves multi-factor authentication and also single sign-on. And on top of that, it strengthens the security by, well, you are using a public key instead of a password. You are not reusing it. And you are not vulnerable to a data breach because whatever the attacker takes from the server where you are storing this data, it's just a public key. They have no knowledge of anything else. And on top of that, it improves the user experience because, well, they don't have to remember so many passwords. They just go there and try to authenticate with passwords. So currently, the passwordless authentication mechanisms that we are providing in FreeAPA and SSSD are the PASCII, SMARCAR, and external identity providers. So for the PASCII, we are using FIDO2 in SMARCAR, this is one of our internal Spanish national identity numbers. And then we have OAuth2 for external entity providers. So now let's see what's the current status of PASCII in the graphical user interface. So first of all, you enter your username and you arrive to this screen where we can insert your PASCII, then what? I can't tell you. It's then press Enter. But it's because I know it. Nobody else would know. I mean, you can kind of guess, but you wouldn't then be sure. On top of that, that's a textbook. You are supposed to enter some data in a textbox, but we don't need any data there. So we are just informing you that you need to insert the PASCII. Nothing else. Second, if you press Enter there, you would arrive here. It's asking you for a pin, but it could be that you don't need a pin. You don't have an extra user verification. So why request for a pin when you don't need it? If you don't need it, just press Enter here and you would continue the process. But who knows it? And finally, you are requested to touch the PASCII. The LED is usually blinking, so you just touch it and you are authenticated. So our proposal. Two different things. So if you don't need a PASCII, sorry, a pin, you will be requested to insert the security key. You will insert it, insert it, press Enter, and then you are done. But if you are required to enter a pin, well, you have a textbook where you need to enter a pin. As you can see, it's kind of better understood what is suspected from the user. On top of that, I didn't mention it before. When you are doing PASCII authentication, you can do it either locally or remotely. If you are doing it remotely in the server, you will get a Kerberos ticket. But what happens if you are doing it locally? How do you know that you don't get the Kerberos ticket? Well, you need to inform the user somehow. And we are still trying to figure out how to inform the user that it will not get the Kerberos ticket. It will not be able to do the single sign-on and, well, to inform it. OK, next one, smart cards. So the state here is kind of better. You have the available users. You select a smart card, and then you come here. You are requested for the pin, and that's all. OK, this one was easy, but maybe not so easy. What if you have more than one certificate for the user? So you select the user, you come to the second screen, and we have the same problem as with PASCII. There's lots of tests here, and it doesn't fit in the same box. So maybe you know which one to choose. Maybe you don't. You come to the next screen, choose one, and you are requested for the pin. We also need to improve this. The user needs to know which certificates to select and why it's selecting this one. So currently, I don't have any proposal for this. We are still working on it. I will show you later on where we are proposing the OS mo-cups. OK. Last one, external identity providers. So I will show you the current state in the command line interface. So you try to log in with Sue, and it says authenticate with pin. It provides some pin at a web page, and then it requests you to press enter. So if you go to this web page in a browser and you input the code that is there, you press on continue, then you come to the next screen where you are requested to authorize this request. Because, well, maybe it's somebody else that is trying to be you, and it's not really you. So you authorize it, you press enter, and you are authenticated. Nice. And what about the graphical user interface? It's not possible. You cannot use it. And here comes the application. And here comes the first demo. Yeah. So I will input my username here. OK. You see. You have your username here, and then the login button. If you press it. You are shown a QR code, a URL, and the login code. So this QR code will redirect you to this web page. And on top of that, we haven't finished this yet, but we will provide also our embedded web browser alongside this, where you will be able to provide the login code and authorize the request. Thing is that, well, first of all, you are embedding a web browser inside the login screen, which is maybe not a good idea for security reasons. And then you also need to take into account that it's not easy. So we are still working on it, and it's, you know, a thing to do. So if I were to follow the workflow here, I enter the code. OK. I press and continue. Now I need to authorize the request. I'm requested for the PIN, for the password, sorry. OK. Now you are supposed to press on enter, and it will work. Well, I will tell you we have a bug here. A problem. I haven't been able to solve it before the presentation, so it will fail, and it will show you another screen. But in reality, that would be all. So you press done, and then you are authorized, and you are logged in inside the computer. So what more? OK. So we currently have several authentication mechanisms. But how do you select them in the graph? How do you select them in the graphical user interface? It's not possible. There's no way. So you are prompted for either the pass key or the smart card, and that's all. You cannot select it. No way. But we already have a proposal for that also. So I will come back again here. I will need, OK. You have the web login here, and you have this small key. And this user, apart from the web login, also has the password authentication option. So you can press it, and you are requested to enter the password. You can come back again, say, OK, I don't want the password. I want the web login. And it takes some time, but it comes back. On top of that, you are a user. You authenticate the first time with a given method, let's say pass keys. You log out, and the next time that you come back to the same login screen, it will ask you for the pass key, not some other method. It will remind that you try to authenticate using the pass key and that you succeeded so that the next time the graphical user interface will ask you for the same example. For exactly the same method. Of course, you can change, but so that you don't have to start changing the method every time because there's some priority that always tries to do the same authentication method. So in conclusion, we have here the software design. It's more or less that GDM prompts the user for the login prompt. The user will input the username and GDM will start upon conversation. So when SSSD has already resolved the username and obtained the available authentication methods and all the data related to them, it will generate a JSON message and send it back to the GDM. So GDM already has all the information for all the available methods and the prompts that it needs to provide. The user will provide the information and GDM will come back and will generate another JSON message informing SSSD which method to use and which data the user input. So if you want to know more about this topic, here you have a web page. This is the design that we are currently writing. It's still working progress, but for external IDPs, it's kind of more or less done, so we don't expect to change it much. As a wrap up, so these are the high level requirements. The user should be able to select the authentication mechanism. It also should be able to use the previously mentioned authentication mechanism to authenticate. On top of that, the previous attempt should be remembered so that the next time that the user comes, it's prompt for the same authentication mechanism. And finally, the user interface shouldn't get in the way, so it should be easy and simple. We don't need to start doing strange things. The user needs to feel comfortable and it should follow the same workflows that he's used to follow in other applications or in the web browser. So last slide, reference links. So the first link is the design mock-ups that the NOMI team has prepared. You have there almost everything except for the two or more smart card certificates. This is still working progress. On top of that, the second link is the link to the SSSD EDM interface that I mentioned previously, like two or three slides ago. And finally, you have a copper repository if you'd like to test it. We are building it for federal road height, so if you want to test it, well, I would wait one or two weeks more until we stabilize everything here, especially for external entity providers, but then you should be able to test it. So that's all. Thank you, and do you have any questions? Okay, so you are asking what happens if you are not connected to the Internet and, well, you try to authenticate. Well, if you try to do an external entity provider authentication, you will not be able to do it because you don't have to enter. Is there a way that I can connect to Wi-Fi? You can. Yes? You mean in the login interface? Yes, in the login interface. No, I don't think so. Uh-huh, no, yeah. Good idea, yeah. But maybe you can use another method that doesn't require Internet connection. Yes, but then what's the point of even having the Internet login? Like, if I need to remember my password and I have the ability to use a password which is less secure, then I just make the entire system even less secure by adding another way of logging in. Yeah, but. Well, not actually improving anything. Okay, so you want to know. I'm trying to criticize. Yeah, yeah, yeah, yeah. So I get your point here that, okay, you enabled the Web Login and you don't want to use another authentication method that is less secure just because you don't have Internet access there. Then I would recommend you to use pass keys or smart cards because usually the user is cached there in the. And is there like no network manager access in the login interface? No, I don't think so. Because that would make it like work. We can discuss it. Yeah, it's a good feedback. I mean, there are certain potential issues with network manager like network manager also has access, for example, to VPN configurations. So you would have to create a special interface for network manager that didn't expose secret information. Yeah. So this is actually a common problem that exists for quite some time. And I've been talking about this at FOSDM in 2016 already when the, how they call these in hotels they have portals. Yeah, yeah, before you connect to VPN, but you cannot connect to VPN before you go online and you need to be online to and then solve first this portal challenge. So it's still the same problem. You need to run browser effectively before. So running network manager with the access to you potentially user private information before authenticate and then identify the user is another problem. Yes. We are not looking at that problem specifically within the context of this one, but solving ability to run the browser before the login will help us to solve some problems. And I know that at least three major distributions are working on these set of problems, but it's question of prioritizing. Yeah, I think that it should be solvable to run the browser securely. Yeah, I think so. Yes. I mean, we have stuff to work in, but we'll take a while. Yeah. Right. Yeah, first thanks for working on this. Clearly, there's a lot of work to do. Have you done like accessibility review like on this? I mean for disabled people, especially and actually connected to that is, for example, I noticed that when you shown like the UI. It was a bit far off from what people expect when they go to a website. For example, like, for example, you shown like to select the factor notification factor was a small. And the bottom right with the key icon, right? It's not really what people are used to like when they go to websites. So why is that essentially and indeed like how is that connected to an accessibility. Okay, so the question here is whether we took into account the access to these ability people. I'm more specifically when you are about to select the authentication mechanism that the icon is kind of a small and it's on the right side. So I know there's been some people from the UX team working on this and I'm quite sure that if you go to the first link there and you provide your feedback there, they will take it into account. Yeah, there's like going to no issues in the design team section. There are actual these mocaps. Yeah, so you can add your. Yeah, yeah, and you can follow it and you know provide further feedback if you don't like it. Yeah, we are still working on this. So everything is like working progress as much feedback as we can get the better because we'll provide a better product for our customers to use. I just want to make good that was more than a criticism. No, no, no. Search for rationale. I'm not part of the UX team. So I don't have the exact details. Thank you so much. Okay, sir. First of all, thank you very much for your work as well as the presentation. There was a picture of logging with security that looked like mocaps. Are they already available or no, no, I guess you mean this one? I guess you mean this one, right? Yeah, it's still a mock-up and we'll work on it. So we started with external entity providers and we'll continue with the other two methods. So I don't know when this will be available, but we are working on it. Philip. Thank you. Sorry, go ahead. So of course you're primarily implementing, integrating this first into a GDM course because that's what you're working on. But will you implement it in such a way or when it is finalized and solidified, can people who have different display managers or even implement it all be able to implement that? Because otherwise we'll be forced to either use GDM or maybe SDDM. OK. So the question was whether other login providers, like I guess KDE or some other one. Like GDM maybe. Yeah, if they will be able to provide this authentication mechanism, the answer is yes. They just need to follow this design. So the PAM conversation is part of the LIPPAN, which is kind of a standard nowadays. And the JSON message, it's defined here in this design page. So the graphical interface just needs to follow this to be able to implement it. It's not implemented, but if somebody wants to already start implementing it, we are fine with it. I mean, we don't have any problem. We are providing Nomi because we are using that in Fedora. But anybody else can come here and implement their part. So they're going to have that tight level of integration where you need to have it? No, no, no, no, no, not at all. We just need a PAM conversation there happening and to follow this diagram. That's all. OK. Thank you. Yeah. Go ahead. I wonder if, for example, I have a laptop set up with both password authentication and a second PAMD module for Trezor, which is like this USB thing. And I can click the button and I can log in. And that allows me to choose whether I type in my password or I press the button on the Trezor. And you have it so that you have two different flows for the authentication. Do you think it would be possible to set it up so that you could either type in your password or tap your smart card to have it win a single flow without the user selecting the flow? If you have a smart card reader of an NFC device on your laptop, then when the password prompt is showing, it would potentially be possible for the user to tap their smart card even though the password prompt is showing without the user having to click on it. I'm like, I can type my password or log in to have a think of it. Yeah. It's the same on your phone. You can make your symbol or you can just use your fingerprint scanner. You don't need to choose what you want. This is a bit more complicated than the case of GNOME and GDM because GDM uses different PAMD stack configuration for cases when it detects a smart card. And it uses GDM smart card. Yeah. And that one explicitly includes expectation that smart card is engaged. So it will not use the one that uses just the password in that case. Your device, if it's supported by a separate PAMM module, yes. Then it will work in the stack of the normal password basis authentication. It will work with this one. This one will just be skipped completely because your module will handle it. That's how PAMM works. So the whole concept here, depending on PAMM, basically being used for everything. So time's up. Andreas, do you want to say something like that? Yeah. Do you plan to extend the PAMM conversion of how to talk to PAMM because that's still very limited? And one problem, for example, if you have multiple domains and have one to have the selection, that the user is able to select this domain, that's the problem to present. So you actually would need to extend the PAMM conversation actually to the first. So you can use this to start your life. We can use the same JSON format. It already allows you to do that. Yeah. Can I define a primary plausible which is part of PAMM that's not fully linked? And so they can go to that. So it's not a classic on multiple terms. OK, time's up. I think we can continue this discussion outside. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
Automated Integration of FreeIPA with AD and External IdP
So let's start with the next talk. And this welcome, Thomas Werner, talking about Ansible Free APA. Hello. So, let's go in here. So, this talk will be about Ansible Free APA to use for AD integration, so Microsoft Active Directory, and also to set up external identity providers. The plan was to have an online demo here, but yeah, there have been some issues. So, there is slides and there will be online demo later on on the web page. So, for the automated Free APA deployment, I have been using the work from my colleague, Raphael, and we had to change something in the inventory to make it work, especially in my environment. So, if you want to do it on your own, it's used as a base for the whole presentation. These are the steps to do, and one important thing is please fix time and time zone on all the machines, otherwise you will have fun. With covers, tickets, not valid tickets, tickets that are in the future or in the past and so on, so not fun. The first step was get your windows you want to use. There is a nice documentation on this web page from Raphael. The different steps you need to do where you can get the images and what kind of images are working and so on. The first step that we need to do from Windows AD step setup is to change the first script that we have this, Windows AD setup playbook to disable IPv6 because if we do not, we will have lots of fun with DNS later on. So, this was one of the most important things. And we're coming to the setup of the IPA server. For IPA server, as we wanted to have a replica deployment later on, it was needed also to enable DNS and auto reverse. Sadly, there is an issue with auto reverse creation later on, but it is fixed manually. And there is another issue with DNS with Windows, so you need to disable DNS stack validation. In the lab, that's a lab. Yep. You will find out if it's working for you or not. So, then you can simply do the steps that are on the web page. So, the first IPA setup, then there is a nice test to make sure that DNS is really working on both sides. This is the NSLookup test. So, it tries to find the Kerberos TXT records on Windows side and Linux side on both ends. So, it verifies that everything is working in here. And the last one is setting up the trust. I'm not adding the information in here because it's completely unchanged from the script in here. And after we've done that, we can log in with the AD administrator into our Linux server. ServerLinIPa test. You can see I can log in, I have a ticket, I can get my AD, and then I'm trying to do a change in IPA. And it says, hmm, invalid credentials. Okay, so, but we have a solution for that. Also an answer for your IPA that was added lately. So, we can add, we can grant the rights to the AD administrator to act as an IPA administrator. So, the first one is adding an ID overwrite that is needed to be able to use the AD administrator. And the second one is adding this an overwrite for this administrator to the admins group to make sure that this user is able to do something so that it has admin rights on IPA. And after we've done this, we can directly add a user, remove a user. You can do everything. So, host, users, whatever you want be. AD administrator is an IPA administrator. And the next part was, okay, let's try to do client deployment using this AD administrator. So, the inventory file needed to be a little bit changed in here. So, there's a client set up. There is also a setting. I don't know if you know about that. This is configured DNS resolver. This is a feature of the IPA client role to set up the client in a way that is that the DNS server or the DNS server you're configuring here. This is the IP address of the DNS server that was used. So, the first step of IPA client role is to set up network manager or systemdresolved or resolveconf so that you are able to use the DNS server directly. So, it's not needed to do this manually. And also, if you do an unconfigure, it will remove it again if you set the variable. So, it's doing this automatically. And the next two lines are to force it to use the administrator of AD. And there is one important thing here. You need to write it correctly. So, that means a capitalized A in administrator and also the domain capitalized. Otherwise, it will not work. If you log in, it's working without because there is a rule for that. But here, there is nothing. So, you need to write it correctly. This is the first issue. So, why isn't it able to find the AD administrator? But, okay. With this, we are able to deploy the client. And it's working afterwards. So, you see here, the next one is the playbook to deploy the client. So, it's the normal thing you see on Ansible Free APA. There is a playbook for this. So, you can simply consume it. Yeah, so, client was easy. What is next? Yeah, replica. But for replica, we ran into an issue with command line and also Ansible Free APA. So, there is currently an issue in the replica connection check. It tries to use admin. And for sure, the password is not valid. So, it fails. We will find out what exactly the issue is to solve it. It's affecting command line. So, Free APA package itself and also Ansible Free APA. So, it doesn't matter which you're using to deploy. They will both fail. But there is a temporary workaround. So, disable replica contact. But make sure that it's working. So, DNS needs to be working. And also, the reverse lookup needs to be working. Otherwise, it will fail also. And the next step is simply to deploy the replica. And then we are there. And we have a working replica. We can use it also to deploy clients. Also using administrator AD8-PI test and so on. So, we have some issues and we will work on them in the next time. But they are relatively small in my opinion. It could have been worse. And now we are coming to the second part. A colleague of mine wanted to present this here, but he was not able to come. So, the second thing that we added in Ansible Free APA was the possibility to configure and use external IDP. There will be another talk later on about external IDP. It will be true. But we will be a big deal for that. So, any open questions here might be solved by this one. So, Free IPA has the modules for external IDP. There is a new module that was added to Ansible Free APA and also the use of external IDP was added to the user and so on. So, we can configure Free IPA as an OAuth application on GitHub. This is the example that I will show here. So, let's go directly to this one. So, we are creating a GitHub OAuth application in the first step because this is needed to be able to configure external IDP with IPA. So, the steps is simply go to your GitHub, go to developer settings, OAuth apps and register a new application and read the docs. If you do so, it will ask for several things. So, it will ask for application name, the homepage URL which is also the authorization callback URL. So, you should have the same in there. This is the iperserver.com. And please also add a description to be able to find it later on. And enable device flow, this is needed for IPA to be able to handle this at all in the end. So, this is very important to enable. And then click register application. And if you have done so, you will get a client ID and also a client secret. It's very important to keep those secret. But one thing, you need both of them in the next step for the setup of external IDP. So, there is no way to get to the second one again. So, either write it down, make a screenshot, whatever, but in a safe way. And if you have those settings, you can go to Ansible Free APA. And here you see, we are using simply them in text form. But you can also use Ansible Vault for that. So, that you do not have the passwords here, but for simplicity. It's here, the same with IPA admin password. It's here simply to make it simple for us to see what's going on there. Otherwise, it will be a little bit cryptic. And so, this is simply creating, setting up the external provider. And in the next step, we need to retrieve the GitHub user ID. Oh, one thing that we should add here. So, IDP user ID is set to ID here. There is another way, but this is way better because with GitHub, it's possible to reuse names. So, it's really good to use ID here for authentication because then you will not run into this possible name clash later on. And this is common problem for many IDPs, that if you delete the user and after some time, another user registers the same user visible name, that user becomes basically squirting the previous one. And many of those providers, they run like 90 days protection of the accounts. Even if you delete accounts, you can not register one. But eventually, they expire. So, somebody can squirt your account in this way. If you've configured your systems to trust whatever was the user name in the system, good luck. You will be hacked a year later when you start doing it. So, taking these other fields into account is very important. And this is part of administrators to kind of design this thing. Unfortunately, all these fields, they are not visible in the UI. So, normal view can not see this information. So, it's admin that needs to discover. Yeah. So, use it this way. So, we're retrieving the GitHub user ID. It's stored here. And in the next step, the IDP user ID is using this retrieved user ID. The bad thing here is, sadly, IDP user ID here and IDP user ID here is not the same. One is a user ID, so a number, and the other one is really a user name. So, be careful and read correctly. So, the thing is, Ansible Free API is trying to use the names from IP itself. So, you will see the same naming issues in Ansible Free API that you see in free IP itself. And after we've done this, the user is able to authenticate. So, it needs to get the code, and with the code, it's possible to log in. And that's it. Thank you. So, we have like six minutes. Do you have questions? Yes, please. Please scroll to the beginning of your presentation where you describe the service stuff. This one? Yes. Well, I hate the NSSEC myself, but why do you disable the NSSEC presentation? If you do not disable the NSSEC right now, the IPA server gets a reply from the Windows DNS server, it's ignoring it. Maybe you should have said that this is about the lab setup. So, if you have a lab where you don't have a Windows configured DNS SSEC in the Windows DNS server setup, which is not default, they don't set up the DNS SSEC. So, your lab is basically disconnected from the Internet, and it doesn't care about DNS SSEC. So, that's your ADC DNS SSEC in that lab, or you don't. So, Thomas choose the easiest way to configure DNS SSEC in the lab. But, IPA configures DNS SSEC by default if you use the internal DNS server. That's why it's forced into disabled validation because it's enabled on IPA sign. So, either that Windows setup in the lab needs to gain DNS SSEC configuration, or both of the sites needs to drop DNS SSEC validation. Sorry, I don't get, because, well, if you implemented DNS SSEC validation, implying that the DNS records SSEC are required, it's something weird. Otherwise, you just don't get any signatures, and you have nothing to disable. Now, a buy-in server, which IPA uses as an internal server, has DNS SSEC validation enabled by default. You cannot switch it off unless you explicitly say to switch it off. Yes, but if there is no signature, it should not reach anything. It does check, and it does reject the request. It rejects on signed address. But this is just a lab story. I try you do not disable DNS SSEC validation in the wild, unless you know what you are doing and then pay consequences. Okay, thanks. Yeah, I think probably the problem, the reality is that a lot of RokyD setups don't have it enabled because nobody ever talks to them, they're blue. In many cases, people are using cloud-based DNS services, so it's not the DNS server on your infrastructure. It's some of the DNS servers provided by, I don't remember the names of those companies, and those typically do not have DNS SSEC enabled for the whole zone that the company rents out of them. Okay, yes, just a second. Okay, so there's some language. Does this experimentary work with Samba as well? Does the external ID provider work with Samba also? I think I need to answer this. Thank you. One of them is using AD user to manage IPA. This will work with Samba, because this is just a normal job between IPA and activity. The external ID provider is called IPA users, because IPA only authenticates users that are in IPA. AD users authenticated by the active territory domain controllers. Microsoft implementation of active territory does not have perverse pre-authentication method that supports anything like that at all. Completely. Same with Samba AD. Samba AD built with Heimdall has no way to handle this. Samba AD built against MIT perverse has theoretical way to handle this. It's not implemented, and it's on my and Andrea's plans to complete this work on Samba AD site. There will be more about it in the hour talk, that will be the last talk of the day. More questions? I hope it was captured by the mic. I hope. Yes? It's the difference between this kind of integration and the one you spoke about in the morning. Wow, good question. Maybe I can answer this one. This is basically Azure Freebase. It's a new Samba network that will make the point of computer development. Here we have an ID integration. This kind of Azure Freebase is simply created to establish a class with AD. The one for my presentation is basically a container service that is capable of connecting to the LDAP from AD to make a request to LDAP, Python LDAP, something like that. The one that was in the morning, the PAA theorem, in the center, the client enrolled into AD or IPA and provided services to web applications, which is key to all of this case. Any more questions? We have time. Thank you.
Connecting IBM AIX to Red Hat Identity Manager (FreeIPA)
Thank you. So you know if you work a lot of the green screens, so after some time you cannot distinguish between green and yellow. So it's what happens to me and we are talking about IBM RX, and it's usually green screen. So we are here in the deaf room and the first, I'm not a developer. I'm a classical system administrator. So yeah, you can say to me I'm a DevOps engineer because I can quote, I can quote and see and Google Go and Rust in Python and Ruby, so I used everything. And I ported a lot of tools to IBM IX and to Linux on IBM Power, and on OpenPower. I'm not an IBMer. Usually IBM has talks about IBM Power. I'm not. I'm an IBM champion. So if you know what is AWS Hero, Microsoft, most valued professional, so this is the similar program from IBM. I'm not a red-hitter. I don't have a red-hitter. So as you see, I'm a red-hitter certified engineer and I'm a red-hitter instructor. So that's why some of these talk may sound for you like a training material, so it is not. And we're here at FOSDEM, beer, open source hackers. What does it do? What has it to do with IX? IX is a series of proprietary, Unix operating systems called sourceWare. So this is my attitude to it. Real hackers don't need source code. I was born in another time in another country and we didn't have access to source code mostly. And the real hackers, or real hacker, is someone who can understand how the program works without looking into the source code and who can change it without looking into the source code. So, who are some of the men who use it today? Who uses IX? Nobody? You are all wrong. Do you have a banking account? You're a bank user, so do you have insurance? Your insurance user, so how did you came here by car, by train, by flight? Everyone uses IX. So retailers, manufacturers, if you bought something, it was done on IX. So do you have this thing? No, you don't have an onix on it. But in the back end, it is IX database, or database on IX, who processes all your orders and sign-ons and so on. So I added some, yeah, I added some marketing sheet because nobody knows what IBM power is. Sheet with long e, not with short. I just want to say that this is my favorite guy, IBM Power 8 and AC. So it has 1,920 logical CPUs and 64 terabytes of memory. So you can have it everything in one position, so in one virtual machine. So I did once with six, first time I did with 640 CPUs, virtual machines. So I wanted to look what are my CPUs are doing. And you know, even if you have, let's say, 40 lines in your terminal, 640 CPUs, it is how many? 16 pages of CPUs. So it takes some time to get them. But there are some funny facts about power which are not very well known. So fun fact number one, zero successful data breaches. So it is 2022. I don't think it is about that IX is so secure. IX is not so secure. So it is like any other operating systems. And most of the guys, system administrators, cannot use it correctly. So but the other side of the story, nobody knows about it. So it's difficult to breach into it if you don't know how to use it. Fun fact number two, it's 14 years the major reliable server. So somewhere here, the typo. So it's really reliable. To bring it down, it's impossible. So it just works. So it can work year long. You can forget about it, it will work anyway. So and I like this fun fact. So this is about performance. Let a P in power is performance. So you see here IBM E-server P5. So this is the fifth generation of power servers 2005. Did an SIP benchmark, somehow 8000 something steps. So eight years later Fujitsu's park could do almost the same. Last year, the latest and greatest Intel Dell Power H with the latest and greatest Intel CPU just outperformed by 1%. So you know what is the most powerful server right now in this benchmark. So the first three places are there. Before going further, we had to talk a little bit about IX and what we should understand. What makes it so easy and so difficult working with IX. So it's a real unique standard operating system. It is everything is standardized and everything what is implemented in IX is standardized. So implemented according to standards. But you know standards can be a little bit different according to the developer. So it depends on developer who develops or who implements their standard usually. One of the most things is binary compatibility. So if you ask any IX admin, they will say binary compatibility is the most important thing because yes, I can run on my most modern IX server. Binary is a program which were compiled 20 years ago. I did it even more than 20 years ago. Another case of this binary compatibility, so you don't innovate. Because you have it, it works. So why should you innovate? Why should you do newer things? It's not BSD based and not system 5B based. So it's a usual, it's OSF-1 based if someone remembers OSF-1. So it was end of the eighties, beginning of the nineties when IBM, HP and digital united together to make a new standard in Unix world. And they did OSF-1. And of course, because not everything can be standardized, it has some unique features. And let's go to authentication. So PIME, everyone knows. Everyone uses in Linux PIME. IX has support for PIME. Everything is good? No. So it's originated in Solaris, yeah. Can be used in IX. But IX uses old Solaris implementation of PIME from the end of the nineties. And it's really paining in the ass, sorry, to port some PIME module from Linux to IX. I tried to port Azure AD authentication to IX. And I failed and after one week I said no, I will not do it. Because the differences between APIs in IX and old PIME interface and newer interface. But IX has something different called load authentication module. This is original IX idea, how to make almost the same. So it was done even before PIME. So five years I think before PIME, they did LAM. Almost the same, but a little bit different. IX only technology and very popular in IX world. Again, not because it is the best technology. Usually because system administrators don't know anything else. So that's why they use LAM. It is by default there. So and the most big feature of it, there is almost no documentation how to use it and how to develop it. So first time I developed my LAM module, I used Samba source code to understand how it works. Because Samba had LAM module for IX and IBM didn't provide anything. So it's not really versus, it's together. They work together. So we have PIME and LAM. We can have application one using PIME and application two using LAM. It's flexibility. So we can have user one using PIME and user two using LAM on the same system. We can do everything we want. Every user has 50 attributes, different attributes we can configure. So it's not like in Linux where you have home directory password, user ID and so on. In IX you have 50 attributes. You must not use every attribute, but you can use it. And you can configure, you can have different password policy for different user based on different dictionaries and so on. But even in wars you can configure PIME to use LAM for authentication and use LAM to configure LAM to use PIME for authentication. So usually it's good that IX administrators don't know about this feature, but you can get into real, I don't know how to say it. So you will be waiting for authentication 20 years till it completes because LAM will consult PIME and PIME will consult LAM. So now let's go a little bit into details. So the first configuration we need to, we can choose, do we use standard authentication? It is LAM, loadable authentication model, or we use PIME authentication. So we configure it's just normal in config file and this is standard value, standard authentication. And in our case we leave it as standard authentication. So next in ETC security user we can configure different user attributes. And this is one attribute system which is usually tells us which loadable authentication modules we should use to authenticate the user. By default in IX there are two variants, files or compact, they are not very, really different, but you can install additional authentication modules and add much more. So it works with IX only functions authenticate, it is not POSIX, it's not single user specification, it's just IX. And you just get username from user and send it to this function. And functions read ETC security user, it somehow works with system which we configured and says, okay, let's go use this prompt to the user. So in this case it can be that user one uses LDAP and authenticate with free EPA for example. And other user uses Kerber's and authenticate with Microsoft Active Directory, the third user can use multi-factor authentication and authenticate through GitHub. So all on the same system. But it's not all because as I said sometimes documentation lacks some information. There is another function authenticate X in the newer version of IX. Yeah, IBM is like me in this case, so they have very big brain how to name functions. So I also use the variables name, R, Y, J, K, L, M and so on. But the difference is here with this state. So what should be used, I don't know. But one of these two functions is used by login and it then goes to loadable authentication modules. And we have a configuration for loadable authentication module in this file and this is standard way. As you see we have one module for 32 bits programs and another module for 64 bit programs. And this is again the problem for me personally because I like to use modern programming languages like Google Go or Rust. And they are 64 bit only on IX. But in this case you need also to have 32 bit program and it means you can use just C or that's it, nothing else. And this is what comes by default as you see IX delivers carbers and LDAP as default modules. They are just there, you need to configure them. But there are some pieces missing of information here. So if you want to use LDAP you must install IBM directory server LDAP file sets. They are delivered with IX, so but they are not installed by default. Similar if you want to use carbers you must install IBM network authentication service file sets. They are also delivered but not installed by default. In this particular case I use only LDAP but you can use also carbers for authentication and LDAP as directory service to list to users, to get directory for users. So do I need to do something on free EPASite? No, really no, no. You just install and use free PSU. Usually do. So I have such installation at customer site and they just installed, they didn't do anything special for IX on there. I usually do something, so in my test I usually do OK2oFS delegate too. It's not for LDAP indeed, so it's more for carbers single sign on which I don't use here but so it works. And I create separate IX for you because there are some gotchas with IX. So BESH is not on every IX installed by default. So just on newer versions of IX it is by default there. On older IX version we use con shell 93 and on IX standard user group has JID 1 and not 100 SM Linux. So that's why I do this using Vue. So and yes, just small ansible snippet what I do, what I told about and again thank you very much for ansible modules. So magnificent. So IX site. On IX site we just create second app configuration with this command and we specify here our IR server. The name which we use to log into free EPO and password you see I use the long very long password and cryptical so and where we can start searching. So it creates automatically configuration for loadable authentication module. It creates automatically LDAP client configuration and starts LDAP client. And if you see here it finds really finds where it can find users and groups on its own. Nothing, no magic, no rocket science. The only thing you can see here is RFC 2307. As in Linux you use usually RFC 2307 BESH and here there is no BESH. Why? Because as I told IX is a standard operating system. So and RFC 2307 since 20 years is a draft of a standard but not a standard. So sorry guys it will not be implemented. So this is official answer from my BMI gut because it's not a standard. So everything else looks good. So there are some configuration changes I see they must be done. So but in real life they should be done not must. So you want to have home directors for your users to be created when they log in. So I found so okay you say that by default all the users are in LDAP directory. So and I found that on the new US-TIX these two parameters and password policy plays not very nicely together with LDAP. Another feature the main list groups so by default IX use if user to IX comes from LDAP directory. It can have only LDAP groups. It can be in local groups and switching on these the main list groups I say okay the user can be also in a member in a local group. So and one more if you use views in FreeEpa like me so just don't forget to adhere IX views and restart LDAP client that's it. So everything is configured everything is working so it's not so interesting if something goes wrong you can use LDAP to check what FreeEpa brings to you or what IX sees on FreeEpa side. And if something goes completely wrong you can switch on debugging with such magical variables nowhere documented almost nowhere documented. So but be careful with them first of all they produce really a lot of output. Second you can find even your passwords in clear text in this output. So okay mapping sometimes you must change your mapping this is standard mapping what all IX uses but you may find situation I was for one bank another a little bit different mapping because they need some additional fields in their case. So and you can change the mapping here and it's rather easy it's no rocket science so does it work yes but there is always some but some so there are some wishes first of all as I told standards are so how. How developer implements the standard it's depends on developer and I extend that sees a little bit different password class change attribute as in FreeEpa so different formats of dates. So it's on IX side and I development tries to fix the problem since I think one year something like so it's when you walk with closed source operating system so you can't fix it on your own sorry. I don't I didn't find the way how to make HB a quacking. So I have the answer thank you. But I would like to have it working so yes free plays missing my favorite 50 IX use attributes and I would like to have them there it's really one of the things I love in our IX. We have even more IX specific like role based access control and trusted execution so if you switch it on you cannot execute some some binary on IX without it will check the signage of the binary and everything can be stored in held up so that's why it would be nice to have it in free part to. Yeah. And yes I exist missing 23 but I would like to have a native free Epa client which is not there. So and if you can help me feel free to ping and we will talk thank you very much for the time and. And. We have any. One minute. Yeah. Not a question just a side note. About stuff so. Well it's sort of catch 22 until you. Until RFC is implemented it can't become a standard for a proper standard. It is implemented by numerous. Yeah. So probably. Raise the issue to. IETF is not interested in finishing this work. The people who originally started this job not interested because everything is working for everyone. That's that's. Position. Okay.
Empowering FreeIPA: a dive into the modern WebUI
So ready? Okay. So, yeah, I have a huge disclaimer before I start. It's that this talk I'm giving now, it was supposed to be done by Carla, who drives this effort in Rehat. So, yeah, she was explaining me a lot of things, a lot of things that I hope I will remember. So, yeah, and also if you have challenging questions, we will probably need to get redirected to Carla somehow. So this is all about the web UI, the free API web UI. Okay. So, yes, I just would like to go quickly through the agenda. So I'm planning to, on doing a very quick overview of the background and historical context. Then I'll talk about the main motivation for the chain. We will see the technology stack transition from the current one to the new one. And this some stack transition not only applies to the technology or the components, but also has implications on the whole testing framework and documentation. So then if there is no demo effect, we will have a look to the modern UI from a live and public instance that all of you will be able to access as well. And you can play, if you have your laptop on, you will be able to play with it as well as we speak. And finally, yeah, before the Q&A, we will have a look to the future roadmap because this is a working progress. We haven't implemented all the functionality yet. We have big pieces, but it's not ready yet and we are working in upstream, you know. And what is more important, we will also cover how you can contribute to it as well. And then, yes, this is kind of the background. Free API is an identity management solution combining Linux, 3D9DS, MIT, Kervos, NTP, DNR, DocTag, Certificate System. I mean, one of the most important aspects is that it provides a centralized authentication authorization built on top of very well-known open source components, as you know. And it consists of a web interface, the web UI, and a command line administration tools. And, yeah, the very first version was released in August 2007, so it's been a while, the first version. And at that point in time, yeah, in the very beginning, one of the main goals, it was simply to have party TV between the UI and the click commands so that you get the same stuff from the web UI and also from the click commands. So that was one of the main goals at the very beginning, very long time ago. And the first implementation was based on Turbogears, which is, as you know, probably a Python web framework, consisting of Seral, WS-GNA components that this is still active if I'm not mistaken. So there is still activity in GitHub. So this web UI consisted of some sort of a tool wrapper, capable of interacting with the IPA server through the IPA IPI commands. And it was capable of sending IPA server commands, fetch the data, and modify before displaying. So very basic thing. So as you can imagine, during those days, no React, no Angular, so they had yet to emerge. It was a plain JavaScript in the end. And there was a lot of limitations, like it was necessary to optimize the number of HTTP requests and also minimize on how the JavaScript files exist. So in the end, we ended up using a lot of files for the web UI. So the first significant evolution of the web UI was driving by the Dojo library. So with the help of this library, it was possible to transform free API JavaScript packages in AMD modules so that it was possible to build into a single file, and this was helping a lot with performance. We managed to reduce all this inside number of JavaScript files to be processes. And then around 2014, all the styles and guidelines for UI, for UX, for the web UI, you know, were adjusted. This was done to align with the Red Hat, Common User, Environment, RQ, you know, which later became the pattern fly. The pattern fly that you know today comes from the Red Hat Common User and Environment. So basically, pattern fly is a set of best practices to provide users and unify them into the experience when navigating through the UI. So, yeah, and this is how it looks like the current web UI, the one that you can see today. So, yeah, so this is why we summarize somehow why are we changing it and what are the main motivations for this change. First of all, the current web UI is following pattern fly 2 and it's unsupported. So, some of the functionality that we need, this is kind of blocking us somehow to implement new features in the web UI because we are still following the pattern fly 2. So, it makes evolving the UI very difficult and what is more important is to stop using outdated tools and libraries like the Dojo library that I was talking about before that is going to be replaced by React. So, and then how are we doing, yeah, and this slide is somehow trying to summarize how are we going to do this. So, it's basically following some sort of guiding principles. So, the most important thing for us is to minimize the disruption with our assistant users. So, this is like when you go to the supermarket and the location of all the products is changed. Every time you go to the supermarket you are unable to find the things because they change the things so that you spend more time on the supermarket and buy more things. So, we are not going to follow that strategy. So, in the new web UI you will find all the things in the same place. So, we will focus more on improving the components and yeah. So, then now you can see the technical stack transition that we are making. Okay, on the left you can see the current web UI, all the technologies we are using there and this is what we are moving now. Okay, so the first one is pattern fly from the version 2 to the version 5. Okay, these are all the guidelines I was mentioning at the beginning and best practices. Okay, then we are going to move from dojo to react which is more modular content and is based on components and is kind of the modern way. And another one which is a very important piece is how we test the web UI. We are currently using PyTest and we are going to replace that by Cypress and Cucumber. As you know these two technologies follow the behavior driving test. They are more human readable. It is super easy to test with Cucumber and Cypress and yeah, this is an important change as well. So, okay, yeah, let's start with the comparison between the old one and the new one. So, the most significant one I think is the navigation bar. So, instead of having it there and the sub task, now we have it on the left bar and it is very well organized in the sense that all the sections, they depend on each other. We have three levels now and then you can collapse and expand as you go. So, this is kind of a huge improvement. And then, and it is always visible and now you have the ability also to hide it and have more space for the other bar. Another big change in this comparison is that the settings form and now we implemented a very cool thing here is the jumping links that we have here. Okay, so you have them available all the time and then you can click on them and then they will move you to the correct view. So, another one is that all these buttons, they are floating buttons and they will stay there all the time. So, also the tables, they are kind of refactored now and this is now much more clear as you can see, you know. So, another one is the scrolling. This is another important one because in the old one, when you were scrolling through all the list of users, for instance, I mean, you lose the navigation bar. I mean, the whole UI was scrolling. So, now this is not happening anymore. And now you can scroll and then you will have all the time the navigation bar there. Okay. So, this is a very nice improvement as well. Yes. So, let's continue then about modernizing the UI. So, this is kind of the architecture. We're following. Okay. So, on the left part you have the front end. You have the modern web UI is sitting on React. Okay. The React is kind of umbrella for all those libraries. From top to bottom. Then we have butterfly 5 that we already talked about that. The testing libraries, it's capable of connecting with the testing libraries with Cypress and Cucumber. Then we have the React router library that this is providing the multi-paging functionality. And this is because React's only built single-page applications. So, this library provides the feeling of moving through multiple pages. Okay. So, this is why we are using that one. And as you can see, we have the internalization one here, but with dots. With dots because we haven't implemented yet. This is not available. But it is on our plan and we need to investigate and see what will be the design. How can we connect that library also with React so that we can also cover different languages and stuff like that. And then the communication layer is one of the most important libraries that we are using. It's using RTK query library. And we are using this one because the connection with the IPA server is through RPC. This is one of the best libraries to do so. So, basically, from the modern web UI, from the forms, we are kind of collecting the data. And then we send an API call to the IPA server. Then we grab the response back and then we process the data that we show in the web UI. This is the way it works somehow. Yeah, basically, this RTK query is capable of performing group operations in JSON RPC. That's the whole idea. So, yeah, let's continue. Yeah, this is about the testing framework I was mentioning at the beginning. So, we moved from PyTest to Cucumber and Cypress using the behavioral driving development. This method that you know is more human readable and then implemented tests now. Super easy, but you are going to see that with this sample, this side-to-side comparison. This is the old version. This is PyTest. And this is the new one. It's more oriented on the user. So, you just describe what do you want to do in the web UI. Like adding a user. Okay, I open the side menu. I click on this and click on that. And this is the test. All right. And even better, I think I have a video. It was done by Michal Polovka, one of my colleagues. He's working on this. Yeah, it's working. Let's reproduce it. So, this bit is really kind of... Yeah, you are going to see now how one test is executed. This is going very fast because it's supposed to be fully automated, you know. You see? And the good thing about this is that if everything... If something fails at some step, it will freeze there. It will wait until you have a look and you will see exactly the step that is failing with all the locks and stuff like that. So, it's amazing. Yeah. So, with this framework, it's kind of really fun to implement tests now, you know. Yes. All these tests are obviously running in the GitHub project. We have enabled GitHub actions. And yeah, it's really easy to implement this. Okay. So, if we continue... If this is working... So, yeah. This is the same video, right? Yeah. Okay. So, now, yes. Let's have a look. So, before this presentation, I deployed a public instance in EC2 Amazon Web Services. So, I did a trick with the... Well, we have a trick to expose the modern UI, okay? Because we haven't implemented yet the login page for the modern UI. So, we are reusing the old one to login. Once you login, you can access to the other one. So, let me try. But yeah, you can access, try to access with the laptop and you can start playing it and start adding users if you like or whatever. So, on my end, I'm going to try to access it now. So, yeah. So, this is... Modern UI. So, yeah. So, yeah. This is the modern... I mean, the demo instance, but it's loading the old one. Because I need to login here first. So, do you know the credentials? I guess it was in the slide. Okay. Admin and secret 123. So, this is supposed to work. Yeah, this is the old one. Okay. So, now, I'm going to access to the modern UI. And then I just need to change here. Yeah, and this will give access... No, wait. Modern... Yeah, this is the modern one. So, if I open it a little bit. So, one of the cool things that Carla told me to show is that all the... is that this web UI is very dynamic thanks to React and pattern flying. Okay. And for instance, if I go here and inspect, you can see how dynamic is the new one. For instance, we start reducing the size. You see how all the things are getting little there. And it's kind of... For instance, now the panel disappears because of... to have more space. But if you continue doing so, it will... all the entries, it will appear like some sort of cart. Okay. So, yeah. So, this is gradually improving a lot of things here. So, yes. So, what I mentioned in the mock-ups that you have in the navigation, you can collapse, you can hide the one. I mean, the entire navigation if you like. And you have three levels. You can go for more sections. And, yeah. So, yeah. So, this is... there are available. And don't trust too much from this instance because I reset every night with one of these nice automations in the Amazon Web Services that destroyed the instance and then recreated again. But, yeah, you can play with it. So, let's continue with the presentation then. Yes. Yes, I embedded another video where Carlisle is playing way much more things than I. Okay. So, you have available in the slides. This is pointing to a YouTube video. So, where she basically is playing in a lot of content. Okay. That you can also watch if you like. And she's using a model, an instance with a lot of data populated so that you will see more content there. All right. So, yeah. Then, as part of future roadmap, what is next? I mean, the situation of the project. I mean, we implemented two major sections in the new YPI already, but we need to keep implementing new pages and functionality. So, it is on the roadmap to improve the routing and somehow figure out how to add direct links to pages. This was in the old one, but not in the new one. And we need to investigate. Also, we need to explore because the new web UI is capable of sorting the data or listing a lot of entry records. And we have limitations with held up. So, this is something that is still open and we need to test performance with that and see how can we improve the internalization that I was mentioning before is still missing. Also, we would love to implement a new login page, but in this new login page, we will, yeah, we would love to enable new authentication types. Like, for instance, if you have been in the other talks, passwordless external IDP, all this type of stuff, we will implement it in the new login page. Okay. And also, there is another topic, which is for a little bit for the future. Once we have the project more advanced is how are we going to adapt all those external plugins that user or the community, they have kind of plugins that they are able to connect to the web UI. And then we need to provide a solution for them as well. And, yeah, and the last thing is that we are driving all this initiative all this product in GitHub. We are engaged a lot with the community. And so, I can show you in the GitHub project, we have GitHub project, sorry. So that we didn't strive for in the sense that it's super easy to contribute to the project. I mean, if you are, if you like these technologies, and you will love to work with the Red Hat Web UI team. So you will be able to do so. This is the project and then it's super simple. You just follow whatever is in the read me file and then in terms of one minute, you can get all the development environment in your own laptop. And, oh yeah, I see. Wow. I miss the time. Yeah, so everything is well explained. I mean, with background you can bring up and yeah. So, sorry. So I think that was the last slide. Yeah, the next is kind of, yeah, how to contribute, but I already explained that. So there are open discussions as well. And yeah. So, yeah, that was it then. So I guess we don't have time for questions or. Yeah, that's a good question. I mean, so the development is happening in upstream. So it's very easy to follow what is happening there. Right. And let's say that there are three main sections in the Web UI project. We have implemented two of them. I mean, the most difficult one was the first one. Implement the first one because you need to take the decisions or investigate about things. But now almost all the things are sorted out and we are speeding a lot with the project. And we believe that maybe by the end of the year we will see the new one. We will see. But we can speed up if we get more contribution. Thank you.
Your web app is taking up too much RAM. Let's fix it!
Hello everyone. Can you guys hear me properly? Nice, perfect. Yeah, today I wanted to start your presentation with quite a bold claim. I would say that your web app is taking up too much RAM and we could fix it. And this like comes from a thing that I noticed recently and is that if you look at your Chrome browser, you can see that if you hover over the tab, you will see that the Chrome is now starting, at least in a while, to tell users the memory usage of your app. Which like, if you look at most applications such as for example g-tub even while looking at a pretty big diff, the memory usage is not that bad. Yeah, like I mean, 122 megabytes were a lot in the 2000 but like now it's not as much. But if you look at other websites that maybe are a bit more expensive such as Airbnb, you can see that if we load a pretty big page, the memory usage goes way up. Like we're talking about half a gig of RAM being used by the browser. And what I was wondering like is it our fault? Is it the browser? What's in that memory that is being used? And we can find out how much of that is actually used by the JavaScript virtual machine, by our variables, our functions and our code. And the way of doing that, it's by opening the DevTools and there is a special tab called memory and you can see for each JavaScript virtual machine that is running, you can see how much memory is it taking up right now. Which in case of Airbnb, it was like 111 megabytes which is like, it's not much but it starts to be quite a bit especially when GitHub was like 10 megabytes compared to that. And then maybe you look at some more extreme examples such as here I propose fully stress test notion by loading a quite big table and we went into 1.5 gigabytes of RAM used just by JavaScript variables and that was quite wild because if you think about it, that's a lot. That's a lot for a web page. And also there are even worse examples or I would say more difficult examples like the product that I'm currently building and I'm building this web-based tool called the flux which is a tool for designing electronics on your browser and it is quite complicated because electronics is made up by a lot of different parts and it's built using typeskipped React, 3JS, React refiber so we use a bunch of technologies and also a bunch of abstractions to make our life easier and that had an effect on us. In fact because we wanted to be able to render very complicated documents with a lot of different shapes and text and everything has to run at 60 FPS, you can see how holding a big project can take a lot of RAM and that's something that backfired a bit. Why? Well because originally we focused a lot on performance, we wanted to have everything load very quickly, we wanted the scroll to be fast and originally we just optimized for performance, we were like yeah memory is cheap, let's just use whatever all the memory that we have so we just optimized what the profiler said, not what the memory profiler said and actually we did this because there was this article that from a while ago that was talking about yeah if you're building React apps just memorize everything, just cache everything that you can because that is not going to be an issue in most cases. People did we know that we were one of those cases and yeah and in fact you can see how like if you load a pretty big document at least before this talk the app will take too much RAM but they can really hear someone says well okay I have 16, 32 weeks of RAM on my desktop on my computer, why do I care about memory users, we're not in 1999 anymore. Well there are still a couple of reasons why we really care about this now and one of the reasons is out-of-memory crashes. If you're not optimizing memory usage the browser will limit you. In most cases for example in Chrome if you go over four gigabytes you will get this, you will get an old snap error code 5 which is an out-of-memory and there is no way to catch it, no way to solve it, the only thing that you can do is just prevent this from happening in the first place because here you will need to refresh the page to fix it. And on iOS it's even worse because on iOS sometimes the limit goes down and no one is really clear about what the limit is. For example if you are on Safari UIOS sometimes the limit can go as low as 300 megabytes and this is what you get, you get your browser loading up the page, trying to load the page then going out of memory, refreshing the page and going in an infinite refresh loop which you will see your user report and that's when your product manager will come screaming into your office why is the application not loading on my phone and because you're using too much RAM, so yeah clients might have a lot of RAM but your browser doesn't care, it will not let you use it. And another thing is that we also care about the garbage collection performance, if the more that you allocate the more that you will need to deallocate later and that's a thing that you have to care about because in some cases the garbage collection connection times can really hurt your performance. This is a bit of an extreme case that's like one minute of garbage collection but like this is something that is a bit more realistic. We were debugging an event handler that was supposed to run on mouse move so something that was totally off path and the major garbage collector took 0.5 seconds which means that there was a sharp drop in the frame per second just because the garbage collector had to kick in and so that's another thing that you want to care if you care about performance. Also memory is part of your performance optimization strategy and another thing is that as I showed before now Chrome is showing the memory usage of your website to your users so if your users are using like 12 tabs or if you are insane like in my case you have 10 browser's opens with a thousand tabs each, yeah that should start closing them maybe tomorrow. The users will be able to see that it's your website that is taking up their entire RAM and they will not be happy with you so now they will know which one it is and so yeah we're into a situation and for example my situation how do we solve this like how we approach this problem in flux? Well first of all it's important to figure out what is occupying memory and once you do that like there are multiple strategies that you can use to kill it with fire and then we also want to make sure we're not doing the same mistake aka we can set up some checks in CI or we can set up some monitoring even with remote users to check that the memory usage is not that bad right now. Of course in the talk of today I wanted to focus more on the first point because that's already a lot of things to talk about. So before going into the tooling I wanted to introduce some ideas about memory usage so that we know what we're talking about. I like to have those distinctions while talking about memory usage this is something that I made up in my analysis and like I noticed that there is a pattern of having either static or transient memory usage. What are we talking about here? Static memory usage it's when you have variables that are taking up a lot of RAM but they are long lived, they are global variables, they are state that is staying there and it's not really changing throughout the run of your application and that's basically what you would find in a heap snapshot and it is that the easy thing for example the document that loads and it is taking up a lot of RAM but you don't necessarily have a situation like that sometimes you could have a transient peak of memory usage which means that for example the user clicks a button and that button triggers a very quick operation which allocates an array with one million elements you can see it as a peak in the memory usage at that point and that sometimes can be a bit more hard to the bug because you want to find that on a heap snapshot because a heap snapshot is just taking an image of what's in your RAM at that moment and a peak of memory that would be de-allocated immediately won't show up in that so there are different strategies depending if we have the first or the second type of memory problem. Another thing that I like to consider is the count and the size of stuff. Why? Because you might have a very happy situation the kind of analysis wise in which you're locating a 500 megabyte string or a 500 megabytes array that's very different than allocating millions of small objects and if you have the first or the second situation you need to use completely different approach to analyze that because while if you have a giant object it would just show up in the memory profiler immediately as a very big object if you have millions of small elements it would be much harder to analyze them because you will need to check what's inside those those four bytes objects and another thing that I like to bring up is the difference between shallow and retained size and these are things those are two terms that you will see in the memory profiler and the reason for that is because in JavaScript everything is a pointer so if you have an array of strings it's actually an array of pointers to strings so the array itself could be very small like in order of bytes but the stuff that is pointing to could be giant like it could be pointing to a lot of one megabyte strings so when we talk about shallow size we talk about the size of that allocation itself such as the array which is small but that array it's causing other memory to stay allocated because it's referring to those one megabyte string so the retained size is instead the total amount of memory that that object or that array is forcing to stay and that is preventing from being deallocated and there's another last topic that is also quite complicated which are allocation types and which means that in JavaScript there are multiple things that you can allocate you have different other types you have code you have strings just array you have typed arrays you have also closures and each one of those behaves differently in memory and one cool thing that you can get from this is that for example also functions are something that can take up memory if you're not careful enough because functions need to save all the variables that are around them so technically that a function is an object as well in javascript and this means that even for example if you are creating functions in a loop that could become a memory problem because it's the same thing as creating an array of objects so like sometimes you can just look up the v8 in Chrome documentation and find a lot of interesting things about how memory is used internally but that's another topic for another talk I would say I wanted to instead look into tooling like if we are in a situation in which we have a lot of memory usage what are some tools that we can use to try to start to analyze what is going on and how to solve the problem well the most famous one is the common memory compiler which is that memory tab that you probably saw next to performance in the Chrome DevTools and it's quite powerful because it can work in three different modes I think the most interesting ones are the heap snapshot and the allocation sampling which works in very different ways for different purposes but it is that with the heap snapshot you can take a big snapshot of everything that it's in your RAM everything that javascript is working with and like imagine that you created a lot of variables in your code with this you can just save all of them and look at what's inside of them which is really cool because you can even see the values that you have there and for each one of those for each allocation that you have you can also see what is the retainer that means what why is this being a memory who created it and who is holding references to it and that's useful to determine who is the thing like what is the function that caused that thing to stay in memory and the heap snapshot are very useful if you want to check stuff like static memory usage because it takes a snapshot in time if instead you're more interested into transient memory peaks as I said before there is this other tool called the allocation sampling which works by accumulating every allocation that happens this means that everything that is allocated is saved here but you don't get the allocations so which is you can't really measure how much RAM you're using you can just measure who is creating that RAM that's objects we had not too many of them but some of them were taking a lot of RAM like 89 megabytes that's a lot and we had one specific object that was taking a giant amount of memory like 80 megabytes and which is then by looking at the retainers we were able to immediately figure out what was the function that was allocating that stuff that was retaining that stuff and that was one of the very first optimization that we managed to do because this way we went into the code into that function and realized how we were basically creating a bunch of functions this is react code and a bunch of string UIDs and saving all of them in a map and apparently that's incredibly inefficient that's probably not code that you look at the first at the first glance that it seems inefficient but if you call it thousands of times this is apparently sticking up 80 megs of RAM and so how did we solve this we refactored it a bit by using a set instead of a map and so it's very experiment based and with this we were able to have like a 50 percent improvement of memory usage which was huge and this really made the difference between being able to load some project at all or projects that would just crush your browser like documents and so that was one of the first big wins that we had so we were like yeah okay let's continue this eventually we will reach zero megs of memory use right no immediately after that we hit pretty much a brick wall in which we were taking hit snapshots and we were seeing that we had two million objects that were taking a lot of space and it's not that that like we had one big object to optimize each one of them was a couple of bytes and the heat profiler really doesn't help you in those cases and that's interesting because that's pretty much the same situation that you will find if you try to profile that same notion that I've tested it before or even Airbnb as it's actually the same problem and unfortunately the answer is the problem is react kind of and like we are in the same situation also with notion we have is it two geeks of ground no that can't be that is just being occupied by a lot of small objects so yeah we hit a brick wall but what we do now like the heat profiler is very bad and analyzing those kind of stuff thankfully we can export from it we can export a giant five gigabytes json from from chrome and then we look at the json and we see that the json it's in a format that is pretty much unreadable but thankfully there is someone that did work for us the guys at meta and it is beautiful tool called memelab which it's a toolkit for exploring memory usage which is very focused on finding memory leaks it has like an entire automation for that but I think this is even more it's even cooler because it provides you a very powerful API for opening snapshots from chrome and analyzing them what you can do is that basically you can read the objects in memory and perform analytics on them for example we wanted to answer this question which type of objects are taking up the most space out of the two millions that we found in a snapshot well this is a some code that we wrote I think that we don't have time to go too much into it but I can publish it the idea is that we can load the snapshot so load the current state of memory and find all the object types like what are what is the like the type skip type of the object computed total shallow size for each type and then sort and print results and from these the results were very cool because the um which is we had the for each object type even including like the the keys of the object how much memory they were occupying and which is we were able to see that on the top two we have one object that is called fiber node which is from react and another node another object that had base q base state memo state what the next q what is that that is not something that came from our application that's react again that's the data structure that is used internally for keeping tracks of hooks and so like we went into react to me so that there was exactly that other structure which in most websites that are using react heavily nowadays is pretty much the thing that is occupying the most memory with enough so it is we figured out that keeping tracks of hooks is expensive and but are we supposed to just tear down the 400 000 lines of react that we have in our app right now like that's a bit too far into the development so we wanted to know precisely what we need to optimize so we use memlum again this time we uh we see like even more uh by looking at this fiber node data structure that is used by reactor and we need a lot of statistics on it to try to figure out what is the react component that is taking up the most memory so that we can optimize that specific component first and we managed to do this because this way we were able to divide the odd memory uses by react component and see each hook how much memory it was using and with this we were able to find out a specific react component that was using a lot of memory and we cut the memory users down again by 60 percent which was pretty nice so that's like memlum saved us with this because we were able to make our app properly working and it also made us possible to answer other questions like as out of all the strings that we have in our app how many of those are uids should we start optimizing uids and make them numbers well no because we used memlum to find all the uids and we found out it was like two megabytes in total so who cares so that's also nice to know what to not prematurely optimize so just to sum up everything that i said i think that we can all agree that memory analysis is actually difficult especially because it varies so much between application between framework between browsers but it's important even if even in a world like nowadays in which we have a lot of a lot of round because for some apps it really makes a difference it makes the difference between you being able to use the notion on your phone or the app constantly crashing and never loading your data and that thing is that the chrome profiler is cool but sometimes it's not enough but thankfully it can export so that at least you can perform your own analysis externally so thank you for listening to representation thank you are there any questions i see a question here yes you were talking about the shallow size versus retained size yeah when would you ever be interested in looking at the shallow size sounds like the more interesting one yeah he asked about when do we care about shallow size when we also have the retained size well i yeah we care a lot about shallow size in our case it was all about shallow size where to write our own custom plugging for memblab to just analyze shallow size why because if you are analyzing like very big objects there are thousands of lines and in that case you have to use tricks like even virtual scrolling if you know that you can have like instead of allocating all the DOM elements you keep reusing the same ones and you think about that that's like ejecting from react because you are creating something just with javascript and the DOM and then you are creating a reactive wrapper for it so that's another thing that it shows that yeah react is good at orchestrating stuff but when it comes to the performance critical things that you want to have inside your application then you need to start optimizing it differently just a small mark or we continue with the questions so please if there are spaces please try to squeeze and not leave spaces in the middle as you could see we have hundreds of people waiting outside and here as well and we cannot have that many people on the sides so please try to squeeze don't let free seats for your jackets or something put it on your lap thank you and since we're starting to be quite a lot if you're going to go out please try to go out from the right side and avoid going out from the left side so that it's easier for everyone thank you we have a question here first i've got more more as a comment instead of a question so the thing is that with this limitation of four gigabytes for memory this comes from the fact that like chrome like compresses pointers so that small objects take less space basically that's one thing second thing is that is this is like a security mitigation so that when there is some like back in v8 it's harder to exploit it but also i've read on like a chromium box tracker that there is for example 16 gigabytes limit for fixed arrays so there may be different limitations for different things like web assembly also has a different limitation and also supposedly like electron abs doesn't have limits so yeah yeah that that's very cool thank you i think that firefox has pretty much the same limitations oh ask me if we're also trying with other browser yeah i'm mostly working on firefox actually and firefox has very similar limitation and sometimes it's even worse because sometimes we notice that the upper randomly sometimes takes more memory in firefox for some reason or some things are more optimized in firefox other things are more optimized in chrome so that that's very complicated to answer unfortunately because it seems like that the answer is either you look deeply into the source code of the browsers which is i still haven't reached that point unfortunately or you do try an error ah no the tooling um you know firefox also has tooling around it which is actually if i remember correctly more focused around analyzing the memory users of the DOM elements and it also has some facilities for for analyzing ip snapshots but since like memlab users works with chrome ip snapshots we went with that immediately and how do you go about running this in ci? oh um yeah that's a complicated thing because running in ci it's pure pain like you can use memlab and run it in ci because it uses playwright i don't remember if it uses playwright or puppeteer i think puppeteer and with it you can like orchestrate some some tests that open a page it can even like use some machine learning algorithm to find memory leaks the problem with doing that is that it's fine if your app is small if your app starts to become bigger then uh you will need to have a ci machine that is powerful enough to be able to run your app and the profiler on top of it which for us it meant that the the ci time went like in 30 minutes which was unacceptable so eventually we removed it but you can do it are there questions as your question there so from the browser or something like that? yeah that's another complicated thing because if you are using chrome i don't think that firefox allows you that but chrome does you have a specific performance or memory i think variable that you can use and you can check both the maximum heap allowed size and you can also read an estimate of the current family usage in our case once we do that we are constantly like giving data to segment then we analyze in amplitude with which we can like keep track of memory usage and we are also doing that for like the performance timing the problem is that we notice that that data who very quickly becomes bogus because it depends a lot on what the user is doing and when the garbage collector kicks in because the garbage collector sometimes is like it goes up to four gigabytes and then no problem goes down to 500 megabytes so it's extremely difficult to capture memory usage because you don't have a precise memory a precise measure on how much of the total retained memory is active and how much is actually inactive and going to be garbage collected soon so we try to do it and we have some charts showing how much memory is being used but it's very hard to make sense of them unfortunately any other questions you still have around five minutes for questions
Unraveling JavaScript's Heart: Mastering the Event Loop for Peak Performance
So, our next speaker is Antoine Perret, which is one of our local superstars. He's the CTO of Rosa, which is a super nice company in the health sector. Maybe he will tell us more about it. And he's going to talk about the heart of JavaScript, which is the JavaScript event loop. Background of applause for Antoine. Alright, so you probably, everyone here probably heard that sentence, right? Do not block the even loop. Okay, and you might have heard or read on the internet that you should prefer asynchronous code over synchronous code. So, question for you is, do you believe that as long as you're using asynchronous code, you're safe and you will never block the even loop? Who believes that async helps you get there? No one. Okay, okay, that is good, that is good. So, today we're going to look at this and we're going to ask is the question, hey, what does that mean not blocking the even loop? What does that mean not using synchronous APIs and whether or not using asynchronous API help us stay safe? So, I'm the co-founder and CTO of Rosa and Rosa is building a patient application. We want to help people live healthier for longer. And when you look at the kind of applications that we build, part of Rosa can be labeled as CPU intensive or there are some parts of our applications where we do some heavy computation. Okay, so we have a calendar application, we have registries that are deployed at hospitals, etc. And when you think about the kind of computation that we do, yes, sometimes you have to compute recurring appointments, so occurrences of recurring appointments. Sometimes we do have to compute hashes to store passwords securely. Sometimes we do have to parse large files such as iCal. And sometimes, of course, we do the diffing because we have the schedule of a health professional. We have a list of appointments and we want to know when that health professional is available. So today we're going to talk about the event loop. We are going to talk about how not to block the event loop and the questions that are that we are going to ask ourselves, does it scale? What does what if traffic does 3x, 10x and is there a possibility to have a denial of service? Because as soon as you block the event loop, you have a possibility to have a denial of service. So why was not created? What is the event loop? And then that is this of how we can hash secrets using B-crypt, thread pools to the rescue and what are the metrics of the event loop? That's on the agenda for today. All right, if you look at one of the first talk of Ryan Dahl, the author of Node.js, he's talking about non-blocking I.O. And he's comparing the situation in which you will query for a database and you will have a blocking I.O. system and then he compares that to the Node.js implementation where you have non-blocking I.O. And the reason why he wants to build that is because most of web applications are I.O. bound. They spend most of their time waiting for an external server to answer. They spend most of their time waiting for database to answer. So when you have a server in which you have blocking I.O., what you do is that you create for each connection, you will create a thread, meaning that you will have a memory overhead because each time you create a thread, you will have memory associated to that. And so you can't scale because each time you need to handle a connection, you need more memory. So the solution to that problem is to create an event loop. But as soon as you start to use an event loop, then you require non-blocking I.O. OK? So Node.js is born because most web apps are I.O. bound and because the CPU and I.O. live in two different scales of time. OK? So the CPU and the Jigga-Earths with the Jigga-Earths frequency means that the cycle of the CPU takes one nanosecond. And you have to compare that to a roundtrip between, say, California and the Netherlands, which will take approximately 150 milliseconds. But that is kind of tough to create a mental model around that. So let's make it easier. OK? So we are developers. We all love to drink coffee. OK? We also love some shows to watch some shows on Netflix. OK? So making yourself a coffee is like an operational CPU such as a new text lock or a lock. It is taking 25 seconds. OK? Making yourself a coffee takes 25 seconds. Looking at a Netflix episode will be something like 40, 50, an hour-ish, will last for an hour approximately. That is the world in which a CPU lives. Now, if you compare that to the world of I.O., it is the equivalent of taking a five-week vacation. Or it is the equivalent of studying for five years at university. So the danger zone is when you take your CPU and you bring it to with you on vacation. So basically, the danger is when you block your main thread because you're performing a CPU-intensive operation while Node.js was designed with the idea that you would drink coffee and not go on vacation. Keep those figures in mind for the rest of the talk. What the heck is the event loop? Who is familiar with this representation? Good. Philip Provertz gave an excellent talk about this 10 years ago. And you can play with that tool on the Internet. It's getting a bit old, but it's still really, really good. OK? So let's have a quick look at what it does. So here what you see is that you have code that you wrote. OK? And we'll see what happens when JavaScript executes that code. It's quite simple. We have a set timeout with a five-second delay. And we have a console.log. So obviously, we all know that welcome for them will be first, printed to the console, and that after close to five seconds, because that is not a guarantee, we'll see powered by BEJS. OK? Let's play that video if it works, and the video can't be loaded. It's not a problem. OK? So basically what the video shows is it takes that block, puts it on the stack. When it's executed, because we have a set timeout, it will put the timer and play the five-second timer in the web API part. After, during that time, console.log for each one after the other. And you can also picture, have a mental model around the fact that in the end, there might be even several loops. OK? Multiple phases, timer, spending callback, ideally prepared, pulling, where we all know JS is going to ask the OS, hey, do you have any connection, network connection for me? Is any file, has any file being read? OK? And when you understand the different phases, you can answer questions such as this one, promise.resolve.then console.log promise versus process.nextStick, console.log nextStick, which one will be executed first? Well, it depends on where it will be picked up from at the level of the even loop, what phase of the even loop is involved. Node.js architecture is inherently multi-threaded. OK? So we all heard that JavaScript is single-threaded. What we mean by that is we have one single thread to execute your JavaScript code. But it doesn't mean that Node.js on its own is single-threaded. OK, how many threads or how many processes do we have in a Node.js application typically? One, two, three, three, anyone? Four, what about five-ish? Five-ish is a good number. OK, five-ish is a good number. Well, you have a thread for the main even loop, for the main stack. You can have threads or processes for the garbage collection. You will have libv, and then libv also takes care of handles a pool of four worker threads. OK, so you have at least five-ish processes or threads when you run your Node.js application. All right, that is a bit too complex for today. OK, and so we're going to simplify, and I've grouped different, I would say, parts of that architecture into blocks or squares of the same color. So we're going to look at the orange square. It's going to be the main stack together with the run on the same thread as the even loop in red. Then we're going to have one single queue. OK, we're not going to distinguish between micro task and task. And then everything else is going to be called Node.js API and will be assembled together. So that is what we are going to work with today. OK, good. So prefer asynchronous code over synchronous code. Let's first look at what it means when we use core modules, and then we're going to look at what it means when we use MPM modules that we can download from the internet. OK, so FS module reading a file. When you use the asynchronous API, read file with a callback, it is non-blocking. When you use the read file sync, it will be blocking. What is the difference? Well, the difference is that in one case, the sync version of it will run on the main thread, while the async version of it will be run on one of the worker thread, of one of the worker of the thread pool. It doesn't mean that at the OS level, reading the file will be non-blocking, but from the perspective of Node.js, it is non-blocking because it doesn't block the even loop of the main or the main thread. So the sentence, prefer a synchronous API over synchronous API, is absolutely relevant and true in the context of a core module, because it will run the code of the async version into a worker thread. But what about a pure JavaScript library? How does that work? To answer this question, we are going to use the example of Bcrypt. Bcrypt is a way to create a hash to securely store its secrets. And it is interesting because that operation can be quite intensive in terms of CPU, depending on the number of cycles that you will perform or how secure you want that secret to be. If you go on NPM, you will see that there are multiple implementations of Bcrypt, and there is one pure JavaScript implementation known as Bcrypt.js, and there is one C++ implementation which is known as Bcrypt. And so here it is interesting because both have sync and async APIs, and then we can compare what happens with the pure JavaScript implementation and what happens with the C++ implementation. So let's look at the pure JavaScript implementation. Hash sync versus hash, it is basically the same syntax as the FS module, right? And what we are going to do is to imagine that we have two servers that receive five requests. And the five requests that they receive is you take your CPU on a five-year study at ULB to obtain a degree, so you perform a super intensive operation, and then four requests that are basically quite fast, you just watch the Netflix episode. Okay, and we are going to compare what happens in both cases. Now, the trick is that Bcrypt is a smart and well implemented library. And the Bcrypt asynchronous API is implemented in a way that when it has to compute a large hash, when it has to compute a long operation, instead of doing all of it at once, it will chunk, it will split it into two smaller chunks. Okay, good. So synchronous on the left is synchronous on the right, and then we look at those five requests, the big one and the four faster one. So at some point, the endpoint with the hash computation is called, and we put on the stack the computation of the hash. Synchronous parts, we have that big square, blue square, that needs to be performed, and you see that as the computation goes, then the green is going to fill in the blue square to show progress. Okay, because on the asynchronous part, we have a chunk, that is a smaller square. At some point, the second request, the first red request, the first episode that you're watching, is reaching your server. And what happens is it has a callback, it has operations to be performed, and it will be queued. And notice that in the case of the asynchronous API, we're quite close to be done with the first chunk. The first chunk is done, and then the script will schedule the second chunk to be run. What it means is the stack will be empty, it will use Node APIs to schedule the next chunk, and then the Node APIs will push another callback to compute the second chunk. So we go on, and now you see that on the synchronous API, we continue to move forward with the computation, and on the asynchronous one, we are executing the callback one. Continue, same here, at some point the stack is empty, because the stack is empty, the event loop is picking the next task, put it on the stack, and we continue to perform the compilation. As each chunk is done, or the callback is executed for one of those red requests, when the stack is empty, we pick up the next task in the queue, and we go on, and we go on, and we go on. In blue, and then only you have callback one, callback two, callback three, callback four. While in the case of the async API, you chunk it, and because of that, you can't do anything. You chunk it, and because you chunk the work, then in between those chunks, your server can handle other requests. Now if you start to draw some lines and analyze the response time, so the point of view of the user, this is what it looks like, and then you have that kind of chart, where you can see the duration of the first blue, the big request, the duration of each of those red requests. In the first part, what you have is that each red request is delayed by the entire long compilation. On the b-crypt async part, on the bottom part, then it is delayed by at most one chunk. That's why you have smaller timing for the red request. What will happen if you do the same exercise with the native C++ implementation? Because it is a native implementation, because you can use the async API, it will behave the same way as the core module of FS, and it will be executed on a worker thread. If it's executed on a worker thread, here's what the timings might look like. You basically have a timing that corresponds to exactly the combination that needs to be done for the red request, and there's no delay at all. There's a small difference between the C++ and the JavaScript implementation. The C++ implementation will be faster, but here what matters is the fact that the code runs either on main thread, or in a worker thread. It's not important to compare the speed of the C++ or the JavaScript implementation in that case. Sometimes you do have to take your CPU on a vacation. Sometimes you do have to do a heavy combination. What if you do not have a native implementation, or you do to have a slow application? Well, if you really have no other choice than to take some vacation, my advice is be sure to have a pool. Be sure to take your swimsuit with you, because it is possible with libraries such as Piscina, Swimming Pool in Italian, to create pools in which you can have threads that can execute JavaScript code. The API is quite straightforward, and in the end what it means is that instead of having one stack to execute your JavaScript and then a set of other threads to execute native code, you can create other tracks, all the threads in which JavaScript code will be executed. For example, say you create two pools, you can create one pool with four threads to compute, for example, B-Crypt hashes, and you can create a second pool to compute recurring events. And in that case, what it means is that when your code will be pushed to the main JavaScript thread, the main thread is going to communicate with the pool and say, hey, execute and do that computation for me. And then the pool will distribute that computation among the different threads that it creates. So here is what it looks like when you use a pool. It's quite efficient, it's quite nice. Is it a silver bullet? Well, no, there are several things you need to take into account. You need to choose the number of threads wisely. You need to determine when you use a pool and make an analysis. You need to be sure that the machine on which you run your application has enough cores, because in some situations it can be counterproductive to create too many threads and have too many processes running. And of course, you will have to monitor and check the memory usage. All right, how do you know when you need to create a thread pool? For that, you need to measure how the event loop is behaving. You need to measure the health of your event loop. And one of those metrics is, for example, the event loop delay. Another one is the max CPU time, and there are tools to help you get there. I strongly recommend Dr.JS with ClinicJS. It will give you such a nice graph and show you when you have a delay in your event loop, when your event loop is blocked. Measuring that yourself is not complex. That is all you need to measure on your server or in your node application. The delay of your event loop. What you basically do is you set an interval with a one second delay. So every second you will execute the callback with a set immediate, and you will compare the time at which you plan it and the time at which it is actually executed. And that, the time difference between those two, the start and the end, will give you the delay in your event loop. Time to wrap up. So do not block the event loop. And what you really mean by that, it is not about async versus sync. It is more about not performing CPU intensive task on the main thread. Okay, so that is what you have to remember. As long as you do not execute CPU intensive task on the main thread, your application will be fast and smooth. So here is a couple of advice. And have some coffee, drink as much coffee as you want. Enjoy the show. Take some time off from time to time. And thank you for them and you will be for having us today. Are there any questions? One question there. You have to speak up. How does PCNAS react compared to Node cluster API? So the question is how does PCNAS differ from the Node cluster API? So my understanding is that the Node cluster API basically means that you are going to have multiple instances of the same application. Okay, why PCNAS, one instance of your application will have multiple threads.
Codebase Conquest: How Nx Turbocharged Our React Workflow
Thank you all for being here and for waiting, sorry for that. So our next speaker is Nicolas, who is a staff engineer with a lot of experience and he is here to talk about Enix and an actual use case that he incurred during his time in Hazura. Thank you Nicolas, for your applause. So, does your build time keep getting longer? Well maybe we can extract some packages into overrack packages. But then the packages are extracted, started to explore the dev time to work and integrate in your app. And then it's hard to keep up with two versions? Yeah, at Hazura it was the same. The build time was like 15 minutes for the frontend. The dev reload time was like 5 minutes, so you make a change, you wait 5 minutes and then it's actually done. And tooling wasn't proper everywhere. So we had to make a change. And this is the story of this change. First who am I? I'm Nicolas, I'm a staff engineer at Pethitch. You can find my Twitter and my blog. This is also in the right hand form in my blog if you want to dig further. So let's get back to the topic. So what was the setup? We had two code bases, the open source one and the enterprise version. What we did was we extracted some of the code from the open source code base into a bundle through extra layers of webpack. And then we installed this into the enterprise application. It seems pretty standard, right? But then tooling wasn't the same everywhere. In one place we had touch scripts, yes, test, storybook, chromatic, Cypress, so very good dev experience, dev installation and everything. In the other side, which let's remember, enterprise clients pay for the other side, we had JavaScript, no touch scripts, yes, test, and that's it. No storybook, no entry and test, nothing else. Because it was so complex to work in this second part of the application, this was the end result setup. But that's not it. Get worse. We had one K-line of custom webpack config just to bundle part of the application into the other one. Log files management was hell when you change one thing in one place. You had to make sure the log file, not the package version, the log file was the same in the other place. Otherwise, things will crash in production and without end to end test, you only know when you're in production or when you test your dev environment. CI was very slow because of this whole system. So we wanted a Mono-repo tool. Let's have everything inside of a single Mono-repo, having them work better in union instead of isolation. We made a wish list for what we wanted to do. What we wanted in the Mono-repo tool was task orchestration, saying build this app before this one. We wanted to have dependency-graph-visualization because right now we have two packages, but in the future we'll have more. We want to see what the hell is going on without having to guess and looking at packages and digging through code. We wanted to have consistent tooling. Let's say we have just and the same config of just and the same version of just everywhere. Because yes, it wasn't the same version of just before. Fun to make with stuff. And we wanted to have contact constraints. And for example, the open source edition couldn't import the pro edition, because you don't want to give away things for free. Like companies get paid for. We wanted to have distributed task execution so that we could scale the CI by adding more runners and to say run those jobs in parallel and deal with how you want to do. And the bonus point is we wanted code generation so that scaffolding was baked into the tool so that in the end everything was done for us. So after this open X we went into the ecosystem, look at every tool that existed. And we checked every one of them. First a small disclaimer. This work happened about a year ago. New tools exist since now. Like Moon repo didn't exist back then. So if you want you can also look into Moon repo. And I also want to shout out every engineers working on those moon repo tools. They are amazing. If you have anything they are always willing to help. So kudos to them. So what did we look into? First one, Bazel. Bazel is made by Google to handle Google monopos. It's huge, complex, you can do a lot of things. But it's also very complex to use. We looked at Gretel because yes, Gretel can do other things than just Java. You can do whatever you want in Gretel. It's tailored to Java but you can do JavaScript, you can do Go, you can do whatever you want in it. We looked at Lerner which is the historical and classical tool to manage a moon repo in the old days of JavaScript. We looked at NX because I've used this in the past in the Angular days when NX was only an Angular plugin. And yes, this is a real monopo tool. We looked at Pence which is mainly used by IBM but also in other places. It turns out it's pretty good if you want to experiment and give it a try. We looked at Java repo because all the hype and trouble was solved and everything. So it was in the list. And so that was like the tool that we looked into. So let's see. We wanted Tasker-based acquisition. Well they could all do it so that's good. We wanted dependency graph visualization and Pence didn't support it. So those two are out. Then we wanted ecosystem tooling. Troubles didn't support it. Lerner neither. So we end up with either Bazel or NX. Project constraints, they both support it. Amazing. We wanted this task execution, they both support it. Cool. And congeneration, well Bazel didn't support it. While we could have added Bazel congeneration utilities and extra code, it was also simpler to set up than NX. Complex to set up than NX was way simpler to do. So Indian NX was the tool that met his needs that we had at Hazara. If you want to learn more about those tools, this is a great resource. It's open source and contributed by many of the maintainers of such monorapos where you have a graph of all the main things that make the monorapo features and each project is listed in here with what it can or cannot do. So we had with the tool NX. But turns out there is two flavors of NX. Integrated or package based. First let's go into package based. Package based is behave like a PNPM, such as NPM workspace. You have many packages, they all link together. It works pretty well. But it doesn't have consistent tooling. You can do whatever you want in your projects. The migration path is way here because you just slap an extra JSON at the root and it's done basically. But there is still a bit of step between the leaves. Let's remember why we are doing that because we want to make sure the build between libraries is way faster so that we don't reinvent the wheel every time. So then what is integrated? Integrated means that every tool in the workspace is unified and considered a monorapo as one unit. Every tool is consistent because every tool has the same version and the same configuration everywhere. You can train it in a specific project but the base is the same. But the migration is more thoughtful because you need to decide how you want to migrate. Do you want to align with NX context or do you want to bend NX to your wheel because you can do both. But thanks to this, we have optional build steps between libraries, which means we could solve all speed issues. But there is one more thing, plug-ins. But what the hell is a plug-in? A plug-in can do three things. It can generate generators that allows you to scaffold the bases. NX new library, done. NX new application, done. NX new storybook, done. It can execute it, which is wrapping the tool to make it simpler for you to consume. And the best part is automatic migration. For example, a new version of desk came up and you need to update your test to have a new configuration for the timer. NX will migrate your code for you automatically and it works 95% of the time. You won't have to do anything. This was really helpful for us because the code base was huge, like a million of code on those lines and it was hard to maintain. So that's all good and all, but we engineers, right? Tread-offs, not everything is green. Yeah, there is two big ones. First one is single version policy. We state that there may only be one version of a dependency and package inside of the monorepo. While it adds extra constraints, it's also recommended within any monorepo. Because if you have a library that is built using React 16 and another one with React 18, you cannot import the 16 into the 18 one. And the way I see single version policy for me is a bit like buy versus loan with interest. When you want to migrate React, if you buy, you just bite the bullet. You spend maybe a bit more time, but you do it everything at once and everything is a daily. Versus if you loan the migration, meaning you have to spend many times doing many packages one by one over time, every time you have to regain context, how do I migrate this again? How do I send this again? And every single time you want to migrate a new system, it takes way longer in the end. But it's a bigger investment up first. You pick. Buy enough tools is another constraint. You have to wait for the tools, meaning that, for example, like this version came up, you have to wait for NX to update in their setup so that it will automatically migrate the tools. In enterprise software, waiting for a day or week is not that big of a deal for a new test version, to be honest. And it's way better now because they work hand in hand with actual engineers working on those tools. And some of them actually work at NX now, so that helps a lot. And if you need it, there is plenty of escape hatches, so you can just do whatever you want in the case you may need. So we know what we want. We want to manoeuble. We want NX. We want integrated. How do we proceed? Because we're not going to say, we're going to freeze production for six months until we might get everything. That's never going to work. So the goal is to migrate incrementally without stopping the digital data work. And we add some requirements for this migration. First of all, we wanted to have no cost freeze during this migration. We had many engineers working on the code base, and we never wanted to say, stop working for half a day every week so that we can migrate stuff. That's not feasible. We wanted to have as little regression as possible. Nobody likes bugs. And neither of those customers. We wanted to adhere to NX. So that automatically migration what was as easy as possible. And which meant less maintenance in the end. And furthermore, if we have standard tools, then reusable skills. You can switch teams and everything is the same. So that's nice. Like companies that do loads of re-ogs, that's a big seller. And nice to keep. We had seven years of Githy story. Githy story is sometimes the only reason sometimes we can debug something because of the JavaScript and such. So we wanted to keep it. So here was the situation. We had our current code base. We then created a new NX workspace, like just create a new workspace. We import the code into the workspace. We build it. Is it working? Yeah, everything is done. Except not. Things broke, obviously, because our code had many issues. And so the next step is to identify a whiteboard and then break the current build. This way we can fix it in the current application. And then we can start over again. The good thing about this migration path is that at every step of the way we provided value to the actual developer working on the old system while preparing the new system. And at one point we identify some of migration we needed to make to NX. So every time we create a new workspace, we added a non-migration beforehand. And we did this cycle many times to make sure every step of the way it worked, we even had a crown to do on a weekly basis to make sure everything was good. And I mentioned we had to make tweak to NX. One thing we had to tweak was the JavaScript path because we had add slash. And in the monomaple, add slash means nothing because there is no root. There is only packages. But we tweaked it so we can make sure the migration was not blocking and require a lot of work on the previous code base. We had to include Node.js fanbots because even though no Node.js code should end up in the browser, we all have Node.js code in the browser, like HTTP and such. We had to make some specific changes to the web-config, like SVG and such. And we had to disable some ASN tools because, well, our code wasn't up to standard, obviously. So that's what we needed to do. What about our code, right? So first of all, we had CSS module without the .manual.tss extension. So there would be a VIN like CSS modules, but we didn't have the extensions. We had to fix it. We used an ability to pass in CSS in tabscript. And it shouldn't have worked, but somehow it did. So thanks, Webpack 3, I guess. But we had to change this so that it worked with Webpack 5. Path imports relied heavily on Webpack config, so we had to change that also. We had to update a test in tabscript to a version that is compliant with NX. We had to update the entry points so that they only export a component and not mount the application. And this was the kicker. Turns out, somehow, the build compiled with a lot of second-dependencies. Like a lot. Like 150 loops of second-dependencies within the codebase. And this was like one of the libraries, not just the bootstrap of it. So we had to dig through and fix our code, basically. And we went down through 95, and now Webpack was able to compile the application, and the browser was able to load it. So that was good. What it looked like in the end. We had our pro application that loads the pro library that imports the OSS library. And the OSS application that loads the OSS library and the end-to-end test that both imports the library and the application. Thanks to this, this was, by the way, generated by the NX graph of the workspace. We don't have to do anything. So all good, right? Everything is nearly ready. We just need to switch. And switching means keeping the Git history. So to keep it, we first made a commit to clean up the old workspace. Then we made a second commit to Git MV to the over place. Then we made an archive for OSS because, given we are open source product, we wanted to make sure a contribution went up broken because of this. Both commits, we applied the known tricks, and then we were in NX land. Thanks to this way, the second commit was able to identify into Git blame to make sure Git blame doesn't pick up this commit. So we still kept our Git history for whatever we wanted. In the end, the total freestime for this migration was three hours. From the beginning to the actual end of the migration, three hours total. It wasn't a fault lasting a few months. And the three hours is because of CI was slow to run on the four commits that I mentioned before. So all good and all right. What about the results? We want numbers for all users and all developers. First all users, zero bargain pollution. That was great. Because of this incremental approach that we took, we were able to see that every step of the way we didn't break something because otherwise we would have identified it in the app. The over surprise was that because everything is unified, the bonus rate decreased quite a lot from 43 megs to 13 megs. And funny thing is when you get a call from a service representative, thank you, Niko. I can finally use the app locally without being too slow to load. Thanks, I guess. It's a bit weird. You wouldn't before, but still. So this helped us at the low time. We have the application loading like five seconds faster thanks to this. Okay, that's good for devs and everything for users. What about devs? Well, 30x faster local devs. Because we didn't have to have build step every step of the way, we went from five minutes to ten seconds. This was life changing. Try to imagine when you debug something, you make a change where five minutes to see that the console you added show ups. Now it's like ten seconds in an instant for what we used to. And the CI was about 60% faster in the worst scenario. In the best case scenario, it's about 80% faster thanks to caching and things like that. All right, good. Is it the end? Are we done? We are now in Enixland. We have the packages. Are we good? It could be. It could be a step that we, you say is good enough. We don't want to go further. But you could. One of this area is architect of the coupling where you say I want to make sure that my open source doesn't import my enterprise code. And you can info that thanks to Linchwool in Enix. You have a Linchwool that's better than Debreche, but it basically says that a pro code can import shared and OSS and pro and that's about it. A shared can import shared. In a visual way, this looks like this. Where you ensure that libraries in the scope can only import within the scope or the scope they allow to go to. This helped us heavily to ensure that open source code stayed open source and the enterprise code stayed enterprise and open source couldn't import through the tooling production like a cloud enterprise code. Then the other thing we went further is to unify our tooling. While in this migration, we just add Enix, generate new entry and test. We add the new entry and test for our provision. And this costs us like 20 minutes to do. We now have a V-test in some of the new projects. And we also made our custom plugin because you could make your own plugin. It's relatively easy. And thanks to the plugin, we can create a new library. I want a library with this scope and this type and put it in the right folder for me. I don't care. Do it for me. And the naming would be automatic. Everything would be automatic. In those cases, you can say generate automatically like the code owner, update the CI if needs so and that. Because in the end, thanks to the plugin, you get the specificity of your tooling, all of the developer and engineers mind and into automation. Because we all know this documentation that is never updated. And a tooling is always updated because we use it regularly. So if we know it's all of it, we can look into it. So in the end, what I wanted to say is coding on a last code base shouldn't feel like this. You are not sure you're going to break something. You are not sure what you change with a fact. You have no idea what is going on. Instead it should feel like this. A happy dance. We just pass the ball around and have things moving in the right direction. Thank you for your attention. Are there any questions? So in this case, we didn't use NPM to share on the outside. However it's supported in NX to be able to release applications. And thanks to the NX plugin, it can understand your workspace and create a package for your library to be exported publicly on NPM. Next week there is a new launch event for NX and they are going to announce something that may be related to your question. Are there any other questions? Yes. Can you hear me well? Yes. My question is what was the main reason for such decrease of the bundle size? Is it because you are using all of these cycles in the code? One of the questions was why we end up with such a lower reduction in the bundle size, because what happened in the beginning of the talk, what happened before is that we had one package that we had bundled into a package. Sorry, there are a lot of slides. Anyhow, I think you remember close enough. So what we did before is we exported a large part of the application into a package and then we imported this package into the proper base. First change now is that Webpack now has a unified view of the whole system and has a way better tree taking. Because in this middle package right here, Webpack didn't understand what was actually imported into the end application and wasn't able to do as powerful tree taking as before. So that was one huge step that helped us on this. The second step was having updated Webpack configuration and tooling, which makes sure that we didn't need to target IE anymore. So that reduced like 5 megabits from the bundle. And so both things combined plus better CSS processing with like a unified view again of the whole system made that we had this decrease in bundle size. Yeah. So today I don't pay for it and I'm doing a similar migration using an X2. There is a new tool that I would investigate, which is called MoonRepo, which is similar in some cases to an X. However, through this day for an enterprise ready product, I will still use an X. Because the one thing they are moving towards to is to also have a way smarter CI. Because if your CI can understand your workspace, it can also understand better what to do and what not to do. And so for this day, an X would be still my choice. In the future, I will still investigate MoonRepo to see if it could make sense. But unless you have a huge scale like 10,000 engineers, Bazel would make sense. Because you could have a team of like 20 engineers working on Bazel. So yeah, that's my answer. Yeah. So just to make sure when you started with an X, you imported package by package. But you threw away the results in the ads. Yeah. And you redid it in two hours. Yeah. So this way, we made sure the old system was being updated to the change we needed to do. So this way, if for whatever reason we had to stop, we still provided value to the existing base. So on the question before, what do you think of TurboRepo? Yeah. So TurboRepo has some features that are integrated into an X in terms of a feature of parity. However, it lacks some of the larger system that is required for an enterprise project. You don't have distributed task execution, for example. You don't have unified tooling. You don't have generators. And this makes that, for me, TurboPo is a middle between learner and an X. It's like a middle ground where you have a bit better because you could have tasks like caching on the cloud, thanks to like Verso. But you don't have the full power of something like an X. So yeah. Yeah. If you compare TurboRepo with the other way of choosing the index, the first one, how would you compare it? So I'm going to have two answers for that. One which is related to next week announcement and one for today. For today, an X requires a bit more conscience and tooling when you set it up. But stay tuned because it will be even easier to adopt an X to an existing workspace because they are trying to change the fact that an X is smart and trying to understand what is your project. And you have less friction to adopt an X. Yeah. Did you have any non-Node.js applications or services that you needed to integrate in this migration or an X is only for Node.js related to nodes? Great question. So by default, an X is agnostic. There is an ecosystem of plugins that exists supported officially by an X that is very fund-electrified and circulated. However you could do whatever you want. There is community plugins for go, for .NET, for Java, inside of an X where for example for the Java project it will understand the POM.xml and try to understand whatever it can automatically. And one great thing about Polyglot repo like this is you can say when your backend change, we render end-to-end tests for the frontend because they are related. Because you can say your frontend like your SDK impulse is related to the backend because it is linked to the Open API spec. Then this, we trigger everything on the frontend. And this is where an X or a manual report shines is that it's one context even if it's for Polyglot. Unfortunately we don't have more time for questions so we'll begin with a close for you guys.
Can we simplify charting libraries?
Alexander has been a React developer since 2018 and he likes creating UI that is nice. And he's going to talk about how we can simplify charting libraries. So big round of applause for Alexander. Okay, so thank you very much everyone for joining. To give you a bit of context about what we will talk today, I'm currently working at the MUI, which if you don't know is providing a user interface component. You might know us because of this library. And a very kind of a tradition each year you ask a user what can we do for you, what can we improve. And the community is quite creative, which led to other libraries, for example, a base, which is a headless library. But they are very creative. For example, Toolpad is a no-code application we are trying to build. And then there is the team I'm working in, which is MUIX. And we create the most complex components, for example, a data grid, a date-time picker, which are a bit more complex than a button and select. And a year ago, we decided to start the chart sephor. And this talk is about how we proceed, what we found, explored, and our current conclusion. So from the question we asked to user, what they wanted is a nice documentation. That's the main stuff they complain about a chart library. And having a developer experience that match what we do usually, for example, for the data grid. So we'll see together if this is possible. Okay, so I started with just thinking, having a dream, what would be the perfect user developer experience I would want. So for me, the best one is you have a wrapper. You provide him information about what he needs to know, what is my size. And each time you want to add an element, just add a React element in it. It seems pretty basic. It should be okay. Up to the time you add more data. When you add more data, it overflows and it totally makes sense just because the x-axis will need to communicate with the plotting to say, hey, stop after 10. But if you put larger data, you have another overflow issue. And just because your line plot needs to communicate with a y-axis. So I started my journey with a dream and I ended up with an issue because components need to communicate in all of our direction. And that is just one example, but it's the main issue of charts. Data management is a pain. There is a second one, which is customization. Here you can see for a button, we kind of all agree about what it can be. You can customize a bit the color, does a background have a color or not. The most complex stuff you can do is adding icons. Most of the time it's at the beginning or at the end. But for charts, you have much more elements. And the creativity of designer and mathematician is endless about how you can add annotation to it. So we need much more flexibility. And currently, all our developer experience strategies does not allow to do that. So we have two main issues. It's time to have a look in the past. This library exists since more than 10 years, so they have a lot of experience to share with us. And it's a pleasure to work in open source, so that you can have a look at why they made a decision and how the code is working. So let's start with rich arts. As you can see, it's a composition. We'll just say at the beginning that composition is a pain. So how did they solve this data management issue? Basically, you have a wrapper, so the line chart. And he says he looks at his children. So children is just an array of components. And he says, OK, which one is an axis? And I will extract all the data from its props to know from which point to which one I can display stuff. Does the same with all the elements that are plotting data. So here, line, mark, areas, and stuff like that. And then you do a kind of an aggregation to render the components with the correct properties. The file that makes that is 1000 lines. It's very hard to read. I assume it might be hard to maintain too. And when you want to add your custom components, you don't really know where the information comes from, because there is this black magic aggregation that will provide you some data. And to the bug, it's a bit of a mess. But it allows a lot of flexibility. On the other side, you have a much simpler approach. It's a single component. So for example, you want to line, you do responsive line. And you provide data. You can configure all the axes look like and configure the tooltip, et cetera. Each element has its props and a lot of options. So as I said, it's very straightforward. So one chart is equal to one data. So that will change according to your user, plus a set of options. But you get two main issues. For example, mixing charts does not really make sense, because you have two single components. You cannot overlap them in an easy way. And you cannot modify the features, because it's a single component. You have a finite set of options. And the option is not available. You can go inside the source code to update it. For example, supporting different axes for the left and the right. So having multiple axes for line charts is not supported. And except modifying the source code, you cannot do that in Nivo. So very nice if you want a simple chart. But if you, once you go into a wall, there is no option. So for the charts, it's a pure Ligava script. So as you can see, you select an HTML element, for example, main, and you run the code. Of course, all the complexity is hidden here. And to give you a bit of flavor, they kind of fixed the issue we've seen just before. The series can be multiple types. So you can mix a line chart, a bar chart. You can even put a pie chart in the middle of a line chart. It does not make sense. But for the software, it's okay. And it's an old software, so there is a lot of options. So you can do most of the customization you want. Due to time, I will just skip this. So basically, this is all the pipeline for rendering a chart. And the main issue I see with each chart is this one. The only stuff you have access is still the option object. So basically, you can provide the data. You can customize the option. But as soon as you want to render a custom element, you know if you've tried to render SVG just using strings, it does not make a lot of sense, or you need to have the components. So, now, save time. Nice. Just to resume, so we have these two solutions, basically, single components or composition. And as we've seen, data sharing with composition is a nightmare. And you can work around, but you get into the black magic stuff. And for the developer experience, it's not good. And for adding elements, you need composition, because as soon as you get to view these options, you don't know how to insert something, for example, in a Nivo as an array that allows you to reorder the grid, the axis, the plotting. But you know that when you reach the state, when you need to pass an array to order your elements, you will be quickly limited. So, it's time to go to the proposal. So, basically, we started with a single component. So, it looks a bit like a Nivo. You want a line chart. You say line chart, and you provide data and options. But behind the hood, it's composition. So, you have a like for a rich art. You have a wrapper and all the rendering components. If you look closely, you might see that the way props are passed are not exactly the same as for rich arts, and there is a reason. Basically, all the data that need to be shared and aggregated, so the axis, the series, and so on, are passed to the container. The reason is basically that we want to do this aggregation stuff in a need and way to say, okay, you're using our components, trust us about all the axis and the series need to interact. You don't need to take care about that stuff. We'll do it for you. And then it's passed to providers, so a series provider. But take care about knowing what is a bar, a bar series, what is a line series, what is a pie series. Same for the axis and interaction provider. For example, we'll say to you, the series with this idea is currently highlighted by the most. So, displaying accordingly. So now we can create the rendering part. So, for example, the bar plot, we'll call the series provider and say, okay, give me the data about the series. If there is none in Render.0, if there is some, he asks to the axis provider, okay, I have this bar with a value of 24. Can you say me which have a coordinate I should associate to this value? So, he renders the rectangle, and he will communicate with the interaction provider to know if the bar needs to be fade out, highlighted, or just in a normal state. With the same logic, you can create whatever you want. So, other kind of series, other kind of components. So, for example, we created the axis, legend to tip the basics one. For the little story, the reference line has been created by a user, just using the provider because it was a bar so. And of course, you can create your own ones, and that's the main success of this approach. So, as a conclusion, a single component. For us, it was a need because most of the time, for example, you just want to put a sparkline. You want to put a bar chart in your application very quickly. So, you say bar charts. You get few options, but just what you need to get the correct bar chart, and you don't have to care about all this internal stuff, about all components communicate together. But as soon as you want to do something very custom, and the charts are part of the earth of your business model, you want it to be as the designer implemented it, or display very specific stuff. So, you need composition. The main failure of this experiment was the configuration feeling. I wanted absolutely to avoid this aspect of, I give you a bunch of options, deal with it. It's not possible because there is so much interaction between the axis and the series, but you cannot split them into the options where they are needed. For example, axis in axis and series in the series. You need to get them all together to do the computation. You get this feeling, but okay. And the success is to empower developers to create their own subcomponents. And that is something I've never seen before, except if you go very low level on how to make charts. And to give you a flavor about how easy it is. Okay. So, this is a line chart. And there is a custom component in the middle, this horizontal line, that shows you for your most position what is the value on the left and on the right. So, this component is not very useful, but it demonstrates interaction and axis management. And so, to create it, you need two stuff. First, the bounding box in red, the most position. That's easy stuff. And then you want what we call a scale. If you use D3, it's the same object that allows you to convert the value to a coordinate. And what will interest us is the coordinate to the value. So, let's start coding it. I promise it will be very quick. Use drawing area is calling the provider that retains. Where do you plot the data? So, you get just the bounding box. And you use Y scale, provided the idea of your scale. And it returns you the D3 scale. Very easy. And that's all. That's all you need. After, it's boring stuff. You save a state. And you do your use effect to almost move stuff like that. If you store null, just because you are outside of the SVG of the drawing area, so you're under nothing. Otherwise, you're under a path. So, quickly. You go from the left at the most position. So, single point. And you draw a line of the wave. So, that's come from the drawing area. And then you just have to use the axis scale invert to get the value from the coordinate. You display it. And that's all. So, you've created a component that is completely custom. And that interacts with your chart. And you can reuse it into any other kind of charts you build with us. Thank you very much for your attention. Thank you. Most of the time, people don't know, but there is an option on to force them to send a feedback about talks. If you have some, please don't hesitate. Otherwise, there are my contacts for later. Are there any questions? We have a few minutes for some questions. Yes. You mean rendering a custom? Can we use a render props to render custom sub-elements? But the issue is, for example, with SVG, you end up with the order of your components, impact which one is overflowing which one. And so the question is, where do you render this element? So, for example, this line, you can imagine that you put it on top of a line chart and below the mark plot of the line chart. And so you need to get access to the GSX level. How do you go from simple mode to complex? You can go from one component and if you need more advanced stuff, you can compose. There is a single component for all the basic charts, so line, bar, pie, and the scatter. And if you want, for example, to compose a bar chart with a line chart, you need to recreate it. So we provide all the basic stuff. So basically, if you open in GitHub, the line chart.tsx, you will see a chart container, different plotting, the access, and basically that's also you get five or between five and 10 components to create your own one. How do you use the rest of MI as financing? It's kind of standalone. We reuse the theme mostly to be linked with, for example, the tooltip so that it gets the same color as the background of your application. But otherwise, it's SVG, so there is not that much in common. There is no button, for example. There is no select. We don't really need those user interfaces. It's more of a theming and the way components are styled, for example. You set that it follows the same developer experience or you can override the styling. How do we create a reaction between what you have in the data pool? How is the performance? Have you checked how it behaves? Because how we have to prevent this is with reaction between real reaction and real data and real data. Have you seen how this affects whether we have a lot of points? No, we did not try mostly because we are currently using SVG. And so we know that there is at least a wall that is waiting for us at a certain level just the time to render the SVG. So we did not care that much. It's part of the next year roadmap. Thank you all for close for ads on.
Building your own JavaScript runtime with Rust
So our next speaker is Leo, who is a developer at Dino, and he's going to talk about how to create a JavaScript runtime with Rust. Big round of applause for Leo. Hello, I'm Leo. As I was just introduced, I work at Dino, and I do various Rust. At Dino we do a lot of Rust, and we create a JavaScript runtime, but we want other people to be able to use it as well and make their own stuff with it. So we will explain internals and how you can make a small JavaScript runtime by yourself. But first, what does Dino? Many people still don't know, so better explain. It's a JavaScript runtime similar to Node, maybe similar to Bonn if you've heard of Bonn, it focuses on security, web compatibility, typescript out of the box support, and just a lot of built-in tools like a formator, lint, dock generation. We also have compiling for single executable, and a bunch of other tools. We are also not 100% fully Node compatible, but we're getting closer and closer by the day, and it's getting quite well. And what matters to this presentation is the modular code base. We have a lot of building blocks that can be used individually to build your own JavaScript runtime just with these Rust crates or Rust libraries to make your own one. Without too much effort, actually, we simplified this a lot. Yes, so first off we need to explain the internal structure of Dino, which is everything is built on Dino Core. Dino Core is a layer above V8, which is the JavaScript engine that powers all the Chrome and Vowsers. And Dino Core is just a small wrapper around it that simplifies a lot of the utilities around it and makes it a bit more friendly to use. It's not always easy to directly use V8 by itself. And on top of that, we have various other functionality that's built on top of that. That's extensions. Extensions are individual libraries that can be used by themselves to implement individual APIs and functionality. For example, a specific web API, like let's say fetch or a fetch of variations on individual extension. We have HTTP server, KV, root loads. Basically everything is individual building blocks that can be not copied and pasted, but imported and just used without too much hassle. Like usually to add an extension is like three lines of code and then suddenly you have a massive amount of more APIs that you can just use. Then we have Dino Runtime, which is a library that is built on top of a bunch of extensions that adds a bit more capability to it, including permission system, which relates back to us being a secure runtime. We have various permission-based functionality and flags. And also another additional feature would be the fact that we do some definitions of various global scopes and the Dino namespace itself. And also web workers are only implemented in the Dino Runtime grade because it's just not possible to have it as an extension just because it needs to interrupt with the extensions themselves. And then we have the CLI, which is what we compile and what people use. And that's not great. Yeah. And CLI includes the TypeScript support, a bunch of other, like all the tools of the CLI, like the lint, the formator, et cetera. And also then we have the compile supplement, as I mentioned before, that has a compiler single executable and testing infrastructure, benchmarking infrastructure, and dock generation. We have a fully static HML dock generator that you can just use and will always give a relatively clean output. But what will we build today? We will build a JavaScript runtime that can compile TypeScript, has a functionality to make a HTTP request, a console.log, some files to migrations like read and write, and deleting a file, I think, as well. And it's all in less than 20, 30 lines of Rust and JavaScript. We will connect Rangias just because this will be a relatively technical topic, so there's going to be a lot of code. So you want. First let's explain extensions more in depth. Extensions have various fields and options that can be set. Arps, which I will explain in a moment, basically the Clif Rust functions that can be used in JavaScript, so you can just write a Rust function and that will then be callable out of JavaScript. ESM is the ES module, so you can use ES modules, import static imports, and dynamic imports. Work as well, I believe. Maybe not. JavaScript files are just scripts, so not ESM. To include it, all works differently under the hood, so we have these two separate options. And then depth is declarations of other extensions. This extension depends on. This is not necessarily needed. It's more of a safety harness. Just it makes sure that you actually initiated the extensions in the right order so you don't actually forget to initiate an extension that another extension relies on and then everything floats and then you don't know what's happening. And there's some other relevant, less relevant options like config. JS, as I mentioned, above is ravelly use nowadays. And then lazy loaded ESM state, and I'm not going to go into depth. It's just config lets you configure some options to a specific extension. If you want to have some special state, you can use the state option. lazy loaded ESM lets you lazy load extension code, but that's nothing that we're going to go into depth for you to look at this in this talk. And then ops. So ops are these functions that you can declare in Rust that then are used in JavaScript. You can just call it like a normal function in JavaScript. And it uses this up to macro. I hope that's not too problematic of an explanation that what a macro is. I hope everyone knows here. Not your problem. And then basically you define arguments and return types with these special macro attributes like the string or this string. And basically it infers then the right type to map it from JavaScript to Rust. And vice versa, depending on the attributes. And yeah, you just write a normal Rust function like for example in that we just use tokyo which is tokyo is the async executor that we use in Dino and most of the Rust ecosystem uses it and then we just read the string, we read the content of a file of the path specified and we just return it. And we return everything in ops as a result which is either an error or an acceptable value because you might want to throw an error for example and that just handles it under the hood tool. So you just return an error. There's other various types that can be specified in ops. We have some more ambiguous types like V8 value which is just a generic JavaScript value. You can pass that in, you can manually match and do some more specific handling if you need some weird function that does based on different types, something which usually we try to avoid. Rather have separate functions that do more specific things. But then we also have Boolean, I supported numbers, strings, as there, array buffers are supported as well. And yeah, you can return and accept array buffers and it all handles under hood without issue. It's all been simplified as much as possible to make it as user friendly or developer friendly as possible. So it's really easy to just create your own functionality without too much difficulty. There is also this async is defined up top. It makes sure that the function is actually async and that it does need to use async functionality when you define something as async. And if you don't do anything async, it will usually then error out during compile time. So because async, it's just more complication under the hood that makes it less performant to some degree. Then here comes the code. So for this example, we're gonna have to find a few ops or cross function clarifications that make the code from JavaScript. And we have read file, write file, fetch, set time out, and remove file. So in the read file, as we just saw, we read the file from the path given and return that with write file. We can get, we specify a path and the content both as a string and write that to file to disk. And we return nothing as per the empty type. And then the fetch one, which might be the most interesting out of all of these, is basically uses request, which is a rust grade for doing HTTP requests. That I guess if we wanna compare to something in the JavaScript ecosystem would be similar to Axios, I think. It's very similar, maybe not similar in API, but similar in functionality and simplicity. And yeah, we may just do a fetch request and get the content of the body via the text method and then just return the content. And then we have a set time out, which just puts the current thread to sleep. And for the specified duration, it's passed by the user via this function. And remove file, just remove file, this is given in the path. However, we use a whole system called v8snapshots and it's gonna be part, I apologize because it's a very complex topic. Not many people really know what it is or even how it works. But to very simplify, it's let's take the current state of the JavaScript execution and you can store it in a file and resume it later. That's the simplest way to explain it. It's not exactly like that, but for simplicity's sake, let's stay to that. So we need a build script because we first need to do some setup. So first we initialize our extension. We call it Vangias as we said earlier. And we have this ESM entry point, I did not mention that earlier. But basically, let's you specify the entry point that runtime will use when starting up. And we specify our files. We have this ESM option and we have this JavaScript file, which we'll see just in a second. And we have some path defined. We want to get the path of the current build script location, some more specific Rust shenanigans. But we get this path and we join it with this Vangias snapshot. It could be any path, just we need a common location where this build script outputs something that we can then retrieve during runtime. Now comes the fun part, which is this create snapshot utility function that we have made that does all the snapshotting logic under the hood, tries to simplify it as much as possible. And you have a few options, most of them can be completely ignored. The only two important, three important ones are the manifest there. We cannot infer this automatically. So we have users to always set this value to be this and micro call to the target manifest directory. The snapshot path is the variable we defined earlier for the where the output of the snapshot will be. And then we have extensions, which is the extension we created earlier above. And we just want to initiate the JavaScript code. We don't have just initiate ESM file, it's initiate ops and ESM. Here we have not defined ops because this is just doing the build script. We do not care about ops at this point in time. They will come into play in a moment. First, we also want to support TypeScript. And this is just a small snippet of the code. There's some more boilerplate that is not necessarily interesting. It's just getting the path of the file and the media type of the current file just to be sure that we actually transpile JavaScript and TypeScript to JavaScript and that file types are all correct. But for that we use this AST create, which is basically a wrapper around SWC. SWC is a Rust create a library that basically implements TypeScripting as per TypeScripts wants and needs since there's no real specification because TypeScript. But it takes some options, the specifier, which has created a path or the name of the file that we want to transpile, the source code. So the text info, we just create this structure from the code that we got earlier from this V2 string at the top from the path that was specified by this function. And some boilerplate, this media type that I just talked about. And then we just call Transfile and magically we get the Transfile TypeScript as a JavaScript. And we can just use it. And then we have to code and we just create a structure of module source, which is how it internally is represented and we just return it. And this is all in trade, I guess the best way if you're familiar with TypeScript is like an interface and we implement this trade and it has a few methods but only one method is really necessary and it is a load method, which is what this is it. There's a few more lines both above but again, that's just for media type and some smaller error handling that is not really out of too much interest for this scenario or for the simplicity. Then we have our, this is in the actual main script where we get the snapshot that we created earlier during the build script and include it into the binary itself. And then we have access to this runtime snapshot and we will use it later on. And then we have the extensions, we initiate again, but this time just with the ops that we defined earlier. And this time we don't need the yes modules because we defined them earlier and snapshot attempt so they're part of the runtime snapshot from above. And it seems I forgot a slide. I can quickly hopefully fix it. This is not well prepared and I apologize. I hope this, let's do it the easy way. This is the JavaScript file with the internals defined. And basically we input the inner core as a JavaScript as a JavaScript module that you can import. If you use the inner core that has some utility some functionalities again just like the Rust version. Just this is for interropping between the Rust and the JavaScript. And we have the structure to score into ops. Ops again is, this is an object that can be used to access the functions that we defined earlier as we see over here that I hope it's big enough actually. Can the people in the back read it? Wonderful. Is this big enough? Wait, then let's, okay. Just to quickly reiterate we have this input of the inner core and instruction to this ops object which is just down below here used to call this op read file which is the one we defined in the Rust file earlier. And all under the hood it converts the values to the correct type matching in the Rust. And then whatever it's returned from this op read file which will be a file content we then just return it from this function that we defined in this object constant. Over here above we also have the console definition which is, uses code of print which is a utility defined in the inner core again, a few more helpful tools. And we just get all the arguments and just use this arts to match it but we then define the part which will be just stringifies and joins all the values. We don't need anything too complex for this example and then just prints it to the console. And then we have the same forever which just sets at the end the true value which is for if it's an error or not. So above it's false and then below it's true. Then further down we have the other function definitions which are read file, write file, move file, fetch, maybe just all wrapper functions around these ops. Technically this async was not needed. So one of the side part whatever and then we have the set timeout which calls the set timeout and then calls the callback. So it's relatively identical to the web API that we know. And it's assigned this to global disk which is the global namespace and also we assign to the global disk also the console and we define a runjs function object which is the object we defined above with all these extra small functionalities. To go back to here, we defined this extension again and the runtime snapshot and then we have basically all the building blocks ready now we just need to actually use it. And for that we need the runtime. This is again a bit more complex but basically we define a function that takes a file path. The file path is the JavaScript file we want to execute with the user's code that they pass. We have some utilities in Unicode that resolves the path against the current directory and gives you a model specifier out because that's what internally it's used. The module loader is what is used to resolve a module and any imports in it from this user specified file and we have our TS module loader. This is the TypeScript transpiler that we built earlier that is just the structure that we defined but I did not show that because we've boilerplate. Startup snapshot is the snapshot that we got from earlier from the setup and then the extension we need to initiate the ops that are defined so that the Dino Core and the JavaScript file that we designed can actually access these functions and load them up. And we don't care about any of the other options and then we have the actual usage which is this load main module. The load main module, it loads the main module of the entry point. Let's say if you run Dino run test.ts it will, that would be the main module and then it will work through the entire module graph which is basically all the imports one by one on the recursively. And this is async, a lot of this operation async because ES modules are inherently async and yeah we evaluate the module so we basically run it and get if there was any output and then we want to run the event loop because there's going to be multiple pulls let's say with async functions you've got to do multiple async calls perhaps or just stuff. We have some options that are not of interest. We have Dino Core includes inspector, utilities and pump via message loop which is again not much interest at some point or another. We just await this event loop running and return value of this result that we were calling earlier so we just then get out of this run.js function we get the result which hopefully will be okay and there's not going to be any errors but there might always be some error. A user might have to find incorrect variable names or have invalid syntax or something like that. And then we can do a small demo where we, I hope this is going to be big enough again. That's definitely not. We have this example.js file. Here we just call the set timeout that we defined earlier in the global scope and then just come out and this can then just be, kind of make this bigger. No. So life demos never go perfectly well but hopefully this should be working. So we then just do congo run and we want to specify this input file which we called examples.js. And hopefully this will work. It first needs to compile and yep it prints the weight and then the hello world that we call here. Now this is just a set timeout that's not as interesting as for example fetch. So I mean we could just console log the fetch output so it would be run.js because we defined this global variable earlier as run.js in this run.js down here and then we want to call fetch. I think we could fetch HTTP example.com and since this is async we want to await it and again let's run this and hopefully we'll get an unreadable wall of text of HTML output from example.com. It's usually not that long and yes we did a fetch request to a remote server. And we had the file system operations so I could just call await runjs.readfile and let's read for example this file itself. And then my terminal quickly. And hopefully it should just print the same output because we're reading self file yep and it reads and then the deleting and writing of files will work as well. We're not going to go too depth into that. It's relatively self explanatory and yeah that's pretty much it. I know I went a bit fast. I hope people don't have questions. There's a QR code for the actual repository where we have this so if people are interested to check it out but also we always are trying to improve the ecosystem and common problems of the JavaScript ecosystem and we actually have had problems with the dependency ecosystem of JavaScript and NPM and we decided that someone needs to solve this and as such we also created a new general purpose JavaScript registry that will work in any runtime. This has been announced a few days ago by Ryan Moindepf and you can join the waitlist at the QR code or the URL. That's it. Are there any questions? Time for one or two questions. Yeah. Let's see that I hate this. Inside the Docker container. I have this input queue of jobs where I send the script that I want to run and then just execute it and the output is from it. Is there any downside as long as I am only sending one single script that it needs to execute? No. I don't see any issue with that whatsoever. It should just work. Again, I'm not too familiar with Docker though but that seems like a relatively normal thing to do. Any other questions? What have been your biggest challenges in writing this run project? This project has been going on since it was announced in 2018 and we have written our internals many times. For example, extensions were called other things multiple times in the past. We renamed and restructured not entire structure of the code base but it was just multiple rewrites just to be able to have more capability but also performance wise improvements. Overall, it has been a challenge but it was something we could always figure out. Rust itself has never been an issue. It's always been relatively good to use. It's not perfect. No programming language is perfect but previously Dino was initially started as a Go project but we switched quickly to Rust for performance benefits as well. I hope that answers the question. Anything else? Yes? Yes? Yes? Is it? On this one? Yes. Okay. This could technically have been just accepting U64 directly. This should actually have been U64 directly and just been passed and not casted but that was probably just some oversight while writing this code. We casted it because it was from Melissa, except only U64 but this is just oversight. I have one more question. Yes? How does the performance on the custom run times or extensions compare to the foreign functions in the past? I'm not too familiar with FFI but we have optimized both FFI and these extensions a lot more but extensions inherently are going to be more performance because it's not a foreign function. These ops, I guess if you really look at them foreign functions, since it's calling Rust functions out of JavaScript, there is some plumbing but these have been optimized so much over multiple years that I would say like sync functions are basically maybe not no cost but close to no cost. Sync functions have overhead due to...
MessageFormat: The future of i18n on the web
So our next speaker is a very good friend of mine. I call him the big boss because he's a co-chair of the TC39, which is the committee that manages the JavaScript specs. So he's going to talk about message format, the future of internationalization. I hope I will say it right. And big round of applause for Rojoa. Let's go. Thank you, Aiman, for the gracious introduction. Major thanks to Leo for setting up the stage for me. There's no pressure there at all. Well, I thought a lot about this. It's very anxiety inducing to follow up to a talk like that. But I realized that it's nothing to be anxious about because why not follow up that very intense technical talk with something simpler? So I try to talk to you. It's mostly propaganda. It's a program that I need to induce you to. But yeah, I hope you're ready. Everybody ready? Let's start. Okay, great. So welcome. A little bit about me before we begin because that's obviously the most important part of the presentation. I am Ujwal. You might remember me from my username on the internet. It's not easier, but it's fine. I'm from New Delhi in India. And I live in Akharunya. It's a beautiful little city. I would describe myself as well. Zellet may be a strong word, but trust me, I care a lot about open source software and believe in the web, which is what I'm here to sort of make you fanatisized about as well. But I suppose, given that this is the JavaScript room, you all share a lot of these ideas as well. I love dogs and massage video games that hurt me psychologically. And I work at Agalia. So quick show of hands. How many of you know about Agalia? Wow. Only at this conference. Thank you. At Agalia, well, this is me trying to read a newspaper. Agalia is an open source consultancy. If you don't know about us, we are also a worker-owned cooperative. So that's neat. We've worked across a lot of open source projects and ecosystems, mostly around low-level software like the Linux kernel, different things in the Linux user space, and in the multimedia space and graphics and so on. You might know about some of our contributions in the web platform. A lot of the browser projects have a lot of interesting work that has been put in by my colleagues, which I'm very proud about. And lastly, but not the least, I suppose, we work in the compiler space, which is all about programming language design, right? It includes WebAssembly and JavaScript, but also stuff on LVM. And that's what I need to talk to you about, about JavaScript, about TC39, as he said, which is a very descriptive name for an organization. Do any of you know about TC39? What does it do? Wow. That is surprising to me. But TC39, long story short, designs and develops this programming language called JavaScript that we all love, I hope, to some extent. It's a complicated subject. What would you describe JavaScript? What do you think of JavaScript? Do you have a definition for JavaScript? We know what JavaScript is. I would like to describe it, hopefully, non-controversially, as a general purpose programming language, and now general purpose programming language, that is designed primarily for scripting web interfaces. Are you web developers? Are there any non-web developers that use JavaScript? No. See? Like, while JavaScript is a growing language ecosystem, it is still very heavily influenced by the web ecosystem, and it owes a lot to the web ecosystem for making it what it is, and therefore, as the primary language for making web interfaces, it needs a lot of tools that are required for web interfaces. But what is the web? I couldn't even find a logo or an image to describe the web, and I feel that the web is such a weird concept because we all know what the web is, right? But could any of you define what the web is? So I do feel that the web is hard to define or really put a finger on because it's gone through so much. I always rely on my good friend, Baudrillard, for explaining what the web is. So we have, for example, web as we know it, and then it just keeps what? And now it's everywhere. The web is in your gaming consoles, it's in your cars, in your toasters, probably. Does it need the web? Probably not, but I don't know. I don't trust toasters. But the point is that over the last couple of decades, the web has more or less emerged as the main platform for designing interfaces for people. This is not a controversial statement, I hope. But what is the web platform that I'm talking about? So to define the web platform, I would say that it's an interactive, decentralized communication platform at the scale, but that makes no sense. So the web is a standard platform for making widely accessible. They're deployed at the scale of everyone and rich user interfaces. We are not in the early days of the web. The interfaces that we use on the web have changed substantially. But this ambition of universality, universe, yeah, that's the, okay. Yeah, this ambition is built into the web and it's ethos, right? It is supposed to be for everyone. It is supposed to be accessible by a vast group of people. And any platform with such ambitions is not needed, but I would argue required to be accessible and internationalizable and localizable because how else do you reach the anywhere near like the target audience of the web, right? So the web has a responsibility as this platform. A quick note before we start, I've already started using these terms, but if you're unfamiliar in a nationalization, basically is the process of making an interface in a way that it can be localized into various different languages, cultures to suit your users. And as I mentioned, localization. Localization is the specific act of, you know, modifying your interface to suit a target audience. This is just a primer to, because I'm going to use this a lot. I hope that was obvious. But let's talk about the early web, the early interfaces that started thinking about internationalization. Because as much as I would like to talk about how the web has revolutionized internationalization, it doesn't start there. The story starts in the beginning because user interfaces were everywhere. Like people who were writing text based video games and C were also very keen on things like internationalization. UIs are composed of string content. These content, like these strings are what we are referring to messages. So when I say formatting a message, that is what I mean. Manual localization is a way to do this. I'm sure everybody is familiar with Wikipedia. It's kind of possible to have a website work in that way. But it was quickly proving to be unmanageable. Not only because it's hard to do it manually, it is. But also that it represents a very slim vision of what it means, what true internationalization means. Changing from a language to another doesn't actually mean internationalization. There's so many different axes within any given country, within any given language. There's so many different ways of expressing things that to reduce internationalization to merely catering to languages or locales is, well, locales can be complicated. But merely focused on languages, it's too simplistic. So the actual diversity of locales can never be catered to using this simple approach. Imagine having a different website, having a different version of your website for every currency that you support. And basically to promote a better, cleaner, more modular approach for building interfaces that can be localized, C-Hackers first came up with Get Text. Have any of you ever used Get Text? It's everywhere, right? In all of your operating systems, it's actually part of Lib C, so basically everywhere. But let's talk about Get Text. What was it? And I know that we are in the JavaScript room, but bear with me. It all makes sense, hopefully. Get Text was one of the two main internationalization systems that the early hackers cooked up. The other was CAD-GETs, but the presentation can only be so long. So let's not get into that. But it was the dominant system over time. CAD-GETs, as you might know, now not used anymore. And it was not standardized. We're going to be talking a lot about standards here and how they enable people to build stuff in a way that is reliable across boundaries. But Get Text was never standardized. It was, however, standardized in a de-factor way by programmers using it across their tooling and basically through documentation and education and so on. But its adoption by Sun and then GNU in like G-Lib C basically made it so popular that it was standardized through popularization, for instance. It was not as powerful as the internationalization systems we would use today. It mainly dealt with very static strings, and you could replace them with different strings and different languages. So already it's not perfect, right? But it was good for the time. It was what we had. It was better than nothing. But what it did was it went on to inspire an entire generation of applications, of interfaces that were not only utilizing internationalization but were built with internationalization in mind. And one topic that I'm going to, this is one topic that I'm going to keep returning to throughout the presentation. Giving power to your users is by far the easiest way, in my opinion, to, well, people do wacky things. That's the web for you. But it's the most interesting way, in my opinion, to allow people to innovate and come up with completely new paradigms that were unimaginable. And we'll see how Get Text inspired these things in its own right. So there's Python's Get Text. It's basically Get Text, but in Python, Python likes to keep things simple, I guess, in some way. Java introduced, however, for the first time, the concept of message format. So the context here, like you might also already remember, well, it's in the same slide. And the micro systems already big in this idea of interfaces in internationalization, of basically deploying their products and their users products across the market, so to say, was very keen on internationalization. And therefore, they basically, through acquisition and innovation, created the first sort of beginnings of message format in Java. At the time, to be this like, it's so funny to think about it. Java was, to that generation, what the web is to us. Java was a cross-platform way of building interfaces that can be then deployed to a massive audience. Minecraft, I guess, if it's the bill. ICU then, however, picked up message format. ICU was also formed by these organizations working closely together, but it was a much bigger effort. It was standardized, and it was developed in a way that was general enough for everyone to use and integrate in their apps. So ICU creates message format. It was originally a very close sort of copy, is maybe a bad word, but a very close imitation of the Java message format. But in its own right, it added more features and more power. And we can see how that affects things very soon. But here's a quick and dirty example of how that works. So you have this message, and it has an expression inside of it, right? And this expression is basically selecting for a number. So if there are no files, well, if there are zero files, you can say there are no files. And there's one file, and then anything before above one is, you know, there are X files. But this needs to be, this is just one message in one language. It needs to be translated into various languages. But finally, we had a format to express all of this, right? If you're writing in Arabic, for example, you need a lot more. In Japanese, you need just one. But, yeah, so if you think about this, ICU's message format not only subsumed the original Java message format that it built out of, but it added more, and it experimented in its own sample space. But some important things was that ICU message format, because it was separated from Java, it was able to rethink some of the design details about some of the processes that happened. And it was designed for the first time almost with massive feedback from implementers and translators at the forefront. Translators being a key word here, because I think it's not very controversial when I say, and it's kind of obvious from the documentation even to this day, that tools like GetText were primarily designed keeping programmers in mind. And finally, we have a lot of organizations who are very invested in translating their interfaces and their translators being part of this process of designing this format. So things were really starting to pick up, at least I would feel so. What it ended up with is a much better or, well, a much more powerful syntax, and that, as I mentioned previously, just opens up the space for innovation as it did. So here's a quick example. It looks huge, but please bear with me. But basically, here we have a simple message. Whoops. Hello? Is, okay. But yeah, you're, one of the things, okay, I'm going to go back. But one thing that I didn't focus on much was that ICU message format at the time allowed nesting. So messages could be nested, and this, as you might see here, opens you up to new use cases that you didn't think of before. So here we have like a nested select statement of sorts. And what we're doing is we're setting for the pronouns of the host who's hosting a party and the number of guests that we have to display a simple message. Well, simple at the end of it, but that's the amount of work that goes into making sure that you have a message that works for all the combinations. Right? And this was great. People were now doing more powerful things than they ever did. But what was going on with the web? So talking about the good old days of the web, whatever that means. The early days of the web were very dominated by either absolutely static documents or very static crowd apps, basically. What does that mean? It's basically about some simple operations. If you think about it early internet applications, so to say, even Twitter, for example, were built as very simple crowd apps. There were very few operations that you could do there, right? And if you compare that with anything we ever do now, it is a completely different landscape. I mean, even just Twitter and how it works now. So the early websites, because they didn't have a very powerful medium on the front end, they were like almost entirely on the server run times for their content, for whatever kind of dynamic things that they did. And guess what? For internationalization. So while the early web was not as powerful as it is now, people were still, there was a lot of appetite for internationalization and people did what they could. So they used Java's message format using Java text. Java.tex has a number of other APIs as well. PHP had an Intel object, if you think the JavaScript was the first to do internationalization. That is not the case. And Ruby on Rails also had this IETN. I don't know why I keep mentioning this. It's just one that is closest to my heart, I feel, because that's what introduced me personally to internationalization, how it's important, and how it can really bring life, spell life into your interfaces. And this allowed the early web developers to tinker with things like message formatting and more complicated internationalization use cases. So there was a lot of tinkering happening, but basically with the combination of this message formatting style that we talked about, the popularization of templating languages. You have to remember this is the time when templating languages were really taking off and HTTP content negotiation, which meant that the client would tell basically what languages the client accepts to the server and then the server could localize based on that. We reached not a great point, but a very important point for internationalization on the web because now things are starting to get serious. So we have a platform for building interfaces. It is being used by a vast community of people who are building websites, and they are utilizing internationalization techniques to make their content more accessible to a wide range of users. What about the Mpesky.js developers, though? What are they up to? Up to no good, I hope. But no, jokes apart, JavaScript also had its own parallel development during this time. But basically, for reasons that are not worth getting into in this presentation, JavaScript remained mostly not very popular. I mean, for a long time, people were hesitant to really do a lot in JavaScript, and it was not their fault. JavaScript was not as powerful as it is now, but things, again, on the theme of somebody giving power to a bunch of users and them taking it across the board. jQuery released in 2006. So that's almost two decades now. And how far have we come from there? It just, it sparked a whole sort of space of people experimenting and building more and more interactive web pages, basically. Fast forward to some years, we have React and the age of SPAs is dawning on us. Some of the most annoying websites kind of were created during this time, but all in sort of good time, and it all led up to something important, which is that now we have a very dynamic, a very expressive web that people interact with in very different ways than they did with the early web. Some of the interactions that we have with websites today are just, they probably weren't something the early designers of the web had any sort of idea they were enabling, maybe for good reason. But TC39, which, as we mentioned, is the standards body that is tasked with designing the language recognized at the time that internationalization is a growing concern for us, and we need to enable JavaScript developers to make the best use of this area and the techniques with it. They formalized a task group, task group two in this case, to work on internationalization. So a lot of internationalization features were developed and deployed to the web. As of this day, you may know about the Intel object in JavaScript and the various formatters and other things that it has. But basically modern JavaScript interfaces use these like in line, well, not in line entirely in the context of interfaces, but basically they use these APIs to internationalize, to localize for that context on the client and then sprinkle them across the interface, but it's not perfect, or is it? Well, the state of internationalization on the web is that outside of what was being done in JavaScript, most of the work and effort in terms of internationalization was directed at very specific things, like supporting more writing systems, which is great. We need to support all the writing systems, but it was very limited in its scope and ambition. On the other hand, internationalization grew and have now so many different features just to talk about them formatters. So you can format all these different types of information like numbers, including currencies and different kinds of numbers in different formats. You have dates and times that can be formatted in various different ways in so many things. You have collation, segmentation of text, which is something that is now supported by the interl object, and then there is some selection. So there's not a whole lot of selectors that are available to this date, but we have plural and ordinal selection. So if you have a number, you can basically select for the plural rules of any given language, which was cool, which has a lot of the building blocks that we need, but we're still missing important pieces. So talking about the timeline of where we end up with message format 2, the JavaScript group that was designing the internationalization API essentially realized they needed something of the sort. It was discussed over the course of many years, as you can see. It was iterated upon, but there was a general agreement among this group that message formatting on the way. The web is something that is both needed and necessary step to enable further use cases. Do you remember this slide? Like, this is not a great format, but that's what we had. So message format and the syntax and all the details of it were standardized 20 years ago at this point, and it is not sufficient for the modern dynamic interfaces that we built that JavaScript has enabled now. So why do we need a new thing? Well, first of all, as I just mentioned, we have so many more ways of interacting with our interfaces that just require new tooling, rethinking how the fundamentals work and the sort of outdated, outmoded tools don't exactly fit the bill. They are too imperative. They are too static to actually support our new use cases. There's also a sad lack of modularity and extensibility in the existing message format a bit more on that later. And because it's a standard that is designed for being accessible to basically everyone who uses ICU, which by the way is a great thing, the unfortunate side effect of that is that you can't really deprecate anything. You can't just clean it up and move on without making a breaking change that would annoy every user. So there needed to be kind of like there was Python 3, I don't know, that's controversial, right? But we needed a heartbreak, a message format 2 that would be designed from the ground up to deal with some of these things. And the diversity of locales that we know now makes it basically impossible to map any localization structures one to one. A great example of that is one that we just used in the past, like the number, the plural selection example. Do you remember that? Yeah. English has two plural rules or modes, which is that either things are singular or plural. Zero is plural, for example. And that's simple and that works for us. Well, Welsh, a language that is not so far, you would assume, has five. Arabic has six, I think. I'm sorry if I'm... But you get the point, right? Japanese has one. When it comes to any of these things, you cannot map messages one to one. You need something that is more expressive than that. And basically, like the design constraints of the old API made it very limited for the modern JavaScript ecosystem that we have. So not only do we need to take everything that the original message format did right, do it better, if possible. We need to also accommodate a lot of the innovation that was now happening outside of the standard space, right? In proprietary tooling or elsewhere. So we needed to really get our act together and that was message format two. So I am going to start on a quick and clumsy, as I said, intro of message format two. But are you ready? Okay. Context setting done? Yeah. So for context... Well, okay, more context. But a dynamic message string is the string that is not just a static string, but as we perceive and use strings on the modern web, they can change. They can morph around each other. It's ridiculous, honestly. But the goal of message format two is to enable a lot of these complex use cases that we haven't covered now while keeping the basics simple. Because at the same time, a very important goal is to lower the bar to make it easier for anyone who at the moment, and I'm not saying incorrectly, that they rightly consider message formatting to be a complex thing to integrate. But lowering the bar would allow more people to experiment with it and to basically think of new solutions for their new innovative problems. So, for example, it's text mode first, which means that for very simple messages, it's very simple to get into it. At the same time, it allows a lot of complex messages and expressions, which we'll get to. It also allows you to have declarations and annotations. So basically, we're getting dangerously into programming territory here. You can have variables and functions being called. And it's great because you have this great deal of expressivity that you didn't have before, right? And finally, talking of functions, there is the idea of extensibility and modularity, right? So now we have the concept of a function registry, which you can add to and a bunch of built-in functions that you can easily utilize for common use cases. So yeah, quickly talking about the kind of messages that we have, we have simple messages, which is, well, a message. We can have expressions where you can interpolate variables kind of like you might be familiar with in JavaScript, but yeah, different. Same but different. And then we have complex messages like selectors here where you can actually match for a particular value, kind of like a switch case statement. So this is the various kind of messages we have. Then as I mentioned, you can call functions. So in this case, we're calling a formatter function to format a date, and then it's part of a larger message. There's support for markup elements. So if you have some messages that have markup built into them, simple example would be text sort of base markup elements like bold and italic. More complicated things could be like this. And then there's support for declarations. So you can have variables and play around with them. But there's more. There is an extensible function registry apparently, which is a great thing. You can have your own functions. You can have private use annotations as well. And there's support for popular built-in formatters and selectors, kind of like we did already in the JavaScript API, but specifically built into message format. So this means date and time, possibly durations. It's kind of not settled yet. There's support for formatting numbers and integers and selecting for them and matching them. And plural and ordinal rules are on the table and possibly lists, formatting complex heterogeneous lists of objects. But yeah, as I mentioned, this is still something that is up in the air. This is one of the final pieces of the puzzle that we are yet to completely sort of put in the puzzle, I guess. How do you do that? But the point is that this needs to be settled. This needs feedback. This needs more data than we already have because we know what is generally useful. But what mostly helps is getting actual data from people who use these things or might want to use these things, but things that matter to them. So that was part of the shtick. But basically, do you remember that slide? It's still here. This is how it looks now. So you can have a more complex matcher. And now it's matching for both the guest count and the pronouns of the host. And yeah, you have an expression. It's basically achieving the same thing, but with a cleaner syntax and hopefully something more manageable than what we had before. Why does any of this matter, you may ask? Where did I start with? I kind of lost context. But as we talked in the beginning, UI design has evolved a lot since the beginning of UIs. And the web platform was developed for something but ended up becoming essentially the most reliable standardized way for deploying user interfaces. But a lot has happened since then. For example, the UI space has done a lot of innovation around internationalization about making more localizable UIs in a cleaner way, in a way that helps programmers and translators and everyone else in essentially message formatting. On the other hand, the web platform has evolved substantially with JavaScript. Things are very different from what they were. To bridge this gap, however, to fill the final piece of the puzzle, as I mentioned before, we need Intel message format. Because we have developed a lot in the web and interfaces are more complex and more dynamic than ever. At the same time, we have better tooling in every way when it concerns internationalization. But these two spaces have not yet benefited completely from the innovation in each other. So this is the idea. Not only is message format to built on top of a lot of the innovation that has happened within JavaScript, JavaScript is now sort of importing years of work that has gone into the internationalization space. So what is the message format from internationalization? That's where we were supposedly starting but it got lost somehow. After talking about it in Unicode and sort of coming to this format, we finally brought Intel message format back to the committee and it is now a proper proposal. So it's at stage one and hopefully it would reach stage two soon and could get deployed to various browsing engines and non-browser engines. But it is built on top of the things that we know and that we have discovered around familiar patterns that internationalization built-ins use. For instance, format two parts is this whole thing that we do in internationalization in JavaScript to allow people to essentially have more control over their formatters, which is not really a concern outside, right? But that is a major design point for the proposal, among other things. This is how it looks like. So you have Intel constructor like any other. Instead of what we do with most Intel constructors where we have the message at the beginning, sorry, we have the locale at the beginning. Here we have the message and then the locale and the options follow. And yeah, it works. This is very simple, but you get the point. But I hope that this was convincing enough for you to feel that this is something important happening here. And if it is, then how can you get involved? Well, one thing I mentioned is the actual message format two syntax and data model and everything that is being standardized under Unicode. You can go to this repo and read all the issues, see what's been done and give your feedback there. Help us out in any way you'd like. And then there is the JavaScript proposal. It's early stage. It needs a lot of work. It needs a lot of feedback and it needs people to be motivated about it, to write tests, to help us figure out the spec, to help us figure out the design details. And we'd really appreciate your help. You can find us or me on GitHub and Matrix and start from there. I'd be more than happy to guide you with this. And that's all.
All Things Astro
Hello everyone, so our next speaker is one of my very good friends, a BGS team member and an Astro core team member as well. His name is Elian and he's going to talk about guests. Astro. Hey, who wasn't that a surprise? Alright, let me check that I'm not in the screen. Okay, hello everyone. Hope you're doing good. I'm doing good, I'm just a little bit tired. I just flew in from Poland over Zurich because I had a conference yesterday as well. So if I sometimes struggle with words, I'm sorry I'm tired. Also I'm in Astro core. Astro is this framework. I'll talk about that in a minute. But also in the React Brussels and the BGS team, I don't know if you've ever been to our conferences. That's here in Belgium. I was actually born here in Brussels and now living in Ghent. But also those guys are the same ones that actually organized this dev room. So maybe let's give them a quick round of applause as well. Yes. And they actually both left so they have no idea. But that's good. I also do my own meetups in Ghent. So if you live in Ghent or in Belgium overall, you're always welcome at our meetups. They're free. If you want to follow me after this or want to ask some question that you didn't get time to, feel free to follow me online. It's at Alien Coats on all platforms. So that should be easy. Okay, let's address the elephants in the room. What is Astro? Who has heard of Astro? Oh, wow. That is a lot. I asked the same question yesterday. We're like three hands. Who has actually used Astro? Okay, that's also a lot. Who is on the latest release of Astro? Okay, still good, still good. And who is using Astro professionally? Nice. Okay. No, that's what I was expecting. That's fine. That's fine. Cool. Okay, so it's a personal experience probably. Okay, that's good. Okay, cool. So we call Astro the framework for content-driven development. There are a couple of reasons that we say that. And I hope that will be clear to you after the talk. See it as being a comparable framework to Next or Next. It's a meta framework as we sometimes call them. There is a lot of discussion over should we call them meta frameworks. But let's call it that for now. We can later discuss on Twitter if it's, well, on X, if it's actually called a meta framework or not. This is what it looks like. This is the Astro syntax. Basically, everything that you want to write in JavaScript or in TypeScript, we support TypeScript, goes in between the front, the top, the dashes. That's always server-side. I'll explain that a little bit later. But it's a very familiar syntax. It's basically JavaScript at the top or TypeScript if you prefer. And below it's just, JSX likes syntax. It's not really JSX. It looks like JSX. But you can use class. So it's an improved JSX. Why is it ideal for content-driven? Well, it is because it's better for SEO and for meta tags and all of that stuff. Because we ship zero kilobytes of JavaScript by default. There is a few catches with that. We were one of the first frameworks to take this approach. But by now, we're surely not the only one. And sometimes a better tool fits a better use case or there is a different tool that's totally fine. If you want to discuss that, we can totally do that after this talk. Think of your traditional framework application approach. You write something in, let's say, next JS or in next. It typically looks like this. It doesn't always, we have now, React Server components and stuff. But I'm not going to account for that. All of these components require JavaScript or TypeScript and compile to JavaScript. And that is actually really weird because there is a couple of stuff here that is completely static and doesn't need JavaScript. For instance, the footer. It's just basic A tags, whatever. The header, maybe it's just an A tag that refers to your home page or an image. Why do I need JavaScript to render an image? That doesn't make sense. So what we do with Astro is basically we compile it all down to static HTML, CSS and JavaScript if you want to. More on that later. Basically, you have to remember HTML first. So what if you need JavaScript? You probably want some interactivity, right? You probably want to add a button, a hamburger button, drop downs, all of that stuff. What if you need interactivity? Well, of course, that is possible. We have a directive for that. It's called client. And that gives you a few options on how to control interactivity and tell the compiler when and how to hydrate components. I listed a couple. There are more. But I'm going to quickly just go over these. Client only is very easy. It just skips our compiler completely and ships JavaScript, as you would in React. And client media will only hydrate a component when a given media query is met. Think of it like mobile only buttons, hamburger buttons, all of those don't require JavaScript to render on our desktop side because you don't even see them. We have client idle, which will only hydrate components when the main thread is idle, when it's doing nothing. So basically, free for your CPU. Client load will just say, hey, I need JavaScript to send it to me. Then we also have a couple others like client visible that will only hydrate when a component is actually in the viewport. That makes sense. So what we actually can do in Astro, think of this as the basic HTML page I was talking about earlier, we can ship JavaScript to just a couple of components. Maybe an image slider, we need some things there. Maybe we need some header links, whatever, that are dynamic. We can do that. Of course, we are an open source thing. So you can build your own stuff. You can put that into Astro. And of course, you all know, as developers, if you let them free, they will come up with weird shit. One of those is Astro client when it's raining in New York. This will basically, like it says in the name, it will hydrate your component, but only when it's raining in New York. Cool stuff. Ben built this. Ben of the Astro core team has an implementation to show off how it works. But it's possible. It's fun. It's cool. There is a lot of creativity to be explored here. We call that concept islands. Islands basically referring to a component that's completely isolated from your other components. But we come with one twist. We have seen the astro syntax. But the components that you want on your client side, you can actually build in other frameworks. You say, add react to my Astro website. And then you can use react components inside of your Astro website. Or you want to use view or you want to use felt or maybe both of them together. That is possible. I won't say that it's a recommended thing to do. My thing disconnected here. Okay. It's not a recommended thing to do, but it is possible. But by default, without the client hydration, if you use a react component on Astro, it will still compile down to static HTML at build time by default. That's basically what makes Astro fast. There is, of course, a lot more. What I show you now is basically only the static generation side of things. That's the default. But we have so much more. And just in 2023, that was a crazy year for us. We did a lot of stuff. We shipped three major versions. And we have reasons for that. And I'll go over them like very quickly. I'll show you what we did and how we improved the life of Astro developers. So in January, I did my first real international Astro talk at JS world. Amy, you were there, right? With Omar? Yes. We just shipped Astro too. Astro looked completely different from the Astro that it is now. What we shipped, we shipped more than just the features that I'm going to share. But basically, these are the important ones. We shipped the new CLI. RCLI? I think it's crazy. It's crazy good. It's super clear. It's really easy. We just asked you a couple of questions. And on those questions, we set up a template for you. A couple of questions are, of course, do you plan to use TypeScript? Yes. What kind of TS config do you want? Do you want strict, strictest, loose default? Whatever you call it. You can do all of that. But also, since we are so open source minded, we also have released that as a client library. Well, not a client library. CLI library on its own. That's called CLAC. That's built by Nate, one of our core members, built in a weekend. And now it's used in different projects and it's actually amazing. Cool to see that there is like a couple of different projects that came from Astro. We shipped content collections. That was actually one of the biggest ones. Content collections give you a type safe way of working with markdown, MDX and all of the other markdown flavors. Even markdoc, for instance. This is probably very familiar to you. This is Zod. And Zod is this client, well, not client. This is library that basically checks your types on Eskimo. That's what you do here. And because that's type safe, we can also error check way better that I'm going to show you in a minute. This is how it looks like. So you get all the intelligence goodies. You get all the auto completion and all of that good stuff. We added hybrid rendering. So I was speaking about, as super clear, you can instantly see what's wrong. In your blog, the astrotutorial.mdx frontmetter does not match collection scheme. You instantly know what's wrong. What file is it? Oh, its title is required in astrotutorial.mdx. You instantly know what's wrong and where it's wrong. You fix it, done. Then we launched astrotutorial.mdx3. I think that was in August, if I remember correctly. We shipped view transitions. View transitions are a super, super cool thing. Who has ever used view transitions? A couple of people, not too much. Who knows about view transitions? Okay, that's a couple more. What's the reason that you didn't use them? Yell something. Time. Okay. Okay, yes, browser support. I was going to expect that one. Yes, it's not supported by all browsers yet. But what we do with astrotutorial is we polyfill a little and then it works. At least the basics work. And what view transitions are for the ones that didn't put their hand up, it actually looks like this. So astrotutorial basically does this SSG MPA page. But they actually, with view transitions, you can make an MPA with basically all static HTML files feel like SPA with client-side navigation, even though you're not shipping that to the browser. The browser will always do this by its own. Basically, really simply explain it takes a screenshot and the screenshot of your next page and transitions in between both of them. But you can do crazy shit with that and about the demo with me. It's not built by me, but I have it with me. Can I do it like that? Okay, give me a second here. You can all see this? Okay, okay, okay. Switch page. Yes. So as I was saying, browser support is a hard thing, but you can do shit like this. So this is multi-page application. Still, when I press North, look what happens. Okay, let me fix that. I wasn't expecting that to happen actually. Will it work? Yes. Okay, now it's there. So if I go back to South page, it's basically South.html. Look what happens. All of that animation is coming from the browser. There's no client-side hydration happening here. This is insane. I don't know if you're excited as I am. Yes, some people. Okay, okay. Not too much. It's fine. It's fine. But still, it works also with like the navigation API. So if at the top, I don't know how well you know ARC, but at the top I have just the basic buttons forward, backwards. That also should work. Yes. That's amazing. Okay, okay. Now let me go back to the presentation if I can get that back here. Okay, okay. There we are. And connecting. Yes. The craziest about all of that is actually from you as an end user. Well, end users are typically the clients that use the website. I mean, as a developer that will use that feature. It's only two lines of code. It's really easy to implement and we make it so easy for you to ensure that you have the best developer experience possible. A couple of other things, of course, if you think statically, you don't have middleware, you don't have all of this edge stuff. We added that as well. And the good thing is you can always create faster responses for your users anywhere in the world, wherever they are. But those are always like the catch words with edge stuff, right? It's also a little bit of a smaller runtime. So it's a little bit more difficult than that. But you get the point. Image optimizing. Images are hard. Can be hard. Can be really hard like in the browser sometimes. What we did is actually we released a virtual module, actually, which is astroassets. And you basically just import your image, just like you would do with a component, then use it as a source and it will automatically output a optimized WebP image. But of course, a lot of people came complaining and were like, where is picture? We need picture. We brought picture. And then actually you can do formats with it. So if you want to use Aviv, because that's even faster and actually not supported in all browsers, but you have a fallback to WebP, which is supported in all browsers, then we'll take care of that for you. So it's really easy for you to define and optimize the small bits of your website that are lacking behind. That's at least how you get. Also, we did a major refactoring of our internals, the JSX internals. And because of that, we also got another 75 performance improvement, which is great. We also brought this. I don't know how many of you are familiar with fast refresh. It's amazing. If you don't see what's happening here, that's good because then you're living a good life. What actually happens is, does anyone ever like build a dialogue, for instance? You click on it, you have the dialogue goes open, then you change some text and suddenly it's gone again and you have to re go through the whole flow again. That's the problem with state. Actually, what fast refresh does for all JSX in our, in our case, it will actually remind the state. So while you're typing, the state will update and you won't like have to go through the flow all over again. So it's basically quality of life upgrade for you as a developer. Page partials. It wasn't intentionally built for it, but of course we have all the HTMLX hype. And actually, this is possible now with Astro because of page partials. You just ship one thing, no HTML tag, no head tag, no body tag, just what you wrote in HTML and that makes using HTMLX in Astro possible. Then we'll have Starlight. Who has heard of Starlight? Less people than Astro. Okay. But there were a lot of people about Astro. What is one thing that you can name about Astro that is good? Documentation. I know you were going to say that. I just said it for you. Starlight is actually a, I want to say theme slash library slash framework. It's basically a great theme for Astro. But one important thing is that it actually ships everything that we have learned from writing docs for Astro and brought that to a framework for other people. And I was actually talking backstage a little bit earlier with Nicholas and he's using Starlight at work a lot and says it's amazing. Like you have all these built-in features that take care for you like the search. You can change that with Page Search or Algolia or anything you want. Really it's very pluggable. It's really good. And of course you have all the Astro goodies. You can use React, you can use felt, you can compile everything to native languages. You can do anything you want. But then we launched Astro 4. Astro 4 is cool. Why? We have a DevTool bar now and DevTool bars are something underrated sometimes. In our case, you can see your islands. You can see where your JSX is located. You click on the file, it will open. You can see that in this case it's not hydrated or it is hydrated. What's the text? How does it work? You can see all of that just in the browser like without leaving the browser. But also we shipped accessibility tools. Accessibility is getting more and more and more important and it is. And that's why we integrated that. So basically you click on the audit tool and it will tell you oh an image alt tag is missing. Oh these are misconfigured area roles. All of that will just show you. Really easy. But also it's super pluggable. So open source first. You can just write your own DevTool bar plugin and build it. For instance we have the Astro Tailwind Config Viewer which is basically you can see your whole Tailwind configuration inside of your Astro website or inside the DevTool bar. So basically if you do this well or there is a lot of more features you can actually just build everything inside the browser and never leave it except for developing code. Then we have built incremental content caching. A question I got yesterday for instance was what if I want to use Astro with thousands of pages? Where are the paints? And there are some of course like if you want to use SSG and you're constantly pushing new files then your build pipeline will just be very slow because it's always building and it's always building all of those pages. Even though sometimes they never change. If you change one file while building all the others basically that's why what incremental content caching does. It sees one file has changed and will only change that file. That makes sense right? But with doing that just for our own documentation it's still experimental but we tested it of course. We had a performance gain just for our documentation which is like 3,000 pages of 80% gain. That's a lot. The improvement is insanely good. And then we also redesigned our documentation to Starlight. Now it looks like this. I don't know if you've ever seen the previous one. It was also good. It was also like kind of built very hacky. We didn't have internationalization support before and such. We all have that now in Starlight and such in Astro docs. It's really great doc footing for both projects. Then we announced the ecosystem fund. It's a really cool thing that I'm very proud of. Actually what we do is we have dedicated the funding that we get as in GitHub sponsors and such things like that. We dedicated a hundred thousand dollars of that to give to other open source projects that are empowering Astro users. For instance one of those that got the grant was LuciaAuth. You know if you've ever done LuciaAuth it's basically an authentication library. That's also framework agnostic. But also they enable a lot of Astro users to build cool websites with authentication. And for that they deserve an award. Well they deserve at least some money to keep working on it. For instance we also gave 10,000 dollars to a team builder. They create themes for Astro and they output like one team per month or something. But that means that a lot of users get drawn to Astro because there are so many themes. So that's really makes it work. Of course that's not all of it. This was just like basically it was a ramble of features and how it works. There is more and there is more to it. And the question I always get is but what is next? What is the next thing that we are going to ship? Well I don't know. We have an open roadmap. So basically you decide. Our users decide. We have an open GitHub repository which is just a roadmap. And you can just make an issue there. We'll comment on it. We'll discuss about it. And then we'll get into an RFC. It's accepted. And then we'll actually build a feature. And if you can help in that, that's awesome. Cool. If you want to stay updated you can go to Astro.Build which is the website. If you want to join our Discord where we are very active both in development but also in support and questions you have if you can't pose them here today. Go there. There is probably someone super eager to help you out there. And Astro.Build says chat. And we also built a newsletter like actually was launched this week or last week. And that's Astro.Build slash newsletter. Cool. Thank you. Questions or is that another thing? If there are none I did a good job. Did you try creating... Is it hydrate only when it's raining in Brussels? Yeah. Because it's always hydrates. That would just be client side. I didn't. But I should. You should. It would be easy. It's just an equals truth. Big round of applause for Elio. Thank you.
How to Win 1st Place in the Kernel Patch Statistics - Tools and Workflows
First talk is by Uwe. How to win first place in the kernel patch statistics. Good morning. Soundcheck seems still good. I'll talk to you about how to get many patches in the kernel. The starter for the talk is the LWN patch statistics. That is presented after each kernel release where you get statistics. But actually this shouldn't be your motivation to get patches in the kernel. This is just a nice side effect. But it was a good starter for the talk. First about me and my employer. I'm Uwe Kleinekunig. I work at Pangotronics as a kernel engineer since 2008. I have several jobs in the kernel. I'm a PWM maintainer. But I already contributed patches through all the kernel subsystems. You can reach me via IRC and PGP if you have questions after the talk. If you are interested in the tools I present, I didn't create a repository for that. If you have questions or want to use the tools, just contact me. My email address isn't listed here, but you should be able to Google it. Pangotronics is a company that exists a bit longer than I'm with them. We're doing embedded Linux consulting, mostly for German industrial customers. In the kernel, my colleagues and me, we have several, we're listed several times in the maintainer's file. So we're working with our customers also in the mainlining business. We're selling them that mainlining is a good idea. Yeah. So if you have a good idea of what to change in the kernel, this is the process you have to work with. You put your changes in the end in a mail and send them to subsystem specific mailing list. Then ideally you get prompt review by the maintainers who are responsible for the code. Then the patches are picked up and sent to Linux Torward through in the end, creates a release from it. If you have a big series, you have to apply the same things. You have to do for single patches too. This is the usual or a short list of things you have to care for. These are not very hard rules, but this is what I think is the sensible set I use next as a base. Next is the integration tree for the upcoming kernel release. This is a good idea because if you send patches based on what is in Linux Torward's tree, you often get feedback that there is already some development happening and that your patch doesn't apply, so you have to rebase. If you use next, this is minimized. Even if you think you are a good kernel developer and you don't do the beginners mistake, use check patch. This is a small Perl tool that catches the obvious errors you can do with your patches. You forget your sign off or there are spelling mistakes and things. It's much nicer to get these things said to you by check patch then if you send the patches out and people tell it to you. The same applies for build testing. Do build tests ideally on several architectures because even for trivial patches, it's quite easy to break the build. The same reasoning as with check patch. For single patches, it's good to describe the change as well. The idea is that you want the maintainer to understand your motivation and the things you are changing. You want to make it easy for them to apply the patches to understand the benefit. This is still more important if you do massive patch sanding because you are adding much more burden to the maintainers. Also, addressing the right people. You don't want to miss the important people obviously, but you also don't want to annoy the others. I once sent a 600 patch series to the kernel mailing list and several people were annoyed. Don't repeat that. To get a big project, you have to pick something that applies to many drivers. What I did in the past is the remove callback for SPI drivers returned an integer, but that value is ignored by the core, which resulted in many drivers returning an error code in the expectation that there is some error handling in the upper layers, but which is wrong and this resulted in several resource leaks. The same for platform devices. This is my current quest, which is a bit more massive because there are more than 2,000 platform drivers that I have to touch. I am in the middle approximately, so there are still a few more patches to come. I have a few further ideas, but I will come to that when I am gone with this quest because doing more than one such quest at one time is really hard. Usually it is not hard to find something new to patch. If you touched all platform device drivers, you have seen quite some stuff and there is always something you can fix. What is very helpful to generate the patches is the tool Coxinell. It allows to describe a patch in a very high level form where you can... For example, this is a small version of a patch where I first try to identify... ...platform drivers that have a remove function that does not return zero, which is the first step before converting them to return void. Maybe I can... The syntax is just that you say, OK, I have any expression that is not zero and I just want to patch that in all remove functions of a platform driver, changes the return value from that non-zero value to zero. This is just to find the drivers that are affected by the quest. It is very hard to create a Coxinell patch that does the right thing for all drivers already. There is always some handwork that you have to find. For example, for indention, which is usually get wrong by Coxinell. With Coxinell, you don't have a tree where all drivers are adopted in the end. If you have 2,000 affected files, you don't want to commit by hand. You have to apply some shell scripting to make a commit for each file, which I think is the right thing. For some maintainers, they prefer to convert all their drivers in their subsystem in a single patch. But at least for sending it out and for review, it's easier to have one patch per driver. What I then do is I iterate overall changed files and commit it. The challenge here is to pick the right subject prefix. In the first approach, I just put the file name here. Then I go several times over my branch and use Filter Branch to adapt the subject prefix. This is depending on the subsystem, how they want it, if they want a capital or a small letter here, and if the separator is a colon or a hyphen, you have to check all the commits for the subsystem to get this right. I have a script that I keep in a scratch file. You see a short part of it where for some common drivers, I can adapt the subject prefix accordingly. This is much quicker than doing it by hand. Then here comes my usual workflow for formatting the patches in a mail and sending them out. This is the usual format patch call. I always put it in a sub-directory that I always call w. I don't know what it stands for. Then I have a script that I pass all my patches. I'll come to that in a moment. I edit the cover letter, which is a quite important part of patch theories, where you have to describe the overall idea of what you want to do and to show the benefit of the patch theories. This is the, I think, or I hope, the first thing that people will read about my patch theories. This has to be a good description to, again, make it easy for the maintainers to pick it up. Then I edit the list of recipes. I edit the recipes to the individual patches and send it out. Then what is a critical thing for tracking later, if I note for every patch that I send out in the commit, the message ID I used to send the patch out. This is important later. If your patch doesn't get applied, I can quickly find the conversation in my mail client to send a ping or to ask what's up. Then I put the commits I sent out in a dedicated branch near the top to track all the patches I have already sent out. This is the L file. This is generated using the getMaintainer script, which helps you to identify the interested persons for a given patch. This is, well, in the end, it's shell script. Usually you really have to adapt the list of people. For example, if I send out a patch series adapting several SPI drivers, I usually want that the SPI maintainer takes the patch series as a whole. What getMaintainer gives you, however, is that for the, or in this case, for the upmail PWM driver, that the upmail maintainers are listed as contacts. What I do then here, of course, one step back, this address append is another script that takes a list of patches and adds the people listed with minus t to two, to the two header in these patches, and the addresses on persons with a pass to option c to cc. So what I usually do for, well, this is a PWM series now, what I usually do is I replace with editor magic all minus t by minus c to first have them all on the cc list, and then I change individual lines back to minus t to just address the maintainer. And then I have a longer VIM command here to fix the syntax because currently it doesn't work. I start a quote here and I have the descriptions from getMaintainers here in the end, and this command above just throws away the parented expression and adds the closing quote. And then I can execute it and have all the people in the right mails. And what I'm doing here also is for each patch, I add the cover letter here to ensure that each person or each list that gets a patch also receives the cover letter to give the right context. This is also important if you have a patch series where you have dependencies where I introduce a helper in the first patch, and then this is used in the second. It's a good idea to also at least carbon copy the recipe for the second patch with the patch that introduces the helper such that they can easily understand the second patch. Here is a short snippet of my git config which is important for sending out or which I rely on. One is I blind carbon copy me all patches to make sure that I have all patches I send out in my index to be able to reply later to it. This is a good idea if you get sent email. This makes git send email ask before sending out each mail, which is if you have a big folder of patches, you don't want to accidentally send it out. So this gives you a chance to look again over the list of recipes and abort maybe if there is a problem. This setting is important for the nodes I added to the commits I had here. If I rebase them to be included in my tracking branch, the information doesn't get lost, so the nodes are copied on rebase. For sending patch series out and addressing the right people, it's beneficial to add one series per subsystem. That means not less. Don't mix several subsystems in a single series and also don't send several series with a similar or the same topic to the same subsystem. This is maybe a bit subjective. Some people, for example, NetDev is an example, they say don't send big series if you have say 30 patches, better use two or three series. It's a bit of experience to know this, but in general it's a good idea to do one series per subsystem. To save time and communication overhead, it's a good idea to be explicit about the expectations, how your patch series should be merged. For example, you can write into it, I expect this series to be taken by the SPI maintainer as a whole, even if there are maybe one or two patches which doesn't fix this topic. This doesn't have to be fixed and people can disagree, but this is better than if you have, you get no feedback and you don't get your series applied, and then you have to ask who will apply. So state your idea and, yeah, such that people can know what you think should be the best path. Another good idea is a slow start. What I mean by that is if you have a patch quest and you have to address drivers in 50 subsystems, don't send them all out at once. Start with the first one, pick something actively maintained, and then take the feedback to improve what you send to the other subsystems. So first send out one and then you can slightly increase your speed. But the effect is that you get better descriptions, people ask questions what they don't understand, and you can improve on what you write to the next maintainers. Good. I already presented you, I have a branch for all patches in my quest. I base this on the latest RC1 release. This is a bit smoother than basing it on next, where there's much more movement and it's easier to rebase from one RC1 to the next RC1, because this is all linear and you know what patches are really in. Occasionally it happens that you get a patch into next and it is dropped again, and in such cases you would lose patches, because they fall out of your tracking branch as they are included below, and then if you rebase the next time they are just missing. My tracking patch looks as follows, it's somewhere down below, there is the RC1 release, and then I have all the patches I sent out, and the few top commits is a collection of the remaining drivers that I have to adopt. This is one commit for all remaining patches, for all remaining drivers. In this case it's two such commits, because some drivers are a bit more complicated, they are not correctly adapted by Coxinell, so I track them separately to be able to take the necessary care. The top commit is where I rely on all platform drivers being converted, where I change the remove callback to actually return void, which is only possible after all changes are made. So it's the top commit to have the series still bisectable. What is really useful is the cherry command line parameter to lock, which marks all patches with a plus or an equal sign, and the difference is that the patches marked with an equal sign are already included in the left hand side of the expression here. So the mailbox patches were already applied in next, but not in the last RC1 yet, so they are still included in my branch, and the Macintosh patches are not yet included, so they are plus the work in progress patches, obviously I also have a plus bit, but that's less important. There's a similar option cherry pick, which has the effect that it lists only the patches that are marked with a plus in this syntax. This is the one I usually go through if I want to track which patches need some more care, which need a pin to make the maintainer act on them. I have this below each patch, ideally I have this marking I added that I already talked about in an earlier slide, and with not much, which is a full text mail indexer, it's quite easy to open a mailbox that contains the mail with the given message ID, and all the mails in the same thread. So if I open the virtual mailbox, well the thread belongs here actually, this is broken in a strange way, I see the patches I sent out, and in this case I can see, okay there was no reply, maybe it fell through the cracks at the maintainer or I added the wrong person, in this case I see it's also nearly a month old, so maybe it's time to send a ping and ask are there any problems or what's the state of the series. This is very useful to have an easy connection from the git commit to your mail, and not much integration of much really helps here. Occasionally it happens that you get feedback where you have to adapt things where things are not so optimal. In this case B4 is a great tool that I really recommend to use, even if you're not a maintainer, it's quite handy to collect the reviewed by and the act by text. Occasionally it happens that you already did some restructuring of your branch, and then git range div is very useful where you can compare the two different histories, the one that you already adopted, and the one recreated by B4 based on your previous submission, where you can see the difference where are texts, where are no texts, where did you change the code, and this really helps to create a single series that has all the improvements that you created on both sides. This is what I wanted to present to you. If you have questions either here in the forum or later, don't hesitate to ask me, or after FOSDEM contact me by email or ISE, send me your questions. I'm happy to help you for your next quest. Thank you. We have time for questions. I don't know who was first. I was looking at my send emails and have seen a lot of my patches sent to you because of GitMaintenors collecting your address often. Is it a challenge for you now to deal with all these emails because GitMaintenors will do to your many commits everywhere, often collect your email address? This is indeed an effect I wasn't aware of. If you attached all 2,500 platform drivers, you'll get a massive amount of patches in the next few releases. It's not very helpful to... If you submit it, it's not helpful to send patches to the person who just do cleanup on the driver who don't have the real interest in this driver, which also applies to me. I don't have interest in some obscure IDE driver that I just touched because it happens to be a platform driver and I changed the remove callback. On the other hand, it's also really hard to separate the list of people you get from GitMaintenors. Don't hesitate to keep me in the list. I'm very good at ignoring emails. I just archived some and it's quite usual to... If you send patches to, say, 10 people, that you don't get feedback from at least 9 of them. So this is life and I have a very big mailbox, but I usually can handle it. Thank you for your presentation. You have described your send workflow and you're not using B4. Have you talked to Konstantin, who has developed B4, because you have some special needs about CC and tool handling and cover letter and so on? No, I didn't. Mark already knows. I don't use B4 because with B4 you cannot individually change the recipients for the patches in the series. So what I like to do is if I have a series that touches, again, SPI drivers, I don't want to send the patch touching the iMix SPI driver to the Atmel SPI driver maintainer. So the list of persons is really hand-picked, which patch is sent to which parties. And with B4, at least last time I checked, you can only define the recipients globally. So you have to send all patches to the same set of people. No, I didn't talk to Konstantin. I have little motivation to do that because my patchwork works and I think it's a bit special for these big series and I'm not sure that there's a big benefit for extending B4 to that because for most people it's the right thing what B4 does. And the added flexibility for my use case results in a complication for tracking and for usage for all people which is questionable, I think.
Streamlining kernel hacking with mkosi-kernel
I'm very excited about this because this is actually the tool that I'm using to build kernels for a while now and it's made my life a lot easier. So thank you for that. Dan? Thank you. Yeah, so let's talk about kernel hacking. First a little bit about me. I'm Dan. I work at META on the Linux user space team and I'm a system demaintainer and I also maintain the tool that I'll be talking about today which is a METCO sign. So quick motivation for the stock. A little while ago I started looking into running system dejournality which I work on for individual users instead of just on a per system basis. But to make this work I actually needed a BPF feature for Unix tokens that wasn't available yet. So I looked at the kernel source code and I figured this is probably doable to do myself. So I got into kernel hacking. One site figured out the code and were written up my first batch. I of course had to test it but there wasn't really a clear way like this is how you test your Linux kernel batch. So I started looking into what I could do to do this. The first thing of course that I needed to fix is like if you have your batch you can't test your compiled kernel on your host machine of course because if it's broken then you suddenly lose your system. So you need a virtual machine or something to avoid breaking your machine. I also wanted to make sure that this setup is quickly replicable to any different machine because I started on my laptop because that's what I do for system dejournality and it works great. The kernel is quite a bit bigger than system dejournality and it also compiles a lot slower. So I was quickly looking for a bigger machine with a lot more cores so that my kernels could compile quicker. So it would be very nice if that I could replicate the setup very quickly to another machine. And ideally I'm not too reliant on whatever the host distribution of that machine is because well I work at Meta and we can get like very big beefy servers that have a lot of cores that we can work on but they might also be running some old version of CentOS with all the not all the latest tools available. So ideally I still get those but on the big beefy server with the old Linux distribution running. Of course I want it all to be fast so that I have a quick turnaround time so I can fix bugs or notice bugs, fix bugs, recompile everything and boot again without waiting too long. Like everyone knows the XKCD with like compiling and two dudes are fighting. So I wanted to avoid that. And then of course when you hack on the kernel these days it's not just the kernel that you're working on. There's very often some user space projects involved as well. A good example for the file system service XFS tests which is a separate project. I also wanted to be able to compile all those things and get them available in the virtual machine so that I can run them. So of course because I work on system D and we use Mako aside to do all of this for system D because system D suffers from the same problems. You also can't really test system D on your system because if it's broken then you can't use your system anymore. So Mako aside is basically my hammer and like kernel hacking is just another nail that I wanted to slam in. So what is Mako aside specifically? It's a tool that the Linux puttering developed to simplify its hacking on system D. So he had all the same issues so he developed Mako aside to fix it. What Mako aside does is it builds you a Linux image. So it invokes a package manager and then it installs packages. It packages that up as like in one of various formats and then it allows you to boot it either in a virtual machine or in a container and then you can do whatever testing you want and when you're done you just exit the virtual machine and it's like nothing ever happened. So Mako aside supports like it has a general execution flow. So of course like we have CLI options, configuration all that. We install packages for the distribution. So this is invoking DNF apps, what zipper, Pacman for all the distributions that we support. Optionally we set up a boot loader and all that if you're doing it bootable disk image. We run various system D tools that are helpful to configure an image. If needed we build an NNRMFS. This is again when you're doing bootable stuff. We generate what's called a unified kernel image. This is this new system D thing that allows you to combine the kernel command line, kernel image NNRMFS all in a single file and then boot that from that in UEFI. Then we package up the entire thing as a disk image and then optionally of course you can boot it in QM or container with this in the N spawn. So how do you get started with Mako aside? This is not like the kernel hacking specific stuff but this is just like if you want to make a side image you specify which distribution you want, you specify the packages you want in this case is in the NLINX and we're running on ARCH. We have an auto log an option to basically automatically get a root shell in the virtual machine and then you say I want to boot this in QM. That gives you something like this. So we support this for Debian, Santos, OpenSuzi, ARCH, Fedora and Ubuntu. And there's a few other distributions but they're all derivatives of these. So everything can be specified via CLI, the settings as you can see here but of course we also have configuration. So this is like the system, the init, any files and things that we all know and love. So we more or less do the same stuff so you can also specify it all in the configuration file. Using Mako aside for kernel development and development in general. So what I showed previously just in soft packages from the distribution of course that doesn't really help us. We want to build stuff from source either system D or in this case the kernel. So you can specify a build script. The build script is responsible for building your software. Canonically we call this Mako aside.build. So when you define that Mako aside will pick it up and it just contains the instructions to build your software. So either make for the kernel or mess on for system D. You can specify build packages which are just the packages that are needed to run the build script. So compiler, build system and all that. You can specify a build directory so that everything is cached. This is important so that your incremental builds are fast. With the build directory we have to build cached but we don't have the image cached yet so we have the incremental setting for that which will install all the packages once cached result and then reuse that on the next builds so that our image builds are fast as well. And then we have various settings that you can use to configure the image without invalidating the cache. So you can add extra files for testing or to configure your shell in the image or basically anything you might want that configures the environment to your liking. You can do with the extra trees and the post installation script so that the testing environment is the way you want it. Whatever customization you want you can pretty much do it. And then we have the runtime trees to basic which we use Fertile.ufsd then to mount extra directories into the virtual machine so you can make the XFScast source code for example available for running XFS tests or you can make your home directory available in the VM if you want that. Whatever you want with runtime trees. You can modify the kernel command line and whatever way you want. And we want to specify the output format as a directory so that we don't have to build a disk image but we can just boot from the directory itself also using Fertile.ufsd. Why do we want to do that? Because it's faster. It takes time. And we're looking for this quick turnaround time so we try to make everything go as fast as possible. So make OSI kernel is really nothing more than a make OSI configuration in the separate repository that's specific to hacking on the kernel. So we have a build script for the kernel and then we have various other modules that are all just build scripts for user space projects that are related to kernel development. So as of this moment we have of course a module for the kernel and then we have other modules for better FES procs because well I work at Meta and Meta work some better FES. The Linux test project which I added for Christian and then some other testing projects like block tests and BP filter which is a Quentance project for hacking on firewalls. So I added that as well. So you basically specify which modules you want and then all those get included. So getting started with MakeOSI kernel more or less looks like this. You clone the repository. MakeOSI is pretty easy to install. You can also install it from your package manager of course but it's a pretty fast moving project so in this case we install it from source so you just clone the repository. You sim link the script to somewhere in your path and then that's all you need. You can then run it. We download if you want, well by default for the MakeOSI kernel we download all the other tools we need on demand. So the only stuff it needs is Python and bubblewrap and of course the package manager and then that's enough to get started. Then we clone the MakeOSI kernel repository which contains the kernel configuration, a specific configuration and then you can write a local configuration file that basically says what distribution do I want to use to test or to use MakeOSI kernel. So we support Fedora, CentOS and Dabian at this point but it's easy to add more. The only thing that's distribution specific is basically which packages you need to do kernel development. So you just define the list of packages to build a kernel and to boot the system and that's sufficient to add a new distribution. So it would be very easy to add Arch Linux here as well. And then of course finally we specify the modules and we specify where our kernel sources live. So this is what the build source is setting. So your kernel can be checked out anywhere on your system and then you use the build source setting to specify here's the source location and then the target directory where it should be mounted when we run the build script. So this should always be kernel, the target directory. Of course the directory can be anything and it will be mounted in the right place and then we run MakeOSI and it will do its thing. So I hope this works with the internet here but I made a video. This is with everything cached so otherwise it would take a little bit too long for the stock but when we run QM we see the images cached and then we start running make. Kernel build is of course cached as well otherwise it would take forever. So not too much happening but we get a new kernel image packaged up. Then we make OSI does its thing and then we boot and then you're running in a VM that's running the kernel compile from source and you can do whatever testing you want and then we shut down again. So of course to build the kernel we need kernel configuration. We ship a default kernel config in MakeOSI itself. This is just with the minimal amount of stuff enabled to do to test various things and to the necessary drivers to be able to boot in a virtual machine. So we keep the drivers to a minimum and the features to a maximum. Anything like that's related to kernel development can be enabled so that it's available and then you can use it for testing. We also enable a few debugging things so that it's easier to figure out what's going on. For example also the kernel command line we configure it to panic on oops and stuff like that so that when something goes wrong when you're testing you immediately see and you don't have to go to the message to figure out if something went wrong with stuff like that. We also allow configuring to build a self test if you want that and specifically which self test so you can specify targets or you can specify to skip specific targets. For example the BPF self test because those take absolutely forever to build. You can specify your own K config if you want so you don't have to use MakeOSI's default one you can specify your own and the interesting way that we basically use this minimal config file is by using the all death config make command which basically says take the config file that we specify with K config all config use everything from that and set every other option K config option to its default value. So we specify what we want and we give everything else a default value. And then finally while I said that MakeOSI can build an inner ramifest for you building an inner ramifest is again more work which means slower which means slower turnaround time so in this case because we're building our own kernel anyway we simply build the virtual aero fests driver right into the kernel and that removes the entire need for even needing an inner ramifest so we just skip that step completely. As I already mentioned there's a few useful settings to like runtime trees and extra trees to customize the image. Another one that's useful for file system development is the QMU drives in the QMU arcs setting. So to add extra devices block devices to a VM with QMU you need both a drive which is the host facing side of it and then of course a device which is the guest facing side of it so MakeOSI can allocate the drive for you using a file that it creates itself on the file system which then removes when the VM shuts down so that's what you can do with the QMU drives and you can specify the serial or the drive ID you can specify the size and you can specify all the extra QMU options you might want in this case we specify that asynchronous IO should be done using IO U-ring and then of course you need to attach the drive to an actual QMU device so in this case we specify an NVMe device and we give it a better RFS serial and we specify that the drive should be better RFS which is the same as the ID we gave to drive. Like I said we can configure the kernel command line and if you want to do bootloader stuff you might want to hack on the EFI stuff or stuff like that you can also specify that we should boot in UEFI environment so that you can basically hack on the EFI stuff code or anything related to that. Well all this stuff I mentioned works like usually what you do with QMU is you have your dash kernel argument and your dash append and your dash init RD which you use for kernel development but when you start doing UEFI you might not have all of that available anymore. Now what MakeOSI does is it basically sets things up so that even if you're booting in a UEFI environment everything really works the same even though we don't directly might not directly use dash kernel anymore we might be booting from a disk image we can still like append to the kernel command line and all that it's all still all supported. You can get some extra shells in the image as well so of course you get the serial console but you might want extra shells so you can do that with MakeOSI SSH. You have to also enable the SSH option to make sure that the image gets configured for this but we do that by default in MakeOSI kernel. There's a very complicated diagram here that basically shows how we implement this in systemd but the interesting thing about what MakeOSI SSH is that you don't need your VM to have a configured network to be able to do this. So for VMs there's this alternative socket family which is called AFVSoc which allows for inter-VM communication that doesn't rely on the network interface being up and running and configured. So using a bunch of new systemd features what we're able to do is at runtime provision the virtual machine with your SSH public key so we can put it in the authorized keys file for the root user and then if there's a VSOC device attached to the VM in the next systemd release systemd you'll basically be able to automatically detect that a VSOC device is attached and if so it will generate a socket unit that will run SSHD on port 22 of the AFVSoc family and this allows you to connect to the VM over VSOC from the host without needing a network. We can also do install a drop-in file for the SSH on the host configuration which SSH support now as you can do drop-in configuration for SSH and we can use the SSH proxy protocol to basically take possession of the UNIX and the VSOC host name prefixes so that you can use those to connect to VSOC enabled VMs. So with all this setup you can basically do SSH VSOC slash the VSOC connection ID to connect to that specific virtual machine all without going over the network. So we don't use this stuff yet and may go aside we have our own version because the systemd thing is very recent but we'll be moving to this in the future once this is available everywhere. Running tests manually is all good and fine but you want to move from manual testing to automatic testing of course so we also support this when you want to do automatic testing you want to run the test and you want to get an exit status usually this is very simple with a process you just run the process in your shallow whatever and you get the exit status from the kernel when you run the test in a VM this gets a bit harder there's not really an easy way to get the exit status of a process that's run in the VM and transfer it back to the host. If you're running a directory from a directory with FURTAOFS you can just write some files to the directory and retrieve all the information that way if you want but if you're doing testing from a disk image then you have to mount the disk image once the VM shuts down to access the information and of course to mount the disk image on Linux you need root privileges so you have to start entering your password and stuff so it all becomes a bit more complicated so what we added instead is a way again using the VSOC stuff to have the VM when it shuts down and you use these two unit settings success action equals exit and failure action equals exit in the systemd unit when that unit exits the VM will also system they will also shut down so the VM will shut down but it will use the SD notify protocol which is a some systemd thing to send notifications to send the exit status over VSOC from the VM to the host and make a way so I can pick up on this and exit with that exit status so seems pretty trivial to get the exit status but there's a bit of work involved to get it out of the VM and then of course what we also want is the locks so this isn't actually upstream yet but we're looking to have add another forwarding mode to systemd-journally so that again using VSOC it can forward locks over an AFV socket and then we can listen on the host receive those forwarded locks with systemd-journally remote and write all the locks to a local directory and that means we can access the locks on the host without needing root privileges we don't have to mount the image we just have to the locks locally we run internal cuddle on it and we can access the locks see what went wrong with the test and debug further of course I'm not the only project in the space we do have some competition so the latest product in this space is Furtmianji so I thought I'd mention it as well because I don't want to claim everything for myself like there's more tools than just to make a side kernel so definitely take a look at Furtmianji as well Furtmianji is very focused on kernel development so it has a lot more options to for example use the kernel from the host and various other options but it's very specific to kernel development it also has its own in its system that runs in the VM which allows it to boot very fast but you don't get all the you don't get a regular Linux system like you would when you well I mean I don't want to like say that system D is regular but you don't get system D so if you wanted to start doing stuff with devices or something like that you definitely won't be running so it gives you a bit more limited environment so depending on what you're doing one or the other might be more useful yeah that's more or less it on the comparison if you want to know more about this like come talk to me afterwards or something and I can say a bit more about the differences between it too of course I'll end with some reactions from users so of course like Christian already said he was using it so it's very nice as well his reaction to it and then Joseph from Meta the better FS fast system maintainer is also using it so and he's also very happy with it so I hope it can be more useful for more than just them so please give it a try and I'm happy to answer any questions or implement more features if needed thanks for listening hello thanks for the thanks for talk or two quick questions so one what about cross compiling so that works we don't like we don't have like a specific environment rival in the build script yet that allows you to specify cross compile but we can simply add that but I already tried it like just by hacking the build script and saying cross like changing the architecture to compile for arm 64 and that works Christian also or I'm not sure who added it but we also had the support for compiling with LLVM if you want to and the second small question maybe I missed that because I was late for the talk what gets into the unit MFS so what about from the all in the last half so make was I kernel by default doesn't boot with an around fast when we do the third I your FAS stuff if you do a disk image then the inner MFS is built with make or side cell so I actually have another talk about this in the distributions there from but yeah we just install regular RPM packages or whatever into the inner MFS and then by default we just copy all the girl modules and firmware from the host but we have a suite of settings to basically include and exclude whatever you want and we also have like the stuff that in the ramfs generators to to include everything that's loaded on the host if you want that so you can configure a bit which firmware and then drivers you want we'll also make sure that we when if you specify these drivers modules to be included we pick up all the pen the dependencies as well so we make sure that all that is set up correctly and included I'm using the inner drum FAS stuff like I'm building full images and I'm not using the QMU part I'm using a different virtual machine manager for this and it works really nice because that was the biggest the biggest thing for me that it wasn't easy to build an inner drum FS especially if you want to do it destroy independent which was really annoying it's this also useful if you want to run a mainline on a new device where there's only some heavy patched in the window kernel so you want to test if your drivers work and so you need to you want to test it but don't want to touch any non-volatile memory just started somehow without like this fast boot boot or something like this sorry I don't think I completely here to her to question so this was all about mainly so if you want to test if the kernel works on a new device so where there's only vendor kernels are known to boot so you want to you don't want to destroy the user space there but first test it there before you touch your space and you want to boot it only from ram can it also be used in that way so it's very focused on virtual machines at the moment while make or size kernel specifically is but make or size can build your images that you can then deploy on another device so like you can run the stuff that is produced by make or sign on your laptop or you can flash it to your disk and it will it will boot but specifically without destroying the user space we don't have anything specifically to make that work you could take the kernel produced and then keep the user space the same but it's not something I've really I've looked at before so it probably won't work all right I think if there are no more questions and thanks for your talk thanks for the tool you
Converting filesystems to support idmapped mounts
Hello, my name is Alex, I work for Fogunonical. I have a pleasure to work on Lexi project and do a lot of container stuff in the kernel and user space. We have been working on that new stuff about ADMapetMal and support for some file systems together with Stefan and with Christian. So today I'm gonna talk about the problems that we faced when we started to actually look into the network based file systems and how to support ADMapetMal for them because it's kinda hard sometimes. First of all, I'm not sure that everyone knows everything about that stuff, so I want to give some intro about how it works currently. And yeah, if anyone, if anybody there, we were listening to our previous talk about isolated user space stuff, please forget that for the next 30 minutes because that's a new feature. But this stuff is about stable API that we have in the kernel since I guess 5.11 or something. So that's more about supporting more file systems. So we don't do these isolated user space stuff in here. First of all, we need to understand that we have three types of ID mappings in the kernel. First one is the callers ID mapping, which effectively taken from the user namespace and from the current user namespace. User namespace attached, you can get the pointer to user namespace from the struct cred and you can get the pointer to struct cred from the task struct. Right, so if you're calling any kind of syscall in Linux kernel, you get a current task and so you can get a current user namespace. So we have a macro in the kernel to get that. And even if you're not doing any kind of container stuff, even if you're not using user namespaces, you're always invisibly using that because you're using the default mapping, which looks like zero zero in this big number, which is effectively the largest unsigned integer. And what does this means? The first number is the user ID inside user namespace. Second number is the user ID outside of the user namespace and effectively the length of this mapping. So this mapping is the identity mapping, which means that we effectively map zero to zero, one to one and so forth. Next thing that we have when we are working with any kind of EFS stuff is the file systems ID mapping. It's also represented as the user namespace because it's the thing that we are attaching to the super block of the file system. So when you're creating a new mount, let's say for example, for X4 file system, you have a block device, you're creating a new mount and if it is the first mount for this file system, not a bind mount, I mean, then the super block gets allocated and on the super block structure, we have a field called SC user NS and this field gets filled with the current user NS. So when you do a mount, it takes the current user namespace from your current task and puts that into the super block. And that's the file systems ID mapping, which means that if you're, let's say, inside the container with some user namespace and you do a mount, so your super block will get this user namespace effectively from your containers user namespace. And that's a pretty old stuff actually because I believe that it was from the beginning of the when the user namespace is very introduced many years ago. And third thing about we are talking today is the mounts ID mapping. Mounts ID mapping is the concept a little bit more high level because instead of being attached to the super block, we have the mount, we have the ID mappings attached to the mount. So it means that you can, for example, create X for file system on top of some block device, then do a bind mount and you can do this bind mount with some ID mapping attached to it. And once you get any kind of IO through this ID mapit mount, you will get some extra translate, UIDJ ID translation layer inside the VFS, inside the generic VFS code. And then this, all of that goes through the file system in mapping and then all of that gets written to the disk. So that's how it works. So important to mention that all the time when you're interacting with the kernel from the user space and if you use any kind of C-scores like start get UID, get sock opt for instance with the option so peer create which allows you to get the PID and UIDJ ID of the peer socket, you will get these values mapped in accordance with your current user space. So the callers ID mapping always get, always taken into account everywhere in the kernel. And for example, if you, so yeah, that's effectively all the examples and also we have the same in proc PID status file and all that stuff. So let's take a look what happens when you for example, take the get UID C-score which is probably the simplest one. Inside the kernel we have a few helpers to convert between the user space, user ID that we can work with inside the user space and with the internal representation of user ID inside the kernel because inside the kernel we have two types, UID T and K UID T. UID T is the user space one effectively because it's just a 32 bit thing. And K UID T is also 32 bit thing, it's the same in size, usually they contain the same value but K UID T is the value that represents the user ID always in the initial user space. Which means that for example, if you are inside the container with user name space, you have the, let's say user ID inside the container zero and if you have the corresponding user ID on the host, let's say 1000, then K UID will have the value 1000 always. But once you call the get UID C-score from the context of the process of the task that runs inside the container, inside this user name space, this function called from K UID Monct will be called. And the first argument of this function is the current user name space which effectively the time thing that represents the UID mapping. And second argument is the current UID which will be the K UID T value which is equal to 1000. And this function called from K UID Monct will try to effectively remap this host visible value 1000 to the appropriate value inside this specific user name space. It will be zero in our case because as I have explained in this case, let's say we have like mapping of zero inside the container to 1000 on the host. And so you will get the zero finally, yeah? And this function has a pair function called from K UID. And the difference between these two functions is that from K UID is like more like internal one. If we fail to represent the internal K UID in terms of some user name space UID range, the from K UID function returns minus one which means that something terribly wrong. We can't really represent that ID inside this user name space which is possible. For example, if you have the username space with that maps only like 1000 to zero and if you have the user ID let's say 2000 on the host you can't really represent that is any reasonable value inside, right? And if you call the from K UID it will return minus one. But function from K UID month it does the trick. If the from K UID returns minus one, it takes the overflow UID and returns that. That explains why we have these interesting stuff with like if you have the, if you try to access the, for example, the container file system from the host or that has another ID mapping and you will see this strange nobody user. That's because this function is used everywhere because we can't really give the user space with this minus one. We always, the user space always expects us to give the normal user ID, reasonable user ID. And also we have a helper called make K UID which effectively does the opposite thing. It takes the user space UID and creates the internal representation of it for the kernel. The same, we need to give, plug the current user space, current ID mapping to this helper and give the user space value. And that's what happens inside the set UID cisco. If you plug the let's say value zero, let's say one inside the user space to that cisco. Inside the container it will go like make K UID current user space one. It will go to the UID map and it will try to find that okay, this one is for what? And if it fails to do that, then okay, we get gain well. And so the set UID will not allow us to set this UID because it's not mapped. But if you have a mapping like zero, 1002 which means that you have mapped zero and one, then they succeed because the end K UID for that thing will be 1001 on the kernel and everywhere it will be represented like that. But once you, until you do the get UID or something like that. For file systems, what we have for file systems? For file systems we have, it's about super block ID mapping, right? We have two important helpers. One helper effectively takes the I node and tries to get the user space visible UID so the normal UID. This function called I UID read, but in fact it called on the right path. There is no mistake, that's perfectly fine because we are reading the I UID value from the I node. That's why it read because we read this value from I node. But of course it's called on the right path because when the file system driver wants to write the UID on disk or let's say send it over the wire, in that for file systems like this. We need to call this to get properly mapped to remapped user ID that we can then send over the wire, put on the disk and forget. And we have a second helper called the I UID write which does the opposite. It takes the I node, it takes the user space visible, normal classical UID that we supposed to work with and does the same as we have seen in the set UID system. It calls the helper called make I UID, but instead of taking the current username space, it takes the username space from super block. And second argument is the value. So let's say if you create a file on the file system at first from the user ID like one, so you will get that. Like it will take the value one and plug in there and so. This K UID will be written into the I node I UID field. And finally we're getting to the point when we can take a look on the whole picture like how it works together with the amounts ID mapping. Okay, imagine that we have the caller UID 1000. And this caller wants to create the file on the ID mapped mount. And we have these three ID mappings in place. We have the caller's ID mapping which is okay, which is something that we have been discussing right now. We have file system in mapping which is the, in this specific example, which is the identity ID mapping that does the zero maps zero to zero one to one, two to two and so on. And we have a new thing, amounts ID mapping, which maps effectively zero to 10,000 and has the length 10,000. So we have like 10,000 UIDs mapped with this shift. So the second thing is that effectively the shift value. So the zero goes to 10,000, one to go to 10,000, one and so. And what will happen in the kernel in this case once we try to create the file? First of all, we will create the internal representation for the user ID 1000, which will be 11,000, right? Small remark is that effectively in the kernel, to be honest, we all the time work only with this KUID thing. So it means that technically, when you calling the file system, CIS calls like let's say open with OcreateFlock, the first step is not gonna happen because we already have these values on the struct cred, but it's easier to think about it like that just to understand how much different mappings we have in this place, right? And second thing is that we need to, we need to apply this new concept, mount id mapping, right? We need to take the mounted mapping and perform effectively the reverse operation. We call the front KUID, we take the value that we've got from the collars ID mapping, and then we do this mapping in accordance with this this definition that we have. In this case, we are mapping the KUID 11,000, remap it, and what we get, we get 1000, right? Which is obvious. And then once we want to create the file on the disk, we need to get the IUIDT back, right? So we need to go through the file systems ID mapping which is attached to the super block to get the IUID that will be written on the disk. And so in our case, fortunately, we have the identity file system ID mapping which means that okay, we have user ID 1000, it goes to 1000, that's all. But let's think about another example, if we have the, for example, mapping like U0 K1000, in this case, we can remap that value, right? Because if it goes like U0 K1000, we fail because this U1000 is not in the range of this mapping, but for the second one, U1000 K0, we can remap because the corresponding user ID will be zero, but in first place, we can't. And what happens if the VFS generic code realizes that it cannot remap the value? It will give you the E overflow error. So that's the reason why you can get E overflow error when you're working with ID mapping, not only. Even if you're not using ID mapping, if you're using just normal mouse, what you're trying to, for example, to write to this mount from the another user space with another color ID mapping which is incompatible in terms of ranges of user IDs with this mount file systems in mapping, you can get this E overflow error. So that's the really complicated behavior, but that's how it works. We have no alternatives, actually, right? So you can create ID mapping mounts using these effectively two options. We already have the new feature that allows you to use the classical util Linux mount utility to create ID mapping mount, but in most distros, I don't think that it actually works right now because it's too recent, it's like one year or something like that. So I'm always using the Christian utility for to create ID mapping mounts. And internally, it just uses the syscall called mount setutter to set the ID mapping on the mount. And so you can, you always need to specify this attribute with the username space file descriptor. So we're always getting the, at least these days, we're always getting the IUID mappings and GUID mappings from the username space because username space, we have the way to actually set user ID mappings and JD mappings to the user space, from user space using the proc files, right, that's the reason. So currently we have support for all of these file systems, but if you take a look on the list closer, you will notice that most of them are local ones, so it's like the X4, better FS, XFS and so on. And recently we have been working with Christian and Stefan on the CIF support. Christian did the major work a few years ago, created the first implementation of that, but unfortunately it get lost in discussions and it wasn't merged, so I asked the permission to continue work on that because it was kind of important for our containers applications. And I get some rebate stuff and also we decided to use a little bit another approach to make it work. I will explain that a little bit later. So starting from 6.7 you can use the ID map mounts which is CIFFS, and yeah, CIFFS is the only network file system in this list, so. How to port the file system? The very naive way to do that is to just go through the file systems code, find all the places where we have like no M&T map, which means that this file system id mapping is not defined, so there is no id mapping. Replace it with the id map identifier, which is passed almost to all the VFS API functions from the generic VFS code. And then also replace the current FSUID, which gives you the KUID from the current user. And with the mapped FSUID, which does the same, but takes into account the id mapping. And also raise the FSUID map flag on the file systems definition. But no, that's not that simple because you need to be really, really careful with that stuff, otherwise you can really break things and or even open to some vulnerabilities or something like that. So the reason for that is that, okay, I would suggest that if you want to try to try and porting some file system to support id mapping, especially the network one, you need to go through the code of X4 as a really, really good example because X4 file system is like very complex one. It has many features. For example, you can do the overlay FFS on top of, and use the X4 as a one of the layers for overlay FFS. And for example, the rename callback on the X4 supports really interesting rename mode called rename whiteout, which effectively when you rename the file, usually it disappears on the previous place and appears on the new place, right? But in this case, on the old place where a file supposed to disappear, it creates the so-called whiteout thing. So this is effectively the share character device with the major and minor numbers zero. And that mode is enabled only when you call the rename from the overlay FFS. And I guess that only for that reason, this rename callback and VFS takes the id mapping as an argument because in all the other file systems where we have no support for that, we can't really use this id mapping in any case because we don't need one. Yeah, also you need to pay attention in the getutter because getutter what it does, it's effectively what is getting called in the file system driver when you call the statistical because getutter reads the attributes, fills the case, utter structure in the kernel with all the data like size, like user ID, JD stuff and all that. And you will definitely need to take id mapping into account in this place to get proper user IDs and JDs reported to the user space, right? Also there is a permission callback which effectively does all the permission checking unique spike in the kernel. So you need to also properly pass the id mapping in there. If you use, if the file system that you want to convert uses the generic permission helper, then you just need to pass the id mapping, check that everything really works and that's pretty much all. But sometimes it's not the case because some file systems will see that later, use really, really weird machinery to check the permissions. And also get ACL stuff and that's pretty much all for read code pass, but for write pass, the most important pieces is the, obviously the places where are we creating the new inodes, right? So that's the MK node, sim link, MK dear, atomic open and create. So we need to take into ID mapping into account in all of these places because we actually write the UIDs and JIDs. That's it. And set other which is getting code from, for example, challenges call, right? So you need to, as the challenges call takes the user IDs and JIDs from the user space, you need to properly remap them and write to the attributes. So that's, so for local file systems, as I said, you really need to take the X4 or better first or something, just carefully read the code. Be absolutely sure that you understand how it works and then go for the other philosophy that you want to support. Which problems we can have and we really have. First of all, some file systems, especially in Torque Ones, they, obviously in Torque Ones, they do the permission checking on the server side, which is really bad because what we want is to ID map it, map it mounts, is the local feature of Linux kernel. We don't want to tell the file systems remote server to be aware about that we have this crazy, interesting Linux specific stuff because the theoretically user may be from another operating system, right? So if we want to, if the file system does some UID, JID based permission checks on the server side, it means that we need to extend the on wire protocol, pass all of this ID map stuff over the network, write some logic in there, so that's not work usually. Effectively for a few file system, which is not the network one, but it almost the same as network ones, right? Because you have the user space demon, you have the kernel, kernel is effectively the client, and user space demon is effectively the file system, and the client, the kernel just takes the information from Cisco, does something with that information, produces the request, send it over the fuse device, and the user space read that, and so if we want to do all the permission checks on the user space side, and if you want to support ID map it mounts, we need to pass these ID mappings over the, so we need to extend the protocol that we use between the user space and kernel space for fuse, right? Also some file systems, it's also about fuse effectively, some file systems can do, some can allow you to completely disable the standard permission hook permissions, so effectively implemented almost like an empty thing, that just allows everything, and then do all the permission checks on the level of the I know operations, and the problem is that, I can remember that I have seen that in the while I was working on Ceph, is that in Ceph it's possible to set the configuration based on the path to the file, and specify the user IDs and JDs that actually allowed to read the sub directory, it means that you have the combination of permissions checking on the Linux kernel side, then you have some permission checking on the kernel side, on the server side, I'm sorry, which is the remote server with another kernel, which does not know anything about this stuff, right? And they do checks almost everywhere, even for lookup, and why it's bad for lookup? First of all, because lookup, I know the operation does not have ID mapp argument, and it's not obvious why it doesn't have, but the reason for that is that the usually lookup operation is getting called from the slow lookup pass in the kernel, right? If you have the pre-cached to dentaries for some pass, then we won't go to this lookup callback, instead we will just take the dentary, and it means that if you have the permission checks inside the lookup, then everything will depend on that, if you have this dentary already or not. So if you have not, then you go to the lookup, then you do the permission checks. If you have this dentary cached already for some reason, for example, if this dentary was accessed from another mount with another user, then these permission checks won't happen really, that's bad, right? That's why we want to have all the checks in one place, ideally, for this stuff. And of course, some of you can say that, okay, in this case we can do some permission checks and derevalidate helper, which is always getting called, yeah, to derevalidate, but not, because we don't want to do that, I guess. So, yeah, and also, third case that I've almost forgotten about is that some file systems has the local feature, really, really close ideologically that what we have in Linux, that does some UID-JD mappings on the level of the file system itself. And that's also a problem because I personally don't understand how to combine all of that together to make it work properly. Yeah, in third case, what I have found is that we have the combination, effectively, of the classical permission checks and the server side checks. Speaking honestly, we decided to forget about that because we just decided that if someone uses the IDMAPitMounds, we clearly say that, okay, you don't want to use the server side permission checks in this case, just disable that, just trust the kernel, just trust the client because, Ceph really trusts the client. If you have the key to interact with the MDS server, you can do anything. So there is no real reason to do some additional checks because you can, if you have the user ID checks on the server side and if you have a client, this client can give you any UID, right? So it makes no sense to check that because this information is not like, trustworthy so. So in third case, we have this lookup problem which is okay because it's only actual for this case when you have some additional setup, some additional configuration. And the third one is that for some reason, most, I guess historically, is that Ceph uses current FSUID everywhere to get the current user ID. Yeah, thanks. To get the current user ID, but what we want usually, we want usually to take the credential structure from the file because when you open, when you are opening the file descriptor, the credential structure from your current task gets stashed to the struct file structure. And then we expect that if you do, for example, the right syscall or itsyscall on this file descriptor, then everything, all the permission checks will be done in the relevant to this credential structure that we have on the file. And you may ask me why it's so important. It's important if you want to pass the file descriptor over the Unix socket or if you, for example, opening the file descriptor while you are privileged, but then you do some capabilities, drop things, or set your idea or something, and you lose your privilege effectively, privileges effectively, and so that can be a problem. But I was, to be honest, I decided not to send fixes for that because I don't want to break any real user space application. I don't know, maybe someone relies on that. So that's technically not ideally correct, but you will see. So, yeah, I effectively covered that. Yeah, what we decided to do, we just ignored these problems with the server side permission checks because we can't really do anything with that. And we were asked by the CFFS folks, CFFS maintainers, thanks, by the way, thanks to them for help, for reviews, to Viennkischenkartus, Huboli for helping with that because they were reviewing that stuff, especially the user space one, because I was forced to extend the on-wire CFFS protocol and add some extra UID and JID fields for the Inode creation operations. And of course, all of that was done in the backward, forward, anyhow compatible way, not to break anything. Yep, and what we are doing right now, we're currently working on Fuse, I have already sent a series of patches that enables support for Fuse. Unfortunately, only for the mode when we have the default permission set, because as I said, if you have the Fuse mount without this flag called default permissions, then effectively the permission callback is almost empty, it just allows everything. And in this case, Fuse file system expects that the user space will do all the permission checks in the user space, which is a problem because we can't handle that properly. And also, obviously Fuse protocol that between the user space and kernel play was extended to send these UIDs and JIDs over the wire, let's say. Yep, also in addition to this series, I wanted to be absolutely sure that this really works properly, so I have taken the three not random, really not random file systems. Overlay Fuse Fuse just as a good and relatively simple example for this specific case, it's not simple at all. Overlay Fuse Fuse, SEPA Fuse Fuse because I was already familiar with Fuse a little bit while I was working with the, so and GlusterFS, which is the new one. For GlusterFS, it's not an ideal implementation because I found, I unexpectedly found that GlusterFS also likes to do all the permission checks by default in the user space. And so that, a bit painful, but I found some special configuration option that allows to disable that and enable the default permission thing for that file system and it allows us to make it work. So to do, in our plan to go further with the Fuse series to make it fully like tested covered to be absolutely sure that everything is fine, then we want to convert the nine PFS and virtualFS, which can be useful if you do some nesting stuff like virtual machine with some shared director from the host and then the container inside, for example, which is not a rare case. And yeah, that's all. Questions? Thank you. Hello, thank you for your talk. Is there any caveats with ID mappings and interaction with Alasams? So like if you're doing some checks in Alasams, like what kind of UI did we get there? Because I was confused. That's a good question to be honest, because all of these ID mappings works is done by Christian, thanks to him, because he did all of these great API in the kernel, all of these preparation stuff. I mean that our isolated user space work and how we managed to make it work with the file systems is all, it became so small in terms of lines of code that were modified just because Christian did all of these crazy complex hard stuff in the kernel a few years ago, because he effectively provided us with the two functions in the kernel that we can patch easily, relatively easily. And so we get the ID mappings supported for some like new crazy case, right? And to be honest, I don't know much about Alasams, so I guess that it should be integrated. So when I did the original work, I went through all of the Alasams. And so for example, Alasams like SA Linux don't fuck with UIDs and GIDs, don't care about this at all. So most of these Alasam functions don't get past the path or UID and GID value at all. The only hooks are relevant, like security file open and so on. And then it's mostly Tomoyo and possibly some app armor stuff and they are all patched to take the ID mapping into account, although one caveat is I once tried to do some additional fixes inside of Tomoyo itself because it kind of does weird stuff, but the maintainer said, no, we don't care. I mostly care about like BPF Alasam because the hook doesn't get the UID, but like you can extract it from something. Oh yeah, they are aware of that. I talked to them. So yeah, well, for example, if you do a BPF Alasam and in hooks like security file open, you get the relevant ID mapping provided. And in other hooks where you only have the inode, yeah, then you don't have access, but that's also for example, not feasible. Like no, there is no security hook in lookup, but there is certainly locations where we have security hooks where you, for example, in the dentry cache, where you don't have any of that information available and it's impossible to make that work. Like you mentioned the lookup stuff, the lookup stuff itself, like it was two reasons why we didn't do it this way. First of all, because in lookup you initialize an inode and that always needs to be take the global UID and GID into account, the one that you see everywhere. Otherwise you end up with inode aliases in a way because if you can't cache an inode per mount, that's the one thing. And the other thing that lookup is called deep from within the dentry cache, which would have meant then suddenly you would have like, have to pass mount information more or less because it's mount information through the dentry cache. It doesn't make any sense. Also L would have killed me. But I mean, that's another thing why in these locations we don't want to have this. But for example, BPF Alasams, if they need that sort of information in specific hooks and is doable, then we can easily extend the hooks. Like I don't have a problem with this, like sort of more of a LSM question if they're ready to do this. It's, I think for most LASMs hooks, it simply hasn't been done because the LASMs that didn't implement that this specific hook didn't want this information. So it didn't make sense to provide it. If you have an LASM that wants this information, it's easy to extend it. Well, I think the other point is the LASMs should use the code behind the way because it seems this one is not the LASM. I think it's a little faster. And for always tricky when you provide a policy from users based on the current idea, you don't need to translate it to the LASM. Question? Yeah, you mentioned an FS real quick. How does it work with an FS? If I remember correctly, there's an upcall through the Linux Curing, right? So you get the translated.
What is Linux kernel keystore and why you should use it in your next application
All right, so the next talk is going to be about the Linux kernel key store and why you should be using it in your next application. Thank you. Hello, my name is Ignat. I work for Cloud for and today we're going to talk about Linux key store. By the way, how many people here know that Linux has a key store? Cool, many hands. Because like James earlier showed us that it has a key store but probably not everyone knows that Linux actually has a key store. So, yeah, a little bit about myself. I do Linux at Cloud for. I'm passionate about system security and performance. I'm like Lolo programming, Linux, but loaders, drivers and other stuff written in scary and safe languages. And I'm a hard Linux fan. That's why I'm presenting from a Mac. And probably like most of you here, I'm a fugitive programmer because NSA banned writing C and C++ languages and enterprises. And why is that? And there are many reasons but one of them is regarding application keys and memory. And by the way, here is the brand that NSA recommends that organization use memory safe languages whenever possible. So what is the problem with application key? Regarding keys, we're like talking about cryptographic keys, right? So to dig into that, let's review the Linux address namespace, isolation concept. So yeah, you have these many processes running on your systems because Linux is a multi-threaded, multi-process system. But what these processes have inside, right? So usually it's kind of like your code, like compiled code, your business logic. Some libraries, shared libraries, if your application uses shared libraries, some data, like global data stack. And yeah, I have the stack box separately. So it's like data heap and global variable with mStacks, right? And then you have the kernel, right? Everything runs in the kernel. In the kernel also you have the core code. You have static and dynamic data. You have the drivers which you load modules. And also you have stack or stacks if you have different threads, right? And the idea regarding the address spaces is within the process, each process, and even within the kernel, everything can access everything, right? So it's like one global space, whereas you can't access the memory of another process from one process and you also can't access the memory of the kernel. Like it's separated. This is Linux address space isolation. If we zoom in into the main process, into one of the processes, right? Like let's actually review what can be here and what can be in your data. And it can be like some internal state. So you have global variables, like applications can keep some internal state in the data. Yeah, your process can have user or customer data if it processes some external inputs and does stuff. Right? And the most important thing is cryptographic keys. If your application does some sort of level of encryption, it probably has some keys in the process address space. And what if like suddenly your application becomes compromised, so either through your main application logic or through a library, well, it means because it's all in the same address space, it means all your data section is compromised, right? But not all data is created equal. So, well, yeah. So yeah, well, like if your application internal state is compromised, well, it can be good or bad, right? It depends. Like depends on your logic. Of course, it can be bad if the attacker has control of some kind of data which can, for example, change the control flow of your application. If you're verifying a password, you can flip back like true or false or you can put some authenticated flag on and yeah, this can be bad, but sometimes it's not as bad depends on if your application is simple, but it can lead to further compromise. Well, if your user customer data is compromised, then like it's much, much more now. And yesterday also mentioned Equifox, my favorite company. Yeah, if you're a user customer data leak, it's a big problem because kind of it creates a lot of pressure on the company and you have to pay a lot of fines, but it's very, very bad but still more or less recoverably. Equifox is still in business to this day, unfortunately. But what about cryptographic key compromise? And this is like a total game over, right? So like if your identity key is leaked, that's what anyone can be as you. If you're like the main data encryption key is leaked, everyone knows your data. So it's a data integrity compromise, full security compromise and total identity take over. So what are the, well, 1000 feet view level of methods you can leak your application keys, right? Well, first of all, untrusted inputs and out of bound memory access. So imagine you have stuff in your memory written somewhere, right? And it may be that like near that stuff, you can have like a cryptographic key also in the same memory. And the normal application logic should allow you only to read stuff. But like what happened, for example, in hard bleed, if you can make the application read past the buffer boundary, you can also read the cryptographic key, right? And this is what happened to hard bleed. Everyone remembers hard bleed. Well, if your application have arbitrary remote code execution, like what else to discuss there is game over, right? So like attacker can control the execution of your binary and they can read, and due to say everything being in the same process space, so they can read everything and as to write everything. Not much to discuss there, but in the example was recent one, lock for shell. Everyone remembers lock for shell. Who patched lock for shell? Should have asked yesterday here, Java, right? Well, buffer use can be a sort of problem for leaking a key. So for example, this is a very, of course, this is a simplified program, but specifically tailored to leak the key, but like it illustrates the example. So for example, it has to function and crypt and log, right? And oh no, we forgot to initialize the logging message in the log function. And if you actually execute it, you will see that it kind of actually leaks the cryptographic key. So what happens is you have the process as thread stack, you have your main logic. For example, you call the decrypt or encrypt data function, which will get the key from somewhere and may put it on the stack depending on the implementation. But if you then the function exits, but if it doesn't clean it up the stack with the key, the next function can take it over and actually has an example, sorry, has an access to that cryptographic key, right? This is why all the compliance and security folks will tell you you always need to zero memory after key use. Like you have to clean up. Which is hard to do in many high level programming languages, especially if in garbage collected languages, right? Finally, you have the debugging tools. If you have a logging can accidentally leak your keys like core dumps, like GDB, Ptrace, everything that can access the memory of the application can leak a secret. Yeah, well let's make our applications don't crash and fix all the problems, right? We obviously can't fix all the bugs, so we have to do something about it. And probably we can't do a completely secure application, but what can we do specifically for cryptographic keys? Because they are the highest, most valuable data in our process address space. What some applications do, well, they try to leverage the operating system address space isolation, so they basically create another process, right? It will have a different data section and you can just move the cryptographic keys over to a different process and you write some very basic, very simple, which is unlikely to have bugs, a cryptographic logic to handle these keys on behalf of the main process. And then you create some kind of well-defined, tightened user interface between two processes, right? So we call it the key agent model. So you have two processes, one, the main process and the helper agent. The main process does not have the cryptographic material in the address space and the main communicates with the agent through a well-defined interface to perform cryptographic operation on its behalf. And agent is usually doesn't process untrusted input, like it's not connected to the network and is usually, and more scrutiny goes into that review. And some of the example of these we all use every day. So who here uses SSH? Who here doesn't use SSH agent? You don't? Yeah. Yeah, so SSH agent, GP agent, stuff like that. But there are drawbacks to this approach, right? So we need to develop and maintain two programs. We need to design this well-defined interface. We need to add communication. Like we need to think about how these processes communicate. Should we use Unix, talk, shared memory, something else, HTTP. And probably it's a good to somehow enforce and authenticate the main process from the agent. And not if the agent is kind of like this thing that performs cryptographic operations, we don't want anything in our system talking to it and being able to do signatures with our keys. This is where we go to Linux kernel key store. And the official name is Linux kernel key retention service. I call it the key store. Some people say it's a key ring, but actually, like key store has many key rings. So I think the key store is kind of the most applicable technology. And what it does is basically it takes this agent model and instead of process two, it replaces it with a kernel, right? And the well-defined interface is just system calls. Easy. So in a nutshell, Linux kernel key retention service stores cryptographic keys as kernel object. And this gives us some flexibility. So it was initially actually designed to share keys with the kernel services itself. So like for disk encryption, for example, you pass a key to the kernel and the kernel uses it. But eventually it was extended to user space. And the advantages that keys are now stored outside of the process address space, you have already have a well-defined system call interface to access and use the keys. And keys are becoming kernel objects so you can have associated access control lists, permission checks. Like you have on files or some other kernel objects itself. And the nice thing about it is like the key life cycle can be implicitly bound to the code life cycle. For example, security deleting a key even if the process terminates abruptly. And for a kernel feature, it surprisingly has a quite good documentation. So what does the key store look like? So it's a collection of key rings and keys. So a key ring can have links to other key rings and keys can contain other key rings or contain keys. So you can get this like a tree like structure. So keys are just objects that contain actual cryptographic material or a pointer treat. They can be read and written to and used to perform cryptographic operations. There are several key types which I go on later. You have user, logon, asymmetric encrypted and trusted keys. And they're kind of similar to a file system but unlike the file which can be on the in one directory, like if you don't take into account the weird bind mounts or some kind of hard links, keys can be part of many key rings at once. And key rings, they, it's a collection of links to the keys. And basically they enforce the life cycle of a key. If a particular key is not linked to a key ring, like it gets automatically destructed. And they can be explicitly created key rings or implicit special, a thread process, user and session. And they do enforce the key lifetime and they are kind of similar to a directory in the file system. So let's see an example. And by the way, all the examples I'm showing, I copied it from a real terminal. So it's a demo which doesn't fail. So in this example, here I'm creating a new key ring and linking it to my implicit user key ring. And each key or key ring is designated by a serial number which you can see. So it's kind of a unique number of the object inside the kernel. And once I created the key ring, I can add a key there with some secret contents Hunter 2 to my key ring. Basically I can then show, kind of, KCTL show shows my key ring and key tree. So we have the session ring, the user ring, my ring and my key there. Yeah. And basically you can see that the serial numbers match so what we just created. And also like because I just created the key, I have access to it so I can read the cryptographic material back and get the secret. And I think one of the examples you can use is like secret sharing between two users. So you have Alice and Bob to users on the system and you may notice they don't have anything in common. So they have separate groups, separate IDs, everything is separate. No common groups or permissions. For example, and Alice can create a secret with Hunter 2 and put it in their user key ring. What Bob can do, for example, it can create a new key ring called from others, a recipient key ring. And Bob can actually set permissions on that key ring so it allows everyone to write there. Write means putting links to other keys. So then if Bob communicates the serial number to Alice, Alice can just move that key to the Bob's key ring and then we now see that Alice doesn't have the key anymore in their possession and Bob can actually now read the cryptographic material because Bob now possesses that key. Simple. There are special key ring types. And these special key ring times determine the life cycle of a key ring. So there are session key rings which are available to all the current process and all these children. So for example, if you are system D and you put a key in the session key ring, it will be available to every process on the system which is spawned by system D. The process key ring is private to a particular process. So like every process has their own implicit key ring which they can use to store process specific credentials. And there is also a sweat key ring which is specific to a particular thread. Then let's say you write a web server which serves several websites and each website has a different TLS key. And you can, if you serve a website per thread, for example, so you can kind of securely store a TLS key for that thread, for that website without other threads even having access to that key, which is really cool. There are also user key rings which are bound to the life cycle of a user. So it's a key ring which is shared between all the processes with the same user ID and there is a user session key ring which is similar to user but not important in this context. There is also a type called persistent key rings which the name is a little bit confusing because they are not actually persisting the keys on the desk. It has nothing to do with it. It's just the life cycle of these key rings are different. They're not bound to a process or a user. So it's kind of time bound. So if you basically don't access the key ring for a time out, it gets automatically destroyed. It's useful, for example, in Chrome jobs where you can't really bind, for example, a key ring to a user because that user appears and disappears from the system but you can put a time bound and while your Chrome job is running, your key ring will be available. If for some reason your Chrome job stops running, the key will be eventually destroyed. So let's see a session key ring example. So let me add my favorite Hunter 2 secret to my session key. And basically, I imagine I'm on a SSH session to this particular machine. I can see that my key exists, right, and I can see its ID and it's linked to the session key ring. What I can do now is, for example, in another terminal I can put a BPF probe on a user destroy function which is responsible for securely destroying keys from the kernel key store. And if now I just exit my SSH session, I log out, I can see that the probe works and my key was automatically destroyed because my session ended, so my session key ring got destroyed and all the keys are linked to it got automatically destroyed as well. And if I re-log in back, I can see that technically my session key ring changed. It was destroyed and recreated automatically and I don't have the key anymore. So what it helps is, like, if you select the appropriate key ring type, you can ensure that keys will be securely destroyed when not needed. And you don't have to explicitly clear the memory. It will happen if you're out. For example, if you bound to a process key ring, if the process dies, the key will get destroyed. And regardless how the process dies, if it's successful exit, if it crashed, if it cordoned, whatever, like the keys will be gone. Okay, so now let's consider, like, some different key types. So we check the key ring types, the key types, the simplest one is the user key, which we just saw. So you have the cryptographic material, you put it inside the kernel, and then eventually either this process or the other process, which has relevant permissions, can read that secret back. There is also, like, a special type called logon key, which you can put inside the kernel, but you can never read back. And this is where this type is primarily used to share secrets with the kernel for disk encryption or eCryptFS. So if you're in a relatively recent Linux distribution, if you dump your dmCrypt setup, you will see that some of your keys are actually coming from the kernel key ring instead of, like, you will see the bytes directly. There is also an asymmetric key type, which only supports RSA currently. So you put an RSA key inside the kernel, and technically you don't read it back, but you can perform some operations with this key, like you can instruct the kernel to sign data or decrypt something with the key. So for example, this is a simple example, it was open SSL, so we can generate an RSA private key. Kernel understands only pkcs8 format for unencrypted pkcs8 private keys, so we have to convert it to pkcs8 format, and then we can actually add it to the kernel, and then we can ask the kernel to sign something, and basically we can then verify that the signature is valid with OpenSSL. Which is very useful, so all the things I'm describing today, and more is describing Cloud for a blog post, and there we have an example where we completely replace SSL, it's like a proof-of-concept patch, but we patched OpenSSH and replaced the SSH agent with the kernel key store, so instead of SSH add, you do SSH add our bash script, which puts your private SSH key into the kernel key store, and if you run the patched SSH client, it will actually work the same as it would communicate with an agent, but you don't need any agents running on the assist. Cool, this is all well and good, this is how you can use it, but surprisingly key store can be very useful as a big corporate key management building model, but the question here remains, in all the previous examples you just saw, that we still need to put the keys into the kernel, so we don't want the secrets to be in the application address space, but we still need the application to put it inside the kernel, so even though if the application cleans up after itself, there is a small window of opportunity where application has the plain text secret in its address space, so how can we provision application keys without cryptographic material ever being exposed to the user space at all? So for this we have two other interesting key types, one is called encrypted key, and in this case the process has not the plain text key material, but encrypted key material with some other key, and the kernel has a wrapping key, so when the process inserts that key inside the kernel, the kernel automatically unwraps the key, and if we try to read it back, it gets automatically wrapped by the kernel again. But here we have the chicken and egg problem like how do you then provision the wrap key, right? So, still things, so what James showed earlier today in his demo is you can technically replace this with a TPM, and then you have a thing called a trusted key, so again you have the wrap key, but wrap to a particular TPM, you can insert in the kernel and TPM will automatically unwrap it, and again if you read it back, it gets wrapped. But this schema is not really great because as James mentioned TPMs are slow and there is as much as you can do with these operations, so like if you have thousands of keys you don't want to continuously poke the TPM to unwrap them, so you can do some kind of a combined approach where basically you have some kind of provision, right? So, and you have some kind of HSM in the cloud or on-prem, whatever which does your cryptographic keys, and then you provision a root key first, so you basically wrap the root key to a particular machine to its TPM, and then you insert it and the TPM unwraps it, but all the other thousand keys are encrypted with this root key, so the process received the wrap key and then it puts inside the kernel and then you don't go to TPM, you already have the root key which is a software implementation, can easily unwrap all the other thousand keys. But there are still problems with this approach, even though the application never sees the cryptographic material in this process address phase, but applications are still responsible for receiving this wrapped cryptographical material from this centralized KMS HSM service to wrap their keys, and so basically each application needs, who here uses Vault? Yeah, some people, right? So like it's, you kind of like know what, need to know what your Vault address endpoint is, right? You need to speak the Vault protocol or AWS KMS protocol, you need to basically integrate all this crap in your code, and there is little administrative control if like you're managing fleet of machines of the created kernel key object, so applications when inserting the key can set invalid permissions, so like anyone can, for example, if you set improper permissions on your RSA private key, any application, even malicious on your system, can use it to encrypt or sign data, right? And ideally like you also want authentication here, so KMS or HSM, that remote service, needs to somehow authenticate each requesting application if it can provide the wrapped cryptographic material. So how the kernel tries to solve that problem, it has two set of system calls. So far we've been using the at key system call with a key CTL utility, so it adds the key to the specified, key ring with the specified payload. So basically the application is responsible for the payload itself, so it's either plain text or in case of trusted or encrypted key, the encrypted payload, it gets it from somewhere and it sorts it into the kernel. And the payload is interpreted according to the key type, it's like no interpretation happens for user logon keys, because those are mostly symmetric keys which are random strings, it's a private public key for asymmetric cryptos or wrapped for encrypted and trusted. But there is another interesting API in the kernel called request key, so instead of applications inserting the payload directly what applications can do, they can ask the kernel, just give me my key, give me my key and give it an arbitrary string as an identifier. And it's on the kernel to actually satisfy that request, and obviously the kernel has no idea of everyone set up, like where should it take the key from, so it's one of the examples where the kernel can then make a user space callback and with a special helper program which you can then configure to actually deliver your keys, right? But it's a more centralized and transparent API to the kernel system, so how it works, so you have the process instead of adding key, so the process requests the key from the kernel and provides the identifier, so like give me my cloud app key one, so the kernel creates a placeholder, then it creates a special process, a callout process, helper process in user space called request key, and this one you can configure and you can specify different routes for different key types, for example if I requested the cloud app key one, it will go to the cloud sub-module and you can write these sub-modules in any programming language by the way, it doesn't have to be C, so you can write them in Go, it can be just simple batch scripts as well, which are basically responsible for if the path is cloud, it can contact your cloud HSM, get the wrapped cryptographic material, put it back inside the kernel, the kernel will then instantiate the keys and then the application will get its key back. So with request key advantages, you have a single centralized operating system API to request key from the application, so there are no KMS or HSM connection strings, you arise in your configuration form, just a freeform ID string, and it kind of fully decouples, your application is fully decoupled from key storage backend, so it doesn't care where the keys are stored and how they are distributed, and it's a more secure way to instantiate the keys in the kernel, so this special call-out process which is created by the kernel is very special in the sense that it has a special credential enforced by the kernel, so even if you launch the same helper process yourself as root, it will not be able to instantiate the requested key because it doesn't have a specific token from the kernel to do it. And this also call-out process is very useful, in fact it can be trustworthy, so you can perform additional security checks, you can implement arbitrary policies there, so you can check the requestor, user ID, group ID, executable pass, package name, whatever you suppose, is this application even allowed to request the key in the first place, and you can immediately deny that request. And you can support multiple key storage backends, you have local storage, you have a TPM backend, cloud HSM backend, whatever, and you can even swap these backends transparently, like if you, for example, migrated from on-prem HSM to a cloud HSM, all you have to do is just modify this helper process config file and applications will not notice. And then you have the nice thing that you need to only authenticate this single helper process on your backend. And yeah, as I mentioned, the backend connectors can be written in any language, so very easy to extend. But the nice thing about that with request key, the key management and distribution becomes a core service operating of the operating system itself as it should be, versus like every application has to deal with it on its own. That's basically it for today. Here are some links to some kernel documentation, to some key ring man pages, as well as the last link. Again, everything I told you today and even more is described in the cloud for our blog post, which is linked at the end. Thank you and I'm happy to talk to you. Thank you for the great talk. So I recall there was an API in the producer space to protect memory from kernel space. So the, like a given page was unmapped from the kernel. So if you had an out of bounds in the kernel, you couldn't access the memory, but of course the kernel could remap the page back again. My question is, are the keys protected in such a way in the kernel? And do you think it would make sense to do it? I mean, it would potentially minimize the exposure in theory at least. The default, I don't, I'm not sure about the implement, but I would say no. I think the keys are not like more protected. So the guy who wrote it is right there. And what was the question? If you put a key of the user space process into these areas, they will be more protected than otherwise. It still doesn't guarantee like 100% My point is the kernel could also do it so that it would protect those keys from itself as well. And it would only remap the page back again when it actually, when you do the request key for it. But what's the point then? If kernel needs the keys, it has to have access anyway and remapping and mapping is costly. The other thing is the key store API internally is also extendable. You can write other modules and this is what I asked for. James earlier, that you can technically write an asymmetric key implementation backed by the TPM. So the keys will not be even inside the kernel. It will be in the TPM, but then each operation will have to touch TPM in the first place. Or if you like design some kind of crypto chip or you can like design like an arm like a truss zone back here. So like whatever you want. There was some effort. I don't remember exactly which areas it touched to do this sort of separation between subsystems. But I only learned about it once. I don't know what they say this. No, no. Well in kernel it's still like the old, you mean in the kernel subsystems? I don't like it's still like a flat address space at this point. I don't, unless you're again using like arm trance zone or enclaves or whatever. My question is, so you mentioned that we can do RSA operations. Not everybody is using RSA. Are there any efforts to introduce other kinds of asymmetric keys? In particular, I'd like to see an explicated stuff. So, yes. So the kernel currently also supports ECDSA, but only for signature verification. It was added for kernel modules. I send like patches to actually support signatures through for the Q-Stone API twice. I didn't get any traction on them. I'll send it one more time maybe. Because I also know that the kernel has its own internal crypto API and has support for all of these operations. They're just not exposed through the key store. Well, specifically for RSA, for ECDSA, no. The kernel crypto API doesn't have crypto for ECDSA signatures for generating the signature. So my patch set included both the crypto subsystem and the key store subsystem. The kernel can do ECDSA signatures, but also this code is reachable through the key store API. Okay, thank you. Very interesting talk. Thank you. I have basically the same question. But also, wouldn't there be an urgency to get some PQ crypto in there? Maybe, but we have to fix ECDSA first before we have to learn to walk before the run, right? James, can you pass it to the next? So if we now add the trouson to the picture, does the kernel have any kind of API to interact? I mean, the key store itself, would it interact with the trouson to get the key or we need to still go to the user space to the helper and then the helper will just go through a normal way of communicating with the trouson and secure monitor call and get back the result and then the key back to the kernel. For the trouson, I think there is some code, because I never tested on an ARM system like similar to what we have, the trusted keys for the TPM back trusted keys. There is an implementation for trusted keys for the ARM trouson, the open source one. I saw the code, I never tried it, but it's there. So there is some reference of the application, right? Yes. The GNS and there is internal support for that. Yes. OPTE. Yes, OPTE. Alright, anything else? Oh, yeah. If you shout, I'll just repeat. It's just wondering which version I need to use this. Sorry? Which version is it available from? The kernel key store. I mean, it's quite all, I guess. What we did, I think from 6.1, again, we mentioned the crypto subsystem, the key store subsystem. It was really handy to insert the RSA key into operation with it, but you didn't have any ability to do the same with the symmetric key. So what we extended is like the crypto user space socket API to be initialized from the user or logon key. So now you can do it from 6.1, you can insert a symmetric key and then you can create a crypto socket based on that key to perform like AS encryption with that key without exposing the key to user space. Back to you, does this not want to... So if I recall correctly, you said that the persistent keys can expire after some time of being unused. Does listing the keys also count as using them? That's my first question. My second question is like, what's the time out time for it to expire? I haven't used them like so widely to have those specifics. I think the time out is configurable, definitely, but listing, I don't know if listing the keys actually reset the timer. I just want to answer the question from over here. It looks like the API has been available since 2.6.10, which feels old. Yeah. There is one person over there which... Maybe you shout, I repeat. As a certified micro-configuration enthusiast, is there a reason why this approach is taking rather than planning APIs for the value of duty and so on, that you need the space and have the same benefits? The question was why we didn't do it in user space, but... How do you add extra functionality to the kernel to give you the same benefits? I kind of don't quite understand the question. The whole point is not to expose cryptographic material to user space. You're saying the benefits are, for example, if a process dies, then you can immediately wipe the key from memory and that sort of thing. You could also add functionality to add consistals to the normal database processes that have that sort of benefits. Why didn't you do that rather than sticking extra things into the kernel? Because you can retrace the processing of user space, but you cannot retrace the kernel. Just saying. Anyway, we are out of time. Thank you very much. I'm sure you can get two infocations.
Packet, where are you?: Track in the stack with pwru
So hello everyone, I know this is the end of the day, the end of the first day, so thank you for being so many to attend the talk. I won't be too much into kernel details in that talk, that should be relatively easy to follow. Yes, I'm sure this is a kernel dev room, this is not about Go, so don't be worried about the logo. I also do apologize if some of you attended Jeff's presentation from yesterday, so the same topic, the presentation from today will be pretty similar. But still, so what is Peru? This is, so the name comes from packet where are you, and this is an EBPF-based tool to debug packets going through the Linux networking stack. So we see why we wanted to work on that tool in the first place, how Peru works, and what are some of the features, and how can we actually use it in real life to debug real problems. So the problem is that nowadays we have a lot of things to debug regarding to networking stuff in general, so when you use containers with namespaces, Kubernetes, all these kind of things, you typically have packets arriving on the interface, and then being forwarded to a pod through a pair of these interfaces, and there's that big thing in the middle, that's a penguin, that also stands for the Linux networking stack, and from the point of view of someone trying to understand what's happening, it often looks like a black box that's difficult to analyze and to understand fully. So how do we get some visibility into that? We've got a number of things happening in the Linux networking stack, a few things, that gives an idea. It's very tricky to get to the right place, so where is my packet? So that's the problem we have. So usually when something goes wrong, we use TCP-DOM. Right, TCP-DOM is good. TCP-DOM is a great tool that's very useful. TCP-DOM works well here and there, and sometimes the stuff happens here, and that's great. Sometimes though, it happens in the penguin, and sometimes in the pod as well. So what do I do with it? Can TCP-DOM help in that case? Not really, not so much. There are some other tools to debug things. There is printk, well, comes with a number of drawbacks too. I need to recompile my kernel. It's quite slow to process, to adjust every time. I need to add new printk's. I may possibly have to add a lot of printk's if I have no idea where my packet's going. If I do things wrong, my kernel will panic, that's not great. And how do I filter on specific packets? It's difficult to do. It's far from ideal. We've got some other traces too. Perf is, for example, a good tool to trace kernel functions that's something else, and I can just look into that function and look what's happening in there. But for networking, really, it's hard to do this filtering on the packets that I really want to follow. It's also hard to extract the network-related information out of other things that Perf returns. And in the first place, how do I know what function I'm interested in? Where is the stuff happening? Where is my packet drop? Where is my packet masqueraded? Where the interesting events are occurring. So what if we could have something that gets a bit of all the functions in the kernel that will be processing my packets? And what if I could get callbacks and run programs when these functions are called? And if I could also filter these callbacks to make sure that I only process the packets that I'm interested in. So that's where we introduce Peru, which is based on the BPF. So I assume most people in the room have some familiarity with BPF. So I won't go too much into the details. Just as a few reminders, that's this execution environment inside of the kernel where you can inject programs from user space. They're going through the verification to make sure that everything is safe and won't crash your kernel. You go for the JIT compiler to turn these programs into native instructions and get some good performance too. And then you run your programs on some hooks where you attach your program in the first place with a diagram that looks something like this. So we have a program here that we will hook to a probe on IP local deliver, which is a function that takes SKB as an argument. So that's a socket buffer that represents the packet in the networking stack. And we scan the LVM, or with GCC nowadays, but most of them we scan. We turn that into an L file that contains the BPF program as bytecode. And then we use a loading program that can be in code, that can be in C, that can be in Rust, whatever, to extract the bytecode from that L file and inject it into the kernel through the BPF system code. Once in the kernel we get the BPF, the VFI are in to make sure that the program is safe. We compile the kernel. We don't have to, but most of them we want something fast. So we compile the kernel. We compile the kernel, right? We compile the program into native instructions. And when my packet is coming in, and IP local deliver is called, then it triggers the execution of the program. And I can communicate with my agent in user space through the use of eBPF maps to store data. So, for example, to store metadata about my packet and retrieve them in user space to know what's happening. That's great, but how do we keep track of all those packet processing functions? So I have IP local deliver, I have a lot of other functions that are doing packet processing too. That's where we leverage BPF, which is BPF type format, which is a metadata format with different information. So a bit like dwarf, but producing objects that are much smaller than dwarf and that target BPF specifically for a number of use cases. So we can have BPF information for one BPF program in one object file. We can also have it for the Linux image itself, which is... So this BPF object is usually exposed in the C-SFS file system. It looks a bit like this. We have a very simple program, sorry, a very simple function. It's going to get marked that takes socket buffers and argument. I turn this, I extract the BPF information from that object file that I compile into a BPF program. And this is the BPF information on the right side. So it works like this. It says, I've got a struct SKBuff with the different offsets of the different attributes. I also defined another type, which is a pointer to that type ID too, which is my struct. I also defined the prototype of a function that takes the SKB, so the pointer to the SKB as an argument. And I gave it a name, which is SKBGetMap. And because I have the BPF information about the kernel image, and because this BPF describes all the functions in the kernel, I can process that in user space to extract a list of all the functions that take an SKB as an argument. And that gives me the list of the packet processing functions in the kernel. So now I have a list of all the functions that I want to hook to. So that answers to the three criteria we had. How to get all the functions, where we can with BPF, how to get callbacks, we can with EBPF and K-Probes in the kernel, and how to filter packets. This way using EBPF, and that's it was a packet filtering mechanism in the first place, that's relatively easy to implement. So how does it look like in practice? So I've got two terminals. It's not live demo, sorry. I use a rule, an IP table rule to drop packets, TCP packets, 1111, which is cloudflare DNS for example. And I call Peru, so here I have Peru destination host 1111 and TCP and destination port 80. And after I call Peru, it tells me that it loads all the, it loads my program and attaches all the K-Probes that I'm interested in. So that's 1500 probes in that case. And then in the first terminal, I type a curl 1111. What happens below is that I get a list of all the functions that process my packets. So I see a list on the right, IP local arts, IP local arts, NF hook slow, and so on and so forth. Sorry. Eventually I get K-free SKB mem, which is the function that is called once my SKB is free because it's been dropped by the IP tables rules. The IP tables rules I can also see through the code to NF hook slow. So that gives me information about what's happening in terms of function. It gives me information about the process that's been creating this packet in the first place because on the, on the column in the middle, you can see that it's a curl process. I get also information about the SKB, which is not useful by itself. This is the address of the SKB, but it allows me to be sure that this is one SKB that's being processed in the list. If I have several packets in this output, they will have different addresses. It allows me to filter by SKB when I post process this information. And once I exit from my Peru session, then it detaches all the processes we're loading. Okay. So what fancy features do we have beyond the basic usage? We have quite a number of options for Peru. So this is Peru dash dash help. I won't go through all of them, but through a number of interesting ones. So before we go into the options, you might have noticed that the way I told you to focus on the packet with the 1111 destination was just the same syntax as for TCP. And we do have a support for pickup filters in Peru. And the way this works is, so if I don't pass any filter, things are pretty much straightforward. I'm using my BPF program, uh, compared from Peru loaded, uh, into the kernel. Now if I do have a filter, I turn this into some CBPF bytecode using the leap pickup. CBPF is not exactly the same thing as a BPF. So I cannot use it just like this. So Peru uses another tool underneath, which is CBPFC. Hang on. And it turns, uh, this CBPF bytecode into a BPF bytecode. And then we get this CBPF bytecode and we inject it into the regular program. Okay. We've got everything in place. We load it into the kernel and that's it. That should be, it's easier after that. Okay. Okay. Some other features, uh, we can trace the kernel itself. We can, uh, we can trace kernel modules as well. We've got a few options to trace either a specific kernel module or all modules. Uh, so if you process packets with functional, take SKBs in your module, you can also, uh, follow what's happening in them. We've got a choice of backends for, uh, for Peru. So there are two currently, which is the regular K probes and the multi K probes. So what do the multi K probes do? They allow you to, uh, well, you don't really realize it when using Peru, but they allow Peru to load a bunch of K probes, uh, all at the same time. So instead of loading your probes one after the other, you create an array of probes and you pass this array with the size of the array to the BPF system code and then everything goes nearly at once. So it's faster. How much faster exactly? So if I could Peru on my laptop with, uh, the backend K probes, a legacy one, which is, uh, available, which has been available for a long time, the new one is for five dot 18, uh, plus only. So I get, um, a few seconds to attach other probes. That's seven seconds here, but it takes one minute, 37, uh, seconds to, uh, to attach, to detach other probes. That's not great. Now if I do multi K probes, uh, that's nearly instantaneous for attaching everything. Like there's, there's no difference on that test and once again for the touching everything. So that's quite faster. That's a good improvement. Um, here are a few other interesting functions. Um, they're all in the same box. They are not exactly related to each other. Uh, so we can filter also by a namespace for Peru, like looking for packets in one given namespace and not the others. That's totally possible. That's, I think that's relatively easy to do from the BPF perspective because I believe the, the namespace is directly available from the SKB itself. Uh, we can filter, uh, TC programs themselves, which are not regular canal functions, uh, just like, uh, the one we have in the networking stack. Uh, but because your TC programs can affect the packet processing, that's also interesting to, to follow what's happening on them. And, uh, the way it works is by using some specific BPF, uh, programs, looking on what is what we call the EF and three FX it mechanisms to plug directly onto, uh, those, uh, TC programs. So we're looking at BPF programs with other BPF programs. Yes, it works. Uh, we can also track SKBs that change. So when does it change? So for example, if I, uh, clone my SKB or copy my SKB, so the way we do that is, uh, when the option is enabled, we, uh, hook onto SKB clone, SKB copy at the end of the functions actually. And we, uh, we say, okay, this packet was interesting when I entered the function. And when I exit the function, I mark it as a packet of interest in a BPF map. So in addition to filtering the, uh, the packets that I usually want that I provided the, uh, fitter in the first place for, I also check for each, uh, for each packet if it's present in the map of the packets that I want to additionally follow. So that helps me, uh, following packets that may have changed. We've got some interesting options for, um, changing the display or adding more information on display. So I can add, uh, meter data on the socket buffers. I can add, uh, the full SKB. I can add the call stack. Here's an example. I can add the, uh, the, the, the four to pull for the packets. So in this example, we have, uh, two functions that, uh, process my packets here. And, uh, below each function that is displayed, we have the full, uh, call stack for the functions. So that's quite helpful to understand exactly what's happening in the kernel and how it goes, uh, in terms of processing. So to real life examples that we've had, uh, when working on Cydium trying to debug things on Cydium, which is a, a CNI for communities with, uh, a number of things related to networking and sometimes, uh, complex cases. The first one is, uh, MTU configuration, uh, error, uh, which we had to debug at some point. Uh, so we have a, sorry, we have a very simple setup with the packets arriving on the interface and the MTU on the, uh, on the node interface, not the same as the one on the VETH interface. And, uh, it was, uh, relatively easy to find out in the, uh, the output from Peru that the MTU, uh, is not the same, uh, that, well, is lower than the length of the packets. So the only thing I had to do to get this is to, uh, to, um, add the output to the information to get the, uh, the information about the, the packet that comes in. Another slightly more complex example is, um, so I had, that was in kind, so I had Docker network in the middle. I had, uh, this configuration with a pod trying to curl to the outside and, uh, hitting an IP table rule, uh, leading to masquerading the packets. So my packet gets masquerading with the address of the node interface. That goes to the internet. Okay. That worked fine. So in the second scenario, we checked that the packets were also, uh, currently masqueraded or not masqueraded when going to the node. And we have a second rule, actually, that was not displayed on the first, uh, case, which should, um, prevent packets going to the other node to, uh, from being masquerading. And so the packet should go straight to the other interface, should not change, uh, its IP address, but the packet never arrived. So what happens? So if you write the title, maybe you have an idea already. Uh, we thought that the packet was not being masqueraded as we expected. We thought that, uh, the IP tables rules were not being applied and we could have maybe found the issue, uh, differently, but, uh, Peru helped us to quickly confirm in that case of the masquerading is indeed, uh, occurring. So that's what you can observe on that, um, sample, uh, output. We can see that we're hitting NF hook slow and we can also observe for the same SKB that the, uh, the IP address, uh, is changing. So this is the same SKB. I just trimmed the, uh, the addresses of the SKB cause it was taking too much space, but, um, they're the same. So once we had this information, once we knew that the, uh, the IP tables rules was, uh, not sure taking place that we hit the, the net filter hook, we went back to the rules. So we were supposed to exclude the traffic, the closer nodes for masquerading. Turns out that the IP sets containing the entries, uh, indicating which nodes be, uh, excluded from masquerading were missing the entry of the node on the left on the first diagram. So that get me busy for, for, for some time a few weeks ago, but, uh, we did it. So Peru in brief, it's an BBF base tool to debug what's happening inside of the Linux networking stack. Hooks on kernel functions using, uh, processing SKBs. It's very good to pick up things where it's been a post shot in a way. Uh, you've got more visibility on what's happening directly in the stack and not just at the interfaces. We can use pick up, pick up filter, uh, style syntax to, to filter packets that we want. So we don't get everything. We just focus on the flows that we're interested in. We can try STC programs, can and models functions, uh, modified SKBs. Uh, so that's quite, quite flexible. Uh, we can, um, a number of information, a number of information, including packet level metadata, uh, the call stack, um, and it's proven very useful to solve a number of complex networking issues, uh, that we've encountered so far. So quick note on some other tools that are not exactly the same, uh, but that also uses this principle of, uh, creating a lot of probes to hook into the kernel and look at what's happening. There is sweet snoop, which is, um, really convenient to debug what's happening in the kernel when doing kernel development because it focuses on the written values of the function you're trying to, to, to observe or also the written values of most function in the kernel. If you're trying to just detect what functions are returning errors. IPF-Dress2 is very similar to, uh, to Peru. Uh, there are some features that, uh, are different between the two, but otherwise they are doing the same focusing and tracking the packets. TetraKone is a security events detection, uh, sorry, is a tool focusing on security events detection and, um, it uses, it also supports these, uh, multi-K probes, multi-U probes mechanisms. Uh, it uses EBPF to detect malicious activity on the system and to block it, uh, for, for, for security purpose. So this is the end of the presentation. I'd like to thank Adity and Matt Ness, who did a great presentation a few years ago, uh, at KubeCon on the topic, and, uh, I reused some of the materials, so, uh, I'm very thankful to them. Thank you to the Peru contributors. Thank you to everyone. Of course, thanks for the team, the talk. I hope you enjoyed it. If you have questions and if we have time, uh, I would really, I hope just to be open to questions. Thank you for the talk. Does it work well with, uh, GCO, GRO, like the segmentation of laws when the packets are merged and dissected? Uh, GCO, GRO should see the, should get the SKB as an argument, so they would appear on the list of functions that you, uh, that you get from the output. So yeah. Can you, uh, just print, uh, the SKBs or also trace, inspect those in, inside? Like, for example, I've seen this particular, uh, value inside the SKB, the changes and causes some kind of bug. Can, can I trace it? So you can, you can get the SKB, you can dump the full SKB. I don't think we have a filtering mechanism to, uh, do some additional processing in Peru on the SKB to only raise when you have that value. What you could do is, uh, filter new packet flow, dump the full SKB, and then probably post process to extract the ones that have this, uh, erroneous values, I suppose. But you, you can get the, the full content of the SKB. So from that, maybe that would help. So thank you for the presentation. Um, do you have a, an idea of the performance of your tool? And are you satisfied of that performance? And do you see some opportunities to make it even more efficient to be able to use it in production? No, I don't know. So one clarification is that, uh, it's not my, I've not contributed to it. Well, I picked two typos. Um, so I've not run any benchmarks myself. I know there is some impact due to the use of K-probes because you're loading so many K-probes at the same time. So it does have some impact on performance on the system. Um, I don't think we've tried to use it in environments where, uh, performance was a hard constraint for us so far. Uh, how could we improve that? Um, I'm not really sure. We haven't really given much thought into it at this point. Well, there's obviously the, the issue of loading and detaching the programs that is greatly improved in the multi-K-probes interface, but that's something different at the runtime. Uh, exactly. Yeah. Um, thank you for the talk in the first place. Um, and my question is which behavior can I expect with packet rewrites or encapsulation, uh, network address translation and so on, is the packet evaluated in every probe or can I, can I trace the packet even before the rewrite rule? So for example, I filter to the revitn IP address or I filter to the address before, uh, we explain encapsulation or whatever, or IPsec processing and so on. Um, so the way I see it, if you use the option to track the SKG, the SKG, the SKG, the SKG is, even if the, the metadata changes, you should be able to trace them, uh, uh, even after. So maybe if you set a given destination IP, you wouldn't be able to trace the packet before it gets that IP because that would be like guessing what will happen. But after it changes, yes, if you have, um, you know, tracking of the packets and that you, that Peru, I did it to the map for you, then you would keep following that SKB after that. Yes. Does it also track the revitn packet so I, um, trace the original IP if it gets encapsulated other destination IP? If you, so does it track, even if it gets encapsulated and if the IP changes, well yes, because it's the same SKB, right? So if you're, if you're basing, basing your, your, your tracking on the SKB address, then yes, it doesn't matter if you change the IP. Okay. Thank you. Okay. Thank you.
The new Swiss Open Source Law: "Public Money Public Code" by default
Okay. Let's welcome our next speakers on the news with open source law. Yes, good evening everybody. It's a great honor to be here at this conference at the FOSSTEM. It was many years ago when I was here last time, but now I'm glad to be back and I'm very happy to present together with Rika Koch our new law that we achieved basically getting in Switzerland. It has been a long journey and it's great that we can now present this and we are very interested also for your feedback in the end if something similar is existing in other countries or how we also have to continue in this journey. So briefly just our background. We are academics from Berlin, Switzerland, but from my side I'm also an activist since almost 20 years when I wrote my master thesis about open source community building and I'm very glad that we can now present this to you and Rika will start. Good afternoon from my side as well. My name is Rika Koch. As Matias mentioned, I'm a law professor at the Benefach Hochschule in Bern and I want to speak to you today about the regulation, the legal side of open source software in Switzerland for the public sector. Here it is again. So in the beginning when we talk about regulation of open source in the public sector there was literally nothing. And I wrote here dark past but we're not talking about the so far past. We're talking about 10, 20 or even two years back. Although there was a strategy of the Swiss federal government that said well basically open source software would be nice because it's economically efficient and it produces good quality but there was nothing in the law. We had a strategy. So when there is nothing regulated you don't really know whether you can do it, whether it is allowed for the public sector to develop their software open source, licensing it open source or not, which is called in legal terms there was a lot of legal uncertainty. So developers or the private sector who offered their software to the public sector didn't really know whether it was possible or not. So the crucial question here is can the public sector develop and also distribute open source software or do they have to do it closed source? Well if you ask the IT experts probably all of them tell you yes of course they can but in comes the spoilers, the legal persons, the lawyers and they say wait a second it's not that easy. So the Swiss government was in this situation where the IT experts please do open source software and the lawyers say no no it's complicated please don't and what do they do? They ask they pay a lot of money for of course the legal experts for a legal opinion. Oh sorry I hand over to Matias to explain first why the pressure arose. Thank you so basically from a historical point of view this is interesting because it was being done open source by several state agencies for many years so from the IT side obviously there were lots of open source activities on GitHub from different agencies however as Rika said it was not legally allowed and so when actually we started from our group of parliamentarians it's called parliamentarian group for digital sustainability, Paul Deghi we started basically lobbying for open source release from the Swiss government in 2011 so it was back then when we had our first initiative and asking politicians to actually support the release of open source software by the government and it seemed like a natural thing although the federal government rejected this that there will be no additional support it was basically clear to us that it will take place anyway and even more actually the Swiss federal court is very open source minded for many years they actually are completely based on open source software they host their all entire stack it's they use LibreOffice back then it was open office and all the federal court lawyer federal judges they use LibreOffice for their activities so and they even wanted to open source their back then their court management system called OpenEustizia it still exists OpenEustizia.ch it actually is still online this website but something happened very little Swiss Bernese law company IT law company they actually objected to against this release of open source software by the federal court because their business was jeopardized basically they were afraid of this government competition because they actually had a market of local courts where they sold their proprietary court management systems too and when actually federal court the big court actually releases open source they were obviously afraid that this government would destroy their market and what they did actually they asked another politicians to hand in to hand in actually another question on political side that would ask what is basically the aim of the federal court would they want to destroy the market for IT companies and then basically compete with small companies and this basically actually was the beginning of this legal dispute for last 10 years and now I hand over to the lawyer again and she will explain to you what the lawyers said about this story. Yeah maybe you know the saying we have in German two lawyers three opinions and it was exactly that so we had a first legal opinion issued in 2014 with the crucial question can the government develop and distribute open source software and this legal opinion said basically no why there is in the Swiss constitution and I'm sure in other constitutions as well the principle of competitive neutrality this means that the government should not mess with the private sector to the contrary that the government should make so-called favorable conditions for the private sector and they said that this is not the case when the government publishes open instead of closed source software. Now it gets a bit legal they said that distribution of software by the public sector is a so-called economic acts that would per se distort the free market. This would actually be allowed but only if there is a sound legal basis and if it's proportionate which means also necessary and these two quite old law professors they said no it's not written in the law okay that's that was true at that time but it's not necessary at all because everything you want to do with public software or software by the public sector you can might as well do with closed software it's even more it's even better suited suited to reach the public tasks so it's private basically but luckily some persons thought okay let's just ask another lawyer and paid for another legal opinion three years later and I when I say some persons I look at Matias so this second legal opinion said well we're not too sure whether publishing open instead of closed source software really is an economic and distorted act per se we might call it also an auxiliary services so you can't do the software because it serves the fulfillment of a public task and whether you do it with closed or open source software it does not really change the fact that you have the legitimation to do so so they said it's an auxiliary service does not distort the market and we do not even need a legal basis with this in mind this was government after long consultation they issued a law although you do not really need a legal basis they thought yeah we might as well make legal basis better saved than sorry and they negotiated the so-called federal law on the use of electronics means for the fulfillment of governmental tasks and there they did not only put the possibility to make open source software but a mandatory requirement this is the text we will look at this now and I will speed it up a bit so here it says the public bodies subject to this law shall disclose the source code of software that they develop or have Dilla developed by third parties unless the right of third parties or security related reasons preclude this so we have the first sentence is the principle that opens our software is by law mandatory for software that the public sector develops or for developments that they buy via third parties and the private market so not for software that already exists so if they they can still buy pre-existing software on the free market so and there's the rule and there's the exception the exception is you do not have to make it open for software if third right parties would preclude that that's clear that's usually intellectual property rights closed licenses maybe or for security related reasons and I personally really don't know what this should mean I've been told by people with more IT knowledge than I have that they do not need her so if someone of you knows what this could be please please raise your hand afterwards and give us this input and this was paragraph one the principle of mandatory open source software there are a lot of other principle paragraphs I don't delve deeper into that but just to show you that we've come from open source software is really distortive and against market neutrality of the Swiss government to also this and this is paragraph four and five and they said so the public sector they can develop the open source software and they can also offer supply services other services to other governmental bodies for if they charge for it so they have to take money usually but they can also also make provide other services which is then not deemed market distortive thank you Rika so basically here we have the issue about is the federal court this are the background of this whole clauses is the federal clause a court allowed to actually make community building basically help other courts to use their software and maybe answer some questions and also participate in the community and this is actually now also a legal basis for community building on the federal level now we have the law so what does this mean so a law is not helpful if it's not implemented it's and this is basically the activities of my next 10 years to make this law alive so one aspect and this is where also Rika comes in because she's a specialist in public procurement now we hope actually to find in the next few years more and more public tenders where actually the government procuring IT solutions will have to include different criteria which actually support the open source the release of open source software and the community building so this is actually a real excerpt from one of them public tenders it's actually not with the new law but I think it could actually serve as a good example from my point of view because it says that first one of the issues is that the software which is being built by the company is providing the solution has to be open sourced under an open source license on the GitHub account of the city of burden then actually they have to use not just any license but a copy left license including you PLH EPL so we heard about that thank you Bradley and we also have then the need for companies with experience in open source software development and community building and actually the community management is also one of the services provided by this IT distributor so from my point of view it looks quite nice that it's really something which should be in the future being used as one of them open source role models or good practices at least for for IT procurement nevertheless we have not much not as much activities in Switzerland compared to other countries especially Germany and we have a few activities so one thing was during pandemia that there was the federal IT administration which released the COVID certificate app as open source software which was then actually used by the Austrian government and used for their national COVID certificate and this is I think a very nice example of how governments can also interchange source code another example includes the Swiss map agencies Vistopo they produced open they supported open layers in the past and collaborated with some institutionalized crowdfunding with other agencies the first development of open layers another aspect is from a company initiated in swissland at Finis they actually started an open source project Kaluma and some workflow component framework and they supported now several canton several local departments of swissland to use this company to use this software and they founded this in osco community and this is also another good example which I think shows that even swiss people are able to produce open source software in a good way and the last the last thing you have heard all the railway activities open rail association today this was also partly driven by the swiss federal railway association I think it could also be a good example how swiss government or government companies can collaborate with others so hope there's still hope for switzerland and their open source activities and maybe with and especially if they'll you know there should be more activity soon there's one monitoring thing which we do osasbenchmark.com this is a very hobby pet project of mine which where we collect basically the the open source repositories of organizations from swiss government and companies and look how many repositories are released by which kind of company and here again you can see basically of 150 agencies and institutions how much open source they are already providing on github now what I hope in the future will also help us is the the the the also the high level political environment around digital sovereignty and digital sustainability so a few years ago we created the report on data colonialism where we pointed out this is danger of the big tech companies appropriating and privatization of data and also obviously of software and nowadays I'm working on a new report for the digital sovereignty strategy for the government where they have to actually release some some new recommendation by the end of this year and so we hope that this again also will help open source software development in switzerland now we are very interested in your feedback for the for discussion because first of all we would be interested on do other countries have similar laws in this way where it actually is not just allowed to open source software but it's really actually the default second question what is the potential and the challenges of this new swiss law what do you think could be also something in other countries being implemented and from the operational and the implementational point of view we know of several activities of other countries but so what in your opinion would be the best to do next in switzerland because now we have the law now we do need to do other things is our parliamentary group okay so that's it for the moment and very very interested in your feedback thank you very much I was a little bit irritated that you mentioned within the tender that you're demanding copy left while for some organizations within their own code if they want to contribute it for this to reduce their costs having a copy left may actually be exclusionary because some people using code that under a different license more permissive license may not be willing to put their stuff under a copy left in other words what's wrong with actually saying a copy left or a apache or a bsd or one of these other more permissive licenses so this basically as if I understand you correctly so the question is why is copy left license being recommended which actually excludes a number of organizations who may want to develop software with a much more permissive license than copy left well it doesn't exclude them well it can be integrated into copy left and final product so maybe someone else can add on this but um do you want to are you responding to that Bradley yeah please so so what it sounds like you're saying is that if it if it has a must on copy left then they might just have to upgrade the license that's under you know under mit to be copy left when they put the solution forward right yes so as far as I understand if you have the end product the final product this is include permissive license software and but you still can use the the less permissive the less restrictive license software thank you for the great talk you just mentioned that there was this argument of that providing open source software does market distortion that somebody gave that argument like the right of a private enterprise to make money on licensing fees is above other things shouldn't there be an argument that the government should make best use of taxpayer money rather than blowing taxpayer money on licensing fees and isn't this a good enough argument to reject that yeah absolutely I mean I did not understand myself the argument that why this should I mean just enable some companies to pursue a business model in a certain ways does not mean to be market competitive so I think we to the contrary I mean to enable the government to make a software that has the best return of use money for value value for money sorry for the taxpayers should be the first public interest so and that's how the interpretation of term competitive neutrality changed luckily okay you mentioned that the slow is forced in 2017 am I right sorry the law was the law like six years ago no no the law started on February January 1st 2024 this this year okay because I've got the question that's how often can be used that's security reason because there are you mentioned there are two reasons like it's third party and the second is security reason when there is exception so when it comes for like minister of defense the software could be perhaps proprietary software but if it's like pretty new so so I don't know if it's off or no yeah so this is this is exactly the point so the law is very new so we don't know yet how strongly will be implemented and fulfilled we we know that the government is kind of behind on releasing some guidelines and they're right now actually providing some guidelines but it will still take a few months or maybe years to really get it going on okay so on on the security issue I've so it should be used very sparingly right but there there are arguments that can't be made like Bradley mentioned when you're looking for people who're trying to evade like taxes there you can make an argument that people knowing how the government looks for tax evaders makes it easier to actually like beat these algorithms right so I think it's a good thing that's in the law while at the same time security by obscurity is a very bad thing and it should be like we should use that very sparingly but I think it's good that it's in the law so just from a like cyber security guy point of view and then the actual question do you know what the because we have the same argument in Austria where it says like the government may not publish open source because it it like distorts the market of proprietary companies profiting of proprietary software can you summarize what like the other side of the argument was and what what claims the lawyers made why this is not an issue actually you mean the pro side the like how was that how did they debunk that hopefully like you know ps yeah a good thing that you're from Austria so I can just send it to you the little penny but um to summarize they just said only real economic acts can distort the market so they compared it if it's open source or closed source is like if you write using a pen or your laptop it's just a means to help you doing your work so an auxiliary service and even if it was market distorted then you just have to have the legal basis okay but but then it's still necessary for to have better quality so very quickly I've seen other laws in other countries that were open source by default failing so it's good that you have already clarified that for instance security can be a way to circumvent this I wonder if there are some regional I'd say regional car valves like this is federal can a city or a canton or whatever avoid to apply this law because it doesn't apply to them uh yes so I have to say this is only for the federal government this is binding only for the federal government sub federal governments can do whatever still but what I would would also expect that usually in Switzerland at least the what the federal government is doing the canton's and also the non let's say the non federal players are also looking at so when they see there are benefits and there are obviously benefits otherwise it wouldn't be here so then I hope that people will be more used to actually procure open source software services and build communities okay let's let's thank you Mattias and Rita
Welcome to the LLVM dev room
Welcome everyone to the LLVM Dev Room. I hope the microphone is working. This year we have three organizers. We'd just like to very briefly introduce ourselves. My name is Christoph Bales. My name is Peter Smith. And my name is Marius Brila. We thought we'd use the first five minutes to give a little bit of general information. It's an anniversary this year. This is the 10th LLVM Dev Room. The first one got started in 2014. Every year we were here, except for 2021, we couldn't find volunteers to organize. And there's been quite a few different people who helped with the organization over the years. I've put in a few names on the slides. Not going to call them out. And I'm pretty sure I probably forgot someone. My apologies. This year is the first time. There's also a GC Dev Room. And I'm very happy that we're running it back to back. So I'm hoping that enables some cross-pollination of IDs across the two communities. So that is very nice to see. Maybe a few words on if you're interested in participating in the LLVM project that you're not entirely sure exactly where to start or if you're a newcomer. I've put a few links here on the slides. I'll very briefly go over them. Most of the communication in the LLVM project happens on this course, which is a forum or discord. If you want to have the links, go to the FOSDEM schedule page. You can download the slides there and just click on the links there. The LLVM project has office hours and online sync-ups. Office hours, it's where an individual expert on something in the LLVM makes themselves available on a regular schedule. You can dial in and any question goes as long as it's on topic. You can just follow the link as I think about a dozen different experts who volunteer to do that. If you're an expert yourself and you think this is a good idea, please consider volunteering some of your time, too. Online sync-ups are regular sync-ups, simply on a very specific topic. They're also all documented on the website. We have a community calendar. I have a screenshot on the left there. You can't read what's in there, but it just gives an indication of pretty much any day of the week. There's at least something going on where people, sometimes on a specific topic, can come together to have an interactive discussion. All the ways to get started is have a look at getting started issues in the issue tracker. This morning, I were 148 open. We're now three hours later, so I'm not sure if that count is exactly correct still. There's a link getting involved with HTML, which gives you lots of starters on the technical details. LLVM does take part in Google Summer of Code, also in Outreachy. If you would like to work on LLVM and get paid for it, there's always quite a few different companies looking, having job openings to work on LLVM. That's all.
Linker Scripts in LLD and how they compare with GNU ld
This is this talk is about linker scripts and some of the some of the ways they differ between GNU LD, LLD. There are some bits where I've kind of bent that definition a bit and gone through the sort of the differences in the internals, sorry not the internals, as in the internal linker script because at least with some linkers when you when you say you're not using a linker script you are it's just a linker's provided it for you in the background. So first slide is basically just some basics so that you can understand what I'm going to be talking about for the rest of the talk. Apologies if you're already familiar with ELF and linker scripts this will be a bit boring but just very very quickly linker's job is to take input sections that you would have in your sort of ELF file normally your dot text your dot data dot bss which is sort of zero initialized stuff and it will combine them together into one bigger big blob and those are sort of then called output sections so I will use the term input sections for stuff coming from your object file, output sections what linkers combined together and then these will end up in program segments in your ELF file and then basically your operating system or kind of copy will operate on a program segment. Right so linker scripts sort of I guess more formally called linker control scripts it's kind of like a domain specific language that the linker uses I guess most the majority of the commands are to do with image layout you know where you map input sections to output sections there are a few additional commands as well like for example some of the commands are load more files and you might actually be surprised to know at least some systems your libc is actually a linker script but it's a linker script that loads the actual files behind the scenes to make sure you get them in the right order. Yes some details on the command line. GnuLD has a built-in linker script so even though and you can actually dump this with minus minus verbose if you're actually interested in the horror of what the new internal linker script is but LLD and gold and assuming mold as well basically have an internal they don't use an internal textual DSL scripts they kind of mimic it using command line language that type of thing or just basically hard code things. So one interesting thing this is just I guess not related to LLD or GnuLD but if you use dash t or which is the short form for dash dash script the script that you provide will replace the internal linker script but you can actually just put the script on the command line as if it were an object file and that won't replace the linker script it will add to it so you could basically add various different fragments that type of thing. Anyway here's an example of a linker script so you can see what sort of things there this is one from a very very stripped down one from an embedded system. I say I've used embedded systems kind of for the linker scripts because generally if you're say linking on a in-user space and Linux or whatever you really don't need to use a linker script for most of the time and the general advice is if you don't need to touch linker scripts don't touch them. So memory command at the top that's basically laying out where your various memories are on the embedded system you might have different properties like one might be flash one might be RAM that type of thing. You have these things called input sections descriptions which are that star dot text star that's the sort of things that linker's going to filter against so when your input dot section it will match against that dot text there. You have symbol definitions that you can put down that dot in the next of the underscore in the EXE IDX start is called the location counter and that's basically what will mean the linker will fill in with the address that was there at the time so basically at the end of dot text there will be certain amount of addresses space being used so at the end of that output section that value will get put into the symbol there so that your program can basically introspect itself by using these symbols. Have built-in functions for example align and these sort of arrow flash and at flash those are sort of ways of assigning things to memory regions and that can become important for other things we'll do later on. Anyway, GnuLD and LLD linker script handling so yeah as it's been mentioned in the GnuTalk this morning there's no specification for linker scripts the closest we have is the linker script manual in the Gnu documentation some parts are under specified some parts are implementation defined GnuLD and LLD are also moving targets so even if you did decide to basically reverse engineer the source code there would be no guarantee that by the next release it would be the same thing. So yeah so generally LLD will try and keep as close to the specification as possible it has made a design decision to differ in a few cases where there's been some I guess odd behavior accumulated over time I'd say these are not well specified languages that have been sort of gone through a programming committee they are accumulations of you know I wouldn't necessarily say hacks over time but it has been developed over the course of 30 years and it's accumulated a lot of rubbish. Okay so often placement so this is one of the areas where GnuLD and LLD differ slightly but they give you roughly the same results so as we sort of went back to that previous linker script and there were only fragments and it wasn't a complete specification of where all of the sections go so linker scripts do not have to be complete and you could only actually need to give a partial description and the linker basically if any section doesn't match any of those input section descriptions it's called an orphan and then it's basically the link the manual says it is up to the linker to place the orphans so basically it can place them where it tries to do something that's relatively sensible so you can if you're concerned about that I want to know what the link has done there is this this thing called orphan placement and it can tell you where things are that type of thing and then there's also an option called dash dash unique where if you want to get if you don't want the linker to try and mess about with combining your orphans together it will just put them all in their own individual sections that type of thing. Okay so here's an example of how a linker might place orphans what it tends to do is it tries to match the properties of the section so like for example you've got an executable section on the assembly code there you have the AX but for that means SHF ALOC executable A that would be read only AW right that sort of thing prog bits there's something in the file no bits that's runtime initialized zero to initialized data and the linker basically says okay what have I already got in my linker script well I've already got a dot tech section that's good executable so I'll place the one the orphan with a similar name afterwards that type of thing so that's the sort of thing that would do one of the interesting cases that we'll get to is where does it place when there's already symbol assignments so linker's got to be very careful to try not to break someone so carefully place symbol assignment so here's just a very very quick textual detail for some of the things that I've said there and in particular the example I've got at the bottom here you've got this last one dot foo there's a section called bar and then someone's advanced the location counter on a thousand so if the linker insert it says oh dot foo I can place that in the section dot foo but where does it place it does it place it after bar does it place it after the dot and the rule that the linkers take is it always puts it after any of the various expressions that it's done there because in general this is this is where programmers might say I want section start section end and if you insert something in the middle of that then you might have broken someone's program who's been to say try to make their own table of pointers they're iterating through that type of thing okay so here's an example of where GNU LD and LLD differ and this is actually fairly simple one it's actually quite hard to get them to differ in most cases but LLD in its default linker script prefers to place read only sections before executable sections GNU LD has the opposite and will place in read only after executable so if there's no read only in the linker script the link has got no information of which to say ah here's my anchor to place it afterwards so they will make different choices and there was a bug reporter about about this saying of the linker did something different but yeah it's one of those unknown difference yeah another thing that this is more of a curiosity with LLD and it's just something that I see when people port programs from another operating system is quite often someone will forget the A and if you forget the A that's essentially telling the linker that this section is not part of the program it's like a debug section that's metadata so now it does turn out that GNU LD and LD will place the orphan at the same place but LLD unfortunately uses that as an anchor point for all of the other sections so if you put in that particular case bar will get inserted after in bar there but then all of the debug sections will get put after it because that's suddenly the anchor point for all the no-alarm sections which is a bit of a curiosity I think at that point so yeah main thing if you ever pointing a program from GNU and something weird goes on check your assembler and the chances are you forgot to put an A flag on one of your sections okay okay so program header generations this is somewhere where I'm going away from linker script and veering much more back to the user space area so this is basically trying to explain some of the differences between separate code and no separate code behavior on LLD and GNU LD so this is a an elf program header so this is what describes a segment and the most important ones that you need to look at here are p offset which is the offset in the file p viadra which is the basically the virtual address that the thing will be loaded at and the p align which is a very very very strange very strange thing congruent to p offset modulo p a lot a line and this is a I guess you could almost call it it almost seems like a trick and to allow basically the same page in the file to get mapped in two different places in virtual memory and that can save some physical memory yeah okay so in what I call that system five by system five think of that as something like Linux or or BSD that type of thing now this elf file is actually memory mapped using various M map calls into memory this is actually quite different from an embedded system because an embedded system you probably wouldn't even load the elf file anyway you would basically obj copy the load bits out and then you would have some you know bit of initialization code copy from various bits to where they needed to be so they're actually in some ways even though I guess the linker scripts were designed before elf elf is not really well designed for embedded systems because you kind of almost misusing elf to make it work for embedded systems in a lot of cases but anyway I'll go from here okay so the reason I'm mentioning program headers here is that you can be very explicit in your linker script and use the p headers command but most of the time you actually want the linker to generate these things for you because if you get it wrong then the program just won't work so for a typical sort of link the the linker's gonna look for this thing called the VMA to LMA offset which is basically LMA being the load address and this is really only important in embedded systems where you for example want your load address to be in flash but your execution address to be in memory that type of thing so if that offset changes the link will change the program header you typically want all of your non zero initialized thing before the zero initialized thing because that's the only way that elf and elf program header can describe it and of course if you're changing properties from like RO to RW that's that whilst you couldn't in theory merge them you generally don't want executable RW in most systems okay so here's just a graphical example of some of the things that I'm making it's quite complicated diagram but this is where you get this alignment coming in in that if you think of your text segment there so I've deliberately done done that so it is just a bit smaller than the memory page and I'm using a 64k page here so you've got the data segment that is not aligned to a page boundary in the file if it was there'd be a big gap in it filled with zeros so what the what the operating system actually does is it double maps that particular bits you end up with the text segment mapped basically and part of the data segment mapped to the first page read only and then you have the second bit mapped to read write into two separate pages there so we've actually wasted one page of virtual memory but we've saved one physical memory page now the interesting bit for this is that the mapping for the read write is copy on write so you can't write to four one thousand and write into the read only execute bit but what it does permit you to do is to basically read towards the end of the execute thing and you'll actually be reading into the data segment read only now in theory if you've not protected your program and this probably doesn't matter that much but if you have hardened your program against rock and chop attacks there could be potentially gadgets in the read only data that if someone manages to redirect control they can find more gadgets in that same so this is section called Z separate code where it will sort of basically make sure that the read only executable is separated by page by pages so you don't ever get this double mapping from there and as you can see for the GNU LD layout it's got some execute then more read only and actually that can waste you quite a lot of pages in a small system particularly in a 64 or something like that where you've got a 64k basic page so there are control things for that so if you use no Z no separate code then you end up with them tightly packed like I had before so quite often various district just rows will choose different values of Z no separate code but if you do find hey all my binary sizes have suddenly got bigger it might be because of this Z separate code now GNU LD does something slightly different to that in that because it normally preferred read only before read only sorry read only executable before sorry read only no executable before read only executable it didn't quite have that triple or a sandwich of read only executable between the read onlys so by default LLD put would give you three program layout okay I need to speed up a little bit here but anyway that's just one example of differences in memory layout even without link script case okay so program segments and embedded systems so mentioned before you're on you had this arrow to flash and at so this is how you would basically arrange it so that your execution address for your data is actually in RAM but your load address your LMA is in in flash and that and then some program will actually go and copy the stuff from the flash contents into the RAM contents so reason I'm mentioning this is that there are some slight differences between GNU LD and whatever so and there are certainly some problems with LLD that we know at the moment so LLD at the moment will assume that your output section and is virtual memory address is monotonically ascending so you can break this with a linker script like this where you kind of because it's working top down it will just try and assign these sections into the memory region top down and unfortunately that plus 64 that second section really should be after the after the section in that file GNU LD is clever about this and it will actually sort the things to make sure that they are in ascending order but LLD won't and you'll end up with a bit of code that tries to work out what the virtual the load address is from its virtual memory address and it basically wraps around and goes negative so that's one thing but known bug in LLD at the moment that we'll need to fix but the other thing is say if you are writing and a linker script for embedded systems please try not to make LLD difficult for your linker and put things in ascending order is what I would say on there okay so just because I've only got probably one minute left how some other things or gotchas that you probably need to look out for so dot assignment within an output section see that dot equals four there and now you might think that means assign the location counter to four but no it doesn't there's a special case that if you do that within an output section it's supposed to be relative to dot section so that's actually saying dot equals dot section plus four so yes so if you want to do that please well LLD has decided this is silly we're not doing this and it just means dot equals four but it does mean if you have got an old linker script then then you can end up getting caught like this and you have there is a way of doing things in that's compatible with both which is used dot plus equals and it also that looks much nicer and it actually lines up to what you probably intended to do anyway so I think with that I just want to quickly mention some references and then I'll stop and so yeah so if you do want to know about what LLD does and mass-grade the current LLD maintainer has gone through and basically often when he wants to implement something he will put a blog post up up and write lots of interesting things about what he's found out about various things so it's not documentation it's definitely blog post type material sort of what's there at the time but it is quite useful to get into what the internals of these sort of things done and then there's some various sort of bug reports links and things there but I better stop there because I'm probably out of time now okay so we've got two minutes so I might be able to get one maybe two questions or something like that so the question is there an effort to standardize linker scripts I think it's very much down to the community so I think what we've said in LLD is that if anyone wants to add to the wants to change the linker script format then go to binutils mailing list and make sure you get it agreed with the canoe and don't just input we definitely don't want to just pile extensions into LLD but really it's just communication across the yet of the basic the problem with standardizing it is that that means that they can't change it so I think there's probably one they want to keep some of that yes so I know that if you use funky linker script stuff yeah with LDO then LDO like you can do things like we talked about like instructor lists or LDO doesn't seem to be able to take advantage of the link of the layout caused by linker script in order to do things like preload data that it knows should be in certain locations yeah is that some sort of fundamental limitation with the architecture or do you think that could be that so yeah there is the question was about basically LTO and linker scripts probably not interacting very well so yeah there there are some efforts that are going on with it within the LLVM embedded community for this to try and basically say how can we make sure that you can place well basically I'll probably take too long to answer this but things like interprocedural optimization can break certain linker scripts sometimes you want to say here's this region of memory it's and this region of memory and I do not want you to share things between these two bits of memory but LTO basically just assumes it can do all of it so there is some effort ongoing but we need to work out what the actual rules are here but we but yeah there is some effort in the community to go on to try and fix that okay I probably best stop there for
Patch based test coverage for quick test feedback
All right, next up, Shivam Gupta. Okay. Good afternoon, everyone. Today I will be talking about my GSOC project. It was mentored by Henrik Olson, and this GSOC project is about patch-based code coverage testing for LLVM patches. And so in this talk, the agenda is, first we will introduce what is the project is about, and then what is the terminology we use, like how LLVM test cases are written, and then we will see what is LLVM source-based code coverage that is used to get the code coverage of a patch. And then we see how it is implemented. It is basically a Python script, so we will see what are the functions used to implement this tool. And then we will see a demo, like it's already a patch is there in LLVM community. We will see like what is the lines are covered or what lines are not covered with this patch. So we will start by introducing introductions. So LLVM test cases are written in a lit format. Regression tests are written in lit format, and unit test cases are written in Google test format, or Google mock these formats. So the goal of this project is to help developer to create good test coverage for their patches, and it will also help the reviewers to know that the code they are submitting, it has a good test coverage or not. So this is the project, and to accomplish this project we have created a Python tool. It's around 800 lines of Python code, so it will fetch the patch as input, and then it will extract some information like what are the source lines in the patch, what are the test cases in the patch, test case lines in the patch, and then we build a LLVM project with the code coverage enabled. So it will instrument our binary. So whenever we run the test case with this binary, so it will generate a prof data file that will be further converted, further processed, and then it will show the lines which lines are covered or not covered by the source code of the patch. So LLVM test suits basically like they have two kind of test cases written for any patches. One is regression test, and second is unit test. Mainly the regression tests are written for most of the patches. So these regression tests are in .LL format or .C format for different tools. So mostly our focus is on regression tests, and some test cases are written in unit test case. So these test cases are test for libraries, like support libraries or FSEG data types. So these are checking the feature in the system, how it is well indicated in the system. So this is unit test case. Regression test is very small, but you can see at the top there is one run line which will actually run for this test case. Then there is unit test case. This is using Google Gold Test Library. So it has some micros to check. It is not important, but these two kind of test cases are in LLVM for any patches. And then we will see what is the source-based code coverage. So source-based code coverage consists of three steps. The first step is compiling the program with the coverage enabled. We want to instrument any binary. So we will use a fropile generate this flag we will use and this will generate foo binary which is instrumented. So in the next step when we run this binary, it will generate a prof data file. That prof data file contains the data for further creating coverage reports. So next is the tool is LLVM prof data. This tool is used to convert the prof format to prof data format which is further used by LLVM cov to generate or show the report of what lines are covered or not. In the next slide we have a simple test case and I have generated the report. It checks if the number is even or odd. If we pass suppose 5, it will say that the number is odd and this line, this if condition will not run. So it will show like this. This is the report of LLVM cov for any program. Next is implementation. So for implementation I have submitted two patches. For this first one is about the change in LLVM lit. This is the testing tool that is used to run the test case, regression test case in LLVM. So initially whenever we run a test case, it will generate prof data in some random name. So we have modified that and we have given a proper name for every test case. So it will generate a proper name in a specific directory. So this is categorization of prof data. Next we have the main tool that has all the functions that will pass the patch and then build the project, LLVM project and then generate data. Then process the data to show the coverage report to reviewer or a patch author. Next these are the some functions that are implemented in the tool. First two function is just a logging function. And then it is sequentially like we as a name suggest we have first we create the patch from the last commit or from the patch itself. And then we accept the source file and then we have write source file allow list that is used to reduce the coverage data. Because if we generate the coverage data for all the files of LLVM then it will be around 150 MB for each test case. So it will be difficult to process later. So we will use, we have used a flag afro file list. This flag used to generate coverage data for only the files in the patch. So next we accept the modified source line from the patch and then we build the project. We build the project with a flag LLVM build instrumentation. So this flag is passed during the CMECH invocation. So when we pass this the binary that will be old for LLVM project will have instrumentation enabled. And then we run the single test case with coverage and that is helper function next the modified lit test case or unit test case. Whichever they if it is if the patch contains a lit test case then it will run the regression. It will run the that function and if it has unit test case then this function will call and the test case will run. And next we have a process coverage data which will process the data. And next similarly we have a coverage file and it will run. Then we will have a print coverage detail that will actually be printing the coverage detail. We will also have a log file. So print coverage detail have a print a lot of details. So it will print something to log file and then we will print common uncovered line which is so in a patch there is one source file. But there are many test case. If one test case is covering the source file then it is covered. But if all the test case are not covering a source line then it means that this line is uncovered. So it will print the uncovered line this way and then there are some helper functions which is not important. This is the GitHub CI workflow that is actually is a file that is used to compile the project like on GitHub. So it is like it is holding the project and then at the end it is running a Python Git code coverage. This is the file name. So it will run here in the Python code and then it will print the coverage result. I will show this is the format. It will show the common uncovered line for the...
elfconv: AOT compiler that translates Linux/AArch64 ELF binary to LLVM bitcode targeting WebAssembly
that transits Reax-664L binary to LNBM bit code targeting web assembly. So first, I will explain what is web assembly or wasn't for short and why we use wasn't. And wasn't with a virtual machine instruction set and currently this is used on servers or as well as browsers in production environments. And compared to existing applications, there are mainly two features, portability and security. And portability wasn't enabled us to run applications on both browsers and servers without modification. And of course, wasn't dependent on CPU architecture so that we can run wasn't applications on computers with various CPU architectures without modification. And in security, in the case of outside browsers, wasn't is highly isolated from the host kernel by Washi. And Washi is an API that provides access to similar OSR like features. For example, for systems so gets and so on. Yes, yes, and, yes, and Washi is implemented by Washi at times. For example, wasn't time and wasn't H and so on. And was was was Harvard, Harvard architecture designs and so the memory of the wasn't was wasn't instance is which clearly separated into right now data memory and memory and code of was can access only right now. Data memory and which include increases security. However, there are some limitations in the capability of applications. And first, wasn't can, wasn't can jump to only the course that are determined at compare time and in other words, it is impossible to indirectly jump to the code generated in the data memory. And second, was she implementation doesn't cover all projects API, for example, folk and exec and so on. So when you develop wasn't applications, you should consider the limitations. And now, many programming languages support was, for example, C C plus plus plus and go and so on. And however, it it isn't easy to build was in some cases as follows. And mainly there are three cases for us the programming language that you want to use doesn't completely support wasn't. And, and currently many major languages have begun to support wasn't, but only limited number of languages are available in production environments now. Second, binary is available, but source codes of the binaries are not available. And recently the number of op source, op source programs has increased, but several, several programs are still not published. And third, the case of time consuming to building the environment. And if the dependent libraries of the target program are not maintained, you might be not able to build the libraries. And in such a case, it might take much time to build. So next, I show existing projects that run in X binary on wasn't. And the first, and the first project is tiny mu. And this is the X86 and describe emulator available on the browser. And, and the next kernel can run on the browser. And so, and the second project is counter to wasn't. And this enables us to run the X kernel and counter, run time with emulators compared to wasn't. For example, tiny mu. And, and, and this can, can run, well, counter us without modification, but it can run with the same amount of time. And, and, and this can, can run, well, counter us with modification both on the browsers and wash-around times. And, but, however, these projects, these, these projects use PM on emulator that compiles a, a, relax, describe 32 L binary to a, a, several binary formats. So next, I will show the demo of L, L comp. Can you see? Watch. Okay. Okay. Thank you. So, and, well, well, I have prepared the counter image for the L comp project. And, and, and now in, in this terminal, the container of L comp has already started. And the target sample L binary to be converted is examples, L-sensitive, L-thousand. And, and this program outputs 100, 100 prime numbers in ascending order. And, okay. So we, we try to compare this L binary to was with L comp. And in the directory, there's a one file L comp.sh and L comp.sh is used to try to L comp to compile. So, and, okay. Okay. So, and target this was browser and L comp. And target this was browser and L comp. And target this L-sensitive L-thousand. Now L comp comp. Okay. Great. And, and serial files are generated and we can execute the was application with MS Gryffin. So, run. The browser. Okay. So the cyber was the was application has started now. And, okay. Wait. You can see the, I'll put this correct in the browser. Okay. So, okay. So now let's return to the presentation. So, so, so in compiling L binary to L and B and B code two, two modules are used. Okay. First is L comp lifter. This process L binary and maps every section and operates the next module. And, and is a library for lifting machine code to L and B and B code. And, as this figure shows L comp comp, L binary to L and B, L and B code with these two modules. And next, I will explain how L comp comp, L binary to L and B code and was binary. And, and ramming converter function to one L, B, M, I, R function. For example, as you can see, the, the, the, the, the, the square function one of the machine code is combative to the underscore function one and the square lift function is a L, M, B, M, I, R. Yeah. So, and also one CPU instruction is combative to one L, B, M, I, R block. And as you can see, the machine, the instruction of move X to X zero of the machine code is combative to one underscore move. Yeah. Okay. Okay. So next, I will explain the details of the combative, combative L, B, M, I, R block from CPU instruction. And there are three steps in the combative L, B, M, I, R block. And the first step is the program counter calculation. And this here shows percent 29 is a program counter of this instruction. And the next piece is updated to the next program counter. Yeah. And the second step is open calculation. And in this here, this, this instruction uses X seven and X three, X seven and X three registers. And in the open calculation, the X seven and X three is loaded. Okay. So, and the third step is calling the, calling the function of the instruction-specific operation. For each CPU instruction, RAMU generates a function that performs the instruction-specific operation. And the corresponding function is called at the end of the LM, I, R block. At this end. So next, as I explained in the beginning, the code of the quantum can indirectly jump to only the code that is determinable at compare time. And this figure shows how to deal with indirect jump VR instruction. So in this figure with VR X seven, indirectly jump to the instruction of move X eight and X nine. And in the error, the VR instruction. And the address to jump is 30% IDR and jumps the block of percent error on the square IDR. This is the VR instruction. Okay. So, and after jumping, R on the square IDR, we get the target rubber by calling the getR function and, and, and start to pass the VR. And after that, with the VR instruction, we jump to the target block. And also in that VR instruction, it requires all candidate labels as the argument. And this is, yeah, and this array consists of all labels in this function. And, well, but in the current design, the array of candidate labels includes only the labels within this function. So, and, and Elf Comp doesn't, doesn't support the jump and long jump now. And that is a future task. And next in converting the LNB bit code to a wasn't statically links the LNB bit code and Elf Comp runtime. And Elf Comp runtime includes the mapped, mapped memory of the original error binary. And that is stuck in the heap area of the error binary. So, and also Elf Comp runtime includes the program of the system called the emulation. And existing compiler, for example, M script and Washi CK compiles these two modules to wasn't. Okay. So, and in the React system called emulation, there are two ways of implementing the emulation. And the way of implementing depends on the RibBushy implementation. And in case one, if the RibBushy implements the tag system call, Elf Comp just uses the RibBushy function as shown in this figure of the right system call. Okay. And in case two, if the RibBushy doesn't implements the target, target system call, Elf Comp should implements the system call. And as shown in this figure of the not used PRK function in this code. So, it implements the system call. Just watch out. Okay. So, next, I will show the parameters of the generated binary, parameters evaluation. Okay. So, and the target sample F binary is a simple prime number calculator. And this program computes all prime numbers, lessens the input to integer. So, and one thing to notice here is that in this program, the evaluation are using X H6 under square 64 binary instead of the wasn't binary. Because in the current implementation, the system call emulation for wash-down time is insufficient. So, we use X H6 under square 64 as the output binary for benchmark test. I'm sorry. So, and the comparison method is QM emulation X64 to X H6 and square 64. So, we compare QM emulation with binary LD compilation. So, and I measure the power months in two cases. In the first case, the input integer is 10 million. And the second case is 15 million. So, and the power month evaluation is as follows. So, and as you can see, in both cases, one and the case two, LD compiling by LFCOM is 4,000 QM emulation. And therefore, we can say that LD compiling is 4,000 QM emulation, at least in some cases. So, okay, so last, I will show future works. And first, we will support the output of other binary formats. And currently, LFCOM supports the output binary of only Wazm and LFX H664 binary now. So, we will support the output of other binary as output of mine. So, second, we will never as compiles LF binary of other CPU architectures. Now, LFCOM can compile H664 LF binary. Yeah, so, yes. In the future, we will support other input binary. Okay, so, and third, we will, we will, we will append system calls emulation. And now LFCOM implements a part of system calls and a lot of system calls are not implemented. So, and specifically, when targeting Wazm as the output binary formats, some system calls are difficult to implement. For example, for exec and so on. Yeah, so, so, so, I think that implementing that system calls is very variable. So, and fourth, supporting dynamic linking. And now LFCOM can compile the static, static, link, LFU binary. So, and that's where that dynamic linking is an important function and will support in the future. Yeah. Okay, so, and fifth is the promise analysis of the Wazm targets. Yeah. Now, I measure the promise evaluation under the H664. So, I shoot promise at the binary of the Wazm targets. Okay. Okay, so, and the sixth is making LNB bit calls generated more efficient. Yeah, and, and so, yeah, okay. So, in the current implemented. I translate that to Wazm 32. Sorry, well, I, sorry. I, the 32 bit x86 platform. So, I think that the H664 L binary is mainly used in the world. So, I think the support of the L binary H664 is a big influence. I think that's. Yeah. I take the top. And you consider using it instead of Remila revenge, if you know revenge. I'm a core developer of revenge, disclaimer. So, I'm sorry. Could you, could you, if you have a question, more sorry. Yeah. Remila is a tool to leave from executable code to bit code. Yeah. There's another tool which we developed, which is called revenge that does something similar. Maybe have you considered that? Are you interested in that? I don't know. Sorry, I could use a more, sorry, sorry, sorry. Yeah. Is it an alternative to, revenge is an alternative library to, Remila. Have you, have you heard of that? Well, have you heard of the revenge library? Sorry. It was just saying, if you'd heard of the revenge library, does something similar to Remila. It sounds like you haven't heard of it. That was, that was my interpretation of the question. I think that will fly. I'm sorry. Yeah. When you measured, you did a performance between Kwemu and Elfkong. Yeah. Was that like the, what did you measure there? I didn't understand was it the compilation or running? What did you measure in that performance? Well, comparison performance. Yes. So, and, okay. Yes. Basically the, component performance is of Elfkong. It's very long. But in this project, in this program of the sample, sample F binary, about, about it takes one minute for the compiling. Oh, yes. It's the compilation that is faster. Or is it the running of the thing that is faster? Is it like, I don't, I don't understand. We're measuring the, the running, like the produced results for the compilation. Like which? Like which, I'm sorry. And, I guess that is for like running. That, yes. So, by Kwemu, it runs, it runs like the, it has the, the G that turns into the native code. Like it has the, ahead of time, or you have ahead of time compilation for that, what, what, what's that you run on a browser, right? Later. So, are you looking at the performance running on the browser here and comparing that to Kwemu? Or are we looking at that, like some compilation item, just understand like what are we comparing? Sorry, so, could you ask after the presentation again? Sorry. I'm sorry. Thank you. Yeah. So, you compare the performance with like emulated ARH 64 versus a X86 binary, binary. Have you also tried, like after converting this with, it was Alphcon to convert it back to ARH 64 and benchmark that against the original source? So, like what is the overhead of one, like lifting it and? So, the question is overhead of the lifting, binary lifting. Oh, yeah. So, yeah, so, and in the program of this performance evaluation, the performance overhead of the lifting is very small. And maybe it takes maybe three or four seconds to compare the lift, the binary to L and B bitcalls. Yeah, but what I meant is like, if you combine the big bitcalls back to the original architecture, so how is the performance of that binary compared to the original binary? So, you say that from L binary to the target architecture binary, the performance overhead in the LB bitcalls to target binary. Oh, sorry. I just follow up on that. So, if you just, I will just drop in directly, but from experience.
Map LLVM values to corresponding source-level expressions
Yeah, it's done. Yeah, well, we're about to start, so meet your Euro. Yeah, really. You're already on it? Really? I don't think so. Already on it? Yeah. You think it's on it, right? Yeah. It's on where do you want it? Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Hi, everybody. My name is Shivam and I work for KDAB and I also work this summer, Google Summer of Code with LLVM, working on this project, this mapping LLVM values to the corresponding source level expressions. And, but why? So the challenge of understanding the compiler optimizations. So, so compilers are basically performing different sorts of optimizations and it's not always possible that it's going to be optimized or basically vectorized. So, we basically our motivation was vectorization for the first because we wanted to include those optimization remarks in for the vectorization part. So, our motivation was vectorization. So, it's not always possible. So, your compiler can always vectorize your code. So, there's some sort of data dependencies there where that's why your compilers cannot actually vectorize all the time. So, on that cases you have to emit a good remarks and I'll let you know what currently clang actually generates as a remark. So, understanding why and how these optimization occurs is not always straightforward. Even the authors or vectorizer don't know what's going on if the vectorization didn't happen. So, consider this example. So, you can see there is a data dependencies between A of i and A of i plus 3. So, this loop clang will not be not able to vectorize this code. Okay. So, see this remark that produced with the clang which is that loop not vectorize. You can use pragma loop distribute. So, you can compile the tries to distribute the loop and it might be able to vectorize in some sense. But, just see the remarks. It's not clear that what actually gone wrong here and where's the data dependency. It's not telling you that where's the data dependency actually was and so you can improve the code itself. Right. So, it just that remarks and you can see this not actually clear. Right. So, it's a bit unclear and if you can have such this remarks nothing much just two expressions are the dependent source and the dependence destination for example. So, you know that okay there is a data dependencies between two to these two locations and so if you are aware of the code so you are going to modify your code and you might be able to modify in the way so you know that it will be possible for the compiler to vectorize now. So, you can modify the code by looking at this these expressions for example. So, yeah so it's going to surely enhance the remarks include the exact so if it includes the exact source and destination of the dependencies within the error for example and it will pinpoint the lines of those dependencies and let's look at the impact of these enhanced remarks so it would be clarity for the developers so they can quickly look where the dependencies are actually occurring and so they can improve their code and probably make it vectorizable and efficiency in the terms of that they can save time and improve efficiency by reducing the need for the deep debugging that where was the actual data dependencies so you can just look at the optimization remarks and you get the quite a lot of information that okay there is the data dependencies between two load and stores. So, let's look at the approach that we took for solve this problem or this project so approach was very simple to just utilize the debug information that available into the intermediate representation right so to recreate the variable and the function names lost during the optimization so the optimizations are actually a problem in our case because we currently don't know how to build those for example instructions that's lost into the optimization for example so if you see a multiplier if you see a MUL instruction in the IR so compiler might optimize that into shift left so the MUL was the original information that was actually available in the source code but right now we have shift left so we just lose the context that what was the actual source level information so that's still a problem for us and we have different approach for that but we didn't see to be so we see that it was not much a performance good so it was very bad so we wanted to clone the module so we have so we can look that what's happened after each optimizations so we can have the clone of the original IR and we can see that what's going on after every pass of optimization or every implementation passes so if you look at the different transformation pass and you have to look over that what's the thing that gets changed and anything that you have stored that okay there was the original instruction as MUL but right now it gets changed to shift left so you see that okay the MUL gets changed to shift left so you have to cache the expressions basically to reload the things in a new way so if there is no need for that so we will proceed after it so let's see how to utilize that information that available in the IR so LLVM uses a small set of infernsic functions if you are aware of so those are provided for the debug information so they contains different metadata so they have different arguments so these infernsic functions are named prefixed by this LLVM.debug and these things helps you to track the variables through the optimizations and code generation so if you look in the IR so if you dump your IR with the dash G symbol so you will get to know about the LLVM.debug.value or .declare so those contains everything related to the source level things so they contains the metadata information and the metadata are there for that so they can give you a lot more information about what was actually in the source so for example like variable names so when you trace the metadata so you can get the variable name from the actual source so for us these two infernsic functions were very important the debug.declare and the debug.value and let's try to understand a bit so if you can see the I is allocated and just below it you can see the call to the infernsic function which is .debug.declare and these you can see the three arguments in the infernsic function the first represents the first will always represent the address of the variable the second metadata always pointing to the for example you can see the DI local variable so which contains this is a metadata node and it contains the variable name so what was the actual name so you can see the actual name was I in the actual source expression so you can when you trace back the information so you can retrieve the name so these are these infernsic functions so the second can really help us a lot and the second was actually the source just the source information like name and the third argument is DI expression and generally DI expression is useful as a complex expression so you have expressions like for example int a is equals to b plus c a DI expression can hold that stuffs yeah and yeah so debug declare is used for that and the second is debug.value so .value is very similar to it it's just that when I is gets updated when a value is updated so debug value can up and so everything goes in the debug value for the same so we now have enough information to at least give a try to build the source expressions and only if the code is compiled with the debug info on so it's compiled using the dash g symbol so we use the we are infernsic as a bridge so our focus was on memory access and vectorization as I said so importance of the memory access pattern is so we really want this project for vectorization at first and then we also have a plan to give it a push into the debugger so debuggers can utilize this information to provide better remarks but the main goal initially was the vectorization pass vectorization is a transformation pass so a transformation pass can always query an analysis pass and what our work is our work was an analysis pass so this vectorization pass in LLVM can always query the analysis pass this transformation pass okay so project contribution is actually that we have we have built an analysis pass that can generate these mappings and provides a better remarks for the case of vectorization or any other things that requires it let's look at the implementation detail so for us the point of interest is load and store instruction because of the vectorization because we want to analyze the memory access patterns to actually emit in the vectorization so which is useful for the remark and for example just take a look at this C code and if you compile this with clag or to dash g and if you emit the LLVM just for showing you that what's going on so I think it should be visible so you can see the call to in intrinsic functions debug.7 so we can build these expressions from them so as I said so if you look these were the first is to multiply n1 but and we compile it with the optimization on so the multiply instruction gone away and it just updated by shift lift operator okay so that's why you can see the shift lift operator here and not multiply so that's a problem so that's a problem of accuracy of the expressions because we still not have a good plan of how to accurately generate the expressions because a lot many times these things gone away because of the optimizations on and it's always been a hard problem of how to actually debug in the case of when optimizations are actually gone so it's a classic problem which we still have to look so you can see the we can build these expressions from the instructions so yeah you can see this from example that computing the equivalent source expressions of interest will involve walking through the LLVM IR and using the information that provided in the debug in forensics so even though our current interest is load and store but we still have to build for every instructions because those load and store can contain any sort of an instruction when you trace back to them it's maybe any binary instruction it contains gap instructions so it might be contain any instructions and we have to we still have to build for them too and as I said that optimizations make it impossible to recover the original source expressions so as you see that for example the 2 into N1 is optimized to N1 left shift 1 so which is recovering the original expression may not be possible every time so let's look at how we proceed it's just a basic algorithm that I just want to go through so we started by traversing the IR so we have we started with the traversing the IR we identify the operations that were there for example load and store source or main so current interest is load and store instructions so specifically look for those instructions in the IR and then trace those operands it might be any other instructions it might be inside any metadata so we can retrieve any information like name and utilizing those metadata information building the source expressions and then we reconstruct the expressions with all the information that we get so that's a bit all but just look at the current state so it's still not yet upstream to LLVM the PR is here so what I need from you or anyone you have experience or anyone you have active in that zone of optimizations or for example analysis passes or transformation passes in LLVM so I do like you to have a look at the patch if you have some experience try to review the code as well and give some feedback so we can proceed with much detail because it's still it's still a new analysis and still need a lot of work for struct as well so as I mentioned we need more review on the patch and some active work from me as well and if any of you interested please reach out and as I said the struct pose a unique challenge to when we try to build the expression for struct for example it was not quite easy it was very difficult to build the expression for struct because they pose a unique challenge if you see them in the intermediate representation it's very weird to look at them because I don't know how they actually gets there in the IR how they represent it it's not as simple as the giving expressions for an array so struct is still a problem accurate source level expression in case of optimization it's still a problem and there isn't always one to one mapping between the source code and IR so if you see that we still don't know what to do in these cases if you see there's a pointer and the PTR 0 so IR can generate the same code for these two patterns and we don't know which have to pick so that's still a problem one solution for this was that debug information also contain the information about the file so there is a DI file in the debug info so what was actually we were discussing is so we have still information about the file path so what we can do is we can actually open the file and just go on that line and retrieve the information what was the actual ground truth just look at that second thing was fall back and anything because we don't know what was there so just fall back to any of them so but the DI file was actually quite easy but it's not good performance wise if you see open the file and just retrieving just going on that line and retrieving that so it's not good performance wise so yeah that's it for the talk and thank you for listening and if you are interested in letting knowing more about this project and the algorithm please reach out to me on mail or for example discord so yeah thank you any questions yes why do you need to rebuild the entire sequence of expressions for each of the values why not the specific value of the value it is what the dependencies and the production line from the file just like after a year ago can you just free us the question so you know like when you admit after marks there's a tool called the operator that just put everything in line so between that and what you have here it seems that you know you would get excellent results in terms of debubbability if you just did what the operator does plus you specify which of the values are causing the dependence that what you said and what the reason for the failed optimization okay so the question is basically about using opt viewer right yeah I was just admitting a more limited view as you have here and not trying to reconstruct everything so we still not reconstructing everything for example so we still we still not focusing on creating whole I at or just mapping whole I at the expressions we still focusing on those loads and stores as I said so we still focusing on that right yeah yes yes yes yeah so we we still we still picking up the load and stores so if we see that is there any gap instructions because gap instruction actually contains a chain of instructions so so but we still have to build the loop for loads and stores and opt viewer still not good at emitting those remarks it's still very abstract in that sense if I remember it correctly so I'm not sure how to go with opt viewer but we still making it for load and stores and tracing back the information yeah yep yeah yes nope nope not sure but one thing I can guess that so basically opening a file is not something which is very good performance wise and and just choosing and just going on that line down because you can see that there could be a multiple lines of code in code base so you have to go on that particular line so it would need it would need it would be very bad in performance was I think okay and there is no and there is no theory about if it would be more beneficial to tell the programmer the error that or the sub optimal choice that you made was between line 27 28 compared to generating some arbitrary complex expression that might not be representative of what the program originally wrote I'm not sure okay okay yeah I think it would be fine then if you if you're choosing for emitting a remarks then then you know that this is not good performance right so if you want to look at the performance if you want to look at the actual correct remarks so you have to going deep in the performance thing then yeah then it would be possible and we have also we also talking about preserving the metadata in LVM as they go through but but in LVM metadata is designed in a way so we can drop we can drop at any time so we still cannot preserve the metadata information so it's still a challenge this lot move what going on on this side so yeah yeah yeah okay yeah thank you yeah okay thank you for joining when you leave make sure to take everything yeah yeah yeah yeah
The Matrix State of the Union
Okay, so this is the Matrix Dev Room on FOSDEM 24, I guess, in case you are in the wrong room then take the chance now and leave. And we have an afternoon packed full of information about Matrix. It's only an afternoon, so if you want to look up more information in the air is an internet full of it. But if you are lazy and don't want to collect all the information yourself then there is this wonderful people, they collected the information for you and will give you a presentation now about the state of the union. Matthew and Amadin give them a warm welcome and the stage is yours. Thank you Jan. So we honestly weren't sure what to talk about because if folks came to the main stage talking Jan's in this morning, basically the first 25 minutes was the state of the union of Matrix. So we have a bit of a question mark on the subject here. Also Jan just promised that we will transfer the contents of the internet into your brains which we also hadn't really prepared for. Anyway, basically if you don't know who we are, I am Matthew, I am the technical co-founder side of Matrix, day job CEO at Element. Amadin, the non-technical co-founder side of Matrix and day job COO at Element. But we would like to at least try to tell you something new about what's going on here. And actually realized that we have never done a brief history of Matrix which begins before many of you were born in the year 2000 and free. Now seriously, the actual backstory is a bunch of us were at the University together at Cambridge and we were messing around and instant messaging on a project called Project Foxtrot. Now the idea of Foxtrot is that it was written in Java 1.3, new fresh off the books or something at that point in the late 90s. And what it did was to serialize hunks of Java, put it over TCP sockets except it was end to end encrypted using manually written Diffie Hellman RSA exchanges. So that is where I at least got the bug for Matrix and instant messaging and after we either got kicked out or left or graduated from Cambridge we ended up working at the little company doing APIs for the PSTM. So that's 2003. Fast forward rapidly to 2010. Well, my company were doing mobile app development and Matthew's company both got acquired about a month apart by a big telco vendor. You would find them in the depth of AT&T doing all their billing systems. So small startups having fun getting into a very big company. After a few years of rattling around inside Amdox, I'm not sure why we're not mentioning Amdox by name but it was Amdox, we discovered a new found desire to burn the phone network to the ground and annihilate it and replace it with something that would be open and decentralized and federated that anybody can join rather than the cabal of the phone companies where it's almost impossible to connect into them. And so that was where the idea of Matrix came from. We basically took the combined folks in Hren and London went to Amdox and said, hey, a little bit of a crazy idea but what if we build an entirely new communications protocol and if we pull it off, then you, my friends at Amdox, can go and sell it to AT&T and many other big telcos and you can replace the PSTN. And meanwhile, at the same time, the rest of the world would get a big benefit of the existence of Matrix. And amazingly, they said yes, with no strings attached, they allowed us to go and switch the business unit from selling clones of WhatsApp and Skype to telcos to instead building out Matrix. And that's what we did from starting depressingly in May 2014. So we are a couple of months off having been doing this for 10 years. Not sure whether that's something to celebrate or not in the grand scheme of things. What happened in 2014? So 2014, we all gathered in Hren and sat down and had a big brainstorm on how this thing would look like and ended up with mostly what Matrix is today. Not much has changed in terms of overall idea and architecture and these sort of things. We started in May and the goal was, September 2014, we're going to launch this. Like four months to figure out a high-level working Matrix proof of concept. And we did it. Yeah, it was a disaster really because we rushed incredible speeds. It was like the best gig possible that your day job is suddenly told you and all your mates at work that you can suddenly go wild to create something like Matrix and everybody's sprinted in slightly different directions naming no names. We might have ended up with three different versions of synapse at first. We had the bit that taught the client server API. We had the bit that spoke the server to server API and we had the bit in the middle that sort of meant to funnel stuff around the place. Each one had a different database schema. Each one had a different object model. They were all written in Python which was honestly a win but it's possible we might have sprinted a little too over enthusiastically into this and spent about six years playing off the technical debt that we accumulated in those three months and then ran up to launching synapse. Worth noting the end-to-end encryption wasn't there on day one but we did start in 2015 and we always designed as part of the protocol because if you are going to replicate data equally over many, many home servers, obviously it needs to be end-to-end encrypted such that if one gets owned all the messages don't go out the door. Then in 2016 it says we launched Element. I'm not sure I pulled the slide from but it definitely wasn't called Element. Basically when we launched we were using Matrix console at the beginning and then we said okay we need a very glossy app to actually drive the usage of this. We launched something which became Element at some point but initially was called Vector. What was the second name? Okay let's quiz the audience. What was the second name of Vector before Element? Yay well done and here we go. Element now is the flagship client and still growing there. Eventually in 2017 we set up shop as properly independent both like we started with the commercial company Element and also a bit later like in 2019. Yeah I think technically the foundation I think was incorporated in 2018 but we didn't do anything with it until 2019 to try to make sure that there is a clear split between Governance, the open source project and the protocol versus us practically trying to fund the bloody thing, Element running around, doing commercial stuff but that was the point where things started to split properly into your classic open source foundation versus startup trying to build stuff on top. We eventually turned on and went encryption by default in 2020 alongside Matrix 1.0 which I guess was June 2019 and then fast forward to 2023 when we announced the idea of Matrix 2.0 as showcase development X last foster and here we are today in 2024 the year of mainstream Matrix. Who knows maybe if you saw the DMA bit of the talk earlier it may or may not be getting the but yeah we'll talk a bit about it and Travis afterwards is going to have an amazingly very deeply technical talk all about everything you wanted to know about DBA. DBA? Well you could do one on DBA or you could do one on DMA up to you. I haven't asked permission for any other people in this photo to put this photo up but this is the original Matrix team on our way to REN from the London side playing Magic the Gathering or something. No it was all of us. Yeah that point the front and side is all in REN. Yes because we hadn't got to France yet. We're literally at some crappy travel lodge I think in Luton or Gatwick or somewhere on the way through to REN and so yeah basically that was the vibe at the beginning of Matrix back in May 2014 and more of a vibe is this which was the whiteboard in the Jupiter project room in the offices in REN where we basically drew up the possible architectures that we could use for Matrix. You will notice that there are four if not five architectures here the simplest one is just client to server to client and this was almost just mapping out the various options we had on the table but at that point we hadn't really decided how decentralized it would be. Then we had the one that honestly I came into this with which was assuming that it would be a little bit like SMTP and IMAP that you'd have like or just SMTP your client would talk to a home server that would cash in rooms which would talk to another home server that would be a single point of failure that would talk to a client. I mean it's a bit like a mark in XMPP sounds pretty easy. What I did not expect was for some of the folks on the previous slide to turn out looking really excited saying you know I think we might be able to do it like it we can actually go and replicate this between the home servers which I christened at the time the distributed sink nightmare in terms of active active replicated version of a protocol and then there is another one down here when you've got sort of two inboxes that sort of synchronize ish together but you basically have queues rather than DAGs and you had this one which has got lots of double arrows and I have no idea oh it's a mix net I think is basically all that was talking about you'd have a personal home server you have a bunch of relays which trusted maybe or not trusted I don't know I can't remember it was 10 years ago but either way that was the level of sort of whiteboard diagram that we were playing with at that point. So basically as Matthew said earlier fast forward almost 10 years now 2023 was very much focusing on the basics to work well thanks to the limits of funding which is good sometime like if you have a bit less money then you do focus on the thing more important. So we have posed a lot of things how do you want to do this do you want to go through the list Matthew? We have no you don't. I do it okay I will do it so the focus was very much on 2.0 CNAPs the SDKs Rust and GS SDK the otherwise peer-to-peer matrix is on the side pseudo IDs as well crypto IDs accountability however we still hope like very very soon we'll be able to get back to all of this low bandwidth as well and some of the done right work funded by Elements. The legacy Elements apps and dsdk that are based on are just put on bug fixes only and hopefully we'll be able to switch to everything to Rust soon and LiboM as well now that we have those amounts taking over and PortoDrum is on the side waiting for someone to take it and bring it up to what it all the power it can do. Yeah third room is particularly frustrating we got an email from the W3C after we announced that we had had to lay off the team element who are working on it and that nobody picked it up saying what this is meant to be our promised land of Web SG the Web scene graph API that we created I thought this was how the future of the spatial web as Apple would call it is meant to be and I said well I'm really sorry but we literally could not find anybody to fund it whatsoever even people like Rolls Royce who promised that they really needed this and would fund it then proceeded well first of all to lay off the team that we were talking to on their side and be not funded at all anyway it's been a really fun year so that said I'm gonna disgrace myself as you probably expect by wanting to talk a little bit about the projects which are shelved because it's really frustrating that an awful lot of work went into them last year until around November they got forcibly parked one of them is some pseudo IDs MSC4014 so this is the project to replace MXIDs with arbitrary identifiers per room now the reason for doing this is well first of all GDPR at the moment MXIDs get baked into the conversation history of your room and they are things like at Matthew got on matrix.org whereas if you had a different unique identifier on a per room basis that problem goes away and it's up to me whether I want to publish a mapping of my matrix ID onto the sender key or not the idea of this MSC is that it works out of the box with existing clients no code change is needed because the CS API maps the sender keys back to MXIDs when it hands it to the client however this does not provide account portability it's just replacing the MXIDs and it got implemented in dendrite in June of last year and if you're feeling particularly creative go and turn on the feature flags on dendrite and have a play with it but as I said unfortunately it is currently on ice I'm not going to force Amundoon to do the crypto ID one for the sake of alternating slides so for crypto IDs is an extension of pseudo IDs highly experimental the idea is that your sender keys become your end-to-end encrypted identity so we finally unite together end-to-end encryption in matrix with the idea of your MXIDs so the idea is that when you join a room for the first time you get a crypto ID generated for that room interestingly and perhaps controversially your client then signs everything it does or the events with the crypto ID so that you can basically prove that you own those events and as you move between servers in future you can prove that it came from me as an individual Matthew rather than being signed by your home server which you don't really care about if you're migrating between home servers this has the gbs side effect that we no longer have cryptographic deniability because by definition you would be able to see that a given client owned by a given user has sent a given message so there's going to be an interesting trade-off to be had there right now we do technically have cryptographic deniability but practically speaking it really depends on the trust model I'm not sure just how useful it really is other than on paper whereas this would obviously throw it away again implemented in dendrite and was just being drafted in Rust SDK when it got shelved in November the idea is that if you take pseudo IDs and add crypto IDs and add some magic glue which probably means storing account data in a room so it can replicate between servers then you would have client controlled account portability also a prerequisite for peer-to-peer matrix which is also on hold and again is currently on hold how am I doing on time Jan? More than 15 minutes? No no no no I am don't worry I am can I do a demo then? Okay so whilst we're talking about daily departed projects I know this is probably going to piss off a bunch of people but I really want to very briefly show the final bits that the third room guys did before they got killed so here's our third room using OIDC as oh that's a great start this is what happens if something is busy a bit rotting away let me try to sign into this using OIDC because third room was the first thing that we used to test out native OIDC we might have to wait a little minute for that server to wake up because I haven't logged in very recently so this is definitely a dangerous demo talk amongst yourselves imagine that a server is actually working here which it is right so where we left you last year at Fozden was that this thing had just launched and the next big thing was actually to make the whole thing scriptable and do fun stuff with it and it got to the point here where you could go and enter a world like this and this is just a matrix room stored in gltf with the sorry with the world data stored in gltf itself but what Robert and AJ implemented is if you press the tilt button at any point you go and get an inworld inspector up you can go and select things like buildings and you can do things like move them around and manipulate them in real time I think I showed this last year the next thing though was to make the entire thing scriptable by Wasm so you have a script editor now built in here which gives you a little bit of javascript what you can do is to go in and grab something like the buildings you just drop it straight in there and it right see the javascript to grab the buildings and then for every time the world updates you get a delta timestamp and absolute timestamp and I can go in there and do I do not know what this API is how back in this go let's assume it has a translation button and say that y is going to be what 10 units times or the sign of the current timestamp that will work right and if you head um save as run what it will do is compile the javascript down to Wasm using quickjs written by the amazing Fabrice Miller and reload the world and there you go let's upload in your dance I think this is so cool this is so cool you can see why w3c got in touch afterwards saying hang on this is how the future of the web is meant to be and where are the people and it's like well this is what it is so if you're watching this and you think this deserves to exist well first of all I'm not sure I'm gonna ever persuade the guys to work on it again because they feel pretty pissed off obviously that the project collapsed but the code is all there still and it's so antelizingly close to being absolutely amazing right sorry back on to what we're talking about cryptoidies or yeah what's the next time matrix 2.0 I mean who was in the thing in Nansen this morning do I need to go through this again oh crap yeah only about half of you yeah yeah perhaps we should have done that at the beginning of the talk in 20 minutes through anyway right so matrix 2.0 very quickly first of all this is not a spec release this is a state of mind a bit like web 2.0 it's made up of various MSCs and the status is sliding sink so instant launch and instant logging and instant sink it kicks ass but it's too fiddly we are currently performing slidectomy which is the technical term for removing the sliding bit from sliding sink and there is in fact a PR against the Rust SDK which basically shifts all of the ordering onto the client rather than doing it on the server and this is all my fault being stupid and over enthusiastic going and trying to do this over optimized implementation where the server figures out the best possible ordering and then the client tweaks it at the end and it turns out that having two different things fighting over control of the order of a list doesn't work very well so we've basically said the client gets to order it entirely the server does a very approximate probably based on time stamp thing and the good news is that it is just a subset of the current API so it's not a yes or no rewrite it's just basically simplifying the API so it's easier to implement then you've got end to end encrypted VoIP which again kicks ass we demoed it and Janssen and it worked this morning need to update the MSC because it's on its 6 or 7th iteration now and I think it's stabilized enough that we should actually spec it properly faster joins so synapse rapidly joining rooms and other home servers for that matter if they implemented them incrementally lazy loading the data in it would kick ass if we actually finished it so we got the hard bit done the kind of infrastructure and made rooms non-atomic in synapse and then actually didn't get to the point where we would get it to go faster significantly faster and then IDC which does kick ass but it's going to be a big migration as we need basically everything to support it before we start turning it on our matrix org etc but there is lots of stuff in progress if I have more time and try to show the QR single hop login demo which is super cool then Rust SDK is the brave new world that goes and wraps us all together on the client side and as of Friday as I mentioned in Janssen the JS SDK and therefore Element Web and anything else using JS SDK now uses Rust SDK for crypto so we are finally at the point where the old Le Bon c++ library is in maintenance mode and then some whereas Vadosmats the Rust implementation is our brave new future and I spoiled that Demir has already produced a post quantum PR draft for Vadosmats using the kyber primitives wrapped around I think curve 25519 so a kind of hybrid approach which should be compatible with the signal and PQXDH and key exchange stuff and what else are we doing in Vadosmats it was another big thing but I can't remember what it was another PR that landed basically we fixed all the crypto bugs in one place and a huge huge focus in the coming months is making the crypto finally suck a lot lot less. Should I keep going on MLS you can do the whole end bit of it okay on MLS people might be wondering hey it's not talking about MLS anymore what's that all about first of all we are still doing this you can track the progress on our MLS yet.com MLS is the group encryption that scales much much much better than normal double ratchet and almond Vadosmats we have it largely working on matrix has huge key bundles you have to store the keys and the media repository they're so big at the moment however there's been a lot of discussion on the meme side which we'll talk about briefly and Travis will talk about a lot more in a few minutes in terms of what if you actually used MLS to synchronize everything so rather than having a matrix tag for tracking no synchronizing data between servers what if you just chuck everything into MLS. TBD so there's a little bit of a do you put MLS over matrix or do you put matrix or meme over MLS and debate going on right your slides. Yeah basically as we said at the beginning 2024 could really be the year where our prediction was the convention would come to and the prediction was this that this is a slide taken off investor pitch deck saying in 2019 in five years everyone will communicate over matrix that's why we did this right. In 2019 it said 10 years and now because we're five years later and now it says five years just saying also this is written in R and it's real traffic from 2019 showing the I think the top 100 home servers talking to one another just saying if you're investor decks aren't written in R you're doing it wrong. So basically killing email and the phone network so why the digital market tag you may have heard of it they demand the big communication services called gatekeepers to actually interoperate with the rest of the world. Two of them have been named so far WhatsApp and Facebook Messenger iMessage is pushing back saying no no no we're not a gatekeeper but let's see where it goes. To the business yeah business to a user. So last year it was coming into force in the 7th of March they will have to actually expose these APIs as production ready and anyone in here who actually wants to interoperate with WhatsApp because they don't want to create an account there will be able to come to them and say hello can I please integrate against your APIs to talk to your users. She's a little bit ironic because it starts to look an awful lot like the PSTN in terms of you have great big telecoms providers and you go to someone at AT&T and said hello please can I talk SS7 to you so my little telco can talk to the big telco and they make you sign a massive contract and there's all sorts of back and forth to happen. Obviously we can't say what that will look like with method but there may be a risk oh well there could be entire spectrum between open federation versus closed federation versus everything in between and we just don't know what will happen. Let's see in a month and basically yeah one we may get to a point where matrix becomes the glue and between all the communication system and matrix them together. Yeah I mean I'm not counting on it honestly on this architecture particularly because everybody would need to agree both on the same dialect of double ratchet as well as the content payloads within but you never know if we get critical mass in some places perhaps everybody will follow. So yes we mentioned it this morning already still a lot a lot a lot of things to do especially on the core making sure the core is funded we're trying to put a big call out for fundraising and honestly the goal is really to get the big guys who actually are using it for hundreds of thousands of users millions of users without contributing a sense to the project itself and funding the core trying to raise the alarm. At the same time there is a public policy dev room there where we're trying to figure out how do we get the proven source projects actually funded so I'm going to run there to try to solve that problem very shortly after this. Cool. Thank you guys so this morning a lot of people actually let's go thank you to everyone who is supporting it and everyone who jumped in live this morning during the talk to actually become a member of the foundation and thank you to all the already how do we call them supporters organizational supporters as well in here. Yeah honestly if your organization is just happening to use element and matrix as its common system it really doesn't cost that much to put some money behind the bar to keep it going like we met x wiki on Friday and said oh how's it going and they said stuck notifications are the bane of my life and he said oh well if you actually want us to have more member to go and work on certain notifications and perhaps you can become a silver member of the matrix.org foundation and that is why there is a x wiki logo and a cryptpad logo on the slides there seriously it's meant to be relatively modest but if we get all the organizations doing it as well as the individuals then if nothing else is really a lot easier to go to the really big people like the EU and say look we've got already got 800 people supporting this this is an important thing it matters therefore you should match 20 fold 50 fold 100 fold and as a narrative it may work. So yeah meanwhile we have an awesome community a lot a lot a lot of things are happening around and this is the menu for this afternoon where everyone will be able to tell us a bit more about what they're working on looking forward to it thank you everyone. Any questions? Two I'm allowed two questions but they can't come from me. Kim. Excellent question. So the excellent question which I shall repeat is where the hell is multiple accounts support in element. Now most of the best of clients out there have it already however we've never got round to it in either element or element x there's not a good reason for it other than everything else taking slightly higher priority we did have it in matrix console the very first matrix client that we wrote before producing vector and riot or whatever it is now. Yeah no good answer other than we need to add it and element x would be a great time to do that it's built with it in mind we just haven't put it in the UI yet. Everyone can do it apart from element pretty much. Is there any indication that other assorted governments are looking at following on something So the question is whether other governments are going to take inspiration from the digital market tax there are some movements in the US around it trying to remember the name it's not the interoperability bill but something along these lines so there is definitely it's like GDPR then has been looked at by the US and European Europe is leading on these sort of things and yeah there is movement in that direction. So yeah the question is whether we are lobbying within the EU to make the API's the gatekeeper offer open so the DMA is forcing gatekeepers to open their APIs the big lobbying we've been doing in the last four years has been please please please don't ask them to only open their APIs but try to converge towards an open standard so that as small companies who want to integrate with these people I don't have to build a polyglot messenger which speaks what's up Facebook messenger Google blah blah blah all of them in parallel please please please so so far we don't really know what this is going to be the in the law of the DMA in the text it doesn't say you have to use an open standard but like basically we continue working with everyone the European Commission's and the gatekeepers and all the big corporations trying to convince everyone that that's the best way to go. I think the US equivalent of DMA is American Innovation and Choice Online Act maybe one of Senator Wyden's initiatives perhaps but you don't need to go and look it up and there was something I wanted to add to that but I've gotten what it was just the lobbying try to get everybody on a single it was it would have been amazing if we'd persuaded the Commission to basically put into law you have to speak an open standard the reason that they didn't is first of all it's not really their job as politicians to dictate the actual technical implementation they need to say what the outcome should be like a more competitive environment without massive anti-trust behavior but it's up to us literally up to us a lot in this room to figure out what that should really look like and the other big problem is that there isn't a standard that is suitable like matrix is great it's looked after the matrix org standard foundation but it's not an internationally recognized standards body so if we're gone through ITF already then perhaps that would have worked it's not like they can just tap on the wall and say use matrix even though it has some traction I would say this is an amazing segue to travestalk right now unless there are any other questions which I can cram in but I'm not allowed to no thank you very much
Interoperability & Matrix
So there may or may not be time for questions. There's a lot of detail. This is a 60 minute talk compressed down to hopefully 22ish minutes. So we will see how we go. But yeah, I'm here to talk about the technical details of interoperability. I'm also Travis. If you don't know me, I'm the director for standards development at matrix.org. I'm also on the spec core team. I run T2Bot and I work at element for trust and safety. I have a few jobs. But good news, there's already more that we can talk about. So Matthew had the talk this morning. If you haven't seen that or seen the recap of it about 10 minutes ago, covers DMA and the timelines in a lot more detail. To recap though, the DMA requires gatekeepers or large messaging providers to open up their APIs and their systems for interoperability. Encryption must be maintained between those providers. So you cannot break encryption for the sake of interoperating. You have to maintain it. These messengers have three options. You can become multi-headed similar to Beeper Mini where you have all the networks available in your one client. And you just kind of switch between them. You can create a bridge app where the user downloads a third thing and then you bridge locally on the device. That works. It's not great. Or you can speak a common protocol. We've been doing that for the last year. Probably longer. And oh yeah, they have to do this all by March 7th this year. So with that in mind, there are many projects involved as well. There is the more instant messaging, interoperability working group or MIMI at the IETF. They are trying to specify a standard that does this stuff. We are very frequent there. We are a direct contributor to this. I have written a MIMI protocol document in association with a few other people on the design team to try and simplify a lot of the components, particularly what linearized matrix is. Also, linearized matrix was originally created as this simplified version of matrix because it turns out that you don't necessarily need a ton of the fully compatible DAG stuff or even messaging history for interoperability. A lot of the existing providers just kind of want to throw messages around the place. They don't necessarily want to just kind of keep these things around. Obviously, of course, we have matrix. Hopefully, everybody here is familiar with that. But it is the decentralized and fully featured version of an interoperable protocol. What parts of interoperability do we have to worry about? A few. There is encryption. This kind of fits into a weird L shape. You have content format within that. But the encryption, we have to make sure that all the messages are secure. We have to make sure that everything is the same. Of course, we have to make sure that it is consistent across the providers. The content format, what do messages actually look like? We have to make sure that that is the same because the servers can't help us here. The clients have to agree on this. That is more of a challenge. We also need an authorization policy so people can be banned because they need to. Then we also have messages that people might not be allowed to send in certain rooms. Of course, we also have transport. The transport is just how the servers communicate because we have a room model that looks something like this. The room model is a combination of the encryption authorization policy and transport. We also have a definition of membership or participation, a little bit more on that in a minute. And also how the messages are found out themselves. In the very simplest scenario, we have clients talking to servers, servers talking to each other, and encrypted messages flowing between clients effectively. It gets more complicated when you add a third server, so we will do that later. Some of these problems are easier than others. Namely, transport. Super easy to solve. Pretty much everybody uses some form of HTTPS. Mimi wants to use MTLS. Linearized Matrix uses the same system that Matrix already does where you have a signed or a signing key that kind of gets thrown around a bit. It is unclear what the actual format over HTTP would be. Matrix uses JSON. Mimi wants to use some form of binary. Unclear what that actually is. We are also considering a binary event format specifically for this kind of thing. Protobuf and Seabor are kind of on the top. But to be determined, clients would not be expected to consume that binary format yet. I should probably just add that in. But yeah, we will end up using some sort of binary over HTTPS mechanism authorization exactly to be determined later. The other easy thing is authorization policy. Mimi does not define one. We have been working without one. We have just been assuming that people are able to send messages. Matrix obviously has one. Role-based access control is super popular amongst a lot of these discussions. There is those two MSCs there. 4056 covers the decentralization part of RBAC. Then you also have 2812 where it basically rolls as state events. It is an early form of RBAC. Linearized Matrix uses the existing authorization rules. Matrix authorization rules clearly already work. People have been using them for almost a decade now. They should be fined. We will figure out what Mimi ends up with eventually, hopefully. The harder parts are encryption. Most messaging providers use lib signal or something that is a double ratchet. We also have a double ratchet-like implementation called OM. It was not previously interoperable with lib signal up until about 2 a.m. tonight. We now have inter-OM, which has that X3DH support as well as some of the other delts you need to be able to support that sort of interoperability. Megalom is what we use in group chats to try and alleviate the load. Otherwise, with OM, you have to send a number of events for the number of devices in the room, which obviously causes problems when you have multiple devices per user or multiple users in a room. Matrix HQ would be a nightmare. The double ratchet does rely on existing infrastructure in order to send keys. It has no concept of membership. It does not know who to send the keys to on its own. You have to tell it who to encrypt to and then also send those keys yourself. Some messaging providers, namely Google, have announced that they will be using MLS. We also obviously want to use MLS. REMLST.com is where we're tracking that progress. MLS does have a concept of built-in membership, so it does know who it needs to send messages to. It obviously doesn't send the messages itself, but more on that in a second, namely this slide. RFC 942.0, that is where the IETF has specified this. I have a really awful crash course guide because I am not a cryptographer, but there it is. But yeah, there is a binary tree, so you have a root key and you have multiple nodes underneath that. With that concept, you end up with a concept of membership where only users or members that have certain keys can see other keys. That is how you get to know who to send the keys to, particularly the decryption keys. Mimi has refused to implement any other encryption other than MLS. They are obviously considering it as part of double ratchet because we do need an onramp. But with the IETF, they tend to get a little bit stuck in the RFCs. We are also considering MLS, obviously, and so we want to extend it. Decentralized environments, namely matrix, will have to use DMLS or similar. Membership. As part of the discussions with Mimi, we have been having some arguments, we will say, about what it actually means to define membership. We have decided that users join rooms and clients encrypt messages. Both MLS and double ratchet deal with clients. When a user joins the room, all of their clients join as well. This is hopefully not a novel thing that is here, but it is written in stone now. So we need to synchronize these two concepts. We call users to have a participation state or exist on a participation list. And then clients have membership. So users, participation, clients, membership. We also have to make sure that these are atomic operations because otherwise somebody joins the crypto state, but they are not part of the actual user state. That causes issues. So Mimi has started proposing a bunch of MLS extensions to persist application state within an MLS group. Because MLS has those extensions that you can just store arbitrary things, making the blob even larger so you must store it in the media repo. These are new as of like a week and a half ago, but it is called AppSync. It is a generic mechanism. Conveniently, it would basically be mapped to state events in matrix. So you can just add arbitrary information to the group, namely with a key and some sort of content. And then there are some operations that apply where you can add, remove, update, that sort of stuff. But yeah, it is visible to servers, but servers can't see the actual encrypted messages part of MLS. They can just see that state changes are happening and potentially what's inside those state changes, which is why they would map to state events in matrix. Double ratchet and participation is a bit harder. Because double ratchet, again, doesn't have a concept of membership. It's not terribly difficult to map these. It's a little complicated sometimes. So there's a couple of MSCs there that list this sort of information, namely the crypto IDs Matthew was just talking about. And then yeah, we translate these concepts to Mroom member state events as well as device lists on matrix. But regardless of the protocol, we want to make sure that people currently on double ratchet have a way up to MLS. So it's a natural evolution of the application rather than forcing somebody to effectively fork their own client, which brings us a little bit into content format. So clients need to end up encrypting and decrypting the same thing. Otherwise, there's going to be issues. Because if you send a text message to somebody and they just don't know what to expect, then there's not going to see anything. So we need some form of extensibility because messaging also has a ton of features. And it's constantly evolving. Servers can't help with this because it's already encrypted. And of course, it should be as small as possible. It should require minimal processing power because not every client is a laptop. Or sometimes the laptop is a bit slow. So Mimi has worked on their own TLS encoded multi part MIME format. It looks a lot like multi part email. It's not the greatest, but it is a notional format while we try and work out the exact things. But matrix already has events and you can already define your own custom event types. And you can already add arbitrary content. But what if we made that way more extensible? So we introduce extensible events or MSC 1767. We use content blocks to persist information inside of an event. We specify the course blocks there. And then we also try to make sure that the client can render arbitrary event types that they don't know about. So we lose a little bit of richness in the sense that if a client does encounter an unknown event, that they have to figure out how to render that. And it might not render in the same way for everybody, but at least render the same information for everybody. And that's the critical part. So an extensible event looks a little bit like this. This is just a basic text message saying hello world. So if your client supports HTML, it picks the HTML format. If it doesn't support HTML, supports the basic format. But critically, you have a type of m dot message and you have a content block of m dot text. So if we add a little bit more richness to that and create a fake schema for polls that definitely doesn't exist, please see the MSC for a real schema. You have an unknown event type for some clients, namely org matrix poll start. So you still have that text content block. And then you also have this poll content block, which gives you a little bit more information about how to render these events. So if your client knows what that event type is, I can go into the content, pull out the org matrix poll content block, render that in its UI, and then the client can interact with it normally. Otherwise, you end up with just the text and it is suitably okay. It's not great. But you still have the same information from the poll. And so yeah, currently extensible events are JSON. But again, you could make this a binary format in the future. More events get rendered by more clients, which is great. You can create more custom event types. You can do all sorts of fun stuff to be determined exactly what all of this looks like. We're still in the process of specifying all of the pieces, particularly the core content blocks, and also a registry so you can actually implement a client that understands all of these things. So a little bit on room models. The Mimi room model looks like this. So when you add the third server, there's obviously a little bit more complexity. Mimi primarily uses a hub and spoke fan out. So you have one central server per conversation, not for the entire global network, that is responsible for distributing messages. So server B and C try to avoid talking to each other if they absolutely can. And they talk through server A instead. So server A is responsible for sequencing, which is important for MLS. It has those characteristics in play. And then yeah, the follower servers, as they're called, go through that. And encrypted messages still flow between the clients as normal. The servers can't see those messages. So then we have the question of what does linearized matrix look like? It's exactly the same thing, just different objects, which is particularly interesting when it comes to the fact that it was rejected. Because it uses just regular matrix events. It's the same room state. It's the same matrix event stuff. It's a stripped down version of the server to server API because you don't need all the DAG resolution stuff if you don't have a DAG. Also, your DAG is now a linked list. So you don't have any state resolution to do. You have the same authorization rules. You can use the same extensible algorithms for encryption. You can use MLS, double ratchet, your own thing if you're insane enough to do that. And then you have all of the same capabilities of matrix. And you have the history and all of that. But critically, you can support having a DAG capable server in the room. You don't need to give up your decentralization. You can end up with a hub server that basically acts as that linearization algorithm or does linearization algorithm. And it also still persists the events, still distributes them. So when you get into decentralization, namely how matrix works, you use a DAG. You have full mesh fan out where each server contacts every other server instead of going through a central hub. Conflicts of the DAG are used or done through state resolution. So if two people try to do the same thing, somebody has to win. And the good news is state resolution can also be used to linearize the DAG. So through use of a protocol converter, which may or may not be a dual stack server, you can then bring these centralized systems, even linearized matrix into matrix to just further route them. So protocol conversion, they aren't bridges. Bridges somewhat necessar- they're necessarily break the encryption because when you're converting to signal to matrix prior to our existing or to our new interoperability capabilities, you end up decrypting the network on both sides of the bridge and re-encrypting. So you're only really encrypting to the bridge and not beyond it. So protocol converter doesn't decrypt messages. It just converts the envelope format to another format. So that way you can just keep sending your messages. This may also include translating some of the concepts. For matrix, we have two device events, some other protocols, namely Mimi, just send everything over what they call events. So we would have to translate those concepts into the appropriate matrix APIs. Again, you can make this either with an app service or as a dual stack home server. So instead of having a multi-head messenger, you have a multi-head server. And then, yeah, use msc3983 or 3984 to bridge the particular crypto concepts if your server doesn't necessarily support those key formats. So this is what it looks like. You may have recognized it. I stole it from Matthew's slides. So if you have a gatekeeper on the left there, you can do a protocol conversion. And that might be attached to a single server. It runs through matrix. And then you run another protocol conversion to bring it into linearized matrix or Mimi, where you have that hub and spoke, namely that the bottom two servers there aren't talking to each other directly. So those two nodes might be the same physical server, just running dual stack and not doing protocol conversion. But that's all right. So there are a few missing pieces. We haven't talked about anything to do with identity. How do you convert a phone number or a name or an email address into something routable? Who knows? That needs to be defined. We currently have identity servers in matrix. They're a bit centralized. We're hoping that somebody in Mimi can actually solve this problem for us. We also have an interesting idea around consent. Presumably, you don't want to receive spam. So how do you make sure that the person that is messaging you is allowed to message you? We also have anti-abuse. How do you report these messages over federations or over servers? How do you make sure that the servers can implement their own anti-abuse measures using whatever identifiers they can? Mimi also is not necessarily defined the exact identifiers that they want to use. Matrix already has user IDs, room IDs, aliases, that sort of stuff. But who knows? Maybe something different would work. So room metadata. Again, where does the room name go? Who knows? We'll have to figure that out. Matrix state events would probably be fine. Same thing with ordering. MLS requires ordering. There's a discussion around whether or not the clients also need that ordering. So what's next? We have no idea. As Matthew has mentioned, again, I'm just stealing from his slides. So linearized matrix will probably get updated as an MSC because currently the MSC is one version behind from the IETF draft. And the gatekeepers will have to publish their plans by March 7th. We'll see what happens there. The protocol converter concept will continue to be refined, of course. Mimi will also make some form of progress, hopefully get refined as well. And yeah, funding the foundation is the best way to make this work. So, questions. Yes. What are the stakeholders in the Mimi and why are so different stakeholders, like, not using the matrix approach? And what are the different interests here? Yeah, so the question is what are the different stakeholders and why are we going after certain approaches, I believe. So there are several players in the Mimi space. So we have obviously ourselves. We also have wire. There's Google and I'm forgetting all of the other ones, but there's... Yeah, Cisco, Wicker, Phoenix, and a few others. There's a few hundred people in the Mimi working group. You can see their company association as part of the membership list. I would suggest going there. As for the different approaches, everybody wants everybody to use their thing. We're no exception. We just think that ours is better. But yeah, we've been doing this for a while. Matrix was originally built as an interoperable protocol. And here we are with a legal requirement to have interoperability. So, surely Matrix is designed for that, is kind of our thought. We used to rely heavily on canonical JSON to maintain the technicality of the company. How does that translate to the Mimi particular and get the intracorrel? Yeah, so the question is how... Like we've previously relied on canonical JSON. How does that translate to Mimi and just general approaches with interoperability? So, canonical JSON has all sorts of interesting issues with it. What happens if you have multiple keys? What happens if the keys use a weird former of UTF-8? That sort of stuff. It's a very complicated set of rules that can realistically never be fully defined. So with a binary format, namely, that's what Mimi's interested in, you don't necessarily need a canonicalization, because if you keep the signature for the event next to the event, rather than in the event, like we currently have in Matrix, you are able to just sign the series of bytes. And the bytes can be in whatever order. You can deserialize them, see them more easily, and then check the signature much faster. So that's kind of where the Mimi direction is going, is we want to avoid a canonicalization algorithm, but we do need the more specific standard for what's contained in those bytes. This is something to be supported either throughout the chain, yes, we are going to be pushing more towards keeping the, instead of trying to make everybody use the existing matrix thing, I would suggest that matrix kind of adopt more of that binary event signing instead. Yes. You had a slide with things you didn't talk about? Yes. In many places, primarily in the Mimi working group, that's where a lot of these conversations are happening, as well as on the design team for Mimi. But if you are interested in them, or you have ideas, feel free to pop by the Matrix spec room on Matrix, and we'll be happy to engage. Do I have time for one more question? All right. Yes. All right. So how do we avoid, basically if you have two protocol converters, say they're both talking to the same network, how do you avoid message duplication? Good question. We'll have to experiment with it. We will be trying to figure out exactly what that looks like. We kind of have to wait until March 7th to see what the actual gatekeepers, namely WhatsApp and Facebook Messenger, have to offer for that certain capability. Thank you, Travis. Thank you.
Let's talk Matrix between Governments and Citizens
Hi, thanks for coming. Huberview is working in the government of all the government in this room. I bet it's only a few. Oh, maybe 10% or something. Yeah, quite a lot. Okay, so, hi, my name is Marco. So let's talk about 2AM and why am I here. I am active in the Floss community for about 10 years now of contributions to SignalDino and also in projects in the wireless mesh community tooling. I have a background in IT security and my current project during the last three years was in this building state-of-the-art infrastructure for the public administration in Germany in a German federal IT agency. And yeah, we think a lot about how we can improve our infrastructure, especially in Germany, but we also try to, yeah, think out of the box, out of the border. And this is why, why Metrix is very interesting for us. But let's start with what public administration does in Germany and other countries. So the government provides a lot of services that ranges from healthcare services to social services. There's dog tags registration, for example, there are holding benefits. And in Germany, there are 575 service categories with a total of 13,000 individual services that the government offers on different federal levels. So that's a lot. And also what we need to think about is that the government has a monopoly on these services. Like if you want to have some housing benefits, probably the government is the only institution that will provide you with these services and with this support. So if you want to receive these services, you need to go to your local government. And this is why it's important to have a, like, see how these services are designed, how they work, if they are privacy-friendly, if they're usable, et cetera. So in my opinion, it's very important to have a look at the tech stack behind these services and also the privacy usability, accessibility aspects in there. So how do we apply for these services? First, there's, yeah, the option that you don't have to apply for them. The government starts the process by itself. For example, sending you money for your child benefits. The government usually knows everything about your child when it's born, and then they could theoretically send your child benefits by themselves. That's what we call proactive government. That's usually a thing, a neat thing, but it doesn't really work in all cases. So there are use cases where this doesn't work. For example, for a registration in the kindergarten, it's probably a good idea to ask the people where they want to bring their kids or if they want to bring their kids to this specific kindergarten instead of just, like, distributing the kindergarten places. That wouldn't really be a great usability thing there. The second option, you could always and probably have already done this offline. You can go to your local city hall. That works for many people, but still for many people, that's kind of inconvenient. So the third option that comes naturally, you can apply online for these services. For example, via an app or via a website. And I'd like to look at this third option a bit in detail. Okay, let's start by requesting some government services via a web formula or via a mobile application, for example. That's comparatively easy because, like, the government websites are public. You can just find them online. And the contact details of the government agencies are also public, including their private, sorry, their public keys, hopefully not their private keys, sometimes they are public, but that's not by intention. So you can just encrypt your application form and send it to the right government agency, and you're basically done. So that's comparatively easy. Then usually, hopefully, the government responds. But the person that applied for the government service may have already left the website, for example, and installed the app where they applied for the service. So that's a bit harder because the contact details of these individuals of us are not publicly available, and that's also by design. We don't want that. But also, we don't want the people to force installing some random application and keep it installed for a longer period, or even at all. There should be different ways to access these services. We can't just hope that the app is still installed and we can send a message to the people via this app, for example. So let's have a look on how the industry solved this problem. And here, for example, banks and some insurances put some online mailboxes in place. That's usually very easy because they just stored the plaintext messages on their central service and provide a web interface or an interface via an app to receive them. That might be okay for some banks and insurance companies because they already know everything about BitOS anyway. So that's their service. They're directly communicating with us. Still, it's not really anti-antimcryptus here, but the two ends are the bank insurance agency and the people. That's occasioned some way, but if we built this for all people and for the whole country, we definitely need encryption. So we have local government agencies that want to communicate with the people. And there's a huge, so there's a large amount of information, a large of different services that are being provided. We don't want to have a central server that stores all this information about the applications and responses to that online on the server. So how did government agencies solve this? You have to summarize. Mostly, they did it very badly. We've seen a lot of data leaks in the past years. And I think there must be a way that doesn't include any risk of data leakage. These are just some examples. I found a line there probably a lot of more issues, but this is not a European problem. This is like a global problem. You can find governments on basically every continent that lost personal identification information of all the people in their country. So that must be a better way. So let's have a look how did the German government solve this issue till date. We have a lot of different online mailboxes. There's ELSER for those of you from Germany. We probably know it. It's a big application to pay your taxes. We have so-called the email. That's a German email variant that should be super secure. It's basically some regulation on standard email protocols. We have BundID, which is like a central identification service that also contains a mailbox. Then in the justice context, we have a lot of different mailboxes that are somehow interoperable, but none of these really follows the security by design principles and the Cedar Trust approach. And this poses a huge risk to privacy and security to highly sensible data. So this might explode somewhere, somewhere. In fact, actually, there have been incidents in Germany too, of course, as in other countries. For example, we had the, since 2021, we have so-called digital health apps, and they got analyzed by, it's a forschung, that's a collective of IT security researchers in Germany and the fact that these apps leaked personal data of more than 20,000 people. That's especially problematic because like in the healthcare sector, there are often very sensible information that might get leaked here. We also had a recent leak in the justice domain. That was the case of the Justice Maybox leak last year between October 13th and November 9th, a directory with personal identity data was publicly accessible to you to a config error. So this shouldn't have happened at all. There should have been technical measures in place to, yeah, that, yeah, make sure that this won't happen. That's especially bad in this domain. For example, if stalking victims use this Maybox to contact some, some Kurds, and especially then it's not, not really a great idea of their personal information, including their address is publicly accessible. So let's talk about some solutions. And I brought this vision here. What if, if, if communication between governments and citizens was easy, reliable and encrypted? And since we are in the metric deaf room, yeah, let's take metrics to the rescue. Metrics already provides enter and encrypted messages. It provides multi-device access from apps and web applications. It also provides access via third party apps and services. So for example, corporate IT service or e-governance apps, et cetera. This is all possible using the metrics protocol. So why not build a metrics-based secure communication channel between citizens and governments? And that's exactly what we are planning to do. We wanted to integrate metrics in Germany's national identity system. So first challenge here would be to build a proof of concept this year to demonstrate that this is technically possible. And we have some like technical things we want to discuss here. Also, for example, usability issues would be discussed here. And in general, when we do this, we of course want to have a great user experience. So what do we need for that? We need polls and multiple choice questions. We need push notifications and status updates. We need also machine readable data. For example, these polls would then make it easy for bi-directional interaction with the public administration using machine readable polls. That would be an interesting thing to look into. Also image and document uploads might be a feature. And the neat thing here is that metrics already built comes with these features. So there's not really much to build on top. We can just use this and go from there. Of course, we also need a great developer experience. That's something most government projects don't really think about. But I think especially here it's very important for us to have some SDKs in place for developers that are working on IT systems either inside the government, but also to help building apps and an ecosystem of apps and services for citizens and company-facing apps, for example. That helps us here with development speed for government services. So again, what does Matrix offer us? We have a great usability, especially compared to email-based systems here. We have tried and tested security. This exists already a bit and the protocols are known and we don't reinvent the wheel here security-wise. It's interoperable and it's easy to integrate and it's also ready to use in the real-world layer. Many features are already there in the Matrix specification. Some strategic thoughts on enter and encryptive communication. In my opinion is a key enabler for seamless e-government services. We need this anyway. We will also be able to... This will enable us to really build a privacy-preserving realization of the so-called once-only principle that enables governments to reuse already submitted data and documents. We have all this data in a machine-readable, secure way. It also might support us in some wallet-like use cases, for example, at a station presentation of attributes like driver's licenses. This also needs a secure communication channel before we think about all the additional challenges cryptography-wise here that we need to tackle. We need to start all of these things, need a communication layer as a starting point to interact between peoples and governments. A proud vision, where might this journey go? We will start with a mailbox app and later, if this works out, it might be a good starting point to provide the most common e-government services via this app. We would have an entry-end process to apply for all these services. This will definitely help us with usability and user experience. This might be a neat thing to look into. In Germany, there are very little government services that are already integrated into an easy-to-use app. Most of them are just huge web forms where you have to enter lots of data and then you send a form and hope for the best. If we go further, finally, why not build a framework for any e-government service? Basically, the service that is integrated into the app is basically a config file. This would help us to scale, obviously, and give us an opportunity for modulary specifying different services that we want to provide by just providing a config file that defines how the UI, for example, in the app looks like. Putting it all together, this will provide us a National Privacy First e-government app, which would be a neat thing to have. Maybe it will help us build up speed and get better in this domain. To conclude, let's talk a bit about infrastructure. The status quo is that we have different text tags for requesting services and also for replies. These are completely different infrastructures. For example, we are able to request services via a REST API and then there's a SOAP API to provide messages back. This is completely different. Also, it would be nice to... Currently, we have different text tags between different government agencies. These might be encrypted or not. That's obviously not good. What can we do about it? The obvious solution here would be to take metrics as an interability layer. In my opinion, that would totally make sense to have a basic common ground to communicate with different government agencies. Actually, that's what metrics is designed for. We don't have only the chat application use case, but also the communication layer between different organizations or people. That might be an interesting thing to look into and build some prototypes here. Plus, it would also be very easy to integrate industrial needs here. The industry is also, of course, a large customer, so to say, to the government. They are requesting, for example, building permits for wind parks. It would be nice if they didn't have to do this wire paper, but also wire an easy-to-use API and integrate their own IT services in this ecosystem. Everything becomes easier for the government and for the industry to work together here. Okay. That's all I have. Thanks for listening. I would really like to continue the discussion of course via metrics, if you like. Join the Metrics channel. It's metricsforgov.org. It would be really interesting to discuss with you there. I think we might have time for some questions. You already answered one of the questions online for where is the place to discuss about it. Another question online was, is there any plans to bring it together with the TI messenger communication from the German healthcare sector? Yes, of course. We hadn't any in-depth discussions how to bring this together, but obviously we would then be using the same tech stack. From an architectural point of view, this is what we want to do. We have all these different Maybox infrastructures in Germany right now, and we need an interoperability layer between them to make it easy to use all of them and have one place for people to receive these messages or send messages to the government. This is one of the design goals in the long term, to have all these services using the same communication infrastructure, making it easier for people and governments. So the question was, if the GNUTALA project that also has some origins in Germany might be... So the question was if the Nutala project that also has some origins in Germany might be, so if there might be some lessons we learned there, so I'm personally not that involved in the Nutala project but I'm looking at it with great interest because I think that would be a nice candidate to have privacy preserving payment, a privacy preserving payment system here. That would of course greatly integrate in such an app here, so just yesterday I thought about this aspect of maybe looking a bit deeper into Nutala there. From the perspective of making it more or interoperable in the European domain we are trying to or we are looking into of course other European or we're talking to other European governments if this might be also a great thing for them. We have the Interoperable Europe Act, we have the Single Literary Gateway regulation in the EU so it might be a good thing to maybe harmonise this not only on the national level but also on the European level. I think that's an important aspect when we build infrastructure and I don't know any other standard, the metrics that has the potential to solve this quite nicely. We are talking to them. It's our question. So the question was if authentication will provide the requirements that government services have in terms of authentication would have any impact on what is needed with metrics and I think we're going in the right direction here with OpenID Connect so this is like what government services already use. The thing is here this is not completely zero trust and so we are not there yet with security and privacy by design here because if there would be one central authentication server that would provide identities for all people in Germany using OpenID Connect this will be a huge attack surface of course. So we are also thinking about how to maybe integrate the German EID system that have it in my backpack. So we have some EID cards that can be used to authenticate people and that would also be an interesting thing to look into if we could deploy this privacy preserving authentication system for these kind of services. So that's a huge thing we are thinking about how to reduce risks here security and privacy wise when we build such a massive system that deals with highly critical personal data. Yeah so the question was if we provide any OZG services via this protocol OZG is like the German government service accessibility law that requires governments to provide services online and of course we have this thought if this would be possible at all. Right now we have like different systems other systems in place that are using different text stacks here but in my personal opinion this would be the like natural evolution to if we like communicate with people via such an app or via the metric standards we might also look into using metrics to fulfill these services but I think that's a long journey to go. There are some things you need to consider when we build this infrastructure because not just the communication you also have to deal about which services or who can request these government services we have to deal about authentication routing which government agency is the right agency you want to address. So I think from a technical perspective this would work but I think it will take time to think about and maybe sometime build this but yeah we also don't want to build something separate to the services that are already in place so I think the only natural solution here would be to transform existing services to yeah maybe sometimes using metrics and have a roadmap to or for developers and organizations how to migrate from existing services to metrics otherwise this will probably not work and yeah will create a lot of confusion I think. Yeah so the question was how to deal with with backups and device signing and all that stuff so how to handle private keys basically and yes we are thinking about this and we have some some ideas how this could be could be done of course we don't want people to like manually store some private key file on their laptop and like take the burden to them but this is like definitely a thing we are thinking about so if you have any inputs on this thing I'd be very happy to hear from you in the metrics chat. Thanks. Yeah there is so the question was could we use our German EID cards for this and the German EID cards are so they are able to put some some digital signatures out there the problem is that currently the the signing keys are not deployed on this EID card so you would have to build some infrastructure to deploy the the yeah signing keys for for the people like private keys and as a like certificates for for every people this is like a huge organizational thing but yeah maybe this this might be an option to go for but I expected to be I don't know nothing that happens in the next one or two years production takes a bit of time to build this. Thank you very much. Thanks.
Embracing Matrix for Enhanced Communication: Migrating the WordPress Community from Slack to Matrix
Hi, so this will be about migrating the WordPress community from Slack to Matrix. So first, quick about me, I'm Alex Kirk, I'm from Vienna, Austria. I'm at Automatic for since 2014 now. We run WordPress.com and others. I'm an engineer. I lead teams around localization and matrix. I'm sponsored to contribute to WordPress.org. And I've got some site projects, so if you have a WordPress blog, check out the Friends plugin for making your site your own hub for subscribing to others and enable Master on Apps plugin if you want to use Master on Apps with your site. So quick thing, probably I don't need to tell you, but just to make sure, what is WordPress? It's a popular PHP CMS. But in 2003, today it powers over 43% of the websites on the web. It has a blog editor that allows you to edit posts, but also the whole site. It's well known for its plugin ecosystem, with plugins like Yoast, Runt Custom Fields, WooCommerce and so on. And it's open source under the GPL. And just a step back, just so that you understand what our needs are as a community. So this is how we collaborate. We've got 22 make teams in different areas, so one about accessibility, core, design, polygots, meta, lots of teams, performance, sustainability. And they all work towards separate goals. But each team has a P2, a blog, where they post about new things that are happening, proposals, decisions that are being announced, lots more. This is like the asynchronous part of the communication. And then we've got sometimes weekly, sometimes bi-weekly chat meetings for sharing updates and coordinating. And these are quite important because they give people a definite time when they can reach collaborators on the project. So you don't have to enter a room and hope that the right person is there. But you know the time, at this time, people who work in accessibility, for example, are available. And we've got meetups and work camps. Meetups are local to a city. They're like the smaller ones. Work camps are the next stage where people travel there to meet. And then we've got the flagship work camps. For example, in Asia, coming up in March, EU in Milan, and US in Portland in September. And another aspect is we've got a project, an initiative called Five for the Future. So there we encourage individuals and organizations to contribute 5% of their time or resources towards the workplace project. So this means a 100-person company would have five people dedicated to the project. An individual would have like two hours out of a week. And organizations like that concept because they retain control over the person who can contribute. And thus they're confident of pledging towards that goal. And if you want to hear more about that, there's actually a talk by my colleague Jesus in this room, Shaping the Future, investing wisely in long-term open source development with Five for the Future. And this is how a release of WordPress looks like. These are the companies who contributed to a release. So 640 people from 186 identified companies. This is the make-wordpress.org site. This is where we list the teams. And as you can see at the bottom, we list the next meeting that will happen and not only in Slack but also in Matrix. And this is the meetings during a week. So every day a couple of meetings take place. And because of the distribution around the world, some meetings happen twice in a day so that everybody has a choice to attend them. All right. So our plan to migrate. So it happened or it started in January last year. We announced that we'll create a subproject to evaluate migrating to Matrix. And then we would evaluate and create the environment that we need, migrate history and integrations, and then finally launch, finalize what needs to be finalized, and turn off Slack. All right. So what could happen? What things we anticipated? First people don't like change. We've been on Slack for a while. So we figured we need to prioritize something superior. So where are the strengths of the new system so that people will want to move? There is complexity around decentralized systems. Like everybody knows centralized systems. They need to go to one address and that's the only way to get there. So people might not know what to do. And then we'd had Slack lock-in. We've got lots of migrations created over the years that make Slack nice to use for everybody. And that's why people like it, I suppose. So when you consider Slack and an open source community, there's actually a few things that are a bit tricky. So one thing is that Slack, a sign up is email-based. So when you join the WordPress Slack, you have to follow a guide. And typically we actually do this at WordCamps where we have somebody there who will help somebody to get onto Slack. It's pretty complicated. Then it's a commercial product. The free tier has a message retention limit. The data is siloed behind Slack stores. So you need API keys to access it. But many companies use Slack and it's easy to just add one more workspace to Slack. So for many people, the barrier of entry is quite low in the end. Having matrix to that. Of course, Federation means everybody could join from anywhere, from any home server. But for the WordPress community, we would want to lock them in through an existing authentication system. No retention limits, of course. And our WordPress community has multiple Slack workspaces for different countries. So this would be able to unite them in one place. And of course, an open source project should have an open source chat. All right. So we tried to make it easier to join Matrix. Number one, I already mentioned it. We created a way to use your WordPress.org account to access Matrix. And we created it in a way that anybody could install this plugin on their own server to use it to authenticate a user against, like, to join a Matrix server. And with the upcoming OpenMidiConnect being like full in for Matrix, this is a potential authentication provider. Yeah, so on WordPress.org, we've installed it. People can use the WordPress.org account. They will go through their WordPress login screen and just authorize the WordPress server to submit the information to Matrix. Number two, we created a Matrix client in a WordPress blog. So a WordPress page is made up of blocks. And those blocks, one of them can now be a Matrix chat. So we call it Chattricks. And you can configure each block individually. So one thing that you can do is you can pre-define the home server, which we'll do. But you can also restrict it to a single room. It's based on hydrogen. And we did some upstream contributions. So before we used it, hydrogen had, you could only have one hydrogen open in the whole browser between tabs even. So we contributed something so that you can use it in multiple blocks on the same page. If you have multiple posts, typically they would be like all put on one page and then that wasn't possible before. And we had a couple of bug fixes to use hydrogen with SSO. I'm not sure how many people had used it with SSO before. And this allowed us to create team chat pages. So what does this mean? We can give a contributor a URL, a WordPress.org URL, where they should go for a meeting. They don't need to know this is Matrix. They just see it's a URL on WordPress.org. So for example, for the Make WordPress Core that creates WordPress Core, the address of the Make blog is make-wordpress.org-core. So the chat page is just slash chat. Core has different chats. There's another chat. The design team has a chat and so on. So this is what such a page looks like. On the top, you have a custom, like it's a WordPress post. You can put anything there. We put there when's the next meeting, instructions on how to go there, also instructions on if you want to use your own Matrix client how to get there. And this is the chat rigs block, which shows the room at the time. For FOSDA, my colleague Ashish created a small demo, and it uses the WordPress Playground, which is an interesting concept where you can run WordPress in your browser, and you can test any plugin in a sandbox in your browser. So I've recorded a demo video, to be sure, but it's real time, so as it loads, you can see it's pretty fast. So this now loads WordPress, and we've preconfigured it with a chat rigs block, and here it joined the chat. You can go there and enter a message, and all you have to know is the URL of this page. If you want to add such a block to the page, this is how you do it. You use the Gutenberg block editor. You add the block. You configure it. You set a home server. And if you want to lock it down to a room, you don't have to, but it can be practical to do it. You just enter the room name, and then the block loses the room list and just shows the room that you attached it to. And then you can... It's a block. You can add stuff before, after, as you wish. It's a pretty neat way of giving instructions to people, or putting, I don't know, meetup agendas, whatever. It's like, it's a post. Additionally, we created our own element instance. Just you can preconfigure it with the home server so that you don't have to tell people, you have to enter this home server into the login screen. It's something where people might typically get lost already. And we also created a bridge. So since we control both the bridge and the matrix server, we were able to create all the users on the matrix server and use the Slack bridge with a slightly forked version so that we can use puppeting. So when you post something on Slack, your matrix user already will say the same thing in your name. And there's some upstream fixes, by the way, that could be merged. And... Yeah, so that makes things quite streamlined. And another thing that we wanted to do, we didn't want to lose the history of Slack. But it's been a bit tricky because if you create a bridge, the bridge needs to start at some point and you cannot really backfill messages. So what we did, we figured out this little trick of first creating a room and bridging it, then creating a second room and migrating the history of that room there. Then we would add all users to that new room. We would import the old events in sequence so that we can backdate using an app service. And if the user is no longer in the room, we have to re-invite them and so on. And when we're finished, we can then copy the events from the first room that already started to be bridged and thus close the gap in the history. So there is this period between you importing or getting the data from Slack and bridging that this gives you a way to close that gap. And then we can change the room aliases, reattach the bridge, delete the old room. We've got a room with all history. So now we have a matrix server, community, it uses Synapse with a Slack bridge. Open ID Connect configured and with the app service. And we migrated 3 million messages in 170 rooms, 45k users, 55 gigabytes of database size. And during this process, we updated the community. We had held weekly meetings as it's common in WordPress. We published meeting notes afterwards. And we've got coverage from the tavern. So it's like first in January, like we're starting this in April. This is what we're continuing. We had to figure things out about private and public messages. And then we installed the matrix bridge. So now to the migration. In November, we announced we want to migrate to matrix. And this is how we'll do it. We'll ask people to use matrix instead of Slack. Before the final migration, we'll post a message in every Slack room. Slack will be closed. This is where you need to go for instructions. And then finally, disable postings. It's actually quite interesting that it's pretty hard to just kind of disable a Slack server because in a way you want to be sure that it's still around. The only way to completely shut it down is to delete it, which is a destructive operation. So what remains is that people could still DM. Well, OK. So the feedback that we anticipated from this. People want the default. So we would figure like they would use element. We knew that the notification element is not to everybody's liking. There's no dedicated threads and mentions view as in Slack. Threads is coming. I saw it. There's a couple of things that people are used to from Slack that are a bit different. We anticipated that. So we felt like people could live without that. People in matrix have been living without it for a while. Search is a bit difficult. There's no search langes in element. And while there are many other clients, some of them miss important features like threads or have implementations that are kind of different. I mean, I've tested some of them. Nico, for example, it works, but it's different. And then when you provide a home server to community, it comes with all sorts of troubles. So you cannot limit people creating rooms on the home server. So people will create some spam rooms, whatever. So you need to be aware of that. So we started to collect issues by the community. So they said we are unable to force some things like we can on Slack. So you cannot reduce the time allowed for editing messages. You cannot enforce room membership for federated users. Well, okay. Thread messages in Slack, you can say, I want this message to also be posted to the main room. It doesn't work. Other Slack features, they're considered essential. You cannot ping a group on matrix. And you cannot ping here. As the room mentioned, but there is stuff that when you have one central server that you can enforce that you cannot enforce on a distributed environment. And scheduling of messages, reminders, not there yet. Through a bot maybe, but not in the UI well integrated as in Slack. Then accessibility problems. So there has been an initiative to improve elements accessibility, but there are still gaps like macro navigation. This over wasn't super great. Then we had bridges, glitches like out of other messages, duplicates, double, yeah. All sorts of small issues. Use experience around threads management, obviously. We anticipated that. Then some things that don't work with matrix failing to load time zone positioning, lots of user join events that can make things pretty slow. So what we did to address this. We implemented integrations and many fixes via bots. We used the mobile framework. We could use the RSS bot, GitHub bot. We tried to make it easier to migrate our own Slack integrations. So we have a post to room bot that uses the web hook to post messages to the room. The other direction, how can something on our servers react to something in a room? We implemented group mentions. Not super great. If you post a command to the bot, it will post another message mentioning everybody in the group. Which there are some very large groups. So there could be very long messages. And watchdog so that we can be alerted if some spam rooms are created by some community members. We also, because we had our own element instance, we could ship fixes there while they're waiting to be merged. We provided the channel for the community where they could get help. We created the commutation and guides. But we had to stop the migration. So Matt called off the migration at the state of the word. Then we posted about it. And in turns out the accessibility problems were too big. We weren't able to merge them in time. We submitted the patches upstream. And there was uncertainty around where, like, are UI needs, where are they on elements roadmap? What are the effects of the license changes that were announced with Synapse? And overall, like, do those changes mean that the ethos of the WordPress project are no longer aligned with the element or the matrix project? It's kind of, it creates a bit of uncertainty that was detrimental to the migration. So the current status. The WordPress community remains a Slack community. But now with the matrix bridge and all Slack history, new contributors can no longer need a Slack account to join the conversations. And turning off Slack is currently not planned. And we'll keep observing how the matrix product develops. So in summary, the WordPress community didn't fully merge and migrate in the end. But maybe the things that didn't work for us are not so important for you. We are a huge community. There's many voices. I could see in a smaller community those things not being as important. And I hope that this talk was able to help you identify which is important for us and decide if you are kind of suffering or not suffering under the same issues. Along the way, we did a lot of open source contributions. The WordPress plugins I mentioned. To matrix, open source, all our bots, the migration app service patches upstream. And that is it. Thank you. Yeah, check out the slides. There's lots of links in the slides. Yeah, so you mentioned that migration, you want to fill in your room with Slack and in for the messages there. You actually don't really have to do that because a lot of the bridges actually allow you to, the four messages that come in the history. So the question was whether we were using the functions of a bridge to get back the old messages. So in our experiences, it wasn't possible to backdate the messages. That was the main issue there. I suppose if you, it depends on the implementation on the matrix server. Maybe there's no, okay. Okay, so the question was whether we considered the relationship funnel? So that's a good question. Okay, so the question was whether we considered to use. to send another message. Okay, so the question was whether we considered to use an app service to change the push rules for a user in order to enable group mentions. No, we haven't considered that. Maybe it's a possibility. I don't know if the user would, basically you're saying that you would add a keyword to the user so that they would be mentioned and you would configure it for them. Maybe that's a possibility. Yeah, you were saying there were some accessibility problems which were kind of killed this. Could you give us a little detail of what was actually missing or what problems were that were? Sure, so the most important problems were around macro keyboard navigation, so navigating between the bigger sections of the app. So there's the sidebar of the spaces, there's the room list, there's the search menu and the messages. And for example, you couldn't get up the message list using the keyboard. And if you were able to somehow get into that area, then the voiceover read out wasn't very useful. It repeated for every message, for example, profile picture. It, so stuff like that had been annoying people. Thank you for the POS, but like the accessibility when I was outside, I think it's already reviewed. One of the other plans. Matthew said the accessibility team has reviewed the patches that were submitted. Other questions? Yeah. Did you have to disable some of those integrations or work around that you mentioned? For example, I imagine that here things wouldn't work very well on both sides in a different way. All right, the question was whether we had to disable integrations. So one interesting thing about the bridge is that it works in both ways. So migrating and integration could be done in a way that you first enable, like you create the migration on the matrix side and when it's ready, you turn it off on the slack side and enable it on the matrix side. And still both sides would be able to use the integration. So for the here one, well, it only worked on slack in the end, but I don't know, it depends on the team. Like part of the WordPress project, like there are so many teams that every team has their own ways of doing meetings. And some heavily rely on those group mentions, others don't, some need the here mention, others don't, it's like hard to make everybody happy all the time. So that's probably part of such a big migration that it's really, yeah, you get so many opinions and as with many communities, there are some louder than others. So the question from the internet, where can you find the tools you have used for the migration of the history of the room? Yeah, so the question from the internet was how we, if you can access the tools. So I recommend you to look at the slides, in the slide where I talk about the app service migration, that's where it's linked. Is there any integration with Element Call? The question was whether we did integration with Element Call, no we didn't. There is no culture of using video conferencing in the regular, some teams use it, but they tend to use Zoom at the moment I think. Yeah, it's depending on the team what they use. So for example, Slack Huddles as an alternative on Slack are not being used as far as I know. Is there a possibility for the migration? Or is it more of a licensing issue of accessibility? So the question was whether there is a possibility to complete the migration. I think it's certainly possible at the moment. I think there has been a bit of tension of implementing the migration fast so that people are not left behind. Like if you let the migration linger for a long time, then people will never migrate and then always at the end, like people get panic and then they do the migration so the whole long period is wasted kind of. So that's why the initial plan was to have it rather short. But on the other hand, I think this current hybrid state is not as bad as you would imagine because for new contributors, we've got this easy onboarding and one thing that I liked about this, the way that we implemented is that you can kind of slowly upgrade your experience. You start with the chat message, the chat URL, and then if you use it a lot, you could upgrade to elements, the one that we hosted, and then you could upgrade to another client. So I think that's an interesting way of luring people in. So maybe over time, the number of metrics users will increase so much that it becomes like a request from the community. But as of now, we're kind of waiting what the license changes do. And yeah, this hybrid state is one that I think is acceptable for the moment. Okay, no more questions? One more time. No more time. One last question. Matthew. I just want to know what it is about the license change on SNAPs that is causing it to. I invited to talk to Matt. Yeah, it's basically like WordPress is on the GPL license where you used to be able to modify software on servers and not having to push the changes back and also contributing back codes to the element project and assigning a CLA is something that makes people uncomfortable. All right, thank you. Thank you very much. Thank you. Thank you. Thank you very much. Thank you. Thank you. Thank you.
NeoDateFix - A solution to organising meetings in Matrix
Now, we will have Milton and then Norgin and then Amat and Mikhail. They will tell us about the Neo-Date fix and a good solution to organize meetings. Thank you very much. The stage is yours. Thank you. I'm happy to see everyone of you today here. As Jan presented, I'm Milton and we're going to talk about Neo-Date fix or previously known as Matrix meetings. It was the starting name of the project, but anyway, we'll start by talking a bit about who we are. Amat, yes? Closer talking. So we are four developers from Norddeck. We have been doing software development specifically developing web applications on top of Matrix. We are doing this in the context of the OpenDesk Sovereign Workplace project for the German public sector. We have built this suite of web applications that are embedded within Element. I'll explain a bit about that more later. But yeah, we have Neo-Date fix, which we'll present here today. We have Neo-Board, which is a real-time collaborative whiteboard, which is actually what you're seeing in this presentation. So we built these slides and they're presenting with the Neo-Board. We also have voting polls, which is Neo-Choice application that it's not spec-based, but I won't get into that. And also, if you were at the Fringe event last Friday, we were using the BarCamp application to manage the schedule and the speakers and the whole tracks there. So yeah, what is Neo-Date fix? Neo-Date fix is a web application that allows you to create meetings and video conferencing meetings, especially within a Matrix client using the widget API. So currently, the only client that implements the widget API is Element Web. So that's what we have to work with. It is a good thing. And yeah, what can we do with the application? We can create these meetings as meeting rooms, as I've said. The meeting rooms are created with the default widget layout. So we have the video conference widget expanded and front and center with other widgets that you can choose. Typically, it could be whiteboard or some other widget that you want to previously set up. So you can pre-configure this for usability and quick action when you get into the meeting. We can schedule recurring and non-recurring meetings and see them in this calendar view that we'll show. It also supports creating breakout sessions. So if you have larger meetings and want to create sub-meetings when split people between those meetings, you can do that. We also support users that don't have an account on the home server. So we'll bring them in, creating them as temporary guest users, and they'll join the meeting. And we can also integrate with third-party clients. So specifically in the open desk project, there's open exchanges. There's also a calendaring solution. And when you create a video conference call there, it will create a meeting room in element with everything set up for that call. And finally, all of this is fully accessible and with support from multi-language. Okay. So going to the widget part, if you're familiar with widgets, you sort of have an idea. If not, it's a way to embed web applications inside element. It gives you access to the room events and room state events, and not much more, but that's the gist of it. And the way we have built our applications is we built this common layer, which we call the widget toolkit, that gives you, like these, for example, a React component, which will inject the widget API client into your React, and you can start using it without having to do that integration by yourself. It comes with material UI components. So you can also have this consistent look and feel. You can change the theming. And it also comes with some mocking components for easier testing. And finally, it comes with a base Docker image that you can use to quickly deploy into your infrastructure, your widget, based on that. And finally, it's also not only a widget, but it's also a bot, because the widget API only gives you access to the room data. We need to create the room meetings. We need to set up all of these accessory workflows that we then use the bot to perform these. It's built using nodes in TypeScript and in the bot SDK and in the SGS package for the API that we exposed. And yeah, this is a broader overview. Now Mikhail will talk a bit more about the internals, how we are doing this. Hello, hello. Thanks, Mutham. I'm Mikhail. I would like to continue with a high-level architecture of how it works. So we have this Neo-DateFix widget that is embedded in the Element Web Client. It just uses a widget API with a toolkit to send and observe state events and call some other actions that the API allows. It all goes past through the Element Client to the Home Server. And some of these goes to Neo-DateFix bot. So in Neo-DateFix bot, it looks for some particular message events of some particular types. And when they are received, it applies certain actions to the rooms. So besides having metrics API, Neo-DateFix bot also has HTTP API that is used by, again, Neo-DateFix widget to provide widgets lists and additional configurations that widget may need. And additionally, it provides the HTTP API to manage the meetings by external clients, as I've already said. Addiction to these components, we also developed several Element Modules that simplifies a bit this setup. And also add some additional optional features, like, and this one, like Lifecycle Model, Guest Model, but these are optional. And then, we have a new one. It's a very simple way. So if user wants to start with Neo-DateFix widget, it has to create a room. And then he needs to invite bot to this room, bot will auto-cept the invite. And then user needs to grant moderation rights to this bot. As soon as it's done, bot adds Neo-DateFix widget to the room. So he can see calendar and create first meeting. So he could create single meeting or recurrent meetings. It all end up with one room for one meeting. But the meeting room is a special room. It has type Nordic meetings meeting in M-Rooms create event. It also connects to the parent room, this calendar room, with M-Space child and M-Space parent events. There is one too many relationship with the meeting rooms. The meeting room has widgets and of course it has some other state events that are related to the meeting. Within the meeting room, user could create a breakout session room. It also would be a separate room, but with its own breakout session type. And it also has a connection to the meeting room where it was created. So we use message events and state events obviously from the metrics. And all the message events there are prefix weeks, not Nordic meetings prefix. So they are the events that are sent by the widget. So just to manage the meetings, to create, change permissions, pages, participants, Tomstone the meeting or send some messages. The state events are used to store the state of the meeting. Mostly there are metrics ones, but in addition there is a net Nordic meetings metadata. It contains calendar information. So this is an example of this calendar information from the meeting metadata event. So for the single meeting, it is a list of just one entry that has certain end fields with a date time stamp together with a time zone. And so it's quite simple. And for occurring, in the case like it could be excluded dates and overrides, it has besides frequency rules, it's frequency daily interval one. It has exclude dates, to exclude some particular dates from this recurrence. And as well to, it could have several overrides to change some particular occurrences of this recurrent meeting. Yeah, this is all regarding the slides and I would like to hand over to Nurjin to show some of the features. Thank you. Thank you, Mikhail. Hey, Bill. It's Nurjin. So yeah, hopefully after the all talk you'd be teased enough to see some actions, some demo. So here, me and Ahmed will try to demo quick the basic features. So yeah, first we need to create a new room. We call it a calendar room. And yeah, so we need the bot to be added to the room and give him the correct exact right as a moderator above. So the bot would be able to configure the room for us and add the widget. Yeah, here we can see the bot added the widget into the room, pinned it in the middle. Yeah, so we can schedule meetings while, yeah, here you can see there are the information that you can create for the meetings. You can add participants. You can also allow or not allow messaging in these meeting rooms. And also we have a set of configured widgets by the bot. You can add, remove whatever you want. We will create an example of a single meeting and a recurring meeting. So it's basically the same. We can add the recurring meeting. We say if it's an open-end meeting or if it's end after specific time or yeah, after specific recurrences, for example. Yeah, here we can see we have first the list view, which are the meetings are shown as cards. Each card contains the information with extra buttons for the participants and also the share meeting. We can share it with the meeting with a link or by email or we can download it and as an ICS file and important into other libraries. We are also able to edit, delete the meeting and of course go directly to the meeting room. Yeah, other than the list view, we have also this calendaring view. We have the day view, the work week and week and the monthly view. And of course in each of these views, if we click on the calendar entry, we would be able to edit the meeting. So for example, here we can try to edit the whole series by switching or just edit like the whole series or one recurrence. We can edit here one instance, for example, save. And if we go back to the calendar, so the changes are reflected into the calendar, this deviates from the others regarding timing. So we can also join the meeting. We can see that the bot already configured the room with the widgets that we chose with a specific layout configurations that we set. Here we set that new board and the JITC are now configured. We also see that the bot sends notifications to the room with every change that we make. And also besides those, we have the near-date fix details. It's basically just another detailed information about this meeting room. Also we can do other actions with it, edit it or delete it. We can also go back to the parent room of this meeting room. And as Milton already talked about the breakout sessions here in the meeting rooms, we can create breakout sessions as many as much as we would want or need. Here we can select, they are divided into groups, named defaultly by group one, group two. We can distribute the whole users randomly or we can select them manually, whichever we would like. So, yeah. Yeah, here we can see the breakout sessions are created. They are also as cards and extra we can check that we can send a message to all breakout sessions. As an organizer, you want to notify all breakout sessions, yeah, let's be back to the meeting room or whatever. So we can send it. We can go to other user and here we can see that he got all the invitations. For example, this for the daily, if we view the message maybe, or go to it, yeah. Yeah, we can, so basically the message of this invitation contains also the recurrence rules, information when it occurs and when and who you are added by. And yeah, you can see here, for example, in the breakout sessions where Alice sent hello word or hello, the, yeah, the message was sent to the room and also breakout sessions are configured with Jitsi. Yeah, I guess that's all. Thank you. I will hand over to Milton. Thank you for the demo. Hope you guys liked it. Just to finish our presentation, we have a couple of interesting things that we find that want to share with you. So the first thing is that as you imagine creating lots of meetings and these temporary users creates, I mean, it's not, it's relatively cheap to have these resources, but we want to keep things clean. So we have these additional features where you can clean up the temporary users using a signups module and also have this sort of hackish room reaper that will go through the finished meetings in the past. According to their, there's a field that will tell you tell the bots when they should be deleted and it will clean up after himself, which is a good thing. And also we have, can you move to the next slide? And also we have what we believe is a very good end to end test suite because as you may know, besides units and integration testing, end to end tests allow you to script the full interaction from the browser to the element web to the widget and how it then interacts with the bot. So we have a fully automated way to have the environments being created, tested on, then destroyed. And this is obviously a preconditioned for us to have releases when these tests pass and they covered most of the features. So if you want to see a good example of using playwrights and test containers as a way to do end to end testing, please check out the repo. There are obviously still room for improvement. We are just finishing and should have soon, should be releasing support for encrypted meeting rooms and encrypted control rooms or the calendar room. We've had a slight issue here because we are clients, requires us to deliver to this special IBM Z platform. There weren't any bindings for the crypto rust. That's the case. So we're, I think we have that in order for soon release. Also make the bot clean the rooms instead of that hacky script. Support element call also when it becomes a defect host and they're there. Also have space scoped calendars. So in the demo you saw that there's a single calendar room and it will create the meeting rooms in your top level. So if you could have it create within spaces so the meetings would be within that space and who couldn't maybe manage different teams or different groups with different calendars, that would be a good thing. And yeah, and finally we can get meetings in publishing them out to another in another calendar in client would be a great thing to wrap up. Here are some of the resources, the links to our repos. I mentioned, I think it's an open source Apache to licensed applications. So be sure to check them out. Yeah, I think we're ready for questions. Yes. There was one question on the internet. Do you have support for entry of the rooms and if not is there plans to do so? Yes. So as I said a few minutes ago, we currently don't support it, but we are soon releasing that. So it's a matter of days. And the question was do we support encrypted rooms? Sorry. Yes. Yes, this is the Neil boards. This is maybe can can you show maybe a couple of features. So this is the this is a widget that allows you to have a real time collaborative whiteboard draw. It's an initial feature set, but it's if you went or if you participated in the summer, the matrix community summit in Berlin last September, we did a full presentation. It's online. You can check it out for more details there. So the question is what is how are we implementing the invitation page when we show them to the invited user with the information about the meeting. Do you want to? Yes. So it works like first of all, in invitations, there is a message itself, but it's of course constructed just inside of a member member. There is invite text, but so it's unfortunate we didn't show it, but it's there isn't there. But besides that, there is not this metadata event, for instance, it's we configured it separately in synopsis. Share it in the strip state. So when you go to invitation, it's already shared and you can see it already in calendar. So if you have calendar as a second user, you would see it there already. So we just yeah, we added to the strip state. So it's a bit. Yeah, exactly. Did I understand correctly that it only supports. The question is if we only support jitzy meetings and not in app meetings. What do you mean by in app meetings. I mean, most of the clients have own meeting functionality. And I just wanted to give the chance to use that. So if the answer to that is if there is a widget for it, we can support it. So if there are other alternatives for video conferencing, it's a matter of just developing a widget that supports it and setting up the bot configuration to then include it in the room. It can be implemented as a separate. So if not supported as a widget, so in theory, you can develop the widget is a toolkit and edit as another widget to the meeting. Okay, the question is what is the Docker container part that I mentioned. So in order to deploy widgets, they are a web application. So they need to run on a web server. So we have this Docker file template based on engine X that it's already prepared for you to just auto include in your base Docker image. For your app. So you just instead of including Debian or node based image, you include the widget server image in your Docker file and just copy over the build release distribution assets there. That's the main accelerator for that. So it's ready to use base image for widgets. Any further questions. You can download an ICS file. The question is, can we can we integrate with Google Calendar and other calendar publishing platforms? We only support downloading in the ICS file for recurring or single instance meeting. The inner format that Mikal showed it's restoring an I cal. I cal format. So the storage is using that. But yeah, we don't export any data out currently, but that would be a good thing. The community is open source. You can contribute with support for that. As well. Yeah. If you go to maybe can go to the resources page. You can. Well, not the widget, but the Neo board. Yes, we have a live widget demo for having this. Sure. The question is, how can you can you include and use this right now in your element web client because it's a widget and the bot you have to host it somewhere. So you would need to download and deploy it to some server or VM. It's not included in this time that I don't know. And it's, yeah. Thank you. Thank you very much.
MatrixRTC: The Future of Matrix Calls
Thanks for the amazing introduction. And as Jan said, so we're here to talk about the future of calling and matrix. And we actually bring some pretty cool new things and there's quite a thing going on at the moment. So yeah, we really hope from now on there's finally good calling and matrix, or at least we're doing the first steps and all this will be built upon matrix RTC, which is a underlying protocol basically, which empowers all the calling in the future. And that's what we're going to talk about, how this works, how this is structured, and how calls are built on top of this. So matrix RTC is actually something a couple of you probably already has encountered in the form of element calls. So basically this is a standalone web app where you can just have calls. It's very similar to JITC, but in the background it's actually running matrix RTC. And what hasn't reached yet though is that this is really in the federated system. So what we have here, this single page application is very, very enclosed. So it's running its own home server and it doesn't federate. You can't log in with your actual matrix account. You have to have a custom account for this specific application. And the change we're going to present now is that we actually have the same technology but in our federated matrix system. So before we actually start into the interesting new things, we talk about why we even considered redesigning all of this. Because as probably all of you know, there is calling in matrix since quite some time already. It's in Niko, it's in the legacy matrix apps, element apps, and it's an element web. So why not just work on those? There are issues. For example, if you call each other at the same time, you get issues that the calls sometimes don't figure out that two people want to talk to each other. Sometimes one of your devices never stops ringing. But why not just fixing those? Oh, I see a lot of knots. That's actually super satisfying. It's really good to see that people know what I'm talking about. So why not just focus on fixing those? Why rebuilding something entirely new? And the thing is, there are some pretty fundamental limitations. So it's by design just one-to-one calls. That's just how it's designed. And it never really was in this specification designed for something bigger. It's very call specific. So you can't build any arbitrary real-time application on top of it. It's just for calls. And that's something which we think would be cool if that changes. And the signaling is done over room events. That's not necessarily a mistake, but it makes things a little slower than necessary. And also, it's really hard to get right, as we can see with ringing never stops, or we call each other at the same time, and it doesn't converge to an actual call. So this is basically our vision, what we want to achieve. So we want calls to be a great and central part of matrix via matrix RTC. And those four columns are the core things which we really want to get right. So we don't just want to have calls. We want to think beyond calls and build an expandable system that motivates also other projects. So we already had this, not with the exact stack we have right now, but something very similar. And people like Element build third room, and also Nordic build things like the NeoBoard, which are also kind of built on a similar thing than matrix RTC. And we want to make matrix RTC really a thing where it's super easy to build those kind of things. The other column which is super important is that it's using a pluggable RTC spec end. So currently that's LiveKit, and LiveKit is an amazing open project. So it really fits into matrix from a culture point of view. So it's an open system, and it really solves all very complicated issues if you use WebRTC for calling. It even ships a SFU, and it's just a very, very decent combination, like matrix for the high level signaling and LiveKit for actually doing the WebRTC shenanigans you need to go through. It actually gets quite annoying if you look into the details, and they just do an amazing job to really get this all nailed down. And then it has to support large group calls. Everything which we want to have in the future shouldn't be just for one-on-one calls. I guess that's pretty obvious. And last, we want to make it as simple as possible for other clients to support the whole infrastructure. And we already have two apps from Fremantly, like the Fremantly app itself and FluffyChat which support it. And we have Element apps which support it. And we also, we talk about this in more detail later, want to make it as easy as possible for others to also add calling. There's like a widget path you can take, and also LiveKit helps us here, because they provide pretty decent SDKs. So if we want to build calling on Matrix, we really want to leverage all the good things about Matrix. So here's like a very short, I guess I can really do this quickly because everybody knows probably, what really makes or what Matrix is really good at. So what things we really have to pass over through this real-time infrastructure. And one of this is that it's an open standard. That's like one of the things I really see as a core of Matrix. It's super cool. It's really fast and so it's great out because it's not really that surprising. Then we have Matrix encryption, which is really powerful and it goes further than just encrypting for large rooms. It also has a very good authentication and verification system. And that's a thing which I think is super essential that you not only can connect encrypted to other people, but it's also verified. So you have the guarantee that if everybody is doing this device verification correctly that all the participants are actual participants you trust and where you trust the devices are not malicious. And that is what actually makes security in the end that you don't have any weird third party being in there which shouldn't get the data streams your streaming. Then it's a federated system. So calling definitely has to go this path as well. And what Matrix is also really good is in having persistent storage. So it's not just exchanging data, it's also storing data and replicating the stored data over multiple home servers. But that kind of comes with the cost that it's not real time, real time, what we need for calling. It's more like a, yeah, more in the below second range but not millisecond range. So having those four columns in mind, how can we now use Matrix to really build up something like a system that uses the best parts of Matrix while still succeeding in actual real time? And this is done by like those three core parts. We have the Matrix part, then we have Client apps which then use the LiveKit SDK and we have the RTC infrastructure which is LiveKit in this case. So starting from the top we can see that we have just a Matrix room which can be on a federated system. And the core component in the Matrix system or what the problem Matrix solves here is that it basically stores which user is currently in which session. So if I'm joining a room and I'm reading the room state, I immediately can tell who is in which session and how to connect to those people. So I know if there's a running call and I know how to connect to them. And yeah, of course then Matrix also does a lot more sharing keys and providing the accounts and the verification. Then as the next in the center part we have the Clients themselves. So here we have a couple of Clients which have only the green box in there and then Clients with the green and the blue box and each of those boxes is basically one RTC application. So to make this example more concrete, one could think of as the green box being Element Call or just calling in general and the blue box being some document, shared document real time system or third room or whatever have you. And some of those members are just in one RTC session and some are in two and this is also something that should be possible. And then last at the bottom we have the RTC infrastructure where we primarily want to use LiveKit but it also would be possible that you use FullMesh or we have this empty box at the end. It also should be possible to basically use whatever new technologies emerging. So if web transport at some point in time replaces web RTC then you could implement a new infrastructure which then does the same high level signaling over matrix but it still uses this new technology to have even better or even higher data transmission or whatever the advantage is. So now we look into a little bit more detail for those room events. So before we had them at the top, now they're at the right. So we have room, multiple member events and each member event has an array of memberships. We need this array because as seen before we could have a call and at the same time a real time document. And the top part of the membership JSON object here is the actual core matrix RTC part. This data is just there to know how to connect to this specific peer in this RTC world. So it has this very central field Fokie active where it says the type of Fokie or the type of connection you want to use in this case live kit plus it has all the necessary information to connect to this. And this could also, this is then the part which actually can be replaced with web transport or full mesh or whatever you. And then there's another pretty important field and that's the application. So each membership has a specific application associated with it. In this case it's M call and that basically also gives the typing for all the other fields. For M call we have a call ID as well and a scope. So if it's just a call for the whole room or if it's a breakout call or whatever you want to add to the calling specification. But you can also imagine all kinds of other things. So if we think about third room one possible field could be that you have for example a country or continent in there. So when I look at the room state I can immediately tell who is in which country and based on that I know whom to connect to. So we can do very high level optimizations in this matrix RTC world already before we even connect to an SFU. What time do we have? Like, oh it doesn't say here because. Oh, okay this is fine. And we actually can talk about this as well. So this is kind of an interesting thing and it's one of 20 problems I could have chosen which we encountered which I find really interesting to just really get the mindset what those call member events are and what kind of problems we encounter in such a federated world. So it's about call history. Whenever we have a call it's of course super valuable to then see in the room history that there was a call. How long the call was, how many participants there have been in this call. And one idea, one very trivial approach would be that at the end of a call we just send a summary in the room and the summary contains all the data. How many people there were, the duration and everything. But then we encounter specific issues which are very, very common in a federated world who creates this event. Like there has to be some kind of glaring and maybe nobody feels responsible for it. Maybe the one responsible has a client which crashed at the moment where he needed to send it. Maybe two people think they're responsible because there were some state which hasn't resolved yet. And it also would be redundant data because every state event is of course also part of the DAG so it is in the history of the room. So by having another summary event we of course introduce a possible conflict where if you look through the state history you see that the call was ten minutes long but in the summary it's twelve minutes long because there was a client side back failing to calculate the proper call duration. This slide actually got broken. Either way it's still visible enough so it works. The cool thing is if we look at the call member events which we showed before it's very easy to pass those events as join or leave events. So if we look on the left hand side with the green border we can see in the unsigned field we always have the previous state of that event. So if the previous state was an empty array and the current state is an array with a membership this can be easily passed as a join event while on the right hand side with the black border we have a previous content with a membership so somebody was in some kind of RTC session and now the current content is an empty array which implies that's leave event so it's really easy to tag those events. And looking at the next slide we have a visualization of a basically timeline so the left hand side has to be interpreted as the past and the right hand side as the present and the red boxes are state event changes which we tagged as leave events with the system we used before and the green boxes are state event changes which we tagged as join events. So if we go through a very simple example, member three for example they just had no changes at all so during the whole period which is shown on screen they were no member. If we look at member two in the past they were no membership then they had a join event so from that point on they were in a membership and then a leave event. So if we now run an algorithm locally that we start from the current or the present and we just go back collect all the leave and join events we can basically recreate the call state. So at each point we know who was joined and who wasn't and then we just loop through this algorithm until we find a point where nobody was joined and that then is of course the start so this slide indicated with green border so we have then all the information we need we have the start, we have the end, we even have the number of participants who joined we basically even have a heat map at which time there were how many participants like there's lots of data in there and each client can decide on their own what exactly they want to do with it and how they want to render this in the timeline. So yeah this is your part now. Who on time? Thank you Timo. So now we are going to look at implementing because well client implementers also need help and if you are one of those people whose client already has the WebRTC parts implemented you might be thinking ah shit I need to throw away all of the stuff I've already done. Not really. So Timo showed this already but there's this small RTC infrastructure bit which we are going to look into a bit. This is MSC3401, well kind of MSC3401 the m.call event has already been removed because it caused way too many glares and stuff if you want to know more about that you should watch Timo's matrix community summit talk about why the m.call had no ownership and stuff so it caused way too many glitches. The first half is just the matrix RTC stuff which Timo already mentioned about. The participants send the member events so the room has a history of who joined when and now if you don't have an SFU you could just say the infrastructure or the foci or the back end in the matrix RTC as mesh and then you can potentially just use the P2P MSCs which were already implemented by you or hopefully will be implemented for a mesh call and a mesh call is basically a P2P call between multiple participants. It's just not as scalable as you would think but now you can use your existing MSCs, your existing implementation for mesh calls and you don't even need an SFU or something but if you are rich and if you do want to set up an SFU then it gets much simpler. SFU in our case will be LiveKit but all of the signaling bits are now handled by LiveKit itself over WebSockets. The previous thing was over two device matrix events. The first half is the same but basically all of the signaling part is now handled by LiveKit over WebSockets. More about LiveKit I'm going to keep saying that SFUs are cool but SFUs are also very expensive and if you don't want anyone else to use your SFU you probably want to have some authentication in front of it. So if you are a home server owner admin and you also host SFU then you probably also will be hosting a JWT service which basically gets an open ID token from your Synapse server. You send it to your service. The service then validates if you are the actual one who generated that token and well then it generates a JWT token for you which you can use to authenticate with LiveKit SFU. Right now I believe that the Synapse thingy only checks if you are the actual one who generated the open ID token but I think there's already work going on for checking if you are actually in the room so only people who are in the room and if you want to actually join that room only then you can get access to the SFU. Some fancy stats. The LiveKit docs say that with around a 16 core Google virtual machine you can have calls with around 150 members. This is I believe 720p no simulcast just draw 720p 150 members feeds. From my personal testing I used a Hexner CAX21 well not personal but family gave that to me but it's a four shared VCPUs ARM core thingy and I could get around 70 participants with simulcast and 720p everything optimized I think. Ringing. You might not think ringing is important but ringing is actually very difficult to get. Mainly because native operating systems are not really friendly with you and will try to kill your app every possible second. So they started a GSOC project by the way. GSOC 2022 project at matrix. It's basically a three month window and you have to do this particular task. Well my task was actually implementing the whole WebRTC thing but I implemented the whole WebRTC thing in two weeks and for the next two and a half months I had to fight ringing. You need to focus on three cases. Your app could be in foreground, background and terminated. By ringing I basically mean if your app is in one of these three cases you need to be able to somehow ring the application when you get a call. Pivot it three times and we'll see the three ways. This is a story yes but hopefully client implementers can learn from this. This is the coolest part which I wanted to show at FOSTA. I did not know you could do this. This is Android specific. As far as I know only has one way you can do this. That's using Colkit which is the phone dialer app on the iOS thingy. I think WhatsApp also uses that but turns out Android also has a way to do that. It's called telecom manager or the connection service API and what you're seeing on your screen right now is the Samsung OEM dialer application and what the telecom manager allows you to do is put any wipe call from your application to the dialer so you don't really have to handle all of the OS killing your app and stuff because the dialer already has that. Then you get this fancy UI. You see all of these buttons, the hold call, Bluetooth, even the merge button works and I didn't have to do that. You also don't have to implement a new UI for all of those holding calls or you have another call when you're in another call. This was very cool but why this could not be implemented? For this you need to add your app as a calling account in your dialer app and that is a very hidden setting. I could not find a way to programmatically do it and also in some of the regions it's just blocked. It's apparently a regional thing so this could not get in. Frustrated by that I went to try to. Where? We just hack it. Apparently Android has two very nice thingies. Show on lock screen and the up here on top thingy. What we basically do is apparently we're out of, well running out of time so this is going to be super fast. We just call the up here on top thingy which then brings your app on the top and then you can use the show on lock screen. Even if your app is foreground or background and your screen is locked you could potentially just hack the app to get live and stuff. It does not work on terminated apps and no way my coworkers would have let me merge this thingy. Try three. Fine. We'll do it the right way. By the way, if you are thinking this is an obvious solution this was not obvious for me because family and Fluffy chat are written in Flutter and when I get a notification I would have to start the right Android bits then start the right Flutter bits and then decrypt the event and then show the ringing too much work. But well turns out after two tries I found out that push notifications already do that for you. Well, so we just abuse that now. You use the Firebase push thingy or the unified push thingy. They start a worker for you. They bring up the Flutter engine. A Flutter engine is basically something which is attached to your Android activity. Once the Flutter engine has started you can just hook on to that. You can hook on a VoIP listener to that and then kind of abuse it to see if there's an invite event coming in and then you show your own UI. That works. I hope that's the right way to do it. Please tell me if that is not. By the way, if you, like I said, I use the m.call.invite thingies for the thing now. But that's not a thing with LiveKit because all of the LiveKit stuff happens on WebSockets. So there's a new MSC for that. With this you can basically, this uses intentional mentions so you don't spam your whole room with your notifications. But you can specify which user IDs you want to spam, ring, and which, what your notification type is. It could either be a ring or a notification, yes. S-Frame key sharing. No time, but SFUs need another lock because WebRTC and said the SFU stuff uses S-Frames secure. Trust me, bro. Cascading. Yes. Right now your calls are, well, right now the calls are technically federated. So you could potentially have a call inside a room with SFU one and you could have a call inside another room with SFU two. The only main limitation right now is that all of your participants who want to be on a call need to be connected to the same SFU. With this you can also have like secure deployments where you basically just have the left half and then all of your communication is within your organization just for the local network, etc., etc. But in the ideal future what we want is cross SFU communication where every home server could have its own SFU and their JWT service, then all of the users from that home server connect to their own SFU and then the SFUs cross-federate, everything is federated. Yay. This is already a thing by the way, but it's a proprietary thing in LiveKit. So maybe if someone from LiveKit is watching, please open source it so Matrix can use it. Probably not going to happen. And how you implement this? The easy, there's two ways. You can either implement element call in the widget mode. I believe there's two SDKs right now, the Rust SDK and the React SDK, which already support widgets. So you can just use the iframe in your app, looking at you fractal people, do it already. And if you don't unfortunately support the widget API, well then you have to go the hard way. You need to implement it using the native LiveKit SDKs and, well, LiveKit has a lot of SDKs. The Flutter, Android, Swift, Rust, obviously Rust is there. Yeah, that's it. Thank you. Demos. By the way, if you can join this demo, you, Timo, I think they can use develop.relement.io. Yes. Basically maybe you go ahead and show the... Ah yes. Yeah, but so they can sign in. You can either use... You can either use... I should have written this down. You can either use develop.element.io or td-family.github.io slash fluffychat. I promise you this is not a phishing attempt. I can show you the CI run from what I deployed it. But well, and once you go there, just type in this alias and then you should pop up in a room and you can join a call with us. Could you repeat the URL? This is the URL. It is... Yes. Timo, do you want to start it now? Yeah. So basically, can people hear me if I talk without the microphone or... Okay, then you just have to talk with them. I'm talking. Okay, perfect. Yeah, then I can also talk. So basically what I just did is start a call. And the cool thing now is that we really have inelement web, inelement x, and infamately or fluffychat, we have the full new matrix RTC stack implemented. So all of them are able to... Can you hear me? Yeah. I can hear some weird sounds. So all of them can talk with this new stack. So you have to go to develop.element.io and there is a feature flag there. Oh, to be in the camera, makes sense. But in general, this is like the big new thing now that everybody can without doing something highly crazy, just go to develop, activate the new group call experience and then still and then be able to use the new calls. So basically what I just did is start a call, but I think I did a private call. So that's why I did the ringing as well. So I am joining here. And TD now... Someone's already in the call. Yeah, this was me just joining there. And I think maybe Kim is in there already. Oh, there is multiple people. Interesting. Well, that's element for you. You have been seeing this for months now, but now we go to the fancy thing, fluffy chat. This started a month ago, so this is probably ridiculed with bugs, but well, if it works yes. Kaboom. Nice. So this is really super, super cool that TD managed to get like in record time. Fluffy chat into a state where we have again a federated multi client system with calling with group calls. So yeah, this is one of the first few multi client... I think it's the third time we do it now. Multi client federated matrix RTC call. With screen sharing apparently. Questions? Do you guys want to break it? How many people can still join? Oh, we are doing a test. Might as well. Yes. Yeah, are there any questions? Does LiveKit send any emails back to say who's talking? Oh, yeah, there's actually lots going on on LiveKit. Oh, okay. So the question was if LiveKit is sending any signaling back to let us know who's talking and yeah, probably also who's showing video. And there's lots of things LiveKit does, so it's actually pretty sophisticated in that regard. And even there's things like if I upstream video, but nobody's consuming my video. Like let's say we have a conference of 100 people and everybody has me at the bottom, and LiveKit is communicating to my client that I don't even have to upload video anymore. And that doesn't only work with upload video and don't upload video, that even works with a resolution. So basically if lots of people consume me in just a tiny thumbnail, then my client automatically notices that I only have to stream the thumbnail. So there's like lots of optimization happening that at the end from a receiver point of view you basically just download what you actually see. And from a streamer point of view, you also only upload what people actually need to see. Yes? Who holds this LiveKit? You said that this is fully federated, but maybe I somehow missed the point where we talked about whose LiveKit service is used. Because in the previous iteration with full mesh, I thought the cool thing were that multiple MaTvic servers are involved, also multiple SFUs or whatever are involved. Now it seems like it's maybe the LiveKit server for the first one initiated or something. Yeah, so basically. This is kind of two questions. The first part was who's hosting the LiveKit server, where are they coming from? If it's federated, there should be like, yeah, same similar to MaTvic server, multiple servers, and that's exactly what's happening. So the idea is that in the future it becomes very, very common that next to your MaTvic home server you also host a LiveKit SFU. It's kind of similar to that lots of people also host a turn server right next to their MaTvic server. And then the second part of the question was how do we decide which SFU do we use? And of course, like what TD presented at the end, where you have the option that SFUs talk to each other, there you would just always connect to the SFU of your home server. And if they're federated participants, the SFUs in between each other would figure it out. Now there's actually a system that the first one, exactly how you also initiated it or presented it, the first one who's joining defines in their member event which LiveKit SFU to use, and then everybody's jumping on that SFU. And since that means if the first one is leaving and maybe others are joining, but they have a mistake that they put the wrong or different LiveKit SFU into their member event, we even have real time switching from SFUs. So it's not, I think it's a one second interruption you get, but it still works really well that if the first one is joining with SFUA, the second person has SFUB in there, then the first one is leaving the call, everybody's immediately switching to the SFU from the oldest participants. But I guess it's quite obvious this is mostly a workaround until we get to the point where the SFUs in between each other can exchange the streams directly, that would be, of course, much more elegant than we don't need this anymore. But for now this is exactly how it works, so we can always guarantee, because that's a very simple glaring algorithm, just take the oldest member state event, call member state event, that we can always guarantee everyone is on the same SFU, which is quite important for a call, of course. Does that answer the question? Yes, always. Do you see any technical difficulties with having recording or transcripts? So the question is about recording and transcripts, and if there's technical difficulties around this. So basically since this is matrix, the ideal and easiest to cross approach, or UX, however we want to call it, would be that those kind of things just happen as bots. So, or recording would happen as bots, where you can easily just have a recording bot, they are just another participant, they are part of the room, they get into the key sharing, so it's very transparent for everybody that it's not just those participants, but also the bot receiving the streams, and then this bot would take care of recording. And since it's all based on LiveKit and LiveKit is a very, very good infrastructure already, there are amazing tools for this, so recording should be fairly straightforward. The transcript question, which was also asked, that is basically an implementation discussion. You could also have a bot, and then the bot could stream the data into a data channel, or the bot could stream the data directly into the room, because it's then part of the room, or you could say you don't want any bot to get the data, and you want to run local systems, which do the transcription, and then just, yeah, do it locally, like there are multiple solutions for this. I guess we'll see what the future brings. This is amazing. I think somebody just joined the room with a, oh, but it's just unmuted. I thought that is somebody already having implemented recording and now playing live. That would have been so cool. I just got super excited, but I guess this is just my echo. Any other questions? So basically the current state is that it's just on develop, but it is ready to try out. I think this is actually something, can you show the path to activate the new group call feature in Element? So if you go to develop, there is one feature flex, so for now you will only get the option to do jitzy calls and legacy calls, but if you want to have the new matrix RTC calls or Element calls, you need to go into the settings and then feature flex, and there's a flag called, yeah, new group call experience, and if you turn this on, and on the sending and receiving client, it should all work, and on Element X, like the mobile client, Android and iOS, it also should just work. Like there you don't even have to activate a feature flex. You just go into the room, press join, and it should end up in the same room there as well. Actually, that's a part of the demo we could just do, right? Do you just want to join with that user? I think, yeah, this is actually also a thing we can show. It's basically easily also possible to have multiple devices per user, so that basically implies we have simple continuity, so I was connected with this computer, and now I just connected here. Oh, I need to read this. It's dangerous. So I'm connected here as well, and now you can't see any streams, right? It does show streams on my computer. Maybe they will recover. I mean, yeah, seems to not work, but it works on this computer. I mean, I can turn it around, so at least the first row can be convinced that it's actually showing this stream right here. So if I would hang up here, I basically did a continuity to move the call from here to here. Oh, and this is also pretty interesting. I'm not sure probably no one can see it, because it's just on the screen, but Paul has joined with an older version of Element X, and currently, if you're in an unencrypted room, you will stream unencrypted media. If you're in an encrypted room, you will have per sender encryption, and that's a part where TD kind of rushed over. So basically, if you have an older version of Element X, this isn't considered yet, and since this is an unencrypted room, if you join with an older Element X, you still stream encrypted data, but my client doesn't expect encrypted data, so that's why it's giving me all kinds of noise. So basically, this is proof that it's actually encrypted. So what TD said is... Trust me, bro. It's always super hard to demo an encrypted call, but here we are. We managed to break it, and there you can actually see that it's encrypted, which is... And the only reason this isn't an encrypted demo today is because we have different encrypt implementations of that. I believe Element uses room events, and I decided to use two device events, because why not? But this will get figured out once we start drafting MSCs and stuff. Exactly. Last question? All questions answered. Cool. Thank you so much.
The state of the Matrix Rust SDK in 2023
Hi, everyone. So today I'm going to talk about the state of the rest SDK in 2023. So all the things that we've accomplished as of last year and some of our future plans as well. So first of all, who am I and how did I get into the rest SDK? Well, I'm Benjamin Bouvié. I'm a software engineer in the Rust team at Element. Prior to that, I worked in a game engine, well, a game dev company on a game engine that was written in Rust and WebAssembly. And prior to that, I was a compiler engineer in the SpiderMonkey team, which is the JavaScript engine, powering Firefox, where I did Rust and WebAssembly. So you can sense that there is a common theme here. And back in the days at Mozilla, we were using IRC. And so I wrote a few bots that were just pulling out jokes from the internet and posting them on the channels. And then at some point, we decided to use this new cool thing called Matrix. And so I rewrote my bots so that they could also run on the Matrix using JavaScript at the time, because when you work at Mozilla, you have to bet on JavaScript all the time. And then a few years later, I decided to rewrite it in Rust because I like Rust. And I made this framework system called Trinity that would use Rust for interacting with the Matrix system. And then you can actually write the bot comments themselves using WebAssembly, which is pretty sweet. And I experimented to neutralize it in production. It's mostly a fun project. And that's how I started to use the Rust SDK. So what is the Rust SDK? Very good question. So it's a Rust library implementing the client server API to allow you to implement clients easily if you want to use Rust in your project. So the code is available on GitHub under the FHT2 license. And it does all the things that you would expect from a Matrix client. Logging in, logging out, sending messages, receiving messages. But I guess the most interesting thing is that you get into an encryption for free. And you don't have to worry about the, excuse my French, gory details in the sense that you don't have to learn about Olm, Meg Olm, like sending, uploading your keys, claiming keys, querying keys and all of that stuff, which is very fine. And that we handle for you. Some history for this Rust SDK. So there was in the past one project that was called Ruma for Rust Matrix, which modeled all the events that are happening in a Matrix room timeline. And also all the requests and responses to the endpoints. And the goal at the time, I think, was to try to create a home server in Rust. Eventually that didn't happen for the Ruma project itself. But people realized that it was a good idea to actually model all those events, request and responses, and reuse them across other projects. And there was another Rust home server that started to be written and that is conduit. And like in another timeline in the world, so there was Damir, who is now the team leader at the Rust SDK team at Element. He was doing Rust on his part like free time. And he maintained a small plugin so that you can use WeChat with Matrix. And that was written in Python. And so as he was trying to learn Rust, he decided to rewrite it in Rust. And the thing is, well, he did so. He searched for library written in Rust to do that. And there was none. So he decided to start one. And that's how the Matrix Rust SDK started. And from the outset, it started to use Ruma because it made sense. And that allowed to reuse massive amounts of code, which was very nice. And Damir, being a crypto engineer, he also implemented all the crypto stack, which was very sweet. And then that was first in the Matrix Rust SDK. And all of that code was pulled out and extracted as an independent library called Vodose Mac, which apparently in question means amphibian. And it's like a big pun across languages like Olm, Megolm, and like all of these just refers to amphibians, it seems. And yeah, so that's how it goes. All right. So why Rust, you would ask, well, this is my minute for the Rust evangelizing taskforce. So I mean, you're probably convinced if you're here already, but it's like at the same time high level and super fast. It allows you to write code in a very fast fashion without having to worry about lots of low level details and issues. It is secure and memory safe, which is very nice for a library because you want to have something very robust. It has an amazing tooling and ecosystem, like all the packages, the crates that are published on craze.io give you all the things that you want to have. And like the cargo, the tool that does it all is just wonderful. You can run tests and, you know, the documentation and all of that. You just want to also very important for the rest of this talk. It is compatible with foreign function interfaces. So you can call into other native languages that speak the CABI. So it's quite important to see. And one of the things that is maybe a bit undervalued in the Rust community is that it's actually also in trying to empower you to write a multithreaded code without you having to know too much about it, trying to make it very accessible. And it's a value that was in the community first, and you can find it in all the places. It transpires in translates to all the places in Rust from the error messages that just hold your hands and try to explain you what you did wrong and try to tell you how to fix the problem that you run into, et cetera, et cetera. So it's very sweet to use. And yeah, being a former C++ programmer, so there was this notice in one of the offices where I worked before that read, you must be this tall to write multithreaded code. And it's apparently at three meters high on the wall. So this is something of the past. Like with Rust, you can just be fearless when you're writing multithreaded code because there is this thing called the ownership model. And that makes it really easy to also model concurrent implementation of anything really. So that's really, really nice. So why the Rust SDK? Well, there was this story where we had three apps, basically Android apps, the iOS apps and the web version that is also powering the desktop version. And they all were using a different SDK and a different crypto stack. So that means that if you are serious about your security, and you want to, for instance, audit your cryptography, now you have to do it in three places and make sure that every single implementation actually does what it's supposed to do, which is a bit of a nightmare. And now you have also per platform issues. You can have a bug in one stack, and then you need to check whether the other stacks also have it, et cetera, et cetera. Well, now we are saying, no, we have only a single stack for the element apps, and it's written in Rust. In particular, it's a single crypto stack. You have very high test coverage. As I'm speaking, it's like more than 83% of test coverage in the Rust SDK. The VodoZemac library, the crypto stack is being first as well, which is very important in terms of finding issues, security issues. So that means it's a single place where you can add features, you can code once, where you use everywhere the old Java Dream that everybody knows and loves about. All right, who's using it? So there is Fractal, the GTK-based Matrix client, which is using it. There is IMB, terminal UI client, if you like, Veeam bindings and all of that. There's the new generation of element apps. The element X apps are only using that, which is pretty sweet. And also the crypto stack, as it could be extracted, and it's also like there are specific bindings just for the crypto stack. And so it could be used in the current generation of element apps. And it's another codename element R. And I guess that you can imagine what the R stands for at this point. Rust. All right. So what happened since the last first time? Well, the previous release of the Rust SDK was in October 22. So we made a new release this year. Yay! At the beginning of this month. Thank you. So it's still not 1.0, still quite experimental. We're breaking APIs all the time, but trying to do a better job at writing, changing logs and all of that. And we'll see how it goes. So new features. So you probably heard about sliding sync last year. And this year, the new kind of sync synchronization that makes it so that logging into a new device and retrieving events is always instant, even if you haven't opened the app for months or years. So we entirely support that. There is the basic feature that you can subscribe to specific rooms and list of rooms of which we get a sliding window that is computed by the server. But we're getting rid of that, as Matthew said. And it also implements a modular design in the sense that you have opt-in extensions for read receipts and typing notices and many other things. And all of that is supported in the SDK. As you can see on the right, it's quite verbose because, well, it's a very versatile and general like API to give you the most control so that you can build higher level primitives on top of that. We'll get back to that. And it's vitrugated behind the experimental sliding sync cargo feature. And you can, we basically use it in production in element X. So it's quite stable, actually. There's also support for OIDC, so OpenID Connect. It's a cross stack effort moving from the custom metrics authentication to OpenID Connect. If you have a metrics authentication service running, so it's another service running on your server alongside the Synapse or your own server, it can act as an actual OIDC provider or specialized proxy to an upstream provider. So if you have a GitLab instance, for instance, you can connect it to the metrics authentication system and then have your GitLab users log into matrix for free, like that. And so that's the server side part. It's also written in REST, which is pretty sweet because that means that the request and responses can be actually reused in the client, the REST metrics SDK. And the SDK implements all of that already. And we are also using it in production in element X. So it gives you all the things that you would like to do with OIDC, create, reload, metadata, register on your OIDC client, do the login flow in all the steps and all of that. And it's also behind the cargo feature at this point. Among the big news, we have a new default storage backend. So the storage backend are implemented using traits, which are REST for interfaces. The previous defaults when you wanted to persist things on disk was sled. And now it's been replaced to SQLite because, well, pretty much everybody knows about SQL. And it's also much faster for our use case. We still have an in memory backend if you don't care about losing states and an index DB backend that is used when you're compiling for the web to WebAssembly. Some new cryptography features. So there is this new thing called secret storage. And it's mostly an implementation detail, but it gives you an encrypted key valley store that is backed in the user account data. And where you can put any information that you would like to share across all your devices in a secure way. Like the server doesn't know about this information. It cannot peek into it and know what is in there because it's also encrypted. On top of that, we implemented key backup and restoration. So that means that when you have a new device, well, when you're using elementics, for instance, it will store all the room keys that are used for decrypting room messages in encrypted rooms in the secret storage. And then another device can restore them so that you can actually see the history of events before you joined with that new device. Also, in addition to that, we made it so that the cross-signing automatically happens and you don't have to worry about this at all. That's what's used to verify your own devices and other people's devices. And it's also like some of the private keys are stored in that secret storage as well. And speaking of high-level primitives, so we made a new crate, new package called the Matrix SDK UI. It is highly experimental and also highly opinionated in the sense that we're enabling a few cargo features by default. And we are trying to make it so that we implement the best practices in terms of user experience and performance. And it's also as robust and tested as the rest of the SDK, which is very sweet. And we use sliding sync as a foundation for all these new high-level features. One of these features is the room list service, which, as its name suggests, gives you a list of the rooms. Yes, it does it so in a way that we try to make it to show something to the user as soon as possible. So that's how you feel that the app is kind of instant when you open the app, because it will try to load just one event for all the rooms you were in, or for no, a few rooms you were in. So you have something to display. And then in the background, once that's done, it will try to fetch more events. And also you can configure it to say, this is a set of visible rooms in my apps. So because when you have an app, you cannot show like a thousand rooms, you will only show a subset, right? So you can configure it to say, this is the ones that are actually rendered on the screen. And those are prioritized so that you get more events for these rooms. Another thing we added was the encryption service. So it's basically a sliding thing that is just running encryption on the side, and it gives you access to more concurrency with the other one. So think of it this way, the room list service, the one I just talked about, when you're scrolling on a mobile app, it will change the list of rooms that are shown on the screen, right? So that now means that it's sending new requests to ask for things. And if we did the encryption in the same request, but it's getting a bit technical, but that would mean that we would need to abort those requests and delay encryption. So now we have basically more concurrency and more performance. And we can do the encryption task in the background while you're still scrolling on the room list using this encryption service. And we also have a notification service. So that's very specialized client that just handles push notifications. So if you're given an event and a room identifier, we want to retrieve the event and maybe a bit of context, like what's your name, what's the name of the person who sent the message to you and all of that. It's also using a sliding thing for that. And it makes use of the encryption service because on an encrypted room, of course, you would get a push notification for an encrypted event and the server cannot know if it's a meaningful event, right? Maybe it's just a reaction, putting a thumbs up on one of your messages. So we decrypt the event in the client itself and then we decide whether it's worth sharing as a notification. The one fun thing, if you can call it fun, is on iOS, if you want to modify the notification in case it's encrypted, it's running a separate process. And that makes our life very hard because even if you're just decrypting data, the state of the cryptography keys is mutably changed, right? So now we have multiple states that is global across two processes that are sharing the same database. So we had to be a bit creative to solve that issue and we are basically enabling the writer head log in SQLite and using some data in the database to indicate who's the process that currently tries to read and write to the database. So basically implementing a text like that. All right. And since we added those two services, the encryption service and the room service, we wanted to make it very simple to just fire synchronization and forget about it. So we made some nice high-level service that just wraps the other two and you can just build it and start it and it will do all those things for you and implement all the best practices and you don't have to worry about any of this. And then you can just take listeners on that service and get information that is meaningful to do when we're rendering for a client. Now that we have a list of rooms and decrypted events, what do we do? Well, we want to display them and we have an API for that called the Timeline API. It's basically a room view MVC, so model view components on steroids. The thing is that in the matrix protocol, events are actually like atomic. It's an app and only database. So let's say you have a thumbs up reaction to a message that is a response to something else that would be two events, like the reaction itself and the message itself. So the timeline will aggregate all those different events into a single timeline item that is much more what you want to render as a client on the screen. So it makes it much simpler to render a single timeline like that. And it does a lot of things for you too. It can enter local echoes. So basically when you're sending a message to a room, you want to show it even before the server has returned that it received it. So it will do that and then reconcile the response from the server with the local state and all of that. So it's pretty sweet. And it's all observable, very reactive. So that's nice. You just get, like, as a user of that API, you get notification that one item has been added or removed or updated. And you can just, like, react according to that. So how is this all used in ElementX? We're using a Mozilla project called Unify, Unify FFI. It will automatically create bindings for you for calling interest from other languages. So at this point, we generate bindings for Swift on iOS and Kotlin and Android. It can also generate bindings for other languages. And we use that for Go, for testing purposes, I think. It requires a bit of integration with the foreign languages runtime. And over the years, we've contributed a few PRs to this project. So we made it so that you can just use procedural macros for exporting your types and your input blocks to other languages. And we also added this year's support for async code. So you don't have to block when calling into an async function on the Rust side. It will just look as an async function on the Kotlin or Swift side. And you have actual concurrency and background processing happening, which is pretty sweet for performance. And reactive programming in Rust. How do we do it? Well, the principle of reactive programming is you have some data and you want to make it observable so people can subscribe to it. And then they will get notifications. And I mentioned the Timeline API that will notify you when there is a new Timeline item that has been added, removed, et cetera. So we're using crates that we created ourselves, IBOL. And there's also an extension that is div-based for collections because when you have a vector with a thousand entries in it, you don't want to say, oh, now there's a new thing that has been pushed into the vector. I hear all the 1,001 entries for that vector. No, you just want to hear that there's a new entry and that's its position, right? It also has some extra querying facilities. So you can batch all these updates, div updates. So you don't have to cross the FFI language boundary too often. That has an inherent cost that we want to avoid, some overhead that we want to avoid. And for your batch transaction, well, for your batch to be quite precise, you need to have also transactions to say, this is the beginning of the batch, this is the end of the batch. And also you can do some filtering on these stream of events, limiting, sorting. So it's kind of mapping to things that you would do on SQL in general. It's pretty sweet and that's what we're using, for instance, to filter the rooms in the room list immediately on the client side. All right. So some of the future work that we're going to do, well, I intentionally remain a bit vague here, but we're going to eventually support all the major features a matrix client would expect. We are already working on Scrantum cryptography. And as of today, I think there has been a PR against Voters and Mac to have something that is compatible with Libsignal and with what they do. So that's pretty exciting. And there is a general theme of doing more things client side. When you have end to end encryption, your server kind of becomes dumb sometimes because it cannot peak into the encrypted event. And so you have to resolve a lot of things on the client side. If you get a new event in a room, does that trigger a notification for an encrypted room? Well, you have to push a notification and it's the clients that will decide whether or not it resolves into an actual notification. And even for sorting the room list, you have to do it client side because if there is some room activity, you want to sort by room activity, just show me the room that have some activity. Well, it's the same thing. If the event was encrypted, you don't know if it was just a thumbs up reaction. Maybe that doesn't justify putting the room at the top. If it was something meaningful like an actual message. So that means that this task has to be done on the client now. And yeah, we're also computing the other badges client side in the rest SDK. So we are trying to be very careful to not get into static notifications situations because it's a pain for everyone, us included. And yeah, that's pretty much it. All right. Just a few things. Well, first to all the contributors of the rest SDK special shout out to Kevin Komei from the fractal community. It's done like a bunch of work in the rest SDK, including most of the support for OIDC on the client side, which was MaceFPR. And if you want to be on this slide next year, you can contribute to we have a few issues that are tagged as good first issues or help pointed if you want. And I would like to take this opportunity also to thank elements for donating all of my work to the matrix organization. You can also be a supporter of matrix if you want by following one of these two links. Thank you for listening. And I would be happy to answer any questions if you have any. To the internet asking why have you moved away from sled? Why have we moved away from sled? That's a good question. So I think in terms of performance, so slay this, if I recall correctly, I wasn't there when that happened. So it's kind of hard to answer this precisely, but I think that it's a key value store embedded key value store. And the performance was not great, especially on mobile devices. And we just figured that using SQLite that has been like performance tested and improved and tuned over the years was a right thing to do. And also the way you structure your data using a SQL database is quite different from the way you would structure it with a key value store. So it's just like slightly easier to perform requests when you have a SQL database because you know all of that. Yeah. Any other question? The internet also asks how is your developer experience when using UniFi, UniFFI in general? Are there any hard edges? That's a good question. So there's, oh yes, so when using UniFi across, for calling rest across other languages, have there been hard edges? Yes. It's been a few cases where we have a memory leak that is identified. Well, Kotlin uses the GVM and GVM as a garbage collector. And so we accidentally, and when I say we, I think it's like the UniFi group in general, introduced some leaks by having the equivalent of premises or futures leak sometimes. So that was a problem, but usually it's, I would say it's 90% of the time it's stable. And the 10% of the time where there is an issue, it's high priority for us because obviously it breaks our apps. So we fix it, we try to fix it as quick as possible and we contribute back. But most of the time it works fine for Kotlin and Swift. So the support and stability is also per language, I suppose, since you have to create bindings for each language. So yeah, I cannot speak for the Python or Go generation on the UniFi side. But usually, since Mozilla also use UniFi, they have to provide high stability guarantees as well. So they are pretty reactive and also fixing bugs. So it's working well. Yes. I was wondering about the startup times. Yeah, so the question was, what about startup times for the rest of the time? I think there were two questions. The first one was just starting the SDK itself and then when you're syncing a list of rooms, do you get instant and response and all of that? And well, it's native code, so you don't have to boot up an entire VM for the SDK itself. So it's pretty fast. It will restore the state from the disk. So that can be a slow step. But even like for users who have thousands and thousands of rooms open and I'm looking at Matthew on the side of the room, our general benchmark runner, it's pretty fast. And for receiving a room list, we are also tracking these performance of our time. Pretty much instant. And every time there is an improvement that needs to be done, we'll do it. Yeah. I mean, we are in sync with the synchronization times where about between five to 20 minutes, if you are a very heavy weight user of Matrix, now it's really up to three seconds. So consider that an improvement. Any other questions? Yes? What's the state of supporting extensible events in the rest SDK? So I think that's a question for Ruma. And since we're using Ruma for passing the events, and I'm pretty sure that so the rest type system is quite extensive in the sense that you can have union types. And for each event that can be extended, I suppose that you can have there is a variant in that unit tab that says it's a custom event. If you're referring to a specific MSC, I don't know what it is. And I'm sorry about that. Was that a custom MSC or? No. No. Okay. Just events in general. So yes, you will end up in these case where we have, you will match on this union type or the event and it will say, well, it's something I don't know about. So I'm just ending it over to you and you do something with this. Yes? Can we also provide a question for the rest of the SDK? I'll rephrase this question as are there plans to use the rest SDK for web? Because it's not used. So right now, as we are speaking, as of last week, people have started, well, have enabled by default for new logins on Element Web, I think. So that may be the nightly version using the rest cryptography for Element Web. We have a separate repository for bindings that are for WebAssembly because there's no meaning in using Unify for that. We can directly compile the rest to WebAssembly. So no need to have an intermediary in the middle. And I think the long-term goal is to use the rest SDK everywhere for the Element apps at least. So don't, don't take my word as granted, but I think that this is going to happen. Yeah. Any other question? Yes? That's a very good question. So the question is, is search in scope for the rest SDK and what kind of features would be out of scope for the rest SDK? So to respond for search, that depends if you mean room search or message search, full text search. And well, actually it doesn't depend because the answer for both is yes. We're going to try to take care of that. For full text, we, there was a previous client made by Element called Hydrogen. That was a web client and that could do that and had the fancy system to actually index the messages on your client and then share parts of the index with your other clients devices. So we're probably going to reuse and reimplement some of that in the rest SDK at some point. Yeah. In terms of what features are out of scope for the rest SDK, it's kind of hard to tell, but I think that everything that is like high level UI related like rendering widgets, but not in the sense of widget API, but actually UI widgets and stuff like that is not something that we want to implement or provide. And then I think that, well, the MSCs that have proven to be features that have been proven to be not very useful will probably not be implemented. It's not clear what's not in the roadmap at this point. Sorry, it's not a very satisfying answer, but yes, question here. So the question was using the rest SDK can store a lot of data if you're listening to lots of events and is there any way to limit that amount of data that is stored on this? Well, as I was saying, the storage is implemented as a trait. So one could always implement a different version of the Scallot backend and decide to drop items at some point. One thing that we wanted to make it is the ability to store events locally. And that's connected to the previous question. If you want to be able to do full text search, well, you have no other choice, but decrypting all the events and storing them locally, at least in memory for some time to do the indexing. And then the indexes have to go to the disk. And that means that, yeah, the size of the index can grow a lot. And so we would probably have to implement some kind of garbage collection and say, well, we kind of forget about like old data, older than like a month, year or something like that. And we only care about the most recent data. All right, thank you very much.
A microkernel-based orchestrator for distributed Internet services?
So like the 13 instances of this bedroom and by this I am over to our particular class. Can you hear me? Can you hear me? Okay. Hi everyone and thanks for being here. It's great to be able to speak here. So I'm Alex and this is my presentation on some very high level and pretty speculative ideas I've had for how we could use microcranels in like to do different systems and to host websites. So I'm part of an association which is called Duffler and if it works at Duffler we have some infrastructure which looks like this. So we have some very low powered computers like this one which are hosted at home. So at home of course we have possible issues like power going down or internet being cut. So we have some machines at different locations in Belgium and France. And the idea is okay we have this infrastructure which is like pretty fragile but maybe we can just put all these notes together and build this system. And this is actually what the Duffler infrastructure is doing. We have email, we have websites, we have instant messaging and a few other things running on these very basic machines. So currently our infrastructure looks something like this. So the idea is not to spend too much time to enter into the details of this but basically on the right end here we have the actual applications that we're interested in running. So for instance we have an element for chat, we have Jitsi for video conferencing, Crippad, other things. And to run all these applications we need this whole huge stack currently. So it's based on a Linux OS and MixOS for declarative configuration. And then we have this platform stack here which is based on an orchestrator which is called Nomad which we use. It's a bit like Kubernetes but a bit simpler and I'd say probably easier to use. But still we have all these different components which are basically... ... storage systems. GARZ is one that I'm building myself. And we basically pull these... ... software. And if we look more closely in what's happening on a single node, actually it's kind of a huge mess. So like this is the operating system running on one of these... Here we have all these management tools. Things that are... So yeah, from a conceptual point of view, this really systems like... ... to enter into too much detail. Let's say for instance we have here internet traffic coming to our server to request some information. It's going to traverse a reverse proxy which is going to do TLS encapsulation. Then it's going to go through an HTTP link to the actual backend which is going to talk to with specialized protocols to the storage layer. And basically we can describe all of these things with boxes and arrows connecting these boxes. So the idea is that actually this model of boxes and arrows is the model of micro kernels. Boxes are... ... ... ... memory between different processes sharing the CPU time. And also controlling hardware access. So this is like fundamental thing that only the kernel can do. Separate the resources of the computer at the CPU level between different things that are going on. And then the micro kernel will also provide some IPC mechanisms like message passing or shared memory. So... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... I've made things and connecting things very explicitly only when they. So this diagram is like what's running on one node, but maybe we can include some form of network transparency to make this more into a distributed. You There would be some impact on performance and we also need to be quite careful about that. Okay, so it's still time for some questions, comments, whatever. Okay, I might have one question. So the use case should be always the God, the thing that dictates what how the architecture should really look like. So what do you have in mind in this area, something like safety critical or security critical or really just some average information system? Yeah, we're doing the association with the facts. I mean security is important because we're handling personal data of people, but I wouldn't say it's a security critical infrastructure per se. But of course, like, yeah, one of the advantages of such an architecture is like security is easier to like build in a robust way because we have much more control. Okay, thanks. So so quite natural follow up question. We probably have seen it in this discussion here. So how do you persuade the average guy to buy him? How do you persuade a very guy to stop using their Linux distribution and start using your architecture? I think this is going to be a very long, long work before we can get to that point. But the hope is that this system is both more robust and easier to use because we like we can get rid of some complexity probably. And and we can have some. So if we get to a point where there's good tooling around this, and whereas there's a lot of examples which are already already running and it's easy to get your own started, then I think we can really have something that attracts people. But yeah, of course, it's a long road before we can get to there. Thank you. Any more questions or comments? I don't see anything. So thanks for the talk.
Using the NOVA Microhypervisor for Trusted Computing at Scale
Next talk is coming up. Pudos Shreinberg does not need a lot of introduction, especially in the micro-curnal circles, but he is the author of the NOVA microhypervisor, and I believe this talk is more a state of update. The stage is yours. Thank you. Can everybody hear me fine? All right. So this talk is going to be about using the NOVA microhypervisor for trusted computing at scale. So we will talk not so much about micro-curnals or micro-hypervisors. We will talk a little bit about scaling NOVA, and we will spend the majority of the talk talking about trusted computing. So the agenda is first I am going to give you a little overview of NOVA. For those of you who have not been in the micro-curnal deaf room before, maybe a quick question. Have you ever heard of NOVA before? Maybe one-third of the people. I will explain a little bit what NOVA is and why it is a microhypervisor and not a micro-curnal. Then we look at what happened in NOVA in the last year, in 2023. The second part of the talk will be about using NOVA for trusted computing for performing what is called a measured launch to actually get some trust in the platform. At the end, hopefully we will have some time for questions. NOVA is used as the bottom piece, the screen box, the micro-curnal that is used in the Bedrock Ultravisor, which is a virtualization layer that sits underneath virtual machines. For those of you who are familiar with micro-curnals, the kernel is very small and most of the operating system functionality is implemented in a multi-server user mode, a deep privileged environment. All of these colorful boxes are actually deep privileged processes that run in user mode and they are isolated from each other and they communicate with IPC. This is what you would expect from a typical micro-curnal. The reason that NOVA is a microhypervisor is because it additionally provides a virtualization interface that allows you to reuse unmodified legacy operating systems in virtual machines. NOVA basically relays all the VM exits to those yellow virtual machine monitors, which then implement the virtualization functionality. The whole stack, all the colorful boxes are in the process of being formally verified and this is going to be important also when we talk about trust. We will not talk so much about all these boxes, we will talk primarily about NOVA, the green kernel at the bottom and a little bit about establishing the trust between NOVA and the master controller, which is sort of the inner process of the user environment. When we talk about scaling NOVA, it originally started about 20 years ago as a research project to address them and since then we have productized it to run on multiple architectures. On the left we have ARC64 which is on V8 and on the right we have X86 architecture, primarily Intel and then we run on all these platforms and more that are listed on the slide. So at the top left corner you can see a variety of arm SOCs and all the ones in yellow are actually not using standard UEFI or ACPI interfaces so they have proprietary builds, you get proprietary or board specific binaries. But for some, like the Raspberry Pi's or even AWS's Graviton Cloud servers, the same NOVA binary works all the way from small embedded devices with just a handful of cores all the way up to big cloud servers with in this case 64 cores. And we have the same on the X86 world, actually the same binary runs on all these platforms whether this atom SOCs at the top right corner or client platforms that you see up there all the way again down to the largest cloud servers with over 100 of threads. So that actually required some infrastructure changes in NOVA but before we get there, in the interest of time I'm not doing any live demos but here you can see or if you can't read it then look at the slides online. The output of NOVA boots on Raspberry Pi 4 or 5. So naturally we had an interest in making NOVA working on the Pi 5 and it just works out of the box if you use UEFI firmware. And the top line which is highlighted shows that it's actually the same build, so the same commit ID and the same build timestamp and you can see the differences in the cores. Raspberry Pi 4 uses a 72 cores and the Pi 5 uses a 76 cores. And as I said the same binary also runs in the cloud. So if you take for example an AWS C7G metal instance you can run that binary and it will enumerate 64 cores actually our Neo-verse cores and it can also drive all of the PCI devices on a platform actually in multiple PCI segment groups. So I don't want to go into the details here. The same thing on x86 where you can see on the right side, so the left side is the beginning of the log and the right side is the end of the log that we can actually run on machines with over 100 cores with hundreds of PCI devices and tons of memory. So what did we have to do to make that work? And I presented a similar thing in my talk last year, what I call an innovation timeline. We put out a new version of NOVA approximately every two months, so six releases per year. And then some releases are more packed than others. So about a year ago we added some local APIC, registered virtualization and support for Atom Sox to NOVA. But then the more interesting work happened over the course of the first two releases or the next two releases at the beginning of the year where I implemented support for Intel TXT which is trusted execution technology in NOVA. And also to make NOVA work with really large core counts, we made the current memory pool extensible. So the bootloader has the choice of giving NOVA little or very large amounts of memory depending on how much a particular platform would want to use. And then in the middle of the year there were some minor adjustments to read, copy, update and capability management that we will not talk about here today. And then at the end of last year for the Christmas release basically, the TXT work was so complete that we could actually extend the trust chain all the way to the master controller, this blue component in New Zealand. And then again for the first release of this year which is going to come out at the end of February, you actually get even more functionality for the TPM and everything that's listed in bold. So you'll be talking about in this presentation. So why do we want to do something in the area of trusted computing? What problem does that solve? And I mentioned this in the introduction that we are formally verifying the entire ultravisor stack. So once that is complete, you know that the source code that you have fulfills its specification and maybe you have a qualified compiler that compiles this verified source code into some binaries. And even if you have that, things can go wrong. The binaries can be tampered with by an attacker either during the installation process, during the boot process, or after installation and you want to know that the binaries that you built are actually the ones that are running or are being launched on a computer. So you want to know that some remote computer is actually running exactly those binaries and not some modified version. Before you give that computer some precious content like your super secret K-I algorithm or some secret data. So in order to understand what trusted computing and what a chain of trust is, we have to look at the concept of what people commonly call secure boot. And secure boot is not a very precise term. The better term is actually verified boot. And verified boot works like this, that you have some immutable root of trust in this slide showing green. And that's the initial stage and it's immutable and it's a root of trust because you cannot reason about it's correct. You have to assume it is correct and it's usually implemented in ROM which doesn't change. And then every stage like, oops, every stage measures the integrity of the next stage and verifies it against some policy. And if the verification succeeds then the next stage gets launched and if the verification fails then you fail the boot. And this is basically establishing a transitive chain of trust and the thing we care about the NOVA hypervisor is at the very end. And this chain of trust only works if everybody before gets everything right. And that's hard because there's millions of lines of code living in all these boxes and some of these boxes are actually very complicated and extensible. So the E in UEFI actually stands for extensible. And the moment you make a change in any of those components, could be you add a new PCI card or you change the order of your boot devices, it changes the measurement. So keeping your databases of permitted integrity measurements or denied measurements up to date is hard. And the industry has learned this recently when UEFI was affected by this logo fail vulnerability which basically forced every vendor to deploy a new version of their UEFI firmware and to blacklist in the DBX database the old version that they had. So it is not very flexible and it is a very brittle thing. And the screen box here in the background shows that all of this stuff actually belongs to your trusted computing base. Because if any of these components actually modifies or trashes the binary, then even though you formally verified your source code, this binary is not going to do what you want it to do. So can we do better? This is an open source conference and we are not so much interested in DRM, we are interested in freedom. So we don't want to enforce boot policies, we want to instead use a concept called measure boot. And it works very similar that a stage measures the integrity of the next stage but then doesn't take an immediate decision on whether the next stage is good or bad. Instead this measurement simply gets extended into a TPM platform configuration register which stores this value for a later attestation request. And then the next stage gets executed. And there is still the problem that certain stages like UEFI and the boot loader are extensible and that they sort of leave a very hard to manage gap in this trust chain. But there is also the problem that typically the whole boot process is not protected against DMA. So these components do not make use of an IOMU or SMMU which means even if the software is correct you could have a USB device or some FireWire device, some DMA capable device that simply DMAs into this memory and trashes the software that way. So again the trusted computing base isn't really getting any smaller. So can we do better than that? Yes, we can and this extends the concept of measured launch with a dynamic root of trust. And the core idea is that you can't really change anything in this boot chain. You still have to execute all the firmware, you still have to load all your drivers, your firmware drivers, you still have to make a boot loader like a boot choice and you still have to initialize your memory controllers. But you can do all of this in a dirty environment. So a dynamic root of trust lets the system boot into initially an untrustworthy state. So we don't really care if anything that happens to this point and only at this point we want to bring the platform into a pristine state. And this is very interesting how this works because what the effect if you do here represented by this green bolt is it is a disruptive event which feels a bit like a platform reset but it doesn't reboot the machine. It just brings the CPU into a well-defined state. It's actually a protective mode with a paging turned off and it holds all the other cores. You can see that in a moment and it forces the execution after this launch event to a code pass which has previously been measured and protected. So we don't care about all this stuff in the red box anymore. That gets eliminated from the TCB which is great because it eliminates millions of instructions and our TCB is now just this DLTM sequence plus NOVA. So what do we need for that? The technology that gives us this on Intel platforms is called Intel TXT. You may also come across the acronym CDNT which just is short for Converged Boot Guard and TXT. So Intel has fused the static root of trust which is boot guard with the dynamic root of trust which is TXT into one technology. And TXT is the one we care about. This gives us the DLTM. You need a CPU that supports this and you need a TXT capable chipset and a TPM preferably TPM 2.0 because TPM 1.2 is really old and it can only do deprecated hash algorithms. And you need an SINNET module which matches your platform. And the purpose of this SINNET module is the module that Intel provides that you can download from their website is to initialize and verify the platform in a way that it is securely configured. And once you do this you can later do a remote attestation by asking the TPM what these measurements in all the platform configuration registers are. And then you can remotely take a trust decision to say if this PCR contains some value do I recognize this value as belonging to NOVA's December release or NOVA's February release. And who knows why there is this sign of like grant here. So Intel develops all its technologies under code names and the code name for Intel TXT many years ago used to be called Lagrange technology and this is named after a city in eastern Oregon. So what happens when you do this disruptive event? How does this reset the platform without rebooting it? And it's very interesting. So first of all we have a number of processors. Here this just shows four so these are four lanes. And we have one processor which we call the initiating logical processor. That's the one which initiates the DM sequence. And we have in this case three responding processors which maybe in some arbitrary state we don't know. They could be sitting in some idle loop. They could be executing malicious code. We simply don't know what they do at this point. But we also don't care. And then some time before the disruptive launch event the code for NOVA which is in this case called MLE the measure launch environment. And this is in it ACM must have been loaded into memory and again they could have been corrupted in memory. It could be the wrong version. We don't take a decision there. And then later there can be an arbitrary amount of time that passes minutes hours. We can do this a week later. It doesn't matter. Some component is executing this DLTM which is a specific processor instruction. And what happens when you execute this processor instruction which is privileged is that everything resets. And the chipset broadcasts an S-enter cycle on the interconnect. And the S-enter cycle basically initiates all the, initializes all the processors into an S-enter sleep state. So we now know that all the other processors, all the pieces are not executing any instruction. They are sleeping. And it transitions control to this S-init ACM and it checks its integrity. So it has a signature and it has a checksum or actually a hash cryptographic hash. So the processor validates that this module is a valid Intel S-init ACM and it launches that. And this module runs entirely inside the cache. It doesn't use any memory because the memory might have been initialized wrong. The memory might have physical memory aliasing where two physical addresses point to the same page. So this operates in a very constrained environment but it is software that can validate that your platform is correct. That the processors are not overclocked. There's no undervolting. That all the chipset registers that need locking are locked and so forth. And the final thing that this thing does when it has convinced itself that the platform is in a good state is that it measures and launches NOVA. And it stores the measurement of NOVA in TPM PCR 17. And then NOVA gets control at its measured entry point. And at some point later after it has initialized enough of itself, it can wrong the rule the other processors into the secure environment. So that by the time we get to the end of this, we have now all four cores or 128 cores in this measured environment. And should anything go wrong during this process like a rogue CPU showing up that nobody knew about or a CPU leaving, surprised leaving this environment, then the platform transitions into a TXC shutdown which effectively resets the platform. So now we talk a little bit about the TPM because what we want to do is we want to measure the next stage into a platform configuration register PCR. And whenever you measure a component, what we really mean is we have a region of that component of that image that we care about that doesn't change. You can call it an immutable region which is typically the code and the read-only data. And you compute a cryptographic hash like a SHA1 or a SHA2 cryptographic hash. And you get in the case of SHA256, you get a value that is 256 bits long or a large number like that. And this measuring entity executes a command to the TPM. And the TPM is a little chip like the one shown up here that sits on your motherboard. And it has, in a typical case of a client platform, it has 24 platform configuration registers. And it invokes an operation on the TPM that's called PCR extent. And the PCR extent operation is interesting in the sense that you can't write to a PCR directly. You can only extend a new value into a PCR and what it does is it takes the existing value, concatenates it with a new value and hashes the concatenated hash. And this forms the new value of the PCR. So the sequence in which you extend values into the TPM and the values themselves are all reflected in the hash. Basically it gets mixed together. And once you look at the PCR so you can read the value, you can no longer recompute the original chain of extent operations that led to this PCR value simply because the hash function is a one-way trap function. So how would a remote verifier who can ask the TPM for a quote so you can go to the TPM and say give me the value of those PCRs that I care about and have the TPM sign that quote report so you know it's authentic. You can send that off to some other computer elsewhere and they can look at all the PCRs and say, okay, if this PCR has the value that I recognize then the platform has launched authentic software. But how do you know if multiple extent operations have happened into a PCR what the individual values are because the individual values represent the individual software components. And for that you need the left side of this picture where in addition to extending a measurement into the TPM it also gets stored in what's called a crypto agile event log. And this effectively is an auditable trace. It's a record of all the extent operations that happened. And in addition to recording which PCR and what the digest so this measurement was that was extended. There's also some event metadata that said the meaning of this extent operation is I hashed the command line or I hashed the RAM disk or whatever it may be. So you have to send both of these things both the TPM quote request and the crypto agile event log to a mode of verifier and it can correlate the two things that can use the crypto event log for a particular PCR to recompute the value in a PCR. And then you can then check if the quote from the TPM actually lists exactly that value of the TPR PCR and if it has been signed with an authentic TPM signature. And then you know what platform and what software is running on the platform. So I said the SNNIT module measures the integrity of NOVA. And NOVA is a kernel that consists it's an F image that consists of code and read only data but also of some mutable data and some heap. And not all of that thing is immutable in the sense that it doesn't change. And while some people may think it's sufficient to just do an integrity measurement at launch time. So when you boot the system that's not the full truth because you can also close the lid on your laptop which will basically shut everything down and only keep the content of memory alive. And when you resume the laptop then all your protections are gone. So on a suspended zoom cycle you actually have to repeat this integrity measurement and then this yellow section has actually changed. So not everything can be measured and how does NOVA tell the SNNIT module which region of it to measure. And there is this MLE header which enumerates the memory pages that NOVA wants to have measured. And NOVA is actually the entity that initiates the launch process. So there's no bootloader that says launch NOVA. No, NOVA gets launched in this dirty environment and then decides itself to launch its second stage and thereby tells SNNIT module what this to be measured region is. But before it actually gets measured the SNNIT module DMA protects this entire region. So the moment it gets protected no attacker can change it anymore not even visit DMA attack. Then it gets measured the measured value gets extended into TPMPCR17 and then NOVA gets launched. And there's also some TXT heap data structures that NOVA's preamble code and NOVA's post launch code used to exchange data with the TXT heap. So one of the things for example that the SNNIT module produces and stores in this TXT heap is some information about how many processes really exist. And it also stores some validated copies of ACPI tables there so that no IOM use get hidden or whatever. When you write software like that that you want to measure you have to carefully think about what should be included in the measurement versus what should be excluded. So if you measure too little then maybe something can be changed in a security relevant manner and it will not be reflected in the hash. And the thing that immediately comes to mind is let's say you have a command line parameter and NOVA has a few among them is one where you can say don't turn on the IOMU. So this is basically a chicken bit for debugging and when you execute NOVA with this command line parameter it's obviously less than fully secure. So you want that change or the configuration change to definitely be reflected in the hash that the NOVA version that runs insecurely can be told apart from the normal version which uses the IOMU with the full potential. So the command line must be included in the hash but if you have some data structures that maybe take time stamps you don't want to take them into the hash because they will change the moment that the time stamp changes. So this needs very careful considerations. And then the next question is if you have a binary like that and you've built it to the compiler, the obviously compile instruction sequences of its choice. How do you know what integrity measurement to expect? So you need some form of reference measurement and when you execute NOVA's built infrastructure and you build a binary at the end of the boot process it will output all the reference integrity measurements. So it will say the sharp one value for this binary is this and the sharp two six values is this and the sharp have 12 values is this and then you know what value to expect when you do a decision. So extending this to user mode what does it require? So it requires NOVA to compute a launch integrity measurement of the root PD which means we have to define what is the region of the master controller of this root PD to measure and we have to actually do the hashing. And for that we can do two things. We can either send the whole data over the LPC or SPI bus to the TPM and let the TPM compute the integrity measurement or we can compute it in NOVA and software using the CPU. And I originally thought that using the TPM it would be a good idea because the TPM automatically does it for all supported hash algorithms. But as you can see on the right side the TPM is really really slow and the bus that connects the TPM to the system is also very slow. So in order to hash a binary of two megabytes of size TPM actually takes almost 14 seconds and NOVA takes 50 milliseconds plus two. So the 15 are for computing the hash and the two are for extending the PCR. And then obviously NOVA needs to drive the TPM itself because it needs to send commands to the TPM and NOVA needs to append the entry to the event log. So all of that infrastructure had to be added and we actually have to measure the root PD before we launch it because we can't have a process launching some let's say malicious instruction. And then saying after I've done some malicious action I'm changing my image to look innocuous and then I'd say measure me. Then it will look correctly even though it has executed something malicious. So before you even execute the first instruction of the next module you have to measure it. So the root PD cannot tell NOVA which part of it to measure so how do we define this. And it's simple you can actually use the L headers, the program headers in the root PD and we defined it to say the first L header that is readable or executable but not writable. That's the one that contains code and read only data. That's the one that we measure. And then NOVA obviously had to learn how to compute char one and char two fifty six and char three eighty four and char five twelve. And basically the entire NISP FIPS one hundred eighty standard. And that looked like very complicated but due to the beauty of C++ templates and function overloading and inheritance the implementation of all these hash functions and NOVA is actually just one hundred thirty lines. And it can do all of these algorithms. So that brings me to almost the end of my talk. The last thing we had to add to NOVA late last year was support for the TPM. And the TPM has two interfaces. The older interface is the FIFO interface and there is a newer command response buffer interface and NOVA had to understand how to drive those that adds another two hundred fifty lines of code. And then you have to send commands across that interface and the TPM library specification is very large. It's thousands of pages but NOVA only implements the subset of TPM commands that it needs for this measured launch which is determining what capabilities it needs. And then you have to understand how many TPM has and how many PCRs and what algorithms and then performing some PCR operations. And that adds about another five hundred lines of code. But for both the old TPM one or two and the newer TPM two dot oh. And then there is a table that lists how the TPM actually gets used by various parts of the platform. The TPM has different localities. And locality four which belongs to the core root of transform measurement actually measures the next stage which is NOVA into PCR 17 and then NOVA which drives the TPM at locality three measures. No, the SNET ACM measures NOVA also into PCR 17 and then NOVA measures the next stage which is the master controller into PCR 19 and then the root PD measures the next component further up stack into PCR 20. So this is the list of all the cool security technologies that we have in NOVA now. Ranging from control flow enforcement, total memory encryption with multiple keys and the latest thing we added which we just discussed is trusted execution technology and adaptation. So with that thank you for listening and I'm happy to take questions. you
Is Toro unikernel faster for MPI?
Okay, if I may have your attention again, it's time for next talk by Matthias Lorsen, this time about his unicernal and how he can run MPI code faster. Which is yours? So hello everyone, you take me well? Okay, thank you. I'm Matthias Lada, in this presentation I'm going to talk about the player in MPI applications using Total Unicernal. This is an exploratory work, so it's an area I am still investigating, so at the end of the presentation feel free to ask me any questions because I'm still benchmarking things, I'm not pretty sure where I'm going. So first I would like to present myself, I'm fascinated about practice system development and mutualization, and I have been working in these companies, this is my email, I did have a profile if you want to get in touch or you want to see some of my projects. This current project is not related with my current work, so it's something that I'm just doing when I have some free time. I would like to start by what is my intuition about what is an MPI application, I am not an expert on MPI, so it is what I understood since two years I have been working on this. So it is an application that compiles with the implementation of the MPI standard, so there exists several implementations of the MPI standard. The standard defines a set of APIs to synchronize and communicate parallel instances of the MPI application, so for example we have this sort of API, like MPI barrier broadcast and all reuse for example, so to set some of them. My impression is that the only performance matter when we deploy MPI applications, so I have a feeling that the virtualization is not very popular in HPC, at least my impression for the overhead that this adds. So my thought was that maybe MPI applications may benefit from the unicurners because for example Cisco are expensive, so in unicurners we remove that, we have calls, threads are cheaper than process, so you may know that we are not switching the page today, every time we are doing context switching in unicurner, depending on your application you can completely remove the scheduler because you are going to run only one thread per course or something like this. You can rely on communication and share memory for example, in the case of unicurners. And sometimes this is something that I just added, sometimes perform better than a general operating system as I guess and I say this because sometimes you can tweak your operating system to reach good performance, let's say. So yeah, well this is the diagram or the components that they are involved with when you are deploying an MPI application using a general proposal of the operating system. In this case I am thinking that the MPI application is running as a built-in machine but the diagram is more or less the same in case it is bare metal. So what we have is your MPI application, then it compiles with implementation of the MPI standard, for example OpenMPI and the OpenMPI is going to use some Cisco to communicate with the operating system to get some service like scaling file system, networking and so on. So what unicurners propose is well let's take a look at the data. Thank you. you you you you you you you about the scheduler. So the scheduler in tutorial is quite simple and also well here is no scheduler, it is the way that the tutorial creates threads. You have a dedicated API called a begin thread but it is a parameter that has to tell where the instance is going to run so you have to set up where the core is, I mean where you want that function to run. Otherwise it is going to choose all the times the booting core. The scheduler is quite simple, it is a cooperative scheduler, so I mean the thread is going to do something and then call this thread switch which is going to invoke the scheduler and each scheduler, I think I present that in the next one. Yeah, I mean each scheduler is independent one another so there is no communication between the instance so each core is scheduler completely independent one another and the algorithm is quite simple, it is going to choose the next thread that is ready, not more than that. And the idea behind that is that the idea was to have instance of the kernel that don't require any mechanism to synchronize the instance so there is no speed lock or something like this so all the access to the kernel data is lock free. So I just talked about the scheduler, now I am going to talk about the memory, also total memory is dedicated so when the kernel initialize is split the memory in rations and then all the allocations happen from that rations depending the core. So the splitting is quite simple, it is just split by the number of cores so you have two cores, you have two cores, you have three, so for a moment this algorithm is quite simple and maybe it could be improved for sure. So for example we have a memory allocator that since each ration is assigned to different core the way that we implement the allocator doesn't require any synchronization between the core, keep also the same idea that each instance runs independent one another. So for example when a thread has two allocator memories it is always coming from the same ration and also doesn't require any synchronization between the cores and so on. The idea behind this is also to try to leverage from technologies that integrate paths that you can have a known informer memory and then you have faster access to some rations. And in a general way all the kernel data in total is per CPU variables so it means that it doesn't require any synchronization between the core to access kernel data. And also to access faster to these CPU variables using the chs register for example and this is an improvement that I did a couple of months ago and so we have a table and it is faster access through the chs register which is pointing to that table. I don't remember exactly the mechanics but I think I have a blog that I wrote about that and all the access is log free. The only moment that we require synchronization between the cores is when we wanted to for example create a thread from one core to another. We need to synchronize somehow the cores to migrate one thread to another something like this but it's the only moment that we need it. Otherwise this is completely independent all the instance. And to end the principles of total I will be going to talk a bit about the core to core communication. And the idea was to even if you as a user you can implement anything on share memory as you want. I decided to implement the entire over share memory so each core has a set of big use that allows to get data from a remote core and say data to another core. And it was just a bit of I mean to have fun to do it like this. I mean I'm starting to see if I can implement the entire like this. And the idea is that the communication is core to core so we don't have only one queue per core. You have as many birch use as you need to communicate one to one for each core. I don't know how to say exactly but this makes that you don't require any protection to send or keep the exclusive access to this birch use. Because you have only one consumer or one producer and so on. And relying on this mechanism then I could implement the API from MPI like MPI Gutter broadcast and MPI scatter which are functions that require communication between the core. So from the root core to the root core and so on. So I think I will just talk a bit about the benchmark I have been done. I feel free to comment about this because I'm not really sure about the numbers I'm getting. What I did was to choose a set of well known benchmarks called also micro benchmark which is since that it is used for benchmark in different implementations of the MPI standard. And I pick up two of them the also barrier and also already use which what they do is just stress some function. So for example the also barrier stress the MPI barrier function which is something to synchronize the instance of an MPI application. It's just a software barrier let's say. And the other one they already use is stress the MPI already use function. Which is going to send some vector to the root core process something and get back the rest to the other cores or other instance. What I did was comparing with Linux Bermuda and Linux in IBM and I use it. I pick up this machine from the from a clinics which is an AMD epic with 24 cores and 64 sheen rate and the host I use it for the VN is a Wuntu with isolated cores. No sorry yeah with isolated course. And I ran the and I use it. KBM team was hypervisor. I'm sorry now the host was Wuntu and the guess was Fedora 38. What I did is in this particular case what I did was to use a huge VM with 16 cores. Maybe it's not the most common case for MPI people just have several nodes instead of putting everything on the same. In my case I was trying to play with this so I decided to use a huge VM let's say and then compare with total right. So this is how I launch the benchmark. So for example I am using 16 threads for example. I'm not an expert in MPI I'm not really sure if this I mean if the MPI run for example is really using one core per thread it will not be optimal otherwise I think. And I was launching for 1000 interaction so this is the result for the Linux Bermuda. No Linux in IBM sorry. So these are the numbers for the host barrier. Which is this test if I yeah. So you can see that there is quite huge difference between the Linux VM and the Unicolon. But still I have to read redo these numbers I'm not really sure about that I mean because there are one order of magnitude at least. At the beginning I was interested to compare with Linux Bermuda because I think we can achieve something like this in IBM. But then when I started to play with Linux VM I said well there is a huge already difference with the VM. And also I was comparing with the host already used as I said before in particular with that side of the vector. And also it's quite huge difference with the Unicolon. So in the two cases are 16 cores in the VM and the Unicolon too. And I think that's all about the benchmark. To have this number also I figured out that some issues in particular I don't know if you were measuring something in VMs. In particular in carry-in the early TCC register is not emulated so you have to be careful when you use that. For example you have to when you are doing numbers you have to check that the carry-in is still in time. So if you make the difference it's not going to work always so you have to be careful about that. That's all I think. The question is a question. It's a question. It's a pity I'm not doing... The question was why I'm not doing communication between the VMs using this implementation. Basically this implementation can only run on a single node but people are using MPI on classes with tens or hundreds or thousands of nodes. Why? Do you have any plans to extend that? Well I'm thinking about that because it's not the first time that they mention this. Maybe create an interface, I mean use butyonet or butyvisoc to communicate with other instance. You will have multiple VMs running that. But for the moment maybe I will do it soon. I'm not really worried about that. What questions? Which MPI implementation are you implementing? Because there are different kind of versions of MPI or Pitch or so on. Which one are you based on? I'm not really sure because what I'm doing is just trying to read the semantics of MPI. I'm trying to implement it at code. The number of the functions I'm implementing is based on what is the benchmark. That's all. This is why I'm doing it. No more than that. Do you have numbers when you increase the number of nodes? Do you mean if I have numbers when you increment the number of nodes? How that behave? Yeah. I'm still doing that number. The difference is still there between the VM and the Linux implementation. Still a difference in the sense that it's faster, let's say. I'm still doing those numbers too. There is no point in finding the question. I don't know. Do you have a question about the big problems that are happening in the end of the time? The question is if I understand why we have that difference. I don't know. There are a lot of ways to tweak Linux to make it more performance. Maybe I'm lacking that. If you tweak it, you're going to dramatically drop that difference and the configuration. I'm not really sure from where it's coming. But I said before, it's still numbers that I'm working on. Okay, I think we are running out of time. So, Api, thanks again for the talk. Thank you. We have a short break for five minutes. And after that, we will have talk. Thank you.
News from the Hermit Crab — From Soundness Foundations to GPU Virtualization
Go Martin, go! Okay, I guess. So let's get this started. Wow. Okay, thanks everyone for coming. I'm Martin from Avitiha-Aachen University, and I'll talk about the Hermit operating system. I'm here together with my colleague Jonathan, and a few students are also scattered around the room. Yeah, let's get started. These are the things that I'll talk about today. First, a general introduction into Hermit and Juni kernels, although if you've been to this room in the past few hours, you already know some of that. Then I'll cover some arguably interesting internals structurally, and then talk about two applications, namely GPU virtualization using Cricut, and application kernel profiling. Okay, we've been through this a few times now, but let's go through it again. We have compared to a standard VM where we have a hardware and a host operating system, which might also be missing if we have a level one hypervisor, and a hypervisor, we then have this virtual machine. And this virtual machine runs a virtual machine image, which... What's happening? Okay, this virtual machine image is just a full-blown operating system with its own guest kernel, user space, and everything else. Then we've also talked about containers before, which throws away the guest kernel and really tries to minimize the image for the application, and we have unicarnals, which then run in virtual machines again, but inside the unicarnal, everything is packed together as tightly as possible. We have the application, we have some user-provided libraries, and we have the library operating system all statically linked together. What this gives us then is an image that we can really specialize to the use case at hand. So that means for the environment, namely the hypervisor, and for the application itself, and what it should do. This leads to tiny images, only a few megabytes in size for Hello World, for example. And since we only have one process in this whole unicarnal image, we don't need any isolation between this process, other processes, or the kernel. That means we can do this as a single address-based operating system without any costly address-based context switches between. We can run everything at kernel level, have no privileged context switches, and then can just make system calls to function calls. And that's pretty cool. Enter the Hermit operating system, as you can probably guess by the logo. The logo is written in Rust, 100%, well, not 100%, but there's no C in there, at least. There's only Rust and a bit of assembly, of course. We mainly target Rust applications, too. So we have an official tier 3 Rust target for Rust applications that we can use. But we also have a GCC and NewLip fork if you really want to run C applications, though that's not our primary focus. We have multi-core support, we are easily configurable, and we can now also compile on Windows. Yeah, we can also support stable Rust nowadays through our own distribution of the Rust standard library, which you can check out here. Okay, let's talk about the platform support. Okay, once we have this image seen on the left where we have the application, standard library, NewLip, and the kernel, we can then run it on our own hypervisor, for example. U-Hive is a specialized Hermit hypervisor that is specialized to running Hermit unique kind of images, which is the focus of Jonathan. The main target for that is Linux KVM on x86, though there's also some degree of support for Mac OS on both x86 and ARM. And also upcoming, though not yet merged, is Linux KVM support for RISC-5, which is something that Simon worked on. Philip, sorry. We can also target generic VMs through our Hermit loader, which then chain loads the Hermit ELF image. We can support multi-boot on x86, we support firecracker, and there's also UEFI work going on, which will be there soon, hopefully. For ARM and RISC-5, we use the Linux boot protocol to be able to run on things like KAML. Okay, so that's all you need to know if you want to use Hermit. Let's take a look inside. This is the same unique kind of image again, but from a different point of view now. The left stack is the application stack. It is the application. It's some user defined libraries, Rust crates in this case, and the core crates of the Rust 2 chain itself, so standard, Alagon core. On the right side, we have the Hermit kernel, which depends on some crates as well, and Alagon core. These two things are compiled for different targets, though, because we don't want to use any floating point operations in the kernel target, because that's costly to switch between. And the user code is compiled for a special Hermit target, which does have floating point support and also tells the Rust standard library how to communicate with the Hermit kernel. We also provide together with the Hermit kernel, but compiled for the user target some intrinsic such as libm for math functions, or mem intrinsics for things like mem copy, which really benefit from having this floating point support available. One thing that I personally worked on a lot are soundness foundations. You can see unsafe and safe Rust on the right. And we published a paper on that. It's called on the challenge of sound code for operating system, and what this basically aims for is to make the Hermit target sound. That means any safety reasoning must not require context. That's extremely important, and the history behind that is that Hermit was once written in C without much strictness around the locality of this kind of reasoning, and we put a lot of work into going forward and migrating to a more Rust-like approach here. One thing that came out of this is Hermit sync, which is a collection of synchronization primitives used inside the Hermit kernel. Most of these are also independently published as single crates and republished through this image, so you can also pick whatever you like in your own project. Another thing is count unsafe, which you can use to count the amount of unsafe code inside your Rust thing that we use to analyze our progress there. The next thing I want to talk about is our evolving network stack. Originally, it was just a user-side thing, so the Rust applications would compile some network stack with small TCP, a Rust network stack, and C applications would use what's it called LWIP, such as Unicraft does. In 2022, we moved that from user space into kernel space, which is not that meaningful since everything is kernel space, actually, but we moved it to the distribution of the kernel. Then we implemented support for these D-Style sockets because before we had a custom-made API for networking, and now we want to standardize it and adopt these things because that will allow us to throw away all the user space network stack, which can then both C applications and Rust applications use the kernel-provided small TCP network stack. In 2024, we are going for Pulse support for async.io, which would enable us to run a whole bunch of Rust networking applications, which usually run on Tokyo or something like that, and work on this is already well underway. Okay, then let's talk about the two application-focused things. First, GPU virtualization with Cricut. Short introduction to Cricut, which is another project developed at our institute, ACS. It's basically just plugging networking in between some API. So classical GPU CUDA applications work like seen on top, where we have this CUDA app that calls CUDA APIs, a library from NVIDIA, which then performs the actual computations on the GPU. With Cricut, we plug a Cricut client next to the app and a server to the CUDA APIs, and then just tunnel through all requests and answers. That separates these two things, and we can move them wherever we want and control what's happening there. And we found it's not that... Yeah, it's not that high of an overhead. We can then use this for remote execution, scheduling, or monitoring of GPU applications, as seen here. We can have several nodes with virtual GPUs, which then run on another node for computation. We then adapted Cricut for Unicornals, and published a paper on that. And how we did this is Cricut is based on ONCRPCs, which came out of Sun way back when. And the reference implementation is Oden Complex and uses Linux-specific networking features, so it wasn't easy to port to our Rust toolchain, for example. And as you can already guess, we ported it to Rust. Our user code is then run inside the Unicornal and only like the server part serving the GPU is not run inside the Unicornal. We did this for Hermit and Unicraft. For Unicraft we had to develop Rust application support first, but we did that and now it's working fine. The last topic that I want to talk about is application and kernel profiling. It's a project that has been dormant for a while, but we are reawakening it and getting it up to date and getting it working again. It's called RF Trace for Rust Function Tracer. How this works is that essentially we want to find out how much time is spent in which functions when we run software. Instrumentation does this by changing the code that is output by the compiler. We are essentially changing the program that we measure, which kind of falsifies the results a little bit, but for that we get extremely reliable things because we measure each and every time frame inside a function. It works like this. We have our Rust source, which squares some number. That corresponds to this assembly for inter-architectures. If we just append the corresponding flex to the compiler, the compiler nicely inserts this call to a special mCount function. What this mCount function then does is it can inspect the stack to find out which function we are currently in. It can then take some timestamp and it can also insert a return trampoline into this stack so that it also knows when we leave the function again. Together, all of this together, then lets us measure the time of functions, which is cool. In the image it looks like this. Our F trace is just another static library, which is inside the whole image. It works for Rust programs, C programs, and also for images, obviously. It is very encapsulated, so it exposes only a few symbols like mCount and then does everything internally. When we measure such a trace, we can then look at it and have a trace replay and really see which function we go into how and how long it takes inside them. We can also have a look at these graphically, of course. There are tools available for trace visualization. You could also create flame graphs out of this and then optimize the kernel. We are looking forward to using that for further optimizing the network stack, for example. All in all, I think that is all I have to say for today. That is a broad overview of the different topics that we covered last year. You can check us out on GitHub. You can say hi on Zulip. With that, I thank you for your kind attention. Thanks, Martin, for the talk. We have a working mic, so we can have some questions. Five minutes. Hi. My question is how do you instrument the Rust code and how do you actually get the function codes in there? The what? The instrumentation and turn some calls into the Rust code, usually, that you have. My question is how do you get those function codes in there? The question was, you said it to the mic, so it should be. There is a compiler flag for that. For C code, it is much simpler. You would just compile with GCC and then say dash PG, I think. For Rust code, it is more complicated. Well, it is not more complicated. It is just more lengthy. I did not put it on the slide because it was two lines or something. But those are features available to us through LLVM. Rust work is on the way to make this easier because it is not a stable thing exposed by the Rust 2 chain, but through manually enabling the corresponding LLVM passes for the code, this works. Thank you. More questions? I had a similar question. We also have a track on profiling, benchmarking and Unicraft. You are using instrumentation for profiling. Are you also considering sampling profiling? For example, what you are using is Unicraft, we are trying to tie in VMI, virtual machine interface. That will be able to do some sort of snapshotting and the others. Is this enough? Also, Unicraft, you have GCof support now because GCC 13 has embedded GCof support, so that makes things easier. Is this enough for what you have tested so far, the instrumented approach? Because you have to build the application, you then have to run the instrumented one, maybe it is not similar practice, is this enough at this point? We will have to see. In general, we are not that automated yet compared to Unicraft. Our Rust application story is quite seamless, I think, and you just enable profiling through a simple feature flag, and then you run it and it gets dumped on the disk and you can look into it. This is also what Gaby is working on. Did you consider, I am not sure how F-TracingPerox does it, but for example, there is something called K-Probes or K-Raid-Probes or something like that, which is a dynamic way of instrumenting the calls. What that does to you is you don't have to have these items done at build time, so that means when you want to instrument the application, you can tie in some flags and then while you execute it, it replaces some sort of function, pro or web, with some sort of jumps. Interesting. There may be something interesting to look at. We are looking at that on Unicraft's side. Is this like inserting a general hook into every function and then dynamically chain? Gaby knows a bit more about that. It is a bit of a rewrite of the function for organic load. Basically, you have a function that you want to jump in and then you can do the whole function that you want to jump in. Similar to that, just by hand and for some functions only and switchable. Okay, makes sense. Still very cool with the flame graph. I mean, this is the most important item because everyone does profiling, but having some sort of visual way of determining what's actually being spent, that's really useful. Yeah. We have to switch to another talk, so Martin will be around for more questions. Thanks again.
Support Dynamically Linked Executables via Linux ld.so and Implement ENA Driver to Expand Application of OSv
Hello, everybody. Can you guys hear me? Hello. Cool. My name is Valde Kozachuk. I'm one of the few OSV committers and I'm here to tell you about the latest enhancements made to OSV since my last presentation at Fosada a year ago. So, first off, I want to apologize for this very long title. Actually, most of my talk is really going to be focused on the first part, but I'll also try to mention a little bit about the other things. So, in today's presentation, I will talk about the enhancements to support statically linked executables and dynamically linked executables launched by a Linux dynamic linker. I will also briefly describe the implementation of the inner driver to support AWS Nitro. In addition, I will preview the new Xconfig-based mechanism to allow further customization of OSV. Finally, I will talk about upcoming one, zero release and beyond. Most applications do not make system calls into Linux currently, as we know. Instead, they do it indirectly by way of calling Lipsy functions that delegate to Cisco or SDC instruction on ARM. On Linux, for example, the dynamically linked executables are launched by Program Interpreter LD, which memory maps the executable else along with other else files. It depends on, like, Lipsy SO, Lipthread SO, and so on. Then, resolves undefined symbols like puts or pthread create and finally involves the main function. On OSV, the built-in to kernel dynamic linker plays the role of the Program Interpreter that performs similar steps as on Linux. But instead of loading the aforementioned libraries, it resolves the undefined symbols by pointing them to OSV implementations of those. The OSV linker supports both shared libraries and dynamically linked executables that are either position-dependent or non-position-dependent. The benefit is that programs interact with the OSV kernel using the fast local function calls without the overhead of Cisco instruction. On the negative side, the Linux compatibility is a moving target because Lipsy keeps adding new functions and on the OSV side, we have to keep implementing them. This slide here illustrates how dynamically linked programs would traditionally interact with OSV kernel. The drawing shows an executable procedure linkage table, PLT, on the left side. The dynamic linker and Lipsy implementation that are part of OSV kernel on the right side. In this example, after the dynamic linker memory maps the program into the memory, actually, more specifically, the self-segment, it then sets up the PLT to later resolve and replace the put function call placeholder with the address of its implementation in OSV kernel, which typically happens upon the very first call. Now, the statically linked executables interact with Linux kernel by directly making system calls and reading from pseudo file systems like ProgFS and SysFS. Initially, OSV implemented a fairly small number of system calls around 70 to support running going programs that were interesting because they would call Lipsy functions to create threads, for example, and execute system calls to do other things like, for example, Socket API. But this was not enough to support statically linked executables. To make this possible, we had to implement some key new system calls like BRK and clone and add substantial number of other ones to bring the total to 137 at this point. However, the most tricky part was adding support for the application fed local storage so-called TLS. The dynamic-linked programs that run on OSV, in a traditional way, would share the thread local storage with kernel and allow OSV to fully control setup of TLS. The statically linked executables, on other hand, want to allocate their own TLS and set the FS register on X64 or TPIDREO0 on ARM and to the thread control address for each thread. On X64, the solution was basically to utilize the GS register to point to the Persepio structure with a copy of that application, TCP, and basically update it on every context switch. On AHR64, we did similar thing. Now, the point of this enhancement is that we basically improved the Linux compatibility because now we don't have to worry about these cases, where, for example, application tries to call functions in Lipsy that OSV doesn't implement. But the drawback, obviously, of the system calls interface is that, obviously, we pay overhead of Cisco instruction every time, which on average I measured this around 110 nanoseconds on X64. This picture actually illustrates what happens behind the scenes. So on the right side, actually, OSV dynamic linker still plays some small role. It still memory maps the segments of the elf. It reads the headers, obviously. But then, really, it just jumps to the start of the elf. And from this point on, the interactions basically between the program and the OSV happens simply through Cisco instruction. The exciting side effect, actually, of enhancing OSV to support Staticly Link executable is basically capability to run dynamically linked executables via Linux dynamic linker instead of basically the OSV built-in one. The Linux dynamic linker, LD, is Staticly Linux, a tightly linked position independent shared object that is loaded and processed by OSV kernel in an exact same way as Static executable is. In Linux, the dynamic linker would be launched implicitly, right? And by simply introspecting the inter-program header. In OSV, we have to launch the LD, the Linux LD executable explicitly and pass its path along with the arguments as you can actually see in RO. And actually, as you can see in this script, runpy example. So we're passing actually the absolute path to the Linux dynamic linker and then we're actually adding all the path of executable and any arguments. So obviously, just like with Staticly Link executables, there is the same benefit that we are now much more compatible with Linux because one can take any application that works on Linux with G-Lipsy and it should work on OSV just because when we build the image, OSV is going to run, it's going to actually load the G-Lipsy, and we can't use it as any other library that given application needs. The drawback is the same because we are again paying 110 nanoseconds for every Cisco instruction. And this slide again tries to illustrate the interactions between the OSV and the application. It's, as you can see on the right, you have the OSV kernel. On the left, the application, the news dynamic linker, that is executed just like with static executables. And then it loads the application LLF into memory by using M-MAP system call. And then also executes the application itself, loads any libraries. And from this point on, all the interactions happen with Cisco instructions. Now to help analyze and troubleshoot static link executables, or dynamic link launch basically in this new way, we have added a new diagnostic tool that called S-Trace, which is obviously similar to what one can do on Linux. In essence, one can specify all interesting trace points using a regular expressions. In this example, to monitor system calls, you just add a Cisco star, and you enable S-Trace system thread that basically would print all the trace point calls to the standard output. And as the application basically gets hit, while program runs. How many minutes do I have left? Seven minutes. So to recap what I have talked about in previous six slides, in the first two I described the traditional way of running dynamic link programs on SV, which benefit from fast local function calls, but may suffer from compatibility issues. In the next two slides, I explained the new enhancements to allow running static link executables. And finally in the last two slides, I covered a new alternative way of running dynamic link programs launched by Linux dynamic linker on SV, which again may suffer from a tiny overhead of handling system calls, but benefit from much better compatibility with Linux. In essence, these new enhancements greatly improve the OSV application and should make possible to run more programs on it. In addition to what I have talked so far, we have also implemented a better version of the AWS elastic network adapter. In essence, we basically took the 3DSD implementation by AWS and made it work on OSV, and we tried to minimize all that. So basically, minimize the changes so that we can backport any possible future, for example, fixes. And disable a lot of stuff that simply does not apply to OSV. The resulting driver costs us around 7,000 lines of, sorry, yeah, 7,000 lines of mostly C code, and 56 kilobytes of larger kernel size. The challenge obviously was testing that because it can only be done on the running Nitro instance in AWS. And so far, the driver seems to be pretty stable. I've tested using, and seems to yield decent performance. I've tested that using IPerf3, NetPerf, and some simple HTTP server app application. As you may have guessed, actually, the ENA driver implementation is enough to run OSV on with RAMFS on Nitro EC2 instance. And so there's actually a script that I wrote to simplify the upload of the OSV image, creating a snapshot and basically creating AMI. And one thing, obviously, to run OSV on a Nitro instance with non-volatile file system like ZFS, or hopefully EXT in the future, we need to have NVME driver implementation, which is actually two pull requests from community at this point, but they haven't been merged yet. They need some love. In my previous presentation at FOSDM, I talked about kernel modularization and driver profiles. This year on it briefly describe a new feature that takes modularization to the next level, and which has been greatly inspired by Unicraft. In essence, the goal is to use the Linux kernel build configuration tool, Xconfig, to let the user select OSV components to be included or excluded, and various parameters to configure it. The make file would then simply act on a generated config file, exclude relevant object files, and pass any configuration parameters to the source files. And this is obviously very much work in progress. And obviously, unlike Unicraft, where all the elements are effectively Lego blocks, with OSV we pretty much have to do the opposite. We have to put sprinkle basically the source code with all these if-deaths. And this is just example of what kind of modules or parameters can be modified. And basically as an example of what can be accomplished with that new feature is that by hiding basically all the symbols, but those used by application, excluding all necessary components, and changing values of various configurable parameters as listed on the slide, one can build a kernel image of 788 kilobytes in size, and running a low-world app using 1.2 megabytes of memory. So it is, when I started optimizing OSV kernel like five years ago, it was like, the kernel itself was like 10 megabytes at least, and it required a minimum of 30 megabytes of memory. So it is almost 10-fold improvement. Well, I'm sure not as close as Unicraft, but we are, maybe we can squeeze to be at half megabyte. So we are, as I am moving toward the end of my presentation, I just wanted to mention that we are also planning to cut a new release of OSV10, which should include all the features that I've talked about. And I hope that we're gonna be able to implement the EXT file system, merge the IPv6 implementation branch, and potentially implement NVMe driver. I'm especially excited about the EXT file system support because I think it will make it easier to build damages on Linux, and then introspect, for example, if something happens afterwards. So beyond the upcoming release, we're planning to revamp Capstan. Capstan is like effectively like a craft kit. It just, but it hasn't been really enhanced in any way, or even to take advantage of any recent features of OSV. So we're planning to basically revamp it, and make it really easy to use, basically to help application developers to use OSV. And then in addition, we're planning to work on some of the security, so like ASLR, and that requires making kernel relocatable, and some optimizations. Eventually, and also finally, we are planning to make OSV to run on AWS Graviton, but that requires UEFI and some other things. And with that, I would like to thank the organizers for inviting me to this conference, and tell you about OSV. I would also like to thank SyllaDB for sponsoring my OSV work, and Dorbola Orr for words of encouragement, and Nadav Haral for being my mentor, and reviewing hundreds of patches, and implementing other enhancements. And finally, I would like to also thank all the community contributors to the project. And this slide, you can find some links about OSV, and thank you for your attention. And I'm not sure if you have any questions. Time for questions. We have time for one burning question, if there is. You wanted? Yeah, go ahead. This is your work on Linux compatibility. How are you handling new APIs, such as the IO U-ring and similar applications? Are you using? Your question was how do you add new applications to? No, no, so with the Linux API, that you are right for, I believe, for, how are you handling IO U-ring and similar APIs? So how am I consuming new APIs, Linux APIs? I don't know how are you handling applications, which do make use of those? So basically, this happens as the way I describe, typically, if the application is launched in the traditional way, OSV simply, resolves all the application symbols, like Lipsy symbols, and simply redirects them to OSV implementation of Lipsy functions. If I have an answer to your question, then we can meet afterwards and I can address better. Thanks again for the talk. Thank you.
[Protocols] Things we wish we knew before starting an IMAP library
Hi, thank you for being here so early to hear about such an old protocol. So we're going to talk about IMAP. We've both started writing some IMAP libraries and we want to share experience in that. We've hit a few issues along the way, a few surprising things. Hopefully this can help you if you want to deal with IMAP as well. So, I'm Simon. I'm working on the Go libraries and he is Damien. Hi, I'm the main head of IWF-Code. Yeah. So the first thing you might wonder is what is IMAP useful for? So maybe some of you know that IMAP is used to fetch messages from a mail server. So if you have a mail client and a list of messages shows up, this is fetched via the IMAP protocol. IMAP lets you organize messages into mailboxes. So mailboxes is what regular people call folders. So inbox, archive, spam, drafts, all of these are mailboxes for IMAP. The main advantage on the upside of using IMAP compared to older protocols is that it's possible to synchronize from multiple clients and devices. So for instance, if I want to start writing a draft on my laptop and then continue later on my mobile phone and sending my mobile phone, that's possible with IMAP. What's the basic way you interact with IMAP? So it sounds pretty simple at first. You open a TCP connection, ideally with TLS and without start TLS. And then you write a command and then you get back some responses from the server. So it sounds simple. Here's a very simple example. Here's an example of a login command where you specify your username and your password. And then after that you get an OK response from the server if the password is correct and the login is correct. So something interesting to note before going to the next slide is that... I'm sorry. I'm going to do this, no problem. So something interesting to note is that there's a CMD1 right before the login command here. So this is what we call a tag and it's used... It's an arbitrary string, a sendizer client, and it's used to match up the server responses with the client's requests. So it's just a string echoed back by the server. So the client knows that the OK response is for the command, this particular login command it sent before. OK. Here's a more complicated example with a fetch command which is used to fetch messages from the server. So here the client sends a fetch command and asks for the message flags and message envelop. The envelope typically contains a subject and the recipients and stuff like this. And then the server sends back some replies here with responses with the first message as the flag is seen. So it's not unread. It has been marked as important. And then the envelope is very big, so it omitted it here. And the second message has no flags. And when the server is done sending all data, it ends with an OK response. Something worth noting is that here in the middle, you might notice that the command tag is not included here. There's a wild card instead. So this will have consequences later. If you ask for data, it's complicated to know if you get replies for which command it was and if it was for command at all. We'll see you more on this later. In the fetch command here at the start, you might notice the one column wild card. This is the way you specify which messages you want to fetch. And we'll see how we do this in the next slide. So how do we refer to a particular message? There are two ways. Both ways use a 42-bit inside integer. So the first way is with something called UIDs. UIDs are a unique ID which doesn't ever change except when it does. It increases when a new message is added to a mailbox. So if the last message in the inbox has UID 42 and you receive a new one, then it will get UID 43. So the second way is with message sequence numbers. It's an ordinal number. So if you use sequence number one, it means the first message in the mailbox, sequence number two, second email in the mailbox, and so on. And it goes the same way as ICOIDs, like the oldest message added to the mailbox is the first one. So something interesting is that the sequence number, they get reassigned by some operations. For instance, if a message is deleted from a mailbox, then the sequence number shifts a bit. So here's an example of a mailbox with three messages, one with UID 4, one with UID 6, one with UID 12. And if the UID 6 is removed from the mailbox, then the first message stays with UID 4. And the second message is none of the UID 6. It's now UID 12. So the meaning changes depending on the state. Another detail is that message data is immutable. So if you fetch message contents, it will never change. If you want to edit a message, you need to re-upload it and then delete the old one. So this was to refer to a single message and we can also refer to multiple messages with something called SET. The simplest set is just one message. So here's just sequence number one. Here's another example with a column. You can say messages 2 to 4 inclusive. You can specify multiple ranges like this, like 2 to 4 and then 6 to 10. And the last one is 1 to wildcard. It means 1 until the end, until the last message. That's it for the IMAP introduction. Now we can go into the meat of the presentation. Do you want the microphone? Is it on? Okay, so let's go through all these layers. The first layer is types. So what's there to tell about types? A few things. Probably your journey as an IMAP developer will start as either a client or a server developer. So it's kind of tempting to try to implement only half of the standard and to a certain amount. This is possible because as a client developer you can implement command serialization and response parsing only. And as a server developer you can implement command parsing and response serialization only. You can kind of pick only half of the routines that you would need. But the IMAP standard has quite a few of overlap between commands and responses. So there are many types that you need to define and many parsers that you need to define and serialize. So you won't end up anyway with implementing 50% of the standard but more like 70, so to say. So my suggestion would be to structure your code so that you can easily extend it to the other side afterwards. For example using a shared module. And if you are lucky and someone will provide the missing side to you and you have parsing and serialization handy, you can do kind of cool stuff because you can first generate a random message and then ensure that parsing and serialization is inverse to each other by doing randomized tests. So there's a pretty powerful kind of unit test for it. At least for me it helped a lot as you can see at the bottom. Complicated stuff. Complicated bugs. Yeah, perfect. Okay regarding syntax, oh my. I will quote Mark Crispin from the IMAP protocol mailing list because I think it's not that bad but you need to be in a certain state of mind when doing it. Alright, let me think now I'm a bit tired today. But first and foremost the formal syntax should be your holy book. If any part of the syntax distracts you from the formal syntax, you should ignore it in favor of the formal syntax. Your eyes will glaze over and your jaw will drop. You can start saying no, no, no. Just work through that stage. It's a steep hill to climb but once you make it to the top you will see everything with crystal clarity. And remember, no matter what you do, do not try to implement any command or response by looking at the examples. And he's what Mark said, so he's right. I would add that before reading the formal syntax you need to learn ABNF and I mean you need to learn it by heart because there are some subtle things you need to be aware of. And regarding lexas and parsas, I think we agreed when talking about this things. IMAP makes in some places the impression that there are things like tokens saying arguments invalid meaning that there could be some generic argument. I had a very hard time to figure out what should a token be. So there are no words on what constitutes a token and I think Simon in version one tried it and got away from this approach or used a different approach in version two. So I don't know, maybe someone has a better idea but for me you cannot lex the IMAP syntax. And another recommendation, even the syntax has layers. So first of all you have the ABNF corals that are described in the ABNF standard and referred in almost any rule. And then you have these IMAP strings which make everything kind of messy. As an example, you see this is the lock-in command, looks kind of simple. And then you have this innocent looking A string thingy in there which is for example here the username and the password. And an A string is in fact one of three types and one of two protocol flows. So you have A string means either an atom or a string, more or less or some IMAP quirks. And if it is a string it can be a quoted string or a literal. And literals do require special care when implemented. So as a simple example we will start with password. It uses only a very simple character set so you can just write exactly these eight bytes as an atom. If you have a white space in it you need to put quotes around and if you have a quote inside quotes you need to escape the quote. So it is similar to programming, most programming languages. And if you have a literal, obviously if you have a new line in there, this would be the obvious case, you need to use these prefix here in curled braces and then you just send exactly the bytes that made up your string after a new line. With a twist as we will see. What we will glaze over today are ambiguities and defects and I had a few discussions already about this one. So I would very much ask everyone if you find some defect in IMAP please report it to us. We really want to start a collection on all of these things. And one thing I finally wanted to say, I quoted Mark Crispin from this thread, but if you now will go to the internet you won't find it. So the IMAP protocol, at some point it's not available anymore due to reasons. So and for me the only lucky thing that happened was that someone I know, it's the maintainer of the Mealy email client, he had this super cool online interactive WebAssembly demo and he used the dump as test data. So that was the only reason I could read it. I guess the thing I want to say here is let's try to be aware that knowledge is disappearing and maybe try to resurrect the IMAP protocol mailing list because it's awesome, it's like a travel trove of information. Okay, then let's go back to framing. So... Oh, everything tanked up. Yeah, I'm back again. So we're going to continue to talk about some higher level layer. So flow and framing, but by flow and framing we mean how does one split the IMAP stream into separate commands and responses. So this is pretty simple. This seems pretty simple at first. Here's a simple example, similar to what we've seen. Log in command at first and then the server replies okay and then the client sends a select command and then the server replies some data and then replies okay. So one may think, yeah, it's pretty simple. You just need to split a new line and each line is a message basically. And then literals happened. So here's a slightly more complicated example where the client sends a login command, the username, and then the password is passed as a literal. So first there's a number of bytes and then the next line there's contents. So here what's interesting is that these two lines are a single logical message. The second line here sent by the client is still part of the login command. Another interesting thing is that in between here there's a plus sent by the server. This is because the server needs to acknowledge literals. So when the client sends the first line here, it says, hey, I want to send a literal with six bytes and then later the server has to reply, yeah, you can go on with this plus and then option and comment after that. The client needs to wait for the acknowledgement before sending the literal data. Okay, so that's interesting. Let's try to look at only one side of the connection. So here let's try to look at only the client side and see what happens. So we can still make sense of everything here, like login with the literal and the next line and no op. Is this valid by the way? This sounds a bit weird, right? The client sends the username and then announces the literal and then the next line here, it sends a completely different command. It's not the password or anything. Is this valid IMAP even? It turns out that yes, it's completely valid IMAP because if the server replies no to the first line the client sends, then it can send the literal. It says, I don't want your literal. So basically what I'm trying to say here is that it's not possible to pass IMAP just looking at one side because you can't make the difference between this case and this case here, whereas the server rejects the literal. So you need in your IMAP password to have some kind of feedback from the other side of a connection to know what happened. And so one may think that we don't really need to wait for the server to acknowledge the literal. We can just send the command and the literal in one go and forget about it. The server will probably reply, okay, we'll probably acknowledge the literal in any case. So here's an example of what could go wrong if you don't wait for the server acknowledgement. Maybe you have a web form on a page which lets the user save a draft in their mailbox. And maybe the literal contains like, may contain some text like this which are valid IMAP commands. So if the server happens to reject the literal, then these lines are interpreted just as regular IMAP commands by the server. And these lines delete everything from your mailbox. So that's not great. And this can be potentially inserted into HTML email hidden and HTML on a single line. And yeah, if you reply to the email, you just use everything. So yeah, it's pretty scary. So to recap everything, something I haven't mentioned is that literal can appear basically anywhere. We've seen in the login command, but it can happen in the search command. There can be many literals for a single command. It's limited to one. So literals interrupt completely the regular syntax. You have to pause the parser from the server side or the client side if you receive a literal. And then wait for the other side to reply, yeah, go on. And then you have to resume the parser. And the literal can be nested into a list or nested into something else. So it's kind of complicated to do, especially if you're using, for instance, a parser generator or something. So we can pass IMAP just by looking at the single side of the connection as we've seen. And it's important to wait for the server to accept literals before going on or security within. So another aspect of the flows we want to talk about is commands such as authenticate. So authenticate is a command that lets the client use Sassel authentication. Sassel is a binary protocol. And to authenticate in a modular way, you have several mechanisms. So here's an example of a plain mechanism, which is a simple one with username and password, but there are those as well. So basically the idea is that you get a binary message and code it to base64 and then send it over. And the interesting thing here is that, so the client says authenticate command, the server says go on, you can continue the authenticate command. And then the client sends just base64, like the, what? This is not a regular IMAP command. This is just base64. There's no tag. There's no command name. It's just like the base64 data as is. It just interrupts regular IMAP syntax with completely something else. And IDOL does something as well similar to this, where client sends IDOL, server says go on, and then client can just send the ASCII string down like the four bytes down. And it's not an IMAP command or anything. It's just like an ASCII string. Start CLS and compress are kind of similar in the way that when you start these commands, it interrupts a regular IMAP stream and wraps it up with CLS or compression mechanism. So these are fun to implement as well. So in summary, for the flow section, IMAP demands you to conflate your passing with business logic with higher level details. So you cannot have a pure password in its own little module isolated from everything else. You need to wire it up with the rest of the IMAP library. It's kind of special in this regard compared to other processes. Okay, now on to operations and semantics. So let's talk about fetching messages again. There are multiple things you can request from the server to fetch messages. So basic example, the envelope we've already seen. Body structure is if you request the MIME structure of a message with a tree of nested parts. If you have attachments for example. And then to fetch the message body, you can use body square brackets. If you just request body square brackets like this example, you get a full message body. So here's an example, very simple message with two header lines and then a simple text. So yeah, if you fetch the body square bracket, you get everything. If you want to fetch only the header, you can use body square brackets header. And then you get only the first two lines. And you can request only the text of the message. So the howdy part here with the text modifier. But you can do more complicated stuff as well. Oh my. Yeah, maybe I'll go very fast on this one. You can fetch particular header fields. You can fetch sections, bytes, substrings of the results. You can fetch, if you have a multi-part message, we have an example with two parts. So the main part, the first sub part, the second sub part with an attachment. Then you can fetch only the first part here. So the counter disposition in line one. Or you can, here this one is interesting because it returns nothing. A header actually doesn't work in nested parts. You have to use a special keyword called mine for some reason. And then if you have a message attached to a message, then you have a section of the RFC dedicated to this particular use case. Like something everybody does every day, I think. Messages into messages, like Russian dolls. The last thing I want to talk about is unilateral server data. That's another simple example of a fetch command where you want to fetch the body of message one. And then the server replies, yeah, here's the body of message one. So everything's fine. Let's say another client happens to mark the first message as important. So the way this works in IMAAP is that the next time you execute a command, then the server replies here in the middle. Hey, by the way, the flags of the message one have changed. Even if you didn't ask for it, just before completing the command, it sends this data. So what happens if another client changes the flags of message one and you happen to send a fetch command right after this happened? Then you get something like this where the server replies first the body of the first message. Like hello world, like before. And then you get something interesting where you get another fetch item for the same message, but something you didn't ask for at all. So... Yep. So it's not possible to think of IMAAP as you request some data and you get back some data. It doesn't really work like this. You can think of it as you request some data and then the server pushes some data into you whether you want it or not, and you have to deal with it. And as a client, if you ignore all but the last reply from the server for the fetch message you asked for, then you won't get the body here. So it's something to look out for. Okay, last topic, extensions. These are a bit interesting. In GoaMAV1, I tried to implement extensions as a very modular thing, which you can plug. But extensions turn out to be more like amendments. Like fundamentally alters IMAAP syntax, flows, operations, everything we've talked about. Idle and compress are examples that add completely new flows. So Idle switches to a completely different mode than you need to send a downed SQL string to switch back. And compress, yeah, just wraps the connection with something else. And then you have another kind of extension like extended list, which modifies an existing list command and adds some arguments in the middle to add more options for the clients. The search extension for extended search, it changes how the reply looks like. So you send a regular search command and then you get some completely different kind of reply. And then the literal plus extension completely changes how literals work. You get a new syntax that you need to pass. So yeah, this doesn't work at all if you try to implement it as a modular thing. IMAAP is completely mononit, if you want to implement extensions that implement everything in the same repository. It will help a lot. All right, that's about it. Unfortunately, we don't have time to talk about everything we wanted, but it should be a good start, I hope at least. Any questions? Thank you very much first. I see a first arm. Yeah, quite immune. It really helps you at the time. Hello. Thanks for the talk and thanks for the library too. I think we're using it quite a lot. Thanks for the talk and thanks for the library. Oh, okay. Yeah, yeah. My question is like, you said like, sometimes you get responses from the server. You're not even asked for, does the server also send without asking? Does the server also send data without asking? So it kind of. I mean, if you, it will only send data right before it, right after it, sorry, let's go from the start again. It will not send data on its own if you don't send any command. You have to send a command and then you reply to the command and then add its own unilateral responses to it, which can be a bit arbitrary. Like it can be anything, really. It's usually at the end of the, just before the okay response, you get some extra data and you have to somehow maybe distinguish it from the regular data. But yeah, it doesn't really work in practice. I was glad to have you. Yep. Oh, yeah, yeah, yeah. I just added that on a little bit. So the IMAP standard is quite specific regarding and it says you need to be able to receive any response at any time. So it's quite, it has in the standard, but us doing practical things. The thing we learned is that you should not trust anything that's in the standard and to the best of my knowledge, most servers don't. So you have, there are exceptions, for example, by, by respond, by untact, like when the server do a shutdown. Yeah, as answered, maybe if you can explain a bit more, but to the best of our knowledge, most people doesn't do it because at least when we tested some clients, many clients, and I mean, I mean, like the most of the clients, they crashed when we sent this. So I think there's a reason why it's not so common in the real world. Okay. Okay. Just wanted to say that if you consider the client server interaction more like that the client told the view about the server and then the server updates the view whenever you send a command, then it starts to make a bit more sense. Yep. But it can be hard to architecture a client with, yeah, against this IMAP concept. Like sometimes you don't want this kind of thing. But it's good. But yeah, it's a good mindset for sure. Yeah. All right. Any, any. Is the only, having regarding IMAP as a cash fill protocol where the client has a view and the server fills in the client's view is the only way to write an IMAP client that will preserve your sanity over years. If you try to, if you try to act as though this were a web server, you will have and this works over the years. Each new server will surprise you in some way. Painful. Don't ask me. Well, your code is. All right. Thank you very much. And thanks again to the two presenters and we come to the next talk.
[JMAP] JMAP: Getting Started
So now we are really getting a little bit after we dove a little bit in the old specs and standards and you know like details of IMAP, we are now going to hear a lot about Jmap, sorry, yeah, sorry. We'll be talking about Jmap which is a new set of standards that has been engineered in the last couple of years by some very engaged people that also in parallel have been contributing a lot to IMAP and still do and we are very happy that we have one of these persons here or representative from Fastmail which has been a company very instrumental and putting a lot of effort in this new set of standards and yeah Rick your stage is yours, we have learned about Jmap. All right, applaud fast because I got a lot of slides and I got a little bit of time. So we are going to talk about Jmap, funny story, I was pitched this talk where I was going to talk about Jmap and I'm like what is it, how does it work, why is it so great, how can you use it, how does Fastmail use it, all this stuff I covered everything, it was a really good talk, it was like an hour long and I looked at my email that I was coming here and it said you get 15 minutes. So and it had to be in PDF so this is just like the absolute minimum dot PDF of my slides and if you want to hear the whole thing and see all the builds and all the animations and everything about IMAP and Jmap that can be arranged very easily, talk to me later. That's me, I work at Fastmail, I'm not going to talk about myself, we don't have a lot of time. Let's talk about IMAP, sorry. Who is here for the IMAP, what I wanted to know before I wrote IMAP library earlier. You, okay well you missed a lot of horror stories but I'm going to give you some now, this is IMAP and I'm going to be real brief about it. What you're seeing here is the server in white, the client in yellow, we log in, we says yeah you logged in and now we select an inbox, okay this is the IMAP protocol, very basic but here's all the like beginning of the parts of the grammar you need to parse and it's a bunch and if you were here earlier you're going to see lots and lots and lots of more stuff, weird literals, like weird ways the interaction with the server changes how you parse the response, synchronizing and non-synchronizing literals, it's a complicated protocol and it's not like other protocols that you're using and there's a really simple reason for that which I'll get to, oh yeah, right this is the protocol to do stuff and then this is the payload of the message is MIME which is like another thing nobody wants to deal with, it works great and it pays my salary but I mean, oh, see what you want about HTTP and JSON but at least it's not this stuff, right? Like you probably all know how to use HTTP even if you don't know how it works under the hood and you probably know how it works because it's really nice and simple. So I had lots and lots and lots of slides talking about how weird IMAP is and I would love to tell you about it but I'm just going to tell you about this one thing and this was touched on earlier, blah blah blah blah, server and client are talking and eventually the client says I want to mark message 12 deleted, store the flag deleted onto that message and the server says great, you have fetched this information and this is where people get really confused and it comes down to something I didn't said earlier, the only way to understand IMAP is that IMAP is a cache invalidation protocol, it's a protocol that tells you what to do with your cache. So you've got a server and you've got a client and the client can send basically the commands you expect like I want to fetch or update or create or delete messages and the server's response is in response to that here is how you should update your cache and if you don't think about IMAP that way you're going to have a bad time. Everything, yeah everything works this way. If the client says I want to work with the inbox it says select inbox and the server says there are 172 emails and these flags exist which is a way of saying here's how to initialize your cache. When you say I want to look at my new mail the client says fetch these things and the server says you fetched these things which means put these in your cache. When you say I want to mark this mail red you say store this flag and the client says put this in your cache. That's how it all works and you have to start by understanding that even to understand IMAP. You want to talk much more about IMAP. Okay there is one more thing though. This is another fairly basic IMAP conversation where we're saying we want to come up to date and come up to date is really important. See at the beginning we say queue resync that means we want to quickly resynchronize our IMAP storage offline. So we say queue resync and that our client state is 123. Just tells what the state was the last time we synced. And we get told great your next sync is going to be 130. Here's all the changes to apply and when you're done you'll be at state 130. Without this IMAP kind of sucks. I mean it's better than pop but one of the great things about it is you can synchronize, go offline, come back later and quickly get up to date no matter what else has been going on. Okay now you understand IMAP. Good job everybody. Yep. Who wants to go implement it? Yeah these four freaks. Okay the good stuff is good but the bad stuff sucks and there's so much bad stuff. So like good stuff. You can resynchronize from previous session. Great. You've got a domain specific model IMAP is built around email. Really nice. How about the bad stuff? Okay the data format sucks, the transport layer sucks. The code that's out there is not great mostly. The key features of IMAP aren't in the core protocol sometimes so you need to make sure you've got the right extensions loaded, the right capabilities available or you're implementing to the worst common denominator and there's way, way too many parentheses. Okay so this is why we built JMAP. JMAP is the JSON Meta Access Protocol. It's just JMAP, it's IMAP plus a thousand right. This is what it looks like. So already I hope people are feeling better, you know what this stuff is right. We're posting a request, we're posting a request to the JMAP endpoint and we say I want to get these emails. Great right so just like everything else it's a restful protocol kind of. Here's what you get back in response. You said you wanted to get emails one two three four, here's one of the ones you might get. You did an email get, you're getting a list of messages, this one has ID one and there's its subject and there's more stuff. But it looks like this, you can parse this. Anybody knows what this means. Here's a bigger context of it so you can see like there's an ID and there's parts of the body and the subject but the thing I want to call your special attention to is it's got one simple date format. Yeah I mean you can stop there and it'd be a pretty good improvement on IMAP and IM. But we're going to keep going. Here's another thing, you can say I want to get, when the server responds to you, it can say yeah you did just get these messages and by the way your email collection is at state 616. It's just like that Q-resync thing. It's going to let you say later I've got mail and it's got it all up to state 616. Hey server tell me what changed since then. And the server replies and what it says is here are the changes. You were at 616, you will be at 717. These two IDs were created and this one has changed in some way. And then you can decide to do what? Update your cache. What do you do? Maybe you refatch those messages. Maybe you just invalidate the local storage but you know how to change your cache. It's just like IMAP. JMAP is a cache management protocol. It's just easier to use. Here's another example. Email query is basically what we call search when you search your email on IMAP. So we're going to search for mail that's been flagged and that's from me. Really simple. And the response to that will look like this. You did an email query. Here are the IDs that result from that. And the reason that it gives back IDs, it's about managing your cache. You should have messages cached. If you don't have these, well now you can fetch them. But if you did have them, why send you the messages back? You should have a cache with these messages. If you didn't, you would go ahead and say great, email get these messages. I didn't have them but I want them so you get them now. And it works great. It makes sense. You can think about this really easily. But we should talk about IMAP again. So in IMAP it works the same way. You say I'm going to search flagged messages and it says here they are and then you say I'm going to fetch those. Right? Makes sense. Same thing. IMAP and JMAP look the same in a lot of ways. This is what you don't always see in these diagrams. Where the round trips come in. Right? First we search, goes to the server. Server computes the answer, sends it back. They only say I need those messages. We say give me those messages. Send it to the server. Server finds the answer. Server sends it back. You're waiting for the speed of light back and forth twice. That's what happens here too. Right? You say I want to do a query. I get the answer. I look for those messages. It goes to the server again and it comes back. So the same waits sit here. But you don't have to let them sit here with JMAP. Because when you write your query you can write this. I want to do a query and a get. And what is the get going to fetch? I don't know the answer yet. That's okay. You tell the server which IDs do you get? They came from another thing I asked you to do. So get the IDs by looking at A. It should be an email query. Get the IDs out of the response that you compute before you send anything back to me and do the method call with those. It's called a back reference. And you can have a whole bunch of method calls that back reference to one another to let the server do all the work and only do a round trip back to you once. So you get one wait state. Really good. Okay. Couple more things. This is a larger section of a JMAP query. I've put in some more things. I've been skipping on these slides. Mostly you've been seeing this stuff, actual method calls. But what up here is good too. This is called the using block. It tells you what capabilities you want to use. This one's really simple. If you squint you can see we're using core which is like, yeah, I'm speaking JMAP. And mail, again, I'm looking at mail. But you didn't have to squint apparently. I had to build. But you can have lots of other capabilities. At fast mail we have contacts and calendars over JMAP. And those are going through the ITF now. They'll be RFCs and we have lots of other stuff too. What that means is if your server supports mail and contacts and calendars and other stuff, when you come back from offline, you can synchronize everything with the same request. Not just the same protocol, but hello, I'm back online. Please get all the changes since my offline state and fetch the updates to me all at once. You can also write your own custom data types for whatever appeals to you, whatever your business needs to use, add it to your implementation. Because even though the data types in JMAP are domain specific, we let you build your own. Anybody can build their own just by describing how those methods will work. I'll talk about it just a little bit. Fast mail uses for mail filters, your preferences, your credentials, your DNS, your files and billing, all kinds of stuff. We just do over JMAP because it's great. Okay. Getting close to the last things. We also give you event source. Event source is a long-running connection. I'm old enough that I still call it combat, right? Like you connect to the web server and you say, tell me when things change and you stay connected. And every once in a while, the server sends you a little blob like this saying, oh, there's an update to your email state. Oh, email and contacts have changed. And when that happens, what does your client do sitting there connected? It invalidates the cache. It can refresh things. It can update the screen immediately. So I'm at Paz this with something called idle, but CalDav doesn't, CardDav doesn't. And when you do this on your mobile phone, idle is not going to help you much because Apple sure as hell is not letting your phone sit there with a connected TCP stream live to your IMAP server all the time. So people build these interstitial servers instead of getting a web push which would just directly send your phone a message. And JMAP supports web push. So you could just get real-time updates from all these protocols. So this is our IMAP. We get rid of all the bad stuff just about and add all this good stuff. JMAP and HTTP, anybody can use. Avoiding around chips by combining these requests. Putting lots of data types in one place and real-time synchronization and the cost is that not everybody's using JMAP yet. It's growing, but it's still pretty early and there's way too many squiggly braces and double quotes. But like that's the price I'll pay. Okay. So what now? You want to know how this works? The first thing you should do is go look at this repository, fastmail slash JMAP samples. It's code that just does some real basic stuff with JMAP and you won't understand it yet, but it's going to give you an idea of what JMAP use looks like. Simplest form. Then it's time to read RFCs. Yes. Don't worry. They're actually pretty good RFCs. You should look at these if you want to play with JMAP. The first one is 8620, which is going to tell you what the basic methods are. And then 8621, which tells you the data types. So 20 is going to tell you things like how do you get, how do you set, how do you do changes, just what those are that work on any data type. 8621 is going to tell you the specific data types that we use like mailbox, thread, email and so on. Everything else, you just learn more data types in more, in calendars and contacts, basically how the protocol works. You learn the data types on top of the core methods. Some highlights from RFCs. Yeah, okay, I got it. A minute and 18 before questions. Email is the most complicated data type in JMAP for obvious reasons. Emails are big and weird and complicated. JMAP does a great job of making them easy to deal with. Here's an email get. When you do a get, you can also say which parts of the thing do I want to get. Don't get every property, just get pieces. So I might say I want the from, to, subject, preview, like a little snippet you see in your mail client, and it's mailbox IDs. So what do you get back? This. You have a build. Great. The to and from come back as structured objects that have parsed the email headers for you. Nice. The subject comes back decoded. That's ASCII, so that was a poor choice of string, right? But it comes back decoded. The preview is decoded and mailbox ID is this weird set thing. Why is it an object instead of just the one mailbox ID? Because it can be in multiple mailboxes ID. And if you hit me up later, I can tell you about labels mode, which is what we use this for. It's really nice. So the headers, you could fetch the subject, but you could also fetch the header called subject. And when that happens, you get back the quoted printable, the literal thing. But if you want instead, you could say, give me the subject, the literal bytes, or give me all the headers, because maybe there's multiple subjects, or all the headers, but decode the text. You can get anything like that. You've got no time left. I'll show you this. When you fetch the body, you can get the blob ID. Don't do that. That's how you have to mine parse it. Instead, you say you want to fetch the text bodies and all their values, and you get something like this. Here's all the bodies you need to display the full text of the message. Like there's no mind parsing, there's no remembering. What do you do with, you have multi-part alternative and multi-part related? How does that, no. Just, just do that. Okay. Yep. Time for Q and A. The first thing I will say is you can ask me for more later, use fast mail, blah, blah, blah. How about questions? All right. Same here. Hi. Thank you very much. So one quick question about adoption. Did you reach out to, because when looking at this protocol, and I've been playing around with it for some time now, it looks fairly similar to whatever Google and Microsoft do. I'm not familiar with those companies. Yeah, yeah. Yeah. So is there any chance that these guys would be interested in adopting this? Yeah. Yes. I mean, I think I can just say that. It's, you can imagine like Microsoft, Apple and Google are all standing around a well in like a spaghetti Western with their guns at each other, like who's going to change first? Right. Apple's client is by far the most popular mail client in use. Google's servers are the most popular servers. If either one breaks, we're in. And I've spoken with people, these companies, and they're interested, but of course, it's a huge amount of work on something that even though it's clearly technically superior and a big win is a gamble. It hasn't won yet. I'm pretty optimistic that we're going to see things happen, but I don't have any secret knowledge. Yeah. Thanks. Hi, thanks for the talk. What's about JMTP? Yes, JMTP. Yeah. So replacing, replacing server to server communication is a much more fraught problem than replacing what your client does. I would. Yeah, yeah. So submission. Are you asking about submission? Okay. So mail, MTA to MTA, right? Full exchange of mail between different, the Fediverse of email, if you will. That's going to be SMTP as far as I know forever. I'd love to see JMTP replace it, whatever the hell that is. But submission where your mail client says I want to give this message to be sent, JMAP supports that and it's really, really good. It has lots of really nice features. It has the ability to tell you, oh, by the way, that mail you sent bounced. It has the ability to tell you how many people has it been sent to. And the way that you create messages as a client author is much, much, much simpler. You don't have to think about constructing mind bodies yourself. You can just say, here's some attachments. Here's the text in the HTML and the server can do everything for you. So it does replace that. It also, because it's one protocol, you're never like, I can fetch mail, but I can't send mail because one server's up and one server's down. It just always works. What do you do about encrypted messages? So you're like open PGP or SMIME sort of things? Yeah. So what do we do about encrypted messages? Punt. Well, so there are some RFCs about SMIME and handling SMIME messages, I think all by Alexi, if not mostly by Alexi, that I would say are optimized for the server having access to your key material, right? Is that a fair way to describe it? Yes. Yeah. And there have been discussions about how we would deal with encrypted messages when the server doesn't have your key material and only the client does. We've talked about it's complicated and I think there's interesting things we can do. But generally, Jmap is built around the idea that whatever the server can see, you can see. And encryption, as usual, makes things less convenient. All right. Thank you again very much, Rick. I think it will be around.
[JMAP] OpenXPort JMAP: a PHP library for Data Portability
All right, we head on with the next talk. All right. Floss is yours, yours. So let's wait for the room to cool down a bit. So I'm one of the lucky ones having only five minutes for my talk. So I'm going to keep it very brief. I hope you can hear me. Good. So I'm Joris. I work at Rodrigo. We do quite a lot of work on data portability. And that's how we came to JMAP. So Ricardo already did a quite good job of, well, presenting what it's all about. For us, the main thing what we wanted for is having a unified API. So I think there was one slide where he said we add files and calendars and contacts and whatnot, own extensions for that. And it's really well for that, actually. Yeah. So I'm just going to skip that slide because I don't have much time. Yes. So in the end, one thing that was not mentioned in the previous talk is that JMAP, the JMAP calendars and JMAP contacts is built upon Cardiff and Kaldiff, which themselves, they build upon iCalendar and vCard. So there is a replacement, a modern replacement for iCalendar and vCard, called JS Calendar and JS Contact, and a modern replacement for Cardiff-Kaldiff, which is called JMAP Contacts and JMAP Calendars. And that's what we are mostly using heavily. And yeah, in addition to a bunch of other data types that we also added. So the work that we did is we, first of all, we have a client, we have a server. We move data from one service to another, data portability. So the client is using, it's a Java client. So we collaborate with Danny Gulch here. We have added a lot of features to the library already. We still need to work out how to combine that well with what is already there, because we also would like to see that the JMAP Java library is the go-to library for JMAP in the Java world. And on the other side, on the server side, we have our own software. It's called JMAP Open Export, which basically makes it very, or is supposed to make it very easy to add a JMAP API to PHP-based systems. We already added support for quite a lot of data types, or verticals, files, calendars, contacts, and so on. So it can also be used to lift files that are on a... It's an ongoing project that you could attach a JMAP API to files that are somewhere on a server, and then you can migrate those away. And obviously, we support JS Contact and JS Calendar. There is an RFC in progress right now to convert... RFC for JS Contact and VCAD is already existing. How to convert between those two, and another one is work in progress to convert between iCalendar and JS Calendar, so it's to make it easy for developers to start with those formats. Yeah, so basically, that's what we extended. Right now, we have a JMAP API for NextCloud, RunCube, and Ancient System Squirrel Mail, and Horder is more or less Ancient System, I would say. Yeah, we already use it in large-scale migration projects with a lot of users. Yeah, so let's finish with the last slide out of time. Yeah, so there's also a JMAP Dart client from LinaGora that we are extending currently, and building a JMAP CLI around that. Yes, so there are also other specifications that you could read upon. Didn't finish quite in time, I'm sorry for that. Oh, fine, thank you. Looking around, here's one. How many lines of code is your Java JMAP client, and what does it require in direct dependencies? So we might even relay it to the next speaker, I think. Yeah, but our client is quite big, actually, but it's a library that we're using, which is quite lean, I would say. Now I don't feel bad at all. Any further questions? Otherwise, I think the next speaker may come, which is actually Daniel Gutsch, the author of the state, the JMAP Java library, and some tools.
[JMAP] Intro to Ltt.rs a JMAP client for Android
It's fine. Anyway, good morning everyone. My name is Daniel. Today I'm going to take a few minutes to tell you a little bit about a JMAP only client for Android that I've been working on for a while. But first a few quick notes about myself. I usually work in instant messaging. I'm an XMPP developer. I am on the Council of the XMPP Standards Foundation. I develop an XMPP client for Android called Conversations. And yeah, JMAP is a long-term side project of mine. I checked yesterday that I registered the LTT.RS domain in 2017. And I think I've been working on that for even longer than that, somewhere on my hard drive. There's an implementation for the pre-RFC JMAP thing. That's at Fastmail Road. And yeah, these days I develop a aforementioned Java library and the Android client for letters. So why JMAP? So as someone who's starting from scratch, I think you already got the sales pitch for JMAP. You have a same set of extensions. You can do sent and receive over the same protocol. JSON parsers are readily available. You don't have to do whatever IMAP is. On top of that, you don't have to do any mind parsing. If you ever wrote a mind parser, you know how much of a relief that is not having to do that. It has built-in push support. So especially if you're targeting web or modern mobile phone operating systems, it's good to have vendor push. And yeah, essentially just see Ricardo's maybe omitted slides on how bad or how weird IMAP is. And yeah, you pretty much know why I way to JMAP. So a little bit of the architecture. The way applications have changed or how Android applications are developed have changed quite a lot. In the last 10 years, Google has released a set of libraries they call Jetpack that make application development a lot easier. And Lettuce tries to use a lot of them. For example, there's Room, which is a database abstraction layer where you basically define how your UI displays the information in the database. And then whenever you write to the database, your UI automatically gets updated. And only those things that have changed. So the way I implemented is that my JMAP library has a generic search backend that's then implemented by Room. We write data to Room. And then magically, our UI gets updated. And we don't have to do anything. And also because my main job, again, is developing conversations, which by now is like 10 years old and quite legacy. Lettuce also has a sort of playground for me to work with new Android APIs, such as Material.U, which is the new design language, or predictive things like that. So you already heard that both IMAP and JMAP are essentially like cache management protocols. And that allows us to have great offline capabilities in Lettuce. So all queries, whether you view a certain mailbox or even if you do a search, those are all cached. So if you retry a search or redo a search, when you're offline, you still see all search results. And then all user actions are handled by another Jetpack library called WorkManager that automatically retries those actions when the user comes back online. Yeah, while the app is in the foreground, we use web sockets and event source to listen for server side changes and refresh the UI. And we also, when the app is in the background, we have a fully open source web push implementation. We don't actually use LIC Play Services library, but we talk directly. We are open source code too. So Firebase, or the Google Play Services to retrieve a web push URL, you can actually trick Firebase into giving you a web push URL instead of doing the application server thing that you might be familiar with from other Android apps. But that requires like Vapet, like a voluntary application server identification, which JMAP currently does not support. And I'm in the process of writing in RFC for that. And yeah, because we have native web push, we can also hook in other push implementations that are not bound by Google. For example, like Unified Push. And the way that works is, for example, that the JMAP server can tell my XPP server to tell conversations to wake up letters. And then Google is not involved at all. And I can like self-force every part of that. We also have native enabled by default order crypt support. No plug-in required. It just works. You see a lock icon on your compose screen. If the other part supports it too. During account setup, we ask for key import. If we previously used setup messages, just refer to the auto crypt spec on how that works. But server devs, please allow us to search for arbitrary email headers, because we need that to discover the setup message. That's it. Thank you for your attention. You will find the code of the JMAP library on Codebook, the Android client. If you want, follow me on message. I'm Daniel at Google.Social. The source code for my slides is also online. Yeah, thank you. Any questions? Thank you. Any conversation about letters or JMAP? Come on. So you said there's no need for a MIME parser. Is there really no, never any reason to have a MIME parser yourself? Yeah, I didn't want to put that on the slides. But as soon as you do like PGP encryption, you do have to do a MIME parser. That's what it was. Oh, damn, now I have to deal with my parsing. But the MIME that is in most PGP messages that I encountered is a lot saner than what you might encounter on wild email servers. So yeah, that's a relief. All right. Any further question? For the boost, I want to know. Do you use Udify? You can receive the notification or all the Zmaps over? Yes. Yeah, so JMAP has built in web push support, which is like an RFC as well. And then you can either speak web push towards Google and let Google relay your messages or use Unified Push. And you best go to UnifiedPush.org if you want to learn more about the self-hosted version of Unified Push, because that's too complicated of a topic to have in a five-minute Q&A session. All right. Any further question? Otherwise, thanks again to Daniel. Thank you.
[Servers] Aerogramme, a multi-region IMAP server
Hi everyone. So I will present IROGAM which is a multi-region IMAB server. And so the goal of this talk is to discuss about this multi-region thing. But before starting some context, so my name is Quentin and I have a PhD in distributed systems and this talk will be a lot about distributed systems because that's something I know. And I try to work as much as I can for a collective. It's called Duffler and we try to build a low-tech, etic internet and if you want to know more about the things we are doing, there was a talk yesterday about garage where the infrastructure, the self-hosted, geo-distributed infrastructure we have is presented. And IROGAM is part of the strategy and the project of this collective and also a very nice thing. It is supported by an internet and they are very nice. I have to mention it. So first the problem we want to solve and I like to say that with emails we want to make other people available when it would be as always impossible due to distance. We can achieve this goal only if the underlying system is working. And so this talk will be about distributed systems but also about availability and reliability. And so I have three main ideas that frame the decisions when developing IROGAM. So the first is that we should not trust the cloud and hosting providers not only because they can fail and when they fail your service is not working. And the second aspect is that we think there is some space when it comes to IMAP server designs to study and to try new design, new trade-off. So there is no perfect solution. We don't have a magic solution but we can try new ways and new designs. And the third part I will try to convince you that this new design can work in the real life. So first don't trust your provider. So generally when you have the title of this talk is multi-region. I think the first part is to define what is a region when you talk about a cloud or hosting provider. So it's the Google Cloud Platform Region Paris. So its name is Europe West 9 and it's made of three data centers. And last April the whole region, so the three data centers, was unavailable for three weeks. Not totally but the outage lasted for three weeks in some part and it was due to a fire in one data center. And due to some tight interconnection between the other data centers and many software, the other data center were unable to work not due to hardware failure but due to software problem. So three weeks without emails you can imagine that it could be very hard when you use it for very important stuff like I don't know paying tax and seeking for a new job and so on and so forth. So the idea, it's not new, is that you should move to reliability first design. You should think reliability in your service and not rely only on your server. I'm sorry, the book is named Cloud Native Patterns but we could have named it Distributed Native Patterns and it's the same example with the region in, this time it's Amazon in the US and we see that, so the author of the book, Today's Three Services, Netflix, IMDB and Nest and only Netflix took the effort to deploy in a multi region and there was only one that were still working when the US, this one region was not available. I think it's the secret source of Google when it comes to Gmail or when it comes to Google search. It works despite data center failure, despite multi region failure because they are designing their service as reliability first. So it's easy to say that we should design our services as reliability first but in fact it's hard, like many things and something which makes it hard is that when you are in the same region, latency are very low, like one or two milliseconds but when you consider multi region deployment, I have made a test so between Paris and Warsaw in Poland and we jump to 30 or 40 milliseconds. It's not a lot but when you have distributed protocols, often this latency is amplified and there is such example in yesterday's presentation too. So we know that it's hard but it's even harder in the context of the email systems and the Apache Jambs documentation summarizes it very well. So the hard problem is, yes well done, the hard problem is about the monotony QID generation. If you are at the beginning of the dev room, UID in emails have been explained and so they say you have basically two solutions. Either you choose to doing weak consistency and so you risk data loss or you choose strong consistency and strong consistency is very sensitive to latency so it will be very slow. So currently the answer of the Apache Jambs developer is you should not deploy Apache Jambs or the cat-sandwrap part in a multiple data center setup. You should pay for consulting. Okay. So if we make a wider review of the existing work, maybe I have missed something and let me know, but you have some leader follower design which are for example, Cyrus or Dof Cut and you have some consensus or total order based design like Stalvart, IMAB, Gmail, Apache Jambs, Wilder, and so on. So this consensus or total order is often outsourced to the database. So for example, FoundationDB, Cassandra, lightweight transactions or MongoDB. There was also a research project named Pluto and they tried to design a mailbox server on a serial DT design. So it was very, it works very well in a multi-region setup but they have an incomplete implementation because they do not support the monotonic ID. They only support sequence identifiers. So yes, it's interesting if we don't implement the whole IMAB protocol, we can do multi-region way more easily. So our solution was, we wanted to implement the full IMAB protocol and so it's a trade-off. It's not a magical solution but we decided to live with conflicts. So in fact, in IMAB you can have conflicts as long as you detect it and you change a value that is named the UID validity. So it's not free, it has a don't sign, sorry, it will trigger a full expensive resynchronization for the clients. So for example, we see two processes, so you can imagine that's two IROGAM processes and at the end for UID4, the two processes assign the same UID for different emails and when the other one learns it, there is a conflict. And so in our implementation, assigning a UID is a log and we have an event log that is not totally ordered but only causally ordered and we have a proven algorithm to solve conflict and compute a new UID validity. Also there is a proof in our documentation, if you want to read it or to review it, we are interested and we try to be as clever as possible when we synchronize this event log to reduce the conflict window. And so you might say we are cheating because we are changing the problem. We don't try to have monotonic UID but we try this time to handle correctly conflicts. And yes, it's true but I have two arguments. Often people are tweaking raft and they are doing bad things. And I have two examples when in Kubernetes, an issue that has been opened like six years ago and it's still open because they are violating some invariance due to a caching of raft for performance reasons and another one is the post-mortem of GitHub where they also use raft which is a strong consistent algorithm. And they say that and they show that they have done some optimizations that break some invariance of the protocol. And you can reduce the risk of conflicts as much as you can. So the most important was to have the correct solutions. And so if you want you can put a multiplexer in front of a irogram and redirect all the same user to the same server and so you will reduce even more the risk of having a conflict. So talk is cheap, show me the mail server. So I will be quick on this part but I've tried the deployment in France, in Netherlands, in Poland. And so you have some screen shot and you can check the IP address. There are some IMAG server listening on. And on each region, this is the deployment. This is connected to post-fix through the LMTP protocol. We have implemented in irogram. And irogram is a state-less software and so all the data managed by Garage which is in fact doing the magic behind the scene with is a geo distributed design. Yes. And I have a demo. So I will try to show you. So I'm just using something like NetCAD to connect and show you that there is an irogram server listening behind the domain name. And after that, I have configured this iMAG server on my phone. And you can see that I have a mailbox. And now there is a Gmail. It's the Gmail web UI and I will send an email to this server, to this multi-region server. And so the email is sent and now we will wait until it's received both on the phone and the computer behind. And that's it. So that's the conclusion. So we started with three ideas. And so this is the answer. So irogram is designed from the ground up for reliability. So it was the most important thing to us. And we decided to tolerate UID conflicts instead of trying to be to enforce monotonic UIDs. And so we tried to handle them correctly and minimize them. And finally, we want to prove that irogram already works in real environments. But irogram is still a technological preview. And it's not yet deployed in production. So be very careful when using it. Don't use it for real workloads. No, I think during this year we will deploy it on infrastructure for real users. And that's one of the future work we will do as much user testing as we can because we don't want to lose important information for people. And we also plan to implement KALDAV and CARDAV. And maybe in the end, envision irogram as a group. And something that's so important is performance measurements and improvements. And I can say that many design choices we have done will result in the fact that irogram might use a bit more CPU or memory than your regular email server. And you have to take also into account this fact. So thanks for listening and I cannot start questions if you want. Thank you very much. I see one question over there. Gentlemen in red. So first, thank you very much for this design. I've been working on distributed email for quite a bit. And UID generation is part of the story. What is your approach with keeping the IMAP session synchronized, especially the modification sequence to UID mapping, IMAP ideally, and other things like that with such design? OK. So we are handling the synchronization, the rest of the synchronization in the IMAP protocol. So we have a view and ways that we are maintaining. And so as I've said, we have an event log. And each irogram server sessions are watching the event log that is stored in garage. And so when there is a change, we compute the difference. All right. Further questions? Last call? OK. So. Ah, there's one. Can you say it just again shortly? What is garage exactly in a few words? Can you say a bit about this? So we say that garage is a distributed data store. So there is one API that is S3, which we call often object storage. So it's like a file system, but with way, way, way less features. So it makes possible to have efficient distributed deployments. And garage is inspired by a research paper entitled Dynamo by Amazon. And it's a design of a key value store. And garage has a second API, which is named K2V, which is very similar to also Raya KV. If you know Bashow, it was a company and they don't exist anymore. So garage is really about replicating your data and making them available. And you have this API about object storage, but we have this key value API. And so it's really the foundation of your data layer. And that's a new way, I think. And that's what we wanted to prove with Irogan. But we can design applications a bit different and use not only for a garage for binary blobs, but also for that lightweight database used. So I think I understood from the website that you also encrypt data addressed. But you haven't mentioned that at all. Yes. You're doing it, right? Yes, we are doing it. It's in the code and it's a choice. Maybe we are keeping it for the next year, probably. But sure, yes, in garage, all data is encrypted with a key that is derived from your password. And so when the data that are stored in garage are always encrypted. And the data is in plain text only in the Irogan process memory. But it's not really ready. We have to find as many things, but we have many ideas about that. All right, thank you very much again. And I think we will head over to already mentioned Apache James.
[Servers] Apache James: Modular email server
I am working with Apache James. Basically, first, a few words I'm working at Lina Gora. Our mission is to promote data sovereignty and especially give the tools for organization to communicate together without using big gaps. So we are working on a suit called twig workplace with twig mail for e-mail, twig that is relying on matrix for the chat and also file sharing. So as part of this development effort, we were looking back in the days for an e-mail server that is easy to scale, at the time we were looking for a file sharing. So we are looking for a file sharing. So we are looking for a file sharing. So we are looking for an e-mail server that is easy to scale. At the time we did not hear yet the talk about aerogram. We were looking for a modern e-mail protocol. Hopefully we already heard about Ricardo's stuff called J-Map protocol. And we also needed to be able to do deep integrations inside the mail server. So we started with the protocol. I am sorry, I am a bit frustrated. I did not get to speak about J-Map so we will take one minute to do so. We started implementing J-Map into Apache James back in 2015. Before even the normalization effort started within the IETF. We are big fans of I-Map. We implemented twig mail client in Flutter. So Odriga is using it. The Dart dependency to write J-Map CLI for instance. Basically we are able to take a mobile team that is not an expert at all about e-mail and get them to implement a mail client. The things work fine, works fast. Synchronization is easy. Most of the pains of I-Map are lifted. So twig mail works on multiple platforms, iOS, Android, Web. And it is also used on top of other mail servers like StoreWart Labs. So about the mail server itself, because it is a track about mail servers, Apache James is part of the Apache foundation. So it is a track about mail servers. To my knowledge, it is the only e-mail server that is part of the foundation and has an open governance model. It started back in 2003 from Project Jakarta. So it is kind of a cousin of Tomcat and projects like that. It is surprisingly influential in the Java world. The mail that I will present later is kind of the servlet of mail. So a generic way to write e-mails. Some of the important people within the Apache Software Foundation did actually contribute at some point to Apache James. And Neti Network Library, which is very influential in Java. Norman Mauer is a previous contributor of Apache James. So regarding the overall setup, what I recommend actually to use is the distributed setup for Apache James, where basically we host metadata in Cassandra. Big binaries into S3, distributed search with open search. There was a little licensing problem with Elasticsearch. And last but not least, RabbitMQ for messaging, things like IMAAP, IDOL and stuff like that. Of course, we orchestrate everything and run it on top of Kubernetes and are integrated with metric systems like Grafana. So now let's look inside the code. This is more or less the classical e-mail server architecture. You've got protocols on the left, SMTP, IMAAP, which would call the mailbox where the mails are being stored. And you will submit emails to a mail queue and apply mail processing. So what's important here to notice is that you've got green dots. It's not updated the slides, but now you've got a green dot here that allows you to depend on simple interfaces in Java. Write Java code in a completely separated project, compile it, and embed it into Apache James. Configure it. You have a set of extensions that already exist. You can use James APIs. You can inject your own component. And then basically have your code run inside the mail server without touching the mail server. And then you can run it on the mail server. And then you can run it on the mail server. And then you can run it by switching a single line within that e-mail server. So sorry that might be complicated to see from the back of the room. I did not thought about that when I copy and pasted those rectangles. But basically the mailet container, you take things from within the mail queue. And the overall design is to have mailets, which is an action, applied conditionally by a matcher. So you have two little interfaces that you work with. The matcher represents a condition. And you would organize a pair of mailets, a matcher, inside a processor, which is a stream of execution. You have a specific mailet that allows to switch a processor. And a couple of various basic implementations. All of that is defined in XML and is fully customizable. I will give you a little example. So a hello world mailet that is kind enough to look up for the language and print hello world based on that. So a mailet would get the mail and applies an action to it. You can modify the mail. You can trigger some external APIs and so on and so on. All I need is to depend on the mailets API. From there, I compile my project, I get a job, and I just register it somewhere into my XML configuration, put the job into external jobs and go. So back into there, I can just go back to the mailets. I can just go back to the mailets. So it's actually quite powerful and you can connect the different sets of extensions together. We've been speaking a bit with Daniel about push. We received a contribution lately to have an IMAP extension to for push for iOS application. And basically you are able to plug a mailbox listener that listens to the mailbox events, register an IMAP app, and you can get an extension that creates the registrations and you would be able to get the push working like that. So that's quite powerful, but James is written in Java. You have interfaces for everywhere. Everything has an interface and we rely on inversion of control with a library called JUSON, which means that basically you can assemble your JUSON modules the way you want. And of course you can reuse existing modules, which means that you can make your own tailor-made server with Apache James. As an example, so because we need to follow the Apache way, we need to be in open governance. At Lina Gora, we decided to clearly split the project, which is Apache James. That's where open standards go. That's where the distributed mailbox is. That's where everything related to modularity, extensibility is. And we reuse that as a framework to bundle our own twig mail servers that have a couple more extensions, things like autocomplete for email addresses and stuff like that that are not part of the JMAP standard. So that we reuse to actually build the JMAP standard and build our product. This is a very nice contribution that we did get back in 2020. So this is to give you an idea of how you could use James. The idea is to validate GPG key. So basically, using the Web Key protocol, I would submit my key to that modified Apache James that will send me an email encrypted with the private key that I've just been uploading. I would reply to that email, which would validate the key and serve it there. So it's proof of concept. It had not been merged part of James, but it's to show you that you can really play and do interesting things with deep integrations. Who is doing pop free? It's the guy in the room doing pop free. Pop free is an awesome protocol because you don't have a UID and it's really, really, really simple. So in France, when you go and see a practitioner, you would get a repayment order that would be sent to the National Healthcare Insurance that of course transits by email. And every insurance would get a mailbox receiving millions of dollars. So you would have emails a day. And of course, you need to have a damn thing, geo-replicated on three different locations and so on and so on. So I map the latency, it would go crazy. At least we don't use aerogram. Volumetry is big. And of course, they have a very crappy description of homegrown custom formats that you need to, that slide, don't do justice. It's actually a couple thousands of lines of code to get all of that fitting in Apache James. The point here, when I arrived on the project, they were actually able to write tons of mail-in matchers, listeners and so on and so on themselves and plug it in together. So we were also able to rewrite the storage engine. And we also had a lot of different design to be able to leave some where Cassandra restrictions on dumpstones and listing millions of emails. Another project that we did was actually to also integrate within MSS Santé. So that's the mailing system for French health practitioners. It has some specific security restrictions that are related to it. So we were able to have also some specific integrations for that customer, like upload directly, attachments received into their drive. So basically, we quite a bunch of extensions and modularity going on in there. And surprisingly, even things like banking applications, that's also email. And it's very specific. They have millions of users with very, very, very tiny mailboxes and it needs to be cheap. And they have custom SOAP APIs to access the messages. That's also the kind of other things that you can do with Apache James. So I did not cover much of the technical details. I did do a hands-on session back in the day in 2019 in the Apache conference in Berlin. So if you are interested in getting more information on the code and watching some hopefully live coding that did not go too wrong, you can do it. The talk is online. Thank you very much. Do you have some questions? Thank you very much. Okay. Let's see your first hand. Thank you. So are there any pre-existing modules for spam filtering directly with Apache James? So you need to speak louder because I did not understood the middle of the question. Are there any existing modules for spam filtering so that you can use the same language as you did with Apache James? So basically we are integrated with spam assassin and air spam D, especially with spam assassin because we are an air spam D because we have mailbox listeners. We are able to live train based on the way you move messages, your spam filters. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So I think that's a good point. So my answer is yes, there's already some integrations. All right. Further questions? So here's somebody. Yeah, I have a question. You were talking about these examples from the health system and from the banking. And I'm not sure if I understand it correctly. It looked to me like this is very email as sort of an API in a certain way, right? For very specific procedures and processes. And if that's somehow right, you might fix me anyway. Do you also do special processing of these emails? I mean, is there any special mind parsing involved or maybe you can say a few words? So first your understanding is correct. Apache James is very modular and of course it works as a regular email server, but you can use it for all various corner cases that could be hard to handle with other technologies. Regarding mind parsing, I'm also the maintainer of the Apache MIME 4G parsing library that of course you can do some pretty complicated mind parsing within Apache James. Does it play a role in these use cases, in this medical or banking one? Yes. All right, let's see two more hands. Maybe first the other guy and you. Yes, related to the previous question. Are the emails handled by the healthcare encrypted or? So they are encrypted and it is transparent mostly transparent to the work that we are doing with Apache James for them. Okay, so is this like transport encrypted or pay-due encrypted? Depends, but there's a lot of things going on with S-MIME. Oh, okay, thanks. Have you seen any maillets be created in programming languages like Scala, Groovy, Closure, those ones based on Java? So yes, yes, we have a couple of example of Scala mailets. We use Scala at some parts within Apache James. For example, the J-Map stack is completely written in Scala, so yes. All right, we would still have time for a quick question if there is any. One here. Oh, sorry I didn't. Ah, sorry. Yes, okay. Misunderstanding of mine. You mentioned POP3, it's very nice, but I suppose you have IMAP as well. Is it ready for standard IMAP usage or do I have to? Sorry, it's a misunderstanding. POP3 is a horrible protocol, but it's that one given use case of needing highly available protocol that can be multi-data centered. It's so simple that it fits the bills. Okay, and IMAP is a separate? We support IMAP, the big range of IMAP extensions. IMAP is fully supported and we also implement J-Map as a protocol, so very wide range of protocols implemented. Okay, fine. Thank you, and thank you again, also Benoit. I hope I didn't see anything. Thank you. Thank you. Thank you. Thank you. And yeah, we have one more talk into service session, which will be Mikkel about MOX.
[Servers] Mox: a modern full-featured mail server
So good afternoon. My name is Michiel Lekin. I'm a freelance software developer from the Netherlands. Last year here at FOSDEM, I first announced Mox, a modern, secure, all-in-one e-mail server. As you may know, running your own mail server has a bit of a reputation for being hard to do, but what have I told you? Running a modern mail server can be easy. Alright. So, thank you. So the goal of Mox is to make it really easy to run your own mail server so that you actually do it, and then you can stay in control of your data, and you can help keep e-mail decentralized. Now, Mox is a new implementation, entirely new, and written in Go. Now, that's a lot of work, and you might ask, why would you do that? Because we have so many open source components that you can just use, and that's true. And for the past decade, I've used many of those components to good use. But a few years ago, I had to reinstall my machine, so I got a completely new one, and I just felt a bit reluctant to install that same software again that I've been using for the past decade. And for two reasons, at least. One is to see. The language with small mistakes have big consequences. Don't get me wrong, I like C as well, maybe in the past. And the software written in C is very high quality, but I wanted a new machine that lasted for another decade, and I see that C is not really going to be part of that too much at some point. But the bigger problem is basically the complexity. Over time, as e-mail has grown, new protocols added, new extensions, new software components have been added as well. So, to make our fully modern e-mail system, you need many components and make them all work together. I think many self-hostings, at least, they stop halfway, so they have a semi-modern e-mail system set up. You can make it easier to get all this configured and set up with a distribution or a Docker image or something, but you still have all these components working together. There's many integration points, a bit of friction, some data loss. Sometimes there's security issues when, you know, message headers are seen as authoritative but added by some component. So, I think what happened is with all this complexity, some people, you know, just stopped running their own mail servers anymore because it just too much work, and they migrated to the cloud, centralizing e-mail, that's not a great development. So, what we need is an easy to use mail server. So, you need to write a set of features. So, Mox tries to deliver many features. I'm at four for reading your e-mail, S&TP for sending receiving e-mail, SPF, DKIM, DMARC for message authentication, because just S&TP is not enough, but that's also not enough. You need TLS, of course, for encrypting your communications, but S&TP for sending receiving e-mail is unverified TLS. So, you want MTA, STS, and ordain to, you know, check that you're talking to the right person, right machine. So, Mox implements both for incoming and outgoing e-mail. Then there's ECME for, you know, your management of TLS certificates. You want to make it easy, don't do any manual TLS tumbling. Junk filtering is part of Mox. So, based on historic messages and their non-junk classifications, Mox will reject, accept, incoming mail, more about that in a moment. Then internationalization, so you can have Unicode in your e-mail addresses and your headers, both in your domains with IDN and your local parts. Autoconfiguration is in their various flavors, all supported by Mox to make it easy for mail clients to find and write service settings for new accounts. Then we've got a webmail included in Mox. We'll have a quick look at that in a moment as well. An admin web interface, so all configuration is in files. We want the full power. You can use the admin interface to quickly navigate and make some changes like add, remove, an e-mail address, an account, or a domain. A web service included, it may sound a bit crazy over the top, but modern e-mail basically requires HTTP stack with MTA, STS, Autoconfig, Jmap soon. Now it's already part of the deal. What I've noticed is people trying to run Mox and a web server on the same machine. That's really annoying because configuration gets complicated. Instead, I just added some web server to Mox for some static file serving and reverse proxying. That problem is also solved. Permit these metrics, structured logging, operations become a bit easier. Then the Mox Quickstart. That makes all this stuff easy to do. Installing Mox, you take a new machine, you've got a domain, you run the Quickstart and you pass it an e-mail address at your new domain. Mox will generate, so the Quickstart will generate configuration file, decon keys, etc. Create a new account. We'll print all the DNS records that you'll copy, paste into your zone file, or you have to manually edit them in your web interface of the DNS operator. That's not so great. Then Mox also, the Quickstart, also Linux, generates a system-d unit service file, so you just enable that and start it. Then you've got a fully working modern e-mail system. All of this is MIT licensed, so you can do whatever you want basically. Then as developers, a little bit about code. As I said, it's a new codebase. It's a modern codebase coherent. All of this is in the same style. It's very self-contained, so few dependencies. That makes it possible to have it in the same style. It's about 73,000 lines of go and 21,000 lines of tests, mostly unit tests and a bit of integration tests and some fuzzing tests. There's 11,000 lines of type scripts, very strict type scripts for webmail and interfaces. The code is cross-referenced with RFCs to make it very, not easy, but to make it something more maintainable. You can look back and see why you did certain things. Of course, Mox is written in Go, so it brings a whole bunch of advantages like memory safety, standalone binaries, completely subtly linked, also includes a few assets, so it's really just one file that you need. Fast compilation time, great for developers. Dependency management is pretty much solved in Go. You get reproducible builds out of the box, and it also works with cross-compilation, which is trivial to use in Go. Now, there's not much to see about a server, but we have a webmail that I can show you. It's not pretty, but it looks mostly like a standard email client, I think. Mailboxes, message list, message view. Let's open up mailing list. There's some threading in there. You can select multiple. I'm using keyboard shortcuts as well. Mark some unread messages and mark them read. Then there's HTML support with or without external resources and tracking pixels. Then there's a little example of Unicode addresses. The search is easy to use. We've got some quick filters on that side. We could send a message, but I'm sending a message from another mail client that should be arriving. There it is. Select some text to quote as civilized people and send a response. That's a webmail. It's not pretty, but it mostly works for my needs of email, sending and reading. Then I would like to say many things about lots of features, but I just limit to one thing, spam filtering in Mox. Analysis for incoming messages is based on historic messages in an account, based on their junk and non-junk flags. It's always per account. Whatever one account does is not related to what happens for the incoming message for your own account. Of course, this means in order for this to work, you need to have the proper flags on all the messages or as many messages as possible. Email clients don't always help with this, but Mox does help with that because in the default setup, you get an account where incoming messages or messages moved to the junk mailbox gets the message flag. If you move something to a archive mailbox, it automatically gets the non-junk flag. Also, if you move it to the trash mailbox. Also, if you're in the webmail and you have a message open for five seconds, that's long enough for it not to be junk, probably, so it also gets the non-junk flag. That means most of the messages in the store will have these flags set properly. There's a difference in how Mox handles known senders versus first-time senders. For the known senders, they're recognized from sender address or just the domain of the sender address. Maybe it's another person at the same company. Or we look at SPF or Decombe signals in a mail message or we look at the IP address of the remote server or various subnets of the IP address. If there are recent historic messages from that same sender, we look at the junk and non-junk classifications of those messages. If the recent ones were junk, we reject the message and otherwise we accept the message. But if it's a first-time sender, we don't know enough about that sender. We do, of course, something else. So, the BEG analysis is also part of Mox. It's essentially a reputation of words. So, you look at the words in the message, then you look at historic messages and their words and their junk and non-junk classifications. If there are too many spammy words in the message, you reject. If there are enough hammy words in a mail message, you accept. Then you can also configure a DNS blog list in Mox, but it's off by default for a few reasons. One, often these DNS blog lists are centralized services. We don't want to rely so much on them. And you would be sending the remote IPs of those you communicate with to some central party, which is also not great. Then we don't want to break existing email flows. So, this is also one of the reasons why it's only on first-time senders. So, if you've been communicating with someone for a long time and suddenly someone puts their mail server on a blog list, you can keep communicating with them at one break, only if that person really started spamming you all of a sudden, then you mark a few messages as junk and then in the future, the mail filter will just adjust. Now, in Mox, being all in one mail server really helps with this because during the S&TP transaction, all this historic data, these messages and flags and words are available for analysis. Then a special handling for messages from mailing lists and forwards because essentially most of the analysis disabled and DMARC policies are not enforced. Now, what do you do with an incoming junk message once it's classified? Well, one does not simply deliver to the spam mailbox. That's not friendly for users and not for senders or recipients and senders, I think, because the sender thinks that the message has been seen and doesn't get a reply. The recipient may be receiving some message and doesn't get it, so they wait or they constantly check both the inbox and the spam box. I think it erodes trust in email. So, I understand that it's done to not give spammer's feedback about their spam runs, but users should come first. So, instead what Mox does, Mox rejects the message at the S&TP level while it's coming in with a temporary error code and a very generic message. So, the generic message means that the spammer doesn't know for sure why it's being rejected and a temporary error code means or causes the sending server to try again a few times and at some point tell the original sender, you know, this message cannot be delivered and then they know they can find another way to communicate. So, you don't have this problem anymore of lost messages in the spam box. But just like with the spam mailbox, Mox has kind of the same thing but different. It's the rejects mailbox. So, anything that's rejected is still stored in this special mailbox. It's a fixed size mailbox. Old messages automatically removed and so you're waiting for some kind of transactional email. Maybe you did a sign up to a website. It's not coming in. Then you can check the rejects mailbox if the message, because maybe the sending website used the infrastructure with a bad reputation. Then I can just move that message from the rejects mailbox to the inbox, Margaret is non junk and the next time because of the historic based filtering, next time those messages from that sender will be accepted. So, the important point is that you don't have to keep checking the rejects mailbox because the sender knows you didn't get it and that's different from the spam mailbox. So, this seems to be a graph for me but if you have ideas on how to improve on this, let me know. Then a bit about the roadmap. There's still a lot to do in Mox. So, I want to implement a simple HTTP based API for sending some messages and also receiving some feedback. Just so web apps for example can just with a simple call make some sense of emails. If you know of any standardized ways of doing this, let me know. Yeah, okay. But I said simply, really the dumbest thing. But I guess maybe it can be that simple. Then I want to add calendaring. It's not email but users and myself included expect it to come with email. I need some more SNTP and IMAP extensions. A JMAP will be coming at some point. So, so far I focused on IMAP because all my meal clients were using IMAP and I wanted to have a working meal system. But, you know, I, JMAP will be, will be coming. I want to encrypt all data at rest. It's not currently done. I want to be able to have a second Mox as a backup, Mx and a backup instance. In order to do junk filtering on the second instance, I will need all the data as well or the historic messages. So I want to synchronize everything to the other one. And then I, you know, the, the, once all the data is there, you can also use it as a, as a, as a failover machine. So that will, that will be nice. Forwarding to external addresses not yet done because it's a complicated, gets complicated quickly. I think modern email is not really set up for that anymore. So, Mox is a different way of applying rules to incoming messages. Then lots more on the list. Too much for today. So, final slide. So it's been a year since I first put out the Mox code. I've gotten quite a lot of feedback. So thanks everyone who sent in bug reports and made feature requests or sent in patches. Very helpful. Then also thanks to NLMet. They've been funding continued developer the Mox since August last year. So that's been instrumental to keeping working and being able to keep working on this. Also thanks to everyone who wrote all those RFCs about email. They're very, they're excellent and they match practice quite often. So my call to action today. If you're not doing so already start running your own mail server, you know, staying in control of your data and keep email decentralized. And you have many options ready. And now there's just another one called Mox. So give it a try. Send me an email. It's a great way to communicate. Thank you. Thanks. Oh, I saw you first. You only have three minutes. First of all, I think it's a quite incredible project for one person. And I was wondering how many third party libraries do you use and how much the code you write directly to implement all this? Yes. So I think the main external library is called Beebolt, which is a fork of the database layer. So the messages are stored in files. The database layer is a database thing and it's pretty much a key value store. But anyway, that's the main external dependencies. There's something for Prometheus. And then there are a few dependencies that I wrote myself. So those are not really all that external. And otherwise it's mostly the Go standard library and the extended Go standard library. So very few external things. So yeah, it feels a bit like a not invented here syndrome. So I want to rewrite everything. But it has been very instrumental because sometimes I've made sweeping changes and there's no one. I don't have to make pull requests, try to convince people to do something that suits my needs so I can do whatever I want. So it's really sped up development, I think. Fantastic project. I have a quick question regarding database. I don't know if it was answered right away because I heard the database was whether the data is a sort of a database if you like or could be changed or whether and whether whatever it is, could we use Unix normal tools to just go through them? No. No, you cannot use normal Unix tools. What I don't really want is say having a meal day that someone else also makes changes to because I have to do lots of work to make sure that I synchronize or chase as well. So I've chosen a simple approach. So messages are just stored in a file system individually at the moment and there's one database per account that has the index for all the messages in that account and that stores also the message flags, etc. So the database is also essential basically for all the history and all the data. I could talk for a long time about the database library but. Okay, quick one. What is your experience with scaling this up? How many does a MOX thing? I've not tried. It caught me because of the Bayesian per user and we tried that and that didn't work out. So I have no idea where the limitations are. So I would like to try to see where it breaks and I don't know at the moment. So I've only run it small scale and really targeting it small like self hosting a bit and not the tens of thousands of users or something. So I see many hands which is great. We have a little more time since we have the switch of succession. Anyway, when people leave in the meantime maybe be silent so we can use the time for a few more questions. Let's try how many we can get. I didn't see the order so forgive me. Thank you. Do you have any plans for LMTP support? No. But why would you use it? Why would you need it? I'm writing a small... Oh, you need the microphone. You need the microphone. Sorry. I'm writing a small like mandrel clone and now that they shut down and for that I need to be able to put an email message into the server. Yeah, okay. So maybe a better solution would be to put it in the go code and make like a fork or something. From what I've seen LMTP is almost like SMTP but just it has this improvement of getting reply codes per... It's just simpler. It's lightweight. It's just a dumb-down version for mail drops. Did I get that right? You reject mails but still deliver them in the rejects mailbox. Yes. Whoa. Wow. Scary. Yes. About the reject. So I think that's basically like grey filtering if you are... It's basically like grey filtering except that you will continue to reject them or do you do anything special if they come back or if they go... Yeah, so if they come back with the same message, I deduplicate it based on the message ID or the hash of the entire message if there's no message ID. But then we'll still be considered rejected. Yeah, it will still be rejected. Yeah, yeah. And does this interact with the junk no junk flag for Thunderbird and other IMAP clients? Well, so I think there's the flag dollar junk and dollar non junk and as far as I see Thunderbird sets it without the dollar. So it's not useful. But it does also interfere, I guess, because Thunderbird does not a client side and it would work but it's kind of duplicate then. So I disabled it. I disabled the automatic classification on my Thunderbird setup and I just let the server basically do it myself. So I now don't get a lot of junk. The filtering is okay. I still get a few pericas sometimes one a day and I just junk it and then it's okay. Okay. Thank you for your questions. There's still a matrix chat. Mikaela will be around. Thank you Mikaela. Thank you.
[Clients] Introduction to Thunderbird for Android
All right. Welcome everybody for a new round of the modern email death room. And now it gets a little more user friendly user experience because we are in the email client session. And yeah, we start with a very interesting new development. Many of you might know there was K9 email which might turn into, or will turn into Thunderbird for Android. And we are happy to have K9 main developer, I think I can say here, and he will give us some first hand insights. That is yours. Thank you. Yeah. So half of the talk was that. So just kidding. My name is Kati. I will tell you a little bit about Thunderbird for Android. First we'll start with a little bit of history. And it's about K9 email. Like you mentioned, it will eventually be renamed to Thunderbird for Android. Our journey starts in 2008 when the first Android version was released. Jesse Vincent bought an Android device, tried to connect to his self hosted mail server. Back then that was more common than it is now I guess. And it wasn't working. That's because the email app that ship with Android wasn't really great. And he figured it's part of the Android open source project. So he will just fix it and it will work. He did fix it, but then he found out you can't just install an update to a system app. So back to the drawing board, like extracting the code of the email app from the AOSP source tree that contains all the app, the AOS. Build it as a separate app, give it a different name. And then it was working and he figured if he made all this work, other people might like to use that as well. So he uploaded it to the Android market. That was the name of the app store back then. And he released the source on Google code, which was a thing back then. And since it was really early days, most of the Android users were nerds like us, many of them developers. A lot of them realized that the email client that ship with Android was really crap. And so a lot of people ended up finding K9 Mail and hoping the bugs were fixed there. Most of the time, because it was forked from the original email app, the bugs were still there, but at least they could fix it easily. So a lot of us found K9 Mail that way myself as well. I joined in 2010 or late 29, depending on how you count. And because neither K9 Mail nor the email app were working with my providers, I had to fix K9 Mail. Unfortunately, we don't have a lot of time. So I can't talk about all the awesome people that contributed bug fixes and features, even the early days, to make K9 Mail as popular as it is. We'll have to skip forward a little bit to events that are half relevant for Thunderbird for Android. The next one is that Jesse made me the project lead because I kept fixing bugs even once I wasn't affected by it. At the end, I was doing releases and stuff like that. So basically everything maintained as Jesse went off to start a startup. Doing keyboards. Fast forward a couple more years. I was contacted by Ryan Sypes from Thunderbird. And he was like, we have lots of users, but they also use mobile devices and they want clients for that. So Android and iOS. And he was talking to lots of people, trying to find out how they can do that. How can there be a Thunderbird for Android, Thunderbird for iOS? One of their ideas was to use one of those cross-platform frameworks where you can write JavaScript because Thunderbird and desktop also have a bit of JavaScript, so you only have to write the code once. I was like, I have no experience with that, but it sounds like a horrible idea. I offered to ask my friends that I'm mobile development that have used those frameworks before and everyone was like, yeah, that's nice if you have super simple apps that maybe do some rest calls and display some data from the web server. But everything that goes beyond that, you probably don't want to use that, especially if you're trying to write an email client. So I told them that and Ryan went off and talked to other people, tried to find out how to do that. What I took away from the conversation was like Thunderbird was asking for donations and was funding their development that way. I was like, I can probably do that as well, right? So I wrote a blog post. What's up with K9 Mail? At that time, it was a difficult period for K9 Mail. The last stable release was September 2018. So that's one and a half year before the blog post. And the next stable release was maybe one and a half years off because we were doing a big UI rewrite. The Android platform changed underneath that. We had to do a lot of catching up, so we were able to run on more than Android versions. I wrote a blog post outlining all of this and asking people for donations. And that kind of worked, but not really. I mean, maybe you have tried it for your own project. If you just write it in a blog post that nobody reads, you don't get a lot of money. At the end of that year, I ended up with not even 6,000 euros, which is nice for a hobby project. You can probably buy a new laptop, but you can't pay rent with that. Nevertheless, I tried again next year for iLOVE Free Software Day or Valentine's Day for regular people, I guess. Wrote a blog post, K9 Mail is looking for funding. So don't make it about the stuff that we can't do, like do new releases, leave with asking for money, and then basically I still outlined all this stuff. Like we can't do releases because we still have to do a couple of things. That one spread really widely. And in February alone, over 18,000 euros in donations came in, which was very nice. I figured if that continues, I am getting rich. Spoiler alert, it didn't. So donations got down, but at the end of the year, it was still 51,000 euros, which is not quite a salary for a seasoned senior developer, but it's enough to live on. And I was talking to the Thunderbird people on and off during that period. At some point, they were like, maybe we can just fork K9 Mail. I'm like, you could, but I mean, it's not in a great state. If you really want to, I can help you a little bit, but I'm not sure if you really want to do that. And then at the end of that year, I was contacted, oh, damn it. I was contacted by Ryan again, and he was like, we have to do this now. We need a Thunderbird for Android client. How about we just use K9 Mail, rename it to Thunderbird on Android, and be done with it. And I'm like, okay, this asking for donation thing is nice, but really donations went down, and you have to constantly remind users to give you money. And if you're a maintainer for an open source project, there's just one task on top of the huge list of tasks you do anywhere. And I'm like, okay, if someone else could do that, and I could just work on that, that would be nice. So I asked Jesse if he'd be fine, if his old project was becoming Thunderbird for Android, and he's like, yeah, sure, go ahead. And that was basically the start. Still, it took a couple more months until we actually announced that K9 Mail will be joining the Thunderbird family. And I was basically hired with a company that does a pace developer to do Thunderbird development. So I guess first full-time employee working on K9 Mail. And the idea was K9 Mail will be renamed to Thunderbird for Android eventually because we wanted to work on some features that Thunderbird for Android should really have by the time it is released with the Thunderbird stamp of approval. So the next thing we did is we hired a second Android developer because two people get more work done than one. And we started informing the community about our progress. And if you've read those blog posts, you will know that we haven't released Thunderbird for Android yet, even though the plan was to do it by mid-2023. That kind of didn't work out. And then we figured, okay, maybe do it by the end of the year, cut some features, and then we decided, no, we don't really want to cut features, we want those in there. So, well, there's no Thunderbird for Android yet. You will ask yourself, well, when is it going to be released? And if you were hoping, I will say now. I'm sorry to disappoint you. The answer is very open source, so when it's done, like I mentioned, we want to get the features in there. And you will ask yourself, well, what are those features? Roadmap, asterisk, so that's the current plan. I mean, we've changed it before, might change it in the future. That's the plan for now. The new account setup, we've been working on what feels like forever, but it's almost done now. The latest beta will probably be the last one, or penultimate one before a new stable release. If you're on the beta channel on Google Play, you can already use a new account setup. But you can't design three is not something I would have chosen, but our users really like new shiny stuff. So, we put it on the list, improved folder management, conversation view is what modern email clients need nowadays. Thunderbird sync, that's also part of the asterisk thing. It's something we really, really want, but there's a lot of technical problems or open questions on the infrastructure side, also on the client implementation side. But the basic idea is sync settings between instances of Thunderbird, be it the mobile version or the desktop version. Then what I put on there, we have existing functionality that needs a bit of tweaking to make it more user-friendly. And of course, Android keeps changing stuff, so that's also something on the list, which is I guess this year Android 15 will be released. There's been no new APIs announced yet. I think that starts in March, but who knows when we'll actually get to the release. All right, what about that? A lot of users have mentioned that they really like the brand and the icon and stuff like that. And so we decided, well, we'll keep it around. We wanted to change the application ID, the identifier Google Play uses, anyway to something more Thunderbird-like. So Thunderbird will be a separate app, and we will just keep, can I mail it around? Of course, we don't want to maintain two code bases that then diverge, so we will build from one code base, two apps. Hopefully, we haven't started on that work yet. And then the difference really is meant to just be visual stuff, so I can have name and the theming. Yeah, and since we are now in the client section, we can also have screenshots. That's something that several people can't really do, right? So, yeah, it's not too large of a screen. It's also really boring. It looks like every other email client, basically, you have a message list that contains a list of messages. If you tap one, you get a message view, which displays the message contents. What could use a little bit more love in K9 Mail is the compose screen, but still it works for simple messages. Hopefully, in the future, also for more complex stuff. Then in the first screen, if you tap the hamburger icon, not to be confused with a Kibbup menu, you get an account switcher at the top, and then a list of folders. Improving this to make it look nicer is also in the list of folder management stuff. Right, and since we're in open source project, if you really want to, you can contribute. The slides are on the FOSTA website, so you don't have to type down the links. We are hosted on GitHub, so we are not doing the Thunderbird thing, like using McEul, using Bugzilla, and stuff like that. Translations are on WebLate. We have a ton of them, but could use really help for some of the more obscure ones. One of the blog posts goes into details, which one need help. We have a support forum where mostly users help each other, which is really nice, so we don't have to do a lot of stuff. But also, we develop and monitor that to get an idea what users have problems with and fix it. We also have user documentation, which is not very often found these days, I find. It's also very outdated, so it turns out maintaining user documentation is work. If you want to help out the project and you're not good with code, maybe you're good with words and screenshots, so you could help out with documentation. All right, the one-minute sign goes up, and I'm also done. Thunderbird has a stand here in Building K, level one. If you want to talk to some of the desktop people, they are probably also here. Maybe Hans-Up who's working on Thunderbird for desktop. Well, Kai is here. He's on the floor, so you probably don't see him, but there are also some people at the stand. Right, and with that, I'm happy to answer questions if you have any. Thank you very much. Sorry, don't ask for the release date, probably. He won't have any questions. You need to help me, probably. There's so many people. I don't see. Ah! So, how is the funding going, basically? Where is the money coming from? Oh, I see. There was a talk about that. You can probably find the video recording on the FosterM website. Ryan talked about how Thunderbird is making money. The summary is we are asking users for donations, and that worked out really well. Last year we made over 8 million. I don't have the exact number, so a lot of money. By the end of the year, we have all of Thunderbird, so the desktop app, mobile, and some other projects we're working on. The plan is to have 45 people working by the end of the year. Any further questions? While people are thinking, you were mentioning the iOS topic in between, but somehow that got lost in the rest of the presentations. Can you say anything about that? The first idea was maybe if we have an existing client, so the same story with Android, we could do that, but didn't find any. So the ideas will start from scratch, and we're currently looking for someone to start that off. So it will come eventually. Which year? Probably not this one. You know, when it's done. Any further questions? Thank you. I have just a quick question regarding the forum and the issue tracker. Will you keep them around? Thank you. The issue tracker, yes. I mean, people report bugs, yes. And we don't want to switch to bugzilla. No offense to people that have to work with that, or even like it, but no. The support forum maybe depends on, you know, how it works out. If there are a lot more users and not more volunteers helping out like answering questions and moderating this, probably goes away, but for now there are no plans to abolish it. All right. Last chance. I see no more hands. So thanks again for that nice presentation. Thank you.
[Clients] Taking care of Roundcube Webmail - current status and future prospects
Welcome, my name is Anna and I've been with Next Cloud since 2020. Anna Focuss primarily in the backend development and I am responsible for the round-cube maintenance at the moment. I'm also at the security team with Next Cloud, so a bit about that later. So first things first, this is the question we have got most in all of the help forum, blog posts and everywhere. No we won't merge round-cube and Next Cloud Mail. Both products will stay independent as they have been and they will receive independent development and independent loving care. So don't worry about that one. Yeah, let's get into the development aspect of things. So we have hired a specific engineer for round-cube, so that's a person that will be responsible for the maintenance, for issues on GitHub, for contributions on GitHub. The thing is it's not gotten that much love the project itself. There's like 50 open PRs and like 300 issues at the moment which haven't been, I'm not saying not triage, but it's hard for the community contributors to look into everything, of course. I mean it's not their main job and we appreciate what they do, so that's what we want to take care of. What we also want to do is we want to do regular security and bug fix releases. This is really, really the main focus at the moment to get us up to date on security stuff, to get us updates, up to date on bug fix releases. There is one person who has been doing a lot of development for round-cube, that is Alexander. He has been doing most of the feature development for round-cube at the moment, but he is not working for round-cube, he's working for somebody else. So we want to help him get new features development, feature development done and do like feature releases and tandem with him. We really want to make sure that we're not edging out any contributors. We really, really, really appreciate what they're doing for the project. So please don't worry if you're a contributor or somebody who wants to contribute, we really, really would love for you to put more energy into this project. I know a lot of you love round-cube and have been using round-cube, so let us know what you think, let us know what features you want to see on GitHub and we promise we will take care of them and look into them and actually give you a response on GitHub as well and not just leave it there out in the open. Yeah, as I said, community-driven development is always appreciated. As with every open-source project, I'm sure you all have the same kind of thing there. Yeah, more care, more feature, more love for round-cube because it is an amazing product, it is really cool and I mean, it's been around forever, so let's keep it going. Another thing that's changed is how we handle security issues. Since I'm part of the security team with NextCloud, we already have an existing process for this, so we're using Hacker One. We haven't discussed yet if we're going to pay Poundty for this, but it is a possibility that we will actually pay you to pen-test round-cube and the advisories will be published on GitHub in the future because right now there is no established mechanism for this, so you don't really find security issues all in one place with CVs and everything. Yeah, that's pretty much everything from me for round-cube. I still have two minutes to go, that is a very short presentation, so yeah, let me tell you a little bit about how it feels to take over this project. Actually, it's really scary because I know a lot of people love the project and don't want to see it in a draw somewhere, not maintained and everything. As a developer, it's also a challenge to get into a new code base obviously because we have different coding standards at NextCloud than round-cube has. There's different expectations from how the community works with us or we work with the community. Of course, there's implementing the email standard, which is not easy, as everyone knows. There's IMAP, which is an old protocol and it has its challenges, but it also has its cool stuff. It's a challenge, it's an exciting time, it's a scary time and I'm really looking forward to working more with the project. Yeah, let me know your questions. That's basically it. That's me done. I see you have something. I cannot decide who was first, so I start here because you're just closing, so. I noticed as the developer of Snappy Mail that more people are integrating Snappy Mail in NextCloud because of the slowness of the NextCloud Mail app. They also want round-cube. Yes, there is a round-cube app. Will there be better integration with NextCloud? We haven't discussed it yet. I really can't tell you that it is for project management to decide. For me, I personally have worked on the mail app and I am partial to the mail app because it has seen a lot of blood and tears as well from the developers. But yeah, there is a round-cube app for NextCloud and with the code base in round-cube improves, then the probability that the round-cube app for NextCloud is going to be better is very likely. Does that answer your question? Okay. Okay, so this question we solved here. Is there any more questions? Yes? Sorry, I think that. Have you already gathered some experience with Hacker One and how is it like? Yes, I've worked with Hacker One for two years now. We handle all internal or like NextCloud security issues via Hacker One. It has produced some good results. Obviously, it's not always easy because it duplicates and stuff like that. But for how the reports are structured, how you can evaluate a security issue, it is actually pretty decent. Yeah. And it offers an integration with GitHub, so it's not that much work to copy it over to GitHub and then publish it. Yeah. Is there any interest by commercial ISPs in supporting round-cube? To use it as a webmail app for their own purposes? As far as I know, Alexander actually works for an ISP. So I think they might be paying him for that, but you would have to ask him yourself if that is true or not. I know that a lot of ISPs have forked round-cube and have their own kind of version of round-cube that they maintain. There is, Hans, if you mentioned this project from the French government that has their own kind of round-cube implementation. So I'm sure there is interest because it is a powerful tool and it works really well. People like it. It's easy to install. I've tried it myself. It was very nice, very easy to do when Docker wasn't doing its thing. Yeah. Yeah. So I hope there will be interest and I hope there will be interest in the community as well to get the product back and get it a bit more popular again. Yeah. That is my goal for this list. Yeah. Any more questions? As a wise, one thing that came to my mind when thinking about, we have seen 4K9, there is a list of actual features somehow blocking the big renaming. So I wonder, is there anything you discovered on the roadmap or on the bucket list in round-cube, which particularly you would like to address in next time? Not yet, no. We haven't done any sort of project management evaluation yet because we didn't have the developer for it. Now that we have hired a person for this, I'm hoping we can actually get some project management up on GitHub as well. So we're using the boards with NextCloud so that would be easy. We can actually sort issues into swim lanes and then work through them. Since it's only one person, progress will not be as fast, but on the other hand, I mean, it's not like nine people can carry a baby in one month. There's the owner from project management. So things will hopefully be getting done quicker than now. I also really hope to get through the backlog of PRs. We have 49 PRs open. The first one is from 2015. So there's some bug fixes in there, but I've also seen some features. I have seen some left to right and right to left text support so you can change for right to left languages. That would actually be a really nice feature because a lot of people have right to left text. If that could be merged, that would be great, but there is a problem with the CSS classes for different team implementations. So that would need to be thoroughly checked and that is probably something that the developer should and can do. It's all different themes, see how well it works. And yeah, also sync with Alex on this. She's had some input. So yeah, maybe you'll see some right to left text support soon. All right. No more questions. So yeah. Thank you. Thank you.
[Security] Thunderbird Email Security, plans and challenges.
So, welcome. My name is Kai Engert. I have been working with Mozilla and contributing to the Mozilla code since 2001, also including email. And I've been a full-time employee of Thunderbird since 2019. And today I want to talk about Thunderbird email security and some of the plans and challenges on that topic. We all know, yes, there are some creatures who could read our email. So, they sit on the service and some robots scanning some mass surveillance monsters and cybercriminals. Okay, we don't like that. So, the problem is that there is no protection while emails are stored on service. We do have some TLS transport security in the infrastructure, but it's not enforced. And it's... So I think we need more than TLS transport security. We heard about that earlier. Of course, we want and need end-to-end security for both encryption and digital signatures. So Thunderbird supports two separate technologies. There's been S-MIME. I've worked on that in 2001 before Thunderbird were born. And, yeah, we have... It's still supported. And we also have OpenPGP, which was previously supported using the Enigmail add-on, which is now fully integrated using an integrated code since 2020. I want to briefly mention some of the things we did in the recent past. We implemented unified status feedback so you get similar UI for both S-MIME and PGP emails when reading an email. When you compose an email, we also have some similar controls to enable or disable encryption. We have made it a bit easier to resolve the problem when you want to send an email, but you're missing the recipient's PGP key. So we have some interactive UI code to help you find the missing keys. We also added some reminders when you try to... When you start composing an email and Thunderbird detects that it can encrypt, then it will remind you if you want to enable it. And just most recently, in the new version from last summer, 115, we have added a long-ass feature which is you can tell Thunderbird optionally to please enable it automatically. If you see, we can encrypt just enable it. Some people have also asked to automatically disable it, but I think it's a necessity to pay attention so we have the option to have some warnings here shown to the user. So other things we did, activists asked for and people who are sharing their computers with others, they have asked that we do support some individual part phrase for the secret keys. We did that. There's some parts missing. We need to make it more convenient by adding a cache. We also implemented the auto-crypt compatible key distribution mechanism which simplifies group conversations by including all the keys of all participants of an email conversation that's called Gossip. We have that recently added. I think we will have it in the stable version soon also. And we added support of publishing keys to keysopentpgp.org. Let's look at the sum. I want to also mention a few general challenges that we've just recently seen. Since some providers now add fmime on the server-side infrastructure, we are now seeing messages coming up which mix two technologies. So people complain. They have a user has composed a pgp message to them and now the whole thing is suddenly wrapped in another fmime layer. And so that's a challenge for the user interface presentation, how you deal with that. What I have one idea is if it's just a signature layer out her most, maybe you just ignore that one, but I'm not sure it's the best solution. So we are still open for discussions if you have better ideas. So there was discussion, what should we do if a message arrives with a digital signature that we cannot completely validate as being good? What should we do? Currently, we do say, well, this has a bad signature, but some people say maybe that's not worth in a plain text email. Maybe we should just stop not showing any bad status at all and just treat it the same as a plain text email. So that's also a pending thing we should do because there was some agreement to do that in a recent pgp community meeting with other developers. And another big unresolved area is if you combine emails with digital signatures and where the content is nice and shiny with HTML and CSS, which many users want to have, the problem is that HTML can be used to manipulate what's shown on screen. So the sender of the email might have seen something different when composing than you as a reader see. So that can lead to confusion. Researchers have shown that. So what should we do about that? I don't have a good solution because nobody agreed to my suggestion to just revert to plain text whenever we have signatures. So but maybe we should show weaker signatures. I'm looking for ideas here. If you have ideas, please, please send them in. So now let's look at some more broader scale. We have the problem that only a small portion of all emails are using S My more pgp at all. They're not used much because there are barriers of entry to use them like Tobias presented. You have to get a third and it's difficult. And then when you have keys, it's complicated to manage. And using email encryption at all can have unexpected consequences. If you just set it up on one device, you have maybe a problem to access your encrypted email from a secondary device. Users can lose their secret keys. They will also lose to the archive of encrypted email. So I think it's still necessary to, we must involve the user. That means user must be willing to accept the consequences. Also user must be willing to take care of the secret key file or lose their archive. So what should we do? How could we get more people, many more people to use email encryption and signature? I think fully automatism is not possible because we have an heterogeneous ecosystem and we need the user to be involved. That means I think that we must better assist users. And that leads me to the question, which technology is easier to use? The past five years, Thunderbird Focus was open PGP in that area because it was necessary because we had to integrate it to ensure it's still usable. But now the question is, is that still a good idea to continue to focus on PGP? As we heard from Daniel, there is currently some, there are currently some disagreements. What's the future of PGP should look like? Daniel has presented a very optimistic outlook for the future. And I agree, many of the things he said would be nice and great to do. But we have the problem that there is a group of PGP in the ecosystem which is difficult to ignore. So I'm, and that's the problem because they are, Daniel suggested maybe everyone should do both. But well, that would also require that client applications support both keys from both specifications. And I see that as a big complication for users to have to manage different keys for different recipients. And I have suggested, I've tried to bring the group together with many discussions and I've suggested even introducing a common key format. But there have been no positive reactions from that. Well, from the IETF side, I usually get lots of good ideas and willingness to discuss. But both sides would have to agree and I don't see the, I don't need much openness to support these ideas from the PGP side right now. So I don't know what will the future bring, of course. No final word has been made. But at this time, to me, I have the worry that the future of PGP is a little uncertain. There are a little, there are conflicting specifications. There might be incompatible implementations. And I don't know how much hope there is for a unified specification. I still hope for it. I think it would be best and we really should see it, but it's not clear whether it will happen or not. And if that's the case, I'm worried that PGP might become less interoperable and more complicated to use in the future. And with that, is PGP the right way to go right now in West Moin PGP when we don't know what the future will bring? My suggestion is maybe we should wait a little and see how the developments in the PGP side go and whether there will be some more agreements in the future. And maybe we should, Sunderberg should wait. I think what we have right now is working. Both specifications have a common base. So PGP is working and you can interoperate right now. It's just that I'm not sure how quickly we should jump on these new ideas and implement them. Maybe it's time to wait. And I suggest Sunderberg should continue to both support both PGP and SMIME. But maybe one idea is I'm presenting that as a suggestion. I'm not saying we will do that and I'm looking for your feedback. Please provide feedback afterwards. And here's a suggestion. We could try to make SMIME easier to use for everyone. We could try to eliminate the barriers of entry that are currently there. And we could say maybe SMIME is an OK technology for users with a limited threat model. And we could say open PGP is more targeted for users with a broad threat model. And as a consequence, they will currently have to accept that there is a slightly higher complexity. Well, why is that? Well, let's look at SMIME. I think it's more widely available in email applications. And if you trust, as a user that certificate authority, do that job right, then SMIME is easier to use than PGP because you don't have to do manual checking of keys. And we don't have the transparency stuff yet that was mentioned. Maybe we can do it in the future. But right now, it's not there yet. And it might be appropriate for people with a limited threat model. It protects against passive reading, SMIME. There is a remaining risk that there are falsely issued certificates. We have seen digital notar in the past. But CAs are regulatory audited. And of course, they don't want to lose their reputation. So I think the risk of falsely issued certificates is not that big. Also, we certify transparency making it even harder. So I think that remaining risk might acceptable for many. So but we would have, in order to follow that idea, we really would have to find a way to get certificates to everyone for free. We would require, like Tobias implemented in his demo, to find a way to automatically obtain and refresh certificates from inside the email client. And then also, we would need something better for looking up the certificates of your correspondence. Maybe we could implement certificate transparency like way for SMIME certificates where we maybe even to protect against the spammers. I'm not fully up to date if the certificate, what's the PPP specifications as, if it's also redacting email addresses already. But yeah, maybe that would be necessary. And if we have some kind of cloaking with a hash, then we could maybe implement a certificate directory that is like a key server and that could consume the information from the transparency logs. And maybe we could use that to make discovery of correspondence certificates easier. So yeah, and PPP could be more dedicated or declared as the preferred technology who don't want to accept that remaining risk of false issues as my certificates. Yeah, and they could still do the manual key verification at the cost of having a little bit more complex technology. So if that idea, if we get a positive reaction to that idea, maybe we could say making PPP easier to use in Thunderbird. Maybe that could become a little less priority and rather focus for PPP on the higher, on the security improvements and interoperability parts of that. And rather focus on making SMIME easier to use and I plan to post some suggestions in the new future towards the Thunderbird email discussion this way, I present this idea in more detail and I will be looking for your feedback. Thank you. Thank you very much. I just want to add one more thing. I somehow expect that there will be many questions. So after this is finished, I will go outside and I'm waiting for your question and to have follow up discussions outside as well. Hi Kai. May I ask why are you still using R&P over the Sequoias version octopus like the crypto library? Well, your question implicates that I should prefer one side or the other. I don't prefer one side of the other. I think I don't want to give any of these conflicting specifications and advantage. In my opinion, Thunderbird should remain neutral. In my opinion, so conflicting parties should get together and find a unified specification and I would like to wait for that. And that if that switching implementation doesn't give me an advantage because I don't know what's the intention of the Sequoia. Will they fully support both specifications? I don't know. There's Nio. Okay. Rush. All right. Are you saying that if we implement V5, then you'll use Sequoia? I'm not making some promises. I'm just saying that additional other alternatives currently are not wearable and if things change, we can re-evaluate our thinking. You mentioned that S-MIME can have a lower barrier of entry than open PGP. To my understanding, the primary problem with encryption is that the user loses the key and he cannot read this email anymore. I don't see how S-MIME has any advantage over PGP in the sense because if I can just as well lose the key, the certificate authority cannot regenerate my key unless you want them to do so much rather than the key. I don't see the advantage. So I think the problem exists with both technologies. That's the same. But yeah, maybe we could introduce a key inscription key. Maybe we could introduce concepts that Thunderbird generates a key encryption key for users. They make it back up with a path race, maybe 20 words writing down which just a randomly generated symmetric key which we back up in paper form with 20, 24 words written down. And then maybe Thunderbird could encrypt all the users' private keys with that single symmetric key. A possible idea that could probably be used for both technologies. So I didn't spend it... Yeah, that's a general idea which we could do which would help both technologies. Alright, any final question in the room? Have you looked at the secure join standard and do you think it might be an option for Thunderbird users to have guaranteed internet encryption with verified fingerprints in a very user- I have not seen the project you mentioned yet so you would have to point me to it and we can have a follow-up discussion. Alright, so thank you again.
[Security] Email Autoconfiguration, and 2FA for email
So, hi. I'm Ben Bocz. I've been working for 25 years on the Thunderbird project, pretty exactly by now. I was a member of the Thunderbird Council, the Leadership Committee, but I'm speaking here for myself and not for the Thunderbird project. I've been for a long time consultant and helping many companies use Mozilla products in code and their products. And most of the point of this talk, I've been involved in four different OAuth implementations, specifically for mail clients. One was for the largest German Internet provider, two of products of my own company, and I was a reviewer of the OAuth 2 implementation in Thunderbird. I've been implementing the IMAP, POP3 and SMTP authentication logic in Thunderbird, meaning server capabilities, authentication methods, password prompt, and so on. And the account creation dialog, including the auto-configuration and the protocol on that. So, about auto-configuration. So, this talk has two parts. First, I'm talking about auto-configuration, and then multi-factor authentication and email. Auto-configuration. Auto-config is a protocol which allows the user, if he sets up an email client, to only enter the email address and the password, and completely automatically set up the email account. He doesn't have to enter anything else, just email address and password. And we find out the configuration completely automatically from that. The user doesn't have to enter any host name, authentication method, cram five, whatever. How do we do that? The email address is ben at example.com. So, example.com is the email provider. So, we just take example.com and get the contact from that. Email is supposed to be federated, so we ask directly the email provider, do you have a configuration for us? So, that's a well-known URL, auto-config.domain.well-known.something. And we ask, is there a config there? And it's simply a static file. So, you can just create the static file once manually. You put it on your web server at the specific location, and all you need is a web server. You don't need any other server-side software for this. But of course, not all ISPs in the world are going to support that. Google, Microsoft, Yahoo, et cetera, they don't have this config. So, as a fallback, we have a database. This database contains pretty much all ISPs in the world. I had to go and search the configuration for every ISP manually in the support documents, put it in machine readable format, when possible, test it, and put it in a database. Other people helped, of course, with that. So, the result is, in this database, you find the configuration for almost all ISPs in the world. If the ISP has more than 0.1% market share, it's most likely in there. But, that still doesn't cover corporate domains, company domains, custom domains. So, my example.com is not covered by this. So, what we do is a MX lookup. So, we make a, what is the MX, for example.com, and we find, oh, it's MX.dreamhouse.com. So, we look up the configuration for DreamHouse, or Office 365, and we got the config this way. This protocol is an internet draft right now. The goal is to make this an internet standard. There is a draft for it. You find this on this URL here. And, if you follow, if you, an e-machline, implements this protocol, and you follow this protocol as it's written down there, you can set up more than 80% of the email addresses fully automatically. So, more than 80% of your users don't have to do anything else than just use an email address and password. How does it look like? This is how it looks like. And, the first reaction I will get is, oh, yeah, except, whoa. Today, I would use JSON, of course, that was 15 years ago. Was a good choice at the time. It still has some advantages. But, this is roughly how it looks like. We're going to look at the details. So, it starts with the email provider. This is the domain. Here, it's on Microsoft.com, but normally, that would be the part after the email address. And, you can have multiple of those, depending on how many domains are hosted by this email provider. So, if I hear who it's all the country, you're who.de, you're who.fr, you're who.ite.com, et cetera. And, the MX servers are also in here. So, if you make a look up, we take the domain part of this, of this, of the MX server, and we put it in here as well. And, this is how we can find the configuration. So, in the database, it's going to map that in the other way around. So, there's going to be an entry for this and this, and we can easily look that up. And, even on the server, this is a static file on the server, so it's really fast. There's also the display name here. There's a long and short version, you don't see that here. It's completely optional, but as a client, you want to use the name of the provider as a count name, for example, you can use that. This is how the config looks like. The incoming server here. We implement, we specify four different types. This IMAP, pop three, JMAB, on special request from Fastmail, yeah, JMAB, and exchange. And, you can have multiple, you can have multiple of these incoming servers. And, they are ordered by preference. So, all other things being equal. The client is supposed to take the first configuration in the file, but the client has the choice to use another one. So, for example, in Thunderbird, if the configuration shows that there's both an IMAP and a pop three configuration in there, Thunderbird gives a choice to the user, what do you want? IMAP, pop three, there's trade-offs, and the user can make the choice, and the client is going to take then, for example, the first pop three configuration listed in this file. Funny fact, half of the Thunderbird users have a pop three account configured. Of course, there's more IMAP accounts, but still half have a pop three account. I was really surprised about that. It's really popular. I thought nobody uses that anymore, but it's popular. There's exchange servers, and of course, you have the SMTP server as outgoing server as well, and the structure looks pretty much the same. You have twice the authentication here. All of them have to work. This is not the same as the IMAP capabilities where the server just lists something and it might not even work. These both work. It's just, does the client support OAuth? If yes, he's supposed to use this. If the client doesn't support OAuth, it can go on to the next one that it does support. And there's the format of the username in there. It could be Ben, it could be ben at example.com. It could be Ben backslash example, like the Windows domains that is in here. Oh, and by the way, in the database, we always prefer as TLS, overstar TLS. And if there's a plain text configuration and an SSL configuration, we don't bother listing the plain text configuration. And there were situations where Yahoo or others said, no, we don't want you to list the SSL only for paid customers. We said, we don't care. If there's an SSL configuration, and it works, we put it in there. And this is the way Thunderbird protected the users years before other clients did, because we knew these configurations and the ISP didn't advertise it. You can also capture address book and calendar sync and file share. Thunderbird doesn't implement that by this possible. So you just enter email address and password, and boom, you have it all set up. Email address, calendar sync, file share, contact sync is all in advance. Ah, so like I said, there's a specification out there. I would appreciate your support. It's like verbally your support, expressing support in the right forums so that this moves forward to stand, actually moves forward to the standardization. And if you're writing an email client, please support the specification. It's really helpful. And of course, if you have an email service, it's always appreciated if you support that as well. All right, second topic. This is a less happy topic, multi-factor authentication and email. We all want that. The ISPs want that. We want that. Unfortunately, not that easy. Right now, the situation is that only if you're Google, Microsoft, or Yahoo, you can do OAuth. The rest pretty much can. There's a few smaller ones, but if you're not part of the select few, which is hard-coded in Thunderbird, or the email client, you can do it because the client doesn't have any way to can discover the OAuth server. So which options do we have? We can use OAuth as it is specified right now, or rather not specified right now. We can, I'm making a proposal for M-Auth, but I'm dubbing M-Auth or more, OAuth for mail. It's OAuth, but we nail it down further. The things that OAuth doesn't specify, we mail them down and specify them for mail so that it works well. Third option is PASCIs. Could you please, thank you. Okay. Thank you. So OAuth was originally designed clearly for websites. Like Zoom wants to authenticate with Google, and they have a relationship, and this is what the spec is written in mind. It clearly shows because if you're trying to implement for mail clients, you run into all sorts of problems. Most of the problems are related to the fact that, oh, OAuth is not really a specification, it's more like a framework. It says if you wanna do that, you would do it this way, but it's up to you. The server decides about everything, it can do something, it might not do it. There's no way to know whether it's going to do it or not. That's for its real problems for the implementation, because as a client, I cannot rely on anything at all. Everything is optional, I don't know what's going on. Even for the same server implementation, it all depends on the configuration, and that specific AP, EISP, whatever they have put in there, this is what works and what doesn't, and I cannot rely on anything in my code. Problem is that users always blame the client, no matter what the reason is. So in my company, we're offering a little email add-on, and in support we get this all the time. The user says, and my email doesn't work anymore, it worked yesterday, it doesn't work today. So there was no software update since yesterday and today. What could it possibly be? Of course, it has to be that the administrator changed the authentication server, the page something, it's something changed at the company, his own company, and there's no way for us to know, there's no way we can fix that, there's nothing for us to do. We cannot even test it because we don't have a login for that company, there's nothing we can do. But we can't, the customer doesn't understand that. He says, hey, but it works in this client, it works this, I want you to work here, you are broken, bye-bye. And I lost this customer, I lost, we lost so many customers because things don't work at the OAuth level, and there's nothing we can do about it. One of the big troubles, one of the big troubles is expiry. None of this expiry stuff, like OAuth is all about expiry and refresh, this is pretty much all that OAuth does, and none of that is specified. There's a lot how it should work, how it almost always works, but it's always optional. So the expiry, I have no idea, is it 12 months, or is it five minutes? I have no idea, and it makes a big difference how I implement my client, but that user has to log in every 12 months or every five minutes. I have to structure my code accordingly, my UI accordingly. But when I write this, I have no idea what's going to happen because it's all configurable. Same with the refresh token. Usually I'm getting a refresh token, but it's optional. So what I'm proposing with M-Auth, that we nail this down. So if you're going to expire that, please tell us when it expires. Again, it's in the spec, so that I have a chance to refresh before it expires. So please send this expiry time. Please send the refresh token. Most servers do that, but it's optional. We would nail this down saying it has to, if you use this for mail, if you want to use this for mail, you have to send a refresh token. You have to actually refresh the refresh token, so if the user continuously checks mail, it is not going to expire. All these little details are not specified in O-Auth, and we would need to nail them down for mail to work properly and reliably. And on the server side, this is just a matter of configuration. Like, we don't need to write any new software, it's just a question, how does the ISP configure that? So all we would have to say, like, if the ISP, if you want to use O-Auth for mail, you have to configure it in this way for it to work. And for us, this configuration is a question of working or not working, because the users are going to complain if they have to log in all the time and not going to use our product. I'm going to skip over error handling. There isn't pretty much no error codes, like all the guys have access denied. I don't know whether the password failed or the user canceled. I don't know how to react to that. Should I show the prompt again or not? We need to have some proper error code. And client registration is my biggest worry. The O-Auth specification requires that the client sends a client ID. And then the specification, it says, the way the client registration works is outside the realm of the spec. It's explicitly not specified. And even worse, this ITFRC specifically states, you may have to sign a contract in order to get a client registration. You may have to sign a contract. What does that mean? Right now, I have to sign a contract with Google and with Microsoft in order for my email client to work with O-Auth. That's the situation right now. That is a problem even right now between the big ISPs because they're not always at peace with each other and they can block each other this way. That's the problem right now. Even worse, for me as a little guy, I have absolutely zero chance standing up to Google with this contract. Like, if I want to offer an email client, I have to sign this. I don't have a choice. So Google can force legal conditions on me, contracts on me, and put in there whatever they want, I will have to sign that. That is a legal nightmare. A huge liability. So if this was an IETF spec for mail, I don't think this is fine between websites. Like if Zoom wants to authenticate to Google, they can make contracts. But for a client's service protocol, I don't think that would ever pass IETF standards because it's pretty much by definition not open. This is actually worse than patents because of the pattern I might ignore. A pattern, it might not apply to me. A pattern might not be valid. But this is a contract between me and the party and it's definitely valid. This is much worse. So there's one proposal in the room to use dynamic client registration. There is a specification which is dynamic client registration. So the ISP can offer to any client, any every instance would just connect, give me a client ID and the ISP would return with a client ID. Apart from the fact that it makes the whole system useless, I don't know of any implementation of that. Like there is a spec for this, but I've never seen any client implementing that. I've never seen any server implementing that. I've never heard of anybody who has ever seen any implementation ever. So there is a spec for that, but we would have to write the server, the software. We need to write a server software. It needs to scale for two big ISPs. We need to write all the client software. And once we've done that, we've added complexity, but the whole thing is pretty pointless because the client registration doesn't actually do anything security-wise. I could just open up outload.exe to find the client ID and secret and theoretically just use that. So security-wise is snake oil and it doesn't serve any purpose. So if I want to know what the client name is for whatever debugging or help purposes, I can just look at the user agent because all this HTTP, I look at the user agent string and if we put a proper value in there, I know what the client is, but I get around all these legal things. So as far as I end up C, the only advantage of the client registration is to force a contract on, which is exactly what we don't want. So there's a simple solution to that. We don't need any new software, very simple, in this MOR thing, we just specify the client ID is going to be mail, M-A-I-L. Problem solved. Hard-coded string. And you don't need any new software for that. The ISP just needs to configure that client ID in their software. If they want my clients to connect, they have to configure this ID, problem solved. That's what I'm proposing for M-Auth as a solution to this. There's another big problem with O-Auth. It inherently depends on a web browser. So I want to implement an email client. I already have to deal with HTML email, but there I don't want JavaScript, I don't want cookies, I don't want video, I don't want any of that when I render HTML emails. When I want to do O-Auth, I have to have all of that. O-Auth won't work if I don't have JavaScript. It won't work if I don't have cookies, and the cookies need to be persistent. It has to be a full web browser. So I'm probably going to use WebView or something, but then it's going to depend on the Android version that the user has, which WebView version he has. So that's going to be a support nightmare. I need a specific API. Because I need to track when this login sequence finished, which Ureli is on. So now he's done now, he's locked in now. I need a specific API for this embedding web browser. That's an extra API, which most embedding APIs don't have. It's an extra complexity. It's already difficult. I don't know how many client, email client implementers are in the room. I don't know how you feel about putting an email client on a web browser, mandatory in your email client. I don't know how you feel about that. But that's the situation right now. There's another option. I can just launch a system browser. So I'll launch a URL, go to the system browser. That's actually what Google wants. Problem there is I interrupted the flow. That's a problem. The user left my email program at this point. He was in the middle of setting up the email address, and now he's in the browser and he finds the news and he starts reading the news and cat pictures, and I lost it. Maybe he's going to come back. I don't know that the flow is interrupted. And in order to know when the user finished, I need to redirect to HTTP local host. That's a web server that I have to implement in my email client. So I have the choice now. I can either implement a web browser or a web server in my email client. I don't have those two choices. I don't like either of those. So that's where we are. You could argue that we have to implement OAuth anyway, because we're dependent on OAuth for Google, for Microsoft, and for Yahoo, which is true. However, the problem right now is still contained. It's really these few three things that are where really needed for all the others that don't need it. My hope is that we can contain it there. If we open up the floodgates and open it up to all the ISPs, we have a legacy that we will never get rid of. You heard the talks about IMAP, which legacy that is. If you implemented email, you probably had your hat scratched for one reason or another. There was a reason 30 years ago why they did it this way. I don't want to be the guy who puts OAuth in email and creates legacy that people have to deal with 20 years from now on. I don't want to be the guy. This is why I don't feel at ease with putting OAuth into email. There's another option. It's called PASCIs. We talked about MAUTH. PASCIs. PASCIs are the new cool thing. Google, Apple, Microsoft are fully behind that. They implemented that in record speed. It's supposed to be super secure. You can bind this to biometrics of your phone. You don't need this code, two-factor thing, code thing. It's still secure. The big ISPs really, really want this. They're really behind this. This is a big advantage because maybe we can contain the OAuth problem and migrate users to PASCIs in this way. We could also allow that for all other ISPs as well. The other advantage is it's a very simple protocol. It's a challenge response protocol, which means the server is sending some kind of information, some blob, some JSON or string, sending it to the eMac line. The eMac line is sending this to an operating system API. The operating system is popping up a dialog. Do you want to log into this and this website or this and this domain? The user can approve or disapprove. He might have to use Face ID or thumbprint, depending on the settings. There we have the two-factor authentication with biometrics. It's secure. Then the operating system generates another information. We send it back to the server and the server says, okay, so it's a simple, we just pass on information back and forth. It's very simple on the protocol level. I don't need a web browser and it doesn't have all of these issues that I just mentioned with OAuth. So I don't know too much about PASCIs, but my gut feeling tells me this is the better way forward and something that is much more easy to support in the future. This is an open question. If you know something of how this would work with PASCIs and have proposals or want to get involved in that, it's an open question right now. Let's discuss this. So, questions? Thank you.
[StructuredEmail] Structured Vacation Notices and Structured Email for Roundcube
All right. So this time I have the pleasure to introduce myself. And somebody else needs to take care. I don't overuse the time. So yeah, my name is Hans Jörg. I'm from Odriga. I have two hats or two histories. So one history is in my main email history is in migration, portability. So you've seen some of our jamming work earlier that day. But actually, I have an earlier history in semantic web technology. So I was a semantic web researcher. I did some stuff on Sematic Media Wiki, if somebody or few is aware of that in the past. And this is a new project, actually, where these things tend to converge. Some people who read their email on the console typically don't like it at all. Recently, what it proposes. But yeah, I like any feedback on it. Some people might even like it because it maybe fixes something what HTML email also broke. And the whole idea is structured email. And I'll present you a reference implementation for RUNCube and a particular application, which is a structured vacation notice, which probably is compiling to email people in particular. So first of all, a claim. So email is sort of your personal API. But you're a little bit of a mechanical Turk in there. So you need to read it. You need to understand it. And you need to act upon what people ask you to do. Other services ask you to do. Second, the email is underappreciated. I think everybody here in the room would probably agree. And so one of the ideas here in order to bring these things together is to make email content, maybe not in general, but for parts of emails or certainly emails, more machine readable so that the tools you develop might help people in certain tasks to make them more efficiently or even to do novel tasks. And the very rough idea is basically like you have a multipart alternative text and text HTML in an email that you also embed structured data in RDF, which is a W3C-specified knowledge representation language according to certain so-called data models. So schema.org is a very popular data model which search engine vendors have set up for, basically, you find it in websites. Like this is movies. This is a song. This is an article. And the very idea is to also allow users or tools to include that in emails so that email clients can make sense out of what is in that email. So yeah, that sounds quite abstract. How could that look like in practice? So actually, this is not something I invented from scratch. So actually, Gmail, Yahoo, some other vendors, WebDE in Germany, are already doing it. So if you fly by Lufthansa or a certain airline and you have a Gmail account and if you opt it in, these airlines might already send that schema inside of that email. And you might notice there is a special display within Gmail that shows you a certain information on the flight, allows you some certain action, might automatically import it to your calendar or at some point also to your Google Assistant and so on. That's nice. The problem here is currently you need, this is only for select senders. You need to register with each vendor, actually, basically, to have this happen. It's only there for very few select use cases, like traveling and maybe ordering in the web. And it's unidirectional. So it's only from a service to you. You cannot by yourself use that. So it's a little bit against the idea of email, right? So I can. I mean, obviously, I would not send a flight probably to somebody, but maybe something else. So schema.org alone does have 800 concepts. And what Gmail supports is like six of them or something like that. But actually, there's already nice, very nice use cases for even this travel information. And there will be a talk just after this by Folk are sitting in the background. So I won't talk too much about that. Second example would be link sharing. So there is share by email, right? And this is how it looks like in K9 email, right? I mean, not blaming K9 for it, but you basically get a URL sent. And this is what you receive. And in this case, basically, you are stuck with Spotify. You click on it. You have said Spotify song. But K9 doesn't know this is a song. And you are with Spotify. And OK, you can listen to the song. But if you're on Apple Music, it's up to you to deal with that. And with structured email, the idea is you could take some metadata, which in the case of Spotify is actually even embedded on the Spotify link already. So nobody needs to do manual annotation. You could put that into the email instead of the link. And so your email client would not just have the link, but the email client would know this is a song, Brussels Jetem, and from 2003 by Al Jaleh. And it could even match, for instance, with your local media player if you have that as an MP3 or something like that. So you could basically give a dereference the kind of content that got shared in that sense. And you have a much better user experience, a little bit like you have an instant messaging when you send a link. Where also, like what's happened so on, they do extract the Twitter cards and kind of stuff. Another use case, maybe even more fancy, is location sharing or even live location sharing. Many instant messaging tools allow you to do so, but it's within their ecosystem. So you're bound to their implementations, their privacy rules. And it only works if you send to another fellow WhatsApp user. So it's also not really open and decentralized. So we built a prototype where you send a location based on the JSON-LD snippet. And we have a prototypical implementation where the client on the mobile can push the updates of the location to a URL with a secret UID, which the user receiving it can actually use to refresh it. So if you're receiving email client support, you could do this user experience. This is an example which we did. And of course, you can also do have some fallback. So you can get an HTML email. Of course, then it's not the live location, but you can do something like a fallback like you have in some newsletters. Click this link. Go to the browser. Even though this is, of course, not the best user experience here. And then another very familiar use case for you, vacation notices out of office messages. So it's typically something you enable for your email account while you are traveling on FOSTA. Maybe it's a weekend, not so many people will write to you. But maybe you arrive back in office on Tuesday. So you say, I'm staying in Brussels till Monday. Please contact my colleague in that meantime or so on. It's still something you need to act upon manually. But it would be interesting if your email client could actually understand this is an out of office message, a resistance at date. And probably this is the person I could redirect the mail to if I wanted to choose. And this is basically what we did. So we did an ITF draft for this to specify it a little bit, the process. And basically, you can even leverage most user interface data you have from the CIF vacation extension. This is how we implemented it in RunCube. We just take the date fields which you anyway fill in there and the reason and put this into the structured field. And if the receiving email client is capable of understanding it, it may store this information for the time which the user is away and it can highlight it. And you can even put it or choose as the user on vacation to include it in emails prior to your vacation. So you could say, even if I go to vacation tomorrow, include that metadata already in just any regular email if you want that. And so recipients can already see, ah, Michel will be in vacation starting tomorrow once he wrote me this mail now. And I might hurry up answering him or something like that. I'm not suggesting this is like it has to be, but it's just illustrating that you can even do additional things which you could not just do with regular out of office right now. And yeah, what is the current state here? So these examples I've shown you, there is currently an ITF working group that's very recently formed. Last November was the first meeting. There is a mailing list here. So even for those of you not familiar with the ITF, please join that list if you're interested in that topic. Any feedback, any questions, everything is very appreciated. There was already quite some good feedback from the community. So like Sunderbad Board Council made a decision like if there would be an RFC, they would be willing probably to implement this or to merge this into their code. First drafts already got adopted in the working group, still sometime under the form of RFC, but things will be going. We are working on a reference implementation where we graciously received money from NLNET and the NGIU program. This is published right now during FOSTA. So you can go to Packages. Not sure if it's already on Packages, latest on Monday. We will provide some guidance so that you can use our round cube implementation as a blueprint for your own vet mail probably, even some reusable code so you don't have to write everything on your own. And there is even first adopters. So for instance, the developer Fairmail, I got in touch with him and he implemented the very first beta of it like within a day, which was quite an awesome experience actually. If you hear that, I really appreciate. And that would be really great thing. So finally, maybe a little bit of an overview of how this currently works. So this is the URLs where you'll find more information. We have one library currently where we do the extraction of the structured data from incoming emails. This could be reused on the server side of your application. We have two libraries which basically are template libraries. It's a little bit user experience-ish. So we are still searching people that really are keen on CSS, HTML design stuff. So if you know somebody or so, please help us because we think it makes sense also to have at least a simple example for how to render these cards for very popular kind of information so that every client doesn't need to decide on its own how to render a Spotify song or something like that, or a music song. Even so, of course, every client could opt to do so. But we want probably to provide some examples here. And we do it both for the actual rendering, but also for this HTML email, which we want to send as a fallback for those that don't have the fancy client yet. And then, yeah, I say there's two RimeCube extension. One is for the structured email as such, where you can do the Spotify thing, for instance, or receive these kind of things. We also have working on the Next Clouds mail thingy, where you can actually interact with the Next Cloud Cookbook app, where you can import recipes that you receive by email. And there is a separate plug-in for the structured vacation notice. That's all, actually, already for the moment. Thanks for listening, and I look forward to feedback and questions. So, yeah, maybe somebody can see. Yeah. All right. Did I say a hand question? Is there concern? I mean, to me, it seems like we've had this discussion that this is just kind of in the background of a mail message. As long as it's not overwhelming data size, it doesn't really matter to people. But the question would be, is this the kind of thing where maybe you have a client that's not displaying really great, where all of a sudden you start having all these random attachments that would confuse a user? Because they can't do anything with this themselves. This is all meant to be machine readable. So there was an interaction between, but I repeat, so I understood correctly. So your question is, is this something that might confuse users if it's somehow mangled inside of you? What are the ideas around trying to prevent confusion of users if a client doesn't know how to handle it? Two things. So first of all, you can see it as a multipot alternative. So just like if the email client doesn't understand it, it just won't get rendered. And also, it's metadata. It will never be shown if the client just doesn't know about it. So you can use it with existing clients already. Actually, you receive those emails probably personally, because Lufthansa might include it already even in the mail sent to OX. You just don't do anything with it. I'll assume you're writing my emails. Sorry? I'm joking. Yeah, yeah, OK. And actually, the interesting thing is even the opposite is interesting. Because we had people coming to us that had exactly the problem where you get a PGP key or an email signature attached to an email, because actually, an email client doesn't even know what that is. And you could actually use this structured data also to provide additional information about what certain email attachments are about. So you could even help email clients to provide a better user experience in that case. What's the incentive for any provider to actually send structured emails? Because it seems that it's an activist opposite. They don't want to like Spotify, or they don't want to be able to send what song it is. They want people to go to Spotify and nowhere else. And same with the Lufthansa thing. I mean, they don't want to send. They want to publicize their brand. They want to upsell services. Yes, good question. Then they don't want to send just a generic message with no possibilities of those. So the incentive is to not use those. OK. So you say there is probably what is the incentive? There is no incentive for both Spotify and Lufthansa to send this. Point one, so for Lufthansa, Lufthansa does it, actually. You can try. I'm not sure about Lufthansa in particular, but airlines do it with Gmail. And the very reason is Gmail gives them a preferred visualization. And actually, it might even strengthen their brand appearance, because they might have a special. There is research being done that the click rate gets even higher when you have the special presentation. So that's at least one theory. I'm not saying I spread the truth here, but just giving you an idea. For Spotify, I was not claiming Spotify itself to send it. Because what I was saying is, you share it. You are in your web browser, for instance, or within Spotify. And you say share with. And you go to the email program. And Spotify does have that data on their website, in the metadata. And the incentive there is for search engine optimization. So they have it because they want to get into the Google ranking very high. And we just piggyback that data by using it in email, in that sense. But you said the share with the feature. The share with feature is controlled by the Spotify client, which is controlled by Spotify. Oh, no, no. It's not, anyway. Because it's obviously URL, because they want to have set for WhatsApp. They won't change that. But with the URL, we can actually pull the metadata from the website. Like the Google crawler does it. So you want to hijack that thing and then put it in? In a way. Which is fair. OK. Thank you.
[StructuredEmail] When is my flight? - Semantic data extraction in KMail and Nextcloud Mail
Okay. Okay. So, yeah, then we'll continue basically right where Hansjord left off. I'm Volker from KDE and I'll talk about how we do the semantic extraction in K-mail and specifically focusing on the travel use case. Many of you probably traveled here, so you might see why this could be useful. So, if you book your flight or your train or your hotel, you get the confirmation as an HTML monstrosity full of advertisements and fine print and somewhere in between is the information that you actually care about. So, you need to find that and transfer it into like your calendar or your travel app and that if you do it manually, right, there's tedious and error prone. So, why can't we have that automatically? And that's basically the point that got me into that topic. I was on the way home from a conference needed to find my departure gate and I was written in like light gray on white in that style. So, I did what you would do in that case like you read the email source code because that's easier to read and I stumbled about a nice compact summary of the trip and that was the schema.org Jason that Hansjord mentioned. So, just showing that in our email client right that should be easy. Six and a half years later, I'm not standing here and still talking about that subject, so as things usually go. So, Hansjord showed us already, right, it's the schema.org Jason that is something that I think Google proposed 10 or 15 years ago for websites and for HTML email. Meanwhile, managed by the W3C, so it's a proper open standard. As an ontology that tries to model the complexities of the real world, right, it has all the fun involved with that. But generally that is sane and something we can work with. Then, however, we got in touch with the harsh realities out there because there's not just that nice Jason format, there is also commonly used a micro data representation that basically embeds that tree of information in the HTML structure of the HTML email. Technically, that's possible and still well defined, but it then basically puts HTML parsing into your problem space with all the fun that that entails. Well, okay, so we implemented that as well. Then we discovered a third variant of encoding that information, basically syntactically invalid Jason. Comalist Jason is particularly popular, so we ended up adding workarounds for the Jason parser to deal with all of that. Then we found the actually much bigger problem and that is semantically incorrect data. I think the most extreme case was Air Berlin. They had the arrival and departure times for flights in the local time zone of the airports, as you would usually do it. But then they added the UTC offset of what is presumably their server location. So if you travel to the US, eight hour difference, you probably noticed that something is wrong. If you travel from here to Finland, a subtle one hour difference, super dangerous, you're under risk of missing your flight. Another common problem, there's an address and there's a geo-coordinate. They mismatch and not just by a few meters. We have to deal with that as well. Then of course the other big problem, this is by far not as widely used as we would wish. You find it with some airlines, some of the hotel and event booking platforms have it. It's super rare for trains. I think in Europe it's only a train line. In general, on a scale from Silicon Valley startup to 100 plus year old European national railway, it's clearly biased towards the former. It seems to be even less common in Asia than in Europe. That isn't really satisfying, but at that point we were hooked and we really wanted those features. We started to look where else we could get them from. There's actually a lot of stuff that we can extract data from in such emails. One particularly useful thing are flight and train ticket barcodes, which then moves PDF parsing and image processing in our problem space. It gets worse. That thing is an entire world on its own. I spoke a bit about that last year in the railways and open transport deff room. I tried to skip that here. Another thing commonly found on booking emails is Apple wallet parsers that zip files containing JSON. Parts of it is machine readable. Parts of it is visual representation, but at least for location and time in the barcode. That's a good starting point. Then of course there is the whole unstructured human readable part. For some of that we were able to build generic extractors. Something like an airline boarding pass. They might look very different from a visual and layout point of view, but they can all be very reliably identified using the barcode. The barcode only contains very basic information, like the day of travel, not the year or the time, and only the airport codes, but not the gate, and so on. All of that information that is really relevant for you is in that human readable text somewhere. It's possible to identify that and match it. For everything else we have provider specific extractor script. That's usually a few lines of JavaScript with regular expressions or X pass queries on the HTML. Not pretty, but it gets the job done. With all of those ways of getting data out, we still have the problem that the data quality isn't really on a level that we can work with. In particular we care about the very exact time, including the time zone. By time zone I really mean IA and A time zone ID, not UTC offset, because if you have a delay over a day-life saving time change, and yes that does happen, then you really need the exact time zone to know when your new departure time is. And the other aspect that is really important is the precise location. So as a geocoordinate, that in turn also helps with determining the time zone, but we want to have features like routing to your departure location or your hotel. And in order to improve on the input data, we use some external data sources like OpenStreetMap or VickyData to resolve airport or train station identifiers and get to the exact location. And we have a few things that apply domain knowledge. For example, if you email, we first to a flight from Brussels to Stuttgart, and mentions a flight time of about an hour. There's two airports with Brussels in the name. They are both close to, or at least both of them are in Belgium, so we know the country and time zone. There's also two airports with the name Stuttgart. One is in southern Germany, the other one is somewhere in the US. But based on the flight time, we know exactly which one of that is possible, right? And I may have uniquely identified the other airport and so on and so on. And then in the end, we have some validation and plausibility checks because they're still either incomplete or nonsense coming through, right? So if you would require time travel to make that trip, then it's likely wrong somehow. And that's then how it looks like in the integration. So we run the current email through the extractor. If it finds something, it shows a summary of that and offers you to add that to your calendar or to your travel app on the phone. This is in KML. Originally, the extractor started as a library for KML, but it's also available as a standalone command line tool by now. That's how we did the integration in NextCloud. Same thing, right? We showed a summary of what we found and you can add that to your calendar. There used to be a Thunderbird plugin, but Thunderbird changed the integration API and since then that stalled a bit. There's a lot of demand for that, so it would be nice to redirect that at some point. And then there's of course the dedicated travel app, it's a memory that we built out of all of this that Hans-Jörg had already mentioned, where you get a timeline of your trip and it then fills the gaps with local public transport and looks for the weather forecast and reminds you to bring a power plug converter if you're traveling to a country where you need that. And I mean, that is exactly the kind of high-level semantic features and workflows that we can build if we actually understand what you're dealing with in your emails or in your documents. So if you produce any kind of transactional email, you most likely have a machine-needable representation of what this is about, so please add that also to the email in some form, ideally in the format Hans-Jörg is working on, but as you have seen, we are not particularly picky in extracting, right? So anything that isn't regular expressions on human readable text would be a big help already. And then finally, I haven't mentioned that yet, all of that of course runs on your device, right? Unlike Google, Apple or TripIt, we don't read your email for this. That on the other hand means we have not as many training samples as they have, so we entirely rely on people donating us travel-related emails in some form, so that is one way to help. Yeah, and that's it. Thank you. Thank you. All right, again, we have our number one question to ask today. Do you have any statistics on signal to noise ratio? Essentially, how many times is the information wrong? Do you kind of any reviews or testing in terms of like, you say that incorrect information is better than no information, but does it ever get confusing to a user, for example? I mean, we try very, very hard to detect stuff that is not plausible or to fit out anything that we at least can detect. How much gets through that is not detectable and then confusing. I don't actually know because the samples we have work or are filtered out, but at least we don't get a lot of bug reports with I missed my flight because it showed something wrong, and usually it is individual providers and they are consistently wrong, so we can add workarounds for that to filter them out and not show anything for them, for example. But there is the risk for providers that we don't know. If they send out something that we can't detect, we might show you a wrong departure time, right? And that is a problem. But you could, I know you could not log, instead of not showing the possibly wrong information, you could not log it somewhere and then to make those statistics. I mean, log in the way that we get the information. Yeah, because it's not a website. That would go against the whole privacy idea that we are very... But if, I don't know, if user agree to send those kind of... We don't have like a data donation feature built into the app right now. That might be an interesting option. But some people send this to us then manually, basically. Yeah. I might, before I give some mic to Arndt, I might just comment on that because we talked already also at Mark to people and there is a lot of the email senders, right? So, and in general, there is some interest by them to support this in a way. So, I have a strong assumption, like, if there is such faulty data, there might be ways to incentivize at least the big senders, the big brands to do it right. So, I'm not so concerned about that. Yeah. Asking people to send bug reports is okay, but if you ever get a mail client to send something to you, to log it, you're going to get information about people's sex life. No matter what you try to get, you're going to get that. It just happens, trust me. And then you have GDPR problems because, well, you thought it was the name of an airline, but it actually was the name of a person. Yeah, I mean, that is, I mean, that's one of the motivation why we are so focused on doing this locally and with keeping control over this. Because, I mean, your personal travel is already quite sensitive. But if you combine that with everybody else, the amount of patterns you see, right, I mean, all of us travel to Brussels in the first weekend in February. If that happens once, right, that could be by chance. But if it happens in the next year as well, and after two or three times, that is not random, right? Then there is some relation between the people involved. And that allows you to do some scary network analysis. If you're looking for the structured data that's already there, it's the open travel alliance. First it was in XML horror. Now it's in JSON. So maybe that will be, can be implemented in the final structure. Open travel alliance. Yeah, I don't know that one yet. No, it's international. Everything is in there, the planes, the trains, boats. Okay. Yeah, we, from the scheme of the world stuff, we support flights, trains, buses, events, restaurant reservations, and ferries and boats. Yeah. But there's certainly more that can be done. One quick final question. I wanted to remark that the anonymization of data fields is possible without being able to trace it back to an individual human being. Because airlines are innumerable. So you can get to the proverbial shouts, whereas user names or people's names are not. And so you could hash everything into the WAHOOZA and still recognize whether or not you should have recognized the field differently than what you've actually rendered in a client in this case. Right. Yeah, but anonymization has turned out to be rather tricky on input data like PDFs, where we also rely on the proper structure. So as soon as you start to modify this, it's not sure that the extractor still detects it in the same way. And we often don't know what kind of sensitive information is even in there or what the fields in the back would mean when we start with a new format. Right. So it's very hard to predict what we need to strike out. Sure, yes. But I thought we were talking about the JSON. Once we have the JSON, sure. But the JSON alone is not really enough to fix the extractor. We need the source document in its original form without modification to see where it goes wrong in the extraction. So if there is proper JSON in the source, then yes, then the JSON is enough. But if our source is a PDF document attached to the email and the barcode in there, then I need the full thing to debug why we failed the extractor. I'm interested, but we'll take this offline, I suppose. Yeah. Yeah. Right. A short technical question is Bogo in the room. Ah, right. There he is. Great. All right. So thank you very much for that lively discussion. Thank you, Falka, for the presentation. Once applause again.
Welcome to the Monitoring & Observability devroom
you you you you
When Prometheus Met OpenTelemetry
So, hello everyone. I'm Pavel. I'm very excited to be here and I will speak about Prometheus and OpenTelemetry and especially how we can use OpenTelemetry project to scrape Prometheus metrics and what are the challenges with this setup. Quickly about myself, I'm Pavel, software engineer at Red Hat. I mainly work in the distributed tracing space. I'm contributor maintainer of the OpenTelemetry operator, Grafana tempo operator and the Yeager project. If you would like to reach out to me, you can do that on the Twitter, on the CNCF Slack. So, today I would like to do some introduction into metrics ecosystem so we better understand what are the projects we can use and then talk about the differences in Prometheus and OpenTelemetry from the data model perspective, how they do things. Then we'll talk about what Prometheus components we can find in OpenTelemetry project, both from the API SDK perspective and in the collector. The second half will be a live demo. We will deploy very simple Golang application instrumented with Prometheus client and we will gather those metrics with OpenTelemetry collector. All right, so why are we actually here? We are here because the ecosystem for collecting metrics is fragmented. There are different projects that provide different capabilities. So, there is a storage, some projects that can store metrics, some projects that can only define protocol for something metric data and some projects that can be used only as an API SDK, something that developers use. Prometheus sits in between, so it provides kind of end-to-end framework for collecting, sending, storing, visualizing and alerting on metrics. Prometheus is very well-adopted, it's very robust and people know how to use it. On the other hand, there is OpenTelemetry project, which is kind of new and for metrics, it provides kind of more limited set of capabilities compared to Prometheus. People still want to use OpenTelemetry for collecting metrics because they can use it as well for collecting other signals like traces, logs and it's better integrates with third-party vendors, your SaaS observability solutions. So the overlap, there is in the API and SDK, Prometheus has clients, OpenTelemetry has an API and SDK and then there is a protocol. Prometheus has its own metrics protocol and OpenTelemetry has OTLP protocol. On top of that, in OpenTelemetry there is collector, which competes with Prometheus agent. Agent doesn't store metrics, it can just scrape them and send them to Prometheus via OTLP, not OTLP, but Prometheus remote write. What I would like to highlight is that OpenTelemetry as well has the auto-instrumentation libraries, which are not present in Prometheus. I think it's a great innovation in open source because those libraries, as we saw in the previous talk, they help you to very quickly instrument your application without any code changes and a recompilation. So I think it lowers the bar of adoption of telemetry in your organization. So that's the ecosystem. Then we should think about how we can use these systems together because we want to combine feature set that they offer to us. So let's take a look before we go into the demo, what are the differences in Prometheus and OpenTelemetry. First of all, the most obvious one is how the protocol works. The Prometheus will pull the metrics from your process and OpenTelemetry, you have to push the metrics into the collector. It's not big of deal. Some protocol might be better for some use cases. So for instance, the push might be better if you have short-lived processes and you need to quickly offload the data before the process shuts down. On the other hand, pool works very well in Kubernetes. I don't think that's kind of a blocker when using these two systems together. However, the second point, the data temporality, I think it's kind of a big deal. The Prometheus uses cumulative temporality, which means that the last observation contains the previous observations. So if you have a counter in Prometheus, it will contain the sum, the aggregation of all the previous values. In OpenTelemetry, we can use as well cumulative temporality, but we can as well use delta temporality, which means that the observations that are sent over the wire will be just deltas. So if people are coming to this room, it will just send one, one, or maybe two if two people entered at the same time. And Prometheus cannot ingest delta temporality metrics as far as I know. So that's a problem. The second difference, or the third difference, is the histograms, or the exponential histograms. As far as I did the research, I think they are almost compatible. However, in the OpenTelemetry, the histogram as well contains min and max values. So in Prometheus, you can potentially lose some precision of what was observed. The next difference is the resource attributes. In OpenTelemetry, when you collect data, there is a resource object that contains information about the process that is sending the data, which is a pot. It contains pot label, deployment label, replic acid label, node label, and all those things. In Prometheus, the concept doesn't exist. All the labels go to the metric usually. There is a workaround to put these labels into the target info metric and then do the joint. However, it kind of complicates the user experience because you need to do additional join when querying the data. The next difference is float versus int. Prometheus uses floats, and OpenTelemetry can use float and int. I don't think it's a blocker because with float you can represent very well all the metrics. And last major difference is the character set that the system supports for metric names and label names. In OpenTelemetry, we can use UTF-8 in Prometheus, only a limited set of characters. So what happens is that when you are sending hotel labels, they will get corrected to the form that Prometheus can ingest. So if there are dots, they will be substituted to underscores, for instance. So as I said, I was working in the distributed tracing space for a long time and I started doing some metrics. And when I did this research, I was even wondering if these systems work, right? Because there is kind of a lot of things that can go wrong. And I think the delta temporality might be the biggest one. So I started looking into how can I solve this problem. And in the OpenTelemetry SDKs, the OTLP exporter that exports OTLP data, it can be configured to translate delta temporality metrics to cumulative with this environment variable that you can see on the slides. And then as well, you can set it to delta if your metric system supports delta or to love memory, which will use even more delta. You may as well ask the question like why we have two temporalities, right? There is a cumulative and delta. And as far as I understand, the delta temporality can be more resource efficient when you are instrumenting your process because the SDK doesn't have to track the summary, right? They will just quickly send the deltas to the collector or process that is collecting the data and doesn't have to do that processing that the cumulative metric store is doing. Okay. And then the temporality, okay, it's a problem. And then in the Prometheus exporter in OpenTelemetry ecosystem, it will do some delta to cumulative temporality translation for you. However, if you are using Prometheus exporter in the hotel SDKs, they will most likely drop delta metrics. So that's something to watch for. Okay. So what are the Prometheus components in hotel ecosystem? In the SDKs, as I mentioned, there is Prometheus exporter. However, if your metrics are delta temporality, they will most likely be dropped. As far as I was going through the code and looking at the exporter implementation, maybe it's not the case in every language, but I was looking, I think, at Golang and Java and that's what I saw. In the OpenTelemetry collector, there are three components. There is Prometheus receiver that we will see in a demo. Then there is Prometheus exporter that will try to handle temporality correctly. And then there is remote write, which will most likely drop your delta temporality metrics. Okay. So let's try what I prepared. It's a very simple hello world style application written in Golang, instrumented with Prometheus client. And then we will have an OpenTelemetry collector with Prometheus receiver scraping those metrics and exposing the same metrics on the collector slash metrics endpoint through Prometheus exporter. So we have receiver and exporter. And addition to that, we will print the metrics into the standard output of the collector. And we will compare if they're correctly propagated. So let me jump back to my console. I guess it's too small. I'm not sure I can change the color. It's better. Okay. So just for reference, this is the app. It's just main class. Using Prometheus client defines a gauge for tracking the version. There is a counter for counting requests and histogram for counting the request duration and some HTTP endpoints. So the app is running. I will just forward the endpoints and refresh the make request. It's a hello world, nothing special. We're going to see the metrics. We get a histogram counter and gauge and not many labels. As a next step, we're going to deploy the collector, which is again a very simple setup. We are deploying a deployment. And then we have a Prometheus receiver with a static configuration. So in a collector config, you can have multiple receivers of the same type. So I have two Prometheus receivers. One is called static, one is SD. We're going to use the static which will scrape the Prometheus example app service. And as you can see, this config is very similar to what you see in Prometheus. So you can potentially copy paste your Prometheus config into the collector config for Prometheus receiver and it should work. And last step, what we're going to do, we're going to enable the receiver in the metrics pipeline to make it active. And now I'm going to deploy it. As you can see, the collector is up and running. And I will pour forward again the metrics end points now of the collector. And we see kind of the same metrics, right? Here's 18, here's 19 because the Prometheus scraped end points with increased the counter. And what has changed are the labels, right? Now I see the instance label, which is the service name and the job which I defined in the collector config called app job. And then, yeah, we see the same metrics, the histogram, the version counter and the direct-quist counter. Okay, as a next step, we're going to make it a bit more automated. We're going to use the Prometheus service discovery in the second receiver. So we need to define the Prometheus as the config. And in this case, we're going to scrape all the pots that have the label that our app is using. Our pot defines this label. So we're going to enable it by just, you know, overriding the name of this receiver. It's the same functionality that Prometheus supports, right? I'm just using it in the open telemetry collector. It should restart. It's up and running. We're going to forward. And now, again, the same metrics. What has changed are the labels. The instance is the pot, right? Which makes more sense if we are configuring the service discovery for pots. The job name changed to Kubernetes. This is what we defined. In addition to that, now we get the target info, which defines the additional labels the receiver discovered. So here I see the namespace, the node name, the pod name. I think it's readable. And so what I can do right now, I can write Prometheus query that will do joint and get all these labels associated to the metric. Or in the collector, I could write a configuration that will put these labels from the target info into the metric labels directly, which will simplify the query. However, it will create more time series in Prometheus. Okay. And as the last step, we're going to use the pod monitor for the pod that we deployed. And we're going to use collector to get this pod monitor, configure the receiver, and scrape the metrics. So the way how it works in OpenTenometry operator, we have additional components called target allocator. And when you enable it, it will watch all the pod and service monitors in your cluster. And it can watch a subset of it. It depends on the label selector. It will get the scrape targets and then distribute those targets across collectors that you deploy. So if you deploy 50 collectors, it will distribute the scrape targets into those 50 collectors so that all the collectors get the same load. How does it work? The operator will deploy the target allocator and collector, will change the Prometheus receiver config with the target allocator service name. And then collector will connect to the target allocator to get its targets. Okay. So we're going to just enable the target allocator. For that, we need to change the deployment node to stateful set. Enable the target allocator. And now we don't have to do any config in the receiver. We can just leave this empty, the scrape config empty as an empty array. However, we need to change the Prometheus to, we need to just define a single Prometheus receiver because the operator will change. There is a convention that operator will find this receiver and change its configuration. Okay. Apply the manifest. And yeah, it's crashing. It's a demo. But it's just waiting for the target allocator to be running and then it will start properly. Sometimes it just takes some time. Okay. It's up and running. Now, if I refresh the same metrics endpoint from the collector, I see labels again they changed because now the instance is again the pod IP. The job name is what the Prometheus receiver uses by default. And then there's labels like namespace, pod directly on the metric. However, the target info should as well contain the metadata from Kubernetes, like what is the pod name, what is the namespace name and so on. Okay. So what we saw is that the Prometheus receiver works pretty well. We can use it to scrape Prometheus metrics. There shouldn't be an issue and it's as well using the Prometheus configuration. So if you are familiar with Prometheus, we can just directly copy paste the config into AutoCollector. However, what we haven't seen is if the process is instrumented with Auto SDK, then the Delta temporality metrics will most likely be dropped if you are using Prometheus receiver. However, if you are using OTLP exporter from the SDK and we set the temporality correctly to cumulative, then those metrics will be correctly propagated to the collector and then to Prometheus. So be careful with the Delta temporality. The Auto SDK should use the cumulative temporality by default. So that shouldn't be an issue. But if you are using something custom, then be careful with those metrics using Delta. So to wrap up, we saw the Prometheus receiver. It essentially contains the Prometheus configuration. However, the dollar signs in the AutoConfig, they are substituted to environment variables. So you need to escape them with two dollar signs. That's one difference. In the open telemetry ecosystem or in open telemetry collector and operator, there is no support for probe and scrape configs. And in the service and pod monitors in the AutoOperator, we don't support TLS. There are limitations. So where do we want to go with Prometheus and open telemetry? The Prometheus is planning 3.0 release. They want to improve the OTLP ingestion endpoint. So you can now ingest OTLP metrics into Prometheus, which is great. However, if you are using Delta temporality, those metrics will be dropped and they want to improve the support for it along other features. So yeah, feel free to help us to build this thing, to be more robust. On the open telemetry ecosystem, there is kind of two projects where you could contribute to improve Prometheus support. In the collector, there is the Prometheus receiver that we saw, Prometheus exporter and remote write. There is a lot of issues on the GitHub where you can help. And on the operator, sorry, we are planning the next CRD version. We want Alpha 2. And we want to create a dedicated target allocator CRD that will expose more Prometheus config. It's as well something that we are working on and we are very happy to accept your contributions. Okay, and this is all that I prepared for today. Thank you. Do we have any questions? No questions? Going longs? Okay. Thank you once again.
Unifying Observability: The Power of a Common Schema
So, up next, we have Christos and Alex and unifying observability in the power of common schema. Okay, thanks everyone and welcome to our talk. We will in this presentation talk about the conversion story of two schemas of open telemetry in the elastic common schema. But let's first introduce ourselves. My name is Alex. I'm leading the open telemetry initiative at Elastic and I'm a co-maintenor of the open telemetry semantic conventions project. Hi, I'm Christos. I work on elastic as well and I'm software engineer focusing on observability and specifically open telemetry where I am a contributor and a prover on the semantic convention project. Okay, we would like to start with a quite easy and simple question. How many of you do know exactly what open telemetry is? That's great. I can skip some slides later. How many of you do know what semantic conventions is about? That's what I expected. And how many of you do know what elastic common schema is? Okay, thanks everyone. So let's deep dive a bit on the history of open source tools and standards in observability to give us a picture where the standards come from. Let me. Okay. No. Does that work? Okay, around, do you hear me? That works well. Okay. Around or a bit more than 10 years ago when microservice emerged that also changed the observability market and industry. That's when like big tech companies started building their own open source tools for collecting observability data. So tools like Zipkin, Jega for distributor traces emerged, the Elk stack for logging, Prometheus for metrics. We heard a lot about this in previous talks. And based on this defective standard tools, then actual standards emerged like open tracing, open sensors later for distributed tracing, open sensors also covered metrics and the open metrics as a derivative of Prometheus format emerged and Elastic has its own ECS that defines the semantics of structured logging data. Since we will talk a bit more about ECS, a quick introduction what that is. So ECS stands for the Elastic Com Schema and it's basically just a definition of a set of fields that describe the semantics in structured logging data. So for example, if you're collecting a service name with your observability data, the Com Schema tells you that you should put this value into a field that is called service.name, not app.name or application.name. So you have common names that you can later on search for and this also allows you to correlate data across different signals. Now as you can see, we already have at least four standards here that are partially competing, partially complementary. Plus we have all the tools that also create some defective standards for collecting data. So it's ridiculous to have so many standards, right? We need one more that covers all of them. And usually what happens is we have one more that is competing with all the others. And yes, we have one more standard for observability. OpenTelemedia will come back to the comic later again. This is the slide that I can skip based on the Paul. So OpenTelemedia provides not just a standard but a full ecosystem and framework for observability. For collecting data, sending it protocol. One thing that I want to highlight here, there is a specification in OpenTelemedia that defines what data you can collect, like traces, metrics, logs. OpenTelemedia working group is also working on a profiling signal. And what we will talk more about in this presentation is the semantic conventions. Semantic conventions are very similar to what I've shown for ECS. And basically defines, yeah, attribute names and their semantics. Let's have a concrete example of how the data structure in OpenTelemedia looks like here with some logging data. Very simplified view here, it's a bit more complex. But let's say we have a set of log records, right? The OpenTelemedia protocol defines like the core structure of that signal with fields like severity text, which is basically the log level and body, which is basically the log message. In addition, you can collect with your observability data additional context information. This is usually represented in so-called attributes, and that's where semantic conventions come into play. The semantic conventions define which attributes exist, their names, types, and also the semantics behind this. For example, if you're collecting an HTTP access log, right, and you want to capture the HTTP request method, this is the attribute name that you would use for it. Now observability data is usually also captured in a broader context for some resource like a concrete service, a host, or other resources. That's why OTLP wraps the actual observability data into a resource wrapper, and a resource again has a set of attributes, so-called resource attributes, that describe the resource, something like the service name, host name, and so on. So this is the structure in OpenTelemedia for collecting observability data, and semantic conventions is just about the attributes basically in their meaning in this data. Now let's come back to our timeline of standards. There's one important thing I didn't mention before. Actually OpenTelemedia, and we heard this in the previous talk, is the result of a merger between open tracing and open sensors. OpenTelemedia also supports Prometheus metrics and OpenMetrics that we have heard in some of the previous talks, and just last year, Elastic also announced the donation of ECS into OpenTelemedia. So coming back to this, the question is, is it really that we have one more competing standard? I would say actually not. With OTLP we have less competing standards, and OTLP really succeeds in reducing the amount of competing standards and becoming the one and single standard for observability. Now as I said before, Elastic announced the donation of ECS into the OTLP's semantic conventions project. Why? Yeah, because there are great benefits to this. First of all, there are complementary parts and strengths in both schemas that we now merge into one single schema. And second, we grow two different communities by merging them and providing a bigger network effect. So it's a huge win I think for the community, but there are not only benefits, there are also challenges, right? First of all, the overlap between the two schemas is a potential for schema conflicts. And to resolve these conflicts might mean that we need to have either breaking changes in the one schema or in the other. We have seen the structure of observability data in OpenTelemedia, which consists of the protocol with the nested structure plus the semantic conventions. It's quite different to how ECS defines the fields because ECS is just a plain definition of fields without like nested structures or so. So there's some difference resolving that is a bit of a challenge. Another interesting thing that we discovered when we started merging ECS is that in OpenTelemedia before the merger, many times attributes have been defined in a concrete context. For example, we have here an HTTP server span and the attribute HTTP route is basically defined under the semantic conventions for HTTP server spans. The problem is now if I want to use the same attribute in a different context like let's say HTTP access logs, I mean there was always a means just to reference the other attribute, but it feels sort of weird because in the one context is a first class, right, attribute and the other one is just a reference that overrides some semantics. So learning from ECS, what we already achieved with the merger is that now we have in OpenTelemedia a dedicated attributes registry that serves the case of just defining attributes with their types, with their meaning and in the different semantic conventions and their use cases we are just referencing those attributes. So we have clear separation between defining attributes and using them in a concrete context. And finally another challenge is metrics. Metrics formats in OpenTelemedia follow the TSTB model. So we have a concrete metric name like system disk IO in this case with a type, with a unit and we have a set of dimensions modeled as attributes. In this case direction for example for disk IO read or write. In ECS previously the metrics were basically modeled as numerical fields on documents and you can have multiple numerical fields in the documents so you can have multiple metrics. That's the reason why often some of these dimensions that we have in OpenTelemedia are just encoded into the metric name on ECS side. So we have things like disk read bytes or disk write bytes. This is quite a big difference in modeling. This is a case where we are learning basically from OpenTelemedia and adopting this at Elastic now also with Elastic Search supporting TSTB. So we see we are learning from both sides which is a great thing and we are coming to the best solution possible for the community. And Chris will tell you how this actual merger is happening in practice. Thank you. Can you hear me? Okay. So as Alex mentioned there are a lot of things going on so the question is when is time to celebrate the merger that everything has been completed and the truth is that we are not there yet. There are things that needs to be done and actually everyone believed in the beginning that once the merger was announced that that's all. I mean we have not anything to add there but yeah the truth is that the actual work started right after the merger was announced. So yeah let's see some examples of how the merger is happening and how things are moving forward. So I have some real examples here from the upstream repository on GitHub with issues and pull requests. So this one for example is trying to add some new resource attributes for the container images and specifically the digest of the image. So as we can see that PR was filed on the 4th of July I think yes and it took it some time to get seen right. So it took us like many review cycles more than 20 blocker comments actually there so lots of back and forth lots of discussions but that one was actually merged after almost two months. And another example is about a very important attribute the IP of the host hosted IP as we call it and this one was really unique really interesting actually because this PR was filed by a non ECS contributor. So actually that contributor used to work for a company that it's I would say completely unrelated to the ECS project but it was quite nice because in that case the existence of the ECS project was taken into account and there were very interesting conversations and it took us like almost three months to have it in. So yeah it's quite obvious with these examples that the merger was not something trivial not something straightforward that can happen from one day to the other by for example writing a script that will transfer everything from one project to the other or something like that. So we have decided to take an approach to move let's say not so fast and pay attention to the detail and have the proper people work on specific areas so as to leverage their expertise and be sure that what we are merging to the up seem to the final project which is actually the sematic convention of open telemetry will stay there and everyone will be happy with that in the future. So that's more or less the areas of the sematic conventions. We have areas in area about databases cloud containers Kubernetes HTTP system metric system resource attributes and many others. And yeah so we have started focusing on specific areas some examples is the effort that we are doing on the system metrics area we have a working group working there focusing on the stability of the area. We are in a really good position now we are moving towards the ability really soon and the same for the process namespace the process area the process resource attributes and the same for container area we are close to achieving the 100 percent converges there the recent going PR that will add the final attributes final metrics excuse me same for HTTP and network areas we have good coverage HTTP sematic conventions were declared as stable really recently so we are adding on top now which is quite nice and yeah we have work in progress in databases mobile areas cloud Kubernetes so we have working groups getting started and focusing on these areas and yeah over the past months we are focusing on making the project as good as possible it's a community driven way so we as ECS contributes to the contributors donating this project we are not only focusing on the merger itself but we want also to ensure that the sematic conventions project will be there and will can serve us in the future so we are also focusing on other things as well like improving the tooling of the project working on the guidelines this is quite important because there are many times that the guidelines of the one project are in conflict with the guidelines of the other projects so in that case we need to take a step back and reconsider the guidelines and see what we want to have there as a final result and yeah also we work on restructuring the project before it was the sematic conventions within the project were grouped by signal logs metrics traces and so on but now we have a better organized organization there and we group the attributes by topic and yeah as Alex mentioned already we have introduced the global attributes registry it's actually a very big list with all the attributes there and then within the actual specification you can reference the attributes from there so yeah that's quite useful and we're also working on adding a new concept from ECS which actually the attribute nesting or reusing some namespaces that means that if you have a namespace for example always dot whatever you can nest it attach it as it is under the host namespace for example and you don't need to redefine it again so yeah these are some examples from the upstream most of them are closed some of them are really let's say close to be completed but we have some small blockers there but the work is moving forward that's a that's the point and yeah how the community is organized around these so as I mentioned before we want to have proper people working on specific areas leveraging their expertise so we have working groups working on each area and we're trying to first declare their the areas of the semantic attribute the sematic conventions as stable which means that all the semantic conventions that we will have there will be stable and then we can use them in the actual implementations so the next step is to tune the implementations accordingly which means essentially the open telemetry collector and the language SDKs and yeah some examples the system metrics working group the working group around databases we have a security semantic conventions working group which is getting started now we have also approvers areas for the mobile area containers Kubernetes and many others that I don't mention here and the process looks like this first once you want to create a working group or a specific project you propose the working group area and you mentioned there what issues you want to work on and then you will have people expressing their interest to join this effort you will need to find a sponsor from the technical committee and yeah once everything is decided we have a specific project board we have regular meetings we have people getting assigned to the issues there and yeah the work is happening like this and yeah regarding the merger itself in yeah technically it happens like this we follow this process so once we have to either introduce some new fields some new semantic conventions or we want to move something from ECS to the semantic conventions of open telemetry we first check obviously what we have in these two projects and we also check what implementations have so far essentially the open telemetry collector or the SDKs because there are cases that the for example the collector already uses some some let's say metrics there or some semantic events some resource attributes for example but those are not yet part of the semantic conventions of open telemetry so in that case we also check what there is there so we might find something interesting so we can use it and once we have everything considered we have a final proposal we raise an issue or a pull request directly and we start the discussion within the community we yeah particularly focusing on measuring the breaking changes because you can imagine that we want to avoid bringing frustration to our users on both sides so yeah that's really unique really important thing to consider and we go through the review process and then once we have a conclusion we merge and then of course we need to handle the breaking changes because they are there most of the times and yeah the summary for today is that the merger is happening feel free to join us contributors are more than welcome everything happens in the app stream so if you are interested please join and you will see that you will find that you will have real impact from day one there and the goal of everyone is to make the semantic convention of open telemetry the one unique straight one unique and straightforward standard for observability and security that will be there for the future so yeah with that you can find us on csf slack channels or by using our github handles and some project meetings on Mondays we have the semantic events working group meeting same our next day Tuesdays we have the specification sig meeting and on Thursdays we have the system metrics working group 530 30 central time and yeah without any questions I think we're out of time do we have any questions hi thank you for the talk this this was really interesting and clarified some things for me I have one question about what's how what are the benefits of these semantic conventions in terms of like front-end tooling that that we are using because I know that you know there's this idea in open telemetry project that you have semantic conventions and you have common attributes for different signals and then we collect all this data in all these different signals in some observability tools and I imagine in like front-end we could automatically correlate different signals if we have this like common attributes I'm not up to date with the current state of this this area so yeah this is my question what are the main benefits of following this semantic conventions yeah I would say there are two actually one is I mean open telemetry is an open source standard right and there are many vendors adopting this so we need common semantics of what the data represents to build features higher level features on top this is the first thing and the other one is correlation as you already mentioned cross like different signals to also have correlation cross or through the resource attributes for example so you can drill down basically on different signals into the same resource and yeah I would say these two things and also cross signal correlation not only through resources but things like trace ID to have them you know both on locks and traces and later maybe in profiling data this kind of things okay thank you so are you doing something like that in elastic like in front-end at the moment is there any work going on in this area like correlation of different signals yeah of course like I think that's that's the goal for for every observability vendor to bring all all these different signals together yeah okay great thank you very much any other questions going once okay cool then bingo plus okay
Linux load average and other silly metrics
We'll see something very basic, the load average, the thing that you have on top, on top when you look at the performance of your server. Very basic, but with a lot of misunderstanding and the goal is really to understand if it's useful or not and at least how it works. I usually do that as a live demo, but I'm not sure about the Wi-Fi. I think I've lost the connection, but I have some recordings. Basically, what we will do, we will look at what we have in top. So this is not moving because I lost the connection, but we will see later on recordings. You can start to think about it. I have run something that you can see in the processes there. I have two CPUs. I have a load average of 32 for a long time. I don't know if you care, but I have 99% of weight I owe. Basically, my question to you is, do I have a problem or not? I am bound on a resource or not. If I'm bound on a resource, am I bound on CPU or I owe or memory or whatever? This simple question, I see a lot of people who cannot really explain it. The goal of the presentation will be to tell you that you can mostly ignore the numbers that are on top of top because those are about the systems, the processor, what you care about for your application performance is more the tasks that are running and this is probably more useful. Going back to the slides where I have the recordings of all the demos, so we will not try to reconnect to the Wi-Fi. Also, so that screenshot of what we have seen, people using the cloud, cloud providers like to provide nice graphs about performance and usually they put first the load average, the CPU usage. Typically, I have two processors. I have a load average of 30 and my CPU is doing nothing. Memory is 100%. What do they want to tell us with that because most systems will have usage at 100% and that's probably cool. We will look at that in the next 20 minutes. First, this is the recording of what I wanted to show you. That was what was running exactly the same. You see the load average, the number of CPU, the weight, IO, there. What do you think about it? Who thinks I'm bound on CPU? Who thinks I'm bound on IO? Who thinks I'm bound on IO because I have a weight IO? Less people. That's already good. Here, we see a high weight IO, but maybe I can advance on the recording. What I show in this case, when people think that I have a problem with IO, is just to run something else. Let me check where it is in the recording. If I have the wrong recording, I will just explain what I show usually. Sorry, maybe it's in the next recording. What we see is load average, high weight IO, but the most important, what I really care about is this, the state of the tasks. Who thinks I am bound on IO because of the D state? For me, this D state gives me a clue that most of my processors are waiting on IO. Probably. We see that it's not so exact science, but that's something that can give some clues. I'm lost in my slides. This is the next one. I'm running yes. You know the yes command? It displays yes. I'm still running the same IO there, the same throughput. I'm doing exactly the same, and my weight IO has decreased. This is how to solve weight IO, just run something else. I show that to explain that this weight IO is not about what your tasks are doing. It's about the CPUs. When you do IO, you don't need CPU, so you wait. If no one else wants to do something on the CPU, then the CPU state just remembers that, okay, I'm idle because someone is doing some IO. Now I'm running something else that uses this CPU. This CPU is not idle. This weight IO just means idle, and idle because the last one did some IO. The only information I have from weight IO is that the CPU could be used for something more useful than weighting, but doesn't really give me the information that I have a lot of IO because depending on the other workload, I will sit there or not. The state doesn't lie if my processes are all on the D state. At least they are not on the R state, the renewable state, so they are not using CPU. In the next one, what I do to understand better the kind of IO I'm doing, the kind of system call that puts this D state, I just run S trace on my processes, and I just did the S trace dash C to count them, and you see that most of the system calls are P writes. That's actually what I'm running there. I'm doing writes with the P write system call with direct IO. That's basically what I have there. If I want to understand really what is behind a state that is not the R state, the renewable state, I can trace the system calls to know exactly why. I will explain why I'm looking at that because even if D looks like disk, you can do some IOs that are not in D state, and you can have D state that has nothing to do with IO. So it can be misleading. The D state is uninterruptible calls. So your process has something to do that is not in CPU and does it in an uninterruptible state. Depending on the system call, it can do it uninterruptible or not. Often IO like the P write is using this, but there are some other kind of IOs. Any questions so far? Any remarks? Okay. So next one. I will run something else if I remember exactly what I'm doing here. I will run FIO. The difference is that I'm not calling the P write system call. I'm calling the Lib IO, asynchronous IO library. Basically I'm doing the same writing to the disk with direct IO and you can see the throughput is mostly the same. However, I'm not in D state anymore. So there are some IO who put the D state, but there are some IO who just put the sleep state, which is not uninterruptible. So very misleading when you see those things and try to guess what happens. If you are stressed, there is no guess. You know exactly the system call. And I think this is what I do just after. If I stress, I see that most of the IO calls here are IO get events and there is some IO submit. This is our asynchronous IO works. P write just ask the kernel, I want these blocks and wait to get those blocks. With asynchronous IO, it tells the kernel, I will need those blocks. So that's the submit and then can you work on something else and come back and say, oh, do you have my IO? If not, I will wait. The submit goes in this state, but it's very short because it's just a submit. The get events, if it waits, goes in sleep state, the S state, and not the D state. Depending on the kind of IO, you will see it at this state or not. And the wait IO there depends on the state, but more important, I don't know if I can go back. Well, I'm sure I can go back if I replay it. I guess that the load average was lower when I was running that because the D state counts in the load average, the S state doesn't. Means that some IO counts in the load average, some IO doesn't. Means that with load average, you don't really know what happens. Okay. The next one, I'm running something else. So those were direct writes by passing the buffer cache. And here I'm running reads and more I set direct equals zero to FIO. FIO just simulates different kind of IO. Typically I work with databases. I'm a developer advocate for UGA by DB that is a distributed SQL database compatible with Postgres. I've been working also a lot with Oracle. They do those kind of IO, Postgres does not do direct IO. It goes through buffer. Oracle, you have the choice. So really depends. Here, what I would like to show you, I don't see it from here, but I'm probably in the running state. Yeah, it was not sorted. But here, I'm mostly reading from memory, from the cache, from buffers. And this is why you see that much faster. And a difference, I'm using more CPU there. You access memory more than you access the disks. And then this is in CPU usage, the kernel part of the read. I mean, my application is doing the same. Just an IO call. So the user space CPU is still low. But on the system, on the kernel, what Linux does is read from memory. And this is where you have some system CPU there. That counts in the load average also. I just, okay, in the meantime, I did this trace to see the reads there. So I have periods, the same system call. What is different is what is behind. That it reads from buffer. And I don't know if you have seen it. When I was attaching with S trace, the state here was T. That's the state when you attach. And of course, it has a little overhead. You do that to troubleshoot. The important thing is the runable state. I'm saying that either I'm running in CPU or I want to run in CPU. And I don't know which one from those metrics. That's the point. I have only two CPUs. So I know that I cannot have more than two tasks running in CPU. They are running able. They are waiting in the run queue to be able to run on the CPU. Top will not show the figure. Load average will add those rating and those running. If you want to see the difference, you need to look at the statistics from the scheduler in slash proc scheduler statistics or VM stat is showing you the run queue. I'm saying that because I've seen a lot of people comparing the load average with the number of CPU. Like if load average is higher than the number of CPU, I have a problem. Maybe not because if the load average is due to IO, you don't really care about comparing with the CPU. And if the load average is high because you have a lot of processes in the run queue, then probably you have a problem because you have tasks who need to run something on the CPU and just cannot and are waiting in behind. So we have seen different kinds of IOs and they look differently. Many times where I've seen, especially on databases, where I've seen different teams, the Linux team looking at the system and the DBA team looking at the database. And in many companies, they don't really talk together. So one is guessing what the other is doing and a lot of misinterpretation on all that. It's very important if you look at the numbers from the system to understand what the database is doing. And also it's very important for the database administrator to look at the system because many things in the database metric will be different if the system is overloaded. I give a quick example on Oracle, you have wait events where you can know exactly how much time you spend on IO. But it's not exactly how much time you spend on IO. It's how much time between the time stamp it takes before the IO and after the IO. If your process is in the run queue, the database thinks that it is doing IO, but maybe the IO is done and it's just waiting to go back to the CPU just to set the counter on the time stamp. So that's also the message. I say that to database administrator, but applications, if you run on a system that is overloaded in CPU, then probably all of their metrics, because they require CPU cycles to get the number are probably wrong. So why did I call that silly metrics? I didn't came with this. If you want to understand what is what low-dverage measures, Linux is open source, so just look at the source of it. And you can look at the source, but more interesting are the comments which can explain the intention of the function. And so in Linux, the load average is defined as this file, so the source for load average contains the magic bits required to compute the global load average figure. It is a symmetric, but people think it is important. So you see why you see that first in top? It is silly, but some people think it is important, so let's give them something. And we go through the grid pane to make it work on big machine with T-class kernel. So the load average idea comes from Unix systems where it was really measuring the load in CPU and where it was easier to measure it because you just counted the ticks in the scheduler. Linux works differently and means that it is difficult to measure and maybe it makes no big sense. So yeah, good to know why this metric is there just because people coming from Unix were used to have this single graph showing the load and compare that with the application and what is done in the application, but if you don't look at the state of the processes, then it can be misleading. It's easy to understand exactly why we see this state, these IOCOLs in the load average, just the way it is calculated. There are two things that are interested in the way it is calculated. First, it is an average and that's also a problem. If you look at the load average, you will not see a peak of activity of five seconds because it is average. The other thing is that it counts the number of active, so the running state, which is more renewable because if you are in the run queue, you are not really running and it has the uninterruptible calls just because they thought that if we show only the CPU load, is it really the load of the machine? For example, you run a database doing a lot of IOCOL. Then we say that the load is low if everyone is waiting on the disk. Let's add an interoptable because in many cases, we have seen that those IOCOLs are uninterruptible calls, but they are not always, so it can be quite misleading. It doesn't mean that you don't have to look at it, but if you look at it and know what is behind, then it can give you some clues like the clue about IOCOL looking at other things, but more interesting is the process state. A process can have something to run in the CPU and then look at the scheduler statistics knowing if it waits for the CPU or there is CPU available and when it has some calls to do, they can be done in this state or as state and they will be accounted differently by the load average. Any questions so far? Okay, the next one is more about memory just because it's another thing that is misleading in some cases. I think it is quite clear in top that you can look at the available memory, but I see cloud provider showing the use memory or the free memory and here I just want to explain for those who don't know, if you do buffered IO like I did with direct equal zero. Okay, I thought we have five minutes now. Okay, perfect. So I will finish quickly on that. Do not look at the free memory. I'm just showing that if I do some IOs, it will take some free memory, but that is easily freed if it needs look at the available memory. That's the memory that is available to your process, but also think that it is available. You can use it, but if you use it, then another process doing buffered IO may not find its data in the case. So if it is available, doesn't mean that it's free from any impact on the others. Okay, I just put the last one while I'm talking and taking question. The idea there was just to show a really silly program doing V fork that has nothing to do with the data, but just to show that it will go to the state, it will increase the load average and that's the case I've seen in some system where the load average was thousands on a database having its file on NFS and network issues and then those uninterruptible calls increased the load average, but without any consequence because they weren't doing nothing. The only thing is that it's ugly when you look at the load average and the other thing is that they are uninterruptible. You cannot kill them. So you want to restart the system to have nicer numbers, but of course you wait for it. So just be careful, load average accounts some IO and accounts some CPU and you have some IO that you do not see there. Okay, do you have any questions, remarks? Thank you. What about pressure stall information? Very good question. If you have seen at the first screenshot I was running pressure stall information, which in my opinion is a better picture. The pressure stall information is counter telling you during the last 10 seconds, for example, how many, not how many, if there were some processes with pressure on CPU, so to run on CPU to get IO or to get some memory. So it really gives you an idea about the pressure itself. The only thing about pressure stall information I have is that in most of the kernels, the distributions I've seen, it is compiled in the kernel but not enabled by default. And then because it's not enabled by default, I've not seen it a lot. And then I think it's a good idea. Each time I used pressure stall information, it was giving me the right idea, but it's just a subset of the systems I've seen because it's not the default. And then maybe there are some cases that I don't know where it's not perfect, but I try to encourage people to enable pressure stall information where instead of looking at all that, you just see that you have some processes that could be faster if they were not on pressure, on RAM, IO, or CPU. Okay, I think we are just... Another question? If it's okay? So looking at a very generic use case, if you were to redesign the cloud provider's graphs, would you change it? What would you change it to? Could your list maybe the five most important metrics from a generic use case that you would put on a dashboard? On a dashboard, I think pressure stall information can be really nice on a dashboard because you can show that to user. User running on the cloud, for example, they want to know if they are on pressure on CPU or on IO because they pay for that. So those ones I would put that. Load average, maybe with a clear description that it is CPU plus some IO, and memory, available memory, not use memory because a system doing some IO, some buffered IO will always use all the memory in Linux. Maybe we have...
Fast, Cheap, DIY Monitoring with Open Source Analytics and Visualization
Welcome to the next speaker, a big applause for Robert who's going to present on fast cheap DIY monitoring with open source analytics and visualization. Okay, thank you. This is wonderful to see you all here. Big shout out to FOSTA. This is the first time I've ever been at FOSTA. First time I've ever done a talk here. It's totally awesome. So thanks a bunch. All right, we've got 20 minutes to talk about do-it-yourself monitoring with Click House. And just some intros, just a little bit about my qualifications to talk about this. My day job is I run a company called Altenity. We're a server provider for Click House, been around for a number of years. But more particularly, I've been working with databases for a really long time. I've been working with Click House as a user for about five years. Our company also does a fair amount of work in the ecosystem around Click House. Among other things, we are the maintainers of the community provider in Grafana for Click House. That's one of our projects. It's quite popular. It has about 14 million downloads. So let me jump in. I probably don't. Pretty much everybody here knows what, how many of you guys run monitoring systems here? Okay, it's easier to ask who doesn't. How many people use Grafana? Okay, excellent. That'll save some time. Okay, monitoring. You say answer questions. So stuff goes wrong. You want to figure out quickly what is going on. Or maybe things are going fine, but you want to do things like capacity planning. Monitoring helps you answer that question. I'm kind of old school, so I say monitoring. Nowadays, people tend to say observability. But I'll use the word monitoring just because I'm used to it. So back in the days of old, I've been around for a while, so I was on the last talk, talked about inheritances from Unix. When things broke on Unix, you would kind of go in and lay hands upon the machine and run VM stat or IOS stat and try and figure out what was going on. Well, those were the bad days. Nowadays, we tend to do things graphically. And there's a couple reasons for that. One, it's way better because if you're trying to find a problem like, hey, somebody is just like blowing out the CPU, you want to see durations, you want to be able to see a bunch of metrics together, it's much easier to see these things graphically. The other thing is that, as you all know, we're now working with systems, for example, that are based on containers, and you can't actually go in and lay hands on them very easily. Moreover, they can restart and then you lose what's on the local file system. So graphical monitoring, graphical display is critical. So what we're going to do in this talk is I'm just going to give you a few clues about how to build one completely from scratch. And basically, when you're talking about monitoring, you have a number of parts. This picture shows them. You start with metrics, logs, and traces. I probably don't have to explain what those are, but those are standard types of information that come out of systems. You're going to ingest them into some sort of database, which you can then run queries on and it will store them over time. And then you have some system for doing visualization. That's thing number one and alerting. So to tell you when bad stuff is going to happen. We'll mostly be talking about visualization today, but alerting is another important part. So the core of these systems is your store. So something that holds onto these metrics and allows you to do analyses on them of various kinds. And for this talk, we're going to talk about Click House. How many people here have heard of Click House? Excellent. Okay. Great. By the way, we have some of the core committers here for Click House. So if you have any questions, I'm sure you'll get answers. So Click House is kind of like, you could think of it as kind of like, bit like MySQL or Postgres, but it does analytics in the sense that it runs pretty much everywhere. It's open source, but it also has all the architectural features designed to read vast amounts of data very, very quickly. I won't bore you with this specific kind of marketing things here, but instead show you a slide that gives you a little better idea of what it is about Click House that makes it so fast. So our first thing, like most analytical databases, it stores data in columns. So you can think of them as being, every column being a big array that's split up in a bunch of files, but can be read very efficiently. Click House also allows, can make replicas of data very easily. So you can, by having more replicas, you have more things to query. You can handle more, you can handle, you know, more load on the, read load on the system. Another thing that Click House is extremely good at doing is compressing data and then reading it in parallel form. So all of these things taken together allow us to scan data, often, you know, hundreds of millions of rows, very, very efficiently. And I'll show you an example of that later in the talk. So then Grafana and Click House go together hand in glove. I think, I don't have, we don't have actual stats on this because you just never know what people are using. It's all open source. But Grafana is probably the most commonly used visualization tool with Click House. It's been available for years and just about everybody uses it, particularly for operational metrics. And I love Grafana. There are many tools out there, but Grafana is just the level of interactivity allows the fact that you can drill in, sort of look at different time ranges, bounce around between different metrics easily. It's really great to use. All right. So I'm going to build an example here in about five slides. So I'm going to just pick VM stat, which I showed in that previous slide. And what we're going to do is crunch the data, load it into Click House, and then show it in Grafana. So the question is how to do that. Well I'm just going to do it from scratch. So the first thing I'm going to do is collect VM stat data and turn it into a format that I can load into Click House. So this is a little Python program that does it. The important thing to note is not the details because they're probably not that great. I'm not a particularly good Python programmer. But the key thing is there's 14 lines of code here that crunches VM stat. And every five seconds we'll burp out some JSON. And what that JSON looks like is this slide right here. So pretty beautiful, right? This is data. This is data that we can get into Click House really easily. How do we do that? Well, first of all, we're going to make a table inside Click House to hold it. And there's relational databases want things to be in tabular forms. But it turns out Click House has a pretty expansive idea of what constitutes a tabular form. In this particular example, I'm going to do it the simplest way, which is I'm actually going to just create a table with a column for each of my data values. And so what you see here is a table definition that maps pretty much directly to the values that you get out of VM stat. And just a little bit of, you know, if you haven't used analytic databases before or aren't, you know, like a deep database person, we tend to think of these column values as being one of two types. Dimensions, which are characteristics of the thing that we're measuring, or the actual measurements themselves. And that's important because generally speaking, when we're scanning the data, we will group by the dimensions, like collect all the data by host, for example, over a certain period of time. And then the measurements of the things we aggregate. We take averages. We take max. We take min. So on and so forth. So that's just a quick intro, like a 60-second introduction to data modeling inside an analytic database. So the third thing we need to do is we've got the table. We've got the data. We need to make the data go to the table. So ClickHouse is very, very, has a bunch of different ways that you can load data. One of the simplest ways to do it is to, it has an HTTP interface, and you can simply push SQL commands and data to go with them up using Curl, which is a great talk on about two hours ago. So this is an example of the code. You can just say, in fact, that top level, if you're familiar with SQL insert commands, that's an insert command. ClickHouse has this kind of interesting variation on it where they have input formats. Instead of reading a bunch of values, which look like tuples, in this case, I'm actually going to read some data, which will be, for example, if I'm doing a post, it will just be in the payload. And it will actually be a bunch of JSON documents, each of which will turn into a row in the database. And this thing down here just shows you how to execute exactly that command using Curl. So, by the way, one of the things about this talk is everything I've done is there's an example out in GitHub. If you go ahead and Google, like if you're sitting right here and want to do it, you could just Google ClickHouse. Actually, I've got the link at the end. I'll show it to you. It's probably not worth it. So anyway, so we can load that data up. In the examples, you'll see that I actually wrote a little script that I can just run and put in the background. And then it will collect this data on each host that I'm measuring, sticking it into ClickHouse. Then what we do is we build a Grafana dashboard. So everybody, pretty much everybody in the room knows how Grafana works. What you're going to do is if you haven't done it already, install Grafana. You will need to add a plugin to talk to ClickHouse. There are two of them out there. There's the Altinity plugin that we maintain. That's the one, the Community plugin that's been around for years. There's a new one from ClickHouse, Incorporated. Pick a plugin, install it, put your connection parameters in, and then you can write a few queries. And within a few minutes, if you're at all familiar with Grafana, you can create a dashboard that looks just like this. This literally took about 15 minutes to create. And then go crazy. So you've got data loading. So using loading it up using Curl, putting a little script around it maybe so that it can sort of reliably load the data up and dump it. And then you can go look at it. But the cool thing here is that once you're in a database, you have this incredible freedom to use the data any way you want. Because ClickHouse is not a database like Prometheus, which is basically designed to hold metrics. ClickHouse is a general purpose analytic database. You can ask virtually any question of that data that can be expressed in SQL. It may go fast. It may go slow. But this is an example where we're asking a question like how many machines had a certain amount of load, like over 25% load for at least a minute in the last 24 hours. And we can sum the number of minutes. So this is just an arbitrary question, but it's just a few lines of SQL. We can get this question moreover if we have a Grafana attached to this, we can turn that into some sort of display on Grafana that will show it graphically. You then have something that you can really play around. And this is just the effects of running some popular commands like stress is a great command if you want to see something hog memory. What you can see in this graph here, that big blue part was the OS buffer cache, which was actually filled with a bunch of pages from previous processes. You can see that the buffer cache was pretty much blown away when these stress runs began. Up above you see the result of running sysbench CPU command. You can see the effect on the CPU usage. So at this point, this is a very simple example, but you actually have a fair amount of insight into what's going on these machines. So the next thing is to take a drink and to scale this up a little bit. This is a toy that I just showed you. Anybody can do it. It's just a few lines of code. So as you start to think about scaling this system, making it work across a bunch of hosts, a bunch of different types of metrics, maybe adding logs, the first question that probably comes up is, hey, I love open source, but do I have to write everything? And the answer is no, not if you don't want to. So there's a couple of projects that if you're going to go in and do this, you probably should be looking at. One is Fluent Bit. How many people have used Fluent Bit or Fluent D? OK, fair number. We use Fluent Bit quite a bit in our cloud stuff. So what Fluent Bit does is it basically has a bunch of plugins which will sample different kinds of metrics, turn them into a data format, and then put them somewhere else. So for example, you can get the same similar CPU metrics to what I just showed you. There is an input plugin for Fluent Bit which will grab those, and then you can turn around and post them to Clickhouse. And it works as a daemon, so you can just bring it up, let it run. And so as a result, you don't have to figure out how to parse all these different formats. Moreover, you don't have to worry about basic things like posting to HTTP. They take care of it. So that's a really useful project to look at and one that you should consider. Second thing is Open Telemetry, which if you were here two talks ago, is basically trying to create a common data model for all observability data, including metrics, including logs, including traces. And that gives you sort of like a universal broker, if you will, that can handle data coming from all kinds of different sources, maybe Fluent Bit, maybe custom stuff that you build, and then it will push it into databases like Clickhouse. And in fact, there is, OTEL as it's often called, has a provider for Clickhouse, which I believe is in the alpha stage. Some people are using it. It still has some performance issues, but one of the things it takes care of is building the data structures that you use to store your data and doing them in a kind of rational way. So in fact, that's to answer question one fully, if you were going to build this out, you might have a metrics pipeline that looks kind of like the following, where you have Fluent Bits, perhaps feeding this to an OTEL collector, that's this broker that Open Telemetry provides, pushing it to Clickhouse and then reading it to Grafana. I'm not saying this is the only way to do it or even that it's a good thing, that's what you have to decide, but you can do this. The pieces are out there and they're all open source. Second question, this is Fostum. You yearn to use Postgres or MySQL. Why not? Why is it we? Why don't we just use Postgres for this? Well, actually for the example I gave, the little one, you could use Postgres and it would make no difference whatsoever. But as you start to scale, it becomes really important to pay attention to how the data is actually represented and what kind of powers of query you have on top of it. And the key thing to notice here is that Postgres and MySQL are row databases. So they store the data as rows. There is a plug-in for Postgres, I guess, that is beginning to change that. But in general, if you read anything out of a row, if you touch anything in the row, you'll have to read the entire row. Whereas with Clickhouse, it's columnar, so if you read columns, say, out of 109 column table, you only touch the data for those three columns. Moreover, by putting things into columns, effectively a raise of a single data type, taking advantages of things like sorting, say time-based sorting, it tends to compress extremely well. You can literally compress a lot of these metrics by a factor of 100, so they will compress down to 1% of their previous value. Let me just show that graphically. So what they affect is, is that when you run queries with Clickhouse, they can be easily a thousand times faster than Postgres and MySQL because of columnar structure, because of compression, because of parallelization. And this illustrates it. This was a sample, literally reading three columns out of 109, scanning 200 million rows. The amount of data you would read in Postgres or MySQL was 59 gigs, everything. In Postgres, in Clickhouse, first of all, you read three columns, so that was literally 3% of the data. You're already up 33x. Now those columns are compressed, so you're not just up 33x. The amount of data that you're reading is actually, has been reduced by a factor of almost 3,000 at this point. And then you can spread it across, you can spread those reads, say if they're coming off an SSD, you can be doing them across eight threads, so you could argue that that actually is going to reduce the amount of work for each thread by approximately 23,000. So these are, like, not just an order of magnitude, but orders of magnitude less IO. This is the reason why Clickhouse is great for this kind of problem. Third thing, how to handle data from a lot of collections. So for those of you that know Clickhouse, you saw that example and you thought this guy's an idiot. He is adding five rows at a time. And so data warehouses, the flip side of using columns is you touch a lot of files every time you load data. So what you want to do is load data, you want to buffer data into big blocks. So like maybe a million messages at a time. There are many ways to do this. Clickhouse has what's called async IO. But in general for a system like this, if you're starting to receive data from many collectors, you're starting to get into levels of traffic like 100,000 events, 200,000 millions of events per second, what you want to consider is introducing Kafka or Red Panda or a similar event stream. And Clickhouse, so the collectors write to Kafka. This then breaks the connection between your producers, which are getting monitoring data and Clickhouse, which is the consumer. Moreover, Clickhouse has very good integration with Kafka. So there's built in, there's multiple types of integration, but the most commonly used one is Kafka table agent. So that basically wraps a Kafka topic, able to read it so you can do a select off it and it basically reads off the queue. And then there's a trick where you use materialized views to do that read automatically and stick it on a table in the database. So as your architecture gets larger, you definitely want to include this so that you can then read large blocks of data very quickly. Final thing. So I talked about this basic mapping of the data. I modeled it as a table, which is exactly matches the data that's coming out of the JSON, but that's not the only way that Clickhouse can do this. So there's actually a number of options. And one of the most common ones, and one that I was playing around with just as part of this exercise, was to have a column that just contains the JSON as a string. And then what you can do is there are functions in Clickhouse to pull the values out one by one, but they're kind of painful to use. There's extra syntax. So what you can also do is basically as the data loads, you can just pull out particular values and turn them into a regular column. This is very simple to do with materialized views. And so what that means is actually if you're doing a demo for this talk, and it's like 3 a.m. in the morning before the talk, what you can do is basically load the entire JSON document, pull a couple columns out to make your queries work, and leave the rest there. One of the secret powers of Clickhouse is it does schema management or schema evolution very, very efficiently online. So as time goes on and you see more stuff in that JSON string you'd like to read, you can just pull it out into new columns. That's a command that can be, those are commands that can be executed instantly. All right, we're down to zero seconds. We're almost done. So there are other options. So you can, if you have key value pairs, you can do pairs of arrays, like an array of keys, an array of values, and then Clickhouse has very efficient functions to match those up and to process them. You can also use maps, so those are like hash tables. And these are other ways that you can represent JSON. Clickhouse has a JSON data type, but it is experimental and actually do for rewrite. So I just found that there's actually, it's on the schedule, so I just found that out last night. So it's going to be implemented, re-implemented in a way better way. But in the meantime, there are lots of ways to process JSON and Clickhouse has many other features that make JSON very handy to process. Okay, where can you find out more? So lots of sources about this stuff. I'm listing them here. The sample code is up in GitHub that shows everything that I did here. If you just Google GitHub, Clickhouse SQL examples, it will probably be the first thing that pops up. And there's a directory in there called open source monitoring. So you'll see that. And then, yeah, Clickhouse official docs are very, very complete. The Altenony Grafana plugin for Clickhouse, Fluentbit, Open Telemetry. And then we do lots of blogs and YouTube videos about how to use this stuff, including doing monitoring. And that's it. So thank you very much. And if you want to get in touch with me, you can connect me on LinkedIn. I'm on Slack, CNCF, Clickhouse, AltenonyDB, Data on Kubernetes, just, you know, or if you connect, send me email, whatever. So any questions? Do we have any questions? Could you please stay a few minutes for the Q&A so that people understand the questions? Any questions? You showed us data coming from a script. Can we do the same thing with data coming from web applications, for example? Oh, from a web application? Yeah, absolutely. Anything that can generate metrics, you can push them to Clickhouse. And in that case, if you're generating metrics, I would definitely recommend going and looking at Hotel, because it has, for example, it has SDKs, which you can embed in your application, generate metrics that will be translated into a standard form, and then they can get pushed to Clickhouse. So did I understand correctly? There's an SDK for doing this. Yes, it's part of OpenTelemetry. That's one, or you can do it yourself. It's part of Telemetry. There's multiple solutions for this. Hi. So first things for the presentation. My question is, does it make sense to use Clickhouse for tracing, storing traces, and such things? Like traces and logs, other stuff? Yeah, tracing. Absolutely. So, processing and tracing is actually one of the major use cases for Clickhouse. And what's kind of interesting is you can use databases like a time series database or something like Prometheus. The cool thing about putting in Clickhouse is once you get it in, you have this whole, because you have a full SQL implementation in there, you have much more flexibility about what you can do with the data. That's thing number one. We also have people that we work with, some of our users, that just have standardized on Clickhouse as sort of the one size fits all. So that reduces the sort of operational complexity. Third thing is, Clickhouse is just really good at anything that's time ordered, including logs, including just practically any kind of data that's emitted over time. So it's a very powerful engine for this kind of thing. I focused on metrics because it was just one use case, but traces and logs also work very well. Very well. We have time for one. Thank you. I'm looking for a tool to do reprocessing. So once I ingest a lot of data into Clickhouse, I want to do sort of on demand some fancy complex calculation that's a bit too complex for a query and then store the data again in Clickhouse. Is there a good framework or tool for that? Yeah, I don't quite understand the use case. So you're saying you're going to push data in, run a calculation on it, put it somewhere else? Yeah, so I gather 100% of data, but I need only 1%, and on that I need to do some complex business-wide calculation. Right, okay. Yeah, there's a couple options. Within Clickhouse itself, I would recommend you look at materialized views, which are basically like insert triggers that will run a query and then put the result in another table. They can be used, they're normally used to, or the most common use case, is to do pre-aggregation, but actually they are a transform, just a generalized transform mechanism. So anything that comes in as an input block, you can run calculations on it using a query, mix in data from other, you can join on other tables and do anything you want and then dump it to a target table. And that's a very efficient mechanism. It gets invoked every time an insert comes in, the query runs on that block of data. If you have, it sounds like you have something a little bit different. Yeah, I can ask you. Yeah, come by later and we can talk through it. Thank you. Okay, I think that's a wrap up. Yeah, thank you.
Implementing distributed traces with eBPF
Thank you so much. My name is Likol Goczewski and I'm here with my colleague Mario Matias. I think I pronounced your name right. Yeah, you pronounced it very well. We work on open source project at Grafana called Grafana Vela about software engineers. We didn't practice this presentation much because we live on two different continents so you get what you get. It's always not too bad but yeah, we'll give it a shot. Let's go. So we will first do a very quick introduction to what is distributed tracing. I know most of you already know but just to try to get a common mindset even for people that is new to observability or to distributed tracing. Then we will explain a bit how it is implemented and how do we implement it in Grafana Vela using ABPF. So if you want to instrument a server, you might add an instrumentation library like for example the OpenTelemetry SDK and insert some instrumentation points in your server to get on each request a span containing data like the start and the end or some extra information about the request like client ID, the path of an HTTP, the response, etc. Then you can send that to an OpenTelemetry collector and visualize that. If we have a distributed service in which one service calls another, gets responses and so on, you could still do the same instrument each point and then send them to an OpenTelemetry collector for example. But the spans themselves could give information but separately may lack a lot of context. So if you get just a bunch of front end database back in the span separate, it will not be as useful as for example getting for each span which is the request that invoked that other request so you can see everything in context. This is what we say name distributed tracing or context propagation. In OpenTelemetry concretely we use the W3C standard that is using a trace parent header in the request. So you can insert into your request, you can insert headers with the trace ID and the parent span ID and then their services getting these or receiving those invocations can read this trace parent and add it to their own request. So that way you can always track the context. This is not any real SDK, any real language, it's just an example on how could you do it. You have a service and on each request you can read this trace parent, create your span, the part of the trace and when you have to call other services you will add this trace parent in the headers and then in the span. This can be manually done by code, be an SDK or this can be injected by your instrumentation or SDK agent like the OpenTelemetry Java or OpenTelemetry.net agents. Bayla, those are another or follows a similar approach especially for these services that are written in a language that is not so easy to instrument be an external agent. I'm thinking of for example, compiler languages like Go, Rust and C. In that case, Grafana Bayla can be deployed in the host, in the same host as the services you want to instrument and it will use the EVPF technology, we will talk later a bit about it, to hook and inspect the runtimes and libraries of your application or the functions of your application and as well some points of the Linux kernel. Then compose metrics, traces and forward them to your OpenTelemetry collector. What is CVPF? I mentioned it before. It's just in time, virtual machine that is shipped inside the Linux kernel. This allows you to efficiently hook programs to multiple events in the kernel, libraries and the user space programs. For example, Bayla can hook every time an HTTP request is received in the instrumented application. Bayla can execute immediately a piece of code, a probe and then inspect and even modify the memory, the runtime memory of your process or even the kernel. This way is able to know when request, service request starts and ends and even inspect some arguments about them. Bayla has two ways to provide a span information. One is to inspect at the language level. At the language level, we only currently support Go and it hooks user probes into the Go runtime and the Go libraries to inspect them. To support other languages, this is compiler languages but also Python, Ruby or other interpreted languages. It hooks K probes in several kernel functions and libraries to know when connections are started to read the arguments of the requests and the responses and so on. We are able to do that in Go. We are currently inspecting HTTP, HTTPS, GRPC, HTTP2 and soon SQL. At the kernel level, at the moment, we are inspecting HTTP and HTTPS but other protocols will come at some point. We will talk about how to provide the spans but Nicola will talk about how the context is propagated with Bayla. I think you can hear me here. You can hear me, right? Yeah, this is working. We showed a previous example where we had this done by manual introduction in that logic in the program about reading the trace information coming in on a request and then how we send that over which is effectively what most of the open telemetry SDK instrumentations do or the agents in Java or .NET, they do that injection for you automatically but we do it with eBPF so you don't have to have an SDK added to your packages or languages when that doesn't exist or languages where maybe your library dependencies don't quite work with the SDK because of different versions or it's not up to date or whatever the reason. We hook into the program like Mario mentioned in different ways and when a request starts we actually read the memory with eBPF and what is in that trace parent. If there isn't one, we'll generate one according to the W3 stack. Then what we do next is that we notice an outgoing call and then in that outgoing call, if we can find the information about the headers, we will inject the outgoing trace header just like the SDK would do. This is what happens in Go currently with Vela. This is exactly what we do. Now internally, how this all works? Well, to make sure that we can tag an incoming request on a server accepted something like slash ping for example and it did an outgoing request to slash ping me too and in that case we need to track that this incoming request matches this outgoing request by the call maybe async. Maybe somebody wrote a library and said, well, I don't want to wait for this request. I just want to do it async for whatever the reason. I'm using some reactive library. In that case for Go, we track the Go routine creation and termination essentially. Because the Go runtime and the standard libraries are very standardized and everybody uses that, we're able to do this kind of stuff. It doesn't need to be the first argument, needs to be the context. None of that stuff. We just track Go routine creation and we're able to match it later on. That's how we propagate the context. Now for the other languages, we thought, well, how are we going to do that for other languages? People use number of libraries. How do you do this on compile languages? Somebody does just think on time compile language. It's kind of hard. For that we wrote additional support that does something more sneaky or if you will, something more interesting. Land 2 servers or two processes talk to each other over HTTP, for example. They have a unique pair of information and they identify every connection. I have a client, opens a remote connection to a server. It has a source port, which is typically a femoral. I have a destination port, which is a server port. When we see that connection pair, we use it as a unique key and we say, we'll store it in the eBPF map. Then when the server on the other side gets that request, they look up that map and say, well, I have this connection pair. Does that match any client that made this connection? It does require that one single baler monitors both processes. If that is true, then we can actually tag these requests between servers without actually using this transparent propagation. For languages where we haven't written additional support to inject the headers information, we use this as a backup option. This context propagation correlates internally requests through the kernel. Here's an example. We start the client call. It may read the transparent information that was present from a previous call, but if there isn't one, it's just going to generate it right there in eBPF and then store that information. Then later on, when another server request happens based on the client call, we'll read that map, read the transparent information, create the spans, just like if you will, that transparent logic flew through the HTTP headers. More or less the same. There's restrictions, of course. Obviously, for this to actually work, we have to have a single node. Now, these eBPF maps can be shared on a volume and maybe there's a way to use that, but we don't do that and support that right now. This is also not released yet, so we just have it in the main branch. It's one of the newer things we added. But with this, I think I'm more of a person. I'll believe it when I see it. I think we want to try to do a demo to show you everything's running off the laptop that Mario has here. We're not going to connect to any cloud services, but what we want to demonstrate is a few HTTP services here. And GRPC also. They're using GRPC in this case. They're returning Go. We're going to have one Bayline instance. Look at all of them. We're going to use this little tool that actually Fabian made, this little Docker Compose LGTN, which has the full Grafana stack with all our open source products, with the OpenTelemetry collector setup that it can ingest and do traces, metrics, and everything you need. Very convenient for testing. Very convenient for testing or spinning up your own Dockerfana cluster at home. So it's just one Dockerfile with all of it. I also wanted to mention, because we didn't say, it's obvious the presentation is about distributed traces, but Bayline does support metrics too. So HTTP metrics were included from the Star Door product. Traces distributed traces is some of the newer stuff we're working on. Okay, so for this demo, we will show a simple distributed application. It means a synthetic application is just a frontend sending a request to a backend, and the backend asking for distributing some load on the workers and then getting a response. Do you need to hold that? No, it's okay. It's okay. Thank you. Then I have added everything into a Docker compose file just for facilitating the demo in my laptop. So we have this OpenTelemetry collector, which is the hotel LGTM container that Fabian did. And we just dropped Bayline as a container. You can drop Bayline there as a host process, but for convenience also as a container. We need to give access to the pit name space of the host, because it will have to instrument all the processes in that host, and also privilege access because loading EVPF programs requires administrative privileges. Then we set here some the OpenTelemetry endpoint in a standard configuration. Bayline accepts the standard OpenTelemetry configurations for setting up many values. And also we are providing a configuration file. Basically here we say how to group the HTTP routes. For example, there is a route that calculates a factorial, and you will pass in the request, you will set factorial and the number to calculate. We don't get a cardinality explosion because we don't want to create a different route value for every number we calculate. So we say, okay, just group all the URLs matching this pattern, group them in factorial number. And then we tell Bayline how to discover the services to instrument. We have a frontend, a backend, and a worker container, and then we pass that. This accepts any regular expression. So if we say just a dot, it will instrument or try to instrument all the processes in the host. But in that case it will also instrument some parts of the Docker API, the Docker Compose API. So to not generate noise, we are just providing the services we want to instrument. And let me then run this Docker Compose file. Okay, this application is a very simple application. It's a huge factorial calculator application. I will just write a number, and it will calculate the factorial. And if you need more numbers, okay, you calculate. Boom! This is an error introduced as on purpose because I also use this application to track errors from Bayline. But it usually works. Then, doing that, we have, Bayline was already running. We have been generating some traces. So let me go to the local Grafana. Let's see. I go to, for example, explore. Here I selected the tempo, and let me search for all the traces. Okay, beautiful. It's strange because here we can see that Bayline... Oh, yeah? Okay, let me check. No data. Okay, it happens in the best families. No, but we have this... I mean, it is able to... Okay, I don't know what happened. But... For sure, it's a book in Grafana. So I have here many, many requests. Or many traces. Let me just instrument this, submit trace, which is the one that triggers the backend and the workers. If we enter here, you will see the trace information. How the front end invokes the backend. You can track also an internal status of the request, like how much time the queue is in... Or the request is in queue or is being processed. And you can see how, for example, the backend might invoke the worker multiple times. So we got distributed traces automatically. We can even see the node graph of all the requests. How this process invokes or the relation of all the traces as a graph. How the front end as a server, because we instrument either server or client side spans. How the front end invokes the backend, the backend invokes different workers and so on. I just want to add something here. So we're here, if you see, when you look at the Bayla stuff that we produce, we produce these two spans for some of the server requests. We have in queue and processing. And for most people, that's like, what is this two things? Like why are you tracking two times? And if you have a typical application server that saves with an in-go, and you accept the request and as soon as that happens, go or launch a go routine for this. But how long before this go routine gets scheduled on a physical thread, which is M in the world of go, and how long before this physical thread actually gets CPU time? So from a traditional instrumentation, you instrument the handler of the server request. This handler of the server request is the time the handler started running, not the time that the runtime accepted the request coming in from the kernel. Well, at the EVPF, because we're at a low level, we can actually track that time. We can actually see where the request actually came from the kernel, when the go routine was launched, and when you finally got the handler to run. So in a situation where you have a server which is overloaded and it's not able to serve the request, you'll get the actual request time, much closer to what the client sees on the other end. Rather than the fake time, which is what the application server would see normal. Okay, so that was the demo. Let's summarize something that is that, using EVPF, you can capture distributed traces, as we, as Nicole explained it, with some limitations. The advantage is that it requires almost no effort from the developer or operator, in the sense that you don't need to reconfigure your service, you don't need to change the code, you don't need to redeploy, just drop it and get whatever Rela can get. Yeah, and it's, another conclusion is combining this packet tracing with language level support, is what we, we allow Bayla get those distributed traces. So if you like it and want to give a try, Bayla is available to download freely, to test it. You can, you can connect to our GitHub page, or, and then you will see instructions and links to the documentation or the main open source page of Bayla. Yeah, and on the GitHub page is what we start with, we have a link to our community Slack, if you want to chat with us, and we also are soon going to start organizing the community call. So once every month we have a call where you can just join in and chat or yell at us, for whatever reason, but yeah, that's it. Thank you. Thanks a lot. Oh, so many questions. I'm running. You said that when you're tracing in Go, you, you are, you are tracing the coroutines that are, that are handling requests, but in Go you don't have ideas of these coroutines and you don't have the relationship between them. And to, to make it worse, the go around time actually reuses coroutines for something completely different. So how do you, how do you do that without constantly handing pretty much all the coroutines all the time in order to get your trace? Yeah, okay. So like with EVPF you get superpowers. So from a regular goal developer perspective, you never actually have the access to this information. Yeah, for whatever reason, they won't give it to you. But with EVPF, I attached the go runtime. So the address in memory of the go routine is my ID. Now I can tell when the go routine starts and when it gets parked back, when it's reused for something else, it can be reused and that's fine. But at that time I'll clear up all the information because I know the go routine is done. Because like superpowers. Hey, thank you for your talk. I'm one of those guys that manage a lot of infrastructure in code in general. And when you say that, hey, you just have to eat that and just work sort of a box, it's kind of scares me because potentially it can cause problems. And one of the issues that we saw with both kind of solutions usually is if you inject into request a tracing header, potentially the request might be changed. And some protocols do signing and request like AWS signature free, for example. And they don't really like you injecting headers in the middle of request, especially at a lower level. So how do you envision if you have some kind of like agent in the code itself, then you can work on that by disabling the tracing on both specific endpoints. But if you do that at a lower level, then you don't really have a visibility to be able to disable that or recognize that you are creating a request to such a back end. How do you envision like working around those issues in the future? Because this is one example, but this will happen many, many times. Yeah, yeah. So that's true. So if you need sign some IDs and whatever, it's not letting you change the header information, then disable that feature. Don't use what we do right now for propagating using the headers. Use the black box. This is the back boxes are sort of the full back. We've been toying with the idea that maybe in the future we'll let it work with an external storage of some kind that we can actually make past the one node restriction we have with the black box right now. But that's the very reason we're designing for because in so many environments, injecting the header information is just not possible. I'm dealing with interpreter language. No compiled methods, no dice. So I can't do anything with you. Thanks. Good question. Thank you. Thank you.
What’s possible in observability when we have frame pointers
All right, so yeah, what's possible in observability when we have frame pointers is kind of the talk. But let's start out with like a kind of like actual use case of observability, right? So we have these workloads. We can like graph the CPU cores and we can see some things happening and we might be wondering what's actually happening at these spikes, right? And we can use profiling, I guess, to figure out what happens at these individual spikes just to like understand, okay, like in this scenario this was happening in another scenario or like at another time something else happened. We can like get profiles manually and compare them or we do something called continuous profiling where we just like all the time over time, yeah, profile, hopefully a little overhead we can even do it in production or not hopefully, but it's a reality. We can do it in production, right? So we can then store all of these profiles and over time kind of like ask questions when we want to in retrospect and we don't have to worry about missing data points and we have kind of the security or yeah, the ease of use that we can just click on some spike and then get a frame graph or in this case an icicle graph because it's like top down and not the other way around. We call it icicle graphs and you can see all the stack traces and you can like instrument very nicely, introspect what's happening and I don't have a slide for this but we can also kind of like this flame graphs and then we can see in like red where things got worse and in green usually where things gotten better and it's pretty obvious most of the time if you have such a big spike like that's right the point where we need to look in such a flame graph and where we need to like check out what's happening in the code. So yeah, that's kind of a pretty good use case for observability, right? But yeah, what our frame point is but before we come to that quick introduction I'm Matthias Läuwe, I'm a senior software engineer at Polar Signals, I work on Parker which is like the open source project doing a bunch of these things but I also work on Thanos, Prometheus and lots of other open source monitoring projects. Yeah and hey everyone, I'm John Seger, I'm VP of Engineering at Canonical, I have a kind of interesting journey to open source but at the moment I am leading the development of Juju and a whole suite of kind of enterprise apps which we call Charm so if you want to get access to like the best Postgres on your infrastructure or the best MySQL or the Grafana stack or Parker or you want to build an identity stack with ORI and with OpenFGA and products like that, that's kind of the effort that I'm leading. The orchestrator is called Juju, it's been around a really long time, Charm's all written in Python and we're kind of building out a big catalogue of operators that allow you to not just deploy those things but actually compose them all together and integrate them in a really common way irrespective of whether your infrastructure happens to be bare metal or Kubernetes or VMs or on EC2 or on Azure or some combination of the whole lot so that's kind of what I'm up to at the moment. Awesome, yeah and I'm looking forward to hearing more from you but before I do that, let's talk about profiling again or like what profiling data is made up of and you can see these like points in time just T1, T2, T3, at some point in time, we basically want to look at the current stack trace or what the program state looks like and we can see that like at T1 we had ABCD, at T2 we had ABCNE so slightly different and then at T3 we had the same thing again so kind of like just for the sake of the demo or the example, one was like executed twice so maybe it was like executed 20 milliseconds in total and the other one 10 milliseconds so we kind of like count how often we see these stacks and then kind of can make assumption on how much it is running and this is kind of like a sampled profiling profiler, it basically only like every so often looks at these stack traces but over time we can really nicely like see the big picture of things happening. The good thing is because it's only happening so often the overhead is pretty low which again I touched on earlier for our use case figuring out what's going on, it's pretty nice due to being pretty low overhead. So how do we get to these stack traces, how can we see these stacks that we then get all the memory addresses for and then we can like nicely format them using the function names for example in the icicle graphs. So the best case and that's kind of the whole point of the talk right are frame pointers and frame pointers looking at this bit of C code it's hopefully not too daunting in a monitoring observability room but we can see we have the main function at the bottom and that calls a function and so on the functions call each other and then at the very top it just goes into an endless loop. And kind of the important part in all of this is looking at the assembly on the right hand side we can see that okay I omitted like the main function and the a1 but then we can see b1 and we can see that at the very beginning we are pushing and moving some registers around and those are actually the instructions to push the frame pointer onto the stack and then we are calling the next function right and the pushing of the registers so that we know once the next function is done executing we can come back to exactly that previous function and continue executing. The one thing I want to mention here is in the past there were a couple of discussions about the overhead of using frame pointer so we have the push and move instructions and then once the function is done it needs to pop that frame pointer so there were a couple of extra assembly steps involved especially on 30 bit systems it wasn't great performance wise but I think unless you are a really really special case it should be fine for almost all workloads even in production and that's kind of the point of this so basically our binary on the left hand side we can see our set up frame pointer so that's kind of the first instruction that our assembly is executing it is putting the frame pointer onto the stack before then going and doing the actual call to the next function right and before doing that we have to add the return address to our stack so that once the function that we are calling is done we know where to continue in our current function right so we need to know where like this other code we need to execute after calling the function we are calling right now where we need to continue so that's why we have the return address and we then actually do the function preamble and we run that function and eventually we return the function we are at the pointer that then actually tells us where to go back to right so the function that we called eventually returns and we want to go back to the original function however we are then executing after that function call right so previously we were can you see my mouse no we were over here and now we returned like one step and after that right because we don't want to call that function again going into an endless loop we want to continue afterwards however we want to know what called us right so basically what we want to do is whenever we have a stack we want to know which function called us and do that all the way such that we eventually end up in the main function and we know all the functions that we see that we have on the stack up to the point where we are now basically that's kind of like working the stack here and the really really cool thing is we can do this in ebpf so I don't know how many attend the previous talk ebpf kind of a hot topic right now for us it's really really cool because what we can do is write a small program in a C dialect and then get that through the verifier and compile it into ebpf code and then load that into the Linux corner and the way it works is then we actually don't use syscalls like the slide originally says but what we then do is like tell the Linux corner to every so often run this snippet of ebpf code and what we do is do the same things like stack unwinding that you are stack walking that I told you about like two slides ago so essentially what we do is we start or we start in ebpf we get kind of the context we get the current stack pointer and we look at the leaf of the stack so like kind of the very top like the currently executed function and we can then use that to essentially read that instruction pointer and from there get the frame pointer and the special occasion here is the instruction pointer has to be the return address minus one because of the thing I just told you about two slides ago right so basically that's how we can then know where we were called from and we do that all the time up until at the end we do that we get an instruction pointer or that zero so this one then means basically we reach the end of the stack and we know we can terminate or we reach the end of that stack trace. In between for profiling you can see over here we do something with the stack with the frame and what we actually do is we kind of like just get the memory address of that executed function and we basically have an array of all the frames that were executed at the end and have the memory addresses and those memory addresses we can then use to get the function names for that function. So having frame pointers in ebpf makes regular profiling super easy and we can then do profiling super simple we don't have to worry about like special compiler configurations because we can just assume that frame pointers are here for us to then basically use them to figure out the entire stack of the currently executing function. There are ways to do exactly that without frame pointers and shout out I think it was in this very room one year ago there was a talk by Javier and by Charlie who were talking about stack unwinding without frame pointers using Dwarf I highly recommend it it's really really interesting but yeah something for another time and then obviously not only like the profiling use case but if we have frame pointers in the executables in those executing stacks we can also use all the other debugging tools right not only for profiling we can use the bcc tools bpf trace perf etc and they also have the kind of same benefits. So essentially what that means is that the possibilities really become a lot more broad and open or like we can do a lot more things because we only have these like two memory reads and for example in bpf trace we can use the like one liner here to essentially also build a really simple but working profiler that uses the use stack to get the user space stack unwinding and count how often it sees things and that's super cheap then but also like the go execution trace actually traces everything that's happening and because unwinding is so has so little overhead we can also do things like that and once we have profiles continue and kind of like the performance aspect we can do something called profile guided optimizations and just making profiling so cheap that's something where I think a lot of innovation is also going to happen in the future and some outlook like some super new papers the context sensitive sample based profile guided optimization so something we are super excited about because yeah it will allow a lot more things to happen as well but maybe another Boston talk is going to happen about that in a year or two so bringing frame pointer to the masses I'm super excited to have John talk. Hey all right so I'm here to tell you about now we've seen all of the cool stuff you can do when you have frame pointers how we at Canonical are going to make this available to all of you much more easily and so if you didn't see this on an outside our blog a couple of months ago we have decided that from 2404 LTS we're going to enable frame pointers for the entire Ubuntu archive on 64 bit platforms. The caveat on 64 bit is because back in the day 32 bit CPUs obviously had far fewer registers and so sacrificing a register to hold the frame pointer came with a much higher performance overhead in reality these days with 64 bit you're looking at on average kind of less than 1% unless you're in a very specific group so if you're doing like turbo pants on head HPC stuff or high frequency trading or real time things where kind of that like 1% could really really matter perhaps this isn't for you and we can make exceptions in the archive for those packages but in general for 2404 you can expect to see frame pointers for the entire archive through main and universe etc. This is pretty exciting because the LTS I probably need to tell you is going to be installed on many many millions of machines right and then supported for at least 10 years by Canonical so this is going to make a big impact for people who need these things. This stuff is often already enabled by the hyperscalers so people like Amazon, people like Netflix, people like Microsoft are already doing this in production and now you kind of get it for free as well just by using Ubuntu. So I mentioned there will be some you know pretty much negligible barely noticeable for nearly all use case performance impact we're kind of willing to wear that because what it actually enables in the medium term is for us to do a lot of work on our distribution right so we're in the process now of running benchmarks on a kind of pre frame pointer Ubuntu and a post frame pointer Ubuntu ready for the release and that will hopefully help as I identify any outliers so if we hit certain packages where we feel like the performance hit is too much then we will disable it for the first release for 24.04 or we will try and work out what other optimizations we might make to that package to make it work better with the frame pointers enabled. So this will really really help I think downstreams to enable or to gain the benefit of frame pointers and optimize their own workloads. If you are someone who just uses Ubuntu as a platform and you build your own code and let's say you use Python or you use go or use no JS or whatever suddenly those big holes in your frame graph graphs are just going to disappear when you move to 24.04 without you having to do anything. This is really just the start which when I make 24.04 a really focused release on kind of performance engineering and performance itself so what does that actually mean having the frame pointers is one thing but you also need the tooling to actually utilize the frame pointers and kind of inspect the stack and the folks at PoloSignals with Parker are one part of that but we are also looking to include tools like BPF Trace and SysStat and the Perf Tools on Stable by default in Ubuntu. Not in every single image so those of you that are about to screen map me because you use the minimal image or you ship 100,000 container images a month and you don't want to ship BPF Trace and all of them don't panic we are essentially going to enable all of these tools by default anywhere where we ship a kernel. So a Ubuntu server image, a full size server image that doesn't include lexd images, it doesn't include OCI images but if you install Ubuntu on a server or in a VM you will have BPF Trace by default, you will have SysStat by default. Essentially a huge majority of the tools that Brendan Gregg describes as crisis tools will be there by default and the reason that is super important is because if your system is in crisis it doesn't matter whether the tools are in the archive. If your system is right on the edge and then you hit the system with a whole bunch of network IO and disk IO to go and get a package from the archives it is potentially going to put that system over the edge. It may not even work in production, the system may not have access to the package archives and so you just need those tools to be there and we are going to make sure that happens. For places where we don't ship a kernel all of these tools will get wrapped up in a new meta package so if you do want it in your lexd containers, if you do want it in your container images, in your debug images then you will just be able to see it really really easily with a single meta package. We are looking at what other compiler optimizations we can make across the archive as well so this might look like rolling out GCC03 for a huge part of the archive, we are not going to do that in one big bang go because there are some trade offs there and we are also looking at essentially not maintaining a low latency kernel and a generic kernel and just shipping the low latency package by default. None of these are firm, 100% definitely going to happen in 24.04, these are the goals we are working towards before the release in April. Finally, some of you may have seen we have been doing some work on working out how to get Ubuntu and the archive to take advantage of the newer instruction sets, AMD64 v3, AMD64 v4, v5. We actually have a build of the entire archive that uses AMD64 v3, you can get it in a PPA and test it in benchmark, it is faster like TLDR but we need to do a bunch of upstream working apps to work out how we can essentially kind of multiplex that right so that you still just go ubuntu.com slash download, download an AMD64 ISO and it does the right thing without you having a massive long list of different instruction sets to choose from for AMD64 so that work is coming but probably won't land for 24.04. We also continue to introduce new patches into things like GNOME, we are still trying to get the GNOME triple buffering stuff landed ready for 24.04 which gives a much smoother experience on the desktop as well. This runs really from Ubuntu server right up through to Ubuntu desktop and these tools will be available to desktop users too. You as a developer on Ubuntu should have access to the same debugging tools that you find in your production workloads in our opinion. On a side note, we are trying to do this at a really big scale at Canonical, we are hiring practice leads that will sit in a central team to build processes and tools and essentially give advice across our 40 or so products and we are also hiring dedicated performance engineers for every single team whether that team be doing Go, Python, NodeJSC or whatever. If you are interested in that talk to me afterwards, check out Canonical.com slash careers, there is a couple of Canonical folks in here as well who you can talk to. If performance is your thing and you want to come and make use of frame pointers and make Ubuntu blazing fast then that is always an option to you. Finally, from my side, we have done a bit of work with Polar Signals, they have been helping us along this way. We have snap packages and charms available for Parker both for the agent and the server. On any Ubuntu machine you can see this in a cloud in it file with a single line. You can snap and store the Parker agent, give it a single config with a token and start continuous profiling out into Polar Signals cloud or you can host this over infrastructure yourself on machines, on Kubernetes, on containers, whatever it is with Juju. We will continue to make improvements to that over time. It is a super easy way to get hold of this nice continuous profiling hotness in Ubuntu. That is it, get in touch. Thank you very much for that. Looking forward to the Ubuntu release. Are there any questions? Questions anyone? Once, twice, nobody? Okay, then thanks again and next up we have QuickWit I think in 20 minutes. Thank you, bye. Cheers.
Introducing Observability to an airline
Hello everybody. Can you hear me? Is it working? Awesome. This is my first talk, so I'm just going to do this because hello, FOSTA! Yes! Great to hear some energy in here at this time. Right, so my name's James and my major client currently is a, so my major client, my only client currently is a major European airline. Get that right? And I wanted to talk to you today about some of the challenges that we're facing in introducing observability to that client, a framework that I've kind of put together to overcome those challenges and some thoughts that I have overall about observability. This talk should be applicable to any big organization. So there's not really anything that's specific to an airline, but if you think about the scale of not only the size, but the amount of different tasks an airline would be doing and the kind of vintage of most major airlines, you'll kind of get an idea of what we're talking about here. By the way, just as an idea, who here works for like a company that's got more than a thousand people in it? Okay, fair enough. Okay. And how many of those people are actually using observability on any scale? Okay, some of you, awesome. You should be doing this instead. In this talk, I want to walk you through three steps I'm taking to introduce observability. One, I'm calling an observability transformation. We're not going to be talking about anything too technically exciting here, and we're certainly not going to be talking about anything like introducing observability to like the cockpit or anything like that. This talk is about helping you get your company or client or whoever else on board with observability. It's about making that transition successful and it's about making it sustainable. And of course, the associated love and adoration of your peers for making their lives a whole hell of a lot easier. So, first thing I want us to do is align us on what observability is. So, that'll be easy. Does anyone want to, I tell you what, we're running late, so I'll just tell you what observability is. Firstly, I think that what we've got to remember when we're talking about observability is that a lot of people don't really know what to think of, but they're probably thinking of something like this, like a big ten-foot view of everything that's going on. Obviously most people in here won't think that that's observability. Why not? Can anyone say, like, is this observable? Is this an observable system by our definition? This is what I think of when I think of observability. And when I speak to anybody that, you know, may be lay technical or non-technical, this is the kind of thing that I'll introduce to them. I know that I'm putting a definition on something that, you know, and that's a little bit controversial, but this is what I think of. So, this will help you kind of ground in what this talk is about. So, you can imagine, like, as we went through that previous thing, like, there's this cake being made. And so, you know, I can describe quite easily that previously, with a, like, monitoring process, we would monitor, get the metrics and the logs from each individual component of that system. But now what we're going to do is we're going to collect the request for a cake through that system. And this has some clear value if we start talking about this. There's this other way of talking about it that's like, you know, observability is how we understand something by the internal something, I can't say, but it doesn't really kind of get across the value to people that may be a bit skeptical about this. And I think that this kind of does. So, let's just pocket that idea for a second. This idea basically describes observability as recording work done to satisfy a request. So, a request is completely observable when you can see all the work done to that request, and a system is completely observable when you can see all the work done to all requests moving through a system. This to me is much more tangible. It does tie it specifically to requests or events. However, I do note that when we talk about observability and making long running processes observable, most people try and arbitrarily or otherwise find ways of cutting them up into individual traces anyway. So, I think that this is fairly close to, like, how we're doing observability in practice. So, in my view, an observability transformation fits alongside other transformations which, when done right, leads to much more productive organizations. So, with Agile, we move from waterfall to more incremental development. With DevOps, DevSecOps, all of that, we move from silos to more cross-functional teams with cloud, like it or load it. We move from buying things up front and hoping that they were the right things to buying things on tap as and when we needed it. So, with observability, we're really talking about moving everything 90 degrees. So, instead of observing individual systems, we're going to observe requests as they go through them. This should also act as a warning. Just, who has gone through an Agile transformation? Keep your hand up if you think that that went really, really well. Yeah. And I'm using this word very, very specifically because this is another thing I want us to pocket as we're going through this. You do need to think of this as a transformation and you need to think about the kind of pitfalls of other types of transformations and how to overcome them if you want to introduce observability to your company, client, whoever. Okay. So, we're all aligned, please, on what observability is. We know we want it, but we don't get to decide. So, we need to think about who we need to convince. Although you could probably get away, especially in smaller or more agile companies, we're just convincing a couple of people and going ahead, often with this sort of thing, you're going to have to convince a lot of people. And so, this is me capturing three broad groups of stakeholders here that you're going to want to convince if you want to bring people along with this observability transformation. And you want to get everyone on board because if you only get, for example, the C-suite on board, like the higher ups, if you like, on board, then engineers will just make your product fail, make your transformation fail so that they can get back to their work, like with any other thing. And then management will just say, right, I've just lost a load of productivity, we can solve this by getting rid of this observability thing. Similarly, if you get your engineers on board and they keep pushing towards it, you'll land up with them being burnt out because they're not being given the time and the resources that they need to actually make it work. So, it's worth thinking through very quickly here, wary of time. I can spend ages on this slide, by the way, because thinking about stakeholders is really, really interesting, but I'm just going to pick up a few highlights. As an example, anyone here a skeptic would describe themselves as an observability skeptic? I'd imagine, maybe, do you have any reasoning? No, that's fair enough. But it's worth noting that even in here, and I think that there's lots of people outside, the thing I compare it to is kind of transforming towards test-driven development. A lot of places will introduce test-driven development and the way that they'll do it, for example, is their experience will be that some manager somewhere will insist on 100% test coverage. So, they've gone through that, they have to do all these ridiculous things to jump through hoops to get this transformation to be complete, and then they come out at the end of it saying, well, test-driven development's crap, we're not doing this. They managed to get rid of it and they managed to dump it. So, you might think that of these three people, the engineers would be the easiest to convince, but there are lots of people that are out there that have gone through three or four of these now and really need to be sold on whether this is going to help them. So, really, don't think that they're going to be automatically on your side just because you're convinced. Also, I'll note all the disagreement that we have just in this one conference about what the best tooling and the best approaches are anyway. Quickly on things like management, management will want to be convinced that it's not going to break down productivity. One example I'll give when we're looking at, for example, higher ups like the C-suite, they're going to be interested, you're going to be asking them to spend money because you can't just say, oh, we're going to do this, you want to actually resource a team. With my client, what we did was we actually went through the outages over the last 12 months and we did some estimates, we said there are estimates and we caveated like what the caveats are. We went through and we worked out how much time we think would be saved on outages, on each of these outages, if they had good instrumentation of their code and if we could identify the issues more quickly. They could go away and they could calculate that in terms of a cost which they could use to justify it. So, don't forget about your stakeholders. One thing that you didn't hear is in all of that, is what tool to use. That's because, sorry, everyone that makes a tool, it largely doesn't matter at this point. People want traces because they want less downtime, they want more clarity, they want to capture lost revenue or whatever else. But you can do that with pretty much any observability tool right now. So, the one thing you don't want to do as part of convincing people is to try and sell them on a specific tool. That can come later. In my engagement, we're focusing on tempo. And the reason that we're doing that, I'll introduce some of the other reasons in a bit, but the main reason is because we always use Grafana, we already use Prometheus, it slots right in. And we don't really have to discuss it much. There's another thing which is because tempo is open source, we don't have to involve as part of selling this project, a new vendor, and new commercials and stuff like that. So, open source to the rescue with that. But really, you want to get your project approved so you can go and start instrumenting code. Last thing I'll say on this is team topology. This is an example of the sort of team that I'd expect to go and start an observability transformation. You'd want, I prefer smaller, more agile teams. So, you might look at this and think, well, based on my business, I might have two or three of your software engineers, two or three of your operations engineers. That might be an anti-pattern. You can go and look up all the reasons why bigger teams tend to do work more slowly. I'm not going to cover that now. So, I'm looking at a kind of crack team. Software engineers are going to get in and go and instrument the code. We've got an operations engineer that's going to make sure that we clear the pathways to actually get those spans out into tracing databases. And finally, we've got somebody that's kind of in a product owner position that's going to protect that team, make sure that they're not answering inane questions all the time. And that is also going to be working with the business and with the other product delivery teams and the platform team and whoever else to make sure that concerns are raised, that they're heard, that they pivot when they realize that actually they've made a mistake. So, that's an important role as well. But remember, this is a transformation and we're trying to do new things. So, we're changing cultures here. So, you do need to be responsive to feedback and you need to be responsive to feelings. Otherwise, your engineers here are going to make the best system that never gets used, which is another pitfall of transformations. Okay. Those are my thoughts on convincing people to do an observability transformation. Now, let's imagine you've got the thumbs up. Let's move towards implementation. Most important thing is to not get bogged down in the details of the infrastructure. You need to move to instrumentation. But, you are going to have to need some sort of tracing database. You are going to need some sort of tooling. If you have something already, so for example, if you're already using a provider of some sort and they have it, then great. Consider that. However, one of the ways that you can make sure that these things move faster is by moving your tracing database into where the data is that you're collecting. You think about big, old companies, big and or old companies. They get really nervous when you say, right, we're going to collect all this data and we're going to go put in this cloud provider over here. Now, that can take months to agree. And so what you can do is you can short circuit that, start that process, start discussing how you're going to do this. But you can also at the same time move your tracing databases into maybe the accounts or the cloud provider that's actually already been agreed to use this. There is a downside here, which some of you might be thinking is, well, doesn't that mean, James, that you'd have maybe multiple tracing databases, which means that you wouldn't have all your spans in the same place? And that is true, but it means that you can move on to instrumentation. It means that you can move to the point where you have like maybe two traces that somebody has to look to, and then you can get other people in the business to say, hey, wouldn't it be useful if, and then you can start having the discussions. Don't try and boil the ocean on these things. And we're being pragmatic here. So as an example here, this is if your client is in AWS, you can quickly get Tempo. There's a good article on the Grafana website deploying Tempo on Fargate, which means that you can get that up nice and quickly. So again, that's an advantage of using these things. And more importantly, you can deploy it. You can find out it's the wrong thing to do, and you can go do something else. And it's a great thing about using these open source tools is you can really work it out as you're moving. With that in mind, get instrumenting. And know that to start with that team that I put together earlier is going to be doing a lot of the work themselves. Automatic instrumentation is your friend. Get your software engineer to go and find the code bases that are across the system, especially on your hotpaths, and start raising PRs to auto instrument. You know how best to do these in your company. Some companies, they want to start the conversation with a PR. Sometimes they want to start with a meeting or something like this. But getting auto instrumentation in to these code bases will mean that you will start being able to build up the shallow layer of these traces. Then if any teams start becoming interested in this, opportunistically pair your software engineer with those teams. Pairing mobbing is a great way to share knowledge. Remember, a lot of these software engineers will not have done this kind of thing before, and doing it's kind of hard if you don't know how to do it. You don't want them to get frustrated to throw in the towel and say, no, this is dumb. This is hard. This is not the way we used to do things. Whereas if you can put your software engineer in with a pair as a pairing or a mobbing situation, they will have happy times and everything will be lovely. Also, make sure that you point out the value when you see it. It's very easy for us to see these things and to go, oh, it's great. And so obviously it's great. But this is new to people. So point out the 10% of their queries that has like this weird choke point. Point out all these advantages you're getting from this instrumentation and from all these spans as you're collecting them. Find, when there's an issue, when there's downtime, get your team to go and see if they can race the people that are doing incident response to finding where the issue is based on the tracing. Once these teams realize that they can see through walls with this stuff, they'll soon start instrumenting their own code. But you need to get them to look. Another trap is to get bogged down on the problems that are harder to instrument. Airlines and banks and other places have a bad habit and that bad habit is Fortran. Or like Zidark or some mainframe thing or whatever. If anyone here, has anyone here just put your hand up if you do any development on like COBOL, Fortran, anything like that? Awesome, awesome. If you go an instrument something like that, please come and talk about it. That sounds awesome. That sounds like a lot of people are going to talk about this one. I'll be fascinated by it. But if you're doing this kind of project, now is not the time. Something like that is not really going to correct me if I'm wrong, anybody out there. I don't think there's any instrumentation for Fortran code or anything like that. Treat it as a third party system. And also don't try and instrument other people's code. I've seen this happen. People will go, right, okay, we've got this third party and it's this third party code that we deploy. How are we going to instrument that? Do not instrument the stuff that is there and then accept that you're going to get to a point where it's going to roll over to logs and metrics. If the tool that you're using allows you to be able to connect up logs and metrics to your traces, that's really handy because remember in these big organizations, you might never reach the golden sunlit uplands of traces for everything. So you're always going to have to go back. You can think of it sort of like fast travelling through the infrastructure as that you are not going to be able to get to the point where necessarily you're going to be able to get into the point in the Oracle database that you're really trying to kill that has actually had this problem. But you will be able to fast travel to the bit in the code where it makes a query to an Oracle database and then you'll know which logs and traces to look at. So the goal really is for wide coverage, especially of hot paths. And that brings us to another thing which is culture change. So you've been working on this for maybe six months or so. It's fairly short projects. You've gotten traces. You've got end to end on many of the request pass through the systems. People kind of get observability now. So those three people will come out with a few others and build an observability engineering team, right? I would say that for most organizations, that's the wrong way to do things. There are companies for which observability engineering having separate teams stood up for that kind of thing does make sense. But for most places, you're really going to be looking at creating this kind of. This is one of my favorite slides ever, which is weird. I have weird favorite things. But this talks about like a DevOps transformation where what you do is you create a DevOps team and the best DevOps team disappears after like six or twelve months because what it's done is it's created this thing where they come together. And you should, you know, this is a valid way of doing things for observability as well. Ultimately, you may have an enablement team. However, instrumentation should be being done by devs as part of their day to day work The tooling needs to, oh five minutes. Oh, slow down. Enablement should be sharing best practices and doing training and stuff. The tooling really needs to be absorbed into an existing platform team. And this is the really cool part is that now, if you think about it, you've gotten to this point where you've got all this instrumentation into your codes. You can start thinking about what kind of tooling makes sense for your organization. Whereas when you started, it's very hard to do. That wasn't five minutes. Okay, I'll stop. So, yeah, if you've done your job well, hopefully these people won't need you anymore. And you can go and absorb back into teams and you can call that project complete. You might be able to do some kind of enablement team. But as I said, that wasn't meant to have a question mark at the end of it. Go and effect change. I'm going to end it there. There wasn't much time. I've got so much more I want to talk about on this subject. So I might do a follow up thing. If you have any questions, I'd be happy to answer them and you can find where to find me at that website. Thanks, James. We have still five minutes for Q&A. Some questions here. Okay. There is one. I answered almost everything. Hi. How long has it taken for you to convince a big org and an old company to move from no observability to some sort of observability? Completely convinced. I'll let you know. So I maybe joined back in May with this client and was helping them with a previous project that was getting wrapped up. So I'd say it's been eight or nine months working on other things and identifying this as a need where it's been working through. Yeah, it can take time. So because you've got all, as I said, you go back to that stakeholder slide. I could have spent a whole 20 minutes just on that because you've got to kind of get everybody aligned. I've done lots of like meetings. I've shown the people off, shown things off to people. I've shown off all these slides and stuff, gotten everybody on board. And yeah, so I think by the time, you know, I'd say that everyone's actually in lockstep, probably about now actually. I should say though, by the way, is we didn't, you know, just not do anything until that point. So there's been lots of opportunities to like seed things as we've been doing other work as well. So yeah. All right. Some questions? No, then thanks, James. Thank you. Thank you.
Debugging HTTP/3 upload speed in Firefox
Okay, I think we can roll it. And we are moving now to debugging HTTP 3 upload speed in Firefox and I'm more than happy to welcome Manuel Buschard for it. Hello. I'm Manuel Buschard. I'm Manuel. I'm working at Mozilla in the networking team called Necro and we work on Firefox networking. And in this talk I'm going about our debugging of HTTP 3 and HTTP 2 upload speed. And for this I'm going to give you some background information first. Then I'll cover the HTTP 2 upload speed problem that we investigated last year. And afterwards I'll go over to the HTTP 3 upload speed problem that we investigated afterwards. So yeah, first to the Necro team. We in general focus on security, privacy, but always also on performance. And our protocols that we work on is mostly HTTP but also DNS, web socket, web transport. And we also own the caching and the proxy feature. So this is what we generally work on. And when we think about networking performance, we usually think about it in terms of how long does it take from clicking a link to seeing the result. And for this we usually just need download speed. For other use cases like uploading large files like videos, we generally also need the upload speed. And in this talk I'm going about the HTTP 2 and HTTP 3 upload speed. Those protocols are more in focus. They are relatively newer than the HTTP 1. They got introduced in the past decade. And yeah, so for HTTP 2 upload, first what's the difference in HTTP 2 to HTTP 1. In HTTP 2 we allow to make multiple HTTP requests via one TCP socket. And this TCP socket is handled by the operating system. And real quick, the bug in our HTTP code that caused the slow upload was that we configured the socket to have a fixed size buffer of 110 to 80 kilobytes. And this fixed size buffer became a bottleneck in high bandwidth situations. And yeah, for the fix we just needed to adjust this TCP socket to not set the fixed size buffer and let the operating system handle the buffer size. And this shows that the operating system is responsible for the upload speed or the performance of upload. And this is a stark difference to HTTP 3 upload. And with this fix of just not setting the fixed size buffer, we can take a look at Chrome upload speed, Firefox before the fix, in red and in yellow Firefox after the fix. And we see that in certain configurations like high bandwidth and also from low to higher round trip times we have upload speed improvements up to like four times the speed. So we only have to wait a fourth of the time. And we are on par here with Chrome, which is using all the bandwidth you can use for the upload. So with this fixed last year, we took a more in depth look at upload speed in general. And we also had bug reports about slow HTTP 3 upload and with HTTP 2 seeing very good results, we made it a high priority for us as well and took a look there. So for the fix or seeing how much it changed, we introduced some high level telemetry. And these are person tiles of user reported upload speed. We have different versions, 114 on the left side is around one year ago. And in 115 to 16, we rolled out the HTTP 2 upload speed fix and we can see the improvements in the high level telemetry about upload speed. It's an improvement like in the higher parts, it's roughly doubled and not quite, but it's very visible. So now two difference to HTTP 3 upload. HTTP 3 upload is widely different. HTTP 3 uses a different transport layer. We don't use TCP anymore, but Qwik was standardised alongside with HTTP 3 and just relatively recently. So the standardisation was finalised in 2021, which is two to three years from now, right now. And Firefox also included HTTP 3 upload around the time in 2021. The work started in 2019, which is all relatively recent in comparison to TCP and HTTP 2. HTTP 2 is around a decade now, all. And the problem is different here because the operating system is responsible for the TCP stack. It is responsible for sending all the data performant and in Qwik we have to implement the same congestion control in Firefox and the Firefox application, so it's not the responsibility shifted to Firefox or the application. TCP is already decades old, it was done about 50 years ago and it's operating since 30 years and got a lot of eyes on it. And our Firefox implementation is really new and we were kind of the first ones to look into upload speed performance here, so we had a lot of low hanging fruits here to work on. And I wanted to visualise this a bit, like we have HV 2 and HV 3, which are very similar. In HV 3 we rely on Qwik and Qwik is also implemented by us. In HV 2 we have TCP and TCP is provided by the operating system. So I want to go into a few findings that we had in our presentation, in our IO graphs and other tooling that we took. The most useful tool for us was IO graphs where we just printed within the application, like with logging, when we send packets, when we receive packets, how big our congestion is and everything. So the first problem we have, what does this graph show? So this graph is our congestion window over time. What is the congestion window? So the congestion, well, I would like to go over this. We don't want to overload the network. And overloading the network is called like congestion control, well, not overloading the network. And this is the responsible of the transport layer, which is TCP or Qwik. And all the bugs that we had were in this congestion control, or most of the bugs. And the congestion window that we have is the estimate of how fast we can upload right now. And this changes over time. With every packet that we receive, we think we can upload more. So we have a graph like here where we steadily increase the congestion window over time with all the packets that we receive. And when we detect that packets got lost, we are assuming that the network is overloaded and we reduce the congestion window by half. And this is like one of the early graphs that we had. And orange are like the bytes in flight that we have. They circulate from top to bottom. Increasing again, blue is the congestion window. And what we see at the drop points is that the congestion window doesn't half. We would expect it to half during a congestion event. Instead it drops almost to zero. And this was one of the bugs that we had. We just dropped to zero. Each packet that we detected was lost, half the congestion window. And normally you would only do this once, but we did it multiple times for each packet. So essentially we dropped to almost zero on all congestion events here. This was one of the first fix. Later... Yeah. Later. This is the same graph. With the congestion window problem fixed, we had to investigate further. There were more problems. Here, like all these drops of packets going down, we want to stay with our bytes in flight as high as possible, with our upload speed as high as possible. But we dropped down quite some times. And if we... Yeah. For this problem, we need a bit of background information. And this background information was this slide, which I apparently put a bit later. And I'll go back to the background about Cric first before going over the next problem. So Cric got introduced. Sorry about the mix up here. Cric, the new transport layer protocol. What is Cric? Cric is on the same layer as TCP, but conceptually you can have multiple TCP connections at once over one quick connection. And we have other benefits, like TLS being integrated, so that the connection setup phase takes less time, only one round trip time instead of two round trip times. Yeah. And now we get back to the introduction of the concept of congestion control. Traction control is for us handles like not overloading the network from all participants in the internet. So everyone makes sure that we don't overload the network and keep it usable for everyone. And the congestion window is one of the concepts that we looked at the first graph and also in the second graph. This is our estimation at how much can we upload at a time. What is our upload speed to the destination server? And so our estimation depends on us receiving packets. And we want to increase the congestion window only if we are sure that we are using the congestion window. Like we are sending as much data as we have in the congestion window because otherwise we are not sure if our estimate is correct if we are sending less data than what is that we estimate we could. And this detection on whether a packet was sent during the utilization of the congestion window like sending as much data as we could. This had a bug as well and made us mark packets as not utilizing the congestion window for 50 to 75 percent of the packets which meant that we didn't increase the congestion window as fast as we could. This is another simple incremental fix for our HPE 3 upload speed problem. And after fixing this the graph looks like this. It has a steeper curve, steeper line. Here we also see that the first problem that we had got fixed. We don't drop to zero with the congestion window but have it halfway here. With these steady increases we can also see them in our high level telemetry that we introduced for our HPE 2 upload speed problem. In HPE 3 in the higher network bandwidth we have already an increase of three times. We are three times as fast as before tackling the problem from around 31 megabits per second to 93 megabits per sentence. This is the 95 percentiles. This is a network speed of better than 95 of all clients. Also visible from the high level telemetry. For the current state we are still working on this. We have more bugs that we are aware of and are also in contact with or in collaboration with contributors who can upload or request logs from them to have a look at their network condition. This is the diagram from before but from the contributors log where we can identify which problems are present from our machines in comparison to their network location. With the logging mechanism which we also included in Firefox this became a bit easier about logging. A few of the further works that we are currently still aware of is that the upload has a few CPU bottlenecks. Mostly profiling. The QuickStack made us aware that not the cryptography part of Quick is taking most of the time but some other parts which is unexpected. We have already identified a few code tests that can be improved and are improving these. We will also continue with profiling this. We also have similar to the HDP case we have a fixed size buffer. This will get to be a problem at some point at much higher bandwidth than with HP2 upload speed problem because we have a buffer that is 8 times as high, 1 megabyte instead of 128 kilobyte. We are also aware of the problem that when we are in package reordering networks we detect these package reordering as losses too frequently. There are ways around this in TCP specifications like REC or Forward Egg that we are taking a look at and investigating which one we want to implement and which proves to be the best of the options. We are also setting up CIS to catch regressions in the future and also have a detailed view from different networking conditions, how they look. We have seen where we got improvements in HDP3 already, it is now at a similar level to the HDP2 upload. It is looking already a lot better but we are still on it and we are aware of a few bugs and we will investigate further. We want to make it as good as we can to see all the benefits that HDP3 can provide for us. A lot of this was in cooperation with contributors reporting bugs. One specific bug is the HDP3 upload speed bug. If you want to take a look at our work there you can follow the investigations there. You can reach us at the Metrics channel if you want to get in contact with the NECO team. We have a NECO specific documentation also about creating logs. If you are interested in the NECO team we are making ourselves a bit more transparent by providing our meeting nodes and having a blog. If you need help with fixing bugs or want to get in contact like contributing, we also are going to provide office hours where you can talk to us directly and get in touch. Thanks for listening. Thank you. We might have time for one or two questions. Hi, thanks for talking. I just wanted to ask if there is any chance of Quick being brought into the Linux kernel or Windows kernel or wherever else Firefox runs. The question is whether Quick is going to be implemented in the operating system with the Shokut APIs. I am expecting that it will be implemented at some point. We do have, I have seen some TLS integration. This is one of the stoppers probably that TLS has to be integrated into the kernel as well. Quick is so new that it didn't have time to be integrated into the operating system. I think as soon as operating systems provide APIs we will start using them. They are not here right now but in the future I would assume yes. Two years of being standardized is like nothing. TCP is like for 30 years already. Last question, I see a lot of people coming in and for sure Manu will be available outside, no? Yes. Making promises is your name. My question is just which congestion control did you implement in Firefox? Yes. We are using Qubic by default. We have also implemented new Reno and we are looking also at BBR because this is also exciting for our lack. It's better for lower latencies. We didn't have a plan to implement it right now but it's like in the future we probably will tackle that too. Thank you so much Manu. I didn't count some of them but I took photos and we can count at different rates. I saw the sticker you have on your water bottle. Yalazila. Yalazila. Yalazila. Yalazila. Yalazila. Yalazila. Yalazila. And I think also this is a no? Yes. No. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes, yes. Yes. This is all about the politics. I'd say about the politics. Call me to be disturbed. I'd like to work with you. Yeah, except... I used to... my mother told me how to... You know, like when they had to mess up with this. Yeah, I mean, it's funny when we're not having a discussion. Do you think there's a few... ...sci-fi sites? No. Yeah. I don't know about that. Sorry, it was on. Yeah. Yeah. Yes. You're stuck with me for two minutes or so. Carmen needs another adapter and we are looking over it. It's coming. We're fixing everything. Is anyone that already got the t-shirt from the booth or the collapsible mug? Oh, good. You're saving the world. I'm there for it. I also like it. Someone else is doing steps today. This is how you stay in shape. You moderate the dev room, you run. All good. We are first. Also a big thank you to Konstantina. She organized the booth, by the way. Konstantina and Mozilla, if you want to... Yes. Yes. Do you want to? She brought more stickers here, especially this stock is related to MDN. We have the sticker here, if you want. And we have the cute llama. I heard it. Gold. It's here, waiting. One is mine. And if you want to learn more about llama, the project, not the animal, we have a guide here with all things related to Mozilla and AI. Grab the... And the first one is from the first talk about support. Grab these papers and you can have more information. Are we ready? Let me go then. Without further ado, ladies and gentlemen, Chris will introduce us to the MDN curriculum. Yay! Hi. APPLAUSE So, hello everyone. Nice to see you all. Thanks for coming. My name is Chris Mills and I'm going to take you through a new MDN project soon to be released called the MDN Curriculum. Take you through a little bit about who I am to begin with. I describe myself as a death metal hippie. I love documentation and I love the open web and I love tinkering with open standards. I used to work for Mozilla. For quite a while I was the content lead and team manager for MDN. But I left and did some other stuff and now I've come back as a contractor and this is the current thing that I've been working on with the MDN team. Another thing to add is that I'm a heavy metal drummer so if you want to ask me a question later on please speak slowly. A little bit about this talk. We are going to talk about, first of all, some of the problems that myself and I was perceived with Frontend Development in 2024, particularly in terms of education and the skills that new web developers are bringing to the table when they come and get jobs. I'm going to take you through the thoughts of how a curriculum, a new curriculum could solve some of these problems and some of the research that we did to try and prove out some of our theories about this. I'll then talk to you a little bit about the actual curriculum that we came up with and kind of its structure, its approach, some of its goals and then I'll talk to you about possible next steps, some of the things that we can then go on to do with this curriculum as a basis. Now, first of all, I'm going to talk to us about, talk to you about something that we're very good at in open source communities, problems and complaining. Yay, Mr. Brexits, back in the UK government, I'm so pleased. No, not those kind of problems. Really, we're talking about problems with Frontend Development, kind of what skills are new web developers missing when they come into the industry? What's the effect of web, what's the state of web education, what kind of effects are these problems having on, you know, the web in general and the quality of sites that we build? One thing that I've talked to quite a lot of hiring managers about, and this will also be mentioned in the research that I'll talk about later, is just lack of general core principles of new developers coming into the industry. I mean, a short anecdote that I'll share with you is a couple of years ago, a friend of mine called me into his company, he worked for a large agency at the time and he wanted me to talk to all of his Frontend teams and he wanted me to talk to all of his Frontend teams about accessibility. Really basic accessibility, you know, just kind of use headings and paragraphs and use alt text, that kind of stuff. And I went in there and did a 20 minute talk and I was thinking, do I really need to talk to these folks about this? And it was like a revelation to them. They were all like, whoa, so that's why you have to do this stuff? I was just blown away. I was like, I thought we'd kind of largely won this battle and moved on. It kind of blew my mind about how little they knew about this stuff. And I kind of feel that with a lot of the new developers community industry, you know, they're not really learning core languages and old school standards as much as just kind of, well, I want to get a job so I'm going to learn React and I'm not going to turn this into a massive winch, but you know, that kind of results in not knowing these core principles and best practices quite as well as perhaps they could. The next thing to talk about is lack of core language skills. This is another thing that hiring managers have talked to me about a lot. So people learn React and other frameworks, but they don't maybe take the time to learn the core JavaScript language as much as they could. So, you know, they can build websites that work great and have a good look in UIs, but maybe their problem-solving skills aren't quite as good as they could be when they suddenly need to get brought onto a problem that requires not writing some code inside a framework. And also, we kind of worry that maybe this is not so good for people's long-term employability, because, you know, if they just learn React, what then happens if all of a sudden the company goes, well, now we're going to do all of this stuff in a different framework, or, you know, another framework suddenly becomes really popular and every employer wants to use it on their projects. This is probably the biggest one that I've heard from employers is just general lack of soft skills from UIs. So, and I know, you know, you could make the argument that this kind of stuff comes of experience, but it really would be great to try and promote that learners spend more time thinking about skills such as research and kind of basic critical thinking and problem-solving and also working on having this constant learning mindset that you kind of need to have to succeed in this industry, because things are just always changing all the time. So, who's to blame for any of these problems? Well, not really anybody, I would say. I'm not going to point the finger at anyone in particular, because, you know, you've got all of this ideological thinking that says everything should be accessible all the time, and this should happen, and then this should happen. But actually, people just want to get a job, so it's no wonder that people go, well, all of these job adverts are saying, I need to know this framework, so I'm just going to take the quickest path I can to get employment and be able to pay my mortgage and buy food. Coding boot camps that I've reviewed largely tend to focus on this kind of stuff, you know, and again, I'm not blaming them, I'm not saying it's a terrible thing, but they tend to be, the attitude tends to be, you know, we will take you from nothing to getting your first job in three months or six months or whatever, and that's a perfectly reasonable way to frame what you're offering to people, but there is the problem that maybe the best practices and the background skills aren't maybe being as taught as well as they could be, and of course, courses become out of date very quickly. Particularly, this tends to be a problem with university courses that I've come across. I know a lot of lecturers that really struggle to kind of keep up with all of the stuff that they've got to do, which isn't just learning about technology, they struggle to put the time in to keep their skill set current with all of the stuff that's going on in the industry. So, I think that's a good point. I think that's a good point with all of the stuff that's going on in the industry. And then, I'm also going to just say a few things about interview processes, and again, this definitely isn't the fault of the actual learners trying to come into the industry. But because people don't tend to have a consistent set of skills, a lot of interview processes tend to kind of be, well, we're looking for this kind of unicorn that knows these ten things really well, that are all really complicated. And all of the people that we're talking to have kind of got about four of these things definitely shown up on their CV. So, we've got to do a whole bunch of whiteboard interviews and coding interviews and huge, long, convoluted interview processes to check whether this person can do this job that we're trying to hire for. Another interesting thing to make mention of AI, which has already been talked about today, is it fascinated me that in the last maybe six months to a year or so, I've started to hear multiple hiring managers talk about the fact that oh, we had to put a load of extra processes in and the interviews have become even more complicated now because a lot of our candidates are trying to cheat using AI. I've literally heard about people having chat GPT open in another window whilst they're doing an interview and just typing all of the interview questions into it and then parroting back the answers to the interviewers. And it's like, that's a bit nightmarish and it's difficult to really think about what to do about that. But I kind of think, well, if these people were maybe more confident in their skill sets in the first place, maybe they wouldn't have to think to rely on that quite as much. Another interesting thing is that something that we're sort of looking to do with some kind of curriculum would maybe to have some kind of industry standard benchmark certification, eventually. This is kind of pie in the sky, often the future. But maybe this certification could kind of say, you know, anybody that's got this certification, it's a trusted certification, you know, in the same way that industries such as law or architecture have trusted bodies who have these certifications that everybody gets to prove that they know what they're talking about. But we don't really have that for our industry. And employers don't really trust some random certificate from some, you know, whatever boot camp, you know, I'm not saying those boot camps are bad or not trustworthy, but employers just have a hard time trusting them. And as makes perfect sense, they value demonstrable experience and portfolios a lot more.
Firefox: Good things come in .deb packages
Hello. Hi, everyone. My name is Gabriel. I'm a senior release engineer in Mozilla, and I work on shipping Firefox on several different platforms. And today I'm here to talk to you guys about the new debt package that we are shipping to our Mozilla app repository. So first I'm going to talk a little bit about the journey from Nile to stable builds, and then I'm going to elaborate on some of the reasons why we thought a native package might be useful for people on debt-based distributions. Okay. So early last year we started talking about setting up an app repository for Mozilla product builds to help us offer better support on Linux and stuff like that. It's really challenging to support distribution builds for us because they're built with different compiler, compiler versions. We can lead to some issues. Yeah, so first around October we started shipping a Nile package. And it was mostly for Nile community. This offered them some benefits like they didn't have to create a desktop file. It also made it easier to update the binary. We have some data that actually suggests that we keep people more up to date on these debt packages. I think probably because people update the whole system components in the app store or... Yeah, I wonder if they did other stuff. Yeah, so I made a blog post about that. We got a lot of feedback from the community about developer addition debt package. So we shipped that. And now as a Firefox 122 we're shipping stable builds to the repository. So we want you to be able to use Firefox how you want. And we know browsers are complex applications that support many different use cases in people's lives. So we wanted to offer a native package in addition to SNAP, some flat packs. So this package, it's built in Mozilla infrastructure from the Firefox source code without any modification. And the builds are supported by Mozilla directly. Another good thing about the package is that we spend a lot of resources in optimizing the builds using PGO and things like that. And we wanted people to be able to get those benefits without having to install our tar balls but rather getting packages from this repository. I like this one. The updates are faster in case of chem spills. So the new app repository is plugged in directly to the Firefox release automation. So when we ship Firefox we upload the build directly to this repository as soon as it's available. Which is nice in case of security patches and things like that. And here's a slide about how to install it. So you can visit that. The website right there and follow instructions, as I said. It's easy. It's just about adding the Mozilla app repository and installing the package. The package is not perfect, surprise. So if you have feedback, if you actually try it out, you can join our matrix channel and let us know if the package is working for you, if you're having issues. And Mozilla will offer support. Thank you. Thank you. Thank you. Thank you. Yes. Can you provide arm 64 builds? Not yet. Can you reply the question? Yeah. So here's if we offer arm 64 builds. Not yet. We've been talking about it. Yeah. Working on it. How are the bindries constructed for these packages? Do you use the native devian tooling to build it? Or do you basically repackage the same binders you would put in a snap file? It's the same binary that we've been shipping as a tarble forever. And yeah, we use the native devian tooling to repackage that into a dev package. And that's what we put in the app repository. I saw a hand over here. I lost the person. Someone had a question. I don't have any questions. Oh, OK. You're tricking me now. Yes, I see. Coming. For those new in the room, this is a challenge for me to have 10K today. So don't hesitate to ask questions. Hello. Do you envision the dev package being like a stepping stone towards the future where flat packs and snaps work well? Or it will be like a permanent offering going forward? I envision it as a permanent offering, just like an additional option for people that like the packages. Yeah, makes sense. Why did you specifically choose dev as a packaging format? I didn't quite understand that. Over like something like flat pack and have a custom repository for that. Yeah, we already have a flat pack. So this is just a different option for people that want to use that packages. There's already a Mozilla flat pack repository? Yep. OK, I didn't know that. Good to know. Yes, go ahead. Sorry. The microphone is coming. So that we mentioned dev and flat pack, so the next question is what about other packaging formats like OS native, for example, RPM or any other, like, I don't know, like something for ARCH or anything else? Yeah. I mean, that would be cool. There's definitely been conversations over RPM. Yeah, we're thinking about it. My question will be if you're supporting dev now and all these packages, is it not a burden for you to continue supporting more? Don't you have a plan to focus on something more straightforward? It is true that we're taking on more work by supporting these packages, but I think we wanted to offer that support to the Linux community. That's, yes. Thank you. Are you going to be working with projects like Debian to promote the Mozilla repositories for their, like, stable user bases? Yeah, we, the Debian package maintainer is a Mozilla employee, he helped us out with this. Yeah, it's just a different package, it's a different set of trade-off, right? There's a lot of different guidelines when you build the package for the distribution and different infrastructure limitations and things like that. So it's more like an alternative package for people that find it useful. Thank you.
ipt_geofence: Protecting Networks using GeoFencing, Blocklists and Service Analysis
Today I'm going to talk about an open source project designed for protecting the network. In particular using some concepts such as geofencing, block list and analysis of service networks. I will tell you a little bit the idea behind this project. So in a sense this work started mainly when the war in Ukraine started and we have seen a really spiking in attacks towards servers. And so we wanted to do something that was easier to maintain than what we had in the past. So we said let's do something. And this is a work that we have started to develop with some students at the university. And we said to ourselves we need to handle a little bit IP firewalling a little bit better because you know a firewall is great if it is static. So in the sense that you put some rules inside the firewall and they stay in the firewall for a while. It's not continuous, add or remove, add or remove. This is not a nice thing. And a typical example is geofencing. Geofencing is the ability to block traffic from specific countries. We know that it is not a solution. Definitely it is not a solution. But for specific protocols it makes sense. Let's make an example. Suppose that you have a VPN server that it is used by your company to simply to connect you to your corporate site. So what do you want to leave it open for everybody? You can claim that today you are in Belgium maybe tomorrow or after tomorrow you will be back to your home country. So probably it is not a good idea. But in general some sort of geofencing is necessary. If you manage a service on the internet you will see that most of the logs inside your Pashi.log or Postgres or whatever are generated by people that are not really interested in what you do but are interested in breaking you or trying to find a way to break your system to go somewhere else. So it makes sense for us at least to circumvent. When I say geofencing it doesn't mean a specific country. It doesn't mean Belgium for instance. We must say Europe for sure, not South America. And another problem is encryption. Encryptions is a very nice thing to have. We are not going to remove it. It should be present everywhere. But the problem is that if you look at this problem from the network and we are in the network in that room you would see that encryption creates a problem. Not just the analysis of the traffic per se. So when you say the initial handshake, the way people are contacting you, if the fingerprint of this guy is a good fingerprint or is not a fingerprint used by a previous attacker. These things are very nice to have but they are not enough because you don't know after the handshake what is this guy asking you to do. So this is another problem. So this has been the motivation. So the idea was to create a single open source tool because you can use many tools to do that. Okay, ready. Very simple. I will show that in this presentation. But the problem is that you have one. So to have something simplified in one place. And also to do this for the next thing that we show in a second. So geofencing. So this is a typical example of how you can block a country on Linux. So this is a simple script. You have a country. You have a country where you want to ban. So you download from a certain place. This is a typical example. The list of networks that belong to such countries. You have to make sure that this is updated because you have to do it from time to time. Because countries in particular, sometimes there is a network that is sold or acquired by the companies belonging to a country or another. And you have in essence to download that didn't occur, but I hope that it is not upset with me. download the file and then put it into the file. And this is the way it works. And every day, you have to do it every week. General from time to time, you have to do that. And if you have several countries, it's a country, one, BCN, whatever, so you lose here track of what this is in this country. So you have to remove everything and start over. So this was how I understand that configuring this is complicated. Even though solutions exist. Again, geofencing is not the way to solve cybersecurity problems, but it's a nice way to mitigate them, for sure, at least for specific protocol. So this was one of the motivations. The second thing I told you before is so-called block list. Block list means I want to block specific people based on something, okay? It's usually based on black list. So a set of IP addresses that, as I said, have been pre-labeled, or if you want to use the machine learning word that today is very popular, or if you want to say, there are some people that for some reasons have put in some list, some security guys, that have done something nasty to other people in a few days before or the previous week. Usually these are created by putting honeypots on the internet. And when you see the violations that people are trying to break to those honeypots, you will see that these guys are labeled as bad guys. The nice thing is that there are several black lists available on the internet, but this is not good news all the time. Unfortunately, because some of those black lists are run by volunteers, okay? So some of them are good, some of them are not maintained. So in particular, if you imagine the free VPS services, so those that you can buy for five euro a month, they are constantly moving. So today this IP is bad, tomorrow it's not bad anymore, and the thing happens to the reverse. So the reputation is something that is dynamic. So you need to find black lists that make sense. So for instance, these days we are using the, one of the black lists that we are using, the Stratosphere IPS black list. That is a very good black list. But unfortunately, since some days, they are black listing for instance, Google, or they are black listing with GitHub, or they are black listing 888.0999, so they are the public DNS servers. But if you don't have some regional knowledge, so something on the place where you live, that doesn't mean the town, but it means the network operators, or your neighbors. So let's put it another term. If you take a black list that is coming from the US, it will be 70% effective to you, but it will not protect you for the rest of the problem. So we need something that are created here. Also some of the black lists, please read the paper, use very large CIDR. And I don't believe that everybody is bad. Maybe in this room there is somebody that is bad, but I don't think that 99% of the people are bad. So it put a slash, whatever, slash two or three, this room and everybody's bad. So this to say that they're good, if you use high quality black list. But if you want to take something on the internet and put it, the longer is the better, so this is the wrong way of doing it. Also we have the problem of attacking on services. The problem of the service is that most of the time, it is encrypted. If it is not encrypted, very often it is becoming encrypted. And this again, it's good news. So the black list is a way to prevent nasty people from contacting you, but then the rest of the group, that should be the majority, can still create problems to you. This is what I wanted to say. So the black list are not the solution. They are nice to have. They are nice to have a solution to put in place, but they are not enough. So if you look at log servers, this is an example from the log server I did when I did this live two days ago, it's one of them, it's full of them. Most of the logs are like this. So you would see good people contacting you, but most of the logs are attempts like this. So authentication, authentication, authentication error, too many attempts. This is the email and this is the web. So if you look at the WordPress, it will be even worse. I tried to break, you see, they tried to put a single config file for everything. Okay, you put everything in one place. And because it's designed for security, we have also put some security features, okay? That spans from the network side, but also to the service side. So we have something we call watcher. So tools that are watching log files and searching for anomalies. It can block and block people, very good. And they call so refresh a blacklist automatically. So you don't have to create complicated scripts that sometimes they break and put several countries into it. And the result can be shared first of all through Telegram. So we receive messages from wherever there is something wrong. You can execute actions, okay? Or you can send them through Xerion Q and we are adding additional brokers, so for example QTT to distribute them. So to have them into a single location. And the config file, it is very simple. So you specify first of all the market. I think links mean drop or pass, okay? You can hear them, but it can also mean slow down. So you can mitigate certain traffic based on that. You can specify the policy. For the policy, you can specify what you want to do. So in this case, if the policies drop, allow these countries, okay? Or this continent, North America, okay? You can say everybody from this place can connect. If you, for instance, pass, it's the other way around. So block, allow everybody except those countries, okay? And you can specify what part you want to monitor, so what part you want to enforce your rules, and what part you don't want to look at all because they have to be open. So something to ignore. So for instance, this is 10TP. In addition to that, you can put some honeypot ports. So it means that if you want to say, okay, these are the list of my services, good. But if somebody connects to another port, that is not one of the ports, so why is it doing that? Is it a mistake, so it does it once? Or it is an attempt, it's a scanning, so we have seen before, so network discovery. So once we made the decision, we marked the traffic as good or bad, and the Linux does the rest. No more packets are sent to user space. So very little, you know, CPU usage, and everything is happening inside the, inside the kernel, in VST, it's a little bit different. And watch it suspend by the tool, so when you do this control, start, start and stop, everything is done by that, you do reload, that's the automatic. And automatically it refreshes the blacklist or blocklist every night for you, so you don't have to do anything. And this is a typical example. So look at the time, again, this is when I did my slides a couple of days ago. If you look at the time, you'll see, and these are just two servers. It's always like that. Most of the problems are created by some people that are spending their time to do things that they should not do. So our service, for instance, is in the Netherlands. So we see some time, very simple patterns that they are moving from one server to another to do that. So this is all, but what's next? So what is the idea? As I said, one of the motivation was not just to create a simple administrative tool so that in one place, everything happens. But we want to create something that it is used to secure services for everyone with one single config file, very simple, in JSON. But we would like to do something next. So what we are creating, we are creating a sort of cloud. So in essence, we would like those tools that are deployed to be put in a way so that they can speak each other. The cloud doesn't mean that you send everything to Amazon. Send your data to the party people that you don't know what type of use they will do, what that can pollute your data with probably some IPs that are wrong, maybe simply because they want to block other people. But the idea is that you can run your own server, especially if you are a service provider, so you have many services, or you can put this service also on a laptop, okay, on a Linux laptop that you can bring home. If you see something bad happen, you can report this to the central server. And this is what we want to do.
Testing iptables firewall rules with scapy
Yeah, then hello everyone, so we would like to start. Okay, yeah, so today's topic is testing IP tables firewall rules with Skeppy, comply with the cybersecurity requirements from the UNECR 155. Okay, turning the on button helps. So yeah, first of all our agenda, so first of all we want to introduce ourselves and also our employee shortly. Then we want to ask the question, why test your firewall rules? Afterwards we want to talk a bit about the basics of the net filter subsystem and IP tables. Then we will show you why we choose Skeppy as a tool to test the firewall rules and we'll also show shortly the tool landscape that is out there. After that we will now for over 10 years. ElectroBit offers the EB Corpus Linux distribution based on the Yachto project, but nowadays we have now also version based on Ubuntu and we are cybersecurity management system compliant. So now coming to the question, why should we test our firewall rules? Well, the answer is basically in our case, well, we had cybersecurity requirements for our embedded Linux distributions and for the automotive industry. This means complying to an Air 155 which is demanding that you basically take care about the cybersecurity for the software that goes into your car and starting in just some few months all new vehicle registrations, so vehicles that are new in the market will need to comply with that and as we are working for building distributions for such cars, well, we basically also needed to certify and test our firewall rules to sum it up. So what is the overall situation? Okay, we have a packet filter and this inspects our traffic in the networking stack. Of course we have different use cases for using this packet filter like the firewall, traffic statistics are locking and in our case we have a net filter in the corner space and for user space because the project was already going on for several years. We actually have IP tables still and not NF tables, so that's our overall setup. And just that much about that. Okay, I think it's forward. Output and post routing and then the egress and yeah, that's just to give you an overview how the net filter looks in general. So and IP6 tables, then a user space program, I think you all know it but just to repeat it and so that we are on the same page and to interact with the net filter. It is organized in tables like the filter nut and mangle table you see here. A table then consists of a change which basically again consisting of a list of rules which can match a set of packets and a rule that specifies okay, what to do with a packet that matches. You can say for example, I'll drop it, I return to somewhere else, I accept the packet or some user defined action and if the packet doesn't match then the next rule in the chain will get executed. So as I've already said, that is going to anywhere that is not sent itself is then sent to the Docker isolation stage two and otherwise we are returning and in the Docker isolation stage two we are dropping every traffic that is going to the Docker zero interface and also returning and therefore we are back at the original state there and yeah. Yeah, so now I want to give a short overview over the tool landscape but I will only highlight why we choose GAPI because we only got 25 minutes. So yeah, I think here you can see the most common tools that are able to create custom network packets so CUT, Nemesis, NetCut and GAPI and I think there are many more out there. They all have their pros and cons but I want to go now a bit into detail why we choose GAPI. Yeah, so why GAPI? So it's a Python based interactive packet manipulating library so Python is very common, I think everyone knows this and it has a very low barrier to get into. Yeah, with GAPI you can define send, receive, complete, custom packets and you can manipulate across different layers very easily so on the slide you can see link, network and transport layer so it's very easy to create their custom packets because the barrier is very low and you have a very easy entry to create your first custom packets with it and send them and receive them and have some highlights. Yeah, and what was also a reason that we choose it, we have already in the integration department running Intest framework that is completely Python based. So we needed to look that we choose some Python based solution. So how do we test the system? So we have the ingressing path and send then a packet to the application layer as you can see there and then we have also the egressing path where we send packets from the application layer then to the egressing path so in our demo version we will provide a show later. We use KEMU and send packets from the host to KEMU and from KEMU back and one additional thing, so GAPI has its own network stack so that runs beside for example the net filter so you have to also keep in mind. So we have here some basic examples to show you how easy it is to craft your first custom network packet. You see here the first example is just a TCP packet with destination port 80 and the flag as is set and as you can see this one liner, when you copy this into your GAPI console is your first packet and then you just have to send it to a specific interface or address whatever. So this is as easy as it is. So the second example is a bit more advanced. There you can see that you can also use random IPs with various other options here for the UDP protocol and it's very easy to understand what is happening I think or from my point of view so it's good. The other example we have is an ICMP package. So we just have here on destination IP that also works and then the ICMP protocol with type 3 and code 0. SCAPI also provides ready solutions so that for example for the NAVA discovery with IPVs. Rule here TCP, okay that we got covered then we have a certain port that we can also easily specify and then we have time to live of 8 here though that's also just setting the parameter here for the IP packet and finally we also are interested in certain flags. So here we are using for example SYNFLAG here once and then we have already our fitting packet and we can just send it. Here we are sending it just easily out with SCAPI. We can then on the other side SNF for it for example we take a look okay what is arriving at PC at p port 1 2 3 4 and then we can just take okay well those packet and the flags did match so it should be accepted by the firewall and otherwise okay well the flags didn't match and so it should be rejected. Or to have a different example here we see a firewall rule for the source that should match in the input chain on the source IP so okay again just put it into the packet then we have a certain destination port again with TCP so we are just crafting our packet again in SCAPI as planned and then we are just again sending it we can again just SNF on the other side with SCAPI and check okay whatever is arriving on ETH0 now we can also have a filter in our SNF function specified so we check okay is it TCP is the source as expected and then we are waiting okay we are waiting for one packet here and if this packet is there we are executing the packet to check function and okay we are checking here for the port if the port is correct and already SNF on the correct interface and check the IP address and if we are on if we are received a TCP packet on the filter already so it would match then the firewall rule and we can say okay it should be accepted or if the port is not matching I don't know it would be 23 then it should be rejected and here one last example it's basically again the same situation we are crafting a ficking packet with TCP and port 100 here and then we are again considering the interface here and also of course our IP address I think you can see it very well from the arrows and then you can just send it out here and again the other sub basically looks as known we are sniffing on the end in the TAP0 interface this is well motivated by our QIMO setup then checking the filter again some sanities executing a function again still checking the port and if this matches then already it should be all accepted by the firewall and otherwise it should be rechecked and I think we now have time for a demo yes so please yeah so as you can see now on the screen hopefully it's visible for everyone good I don't know yeah yeah so yeah you can see on the on the right window you can see the QIMO that is running we already loaded the firewall rules we want to test and now I would start the sniffing so that we can do the ingrass thing test and yeah in the in the IP table fields are have gone up so everything worked fine on the on the ingrass path so now we can also show the other way around yeah yeah I think so we can stop the demo I have I think I have some some mistake in one of the rules so now I'm not allowed to send a packet this is the output packet we showed before that gets redirected yeah sorry for that but if you go to the slide deck you can see you have can see here the demo so it's a link to the GitHub repo where we have also stored everything with a read me then you can see can test it on your own and if you want to extend something or yeah so yeah then we are already at the end of our talk so as a summary so you now know why you should test your firewall rules so it's yeah not only regarding to the UN our UN EC are 155 yeah you get some overview over the net filter IP tables we showed you bit the lens very short overview of the landscape and why our solution was skeppy yeah you hopefully get some basic insights how you can use skeppy for your test cases hopefully and yeah also you saw that you can test IP tables with firewall rules if the if everything is set up right so yeah with that we want to thank you for your attention and if you have any further questions then please feel free to ask now or you can also see our contact informations on the slide yeah thanks also from my side sorry about the screwed up demo I promise on give that it's fixed if not we are fixing it today still so questions hi great presentation thank you so I searched occasionally over the last several years for something that can simulate IP tables rules without needing a full network stack since for example the IP tables save utility can dump entire IP tables configurations and the syntax seems simple enough is there something non-obvious that prevents us from simulating rules entirely in memory without the need for a full network stack sockets ton tap devices or virtualization I think I cannot fully answer your questions so I would say from the perspective we have why we test the firewall rules so we our test case what was that we wanted to test our the firewall rules we have in our system so the showcase we had here in the in our company we are running the in tests directly on our target on the hardware because we want to know how the behavior is directly on the system with the complete firewall config loaded so that we can ensure everything we have in our case we have requirements and we need to test if the requirements are fulfilled in the in the firewall we have in the system and so yeah that's true and yes kept the supports fast testing and we are aware of it and yeah so the basic answer is yes and how you of course say okay I want to have a focus on passing there or there I mean it's not to be generalized because it of course depends on the concrete firewall the concrete user cases you want to cover but yes kept the supports it so that's also a great part of it I would say and I was that I think we can thank you and we are now directly on time so sorry but please have a nice day have a great post
iputils project introduction
Whether you have used pink or trace road or trace path, some of those implementations, I just wonder, does anybody use arping? Okay, you are network administrators, I guess. And clock diff, has anyone used recently clock diff? No, that's a nice question. Thank you. IPv2 is a very old project. It was started by Alex Seikyuznetsov in 1999. He was a Linux-Cannell network upstream developer. He was also IPv2 upstream developer at the time. He ported some BSD sources from Linux and he wrote some other tools for IPv2. And he maintained the project till 2002. He also used net-death-linux-cannell mailing list. Hideaki Yoshifuji was the next maintainer. He was also Linux-Cannell network developer. He was doing IPv6 at the time. Hideaki improved the project a lot. He started to use Git, so we have some history now. He moved the project to source for Chnet, which was popular at the time. And he still continued to use net-death-mailing list. He introduced use-illipsee support, so it was not just for G-Lipsee. Although he made his last release in 2015, the last widely-adopted release was probably the previous one from 2012. Because IPv2's development slowed down, David Heidelberg forked IPv2 and moved development on GitHub in 2014. The initial goal was to upstream various patches from Linux distributions. Still at that time, I did also muscle-lipsee support and other things, because the tools were very old. License cleanup was done, which people from Linux distributions approved or were happy about that. There were other people at the time, for example, Janssen Aček and Pavel Šimetdá. They were both from Redezhet. Pavel improved a lot, modernized the code. He started to use the new C-functions, get other info instead of the old ones, which were for IPv4 or for IPv6. And there were other improvements. Semi-Carola was the next maintainer, starting in 2017. He modernized the code a lot. And he also introduced Messonbelt system. There were other people at the time, Noach, Myron Hans and Yuri Hornovian, who still maintains localization. There could be another question, who needs localization for tools like Pink? Really? I guess not really many people, but I got approached that people really like localization. So I kept that. I came in 2017 and actually there were obviously many people in Git history. There are nearly 140 contributors. And there was history before. So current tools. IPv2 tools have currently Pink, Arping, Tracepath and Clogdiff. Pink sends ICMP a correct Vest to network host. It's very old-called from 1983. I think it's the most important IPv2 tool. And it supports both Sockets, raw socket and ICMP datagram socket, which is more secure. Unfortunately, not all distros use that. I see some of the people from the Bien. So I would recommend to stop using raw socket. But the reason why it's used is system D, which is not used on all systems. You know, the Bien supports other, other in its system. So that is reason why Pink wouldn't work by default. Yeah. Below we have example, pingingsusa.com. That's very basic example. I'm sorry. Pink supports obviously a lot of functionality. So there are loads of switches. So just a simple example. Arping. It sends ARP requests to network host. It was written by Alexei Kuznetsov. And it supports IPv4 only because the protocol itself is for IPv4 only. So, again, basic example. Trace path. It traces path to network host discovering MTU. Again, it was written by Alexei Kuznetsov. There's a small example. Tracing path to suce.com. And clock diff. That's again very old quote from 1985 from unknown author supports IPv4 only. We removed some obsolete tools in 2021. Those tools were using some experimental protocol which were not relevant later. Or there were much better implementation of other tools. So there was no point to maintain something which is not really well used or it's kind of buggy. Because those tools we have in IPv2 are basic network tools. You know, written long time ago. There are obviously other projects which are implementing similar tools. So just to highlight some of them. F-Ping is very enhanced ping. It's written in modern C. It allows to ping any number of targets. Its output is designed to be parsed. So it's good for using in scripts. Also it doesn't perform reverse DNS lookup by default. Which is in some cases faster. MTR, my trace road. It's a tool which combines trace road and ping. It uses QE and N-Curses. And it's also for free BSD. Very nice tool. Those two projects are collection of tools. So busybox is for low power embedded devices. It has many tools and among them are ping, ping and trace road. It's somehow compatible with tools from IPv2 but it implements just part of the functionality. Inetutils is old GNU project which also has RHS and stuff like that. So very old project. Not that active nowadays and it has also ping and trace road. So future, IPv2's future, what we should do. We should rewrite the code to the modern C. We concentrate mainly on ping so other tools are neglected. I wonder whether we should keep clock diff. Also trace path, it's questionable because my trace road is much better. There is trace road, the original project which is also better than trace path. So it's a question whether to keep this. Project would need reviewers and real network developers. We should write tests because we have CI tests but we don't have functional tests. So sometimes regression slips in. Tools could have JSON output and color output. So that's it. Do you have any question? Sorry, I didn't quite understand how system D or lack of it can force to use row circuits. There is a sys control tool which handles kernel parameters for networking. ICMP data gram socket is by default allowed just for root. So if you want to have ping just for normal users and you want to use safer ICMP data gram socket you need to set something. And this row says that with ETC, CCTL config or somehow is called that file. And this works differently for system D and for other tools. So if you want to use busy boxes in its system then you would lose this configuration. I would say mainly there should be a solution just not to block this and there is the band bug report. But no one works about that. Any other questions? Hello. So I have one question. What is the future of the IP tools? So what's the next feature or roadmap that you are actually getting on? What's the future? Or like five years or ten years? So those tools are very old. So one would say the work has been done. But the problem is there are bugs, there are improvements which can you know, broad regression. My motivation to join the development was to keep ping working because I need that for Kana network testing. So I would say there is no future otherwise someone finds interesting to rewrite the tool as an exercise to rewrite them into more than C because the code is terrible. It's 40 years old or something. So no real future but I think JSON output would be a good feature and color output would be also good. So some of those. But mainly maintenance mode.
ZeekJS: JavaScript support in Zeek
Hello. If you can hear me well. Thanks. My name is Arne. I work for a company called Corelight. I work on the seek project. Quick information, who of you is using seek? Anyone? Three, maybe. I want to talk about JavaScript support in seek. But first, if you, well, there are not many people that have maybe heard of seek. It's a passive network security monitor. It's existence, well, a long time, 95, was development started. It's open source and BST. It was called Bro until 2018. Bro isn't really a name that you should use for a project anymore. So it was changed. And if you look at it from a high level, you sort of feed it packets at the bottom, either from a live network traffic, like live interface or from a PCAP file. And what you get out at the top is sort of a set of logs that describes what kind of activity is in your network traffic. If you look under the hood, there are a few more details. So it's an event-driven system. It has a custom scripting language. We have some, we call it broker. It's a messaging interface to talk between separate processes. Yeah. To give you a flavor of the logs that were at the top, sort of, those are single entries for single connections. So on the right-hand side, there's the con log, which is the most central log. And, well, there's the identifying five-tuple. We also support IPv6, but that's an IPv4 example. The service field indicates what kind of protocols Seek was able to discover within that connection. And then the bottom is sort of statistical information, like packet counts and duration. On the left-hand side, you see more protocol-specific log, in this case the quick log, which has been recently added. And for example, there's the, so you can see the server name in the client protocol. And if Seek is able to decrypt the initial packet of a quick connection, it forwards the crypto payload to the TLS analyzer, which can then extract that kind of information, and we put it in a log field as you see. That is sort of the data that you would push to elastic search or Splunk and then do your analysis there. That's sort of not Seek jobs, we just produce logs. Okay. It's a fairly old system. It has a custom scripting language, and it looks sort of, that's just a sketch. It's not actually going to work like this, but it sketches how the quick log entries created. So there are two event handlers, one for initial packets, so whenever there's an initial packet, that event is raised, and we create a info record, which represents the quick log entry in the end. And then there's another event that is the SSL extension server name that is raised whenever there's an SNI extension in the client Hello. And you can handle it and basically enrich that quick log entry with the server name or with the first name. That's just a heuristic here. The bottom is a sort of log write call where we actually then produce that JSON entry. So yeah, but it might look a bit unusual in the beginning. It's a fairly powerful language that has some network domain specific features that also allow you to write detections with Seek and sort of build advanced analysis also within that scripting language. What's not so great is sort of interaction with the outside world that log write, for example, is the thin layer above the whole C++ logging framework. So that is not implemented in Seek script, but then you have to do that in C++. And usually any extension that you want to do, you have to resort to writing a plugin in C++. Yeah, we do have so if you don't go to C++ route, we do have support for asynchronous HTTP requests. And if you look a bit under the hood, then the thing is spawning is read and it's launching for writing stuff into temp directory and into a file still and then it reads them and gives them back to the script. So it's a really scary implementation of an HTTP request. So the idea was to, well, why don't we use a language that maybe does provide all that stuff and sort of has a rich ecosystem and has is well known as well. And particularly the Node.js, because of the libraries and the NPM system, so that there was sort of the idea. And as a twist, we are doing this as a plugin and not by patching Seek source code base. We just want to build something external to add support to Seek to also use JavaScript. So quickly about plugins. They're basically shared libraries that seek loads at startup and within that plugin you can access Seek's C++ API or also hook into certain execution paths. For example, whenever a new connection is, so new connection state is created, you can implement the hook set up analyzer tree and attach something to that connection usually analyzers, a protocol analyzer we would say. They also really made components where basically implemented against an interface. There's no component for a whole scripting language, so we sort of resort to the first two to implement the JavaScript support. Okay. So that top hopefully doesn't look too unfamiliar if you have some JavaScript. There's an event object on the left that is called Seek, sort of a global object. There's a well known on function where you register that an additional function for a certain event in M. So that that looks more usual problem in the Seek script example. And as an addition, there's the there's the HDT module from our HDPS module from Node and there's also an example how you could put how you could post the connection you had the end those SSR server names mentioned before to an HDP endpoint just from within Seek script. So we want to get there. And the first step is to, as you prevent Seek from interpreting .js files as Seek script, which it would do with default. And you can implement hook load file and basically check if the if the file name that Seek is attempting to load is ending with .js and return one basically says well don't bother about it I'm taking over and we are stashing away those JavaScript files. And that works for files in the command line or also those with directives loaded. So the add load directive. Step two is sort of to initialize the whole JavaScript engine, sort of the V8 engine and the Node.js environment. There's documentation about that. There's a link here. This is sort of a sketch. It's a bit complicated but I have good documentation about it. What is happening at that point is also that we are loading the JavaScript files and so the top level Seek on calls are actually executed. So we need to provide this Seek on call already. So I'll say this is just step three. I need to slow down a bit. Just for myself. So step three is the call to Seek on is basically getting an event handler name and listener function. And with that event handler name we can use C++ APIs to look at the event handler object which is a Seek specific object representing that, well, belonging to that event name. From that we can get a script function which usually has a list of bodies and each of the bodies contains a statement list and then there are further statements. So usually the script execution is interpreted. So it just runs down all those statements and executes them. What the plugin can do is add another body into that list of bodies and provide the custom statement subclass which when executed really just calls into JavaScript and executes a V8 function. So when this first happened it was really exciting. You see a hello printer from Seek and a hello printer from console. It was nice to get done. What was not so nice is that you need to map types between those two languages. So there's different types on the Seek side and JavaScript has other types. For example the address or subnet type on the Seek side we currently just mapped to strings in readable form. It's not the most performant but it was nice to have Jason stringify and have IP addresses like that. I'm not going to talk much more about this. The last step was to integrate both of the IO loops. Seek has its own IO loop that is KQ based and Node.js has also an IO loop which is libUV based. Usually the Seek IO loop is blocking on an event call waiting for a packet to be served or a block of packets or a timer has expired or something else happening and an act on it. What the plugin can do is register something called an IO source and in the case of libUV the plugin takes the backend file descriptor of the libUV IO loop and installs it into the Seek IO loop which means that whenever something has to be done on the Node.js side like a client is connecting on a listening socket then the backend file descriptor of the libUV loop becomes ready and the Seek IO loop is waking up. Recognizing this is Node.js file descriptor that became ready. I need to transfer control over to that loop and the plugin runs the Node.js loop non-blocking until there's nothing left to be done and control is then transferred over back to Seek. Yeah, that was the most tricky part of the whole plugin. I didn't talk much about the picture before, the architecture, but where I would position that is sort of, it's not completely technical to correct, but if we have extended the event engine a bit with Node.js event engine down there and then also the Seek script language, so we have extended everything with being able to also use JavaScript instead of the Seek script language. As a summary, I find it really impressive that we could do that without actually patching Seek. Everything was in place to pull this off which is testament to how Seek was built over the years really. We're not going to replace the Seek scripts that are existing with JavaScript, that is not sort of the plan. The integrations you wanted to build or maybe just wanted to have proof of concepts of things that you previously needed to quickly use C++ and find some C++ library to do whatever. You can now tap into NPR ecosystem or JavaScript and try it with that. That plugin is sort of coming with Seek 6.0 by default, so if you have LIT node installed and you compile Seek it will just be supported really. And our container images also have it built in by default as well. Any questions about that? Any questions? Hi, Armin. Have you evaluated the performance of this? Does it impact performance a lot? I would say it runs slower than just Seek and interpreted scripting, mostly because we need to translate between those two types. I would also currently position it to not necessarily run JavaScript in the packet path unless you are really adventurous. We have also Seek processes like the proxy and the manager that don't do packet processing. They have a lot more cycles there. If you run JavaScript there and do sort of pulling in IOC information, that's one use case, that you can do in a node that is not in the packet path. We would be interested in performance numbers. Thanks. Have you explored other languages as well, apart from JavaScript? Not explored, I sort of have in my mind as a proof of concept Python, but JavaScript was sort of asynchronous, it's non-blocking. That's a paradigm there and that's what we needed as a replacement for Seek script. Thanks. Any more questions? Thank you very much.
Bringing routes to Kubernetes nodes via BGP: introducing frr-k8s
Can you hear me? Okay. So, today I'm going to talk about a project that we started more or less this summer, which is FRR Kubernetes. Some quick words about me. I'm Federico. I work for this networking team at Red Hat, in charge of making the open-source platform suitable for telco workloads. And because of that, I managed to touch many network-related projects, the SRE operator, some C&I plugins, our primary C&I and lately MetaLEDB, which I'm currently maintaining. My handles are for the power for Mastodon, Twitter and Gmail. If you need to reach me out, ask questions, I will try to answer. So, it's funny because this talk has something to do with the talk that I gave last year here for them, where I presented how we in MetaLEDB replaced the native Go BGP implementation with FRR. And first of all, what is FRR? FRR is a Linux Internet routing protocol suite. It implements a lot of protocols. It's very stable and super-supported and supports BGP and BFD, which were a couple of protocols that we were interested into. What is MetaLEDB? Is anyone using MetaLEDB? Nice. So, I'm the one to blame if something is not working. MetaLEDB is the load balancer implementation for Kubernetes cluster using standard routing protocols, including BGP. BGP allows us to announce our load balancer services across our network. If you are using Kubernetes on bare metal and you need to expose your application, there is a good chance that you need something like MetaLEDB. It's not only alternative, but it's the one that I maintain. This is more or less the architecture. We have the Kubernetes API on one side expressed in terms of services and MetaLEDB configuration. We have some code that takes all these resources, munch them together, and produces an FRR configuration that an FRR Sidecar container processes and then handles the BGP implementation. Last year, in this very conference, I got this question. Can I run MetaLEDB together my FRR instance on the cluster nodes? This is something that I keep hearing a lot. Not only that. What I keep hearing is, hey, but now inside MetaLEDB you have FRR, so you can also do this and this and this and this. No, because MetaLEDB is about announcing services, not for example about receiving routes and injecting them to the node, which is a common request. Why is that? On the cloud, everything is easy. You have one single interface to the node, one default gateway. You get the client who wants to hit your load balancer service, get to the node, enters the CNI, goes to the pod, the pod replies, and then the reply goes to the node and then exits to the default gateway and reaches the client. All is good. But on bare metal, we have users that want to have different networks for different class of traffic, for example, and you have the clients that are not on the same subject. So what has happened in this scenario is that your client reaches your secondary network and guess what? The traffic will try to exit via the default gateway and will not reach your client, or even worse, you will be beaten by RPF and you'll have a bad time trying to debug it. I've been there a couple of times. So this was more or less the request. How can I have something that is able to configure FRR running on my node together with MetaLEDB? There are a few alternatives. The easiest one, at least the easier for me, was run to FRR instances on the node. So I don't have to do anything on MetaLEDB. The user can configure its own FRR instance, but that comes with a few issues. You have duplicate sessions, you have duplicate resources consumed on the node. You have to use custom parts to let the router know how to connect to MetaLEDB and how to connect to those custom parts, to the other FRR. The other option is using two FRR instances in Cascade. This might work, but FRR wasn't able to peer with localhost until recently. It limits the flexibility of MetaLEDB because MetaLEDB has a lot of per node configuration knob. You can say I want to peer to this BGP peer only from this subset of nodes. In this case, these will affect only this session, which is useless. And also, what about the BFD implementation in MetaLEDB? It will establish BFD only through this path. So the next one, which is the one that I'm going to talk about today, is to have a shared instance between MetaLEDB and the rest of the world. So the extra configuration can scale. We can have something that is distributed across all the nodes. We don't waste resources because across the same BGP session, towards the router, we can do what MetaLEDB needs to do, but also other stuff. The cons were this was a lot of work. And getting the right API was tricky. It wasn't clear how to handle conflicts, how to merge all this stuff together. But eventually, this became a design proposal in MetaLEDB, and it converged, and we started working on it. And this is how this new project was born. It's Kubernetes-based, the one set that exposes a very limited subset of the FRR API in a Kubernetes compliant model. I wrote this description, so it's nice. This is the new architecture of the new thing. It's basically we stole what we had in MetaLEDB. We have this new FRR configuration resource, and it does basically what I already described about MetaLEDB before. But now we have a different API and a different way to configure this thing. How to deploy it? It can be deployed as a standalone thing, and this is something that I want to stress. We can use it together with MetaLEDB, and we just released a MetaLEDB version that uses this one as a backend, but you can also deploy it as a standalone component. So you can use it for your own purposes, regardless of the fact that you are using MetaLEDB or not. Now I want to talk a bit about the API. There was a good amount of discussion on this, like we were not sure whether we should have exposed the raw FRR configuration to the external world or having something that was higher level. Because there were some issues in this. How do we merge configurations? How do we allow two configurations, produced by two different actors, to become the same FRR configuration? How to intercept the configuration conflicts? If it was the raw configuration, that would be our Royal mess. And also the way MetaLEDB configures FRR is very opinionated. It gives some certain names to route maps, it gives some certain names to prefix lists, and if we wanted to extend that with a raw configuration, that would have become part of the API, and it would have been something that we couldn't change. Eventually we ended up with something high level in terms of CRD, which is FRR configuration. And this is how a configuration looks like. It has a BGP section in the spec, because we are anticipating that we might need other protocols. We have our outer section, we support multiple routers, but they need to live in different Linux VRFs. We can configure the neighbors, and we can say what prefixes we want to advertise or to receive from those neighbors. And this is how advertising looks like. We can say, I want to advertise all the prefixes that I configured in my router, or only a subset of them. And the same is more or less for the receiving part. We can say, from this peer, I want to receive only the prefixes that are matching this selector. Or we can say, I want to receive all of them. And we have an old selector, so you can say this specific configuration applies only to a subset of the nodes, which is always useful. And of course, because we know that there will be a lot of configuration that we don't cover, we also allow for experimenting or for covering special needs, our configuration, and there is a priority field where basically this gets appended to what is rendered inside the configuration from the API. And of course, we have BFD, communities, local preferences, and all the stuff that Metal.ed is currently exposing. It's covered in this API. And now I'm going to talk a bit about how multiple configurations are merged, because this was a pain point. You have multiple actors throwing configurations at the cluster, and those needs to be merged together in order to produce one single FRR configuration. And there were some guiding principles into this. We wanted a given configuration to be self-contained, meaning that you can have prefixes on one side and saying that you want to advertise those prefixes on another resource. A configuration can only add to an existing one, meaning that you can add neighbors, but you can't say, I want to remove this neighbor, applied by this other configuration, because that would steal the work to other actors. And a more permissive configuration can override a less permissive one, meaning that if you have received none, you can have received some, or a receivable will override the received some. And this is how we can merge to different configurations. We have one neighbor on one side, we have two neighbors on the other. These two configurations are compatible, and then on one side we advertise only a set of prefixes, and on the other side we advertise all of them. And these are two compatible configurations that can be merged together. Another thing is you apply all the configuration, and nothing is working. It happens a lot. We have validation webbooks, but given that the configuration is composed by multiple configurations, we know how Kubernetes work, and some things might slip. So we are exposing the status. We have three fields. One is the last conversion result, which means that if you have multiple incompatible configurations that makes to the controller and the conversion will fail, this is where you will see the error. This is the status of FRR loading the generated configuration, and this is the configuration running inside FRR. So it's something that can be used to inspect the status of the thing. With Metal LB, again, now with this new implementation, we have the same Kubernetes API on one side. Metal LB will generate an FRR configuration. It's going to be read by this new demo, which will talk to the router. And this is how a Metal LB configuration gets translated into this new one. So we have the routers, we have the neighbors, and we have a selector. So each speaker will generate a configuration for the node where it's running on. Yeah, this is what I just said. And this is when we add the service. So we will start advertising those prefixes related to the load balancer service, and things eventually will work. And of course, this is something that can be expanded, providing your own FRR configuration that gets merged to the one generated by Metal LB. I have a very quick demo. It's my first time on a live demo, so fingers crossed. Very quickly, the demo environment is a kind cluster. We have the demo running on each node. We have an external FRR container that represents more or less the external router. And now I'm going to... Okay, so here I have the kind cluster and a bunch of configuration. We have here the external container. It's paired, or it will want to pair with each of the cluster nodes. And also, it will try to advertise a couple of prefixes. And I can go on the configuration side and look at this. This is what I just stated. We want to advertise only one prefix. I'm going to apply it. And hope. Okay, so the session is up with the whole three nodes. And we have the single prefix advertised by the three nodes. And now I can look at this other one, which says advertise all. And I can apply it directly. And it's going to be merged, hopefully, to the other one. And then now we have two prefixes advertised. So it's working. We have CI, so it shouldn't be a surprise. Now I can do something on the receiving side. Here we want only one service out of the two that the external container is publishing. And this is a session inside the node. And eventually, yeah, here we have the last one is the route that is published by the external container. Yeah, what else can I show? I have five minutes. Oh. I can do this. So this is a pod running into the node. And if I try to ping it from outside, it's not going to work. For example, what I can do is try to put that prefix. No pressure. And it pings. So again, another nice example. Thank you. Okay. So I also have other examples, but I think I stressed my luck already enough. And we have still five minutes. Okay. So what's next? There are these, I don't know what's next. FRR provides a lot of opportunities. This is more or less a subset of what Metalaby offers plus something that was asked by a lot of Metalaby users. But of course, you can come and provide feedback, suggest the new features, open issues, or even contribute to the project, hopefully. The good thing is we have a framework that we can expand and grow on implementing new FRR features. A few resources. We try to keep the documentation aligned. So we have an upstream Redmi. We have the Metalaby documentation. There is the Metalaby channel on the Kubernetes Slack, which is where I live daily. And of course, the FRR community is super vibrant, super helpful, and always open to provide feedback and give help to us. And with that, if you have any questions, I'll be happy to answer. Thank you. Why did you keep using the FRR configuration files, which are quite painful to merge, as you said, instead of using the North One APIs? Can you raise your voice a bit? Why? Is it better? Yeah. Why do you use FRR configuration files, which are, as you said, quite easy to merge, instead of using the North One APIs, which have a NetCon thing? Because at the time, that was declared as experimental. I don't know if things changed in the meantime, but like, okay. So then we can... You should. Okay. Yeah. But like, we had all this mess already in place, so it was easy at the time to recycle it. But yeah, if there is a proper API, I'd be happy to start moving to that. Thank you. Any other questions? Okay. Thank you.
Multi-network in Kubernetes: No batteries included
Check one two, check one two, all right. Thank you everybody for coming to our talk about multi-network in Kubernetes and how there are no batteries included. My name is Doug Smith. I'm a maintainer of Multis CNI, which is a multi-networking plugin and also part of a working group related to it and I'm joined by Miguel. I'm Miguel Duarte. I'm a software engineer working for Red Hat, particularly in the OpenShift Virtualization Networking Team. I'm also a member of the Network Blumbers workgroup and yeah, sometimes work with Doug on this kind of stuff. Awesome. So, we've got to rip through this pretty rapidly and it's a pretty complex problem space, but we're going to run you through it as quick as we can. So we're going to look at what exactly multi-networking is in Kubernetes and kind of show you what the problem is that we're looking at. There's also kind of like current set of solutions and then also future solutions that we're looking at as well. And even if you're not necessarily interested in the multi-networking problems in Kubernetes, we kind of hope that you're going to be interested in sort of the problems that we've identified that we think are really common to a lot of engineering problems in general and especially for open source communities. We also have a demo for you to watch at home because we have some short time. So the first question we should be asking is exactly what is multi-networking in Kubernetes. So the thing is it kind of isn't because it's not something that Kubernetes actually is interested in solving. What do I mean by this? So the Kubernetes networking model pretty much says that any pod on the cluster can reach any other pod in the system. Cool. How does it do it? Like one interface on each pod connected to the same network. One interface. If you need more, well, it's outside of Kubernetes. The community pitched in together and implemented that out of three. But well, first, why do you want multiple networks in Kubernetes? For instance, like network isolation, let's say you want to meet the compliance requirements of like you need to separate traffic not only in software but only physically in the network. This kind of thing happens every day. And for that you need multiple interfaces. Or for instance, you want to implement like a firewall or something. Well, you'll need at least two interfaces. So this is a reality. There's a need for it and Kubernetes does not do it on its own. So the problem, you don't have batteries for this. You can do it. The community has provided ways for this to happen but it's out of three. And you need to deploy a bunch of stuff for this. So you need to deploy like controllers on the nodes. You need to add more and more pieces. So it's solvable but it's not entry. It's not native to Kubernetes. Furthermore, while it works, its user experience is challenging to say the least. Like it's cumbersome to use. It feels clumsy. There are a lot of ways for you to get it wrong. Like if you just put an attribute that does not exist or put a typo on it. Well, it depends on the implementation what will happen. And at the end of the day, if you have something that is error prone, a lot of people are going to make errors in it. In one word, this is pretty much like Arcane Knowledge that needs to be used in it. So current solution for it is multi-CNI. So multi-CNI is a CNI multiplexer. So CNI is your container network interface. It's an API that allows you to specify how you are going to run plugins that talk to this API in order to plumb your networks, how you're going to connect your network interfaces into your pod to the rest of your network in Kubernetes. What MULTIS is designed to do is to multiplex CNI plugins. So you use custom resources, which are extensions of the Kubernetes API. They're not natively part of the API. They're a way that you extend it. And they give you a way to quote unquote kind of trick the platform. So you add MULTIS CNI into your network. You populate these custom resources with CNI configurations, but CNI configurations are JSON and Kubernetes resources are YAML from a user perspective. So you kind of mix both of those, and I'll give an example of that in a moment. But we also have an effort that's ongoing for a Kubernetes native multi-networking. So what this would do is take kind of this concept that we have out of tree and get these pieces natively into the Kubernetes API. So we would actually extend that. And probably as a building block, we may actually implement them as custom resources, which is a detail here. The one thing to keep in mind, though, is that this will be an extension of the API without an implementation for you natively. So it will actually still require an add on itself, which is also a bit of a challenge. But we really like the idea of extending the API. If you take a look here, you're going to see this is a Kubernetes PodSpec. You've probably seen them before, but we use an annotation. And those are freebies. Anyone can add an annotation. We have a specification for how it should look, and we see that specification. But it's got JSON in there. So if you're walking through this object in the API, you hit this. What do you have to do? You have to parse it, which is no fun, no fun in the least. Now with the Kubernetes native, you are going to just have it all YAML. So if you're writing a Kubernetes controller and you're using client go, you're just going to walk through this easy as pie. So it should be a lot easier. But we have to ask ourselves, what does the future look like? It's kind of a complicated scenario. Number one, we'll probably still have multi-CNI. It's out there. People use it. They're going to continue to use it. And then we might also have Kubernetes native multi-networking. So we've got these two things. But there's a bunch of other projects in the space that may be up and coming as well. So CNI 1.0 has been around for quite a while. And CNI plugins run on disk on the host. They're not actually containerized. CNI 2.0 would make a step forward to being able to containerize those and also would give us an opportunity to kind of update this API. We also have in the works the Kubernetes network interface, which is a proposal which would bring CNI and Kubernetes potentially closer. CNI itself is container orchestration agnostic. It doesn't actually relate specifically to Kubernetes. It was actually kind of invented in a parallel time to Kubernetes. So before Kubernetes won the container orchestration engine war, there were a bunch of different container orchestration engines. So it tried to fit the needs of all of them. But Kubernetes is the winner, and we kind of need a way to get a little bit closer. So let's figure out the lessons that we have learned. The first one is sometimes you have political problems. We want to extend the pod network and the pod object to get these items here natively in there. And maybe it's not so much a political problem as it is like a people and intentions kind of problem, like what are we trying to solve here? And not everyone sees this exactly the way. And this is a very core part of Kubernetes. If you've ever used Kubernetes, you've definitely spun up a pod, you've definitely touched a pod object before, or an abstraction of it like a deployment or whatnot. So to extend this, to add this network is hotly contested. And there's more than that. Like first, as Doug said before, things is that APIs are forever. Not in a sense that you have to maintain stuff backwards compatible, but pretty much that Maltes exists, solves the problem. And if you want to have multiple interfaces in a pod, Maltes is already doing that. Are you actually going to update all the manifests of your deployments to comply with this new API to stuff that is running in production? Well, maybe not. So there's this to kind of have in mind. Next scope creep. Like everybody wants to solve a different problem. And it's very hard to get quite focused on the, let's say, the least common denominator of the problem space. I mean, just doing that has been extremely challenging in these last six, seven, eight months, year. I don't know, like I lost track of it. And finally, handling a technological problem is a lot simpler than dealing with people and opinions. It's very, very, very easy to clash on those. Like it's hard to get to choose like a restaurant between your four friends to go out tonight. So it's even harder to get you to like agree on what the API would look like on something so critical and so, let's say, central and paramount as the pod spec, for instance. And here's like the, we would really like you to take a look at this demo. But again, like, yeah, better do that at home. Like just hit or scan this or hit the link there. And it's short, couple of minutes. And you'll see how the native current effort for native multi-networking looks like from the user's perspective. And yeah, that's pretty much it. So any questions you have? Well, fire away. Any questions? Can these additional interfaces be used to connect devices that are outside of the data center via VPN, for example, which is a problem I've been trying to deal with and couldn't find manageable solutions? Yeah, thank you. Okay. If these interfaces can be used to connect the pod, for example, to via VPN to external networks. Oh, yeah, absolutely. So this, something that you can do is to use these to connect to existing resources that are already in your network. So I guess a VPN absolutely could be an example. And so oftentimes the reason that people use these additional networks is they have existing infrastructure and they go to deploy Kubernetes to become more of a like a cloud native approach. But they have legacy systems that they need to integrate with. So if you've got existing networks, this would be a reason to do that kind of as a sidecar that you could go out to a network like that. Great question. Can you talk a bit more about KNI and how it relates to the multi-network problem? So that is an excellent question. So I would say that one thing that we're trying to solve is that the way that, for example, multis CNI works is somewhat inefficient. You have this flow to create a pod and in that flow is a call to CNI, which then is called today you would usually use multis to do that. When multis is called, it stops that creation of the pod and goes and queries the API, the Kubernetes API itself. And KNI may be one opportunity to make this flow linear in order to pass information directly to some type of multi-network solution that already has the information from the API, for example, instead of having to interrupt that flow to call the API. So that's one possibility. Another possibility is that, at least from my perspective, as Miguel said, APIs are forever. So multis, I'm a maintainer of it. Certainly it'll be around for a while, but as we may get the Kubernetes native multi-networking, it may also be a layer to do compatibility as well between kind of the new way of thinking and the old way of thinking. Is that helpful? Nice to see you, Gargay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Declarative Networking in Declarative World
So welcome to the next one in this track. My name is Mateusz. I will be talking about declarative networking now. Yes, that's very good. Yeah. Yes, we spent quite some time talking already about Kubernetes, how networking is done. I'm very glad that people from MULTUS took the hard part of explaining, you know, multi-networking at the level of containers. I'm also glad they didn't say anything about host networking because this is what they don't do, this is what I do. So we are going like very smoothly lowering the stack. So I work at Red Hat like they did. I'm best in Switzerland when I'm not touching computers. I do farming. I actually like it much more, but it doesn't pay my bills so well. Here we are. I don't do AI. That's something, you know. Everyone does it, but no. So I will skip why multi-networking because Federico was talking about this and, you know, if there are reasons for you to do multi-networking, you know that you need to do it. And if you don't, then you don't. It all started because, you know, clouds never care about multi-networking. You go to AWS, GCP, FB, ICIA, you pick your three letters. You get a VM. It has a single network interface and that's it. But at some point you realize you need more network, bandwidth and all this kind of stuff and you're just going to start doing bare metal. It won't fly anywhere else. And once you start doing bare metal and network configuration, probably more than once you've seen, you know, the very ugly network manager config file. It's just a static file and, you know, the syntax is somehow opinionated. It's okay once you learn it, but it's still a static file and it flies if you have one server. It flies if you have three servers, but does it still fly if you have 300 servers? I'm not sure. And one problem is that, you know, those are all files and they don't apply changes automatically. So you modify your file and until you restart the interface or the machine, you may not even notice that you've made a mistake. So you may have some configuration that flies for, you know, last five years, but in reality it shouldn't and the reason is just because you've never rebooted. So, yeah. There was another talk about this before, yeah, you shouldn't have your servers run for two years at a time, but, you know, that's another story. So what is done to somehow change this? So you don't need to modify this file manually. Network Manager gives you command, which is NMCLI and you can modify those configurations using somehow nicer syntax. You can say, you know, modify connection, IP address, yada, yada and it has slightly better error handling because you can see in this example, I never distinguish slash and backslash. Sometimes I will just write, you know, I will write slash 24, but it's not really slash, it's backslash and I will see an error, you know, invalid IP address. That's super easy. But then I fix that, well, I think I fix, but I'm putting IP gateway, which is not in the subnet of my IP address. It cannot fly, like this configuration is utterly wrong, but syntax wise it's perfectly fine and system will allow me to do this. So, you know, is it really the state that we want to have? Well, we could discuss. So we have some basic protection about some basic bugs, but yeah, we could do better. So we got this tool, which is NM State CTL, so we still live in the realm of, you know, Network Manager, but we want to try to be a bit more declarative now. We want to change this syntax so that, you know, at the end we do this for Kubernetes and Kubernetes got this very nice notion of APIs, everything is well defined, everything is declarative. So let's try making cost networking declarative also. So how about we create some API which would look almost like a Kubernetes CR and allow changing this. So let's fix a YAM in which you define your state and I think this is the biggest improvement over the previous file, that you define how you want your network to be configured and afterwards let's event a tool which will be taking care that this configuration really works. So I don't want to dig into details of, you know, this example here because it shows some basic configuration. IP address, IP, IP routes, DNS server, so in general something that you always need, it does that I claim that this syntax here is much nicer than the syntax of this file. We can argue afterwards that I still claim it's nicer and, you know, at this moment there are no containers in the game. We are talking about vanilla Linux, you can do it and you may not know about containers. But now how about we wrap it around API and kind and let's take it to Kubernetes. So let's make CRD out of this and use everything that we built in the previous three minutes to have something that is declarative and something that Kubernetes will be reconciling. So in this scenario and I think that's pretty descriptive use case, you know, you have multiple network interfaces, you want to bond two of them and doing this using all the static network manager yada yada, it's ugly. So how about you just write that you want to create a bond and let something else make it happen and let something else make sure that this bond is all the time there, that no matter what you do, you start SSHing to your Kubernetes nodes and all this kind of yada yada, let this be the safeguard that once you define this configuration is there. When you define a pod, you go, you may delete the pod, but if you have, you know, deployment, demon set, all this kind of stuff, something is there to recreate this pod. Why cannot we have something similar about the networking? Well, we can, so let me do the very short demo on that. So what I have now, I created and we will go through the examples in a moment. So first of all, this is something I didn't mention, but you know, Kubernetes CRs and Kubernetes API in general tries to protect you from doing stuff that won't fly. And you know, you have very extensive validations at the level of Kubernetes API and it's super amazing. I also would like to have something like this here. For example, I will try to configure on one of my worker of the Kubernetes cluster some stupid DNS server that simply doesn't exist. For people not familiar with IPv6, I defined here link local IPv6 address. So there is a very, very low probability that something like this actually exists in your network. And on the other hand, I have this worker and let's just look at the DNS configuration. So I'm running on my default setup, it's there, and I will try to now apply this configuration, which I told you is wrong and you should trust me that there is no DNS server behind this address. Okay, so we created this CR and okay, now it's not what I said because we see in the ETC results, which we watch all the time, that this address appeared here. But this is only because we are doing some kind of reconciliation of that and I have a time of 20 seconds on this controller now. So you can see that this CR is in the state progressing, yeah, 20 seconds already passed. So failed to configure and my configuration is reverted. I won't go into logs of what happened, but you need to trust me this server doesn't exist so it makes no sense for my real server in the cluster to get this configuration. So that's it. So we think you revert that and you get the feedback that, sorry, we cannot apply this configuration because it's nonsense. Apart from this, what I can also do is I have another file in which I will simply take some additional network interface and I will add IP address. Very simple, we do this very often when we are provisioning servers, but you know, maybe you just got some additional network interfaces installed or whatever. Doesn't really matter. At some point you want to configure the next network interface. So this server, we don't need it anymore. And so the output is big, but you want to look into this part. 3S0, we don't have IPv4 address, we only have the IPv6 because you always get this one. So I'm going to apply this configuration now. Now that should not be a magic. This address appeared, but that's boring. You apply some configuration and it's there. But what I will do right now is I will manually on this server delete this IP address and I will make my controller, which is behind every CRD in Kubernetes, to reapply this so that this IP address is back there because if I define something via CRD, I don't want some admin going around my servers and change this configuration. If we agree that we are configuring cost networking via Kubernetes API, let it be like this. So I'm deleting this. We don't have that. So we have the previous state. Now I will do some small hack for the purpose of this demo because I realize that the timeout is set to five minutes. So we'll need now to sit five minutes and let the controller to realize that something changed. I will just kick the controller and... So we were in the worker two, which is this one, so I will just kill it. But not the only thing I did, I deleted this pod. So it's not like I somehow magically apply this configuration again. And we see that the IP address is back. So again, if we just sit here and wait for five minutes drinking or whatever, this would be the same. So that's it. And also for the sake of completeness, I have a setup with the proper DNS server. So, well, I already applied this one, so there is no point in doing this. But you've seen the wrong one, so you have to trust me that the good one would be also configured there. And the slide back is here. So that concludes the demo. So some final remarks because, yeah, that was really lightning talk. So all this stuff that I showed you, the backend, which is nm state, this is written in Rust because why not because we can. It uses network manager as a backend, well, pretty obvious, but we could discuss, and this is something that should come afterwards. That today I have this, it works using network manager because this is what I do and this is what customers that I have behind want. But if there is someone without network manager with a strong reason not to do network manager, but would like to have this, we can discuss and I would be very happy to discuss. Of course, there is a Kubernetes operator behind this because this is what I just demoed. And you are not bound to using this as a CLI tool and this kind of stuff. There is usable API, so you can get Rust create for that. You can get Golang library, you can get Python, probably something else, but those are the most popular and this probably should, I assume, those three make everyone on this audience happy. And yeah, we have a moment for questions. If you want to talk more about this, you can find me on the Kubernetes Slack and yeah, that's it. My personal wish would be that, you know, Kubernetes, and we know it from previous talk, previous two talks, never really cared about managing cost networking. No one really wanted to take this into the realm of Kubernetes. Well, it's not like I wish that we got this API now into Kubernetes upstream, but I wish. So yeah. Maybe we have time for just one question. With networking, you can do the worst thing and pull up the network up to what you want. So what if, for example, you misconfigure the IP address of a node and the node is unreachable from the controller? All of it can be fixed. Yeah, so this is exactly what I showed with the example of DNS. I could have shown it with the example of IP address, but if you wanted to create a CR so that you configure, for example, IP address and the gateway and applying this configuration would make your node unreachable, then we will revert this configuration exactly like reverted DNS because that's the purpose of the operator, that it has a safety checks so that it applies configuration. It checks if all the connectivity works as before. In this case, we had DNS, so it applied new DNS. It was checking in the background. Can I still resolve names? After 20 seconds, it realized, no, I cannot. I'm reverting this change and the CR is marked as degraded. So exactly the same would be happened if you have IP address and you don't get connectivity there. All right, thank you. Great, thanks.
Remediating thousands of untracked security vulnerabilities in nixpkgs
Okay, up next we have Del Roth. He's going to be talking about improving security in Nix packages. Thank you. It's a microphone actually working. Yes, good. So I'm Del Roth. I've been working on Nix packages for a few years now. I've been involved in some security mediation events and recently working on Nix source infrastructure. So it all comes from a story. So let's start with a story. Sometime last year, this vulnerability dropped kind of silently. Chrome released an update saying, hey, we released an update saying, hey, we released an update. You should really update today because people have actually been exporting to the world and it's giving code execution. So the interesting thing is we actually patched that in Chrome really quickly, but also what Nix packages, Nix OS and some other distros started realizing that that's not actually a Chrome vulnerability. It was reported everywhere as a Chrome vulnerability at the time, but it was actually vulnerability like a dependency of Chrome, which is WebP, which is an image parsing library. So we patched WebP and not just Chrome and everything is solved. Everyone depends on WebP, so whenever the version is updating Nix packages, they'll pick up the update. It gets rebuilt. Everything is magical and we don't have to do anything else. So then it's viewed about a month of work to actually make this happen. So this is the tracking bugs that I actually linked at the bottom for trying to actually fix this vulnerability in Nix packages, not just in Chrome, not just in WebP itself, but in actually everything else that's in Nix packages. I've highlighted part of it here, which is that some applications bundle their own version of WebP. Each of these need to be updated separately by Nix packages' maintainers. And that's not just Chrome, that's including some other web browsers, for example, that was not including Firefox, but for example including Thunderbird, because the packaging was slightly different, etc. See Bigo for a guess of all the non-applications that need an update in their status. So this is Bigo. And yeah, so as you can see, this was about a month of work. I'm not going to go through the whole list, but there's a lot of stuff. This list is probably not complete, because as I'm going to get to, we lack tooling on this, we lack statistics, we lack data. And so this talk is trying to, given overview of this problem, try to bring awareness to it and try to bring up a few solutions and how we could actually do things better. So why is this happening? Why did we have to fix so many things? So this is a phenomenon known as vendering. Vendering is when a piece of software decides that instead of depending on the library that they get through, like, you know, package config or through the general build environment providing to it, the software is just going to copy the source code of that library, put it somewhere in their own source directory, and then, you know, it's easier to build because people don't have to install dependencies. The problem with that is, since they've pinned the version of the dependency, whenever there's an update that needs to happen to it, then the update doesn't just need to happen to the dependency, it needs to happen to everyone who has pinned the version of the dependency by copying it into their source repository. To some extent, vendering also happens with rock files. Rock files aren't exactly the same thing because technically you're not copying the source code, you're just enforcing that your software can only build with one specific version of a library, and sometimes even if you're providing the hash of the source code or the binaries that you need to use to build the software, which means that in practice, you're not copying the source code, you're just copying the hash of the source code, you're still basically pinning it and making it impossible to do any kind of update. So, for this specifically web people ability, so we spent about a month on this, we did not, but it should fix everything. So that's a sad part of this, is that tens of people, I don't actually have a full list here, but tens of people spent probably hundreds of hours combined on this, trying to fix software that we have index packages but maintainers weren't super active on or just chased on upstream, because if upstream has copied the version of a package of a webp into the source code, then you need them to actually go and fix the problem, or you need to apply patches, but patches are fragile, they'll just break on the next update, etc. And yeah, even though we spent hundreds of hours on this, we did not actually fix everything. We fixed, I think about like 50% by count of a new number of packages, but what we did actually is we spent some time categorizing and saying, okay, these are the actual high risk stuff that's likely to get exploited, that's connected to the internet, pulsing and trusted input, and then there is the rest, which is like, you know, maybe we can get away with not updating it now and sometime in the future upstream will realize that they actually are vulnerable to something and we'll maybe fix it. Even if you look at only the high risk stuff that we categorize as high risk, we had some packages in there that we did not actually get fixed. We had to mark them as insecure index packages because even though they are like internet facing software that passes and trusted image files, for example, like email clients, they did not get an update within a month for critical vulnerability that Chrome people were saying was exploited in the world. So, let's play a little game. Let's have some maybe, well, I don't know if we'll do audience participation. Give WebP copies index packages. So I've counted, I've actually been building some tooling as part of the remediation for this Give WebP vulnerability and we have a better idea now of how many packages are copying libraries that we have in Ex packages. So for Give WebP, we had about 116 different packages and by packages here, you know, we had a whole talk about what the package is. I'm talking about something that is built by Hydra. So, next package is stuff that is not non free, that is not insecure and not marked as insecure. And I'm not counting like, I'm counting only one architecture, I'm like grouping by package name. So we have about 116 Give WebP copies, but I think WebP is actually fairly recent. WebP is a modern image format. What about the PNG, which is significantly older? 2037 WebP JPEG, which is maybe even more common than with PNG, 2053. ZGib. So ZGib is a really small C library that people have been using for maybe 30 years to decompress gz files and zip files and stuff like this. We have about 761 of those in Expack, copied throughout multiple Ex packages things. So let's say that for example, there would be like a vulnerability in the PNG. How do we go and fix it? Well, given that we took about a month for 116 packages in WebP and we got 250% of them, I guess the PNG would take about like two months maybe and we get also 250% of them. It's not really a great outcome. So is this actually a problem? How often do these libraries actually have vulnerabilities? Also, okay, we have copies of them, but maybe they're actually being kept up to date and it's not actually that bad. So here is actually for gpng, this is a grouping by version. It's another that we actually have enough information that we can figure out what version of the PNG is being embedded within all the packages in the next packages. And you'll see that the top of the distribution is very much like recent versions 1637, 1639, 1640. That's actually pretty good. Next packages unsurprisingly is at 1640 right now. We have the gate. Well, I don't know if it's a gate version because you want 170 in there and I don't actually know it's definitely not as used as the others. What I've also looked at is the biggest date for some of these versions and these are also versions really small than 10 years ago. So we actually have two packages in next packages right now using a gpng version from 2004, which is kind of impressive. It's like, you know, was x6664 even a thing at the time? I don't know. But somehow it works. I don't know actually I've not tested it, maybe it doesn't work, but it's in there. You can next build it and you'll get a binary that has a gpng from 2004. Does it have vulnerabilities? Yes. There's like about like 12 different critical CVs that give code execution, buffer overflows. Some of this might be might be mitigated these days because we have vulnerability mitigation that's part of operating system. That's part of compilers and stuff like that. So it's not exactly clear how many vulnerabilities apply to these old versions. Another thing is that there isn't really anyone right now who like finds a new vulnerability in gpngs and goes and say, oh, I'm going to test it on this version from 2004. Just to see if it actually applies. So a lot of the vulnerability databases out of date and doesn't really contain the right information to even check against that. I've mentioned block files. So block files are kind of a new problem. It turns out if you go back to 10 years ago, we didn't have software in Rust and go in JavaScript. At least we didn't have as much as we do now. Java kind of did block files a bit with Maven even at the time. But it's mostly a new phenomenon. And the good thing with block files is it's actually really easy to get a full transitive list of all the dependencies because they're in the rock file. That doesn't mean that people are any better at actually managing their dependencies unfortunately, even though there is good tool game to do so. So for example, for Rust, there is this tool called Calgo Audit. And Calgo Audit is a tool that takes a Calgo.rock file and tells you all of the vulnerabilities that apply in this Calgo.rock file. So I used some tools that I wrote as well to go through every single Rust package, current index packages. And that's kind of like looking at every single derivation that has a Calgo.debs in it, correcting the rock file from that. And what we find by doing this is that there is 62% of all Rust packages in the next packages right now that have at least one vulnerable dependency locked in a rock file. This is, I mean, I'm describing as an X packages problem. The problem is it's not entirely an X packages problem. We're just fetching it from upstream. It's just people are locking dependencies. We don't really have control over that. We did for Python, C, C++ dependencies. And upstream is just not doing as good a job as distributions were doing. I mentioned a thousandth and a hundredth vulnerable dependencies. About 750 of these are actually higher critical severity based on CVSS score, which is a third metric, but it's about as good as we have. So yeah, if you get a Rust packages in X packages at random, you have a 40% chance that one of these dependencies has at least one known critical, like higher critical vulnerability. Doesn't mean it's exploitable, but let's say that even like one percent of this is exploitable, that's seven packages in X packages with higher critical vulnerabilities exploitable. That's still not good. And one percent is just a random number I picked. Yeah, so as I mentioned, it's an ecosystem. It's a general open source ecosystem problem. I don't know that this specific log files thing is something that we can fix at the next packages site. Next packages has some fault. We have some rust software that's just out of date, for example. And so when we do that, you know, well, the log file is also out of date. But from the ones that I've manually inspected, this is not the majority of these cases. In the majority of cases, this is that next package is packaging the latest version from upstream and is just containing insecure dependencies. What is causing vendor in X packages? A few things. We don't actually try to prevent it. I've checked. I was really surprised next packages does not have any documentation, any policy against vendor. There's nothing that says, you know, if a software has an option to use system with PNG itself using its own empty version, there's actually no documentation that says that we should prefer using that option. A lot of people do it because it's good practice. Not everyone does. We don't really have, yeah, like we don't really have a way to prevent it for newer language ecosystems like for Goura, JavaScript, as I mentioned. You don't have a choice. You just have to vendor stuff because we don't have a way to, we don't have Rust libraries in X packages. We just have the GIF software. Same for Go, same for JavaScript. Well, now same for JavaScript. We used to have Node 2.x for a while, which kind of added NICS derivations for libraries, but it was automatically generated anyway. So it's not like we could really do much about it. And finally, like until recently, we didn't actually have any tooling to try and detect and measure this problem. So it was just like hidden, big of the water. And we couldn't really go and say, oh, hey, there's a new derivation that's being proposed by someone. Like a new package has been added into NICS packages. Is it actually rendering anything? People would have to go and manually check. And nobody was doing this when reviewing packages because it's just a lot of effort. This potentially stuff we could do now automatically with some newer tooling that I've been writing. As I mentioned, we don't have policies against rendering, but it's even wasn't that in X packages. We don't have really policies against even building from source. It's preferred, but there's actually no preference expressed anywhere. I've checked again today and I could not find it. So people just go and fetch things from app images, for example. Upstream this with an app image. It's too complicated to build. I'm just going to fetch the app image, run patch health to fix a path to dependencies. And then the problem is, well, you don't really know what libraries, what dependencies upstream has been using to build these app images. It's usually not great. So this is something we had with WebP, for example, where I think Anki, for example, like the flashcard software is using. We just using the app image for it. And it was vulnerable because it was using some build environment from 2018 or something. That was, of course, not receiving any security updates. We fetched dev files. We fetched... People are very creative about how to get binaries. We fetched targz. We fetch static go binaries. We... Let's not even talk about JavaScript because you can just fetch a targz and unpack it somewhere. And it's fine because when would JavaScript software ever have vulnerabilities? And, yeah. So some of the distros have famously strong preferences for building from source. Debian has really good policies regarding rendering, which have always been kind of the gold standard, I think, in the distribution world. We should probably do some of that. How do we address Rust, GoNPM, et cetera? I don't think we can. I think it's an upstream problem. But what we could probably do is make it clearer to users that they're actually using intellectual software. It's not really a problem that any other big distros have been... Has been hitting much because just... NICS packages are much bigger in terms of scope. We put everything in NICS packages. We don't have a UR. I mean, we have any UR, like some people use, but like the bar for what goes into NICS packages is very low, right? We don't actually have many policies against like, you know, this... Let's just keep this out of NICS packages because it's not well maintained by upstream. We don't really do this much. So, you know, by being a huge package set, we have the problem that we have pretty bad software that's not really being kept up to date. We have stuff that's just not maintained anymore by upstream. And I feel like this is some... Like the way we should fix this in NICS packages is like, lockfile insecurity problem is just by making sure that if upstream is actually maintaining the lockfiles, we should probably inform the users and make them actually aware of the risks. We currently have this non-rune... Like non-runeabilities bit that we can put in the package. The problem is that it's extremely cold and it's extremely annoying to work around. Like, people have to manually... It stops evaluation. It's not a warning. It's like a critical error. You're using like an insecure package. And so what people do is they just allow every insecure package because the easiest way to work around the error. So, yeah. Tool Gang, as I mentioned, until recently, we didn't really have any way to detect this. I have actually written a bunch of things to try and detect rendering. So, one tool which is called GrepNICSOS Cache, and what it does is get rid of GrepNICSOS Cache. It goes and it takes a list of store paths which we get from Hydra. And it runs... It will just go and fetch every single store path that Hydra has built, which is like a few, usually 100,000, and it just runs some signatures on it. It looks for some strings that you find as part of the implementation of some libraries. And if the library has been vended or statically ganked, you will find a string in there. And sometimes you can even get version numbers and stuff like this. And some other projects I've been working on is... I've kind of called it NICS packages Vendor Drone Scan because I wanted to also make it include the above, like, new rendering thing. But this one is currently specifically doing log file analysis. So, finding all of the log files for Rust, JavaScript, and doing automatic vulnerability detection based on the log files. Conclusion. We have new tool gang. We have a better idea of how rendering goes in NICS packages, and it's not great. And it's a problem because we cannot actually fix security vulnerabilities in base libraries right now. We tell ourselves that we did by, you know, fixing the library itself, and we say, I have 100 different instances of the library being unpatched. How to fix it? Well, awareness. Now all of you probably know about this, and when you review new packages, and maybe look at whether this is happening, more discipline. I think we should have policies about this. I have not thought about the exact policy we should have yet, but we should probably have one. And better reporting for cases that we cannot fix ourselves, which is the insecure log files, most of the time. Yeah. Here we go. If you have any thoughts or comments, please ask any questions now. Otherwise, this is a contact info and some links to tooling. Thank you. Thank you for that. That was terrifying as someone who's come from the Debian world. Exciting as well. It's a really simple social approach that we could take to this to add another tick box to the default pull request template to say, have you checked that there's no vended crap that you could be using a system library for? I think it would help. At some point, if we just continue adding check boxes, people are just going, people are already ignoring a lot of them. They say, has anyone actually checked the sandbox thing in the pull request template anytime soon? It's like, you know, I see two people check, raise their hands. The rest of us have never touched a check box. Yeah, I kind of, I'd prefer if it was automated through tooling, if we could detect some of it automatically. And I'd prefer if at least we fix the policy first and then, you know, maybe figure out the actual edge cases before we start taking people to look for stuff without actually being accurate about what to look for. But yes, we probably should be doing some valiant of this. One of my favorite things to do is package and archive and preserve old software. And some of that work has been done as a pull request in next packages. Sometimes it doesn't get merged because, you know, it's got an old dependency on like Qt4 or something. So they say, oh, no, we can't merge this. And it does prevent some software from getting into next packages, but there is still a lot of software still in there that managed to sneak in. And since we don't have a policy, it's kind of like ad hoc, like some things get in, some things don't. Some people launch Crusades against like old Python, and they don't want that there. And yeah, it's kind of, yeah, messy. So what do you think about it? Because I think there's like a real value proposition for archiving old software, because Table is not nix-source.org, will archive the source code. It will be around forever. You'll be able to reproduce it in 20 years. Yeah, so what do you think about like, you know, banning old stuff and only striving for perfection versus keeping everything in next packages and just accepting everything? Yeah, I don't think we should be striving necessarily for perfection. Like, I don't think we can anyway, but the problem right now is, so for the case of old software, for example, usually one of the things that's blocking old software is when they use library versions that, you know, when they have so many dependencies in next packages that having this old software will induce costs on other maintainers, because they have to care about this old software that will never actually be updated to, like, you use a new API or something like that in the library. And so I think that we should figure out a way to include this old software or it can use this, like, you know, less-maintained software in a way that doesn't use a bandwidth of all the maintainers. Right now, we don't have a way to distinguish this software from, you know, stuff that more people care about, more people use, which means that whenever there is security radiation that needs to happen, the people doing the security radiation don't have a way to distinguish these things. And we use our bandwidth on stuff that maybe is, like, your old software and stuff like this. That's why I think that we should have better query categorization, better ways to inform the users about which category a given software falls into. We don't really have any of this right now in next packages. And I think it's, I don't know how we've managed to survive that long without having such a system. I think we just burn a bunch of maintainer time on stuff that, really, we shouldn't. And we should just accept to be broken. Hi. As you went around interacting with upstream to have these sorts of issues fixed, I'm pretty sure some of these things were also things that other distros were also dealing with and working with. As you encountered them, what sorts of things did you see in terms of those interactions with dealing with upstream, where you had maybe requests coming from other people, kind of a similar position as you, but just from other distros? Yeah. So a lot of the cases where I've actually had to contact upstream myself were stuff that was not actually packaged in other distros, just because, you know, Debian doesn't actually package .NET software, for example, or doesn't package much Go software, surprisingly. Like, if you want to get Grafana from Debian, I think they still won't have that in every repository. I mean, it's not free anymore, so they have a good reason now, but, you know, they never did, right? Do they even package Prometheus? Right? Like some pretty base software that people would expect to be able to apt-get. You have to use some external repositories because they don't have the right tools to package Go codes. So I think because the next package is much broader in scope, we have a lot more things to care about, and we've had to do a lot more of the talking to upstream. There is stuff that has been useful to all these distros and, like, you know, we, like, other distros have contacted upstream before us in some cases, and usually when we do contact upstreams, they are receptive to this. The problem is when they just don't reply. So, for example, for WebP, we had the issue that the main libraries that people are using to use WebP in Go was just unmaintained, and we failed the bug, and the maintainer has just still not replied to it to this day. So you have 500 users of this library that indirectly has a vanduid, vulnerable WebP version. What do we do? So we had to go and manually contact some other users of this and say, like, hey, you use a vulnerability that's not actually maintained anymore. You should fix that. And suddenly it's like, you know, like, the tree of things you need to contact grows and grows. It's a complicated problem. But does that value? It does. It does add value. Yes. I mean, it's general, like, you value to, like, the whole software ecosystem. It's just not the next packages thing, but it's tiring, right? You know, it's not feasible that we would be the only people caring about this for every single vulnerability. Great. Let's have another round of applause for Delo. Thank you. Thank you.
Nix for genetics : powering a bioinformatics pipeline
We have now up next Alexi talking about NICS for bioinformatics pipelines. So thank you everyone for coming. For five minutes I will try to make a kind of different presentation and try to say how NICS can help safe patients. It's not a clickbait title I promise. So I have a doctor in training but I also have a background in computer science so it's a kind of a mixed presentation and I'm working in France in Besanson Hospital. So when we are dealing with patients we want basically three things. First we want to give accurate results because for these patients diagnosis can be life changing. Second we need to be reproducible because all the doctors trust us with giving accurate results every time. Finally we want to be as fast as possible because there is a high demand for results. I'm working in a rare disease setup where obviously things are rare so it's hard to find and how do we do it? Well it's a mix of computer science. And expertise and state of the art technology. So here is a very worth scheme of how everything works. We start from a blood sample of a patient and we try to extract the DNA and sequence it on this machine thing. Unfortunately the machine doesn't do everything and we need some bioinformatics in there. And also the bioinformatics doesn't do everything either. We need a human at the end of a pipeline which is why there is a CSV file that a human has to read. And basically what the bioinformatics setup does is that it figures out a list of candidates for diagnosis and try to filter down the results. For example it can go from one million candidates to a thousand. If it filters too much we can miss the diagnosis. If it doesn't filter enough, well the human will have a really hard time trying to pass the CSV. When you say pipeline it's a really fancy word for just a set of common line utility tools but we also have databases in there that are in our setup just text files compressed. And when I say pipeline we just feed data from one CLE tool to another. And now how does Nix can help it with this? Well as a medical lab we have to be reproducible. It's like in the law. So Nix is a perfect fit because we can fix the software dependency and the dependency either like byte by byte dependency. So that's done. So it would be great if you could run on the high performance computing cluster. And in our region the folks in our cluster agreed to install Nix. And now we can run our current production with Nix there. Two things we didn't do with Nix was to manage the whole workflow. There is actually a tool for that Nix but it's more like a niche thing so we prefer to use a more common tool. And the final things what we could do in Nix but we didn't is to manage this large database because in our setup it's a different folder for Nix so we cannot install it. But it's there in Nix. Last last thing. I really enjoyed the community. It was a really nice interaction. I'm sure everyone knows. But it's also kind of a slow process because I tried to package something myself which is not easy at the beginning. And as you know there is like 5,000 pull requests on GitHub so feedback can be sometimes a bit slow and also I'm working on my spate arm either so it can also take some time sometimes. But for example the support for large databases has been added after a few conversations on Matrix. It was really fast. I hope you take some key points there but if you want to know more you can send me an email and I'll be glad to answer. Thank you. Thank you. Thank you. Thank you. Thank you.
Automatic boot assessment with boot counting
Hi, can you hear me? Up next we have Julian with automatic boot assessments. Okay, hello everyone. So today I'm going to, my name is Julian Malca and I'm a PhD student at Telecom Paris. And today I'm going to talk about automatic boot counting, automatic boot assessment with boot counting, sorry. And so what I will talk about is why we need automatic boot assessments. What is automatic boot assessment and like one of an implementation that is a system-deboot counting, I'll show a demo. So why do we need automatic boot assessments? Because we are using NixOS, we have like something I call the NixOS benediction. It's very difficult to break your system. You really have to want it to break your system. And even if you mess up your NixOS configuration, you can just roll back to a past generation and just be solved by the NixOS magic. But sometimes this benediction has limits. And let's say if you are administrator for a remote server, you perform some kind of server update, let's say kernel update. And you mess up. You choose a kernel that cannot boot your root partition. And at next boot what is going to happen is that it's going to fail to boot. And if you don't have any physical like BMC, then you will need physical intervention to revive the server. So this is this kind of problem we solve with automatic boot assessment. So boot assessment is any kind of technology really that can, and I can only assess if a boot entry is bootable or not. And we have one example that is system-deboot boot counting. So boot counting is a feature of, as I said, system-deboot. And the idea is the following. Each boot entry has a counter when created. Each time system-deboot tries an entry, the counter for this entry is decreased by one. If the entry is booted successfully, and I will define what booted successfully means, then the counters are removed permanently. But if the counters for an entry have ever reached zero, then the entry is marked as bad, and it's sorted in the boot menu at the end. Just if I get just a little bit more in depth of how this works. The counters are embedded in the entries' name, file names. So you have the file name, then you have the plus separator, then the number of remaining trials, then the number of failed trials. So this is generation nine. It has four remaining trials and one failed, and it adds, at the beginning, five trials set. Counters are decreased by system-deboot when it booted the entries by simply renaming the file. And you have to define some definition of a successful boot by scheduling whatever you want. You need to be started successfully before the boot-complete target. So when the boot-complete target is reached, then the entry is renamed by the system-deboot, the bless entry unit, that is going to remove the counter. And we are done with this entry. We consider it good forever. Okay, let me show you a demo. Right, so here I am in a VM, I am booted, and I'll show you that in the configuration.nix, I have enabled the feature and set the number of trials to be two for any entry. The VM is booted successfully, but I will do a massive mistake. I am emulating a mistake. I have a BKHFS file system and I will rename it as X4. So it means that now this partition will definitely not get mounted, and when I will rebuild, it will even change the kernel to a kernel not supporting BKHFS. So now it's rebuilding my configuration. You see when I am done rebuilding, I get no error, no nothing. I think everything is good. I show you the boot entries. The boot entries, you have five boot entries, and the last one as a counter, you see the dash five plus two, two trials for this entry. And now I will reboot this VM. So what happens when I reboot? At the beginning, everything is fine. My generation five is sorted first. It will try to boot it. Kernel crashes. It reboots. Now it's still sorted first because we have two trials for this entry. Again, kernel crashes, reboots. And now you will see it's now sorted last and we are going to boot generation number four. And of course we are going to boot it successfully and that's it, that's the feature. It's available currently as a PR. It will be merged very soon and be available in the next lesson table. Thank you.
Typhon: Nix-based continuous integration
Hi everyone. So today we're going to speak about Typhon, our software for Knicks-based continuous integration. Let's say, for the sake of the argument, watch your enthusiast, and you're asked to set up CI at work. So what do you do? You convince your boss to use Knicks, because that's great. And you install a header. It's the de facto software for CI with Knicks. So your job is fantastic using Knicks, but soon you realize that not everything is perfect, because first you need to install the thing. And it's not easy. It's a full. And so you want get upstate choices. Then you need to configure the plugins, and each time you change the configuration, you need to redeploy the thing. Also, it's hard, because when you want to change a plugin, you actually need to write a poll for scripts, and you need to redeploy it again. Last thing, when you want to do deployment, all you get is this rank command thing, which is a bit hard to use, and a super staple, which you don't really like. So you start to dream about something much more simple, something declarative maybe. Maybe you want your plugins to be defined, user defined basically, with Knicks maybe, and you would like some better deployment, more in line with Knicks philosophy, with declarativity and the productivity. Okay, so in this dream, how does it look like to configure CI for a project? Well, at first it looks a lot like it does in Hydra. You set up an attribute set of derivations which are going to constitute your jobs. But then you write a Knicks expression for your project that looks a lot like this one. So here the makeGitUpProject function takes all the information that needed for a GitHub workflow, with the repository, of course, some arbitrary deployment rules. And of course you're going to need secrets like GitHub tokens and SSH keys to set GitHub statuses and do remote deployment. This expression is fed to Typhoon through the Flake URL. And once Typhoon spawned your jobs, it's going to use the project expression to build actions. So actions are scripts, which are user defined and Knicks built. They are run in a sandbox and triggered by Typhoon on various occasions to provide features that will be provided by Hydra's plugins. For instance, the most important hooks triggered by Typhoon are before and after every job to set statuses, of course, or do any kind of deployment. In a little bit more detail, an action is sandboxed with only access to the store and to Internet. It does not have access to the local machine. So for instance, it does not have access to secrets for other projects. It takes JSON as input containing the decrypted secrets and of course contextual information about your job. And it outputs JSON to communicate with Typhoon. Thanks to actions, Typhoon is completely for diagnostic. Actually all the communication between Typhoon and the forge is done through actions, meaning Typhoon can fit a lot of different workflows. But how do you write actions? Well, of course, you use the Typhoon's Knicks library that lives in Typhoon's flake. It would be quite frugal at the beginning, but soon it would go to fit a lot of different forges and various kind of deployments. And the goal would be, of course, to have an ecosystem of actions like we do for GitHub actions, but much better and using Knicks instead of YAML. A few words about how you would code something like this. Of course, you would use Rust to get some like technologies like Actix and Dissol for the back end and a nice web app using Leptos. And so you would start coding and soon you would have a prototype. Soon the prototype would run CI for itself. So it would be time to present the project to the Knicks community at FOSDEM and tell people to try it. You would still want them though. It's still a prototype. Everything you talked about today is maybe not yet fully implemented, but still it's ready for beta and you're waiting for feedback for issues, a lot of issues, maybe a contribution even to the actions library. And all that would be left for you to do is to thank everyone for listening to you.
rix: an R package for reproducible dev environments with Nix
Alright, hello everyone. My name is Bruno. I'm a statistician and data scientist, data janitor, whatever you call it, in Luxembourg. Are there some people that use the R programming language here? Statistics? I will see some of you. Okay, cool. Maybe this will interest you then. So what is R very quickly? So R is this programming language that's been around for 30 years. It's like a floss implementation of S and it's mainly used and mostly used for statistics, machine learning, data science and all that kind of thing. And it comes with all these built-in objects that we like very much when we work with these things, which is data frames, matrices, formulas, models, etc. So that's all built into the language. There is like a little hello world. You can, with the base language, do linear regressions so you can load data frames or CSV files very easily. You have formulas that define like your model very easily and you can do that with the base language. But you can also extend the language with packages and these are really called packages. So you have deeplier, you have tidier, these are very popular packages for data manipulation but there's many others. And this here is like a typical data manipulation pipeline in R. So you start with your data frame and you keep passing functions to that with arguments and you do your aggregations, you do whatever you want. And so we have, as of writing, around 23,000 packages that are available through the two biggest main package sets if you want, CRAN and Bioconductor. I wrote that all are available through NICS packages. I don't think that's fairly accurate. I think not all packages are available but most of them are available. Personally, I've never found a package that wasn't available through NICS packages. So what this means then is that we could use NICS to set up an environment with R, with our packages that we need, etc. and use that to work. But that's not really a thing in the R ecosystem like this per project environment. If you use Python like for data science, very typically you will see people start with a virtual environment with a specific version of Python, specific versions of packages. That's not really a thing. At most what people do or our users do, they do like per project libraries of packages, right? That's a thing. And if you need more, people would typically use Docker and there's been the Rocket project for that that really popularized the use of Docker in the R ecosystem. That being said, I wrote with a colleague called Philipp Bauman. We wrote the Rix package. So Rix is itself an R package which provides this really familiar interface to R users. It's a standard function. You can specify the R version that you want. You can specify the packages that you want. These packages can come from CRAN, can come from Bioconductor. They can come from GitHub as well. If they are hosted on GitHub only, you can set up tech packages without typically a thing that R programs want as well. And system packages, we called it like this. Maybe it's not the best thing, but this would be kind of other tools if you need Git, if you need whatever, you can add it there as well. And you can specify IDEs because for R Studio, which is a popular ID for R programming, there's like a wrapper that needs to be installed as well. So this would take care of that. And it generates that expression that I'm not going to show you, but it's like a Nix expression that will install all of these things. It will look automatically for the right revision. And if you put in Git packages as well, it will also generate for you the hash because there's like a little server that we set up that downloads the package there, computes the hash and then sends it back to the user. You can also use this with Nix function within R. So you could execute any function or any R script inside like a sub shell with a specific version of R. And you could then within your interactive session that you are currently running, you can then get that result back and continue working with it. So this is useful if you are like doing a reproducibility study and you just want to execute one particular function from a paper, for example, and you just want to get that result. So you can do that as well quite transparently. If you're interested, there's this website that you can check out. It's still not released on CRAN or on CRAN, but we are aiming at doing that in a couple of weeks. Thank you for your attention. Thank you.
Preparing a 30 year-long project with Nix and NixOS
Hello everyone, my name is Rémi Nicole, I'm this dude on the internet and I work for the CEA which is the commissariat for atomic energy and alternative energy. But the CEA is quite big so I should say technically that I'm CEA, DRF, IRFU, DISC and all the way at the bottom. What do we do? Well, we do control systems for big physics experiments like particle accelerators. And so what is a particle accelerator? Basically, it's a bunch of hardware. There is a plasma chamber that just exposes protons and then you need to give protons some energy, you need to control them and you need to do some diagnostics. For example, if you want to make the protons turn, you need to have an industrial power supply and an electromagnet. And so you need to control the power supply to control the strength of the magnet. And so we use a framework which is called EPIX, DayLake acronym in this space. So it means experimental physics and industrial control system. And it's quite old software. I'm showing you the old logo because it quite explains well what it does. It's a single protocol which is represented by the line and some clients and servers. So we have, for example, the input-output controller which does the control of the power supply, for example. And we also have some graphical clients, alarm system and some archiver. And so what do we do when you're an EXPAN? Well, you package it with NICS. And so you can see the logo of NICS kind of eating the EPIX logo. And I'm not going to talk too much about that because chances are you don't have a particular accelerator at home, so you won't really need this project. To be fair, someone did use EPIX to control a beer brewing system. Yeah, beer people are weird. And so what in terms of network? So you need a network as isolated as possible so you don't exactly need to do this much update. And usually you don't want to update something. If something works, you don't want to touch it because it takes a lot of money to restart the accelerator. And so what you need usually is a good resilience of the system. You have a lot of assumptions to rethink. And we could ask you, we could be asked to modify some software 10 years after it was in production. And so what I'm going to present is how we use NICS and EXPAN for this kind of resilience. So first things is we use Flake for pinning projects, which is good because anyone can just pick up the project back up and it should compile and work. There are some exceptions that you have when you have such a large time scale. For example, some software might not be available in 10 years. Maybe GitHub went down because Microsoft or something. And what we have as a solution is to do a lot of CI and use our own cache server extensively. And by caching, I mean caching really everything. So usually what you want to cache is the runtime dependencies, but what we want here is we also want to cache every build time dependencies. And so what we should have as a system is that even 10 years after it was deployed, we could modify anything down the stack and we could pick any project back up. We also need to cache fake inputs, which is a bit weird to do. We also need to cache NICS itself because maybe in the future NICS won't evaluate, we will have some deprecation and won't evaluate the old NICS code. And so the system that we have, thank you Maurice for working on this, is that we have a CI server, in our case it's a GitLab CI, which will build our derivation and we also build a build time derivation, which depends on all the build dependencies of the software. And then the CI we call a webhook in the cache server and the cache server will actually pull all of those dependencies. And why do we have a separate cache server is that with this system we can use profiles because over time the cache server will fill up and so we need to figure out what old version of the software we need to clean. Yeah, I have a hopes that NICS can be used for building resilient systems. And yeah, if you're curious, here's some links. And if you want to build time derivation, there's an example code here. Thank you.
Running NLnet on NixOS
Alright, thank you everyone who moved and gets some space. We can now start the next talk. Josse is going to talk about using NixOS at NLNet. Hello everyone. So, yes, my name is Josse van den Noeven. I'm an employee at NLNet. NLNet is a Dutch foundation. And who here has heard of NLNet, by the way? Are there any hands down? Wow, this is amazing. That's very cool. Yeah, so it's an honor to work at NLNet. And this talk is going to be about how we use NixOS there. There were so many hands. NLNet is the organization which here at FOSAT might be known for, you know, spamming stickers everywhere. We have the stand in the K building with so many stickers. And each of these stickers here is a project we have supported. But not all of the projects that we have supported have a sticker because, you know, command line tools might not have a logo all the time. As you can see, NixOS is up there as well, as well as many other projects. I'm wondering who here has ever had funding for NLNet? See, that's less hands. But we have funding for open source projects. So if you have good ideas, if you're part of a community that has these tenacious bugs that nobody is coming around to help fund to fix, or if you have a protocol which has not been implemented in your particular library, or whatever good idea you have, just look on our website what other projects we've been funding and, you know, write your own proposal. Proposals for writing to NLNet are not difficult to do. It's one form. You say who you are, you say what your plan is, what the outcome is, what it's going to cost, or what you think it's going to cost, and then you press send. And every two months, there's a new call. So this is the tagline that we use since this week, actually. We have a PR person now, and she, you know, she says the message should be simple, it should be clear, and it should be to the point, and so we try to, she tried to fit it into one line. We fund those who contribute to the open internet, you know, because that's what it's all about, why are you here at FOSDEM? And, yeah, we're just very happy that we can help there. So what do we mean when we talk about the open internet? Well, we should be able to communicate directly, right? Get rid of big tech, which is in between our communications. No dependencies, no lock-in, just get the source, compile it yourself, and that way we can have a good democracy, we can be independent and not have to be living in fear that some service is going to be taken away from us, because, you know, we can run it ourselves. So self-hosting is a thing that we very much promote, yeah, free software, free society, and this logo here, Next Generation Internet, is, that's the thing that makes me standing here, because that's the fund by the European Union that is providing over 90% of the funding that Anonite is able to give out. We have been given out money for decades now, but we were always very minor operation until the EC decided that, you know, there's so much software in this world, we're running on it, we're depending on it, we should, you know, also be the owner of it and invest in it. So that's what the EC is doing now, and we are one of the facilitators that, you know, seek out the right projects that are to be supported. So we fund open software, hardware, standards, documentation. When you submit a proposal to us, it has to be something that you can deliver, you can, you know, get pushed or you can publish somewhere, not, for example, server maintenance for that or having a meeting for that you have to go elsewhere. We like to, you know, we like to check what the money is being spent on, and that's also what we have to report to the people that give us the money, which we will mostly be doing, so we try to keep the bureaucracy very low. Yeah, self-hosting. So self-hosting, of course, means system administration. Who here likes system administration? 50-50. Yeah. So, yeah, it doesn't always go well with system administration. You're in the basement in some organizations. In the Netherlands, we're only small, so I get to sit with the other people. It's not all that bad. Once a year, you know, you have system administration appreciation day, which is awesome, right, if people remember it, and, you know, if they're not on holiday, because it happens to be in the middle of summer. So, yeah. Not everything's perfect. Okay. How do you use Nixle's in a small organization? That's what this talks about. And in the Netherlands, we're currently tendon police when I started, we were four, so we're growing. Also, when we started, we were running a bunch of different systems with backups sometimes, no commits of the configuration, so no history of what was running. It was a mail, for example, was running in a BSD system with set-afers, so it had snapshots, so that was pretty good. And our requirements are really not that crazy. We need mail, website, telephone, you would think. But then if you drill down, there's quite a lot of stuff that you need to keep running, actually. So here's what we have that is open source and which is not free and open source software. So a website, obviously, it's run by EngineX. Our email server is self-hosted, mailing lists. We have our own code forge. Well, what makes us tick? Our grant management system. That is running using open source components and chat, video, micro-blogging since a short while. We are also hosting it ourselves. But not everything. For example, our router, which we could do, of course. We haven't gotten around to that. Printer, open hardware for printers. That's not worth it right now. We have some people using Apple devices, so it's not completely open there either. Biases and chips. I mean, we support people designing chips. We're not yet at the stage that we can also dog food those. But we have quite some components that we do ourselves. So when we choose a system to get rid of the whole collection that we had before, what options are there? Well, there's Nixos, there's Geeks. We could go to a closed cloud, but obviously that would be very bad for our image. Or we could go to an open cloud hoster, which there are more and more of those now. But we said, well, we are funding projects. Projects are sending us that code. It would be great if we could also try to keep our knowledge about all these systems up. So let's try to do it all ourselves. And then Nixos has quite a lot of advantages, also some disadvantages. But the declarative part is, yeah, it takes some getting used to. But it's really useful, right? It's just nice static files. It's mostly reproducible, and mostly it's mean 99.99% for the stuff that we use at least. Extremely many packages, as you've seen in the talks just before now. You can mix versions of stuff. I'll show you a bit later how we actually need to do that. The Nix language, well, there's a lot of discussion always about it. But personally, I really like it. So you have to get it. But then it's great. So it's familiar to us because before we decided to switch all the systems to it, we were already using it on our laptops. So that's a bias over there. The Flake lock is very important to us because we can lock down the dependencies and we can be sure that whenever we update, it's a conscious choice to do so. Propriety packages are packaged, but they're disabled by default. So we don't have to worry that by accident we are starting to depend on closed software. Yeah, there are some downsides as well from our perspective. So the community is organized on a proprietary system. A lot of open source projects these days are. And we really promote self-hosting. So if a project is self-hosting, that's a plus in our book. Another thing, not everything is as polished as it could be. I'll show you that we are using an officially unstable feature. So yeah, and there's no storage handling. And what that means, I'll get back to that as well. So there are a lot of green flags there. Full disclosure, Nixos is a partner with us. When people get funded at an Lnet, they also get services. So they get free packaging and Nixos is providing that. So we are a bit prejudiced when choosing Nixos. For me, Nixos, I've been using it a long time, but I always find it very difficult to write the packages until one day I had to explain to a colleague of mine how these files work. And I was sitting there and suddenly it clicked that yeah, everything is a function. I mean, it's called a purely functional package, but still somehow it didn't click. But then I had to explain it to him what are these brackets at the top with the columns. And yeah, that's the arguments to the function. And the rest of the file is what comes out of the function. There are many Nixos developers who are thinking, wow, this is a newbie here. But I feel a bit embarrassed to say it, but once that clicks, it's really a very nice system because it's like JSONnet or Haskell, other functional languages. It's very predictable in what it does once you get it to do what you want. So is it just Nix? Is that enough? How do you deploy it on many systems? So there was a talk by Sir Leanne Rappen a few years ago on all the possible options that there are to deploy Nixos to a number of systems. So it's a whole list here, and in her talk she explained what the pros and cons of each of these systems were, and that was very helpful to us. So that's why I wanted to highlight it here. That was really amazing work that she did. And in the end, what we chose is to keep it simple and do everything with Nixos rebuild. And that's the basic command that everybody's using when you're using Nixos. And it turns out you can just manage your service with that. So all of our systems are defined in one Git repository. They're all defined in one nix.flake.nix file. Each machine has a configuration nix, hard configuration.nix, but there are a lot of placeholders there for stuff that we import from another directory where most of the services are configured. And we try to keep the simple readable for everyone. We use a JSON file that has sort of the structure of our setup in there, and then that's imported and readable as variables further on in the system. So if you do a nixflake show and the flakes are the not yet completely stable part of Nix that we are using, then you will see Nix configurations has five servers in our case. And what we do to deploy that is we type nixos rebuild, switch, and then we say here's the flake for the server, and it should go to that server. So that's how our deployment system works. It's just built into nix. And this is our machine's JSON, so it tells us what should be the IP number for the different machines, what name servers should it talk to, where are the secrets. Secrets management is really done by rzink by us, so we just, when the machine reboots, we don't store the secrets in the nix store, we just copy them into the slash run directory with rzink. And yeah, here's the flake. So we are mixing an old version of nix packages because we haven't completely switched that, I'll explain later why, with nixos, the current nixos, I mean you can just do that, you can put it together, and so these are the inputs, and then here is the function that defines the outputs where these things are coming in. And this is then a very simplified version of how we define each of our machines. So we have a function called makeSystem which takes the hostname and the definition, and we define our systems by looping that function over all the machine definitions. It's a bit more complicated because it has to know which inputs to use on which machines, but this is sort of the magic that makes us able to just use nixos rebuild. Now when you're setting up your system, this is the thing I think is most important, is the alerts. The computer has to do stuff automatically for you, and you would like to make sure that it continues to do so even while you're sleeping or while you're giving a talk at FOSDAN. So I'm very happy that this box in my mail folder has not had any unread messages for a very long time now, so that's great. Our alert board is green very much of the time. We have a very particular alert here which is called the nixos flake committed. So if somebody doesn't deploy without committing first, it gets read because then it's undocumented what our system is doing. This was a zoom in, but I think it was good enough to read. Yeah, so backups, that's the second most important thing for your system. We use Borg for backups and Butterbackup, to do snapshots every hour, and here's a small point of critique for nixos or actually a feature which is not really there at the moment. When you do anything with software and it also needs data, you have to say where the data is and everything is declared in nix, except the folders have to be written by hand or they're set by defaults in the services. Doing backups, there's no enforcement that there is a backup or an easy way to do the backup. In the setup of your backup system, you have to repeat all the directories again or you define them at the top level and then you use the variables for those directories at the top level. This is a thing that might be a bit more polished, it's an opportunity for a new system, a new extension. So mail, who here is hosting their own mail? Wow, that's not enough. We need more people hosting their own mail. It's so important, it's still the backbone of all your communication, it's email. We really want to self-host, we were self-hosting, so when we're setting up a new system, it would feel like a defeat to stop doing that, so we continue doing that. And nixos has a project which is called Simple Nixos Mail Server, which ties together Dovcott Postfix, LDAP and Rspundi together, it didn't use to tie up LDAP, but we needed that, so we paid a contractor to add this support and upstream it. So that's what we're using right now. However, we're announced, so we're funding a lot of projects. We're also funding Stahl words, that's simpler, all included Rust implementation of a mail server, and we're also supporting Mox, that's a go implementation of a mail server. And we're soon going to try out Stahl words on a less important mail domain of ours. Yeah, and then you get these wonderful 100% scores. If you fill around long enough, well, actually we didn't have to fill that long because the Simple Nixos Mail Server really configures your mail properly, and this wonderful website, internet.nl, is what you can use to check if actually your mail server is configured correctly. One highlight of Nixos that we really value is the testing. So to test two computers working together is made very easy in Nixos because there are Python scripts that you can call and you set up both computers and you tell them how to talk to each other and what the expected outcome is, and all of these scripts, many of these scripts are just part of Nix packages. So you can read how these systems, how this testing is done and for your own setup you can also write those scripts, and that's great. And we run that in CI via Flake checks. Well, sometimes something can go wrong. You don't have to be a genius to see what's going wrong here. We are sending the configuration of server one to server two, and this is where the system that we saw earlier comes in handy, how to fix your booting because this really killed deployment one time. So when I say we keep our system simple and we try not to build on top of stuff, here we decided that it would be a good idea to make a small alias script that only takes one argument so you don't confuse the two servers with each other anymore. We recovered from this in five minutes so it wasn't that bad, but I did get a big fright. How do we do updates? I'm just putting this command here. It's not that interesting, but I just want to have it documented somewhere because it's a bit long, but we have a number of inputs to our Flakes, and if you want to update just one input, which is often something that we need to do, for example, when one of our software packages updates that we write ourselves, then you can do only that Flake input with this command. So conclusions. We like to keep it simple. We just use the basic tools of NixOS, and we put most of the configuration, or we try to move as much as we can into JSON files so that it's easier to read. So technically, NixOS is really great for an alat. However, for the average office, it's probably quite complicated to do this. So I think there's an opportunity here for open cloud providers to use a system like this and make it more user-friendly. And in fact, there is a project currently called NGI Fidiversity where the EU is funding us to help create a new hosting stack that will be using Nix. And that has just started. We have the planning phase for this. So if you're interested, look it up. There will be this, or talk to that guy over there. That's a raving, but this will be probably a talk next year for them. And with that, I'm done, and I'm open for questions or tips because there are many people who are more expert than I am. Thank you. Do we have any questions? Hello. Thank you very much. I'm just wondering, you said that you are thinking your secrets to run directory, like why you are not using like, Agenix or SOPS for that, which will do it for you and you don't need to do it manually. So the reason we're not doing that is there were so many options which made me confused. And then also some of them were putting the keys encrypted, but nevertheless in the Nix store. And I just felt more comfortable doing it with Arsene. That's the whole explanation for it. Hi. You said something about Nix not being aware of storage locations. I didn't really understand that. Could you kind of explain that a bit more of what's... Yes, so Nix defines where all of your software is coming from and how to compile it, and it puts it all in the Nix store. But of course the software is interacting with data. And there's no sort of, you know, a type or a class which defines where the storage is. So you could say, I'm doing backup now. Just backup all of my systems. Or if you pass a directory into a service, that directory is an object which has been defined elsewhere as it needs to be in the file system. It needs to be... the file system needs to have, you know, a type of file system. It needs to be mounted. All of those things are something that you have to take care of. And because Nix is declarative, you know, once you hammer it down, it's fine. But it would be great if at compile time you get an error for that. Any other question? There's a question in the back. I just wanted to react to the storage location thing because it's interesting. So in NixOS there is a problem which is you want to declare things, you want to be declarative. But when you deploy a software, software often comes with automatic migrations. So they proceed to the operation on your state, on your files, at every new deployment. And this breaks the rollback system. Because if you rollback to previous version, you don't rollback the data. You just rollback the configuration. And what could be done here is that the NixOS modules themselves could learn about where the state is. What does it mean to back up an application? What is dependency in the PostgreSQL database or whatever. And it would start to provide a solution for the problem you're mentioning. Yes, exactly. Databases are a whole extra level of complexity and possible, you know, data corruption. I think we can take one last question. Hi, thanks for the talk. I might have missed it because I joined a little bit later. But in the configuration, do you have a way that you are happy to pass secrets? Yes, so the way we pass secrets is we have a top level JSON file. And there we declare all our secrets. So for root, it needs these secrets. So wire guard needs a private key, the mail needs a password. So these are files that have to be under slash run slash root. And when the Nix evaluates it, they are stored. Where do they get? So Nix doesn't do anything with that. It just writes in the configuration where that file is supposed to be. And then when the machine starts, some services will say, hey, I'm missing my password. So I copy in the password in there and then I restart that service. But that doesn't happen very often. We are fairly small office. So we don't have 100 machines. So automating it more seemed like complexity and overkill for our situation. Okay. Thanks. Thank you very much.
Dune 3D - the making of a maker's tool
Hi, I'm Lucas. So when I am not writing the CAD software of any kind, I usually do hardware projects, some of which I've shown here. As you may see, they're pretty much all the same thing. They are a circuit board in a 3D printed case. So for designing them, one needs basically two softwares. There's CAD software for the printed circuit board and software for the 3D printed case. What both of these things have in common, that CAD is pretty important there, since what you're drawing CAD is what you're going to get. So when you're doing, for example, woodwork or metalworking, if you need an extra hole, well, you just drill it. But that obviously doesn't work for PCBs or 3D printing. So yeah, it's pretty important to have a proper CAD software there for the first thing, for PCBs. I solved that issue for myself a couple of years ago by writing Horizon EDA, but that's not what I'm going to talk today. But for the 3D stuff, I found myself oscillating between FreeCAD and Solvespace, since both some things great, but neither of them covered everything I needed. So let me elaborate on that. So FreeCAD itself, pretty much all the features I needed, some of which are step important export and support for chanples and fillets to make the things look more prettier with a little effort, but it falls short by not, by the peculiarities of referencing stuff, the sketch of being modal and not being able to easily make constraints in 3D. For Solvespace, it's pretty much the other way around. It has significantly less features, but these features work really well and I found it really pleasant to use. So at first, it's dismissed it, and since it doesn't do step import and export, but everything else works really well. So is there anything that does all of that? Unfortunately, I didn't find anything, so I thought, well, it's not the first time I've written a CAD software, so maybe let's try writing a 3D CAD software. So after all, so what do we need to make a 3D CAD? So first of all, we need to show something to the user. For that, we need a 3D viewport with all of the usual stuff like shading, navigation, and selection, but fortunately, I already did more or less that for Horizon EDA, since Horizon EDA has a 3D preview, and it's basically all of the OpenGL boilerplate already done. So we have that. Next up, we need a geometry kernel that takes care of all of the Boolean operations and extrusion and stuff like that, and for that, there are some of you might know it, OpenCascade, also from the talk before. It has some words, but I had some experience with it from Horizon EDA, and it works okay. And it's also pretty much the only game in town if you want to have jump-fast forwards and proper interaction with stuff. So that one's there as well. And next up, we need a solver that takes care of solving all of the constraints and entities and stuff. And for that, there's also something that we can use, in particular, the solver from false space. The solver from false space is available as a library, but that's with a small asterisk. The library itself is a C wrapper around the C++ internal from false space, but the wrapper itself is pretty limited, so I ended up not using the wrapper and ended up using the internals from false space myself, and they are pretty easy to use, actually. So we have that one as well. And last but not least, we need a user interface of some sort with all of the boring stuff, like preferences dialog, way to select tools, the general tool handling, and all of those little, lot important stuff, such as the access lollipop that shows which access goes in which direction. But fortunately, I already had all of that in some way or another from Horizon EDA. It's a 2D CAD, but well, undo, redo, and stuff like that pretty much doesn't care if it's a 2D or 3D CAD. So yeah, then I realized, well, I had all of the building blocks to make a 3D CAD, so I started with it, and that was back in August last year, and now I'm here to talk to you about 3D, a parametric 3D CAD. So I said it took about six months to get from basically a blank window in GTK to where we are right now. As probably expected, it's written in C++20, and it's about 33,000 lines of code, and it uses the, you use GTKMM4 as a GUI toolkit. Using GTKMM3 would have probably been a slight bit faster, since I've already used that for Horizon EDA, so I would have been able to directly copy-paste code. But yeah, GTKMM4 was the last version at the time I started it, so I went with that, but that's probably a topic that I should write a book about, since there are quite a things that were a bit annoying about GTKMM4. And same as Horizon EDA, it uses UUIDs for everything, and uses JSON as a data storage format. So yeah, I pretty much reused all of the concepts that worked well in Horizon EDA for GTKMM4. And yeah, just a couple of days ago, I released version 1.0, and yeah, so it's already packaged for, in Fatpack, for the Windows folks, there's an MSI installer, and the good thing was, well, it wasn't the first time that I had to take care of all of the packaging, so the packaging stuff was pretty much just copy-paste from Horizon EDA again. So what does it do? It has a parametric to the sketcher, that has all of the usual stuff like lines, arcs, circles, and constraints to draw these lines and arcs. There's a convenient all-in-one tool that handles lines and arcs in one tool, so one can draw arbitrary outlines in one tool, and there are also some convenience tools for drawing an axis-aligned rectangle or regular polygons, as they're needed, for example, for hex nut inserts also. To make things 3D, there's extrude and lathe, so lathe is basically a 360-degree revolution, revolutions that are not 360 degrees aren't supported yet. And to repeat things, there's linear and polar array, and to combine multiple solids, there's the usual operations from open cascade, so union, difference, and intersection. So for that, I just basically had to expose to user what open cascade offers for to make an N in 3D. There are also constraints such as distance, angle, point-to-plane, or point-to-plane distance, or that's useful for example when you want to make a hole that stops at 3mm from the last edge, you can just use a point-to-plane distance of 3mm, and that's it. For the step import, I basically copy-pasted the code from Horizon EDA that turns the step model into a set of triangles, and I also reused the code for extracting the reference points since the idea is that you want to import your circuit board, and you want to add some reference points, and then you can reference these points in the geometry, for example if you want to fit your case around the circuit board or make cutouts for connectors. The last important point are fillets and chamfers. These are basically just calling the open cascade functions to add a chamfer or a filler to edge, but unfortunately the way it's implemented right now is subject to a topological naming problem since all of these edges are just referenced by index, so if one changes the geometry in a way that adds extra edges or so, it breaks, but well. I was used to that from FreeCAD, so it was okay. So how does it all fit together? So in the middle there's the thing for the document that consists of all of the introduced specific data structures like groups, entities, and constraints. These are then presented to the user with the renderer and canvas, where they are turned into primitives that I can render with OpenGL, and then the user uses the tools, same as in Horizon EDA, to interact with the document and to take care of the solid model. All of the entities get transformed into something that OpenCascade can understand, and then that's again triangulated and rendered. And to take care of solving the things, there's the interface to the solver in the space, solver in solve space, and app-rolly as to be expected. The hardest part of implementing all of this are these interfaces between OpenCascade and the solve space part, since that's where the impedance mismatches are, since I had my data model and the data model from OpenCascade or solve space, and it somehow had to fit together. So what's next? So there are some, of course I have got some plans, mostly some basic things like measurements, revolutions that are not 360 degrees, or stuff like copy-paste. But the big distinction between, from the project point of view, between doing 3D and Horizon EDA is that with Horizon EDA at least have the aspiration that one might eventually be able to do really big and complex parts, but I want UnityD to stay and to be and stay small and little easy to use CAD software that doesn't have the focus to cover everything and all. It should just be a tool to make simple 3D printed laser cut or CNC machined cases for PCBs, probably something else, but it already does pretty much everything I need for my use case, so it'll probably, mostly stay as is, with of course some bug fix and UI enhancements, but yeah, don't expect anything big to happen there in the future. And I think then we're already over with the presentation, and now for questions I think. So, questions? Thanks for the talk. Very impressive for this time scale. You were talking about having 3D constraints, and then you just showed an extrusion size, but well that's something you can also do in FreeCAD, right? Do you have any other possibilities to do more complex constraints in 3D space? Yeah, sure. Okay, so the question was if there are any more complex constraints in 3D space, there are some, such as angled or point-to-point distances, or what one can also do since 2D and 3D can work together with a means of work planes. One could, for example, construct at a work plane in the same group as the extrusion that's perpendicular to the extrusion, then do whatever one needs there, and then constrain the extrusion to that. Or one can also constrain the extrusion to another sketch, so one can put the extrusion in a work plane and then do stuff there, so there really isn't, and then it's all protected into the work plane itself, so there really is no limit of what can do, but yeah, that's all the way that wall space works. Thank you for your talk and impressive effort. Do you think that CAD programs, CAD suites with this level of complexity could be a good stepping stone for beginners and maybe even children from very simple drag and drop programs like TinkerCAD towards something more parametric that they can manage to use when they start to grasp the basics of these kind of suites? So I think, yeah, it definitely has a learning curve since one needs to grasp the concept of constraints, degrees of freedom and such, but I think that's pretty much the same thing in every parametric CAD. So yeah, there are some idosyncrasies in terms of the user, the interface, and it's driven by a global menu that unfortunately has some discoverability issues, but yeah, I think it's something that one can also try with children, but yeah, I've never, I don't have any experience in that education space. Yeah. Great work indeed, especially for the time you spent on it. So in the beginning you showed these tables with check marks, but you didn't explicitly conclude that you had all the check marks for your software. Yeah, so let's go over it. So a step import, step import and export is pretty much done by OpenCascade. Since OpenCascade does the import, i.e. the triangulation and extracting reference points and export is just calling a couple of methods to take the topo ds shapes and write them to a step file, gemfas and fillets are just methods to call from OpenCascade, and all of the three bottom things are basically the same thing as in SolveSpace, since overall June 3D is pretty similar to SolveSpace in terms of overall operation. There are groups, constraints, entities, and if one knows and likes SolveSpace, they'll probably also like June 3D. Right, and another question, thanks. If you would have spent the same time in either SolveSpace or FreeCAD, could you have improved them to your needs? Yeah, but I was pretty sure that question will come up. So, let's go over it. I think FreeCAD, I've looked at the code sometimes and I've also find that there are really a lot of code, and I think especially changes like having a non-modal schedule would have probably been way more work, and with SolveSpace, they have their own geometry kernel for probably good reasons, and from a project conceptual point of view, I think OpenCascade and SolveSpace are pretty much diametrical. SolveSpace has really this nice self-contained thing without that big OpenCascade dependency hanging off the side. So, yeah, that's why I conclude, well, it's probably easier to write my own, and I also noticed that I really like writing CAD software. Okay, we have time for one more question. I use CAD software to create 3D models for on PCBs, to render on PCBs, and I felt that problems like SolveSpace are missing color support for faces. Does your Dune3D support this? Right now, it doesn't support colored faces. Yeah, I have to look into how to accomplish that with OpenCascade. These are always the topics that are a bit tedious, and yeah, well, it's OpenCascade, and as mentioned in the talk before, it has a rather cryptic API, but the good thing is there's FreeCAD, so FreeCAD is pretty much the best OpenCascade documentation there is. Okay, thank you. Okay, thank you very much, Lucas.
Comprehensible Open Hardware: Building the Open Book
Good morning everyone. As said, my name is Joey Castillo and I'm here to give a brief talk on, I guess, comprehensible open hardware, which for the record, I'm not a maker of open hardware tools. I'm just a humble user of them. But yeah, this talk comes from the perspective of someone using the tools made by the folks in this room to learn from open hardware designs and make some of my own. So one of the first things that I wanted to build when I got into open hardware was called the open book. This is an open hardware e-reader, more or less. And I wanted to make this for a long time, way back in like 2018, 2019, when I really wanted to make something like this. I didn't actually have the skills to make something like this. So to get there, I went online to steal as much as I could from folks like Adafruit, who make open source hardware. In opening up their designs for things like this e-paper driver board, they let me copy a lot of what they did for their gadget into my gadget. But getting ahead of myself here. The open book is the thing that I wanted to make, and I had some goals for the device. Those goals were pretty simple, or simply stated. I wanted to use it to read books. As I pitch it to new acquaintances, it's like a kindle that you build from scratch. I wanted it to support reading text in all the languages of the world. And I also wanted it to be affordable, accessible, and for lack of a better term kind of DIY-able. So just to give you an idea of what the device is and what it does, here is a short video of it in use. Here's a listing of books and short stories on the device. And I can launch this short story by Leo Tolstoy, which of course renders in Russian. The center button goes back home, where we can select a different work, like here the Tao Te Ching, rendered in Chinese. So I think it's pretty fun as projects go, but the fact is that's only half of it. The other half is I wanted the open book to be comprehensible to the person who builds and uses it. Like, it's through open hardware that I learned to build open hardware, and I really want to pay that forward to the people who have their own open book. To explain some of how I tried to do that, I have to flip it over to the back side. So there are a lot of issues with this first revision of the open book, but I'm showing it first because of this sort of dense silkscreen text that kind of became my trademark. Back when people were on Twitter, multiple folks at various times called me the Dr. Bronner's of PCB design. For this habit of filling every millimeter of my board with text. Up here, I'm narrating the entire soap opera of an ideal diode circuit giving five volts to a regulator, which is interesting, I guess. But why? Why should I pack my board silkscreen full of this kind of stuff? To answer that, I need to briefly do my ideology of why I got into open hardware. The problem, as I see it, is closed tech, especially as shipped by these big tech companies, fails to serve users of the technology. I tend to look at technology through the lens of power. Like, take this Kindle, for example. Who is this technology designed to empower? And while, yes, it does allow you to read books, I'd argue it is designed to empower Amazon. It's designed to push you into dark patterns that make you spend more money with Amazon. It's designed to surveil your taps and profile your habits for Amazon. It's designed to steal your attention and monetize it by selling ads for makeup or toasters. Meanwhile, the end user just wants to use the device on their terms without ads for toasters and is prevented from doing so by the platform owner. The big question for me, why does the tech get away with this? And the answer that keeps coming back to me is the technology is fundamentally incomprehensible to the end user. A device like this arrives fully formed as a slab of glass and plastic and it's meant to be used in the ways the platform owner sets forth. It's not meant to be understood or hacked or made to better serve the user. So what can we do about it? Well, I don't have all the answers, but in my practice at least, my goal is to make tech that folks can understand. My theory is if we can design well-documented open hardware that people can build on their own and understand, at least in the broad strokes, we can teach them that they don't have to accept technology that wasn't made with their best interests at heart. There is this fantastic quote from Bunny Wong in a blog post about hacking his Garmin smartwatch. He writes, the point of open hardware is not the ritual of compiling our stuff from source. It's the awareness that technology is not magic, that there's a trail of breadcrumbs that any of us could follow to liberate our digital lives. So with that in mind, what are the breadcrumbs? What trail am I laying down for users of my objects to follow? Over the course of a few years, I've had the opportunity to design several different versions of the open book and I think I've found three different sets of breadcrumbs for three different contexts. The first one has to do with helping the user understand how the gadget works, the second helping them understand how to build the gadget themselves, and third, explaining how to make use of the gadget. Let's take the first one first. This is one of the earliest open book prototypes and my vision back then was to use the silkscreen to narrate what each component on the board does. This has some benefits, I think. Like on the plus side, this could demystify the tech for someone who sat down and actually read the silkscreen. On the downside, I have to say, space is limited and I honestly left wondering if this is the most useful information to give the user. Like, I want to demystify the tech, I want them to feel like this is something they can understand, but is understanding how a MOSFET works the best way to do that? The best answer I can come up with is maybe. Still, there were a couple of bigger issues with this version of the open book. The parts are kind of small. These are 805 passives, which are pretty small for the average folk. They're fine-pitched parts. There's also parts like this microSD slot, which has its solderable connections hidden underneath a shield. I borrowed that footprint from Adafruit, which is open hardware and great, but they design for manufacturing, not hand-building. They're also honestly just way too many parts on this board. It's trying to do too many things and it's overwhelming to someone trying to build and understand it. So, yeah, this realization led to a new design that I called the abridged edition. This version cut the part count down considerably and tried to make it as simple as possible for people to build themselves. I used bigger passive components, 1206, and I picked parts with pins that are easily accessible like this new microSD slot. Instead of making folks solder down a fine-pitched microcontroller, I used the Raspberry Pi Pico module, which has a super-friendly 0.1-inch pitch. Yeah, some parts like this flex connector I could not buy on a module, but then I realized I can make my own module and have it preassembled. This little cast-related module, the green part, includes the e-paper connector as well as the whole supporting boost circuit. I ordered dozens of these for a few bucks a piece and I offered them alongside my main PCB. This meant that DIY makers only had to plop down one module to get the display working instead of a dozen densely packed fine-pitched parts. I also decided rather than using the silk screen to explain things, I could use it to explain how to build the thing. Adding step-by-step instructions alongside each of the parts on the board is a different trail of dreadcrumbs, but the upshot was you could follow the instructions, literally counter-clockwise around the board, and if you followed them all correctly, you would end up with a working device. Okay, so things I like about this set of dreadcrumbs, well, it is super effective. Since releasing this design, dozens of people around the world have assembled their own open book boards without any of my involvement. Like these photos are community builds that I never touched. I didn't even send them a board. These are people going in on group buys and part swaps and having enough success with the build that they've moved on to hacking on the firmware, which is exactly what I wanted to see. I'll also say we did a workshop at Hackaday Supercon in 2022, very ad hoc, not a formal workshop. We just sat on the floor of Supply Frame HQ and I guided a dozen people through building their own open books, and every single one walked away with a working device. Like, the plan worked. Still, after the abridged edition and doing these workshops kind of hands on, I realized to make the project scale to more people, I couldn't rely on everyone soldering it together themselves. I would have to have most of the thing done for them. This means I'm no longer using the silk screen to tell folks how to solder the thing together. Still, I did want to use it to do something useful. I still wanted to encourage that comprehensibility that we were talking about earlier. In this case, I kept something from the original open book, arranging the components in functional blocks, even if I can't fit room for narrative text to describe what they do. These blocks match what's in my schematic and how the components are grouped over there. This still gives people an overview of how the device works. You can see this is not a pile of parts arranged haphazardly. These parts work together in ways the user can understand. This is the battery charger. This is the power supply. Still, there is the question of what to do with the rest of the board space, and I can't leave it blank, so. The trail I'm finding most useful these days is the trail that leads to making use of the device. For this latest version of the open book, I'm including pin assignments as well as notes on how to develop firmware for the device right there on the circuit board itself. So, I'm going to be honest with you. I use this a ton when I'm writing my own firmware. Like, I am lazy, really. Sometimes I don't want to search my own documentation. Sometimes I didn't even write the documentation. If I don't want to open my schematic to try to decode what I was thinking when I designed this thing six months ago, what are the odds a user is going to go to all that effort themselves? Having the docs right there on the board is an affordance for people making use of my device, and as I found out, I am one of them. This also works on boards of many shapes and sizes. This is the circuit board for SensorWatch, the Casio wristwatch mod that I'm wearing on my wrist. Also, a shout out to Lucas, who is just up here. I learned everything about making a Casio board swap from your open hardware Pluto watch, so thank you for that. SensorWatch owes its existence to your project. Anyway, you can see here we're on the backside. This board is less than one inch in diameter, but we're still able to include notes about which pins are which, the capabilities, and even which on-chip peripherals I expect you to call on to make use of those pins. Self-documenting circuit boards like these attach relevant information to the hardware you already have physically in hand. This board doesn't just have pin labels. It has a narrative of how you wire it up. It doesn't just have component designators. It tells you what they mean for the device configuration. Oops. It creates a self-contained artifact. This is a prototype of a new version of SensorWatch. I'm still working on the pin assignments, and they may change before it's final, but even if I put this down and pick it up in six months, I don't have to cross-check a revision number with a schematic and a datasheet to get hacking. All the relevant information is literally in hand. Moreover, that information becomes available to the end user as well. Unlike closed-source objects, you have to painstakingly reverse engineer. Putting this information on the board itself makes the object hackable by default. We're throwing the doors open to the end user without forcing them to do so much as a web search, much less a deep dive into my repo. Also, just as a side note, this technique pairs very nicely with code that makes use of the same names. If your silkscreen says you named a pin button alarm, and the headers for your board support package also name that pin button alarm, you've made everyone's life easier, including, actually, maybe even especially your own. Once again, I am not someone who invents open hardware tools. I am just a humble user of them. And I don't have all the answers for when it comes to making or helping folks to rock the stuff that we make. Still, these are some of the ways that I have tried to make some of my stuff more transparent. And I just want to close with some questions that I can ask myself and we can ask ourselves as we finalize our designs and send them out into the world. Questions like, how would I imagine someone using this device? Am I offering affordances that make it likely they'll achieve what I hope that they'll achieve? What kind of information would I want to give a user of the device, both at a basic level and at a more advanced level? And also, I didn't put it on the slide, but what would I want to know if I'm picking this up after six months and I've forgotten most of my design choices? Most of all, can I tell the story that I'm trying to tell, the story of the device in a way that makes sense? Because if I can figure that out and print it right there on the board, both the artifact and its back story will live together forever. Anyway, that is what I wanted to share today. So I'm going to put up my info and I would love to take questions if we had any. Thank you. So first of all, I love the product and I love your philosophy in open source. So those open books support EPUB files right now? It does not. So the open book uses a ESP32 microcontroller. It's a very kind of resource constrained platform. At this time, I'm supporting plain UTF-8 text. That is my file format of choice. That might also be a bit of an ideological choice. I like the idea that a plain text file can represent a literary work. Plain text feels powerful. I think if space aliens come and see the ruins of our civilization in a millennium, they'd probably be able to figure out the UTF-8. I'm not sure if they'd be able to figure out the plethora of things that go into... EPUB is just a zip. Yeah, having said that, folks ask this question a lot and now that people are hacking on the firmware, I think it's entirely possible that I think the ESP32 is a capable microcontroller and I'd be curious to see what folks come up with in this space. So while it's not something I'm working on myself, this is the ethos of open source, right? Throwing the doors open to folks. Awesome, thank you. Thank you. Yes, but for me the problem is microvision in the sense, so because there is a lot of SMD, surface-mountain device, and for me it's not practical. You need an installation of a big house to have a fire and this precision soldering and so it's not for everybody, this kind of thing. So I know hardware, but hardware is now in the past, it was easy. Now it's very difficult. So if you can do a book format, it's little also. I prefer a large format and to chance to make a chance to come annotation and so on. So what do you say about that? I think you're absolutely right and I think this is the reason that I'm starting to move toward getting it PCBA assembled and maybe people just... maybe the experience of building your own book is like taking a circuit board that's assembled and putting it into a case of your choosing or 3D printing a case and maybe that's the larger things that you're putting together, but I totally understand not everyone is going to be able to solder these fine-pitched parts and yeah, I think maybe my appetite for DIY got ahead of my understanding of everyone's capability or desire to DIY. So I think you're totally right and yeah, I'm probably moving toward more PCBA in the future. Can you pass the microphone back to Andy? Hi, thanks. It's really great. A couple of questions. So the sales screen can mess up your board now if you make a mistake in the documentation. That is correct. I think it makes me very diligent about triple checking things before I send it off, but no, I will not lie, that has happened to me before. No way of automating that? I'm very curious and maybe some of the folks in this room have ideas, but I do like the idea of if I know I want to annotate, for example, a line on my microcontroller symbol, I want that to be on my sales screen. That would be very interesting to see if there are ways to link those things together. I haven't yet run across ways to do that, but if anyone in this room knows tools that can help me with that, I would love to do more of that. Can you do field substitution in KeyCAD in the sales screen? Okay, cool. I've been told I can do field substitution in KeyCAD in sales screen, so that is awesome. As a user of KeyCAD, I will check that out. Any plans for having a camera on the book that you can scan the board and show the schematics and the documentation on the ebook itself? Interestingly, not only in the ebook itself, but I have a colleague who's working on using kind of QR style codes to get a better sense of the assembly of various devices. I think there's a lot of possibilities there. I also had a slide about the idea of, I like the idea of putting things like QR codes that contain text, not URLs, but if I could put a basic readme in a QR code and you scan it and you get the full text of a pinout or a description, that would be very interesting. But yeah, possibilities. I see one more in the back. There's a question on the line on the ETA. So, yeah, the question is, am I planning to offer the open book online or is there an ETA? And I hope to do a crowd supply campaign at some point this year. I'm just, yeah, it's hard to find the time to do all the things I want to do, but hopefully by the end of the year, hopefully in the next few months I'll have a pre-launch page up and we'll be able to, yeah, put something out there. Okay, so thank you, Joey. Thank you. Thank you all.
FreeCAD - state of the union
So, we have York and Aikson are going to talk to us about the state of FreeCAD. So please give a warm welcome to York and Aikson. Hi. Hi. Is the microphone, yeah. Hi. Hello everybody. So this is what we're going to talk today. I will make a short update of what's going on, what has been done recently with FreeCAD. And Aikson will talk more about the assembly side of things. What do I go to the next one? Page, huh? The next page, yeah. Oh my. Oh, it's hard to use other people's software. My goodness. We'll get there. Wow, it's good in the browser. We'll get there. Basically, I will already talk while we are there. Yes. Okay. So the main thing everybody is always asking us is when is version one point there ready? And I have good news for you. It's almost ready. I know we've been saying that for like five years, but I swear it's almost ready. We basically, we have two more things that we think we cannot call it one point zero without that. It's the top naming problem and assembly. That's basically what I'm going to talk about briefly. And I will talk about these other things too. That's basically all that we are doing now and what you could already, you can already see if you use a development version. Can you go one further? I hope so. Something, something. So, um, Oh, proprietary platforms, you know what that is. It's not responding. I go back. Okay, let's just do it on the browser. Yeah. So, um, that's a teacher. A lot of things. Woo. Thank God. I don't know what to say. So top naming basically is, it's the main course of, of free cad is a problem we have with free cad since a long time. Basically when you transform geometry, any geometry in free cad has named components with this is edge one, this is edge two, that is face one, face two. And when you change the geometry, the open cascade kernel of free cad reconstruct the things, the thing, and especially with, with sketches where there is a computation needed to know which edge depends on, on which one the order is shuffled. And this prevents or this hinders all the naming. When you reference one edge by name, for example, if you say, I'm building upon one edge, edge one, and then your edge one is somewhere else, your model breaks. Everybody who has worked with, worked with free cad knows that problem. That's our main course. And it's basically about to be solved. Hopefully. Crossing fingers that we'll see when it's there if it holds a promise, but it's basically a fork of free cad, which is the, the link three branch. It's working there already. So we have an example of it's working and, and it's working really reliably. So basically there is a new engine that keeps tracks of the name and renames things as it needed. And that is the last piece of this is being merged currently. So in the next few months, who knows this is done. So this is basically everything I just explained. Assembly. Dr. Co will talk about it in a moment. We have lots of new stuff in the sketch. That's basically where most of the new stuff is being done now. We have auto constraining, which is much better. We have lots of people are working on really interface things and having the workflow in the sketch or more streamlined, removing some unnecessary operations. Now you can build these things and you erect a grid to try and other shapes and you already have the right conference put for you. You just need to edit. If you want to change anything, we have on screen inputs. So you make a rectangle, you click two points and you can enter the dimensions there without having to build. We don't have to go click anywhere else. So those are like small UI improvements. But we are beginning to get into UI and UX and try to address all those little things and make see what we can do to make things flow better. Lots of teaming and UI work is being done too. There are new teams coming, trying to solve, make all these little widgets a little bit better, a little bit less different one from each other. There are light teams for who likes light teams, dark teams for who likes dark teams. Everybody is having fun with this. It's surprising how everybody use them to say, oh, wow, this is so much better and it's just teaming. But you can do so much with that in terms of having your own application like behave more the same across, across where benches etc. Before I let Aksion talk just a last word about what's happening around Fricade. We are in a kind of exponential growth now. Things are like pouring everywhere and new developers coming from all kinds. We have a non-profit since it's our second year now. The FPA is the Fricade Project Association. And we begin to finally, we begin to learn how to spend money. Like it would be hard to earn money, but it's actually much harder to spend it in an open source project. How do you do that responsibly, transparently? How you make the best of the money? It's really something we had to learn and it took more than a year to begin to learn about it. Some Fricade veterans have also created a commercial company around Fricade, which is Onsell, which is its aim is basically to try to sell commercial solutions to people who need commercial stuff. Companies who are not comfortable with taking an open source software. And so the idea of Onsell is to bring them a commercial package. So with all the stuff that companies need around an open source software to feel more comfortable. And as always, Blender is our main inspiration into all this. Like how they managed to get to higher commercial levels and still, it's still a community project. And that's basically what we're trying to do here. Like maintain Fricade as a community project and try to wet our feet with that commercial world. We want Fricade, but without getting lost in the way. So I'll let action talk about the assembly system now. Testing, testing, can you hear me? Alright, thanks. Okay, I'm going to talk about what is called the Onsell assembly solver. I'm part of Onsell and assembly will be LGPL. And the work starts from my work. Oh my. There are your slides now. I know, but... How do I get out? Get out? Escape? There's an escape? Oh my. Can you go just page... Oh yeah. If you go that way, you have yours. Alright. So basically, my work starts here. Goodness, what is so tough suddenly? I have a very little introduction while you're looking at it. There was at the time a second Fricade project. When Fricade started, there were two projects, two open source projects named Fricade. And one of them was ours that we know. And the other one was something that I called launched. And now after like 15 or something years, Dr. Ko is like putting his Fricade inside our Fricade. And it's like a cosmic combination. Both Fricade are joining into one effort. Right. I started out doing this in earnest in 1991. And this project here, I started out as Fricade and I launched it in the year 2000. So predating Fricade.org. Anyhow... So the important thing here is this software here is a multi-body dynamic software which is equivalent to say, Adams for those who are into multi-body dynamics, which is probably the premier multi-body dynamic software used in industry by Boeing, Ford, Caterpillar and so on. So I was in the theory of multi-body dynamics in the 1980s. So I really got absorbed into it. But the license for Adams was, of course, ridiculous for a professor. And I just said, okay, I know how to do it, so I decided to make it. So the next question was, you know, do I program in FORTRAN? Do I program in C? And I said, if I did that, I'd be 20 years behind. And I was just saying, you know, let me look, see what is more productive. As it turns out, in the 90s, small talk was the big rage. Small talk had just invented the GUI. Small talk had just invented the mouse, integrated IDEs and so on. So I used small talk for this program. And within a year, I was able to get a simulation going. And then I spent about, you know, three to five years, depending on how you count it, to get what you see on the left side. Speak closer. All right. On the left side. And I tried to commercialize that. Indeed, we did form a company. And we got bought up by Adams. I decided to leave to start my own. And I put this, as you can see, the graphics here is, of course, pretty bad. Just extrusion. 2D is extruded. That's it. So I made add-ins for space claim. And the motion simulation is mine, still running in small talk. But the simulation now works with space. And the system is quite capable. It can do systems like this. And hopefully, FreeCAD will be doing this in the future. Certainly in kinematics and assembly, that will be all LGPL. We, the multi-body dynamic side, the dynamic side, we undecided at this point. Okay. So hopefully soon FreeCAD will be able to simulate systems like this. Okay. All right. Back to my slide. 15 minutes left. Okay. Good. So the theory, if you're interested in the theory, the theory is right there. Okay. It's in open office format. And this is just a summary page in PDF. And for those who, you know, want to get into the details, it's all there. Okay. And hopefully it's reproducible too. All right. So this is the open source version, which is in C++. All right. In order to make it work in FreeCAD, I had to translate it from small talk to C++. And that was an exercise that really taught me something interesting too. So translating from C++ would be similar to translating from Python to C++. So, all right. A little bit more on the theory. I'm supposed to make it technical. I guess they want me to impress you guys with the technicality. So. It's basically you have the world frame or the inertial frame. And then from there you go to the assembly frame for, and then from the assembly, you go to the part frame. All right. And then from the part frame, you have markers frame. All right. And then from the marker frame, you go to the end frame. And the end frame can be on a point or it could be on a curve or it could be on a surface. All right. And the point itself could be moving relative to the marker and the curve itself could be changing shape relative to the marker. And why would I want to do that? For example, if I have an actuator moving a piston or hydraulic piston out as a function of time, I want to be able to describe the movement of the E marker relative to the M marker. And that would be the purpose of having things move. So once you have the kinematics where the position of things are, the next thing is to use constraints to make sure that they connect in an interesting way to make your mechanism. All right. And the constraints are basically absolute constraint, Euler parameter constraint at point. That means they are at the same point in the plane. The lines are perpendicular. They are at a certain distance, constant velocity in couplers and so on. Extruster. So the equations and we solve this in a right way, mathematical, exact way. You should get kinematics and dynamics quite nicely. So right now we have a lot of assemblies in free cat, but I don't think any of them solve the full kinematic equations completely in 3D. So with that, you can create joints like this, rigid, prismatic, revolute, parallel, cylinder, cylindrical, spherical, and so on. And hopefully almost all of modern day mechanisms could be solved using this combination of joints. So let me share something that I think is most interesting that I discovered in this practice. So I started on a small talk, which is very Python like. So I think that could be shared with you guys. Going from small talk to Python or Python to C++. I was a bit worried, you know, how do you do it? And, you know, C++ is of course a terrifying language to me. But I realized that actually I don't need to use all the bells and whistles of C++. Let me just get into C++ so that you can get into free cat.org. So I made C++ in a way that is small talk like. And once I was able to do that, miraculously, the small talk and the C++ could stay on side by side and look very similar. And as a result, I was able to do a translation in about 6 months when I was worried, wow, it may take quite a long time. So I want to pass this on to you because Python people may want to do that too. You want to, you have developed something in Python, it's nice, but it's slow. You want now to put it into C++ to make it fast. And you have never wanted to do that because it's just terrified of C++. But now I think there's an opportunity. It's not that difficult. Okay, so you will make Python and C++ look alike. So what's the secret? I don't worry about protect and private. Everything is public. All variables are public. Functions are public and virtual. Okay, and the secret is to use smart pointers, shared, underscore, PTR. If you do that, then the C++ objects behave very much like the Python object. Okay, and in the past, you are afraid to use pointers because you worry about memory leaks. Okay, you worry about when to use new and delete. And if there's a match, you are guaranteed to have memory leaks. And then the other thing is in C++, you have copy by value or copy by reference. But if you use shared pointer, those things, you don't have to worry. The only slide additional is the worry of circularity. You have one shared pointer, A to B, and if B points back to A, that would be a circular reference. All you have to do is you use a shared pointer from A to B, and then from B to A, you just use a plain raw pointer. And that circularity for shared pointer will not be a problem. As you go around in a bigger circle, you just have to break one of those with a raw pointer, and you should be in good shape. So that's my, I guess, extra bonus message to you guys about Python to C++. That's it. Just one last thing to add is that there is already a version of this working in FreeCAD that you can get in the on-cell has a special build of FreeCAD that they issued which already contains part of this. So that's testable in their build and soon in main FreeCAD. Okay. Are there any questions? Hello, thank you very much for the very nice talk. I'm very curious with all this new powerful functionality, what your thoughts are about possible changes to the user interface. If it will be easy to integrate, if there need to be new concepts in the user interface, maybe if there would even be visual programming, like how do we envision that users can use all the functionality that you introduce? Assembly has been around in a long time, so I don't think there's anything unusual. So we are developing something on on-cell, but I believe that any of the assemblies can use my solver to get good assembly solutions. Hello, can I ask if there's any roadmap for the path work bench and in particular about using the fourth access, the available ability, if that is on the roadmap at all? Yes, what do you know about that? Sorry, could you repeat? Brad Collette is the, I guess, originator of path. From talking to him, he wants to make it as professional as possible, or as solid as possible. So I'm sure it's in our roadmap, but I don't know exactly what that roadmap is. I mean, it's certainly a high priority. Let's put it that way. So my question is, is there going to be a default assembly workbench and a DAST case with one one? Good question. You want to? I think we have Pierre Boyer creating one in on-cell, and we call it an integrated assembly. But like I said, for everyone to work together to create one nice one for freeCAD, it's definitely our goal. So we'll put our integrated assembly out, and as usual, freeCAD is open source, and people can give a lot of input, hopefully constructive input, and then we can move on. Yeah, just to add something, we hope that this will be the one that unifies everybody. But it would still have all the others, and there are many paradigms, and other people who want another sort of assembly in that will stay. We just hope there will be like a good default that most people will want to use. There's a question online. Once toponaming is solved, would you allow adding custom names to elements, faces, edges, for example, instead of only having generic names like face1, edge2, etc., a user could attach custom strings? I'm not sure I understood exactly the question, but the thing is, yes, the translation engine that we're... There is a kind of mechanism that maps the older edge1 to the new edge1 and keeps tracks that's always the same edge that gets named edge1. That engine is also able... You can use custom names there. So instead of edge1, you could change the name to left edge. And so, yeah, I think that's what the person is asking, if you can begin to customize the things and give labels and names to things. And yes, that's already in the engine. Thank you very much. Super news for the topological naming problem, but I think the one most interesting to me is the change in the UX. I think that's one major, major wall for beginning users. Another question I have, I was recently listening to a podcast from Opulo, a machine making pick and place for electronics, open source but commercial. And what I found very interesting is they said, we want to build with FreeCAD just so that our machine can be the best. How much emphasis do you put on building features in FreeCAD to support this community open hardware approach of just contribution, like building... So for code, it's very easy because you just add line of code, and you have a Git diff. What about Git diff in FreeCAD somehow? Yeah, we're thinking about that all the time. Of course, it's very hard to obtain. It depends a lot on what's your use case, what you want to see in diffs. But if you look at the FreeCAD forum, you have several threads about that, people looking for possible solutions. I would say none of the other software...
KiCad Status Update
There's quite a few seats down here in the front row. If you've got, if anybody's looking for seats, right here is four, three. Okay. So, let's please give a warm welcome to Wayne Stambaugh. Thank you. Thank you. Thanks. I'd like to start out by saying thank you for everybody for attending. It's great to be back at a live Fosdom again. This is the first time I've been back since 2020 pre-COVID. So it's great to get out in front of the key CAD user base and get to talk to people. If you didn't show up at the booth yesterday, it was a lot of fun. We sold our swag out much faster than we thought we would and all the stickers were going before lunchtime. So hopefully next year now we know we'll bring a little bit more with us. So there's a lot to talk about. I know some of this is going to be kind of fast. I'm going to get through it quick, but because there's been quite so many changes that are going to happen in the upcoming version eight, I'm just going to blow through them. The talk slides are available online. There's some animations in them that I'm not going to have time to let play all the way through. So if you are curious about how some of the new features work and how you access them, you can just download my talk and then play through it on your own. So let's talk about what's going on. I'm only going to talk about what's happened in the last year because it would be too much to go all the way back to 2020. Unfortunately, that first line should have said key CAD date was already released. Well, we ran into a few issues. We're going to have to spend an RC3 here probably in the next day or so. I expect eight to be released with some time in the middle of February at the latest. Fingers crossed, but I think we're pretty close. Last year when we ran the version eight end of year campaign, we raised over $200,000 in donations and donation matches from other companies. So that was really, really a successful donation campaign, and that's allowed us to pay developers to continue to contribute to key CAD. So all these new features that are in key CAD and V8 and then moving forward is largely in part due to those having those funds available to help pay our team to continue to contribute. That's been really beneficial. We had our first conference since the original one in 2019 in Chicago. There was a key con Europe this year in Ocarina, Spain. For those of you who didn't get to attend, it was a much smaller, more subdued event, but it was really well done and I think everybody had a good time. Interestingly enough, spun out of that, there's a company called Watch You Next PCB who is one of our platinum sponsors. They also decided to throw a kind of impromptu key con Asia. And so Seth and I were in Shenzhen in November for the first ever key con Asia. So what else is going on in the team? So in the last year, we've added three new lead developers. Watch You Next PCB who sponsored the key con Asia event. They actually have hired people to work full time on key CAD. There's one full time individual now. Once we release eight, he'll probably be the next member of the lead development team. They also hired a second person that he's bringing online and getting him used to the key CAD code. So now we're going to have some additional resources that help the project. The biggest improvement though has been in the library team. So the library team has grown tremendously. For a long time, we had this huge backlog of symbol footprint 3D model libraries that kind of were getting stale because the people who were running the library way back either changed jobs or they had to go do something else. Life got in the way. But in the last year, we've added six new members to the library team and there's been eight now. Is it eight? Sorry, I probably didn't update it. But it's been a huge amount of backlog. I'll go through the statistics at the end here just to show you how much that's improved. We actually have a technical writer now. We have one individual who spends a lot of time just all he does. He's like our, they always say there's no I in team for Graham. He is truly a one man team. So our library docs don't lag or our key CAD documentation does not lag as much as it used to. It used to be the documentation always lags quite a bit. For version eight, it's going to be relatively up to date. There'll be a few things that are missing. One of the other things that was interesting is worth electronic out of Germany had contacted me about providing their footprint and symbol and 3D model libraries to key CAD. And so they've been slowly starting to online and contribute stuff. And their goal was to get their entire product line in the stock key CAD libraries. So at some point it's quite a, that's a big company and it's quite a few parts. So thank you to them for stepping up and documenting their basically providing their own symbols and footprint and 3D model libraries for key CAD. So one of the things that's interesting, we'll talk about, I'll also talk about this in the statistics part is in US and in Europe we have quite a bit of market share. We actually have a very large presence. But in Asia we've kind of lagged behind, but in the last year we're looking at the download numbers, Asia's really starting to ramp up now. And you know, that's a really big market. So it's neat to see key CAD getting in, in making penetration in that market. And I think one of the main reasons, especially in China, is we now actually have quite a few people who are full time either translating the application, translating the documentation. So they have really good native language support for key CAD. So that's really, I think that's what's helped. We now have five platinum sponsors and if you're not aware platinum in key CAD is 15,000 a year or more. So we now have five of those. And I know it's a little limited right now, but the key CAD stores open. So if you want to get some key CAD swag, there's not a lot of items there yet, but as time goes on we will add more and more items to the key CAD store. So head to store.keycad.org and check out, get your latest key CAD swag. And of course we always like to give a little love to our platinum sponsors. I see Felix is around here somewhere. Eisler, the newest sponsor is DAI and they're a consulting firm. They are our latest platinum sponsor. DigiKey and Watchu, Next PCB. So Watchu is the parent company of Next PCB. You are familiar with them. They are a PCB and PCBA manufacturer. And of course key CAD services corporation whose goal is to continue to support the key CAD project from the commercial side and everything flows down from there into key CAD proper, which so we all get to benefit from. So what would we add in version 8? So there's a lot of things that happen in version 8 that are not in version 7. So we made a bunch of SVG exporter improvements. Some of the primitives used to export as line segments and now it exports as its own primitive. There's now a startup splash screen, but that's been disabled some more about that later. But it's there for like rebranding. If somebody wanted to make their own variant of key CAD and wanted to put their own splash screen up there, they can do that. There's now, oh it's slow. Oh it went too far. Page up. Come on laptop. There we go. So there's now all the hotkeys. So here you can see the animation plane. You can assign multiple hotkeys now to a single action. We have ARM64 builds. So for those of you who are running Windows on ARM64s, we now have data binaries for you guys. One of the big contributions this year was there's now an easy EDA project importer. So your easy EDA and easy EDA pro projects will import directly into key CAD. The whole project schematic, everything. So that was a nice contribution. We introduced the command line interface in version 7, but there were bits and pieces of it we're missing. So now you can run DRC and ERC from the command line. So if you want to do like a CI tool, so every time somebody commits a change, you run the ERC or DRC, make sure it's clean. And if it's not, you can automatically ping somebody, hey you broke this, you broke that. So that's going to be in, that'll be available in version 8. Yeah, we all like those CI tools. Keep people on their toes. So here's something that's interesting to happen this year. So one of our, this is one of the things that key CAD services did was there was a customer who needed this for their, this was a request, they paid us directly to integrate this into key CAD. So everybody gets get support key CAD now. It's not everything is supported in get, but most of the basic things that you would need to keep track of version control of your designs is now built into key CAD. So the property panels on seven, it was only available in the board editor. It's now available in all editors. So the schematic and the symbol editor and the footprint editor now have the little panel. You select the object, you get the, you have the panel up. You can modify your object properties without having to open and close the dialog. Here's another, oh come on, you can do it. There you go. Here's another one. So we have customers that do really complex designs and they requested this. So they have, when you highlight a net, sometimes it's so complicated in order for you to find where on the design where everything goes. If you have a deeply nested hierarchy, it's cumbersome because you got to walk down, walk up and down the hierarchy stack. There's now a navigator. When you highlight the net, there's a navigator allowed. You see all the elements that are connected to that net in the bar on the left. Click on it, opens the sheet, takes you to that element directly. So if you do really complex designs, it's a time saver. That was also paid for by somebody else. This wasn't something that was even on our radar. It was just something that a paying customer requested. They paid for it, goes into key CAD. We all win. So there's now search panels on all the editors. So there's a global search panel which allows you to search for all kinds of different objects. You click on it, takes you to that object in the view. Instead of the old find dialog, this is a lot more convenient. It's a little bit more useful. You can see what's available. So there's now an internal bomb tool. In version 8, you no longer have to generate your bombs using a script. In the past, we always scripted it out because everybody would argue about what a good bomb is. We provided a tool. It's obviously not as flexible as the scripting is, but if you just want to export a simple bomb, there's now a tool for that, built into key CAD. So we also have contextual object grid alignment. So you know how sometimes you want the pins on 50 mil grids, you want your text to be on 25 mil grids. You can set it up contextually and when you're using that object, it will automatically pick that grid spacing. So instead of you having to constantly change grids back and forth when you're connecting pins versus like moving your text around to get them all nice and lined up pretty, there's a tool that handles that automatically for you. You don't have to do it. There's now nested symbol inheritance. So you can now instead of in version 7, the derivation level was 1. Now the derivation level is infinite. So you can create subgroups of subgroups of subgroups and you can stack. So you don't have to keep redefining the same fields over and over again. You can define a set of fields and then build something on top of that, the symbol on top of that, symbol on top of that. So that's available in v8. Oh, God. Yeah, this is a pretty big, there's now a tool to check for diffs so against the library. So you're working in your schematic. You don't know if there's a different, you know, it allows you to see a diff. So you say you run an ERC and you say there's a diff, you get a diff error or a diff warning that your symbol doesn't match what was in the library. There's a tool now that will show you what the difference between the two objects were. So you can decide whether you want to pull in the change from the library or just ignore it. Okay, so there's, we now, you can now directly import CAD, Altium, CAD Star and Eagle Symbol libraries directly into KeyCAD like you could do some, you know, some of the other ones instead of having to like convert them and then, and then bring them in as KeyCAD libraries. You now can just bring them in directly. There's a preview. Is it going to show up? Yeah. So there's now a previewer for the library so as you hover over, mouse hover over them, instead of having to click it and see it in the editor, you can now just hover and say, oh, yep, that's the one I want. Just some handy convenience features. There's now a library file. I don't recommend anybody doing this. This is just for demonstration purposes. Here on my hand editing, a Symbol library file, and if you watch in the, it's updating, we now watch all the files in both the Symbol and Footprint libraries. And if you, if they change, it will tell you so you don't save over top of existing changes like say somebody, say you got two people working on the same library at the same time. Some of you is going to get a, hey, you're going to overwrite somebody else's changes. So we implemented that. We now have a simple single button now. Let's say you have a bunch of libraries that you've imported from Altium or Eagle or wherever with one button. Now you can just save those as KeyCAD libraries because one thing we can't do is we don't write anybody's proprietary formats. So we just, it's a read only library. But if you want to edit them in KeyCAD, you just save them as KeyCAD libraries and go on about your work. We now have differential cursors in the simulation, in the simulator. There was quite a few changes in the simulators that much improved. So you, you know, there's a lot of LT spice like features that people are used to having in a spice simulator. Oh, come on. We can now directly import LT spice schematics. There's a caveat with that. You have to have LT spice installed because it goes, because it needs to go back into LT spice installation and get all the LT spice symbols. So if you want to do this in order for it to work correctly, you do have to have LT spice installed on your, you just can't take your LT spice circuit and then import it without being able to import the rest of LT spice because it references its own internal stuff. So we have to do this ourselves in LT spice and we have to extract all that out to make a simulation, but it works very well. I mentioned earlier we got a bunch of spice simulator improvements. We have FFTs. This is a really bad oscillator here. So it's fun to make. You can see, you should only see that one spike there at the beginning and none of the other ones and a perfect oscillator. But I did this one just for fun because I know how to make bad ones. I've experienced that. And S parameters and Fourier. So like most of the features that have been available in NG Spice that we just haven't extracted out, that information that we haven't extracted out in the simulations is now available. One of the more requested features, come on, you can do it, there you go, is editable power symbols. So before in KeyCat if you wanted to create a custom voltage, you had to go into the symbol editor, copy it and then change it. You can now do it on the fly right from the schematic editor without having to create a new symbol. We vastly improved the importers for SVG and DXF, or I'm sorry, we now actually have SVG and DXF importing into this schematic editor. It was only in the board editor and the footprint editor I think before, but now it's now in the symbol and schematic editors. We can export to Cadence Allegro PCB designer, so for those of you who don't want to use our board layout package, you can now export the net list to Cadence. And we've switched over, so a long time ago we switched the board editor printing in Cairo, which can do things like alpha blending, which is kind of more important for the board editor, but we've also switched over just recently in the schematic editor. So like if you have bitmaps with alpha instead of being no alpha, they'll be printed with alpha blending. Okay, that's it for the schematic editor. So it got the bulk of the love for version 8, but there were still a lot of changes in the board editor. So we have the same tool that we have in the schematic editor, check footprints against the board in the library. You can get a visual diff, you get a visual diff just to see what the difference is before you accept any changes. You can now import, come on, you can now directly import Altium Footpoint Libraries. Before we didn't, you had to save the Altium Board Importer would import them and automatically convert them to key cad formats. Now you can just import the library as it is. SOLIDWORKS PCB files, we now do, that wasn't fairly easy because they're basically Altium PCB files, but we can now import SOLIDWORKS PCB files. There's a do not position flag, DNP for the board editor now so that when you export your position files, it won't export those if there are things that you don't want your pick and place machine to populate. We now allow connectivity on any random shape. So you can draw a shape on copper, any shape you want, sign a net to it, it becomes a trace or another connect or zone, but you can basically draw any arbitrary shape and give it a net name. We've added some major improvements to the interactive meander tuning, here's the new properties dialog that allows you to set the parameters when you're doing your meander tuning. There are a bunch of step export improvements, including if you really want to have ridiculously large step files, you can actually export the pads and the traces and the vias, and your step files will be gigantic, but you can do it now. That was a feature that people requested, but be prepared, you're going to have some big step files. So property panels again is now in the footprint editor, you can see down there I click on an object, I get the properties, I can just edit them in the properties editor. We also have the preview, the hover preview, flyover preview in the footprint editor as well. There's a recent, we now export to IPC 2581. So I know this isn't supported by a lot of manufacturers yet, but we're now in a position where when it becomes more widely supported, we'll already have it in KeyCAD. And there was a whole host of 3D viewer improvements, things like visibility panel, so you can turn layers on and off, and there's a bunch of other stuff in the 3D viewer that were massively improved. So I'd like to thank Roberto because I shamelessly stole this from his presentation at KeyCon. This was a matrix of the changes, and so everything in blue was the importer. These were the importers for third party tools. Everything in blue was in seven, the orange is now included in eight. We have a few gaps, we got Gita, import, and PICAD, but we still need to do the project support for Altium. So right now you have to import the schematic, the board, sync them up together so that KeyCAD is happy, but I think in eight and nine we're going to have project support for Altium. So here's some fun statistics. So the source repo, so between now and version seven, the source repo, and it's actually more than that now, had 4,500 commits by 15 different authors. Actually KeyCAD sits at 1.63 million lines of code without translations and another 176K lines of comments. So we're rapidly approaching 2 million lines of code. The library team has just been busy, and in the last year we've added 1,207 new symbols for a total of 20,000 2023. I'm sure it's more since I've made this. He's clean, that's not here shaking his head. Footprint library, we added 713 for a total of 13,454. And just to give you an idea how significant that number is, I was informed this morning that one large, well-known component distributor doesn't have that many footprints. We actually supply more footprints than some of the distributors do. So I don't think I have the permission to say who it is, so I'm just going to say just throw that out there, but it just shows you how massively improved the KeyCAD libraries are. We added 238 3D models for a total of 6,700. We did slip a little bit in our language translation, so for v7 we had 17 languages that were 99% translated. We only have nine for v8 release, but I'm hoping that situation will improve as v8 gets out there. We saw, I don't know how many people saw this, but Felix and Eisler posted KeyCAD usage, and this actually shows the growth from 2020 to now. In 2020, I think we're roughly in the mid-20% of their orders for KeyCAD, and we've continually grown to now, we're 42, but I've heard recently somebody say something like 50% in the last month. So you see all the other EDA tools going down, KeyCAD's going up. That's a nice trend. I like that trend. And it's an Oshpark also demonstrates similar kind of trends. They're seeing KeyCAD usage go up from their customers. Now, that's not universally true. There's other board vendors, I'm sure, that have different statistics, but most of the board vendors that we interface with directly, they're seeing those kind of numbers. So that's really good. So I'm going to blow through this quickly. I apologize for not having a lot more time. But here's what's coming in V9. So we're going to have IPC support for internal procedure calls. So one of the things that KeyCAD has always had an issue with is our Python scripting. People call it an API. Technically, it's not an API. It's a wrapper around the internal KeyCAD API. So anytime there's something internal changes in KeyCAD, we build the Python scripts, we break stuff. We are in the process now of working on an IPC interface that will act as a go-between between any high-level language, including Python, and a running KeyCAD instance. So one of the things you know is you can actually bring KeyCAD down with a rogue Python script. So we're going to try to fix that in NINE as best we can. And at some point in the future, we'll deprecate the scripting stuff and make everything built on top of the API just to eliminate those kind of issues. And so you'll have a stable interface. So when you write a Python script using the API, it's not going to break the next time you compile and rebuild KeyCAD. We want to do, one of the things that's requested is customizable interface, including toolbar layouts. We're going to try to get that done in NINE. There's some talk about doing a visual diff merge tool for Git. So you look at the difference between, so you'll see the visual difference between the schematic or the board before you change or merge. So you have a merge conflict with somebody else. You can look at the diff and say, oh, yeah, I want mine or I want theirs. We've been getting requests for embedding licenses and project and library files. So we'll implement that. Support for barcodes, multi-user editing is something we've been discussing, whether or not that happens or not. That's a big one, but it's something that we're looking at. And of course, I talked about the pads and the guita importers. We've also been taught, one of the things that people have asked us about is being able to save the old file formats. Historically, KeyCAD has not allowed that, but we're actually in the process of thinking about that. And also, forward file compatibility, we were just open a file, an older file or newer file, and KeyCAD just says, well, I don't know what to do with that. So it just doesn't do anything with it. But if you save it, you're going to overwrite and lose your anything that the old version of KeyCAD, you'll still be able to open it. It's just some of the things won't work. A lot of other applications do similar things. For the schematic editor, there's actually a merge request now for a tool that synchronizes sheet pins and hierarchical labels in the schematic that it references. So you can do this bi-directional updating in both directions. We're going to replace this. So right now, we allow sharing of schematics between projects. We're going to stop that and go with a reusable design block because that particular thing causes us so much grief that we decided it wasn't smart to continue to support it. That will give us the design blocks. We're going to do variants for schematics in nine. I don't know if we'll get the board variants, but at least we'll get the schematic variants. Bezier curve editing tool. People who know there's a Bezier curve support in KeyCAD, but there's no tool to edit them in KeyCAD. So we're going to do that. The board editor, there's also now a zone manager. That's also a merge request that's ready to go as soon as we release eight. It allows you to edit all your zones in a single interface. Instead of having to open each zone individually in one dialogue at a time, now there's a zone manager for all of them. Multi-channel designs, that's in progress. Pad stacks. We're really hoping that one gets in because that's one of the feature parity issues that we have. Like so when we try to import from other tools that support pad stacks, we can't import them. We have to make an assumption based on a best guess and you get the pad stack that KeyCAD could support. So we're going to do pad stacks. Guard rings, that's a feature for those of you who do high-end piece stuff and you want to guard your high-end piece circuits so you don't have to leak each current. Those are useful. Right now our router doesn't really make it easy for you to design a guard ring. And also we're going to do the Bezier curve editing tool in the board editors. Somebody's working on a table tool right now. That'll be in the schematic and maybe the board editor. I don't know whether that's actually going to happen but hopefully it does. So it'll just be a table like any table in your favorite document editor. They'll just be native table support. We want to embed 3D models into the footprint so you don't have to have your 3D models. 3D models are external to the board. They'll just be embedded in the board and you take the whole board with you. You've got all your 3D models. ODB++ export, this one's already also in progress. Our friends at WatchUnextPCB are working on that because their infrastructure uses ODB++ since they, you know, that's what they use when you order boards. They prefer it over Gerber so they're going to provide ODB++ support so if your favorite board manufacturer is an ODB++ only shop, you'll be able to export that. Okay, that's it. Just a quick wrap up here. I can't, I mean I get to stand out in front of the team as the project lead but it's an incredible amount of time and I always want to say thanks to all our developers who contribute to KeyCat. It's really gotten impressive the last, you know, the amount of contributions just keep going up. I'm really, really encouraging and it's fun for me as a project leader to see that happen. Thanks to all our sponsors and donors if you've contributed to the KeyCat donations. Thank you very much. That continues the sustained growth of KeyCat. Thank you for your continued support of the KeyCat project so everybody who uses KeyCat thanks. We really like the fact that you use KeyCat and we hope we can continue to support your needs as a project. So anybody who's ever organized a dev room or anything like this knows it's not a nontrivial amount of work. Thanks to Seth for organizing this. It doesn't happen by itself. And hope I get to see everybody here next year and I hope I get to see everybody at least at one of the KeyCons this year. So keep an eye out. The one in Europe this year is going to be in Germany. We don't have a date yet or a venue. We have people on the ground who are working on it. And as soon as we have that information we'll put it up on the KeyCat website and on the forum and you'll be able to keep your eyes open and then hopefully you guys, hopefully we can see as many of you there as possible. Early September in Bohem. Is that when it is? Early September? Okay. And I'm not 100% sure we're going to have a Shenzhen and Asia one this year but I suspect we will. Has Hubert committed to that? We are going to have KeyCon Asia. It is just waiting for the time we coordinate with Maker Faire Shenzhen. So we're going to be on the same weekend as Maker Faire Shenzhen. So yeah, if you want to go, that's a great dual hit because if you've never been to Shenzhen Maker Faire it's really impressive. You should go if you get a chance. So okay, I'm open for questions. If anybody has any questions, no. Thank you. I had a question about the libraries. Are there plans to move the libraries in a... In a... This deep... Po. Are there plans to... So that a project could import the libraries directly from... Yes. Not right now? So the question is, are there plans to allow importing library objects directly from a Git repo? Because all our stuff is basically saved in a Git repo because we design it. You can import the project libraries, not the globals. Yes, not the globals. So yeah, because the Git support, obviously, the libraries that are already in your project will be part of your Git project. But externally, no, we don't have anything at the moment for that. But I mean, if somebody wanted to spin up a Git plug-in, that wouldn't be the... No, but I wouldn't turn you down because I think other people would probably like similar things. Do you have any plans to integrate some sort of mixed signal, real-time interactive simulation into... Kind of like Muldi's in, basically Muldi's in. Well, okay, there's... So on the simulator front, we've had a lot of fits and starts. I wish I had a better, a more rosy outlook to give you on that. We had some people working on EM simulations, so we were going to take the board, break it down into its 3D representation, and then do like EM and maybe a power solver. But the people... There's several things on that front that make it difficult. The most difficult thing is finding the manpower to do that because that's pretty specific kind of... You have to have a pretty good knowledge of how to do that. The other problem that's kind of been problematic is a lot of the libraries that do that in the open-source world, because we are an open-source project, obviously, we're not going to use like MathLab. They don't necessarily build well, they don't play well on all platforms. And so KeyCat, one of the... If you're not familiar with it, one of the things that KeyCat doesn't do is make second-class citizens. All the major platforms are considered equal. So if I can't provide a feature on Linux or on Mac OS, I'm not going to do it on Windows. It's got to work on all three. So that's kind of been a little bit of a... We've had a little resistance there. I don't think that problem is over... Oh, not solvable, I think it is. So the person who's implementing that's not only got to do the end part, like the solvers, they also got to get all the libraries to build all the dependencies that they need to integrate into KeyCat to build on all three platforms. And that's a bit of a load. So I do think it's going to happen at some time. Obviously, it's never going to be as fast as I want it to, but we do have in our big wish list of things we want to do, it is there. It's just whether we get the manpower to do it. Any other questions? Am I done? One more. Yes. Go ahead. Okay. Congratulations for all the amazing work. Thank you. Contributors and maintainers. And I want to ask about one of the planned features for the next release you talked about, GiveDiffMerch tool. I think it would be amazing if the command line tool could export a GIF animation. So you could... Export? You mean what the command line told you to export the Diff? Yeah. So, like, I don't know, GIF animation, something like that. So when somebody comes with a pull request, you could see, okay, what's changing without needing to download or open? I mean, just an idea. That's not a bad idea. I mean, what, like a ping? We'll put an issue for it and we'll see about that. Yeah. So any more questions, please follow. Wayne will be out in the hallway to answer any questions. So thank you once again, Wayne. Thank you. Thank you.
LibrePCB Status Update
Hello everyone, my name is Wurl van Bruin, I'm the founder and main developer of LibrePCB and today I will give you a short update about the LibrePCB project. So for those who do not know LibrePCB yet, it's an open source EDI software to draw schematics and design PCB and the main goal is the same as KitKat but there are some differences and of course it is a cross-platform, it runs on almost every computer, Windows, Linux, MacOS and more and its main goal is actually to make creating hardware easy, efficient and more or less foolproof with an intuitive user interface and powerful architectural concepts. So while the intuitive UI is especially helpful for beginners to get started easily with PCB design, it's also intended for professional users, for example who care about things like a same file format or a command line interface to automate some tasks. So let's take a look at what happened in the past one or two years because there are some great news. Especially at the end of 2022, it was an exciting moment because I started to work full-time on LibrePCB, now doing it for a bit more than a year and of course this leads to a lot more progress than in the many, many years before. In addition, the LibrePCB project has been approved by the NLNet Foundation to receive funding through the Next Generation Internet Program, which helps a lot to keep the full-time development ongoing. Then our fabrication service got PCBWay as a new manufacturing partner, so if you order PCB through LibrePCB Fab, you can now choose between ASLER and PCBWay. Also, I'm very proud to have several new sponsors on board from last year, Bitelé Electronics, NextPCB PartStack, PCBGogo and WinZorz. Last but not least, there are many individuals supporting the LibrePCB project with donations or other kinds of contributions, for example translations or creating libraries and so on. So with these sponsorships and the donations, the LibrePCB project raised around $8,000 in 2023. In my opinion, that's already quite amazing for this relatively still early state of the project. So at this point, I want to thank all the supporters and contributors and for your trust in the LibrePCB project. So this really makes me happy and thank you very much for this support. I take this as a sign that the LibrePCB is on the right way, so I hope it's okay to continue this way. Nevertheless, it's still a very long way to go until we have a stable funding for the full-time development, so I hope this support continues many more years. Other things which happen beside the application development are a completely new website with much more content, a new documentation system with more documentation and also for a few months now, we also have official video tutorials on YouTube. Not complete yet, but at least a few ones now. But now let's take a look at the application. In September last year, version 1.0 was released, which was a very exciting moment. And beside many new capabilities in the board editor like thermal relief pads and so on, this release also added a 3D board viewer with step model import and export, which is not only fancy, but also a great way to review the design before ordering the PCBs. But actually, I mean, this is known from the 3D viewer, it's known from many other EDA tools. Probably every EDA tool is able to show your preview. I'm actually especially proud of two features which make generating production data really a pleasure. First of all, we have introduced comprehensive support for assembly variants and manufacturer part number management. So, MPNs can now be stored in libraries, so you don't need to add them to every new schematic you need them. In the schematic editor, you can even assign multiple MPNs to one component to export them as second source parts to the BOM. I mean, who didn't experience any supply chain issues in the last few years? So, it's nice to actually specify second source parts. And you can even specify different parts for different assembly variants. For example, assembly a 10K resistor in one assembly variant and zero-ohm resistor in another assembly variant. And to actually make generating these BOMs and any other output data, a matter of seconds, we introduced output jobs as a new unified way to export any data. So, these output jobs can be configured very flexible and stored within the project. So, the exactly same output files can be reproduced on a different computer. You don't need to configure anything again. So, yeah. And if they provide a command line interface, it's also very easy to fully automate the production data generation. For example, if you like to use a continuous integration system. So, now, a short demo is worth more than a thousand words. So, I would like to quickly show you a few of the features. I hope this works. Okay. On my screen, it looks completely different. But, okay. I think it's, you understand what should be there, I think. Right? So, the first is the 3D viewer. Let's see if it actually, yeah. More or less. Okay. Strange. So, I just want to show you actually that the 3D feature is very, very easy. Actually, you don't need to care about it. You just add a resistor or whatever to the schematic. And our libraries have the 3D models built in. So, you don't need to care about them. The part to the board editor, let's say a THD variant. And it immediately appears in the 3D view with a 3D model. And, yeah, it's even possible to switch between different footprints, for example. Different pitch. And the 3D model is automatically updated to the new footprint. Or, for example, vertical mounting variant. So, you actually cannot even do anything which isn't compatible. It's always assigned to the footprint you choose. So, yeah. Now, let's take a quick look at MPN management. I mean, the most simple use cases, you just want to add some component. And you have now the option to actually choose a concrete MPN, because they are now sorted in the library. So, if you add a component by its MPN, and let's quickly also add it to the board to actually make it appearing in the BOM. And when you export the BOM then, I think it was LED3, it immediately appears with the MPN, you're just assigned. So, it's very easy to generate high-quality production data. And another use case, for example, I mentioned before, if you want to add a second source part, you can just choose a different part, let's say, from a different manufacturer. Add it to the same component. It is listed as an alternative part now. And if you export the BOM now, you have a new column with the second source MPN and you can generate the BOM. So, there is no need anymore to manually adjust the BOM after generating it before sending it to the assembly house. You can generate it completely finished. No manual rework needed anymore. So, then, to actually generate the BOM, you can use the output jobs feature I just mentioned before. So, you can also generate these jobs. Every job means one or more files which are generated. For example, the Garber files, there is one job to generate Garber files. And if you, for example, like to send Garber files in a zip file to the manufacturer, you can just add a zip output job, choose you want to have Garber files in the zip file, maybe also the assembly PDF within the zip file. And now the output jobs are stored in the project, so you have to do it only once. And now you can generate production data, for example, for single jobs. Just double-click the job, the files generated and opened. Or you can generate all data at once. And you get, for example, the zip file you just configured, containing the Garber files and the assembly PDF, just like you want to have it to send it to the manufacturer. Also no manual file editing or archiving needed anymore. So, if you make any change to the project, one click and you have all files updated. But of course, not everyone likes to manually generate output files, even if it's that easy, because there is even a more easy option available. If you don't like to care about all these things, just start ordering your PCB right within the application. It's uploaded to our fabrication service website. You even get ERC warnings if you didn't resolve them in your project yet. You can choose your manufacturer just forwarded to the manufacturer you like. And without handling any files manually, you have your project... Okay, I was too fast. You have your project ready to be ordered. Just enter your shipping address, payment information and so on. That's it. So, let's switch to the slides. Okay, so now what's the overall state of the project? Generally, liver PCBs are fully functional and can be used productively for projects which are not too complex. Not too complex because hierarchical scheming is a very important factor. So, what's the overall state of the project? Generally, liver PCBs are fully functional and can be used productively for projects which are not too complex. Because hierarchical schematics and bosses are not supported yet. And also the trace routing tool and actually the board editor in general is still rather rudimentary. So, from time to time it might be a little bit inefficient. So, yeah. And of course, part livers is always a problem. It's not very comprehensive yet, but at least with liver PCBs, it's very, very easy. And to create the missing parts by yourself. So, a quick outlook now. The upcoming release will contain an EGLE project importer. So, it can import complete EGLE projects. And there's also some work ongoing currently to integrate live part information into the application. When you add a component to the schematic, you immediately should see then the part lifecycle status, stock availability and the price. So, this will be very useful. So, I hope we can make it happen. And yeah, it's clear from time to time. Some technology updates are needed. For example, switching to Qt 6. And yeah, for long term, as I mentioned, the trace routing tool needs some improvements and also hierarchical schematics and bosses. I think these are a must have. Yeah. So, if you like to support my effort on creating an easy and powerful EDA software for everyone, I would be very, very thankful about the donation. And to keep the full time development ongoing as long as possible. So, yeah, and there are also many other ways to contribute. Just check out the link here. And if there is any Wikipedia article right here, please let me know. We are looking for some help to publish a Wikipedia article. And please let us know your feedback on the feedback survey. So, yeah. The slides are online. Here are some links to get easily started with Libre PCB. That's it. Thank you very much. Thank you. Thank you for the presentation. I'm using Altium Designer and Qiget, and I work at a shop where Mentor was used. How is the state of the import of Altium Qiget and Mentor? It doesn't exist yet. Do you have plans to implement either of those? Any plans to implement these imports? I think Qiget import would be quite obvious. The other ones, I don't know yet how much effort is needed, how these file formats are known or not known, how to read them. So, yeah, yeah. I think at some day we will look at the imports, but it's of course not a high priority. So, did you encounter any problems with patents or something during your development? Because I'm developing a clone of a commercial software where I'm dealing a little bit with some patents that I might violate during that. Sorry, I didn't understand. Patents. Did you have any issues with those, like registered patents of companies of, I don't know? So far, I didn't have any problems with patents, but yeah, I'm not an expert in this area. So, I just tried to take care of licenses of things I use to hopefully not doing anything against the license terms. Any other questions? Okay, thank you, Urban.
ngspice circuit simulator - stand-alone and embedded into KiCad
I translate. No, no, I directly plugged into the laptop. Okay, so we are going to continue on. The stream going out that is being recorded looks nice, so the rest of us, we're going to suck it up and just listen to what we are here to learn from Holger Voight. So please give a round of welcome to Holger. Yeah, okay, so many thanks. Angie Spice, Circulator and Pink. Talk about Stand-Alone and Embedded in Turkey Cat. Well, I give a short introduction to Circulator. Then talk about what's new in Angie Spice. Talk about the Kikat-Angie-Spice interface and give some simulation examples. Well, conclude with what is next. Yeah, why circuit simulation? You emulate electronic circuits per software. It should be cost efficient at time saving. That's it. Some details, of course, you can check functionality without making hardware. It's very important if you do IC design because fabricating an IC with a defect circuit, this is very expensive. You can check for parasitic elements. You can make variance very easily. You can change some device parameters and see what is happening. You can evaluate new concepts with not too large an effort. You can cross-check against automatic circuit generation as a final simulation test. You can anticipate reliability, make degradation simulations. And it's a good learning experience because you can look into a circuit without using hardware to do so. You can see voltage and currents in different branches in this. Very interesting. Yeah, Angie Spice, what is it? It's a circuit simulator that numerically solves equations. Describing electronic circuits, it can also be other types of circuits. For example, thermal could also be mechanical. And you are interested mostly in time-varying signals in electronics, its current and voltages. It's the open-source successor of the vulnerable SPICE-3 from Berkeley. Okay, we have a circuit. This is a very simple circuit, an inverter with two transistors. And this is the entry to Angie Spice. So Angie Spice is a command line input tool. Many people said, ooh, command line. But I've just learned command line is very nice. KeyCut has got a command line and other software also. So we are not too bad with that. Okay, you have the net list, the SPICE net list, which contains circuit description, power supplies, transistors, some simulation commands to run the thing and some model data to run this thing. The output is graphical indeed. It's a time axis and the voltage axis. The ideal green input. Yeah, it's green still. And the simulated output, you see the inverted signal. Yeah. This is the Angie Spice user interface. Yeah, on the input side, you put in the circuit net list, the circuit description. You put in models or model parameters for the devices you're using in your circuit. And you put in simulation commands. And the output could be data tables or tables to file, of course, could be graphical plots. We use the venerable X11 interface or native Windows plotting capability. Or you can plot to PostScript SVG or use NuPlot or other tools for outputting. Yeah, what's new in Angie Spice? The current release is Angie Spice 42, released in December 27th last year. We have an additional matrix over. I will talk about these things a little bit more in detail in the following. We have a new matrix over in addition to the venerable SPACE 1.3. We support VariLock A-coded compact device models. We allow co-simulation for mixed signal simulation with VariLock digital circuit blocks and mixed signal digital analog parts within Angie Spice. We allow core simulation. Again, mixed signal with C-coded digital. So there is a way to translate C-code into Angie Spice readable shared libraries. And we have a, and I'm benefiting from the vastly improved graphical user interface, key cut, especially the upcoming key cut A is offering for using Angie Spice. Well, the matrix over. What is the circuit simulator doing? The circuit simulator, if you look inside, Angie Spice gets the circuit, makes a setup, parsing the netlist, reading the model files. And then it is, if you do a transient simulation, simulation versus time, then you have the ever circle here between model equation evaluation and these data go into the matrix and the matrix is solved. And then you go for the next time step and you repeat this until the time is over and you look at the output. The model evaluation is already running in parallel in Angie Spice. We use open MPs, so if you have a multi-core processor you typically have today, you use, benefit from that. The matrix evaluation is not paralyzed. These sparse matrix solvers are difficult to paralyze. So we have been looking for a long time for an additional matrix solver. We used the sparse 1.3, developed in 1986. And now we use an additional optional selectable KLU matrix solver, which is ongoing development by T.A. Davis and his co-workers. And with KLU you get a speed-up of simulation by factor 1.5 to 3 if you have large circuits and especially if you do circuits for IC simulation. And this is, of course, an advancement. We allow Verlach A compact device models in Angie Spice. Compact device models, these are the model equations describing modern transistors, for example. These complex, tiny things like FinFETs also would have 500 parameters and lots of differential equations to describe and people do the development in Verlach A. And so we had a real need to have an interface to this Verlach A because this provides access to the modern devices like BISIM bulk, which is for ultra-short channels, or BISIM CMG, which is for FinFET, or for gallium nitride devices, power devices, high-speed bipolar transistors, and so on and so on. Yeah, and we got this set up in cooperation with the company SemiMod, who did this open-source development. We have the Verlach A model description. We compiled this model with an open-source compiler OpenWath, compiled directly into a shared library, and this shared library can be read by Angie Spice, which has got the OSDI interface. So we are reading directly the Verlach A compiled model from a shared library or DLL. Yeah, we make use of this. For example, I've been mentioned maybe already, open-source PDK is for IC design, are upcoming, and one of these is the IHP open-source PDK, and this is a 130-nanometer CMOS process with integrated ultra-fast bipolar transistors. Ultra-fast means 500 GHz or so. The model used for the bipolar is the so-called phobic model, which is integrated into Angie Spice for some years now, and the MOS model is a PSP model developed by, currently I think developed by Lehti in France, and this is Verlach A, and we translate this, put this into Angie Spice, and so we can support this open-source PDK with simulation. This is just a simple example, 19-stage NAND gate ring oscillator, so we have 19 NAND gates in series, feed them back, and it starts to oscillate, and we have your frequency, this is an FFT of this signal, a frequency of 600 MHz, and you divide it by 19 by 2, and then you get an inverter delay of 280 picoseconds. Okay, yeah, we allow digital Verlach circuit blocks into Angie Spice. Looks a little bit more complex, but isn't that much complex? We have a Verlach digital circuit block. We compile this with an open-source compiler very later into some intermediate C code, and then we compile this intermediate C code with some C templates in addition, which are constant all the same. We compile it with GCC or MSVC into a C-CodedShare library. And this C-CodedShare library is read by Angie Spice. Angie Spice has a so-called code model interface, and we have written a code model in decosim, which directly interfaces this shared library. So we can now run simulation with this standard Angie Spice netlist, which may contain lots of analog, plus digital blocks. This is an example, it's just a demo, it's not a productive simulation. This is a successive approximation register analog to digital converter. Six-bit. And this uses the digital SARAR block written in Verlach with the analog part, which is a capacitor array with some switches. Okay, and even if things look complex, using this is not very complex. You need two commands. You have this command, Angie Spice, and Angie Spice calls a script written in Angie Spice control language, and you enter the ADC Verlach description. It compiles the Verlach thing, it compiles the GCC thing, and then you call the Spice netlist with the standard command, Angie Spice ADC dot sir, which contains the analog part and contains the simulation control, then you get this kind of thing. Okay, I just enlarged a little bit. You see that it's a successive approximation. This is the ramped-in voltage, and this is the x-axis is time. And this is a new start. We try to get the value of this point here. It starts with the starting value, and then successively approximates the input. Well, with a certain delay, 8.5 microseconds here, which is the time you need for the conversion, then you are here in the stable phase, and this is the red line just shifted by 8.5 microseconds, is, well, the output signal. Yeah, so digital plus analog. Okay, you can also do this with C-coded digital type of models. You have C-coded independent processes. You compile them with GCC, for example, or with any C compiler. And these communicate with NG-spice via another code model. The digital interface is now called deep-process. Well, this has been developed by Eurospalatis from Isotel some time ago, but we have now, for the recent version, have adapted it a little bit, modified it so it will also run under MS-Windows. And now we can, yeah, simulate some circuit, which has some circuit blocks from C-code. This is just, again, an example, a simple example. The C-code you see here, this is, yeah, a gray code generator. This gray code generator is compiled and loaded into NG-spice, and this is the output. The plotting here is by GTKWave, because this is a nice digital plot. Yeah, and you can use these kinds of blocks. So you define these compute functions with data out and data in and some other, and the time or the clock circuit, clock going in, you can run C-code digital circuit. Okay, so I want to talk about schematic entry for NG-spice, because this is under continuous development, and, yeah, it's a nice usable thing. Why do we want to have such a graphical user interface? Well, NetList as input quickly becomes confusing. You need schematic entry. You need to see circuits, circuit schematics, and then have an interface to the simulator. You can get better documentation, of course, if you group inputs and outputs. This is not an NG-spice development. So we develop this, for NG-spice, don't develop these graphical user interfaces. We make use of existing ones or support the development. And, of course, you need one, because all other simulators are mostly, most of all other simulators have one, so you have to offer one. There are three of these interfaces currently under development, we cooperate. This is a thing called X-Schem, whose main focus is on IC design. There is another one, QXS. This is a very universal interface, which specializes a little bit in RF simulation. And then, okay, we have the key-cat. So I wouldn't say that key-cat is developed because it's a graphical user interface of NG-spice. No, the other way around, yeah? You have heard about this PCB design and layout tool, and it offers a simulation, and the simulation engine is NG-spice to support the circuit designer. So, of course, I can then make use of this beautiful interface. Okay, just show these interfaces in strange colors. Yeah, I won't talk about these. I want to talk about this one, again, in strange colors. Okay, but you could imagine that it could look nice. This is the ischema window with some circuit, simple circuit, a simple phase shift oscillator with a 4.2 kHz frequency oscillating. And down here, you see the FFT. Of course, you see it's not a super clean sinusoidal signal, but, okay, this is the 4 kHz thing here. Yeah, so what is the interface looking like? Ischema does this schematic entry. Ischema generates the SPICE netlist, and Ischema also does a graphical presentation of the results. So, it sends the circuit netlist to NG-spice, it sets model parameters to NG-spice, so the simulation commands, and it gets back simulation results. NG-spice is here used as a shared library to this key-cut process. Yeah, I would like to make a live demo. I don't like these colors, but let's see if we can survive somehow. Okay, this is my starting template. I do not start from the zero because it takes too much time. So, this should become an operational amplifier. Simple thing, amplifier by a factor of 10. Okay, what is missing is the operational amplifier. I try to grab it, grab it from the library. So, we just load the library, it takes a little bit of time, but only once the first time, then it gets faster. I know that it is in the library simulation-spice, and here is the op-amp. I grab it, and I move it, and hopefully it fits because, yeah, it did last time. Yeah, it does. Okay, so this is how you place additional elements. Very simple. But now, let's stop. We don't need any more, I hope. Yeah, and now we do simulation. This is a real-time simulation. I look inspect for simulator, and I get this simulator interface. Well, black is green, and pink is white. Okay, I'm sorry for that. What do we want to do? We want to do the transient simulation. Transient simulation is output versus input versus time. Okay, and so we have, yeah, what is our input? Let's go back and have a look. The input is a sinusoidal signal with an amplitude of 0.1 volt and a frequency of 1 kilohertz. Okay, back to the simulator window, and I just click on to start simulation, and here is our simulation. The input is the small one, and the output is the red, who stays red. That's great. The input is the red signal. Okay, so this is transient simulation versus time. We could have another simulation. To be honest, I have prepared this. Four, this is so-called AC simulation, small signal simulation versus frequency. So you see the frequency behavior of this kind of circuit. Yeah, we again run the analysis, and you see that the amplification is 20 dB, so it's 10, is constant, but the operational amplifier has one single internal pole, and so it goes down. Okay, so this is very quickly, you just see what's going on. I think I have time to make some additional change. I put an additional capacitor in here. I collect my capacitor, I transform it because I have to rotate it. I put it just in here. Let's do it in here. And I have to give it a value. I guess I take one mic one. Yeah, and then we go back, and do the AC simulation again. Oops, there's something changed. We have this, this is sort of low pass behavior. It stayed, and now we have some high pass behavior for the low frequencies due to this input capacitor. Yeah, so very quickly you do a small change, and with a simple click, we are there. Okay, so this is what I wanted to show live. Let's go back to the slides, and I give some more examples. Yeah, the first example, this is, again, why do you want to simulate? This is a 2.5 kilowatt class D audio amplifier. And you would say, this is strange. No, you go to some Amazon and click in looking for these kind of amplifiers and 300 bucks. You can get a kilowatt amplifier today, because it's a digital amplifier. And, okay, so what did I do to get this simulation? Okay, I made a symbol myself of this audio driver circuit, just drawing the symbol. And this audio driver circuit is also something I created myself, because it has the analog input. It has a path width modulator. This is a translation from the analog signal to a pulse width digital signal. It needs something more. It needs a complementary pull output, because we have two transistors here. And it has a dead time generator to avoid shoot through, because what will happen? You have minus 100 volts here, plus 100 volts here. And if you manage to open both of these transistors at the same time, you will see the result in form of smoke. And so you have to avoid this. And, okay, and some simulation commands in here. The input is 2 volts, again, 1 kilohertz. You see the power supply. The output load is a 2 ohm resistor. Well, and this is the output. This is the input signal, and this one is the output signal. Okay, and with the double frequency, you have the power signal, the blue one here. And if you do an RMS over this output power signal, you see here it's kilowatt up to 4.3, for example, you will get an output power of 2.6 kilowatt. The simulation has a great advantage. Nothing explodes. You can just do it, and if you do, you can investigate the output filters and can check loudspeaker models and everything just by simulation. Of course, you can also do real-time real amplifiers. This is Tiberio Vecol has made this Q17 amplifier derived from the famous quad 405 audio field amplifier. You see lots of transistors in this thing. The output stage, the input is an operational amplifier. This is the modern contribution of the whole thing and some voltage generators here. Well, yeah, and you can, of course, simulate this and look similar to our 2.6 kilowatt, it's 100 watt, and what you see here is just at 300 milliseconds, we switch the output load from 8 to 7 ohms automatically to check what the output load would mean, and you see a little bit increase in output power. So you can model all these things and model the influences and so on and so on. OK, NG-SPICE allows to do mixed signal simulation. Mixed signal simulation means you have analog and digital circuits in the same simulator, and you could also simulate the digital part like the analog part, but this takes a lot of time, and if you have more than a few gates, it would be much too slow. So NG-SPICE includes an event-based simulation, which is very fast, and this is a mixture. Well, this is the veneral 7400 series of devices. You have flip-flops here, you have some output decoders and some NAND gates, and you have some XOR, or NAND, this is NOR gates. Yeah, and you can simulate this whole thing together, and you see that this is mixed signal means because we're using the digital output here for a delay line. So we have an RC delay and another RC delay, and we have the original signal, and so this gives an output pulse of a specific width. This is the clock signal generated in this circuit, and this circuit here, which is shown, is a rotary encoder, so encoder which does give optical signals when it's turned around one or the other way, and this is the digital output, again plotted with GTK wave, and you see this here, the Q1 signal is coming before Q2 signal, and because in the rotary decoder these two decoders are shifted a little bit, so you know that this is turning left, for example, and here the turning is changed to the other direction, and you see the Q1 is coming later than Q2, and this is detected by this circuit. You have here the pulses, let's say for turning left, and then left turning, switched right turning, and you see the output pulses here for the turning right. So mixed signal simulation is, and this is effective, because the whole simulation thing is 25 milliseconds, so it's ultra-fast, it's click, and it's there. You can even run this on this computer here, which is not the fastest machine. And we can have pure digital. I made a symbol for this up and down counter. You have the input clock, you have the input up and down signal, and here it's a 3-bit, 8-state counter, and inside of this is a state machine, and it's a very, very simple state machine. You have here the states from 0 to 7, so the 8 states. Here are the signals you see from 0, 0, 0 up to 1, 1, 1, and here is what the states are switching. The input is at 0 state, and the input is 0, input means backward counting, then the next state is this one here. Or if the input is 1 and we are at state 0, then we go to state 1. If we are at state 1 and we count down, we go back to state 0. If we are at state 1 and we count forward, we go to state 2. So you can do very simple programming inside one of these code models used by the digital event simulator of NG-SPICE. Well, and here's just the signal, the clock signal. This is the up and down, the up and down signal, and we count up and count up, and then we switch to down and then we count down and we switch up again. So very simple simulation, and the simulation time of this whole thing is mere 37 milliseconds, so it's very fast. Okay, so much about the examples we have. What's next in NG-SPICE? Here are listed some ideas, some more or less fixed plans, and some actual activities. We will do more tests with the open source PDKs, supporting the sky-water PDK, and especially the upcoming IHP PDK to support analog mix signal and RF simulation to support these kind of designs. We will improve the RF capability by adding harmonic balance with a special effective method, for example, to simulate intermodulation of signals and so on. We will support reliability and degradation simulation. Well, nothing lasts forever, chips don't last forever, and people sometimes want to know how long they will live, and so you can try to model that, and this will be done here. And hopefully with a funded project, this is very interesting. There has been the request for transient noise simulation. This is a difficult task, because we don't want to rewite the complete simulator, we have to figure out ways, and again here it would be very difficult to do that. If somebody is interested in integrating this into NG Spice, please let me know. We will improve the usability of key-cut NG Spice graphics interface. Continuously, people are requesting things, and we are detecting things, and we can try to simplify things, we can try to support more of what NG Spice is offering internally right now. For example, the digital simulation is, should be supported by having digital basic blocks as input, and digital plotting, for example, as output. And we have to enhance compatibility, because the world is, somehow we are competing against commercial simulators like LT Spice or Q Spice, or P Spice, or H Spice, and what other... We cannot do this in full, but the basic things should be compatible. But all these four I have mentioned have different, slightly different input languages, slightly different models, and so you have to take care of this somehow. Yeah, that's it. What I wanted to provide you with information, here is some support, websites, if you need more details, here they are. Thank you. APPLAUSE So, while we are taking questions, the video team is going to try to repair the video locally, so your questions will not be able to refer to the slide. Hi Holger, you said something about the creation of semiconductor devices. Would it be possible to simulate the creation based on radioactivity? Yes, this is included in this development plans. Thank you for the presentation. A quick question is how do we input the state machine in the component? Is there a special window where we come and we type it, or the state machine must be written in a dot c or dot something and we give it to the component? Yeah, the simple state machine, the question is how can we code the state machine into ng-spice? The simple state machine I have shown is just a text file. This text file is loaded, you put into your spice netlist a single line with a specific model and this model loads the state machine. That's it for the simple things. The complex, you could of course write state machines and c-codes if you want to. Then you have to do this translation. My question is maybe a bit naive, but would it be at some point feasible to include the tracks or geometry inputs from KCAD in order to mimic the links that you place between your spice components? Please, it's a little bit... Track width and we also have the PCB stack up. Would it be somewhat feasible to from this geometry inputs associate a kind of approximation of the S-parameters of each lines between the components? Yes, there is some work ongoing. It's not that intensive to use an EM-sover, it's called Sparse Lizards, to extract these data from your lines in KCAD. I think it's a lack of manpower to make this a real tool. KCAD has added IBIS simulation, so you have IC output and IC input, only the output and input signals and many semiconductor vendors offer these models. Then you could basically have a transmission line or an RC line in between to simulate the signal integrity. The problem is, as you said, to get these data from your PCB. Slowly, slowly moving on. Basically, yes, but this is a key-cut or ischima, it's a key-cut work, it's not the NG-spice. The NG-spice takes the transmission line parameters or takes the parasitic capacitance resistances and then does the simulation. So the EM would have to be data from the key-cut? Yes, exactly. The EM has to come from the key-cut. I wanted to ask if anybody has used the C interface to, for example, make simulations of existing microcontrollers or things like that that you could have in your design. There has been some activity on this, very scarce. I think it's two. It's yours, Platysy, from Iso-Tel. Just look up his website, Iso-Tel, and you could find some information on that. There has been another guy, I think he has used Arduino interfacing to NG-spice, but I don't know much about this work. Are there any dynamic languages that are possible to be used as a model, or is it just compiled languages that have to be loaded? If you don't care about simulation time, for example, would it be possible to use any scripting language to... Yes, there are various kinds of making models. You have the very old A-road, but this is compiled and static. And it's compiled, it's there. You can do models with NG-spice internal nonlinear voltage sources, for example. And these are very dynamic. And many power semiconductor device makers, they make so-called sub-circuit models, which are comprised of spice commands. These can be very complex, difficult to debug, but then you can do whatever you could imagine. Is it possible to perform simulations over PVT, so over process variants and voltage variants? Yes. And would it be possible to do this without changing any of the models itself? Yes, this is the typical content of the model. Content of modern semiconductor PDKs when you think about IC simulation. The worst case simulation or corner simulation is typically integrated. It's different model parameters. The model stays the same, but certain parameters are changed. So we have a question from online. Just heads up, we're still working on the video, so lucky for us. Holger is able to continue answering questions for the foreseeable future. Online they are asking, is there any post-processing of waveforms such as THD, FFT, etc. possible? FFT is standard. FFT is standard in NG-spice and is standard in the Kikat-NG-spice interface right now. It's more or less two clicks and then you have it. You can set up, NG-spice has a very powerful scripting language, well another language. It's not Python, it's another language which originated in 1990. So we keep it up and have more than 100 commands available. And you can do a lot of data processing with this scripting. So for example, classification into bins, or if you do Monte Carlo simulation, you can run Monte Carlo simulation. You can classify these data into bins. You can do a lot of post-processing internally in NG-spice. Well, of course, if this is not enough or you want to use standard interfaces, there are Python-NG-spice interfaces available. So you can use all these Python libraries which are there for data processing. So it's a lot of action, but the action has to be done by you. Okay, we have time for one more question. You do not actually work time. Okay, so let's give Holger a round of applause. Thank you very much. Okay, so we're going to check.
Modos: Building an Ecosystem of Open-Hardware E Ink Devices
Hi folks, thank you. So my name is Alexander Soto. I go by Alex, the founder of Modos. And yeah, so we're building an ecosystem of open hardware e-inck devices. There's a box of a PCB, red one, green one, and also our prototype or our paper monitor. Please don't plug it in. If you're interested, I can show a little demo in the back. And let's get it all back. I love to get my PCBs back at the monitor and such. But yeah, trust you all. But yeah, please check it out and I'll do a little bit of a live demo a little bit later. We'll do it here, but think it'll be a little bit complex. But yeah, please all pass it around and check it out. So a little bit of backstory. In 2021, sort of the height of the pandemic during lockdown, my bedroom kind of transformed into a workspace. And just from the morning to night, I was just constantly being distracted, having to refocus and then be distracted again. And yeah, being in front of a computer for 13 hours, I got to the point where the device that I'm using to be able to do my work is also the same thing that I use for leisure, is the same thing that I use for entertainment. So can we have technology that is different, that is more calm, that is more humane, that's more aligned with what I got while being. So our focus is reimagining personal computing with a focus on creating calm, inclusive and humane technology. We'll upload the slides, but here we have two links to some earlier videos where we show off our electrical ferretic display controller. It runs at about 60 hertz, open hardware, uses an FPGA, get a bit more into the details of the hardware specs in a little bit. And this is the team that sort of turned that vision into reality. So Wenting has been the lead designer of creating the Electro-Ferretic Display Controller. He's been working there for quite some time and going to wait a little bit so we get the whole image back on the screen. Oh no. Hey! Thank you, awesome. His presentation is read by design. You didn't believe me. Yeah, so recap, Wenting has led the design of our Electro-Ferretic Display Controller. Brody has worked in particularly on the CAD manufacturing chassis. Michael had many conversations thinking about what would it look like to create a software architecture or software stack that's tailored for E-Ink has like a medium. And I'm kind of the guy that does everything in between and kind of supports everyone, thinks about this nonstop and tries to make things happen. Alright, so lastly I also want to say thank you to our community and also the NoNet Foundation. We've had about 300 plus people who want to be able to be in our private program, about 5,000 people in our mailing list and also about 3,000 testimonies and in those testimonies as well a lot of feedback, learned a lot and there's also, sorry, and also I'm going to say thank you to our NoNet. NoNet really, we're a NoNet-sponsored project. It looks up on our caster and yeah, thank you for your support and thank you for helping us really getting to finish our prototype. Okay, so from the community survey findings, we did a community survey. We're asking folks, you know, what are the particular use cases that they use their computer and or also what they would like to use for the E-Ink device and the overlapping categories were reading, writing, coding and focused sort of focused task and those are the majority of categories that people had mentioned. But that I had a general idea that most people would be interested in but I think where I learned a lot with the takeaway were the sort of same problems that I was experiencing myself with just being distracted, being stuck on the rabbit holes or having to use a computer for an extended period of time kind of got that a lot from different people in the community who also kind of expressed similar concerns and then here as well, folks discussing about problems related to like eye strain, people who have tried to use like other solutions and still, you know, filtering glasses and such and but still being a problem. So overall, there's these sort of general categories of one people who are looking for more of a balanced digital life, sort of reducing screen time on social media, entertainment, unplugging, seeing the sun, being outdoors, seeing away from a screen. But also people who are looking for like less visually stimulating digital environments and trying to reduce a digital clutter. So that was like one group or one demographic. The other one being folks who experience some form of visual impairment or maybe some form of like light sensitivity. For example, there's, I always mentioned this, I need to look up the specific person who did who filled this out but there was an engineer who was writing on behalf of his wife who was experiencing epilepsy and she tried all different types of solutions and just is trying to find something so she can be able to interact with her digital devices and that comment has stayed engraved in my mind and a big motivator for myself. But there are other health issues that people reported, things related to like myopia, epilepsy, light sensitivity, headaches, migraines, traumatic brain injury and post-concussion syndrome were quite frequent as well, which to me was also very much a surprise. So here's my pitch, I guess you want to look at it. I think there's a need, right? I think there's a need for being able to create technology that like satisfies our essential needs but also protects our well-being. I think we can maybe redefine the role of our devices to foster more healthier balanced life and hopefully with the start of this control is with the display controllers to create a new class of devices that are kind of like built from scratch to sort of embody these principles of human technology both through hardware and software design. So, Alan Kay, people who are really serious about software should make their own hardware. So, I think hopefully the monitor I have no idea where the monitor is but hopefully it's being passed around. People check and take a look at it, great. So, that's our monitor, that's our newer revision that we have. We built that using our key cat and also our free cat. We have a bit of a block diagram here for folks. Also, folks want to know a little bit more in details. We recently updated our repository which has a much more documentation into specifics of how it all works. So you can feel free to take a look at that. I have some excerpts from there as well. And yeah, so we're using a Spartan 6 FPGA Type C for the display port. We also have a HDMI or DVI video input. Using a Raspberry Pi 2040 for USB communication and upgrading anything related to the firmware and waveforms. And then this is the caster block diagram for our FPGA. Good. Take a look at that. Again, I would redirect folks to take a look at the documentation that we have on our GitHub that goes a bit more in detail than I could possibly do in this one presentation. But some of the features of the display controller is that it works with screens from 6 inches to about 13 inches. Works with the sort of black and white electronic operating displays from E-Ink. Also works with color displays and also DES. Really extremely low processing in the video. Finish show it, but also do a live demo. Very low processing delay. Yeah, we got four level grade scale, 16 as well working. Let's see. Yeah, optimized for the four level grade level scaling. If you ever use a commercial E-Ink monitor, they have these buttons that are in the front that switches these particular modes. We also have that. I don't know who has the monitor right now, but yeah, in the back, yeah, in the back there's like a little button. It's a little blue button. It doesn't work right now, but no, no, it needs to be connected to like a laptop and such. But I'm just saying in that button you can cycle through and it switches between different modes and the different modes are for the particular use cases. So if you want to be able to focus on typing or reduce input delay, there's a mode for that that you press it or if you wanted to use it for looking at black and white image and looking at grade scale it switches and that's all happening locally through the host software that's on the hardware itself. And a little bit more about how it's driven. I'm not going to go read through all of that. Let's see, I want to do more slides. So yeah, so pixels, they're ranging at 2D array. Refresh rate to between 50 to 120 hertz. It's a bi-stable display. So they maintain their stay after the electrical field is removed. And the sort of frame buffer driving mechanism, it uses two frame buffers to determine the picks of colors and then the picks of color changes between the particular value 0, 1. And there's also sort of a global counter of how it's used for the track, for the frame duration. So this is just a little bit of the basics when it comes to electrophoretic display controller and sort of e-extrins in general. Going to create a better version of this. It's a bit more simplified, so I apologize to just sort of make it, yeah. So let's see, give us grade scale. So when it comes to grade scale you're often switching between black and white in order to be able to get that. So you're constantly switching between zeros and one and kind of switching between these particular different modes. And then lastly, so one of the optimization that we've done is that instead of having one global counter, we allow for individual updating for the region. So we're updating each pixel independently. And we also have this sort of like early cancellation method. Could talk about it more, but just kind of want to leave it there. And I think just the last thing for next steps. So, how much time is that? Okay. So for next steps, we have a, so we've been working on this for about two years now and I think we finally have the prototype more or less finished. So we want to be able to do a crowdfunding campaign, most likely on CrowdSupply this year. So there's a link here where it, if you want to be notified when the crowdfunding campaign happens, we can send you email to be notified. We're also a relatively small team of about three or four people. There's also a separate link that as for folks who want to contribute with just various different skills if you want to be able to support in documentation, in CAD or getting more involved with the display controller. We have a link there for more information. Don't think, yeah, pretty much wraps things up. I think it could talk a little bit more, but I'd rather leave room for questions. And that's it. We have some, okay, let's see. Questions are always right in the middle. Thanks for the talk. How do you deal with the waveforms being proprietary? We've generated our own waveforms. Do you put any work into updating those, improving them? Yes. Sorry, the question was related to waveforms and how we generate them. I'll say that we generate the waveforms ourselves and there's certain similarities or patterns across different displays, regardless of different sizes. So we are maintaining and updating the waveforms that we have right now for the 13-inch panel as well and for the 6-inch panel. So you focus mainly on the hardware, but especially doing focused tasks, for example, requires quite some specialized way of displaying things. Are you also providing some kind of solution for that in software? Yes. So the question was related towards hardware that we've been focusing on that and what would things look like on the software side? Yeah, we've spent quite a bit of time looking into what that would look like. The one approach that we've looked at is to use in Wayland protocols in order to use things related to damage tracking in order to be able to do partial refreshes. For example, if you have two overlapping windows and you drag in one window over, it would recognize that this is the area that has changed damage tracking and would only update that particular region rather than doing a whole full refresh. So I think that there are things, for example, that we can use with Wayland and Wayland protocols at the higher level of the software stack that in one way you can look at it is that it abstracts away the idea of waveforms and you let the higher level software stack take care of that with the display manager. So one of the hopes and dreams is that we can work with SORSA with Drew DeVau and a few other folks raising up funding for that to be able to work with them to have that be part of something we can do, which would allow us to one, create applications that are native for E-Ink and also backwards compatibility. Yeah, just one ring. I have no... Yes, oh, hi. I am so sorry. Yes, hi. Yes, why did you show us the Spark on 6 as a platform since it's quite old by now? It's what we were most familiar with and what we had access to and experience. Sorry, the question was why did we choose Spark on 6? It's what we were most familiar with and had experience using. There is another gentleman, another person from NowNet, Victor Suarez, who is also interested in porting our work into other FPGAs. So I think it's not tied to Spark on 6, it's just where we started. Yeah, so I think it's regarding the FPGA only. So you have plans to upgrade to let's say FPGAs that are faster or this was just the original decision, but just because you were more hands on that particular family of FPGA. Yeah, so the question is related to the FPGAs if we plan to use more modern ones. So yeah, it goes back. That was the one we're most familiar with and that's the one that we're going to keep using maybe in some future with Switch, none of the one. Spark on 6 FPGA is doing the job right now. People are more welcome to contribute and port it to other FPGAs. That's one way, so welcome that as well.
Jumpstarter: Open Hardware In The Loop for everybody
Next, we have the Jump Starter project with Miguel and Ricardo. This is going to be a rather interesting bit of open harbor for those of you who don't know. Miguel has a long time contributed to the KECAD project, so an alumnus if you will. So welcome. Thank you. Hello. Hello. Yep. Thank you very much for attending this session. My name is Ricardo Nariega. I work in the office of the CTO at Red Hat. And I'm Miguel. Yeah, I worked with him since one year and a half ago. And we are going to talk about the Jump Starter project. We will go through these slides and hopefully a live demo as well. So let me introduce you to PNAT. This is PNAT developer. PNAT works developing applications for embedded systems and edge computing use cases. And he uses all the more tools of development that we know. He develops locally in his laptop, pushing code to a give repository to have version control. He uses IDs, testing frameworks, virtualization containers. So basically all the other tools that we use for developing services in the cloud. The problem is that PNAT, after some hours of coding, he really needs to test a release candidate or some code that he thinks is ready to test in the real target platform. Let's say he uses an NVIDIA Jetson device, for example. The problem with this is that he needs to take the power adapter connected to the Jetson device to the plug, then an Ethernet cable maybe to get some connection, then an HDMI cable or a serial cable. Then at the end he goes, takes a USB stick, puts some operating system image in the USB, install it in the device, and at the end he needs to take the application to the device somehow via SSH or whatever. So by the end of the day, the poor PNAT is completely exhausted. And this is when we start to think, okay, we need to tackle this. We cannot afford every day doing the same thing. And this is where Jumpstart came to life. We see that developing applications for embedded devices comes with unique challenges. There's a huge lack of standardization. Every device is different. We see in big companies that enrolling these devices into CI systems is rare or sometimes is very expensive. We want to keep high quality in our code, in our applications. And testing, especially automated testing, is a key aspect of it. So we thought, okay, what are our testing goals? We would like to test our application in those target platforms at every pull request that we push into the repository or for every merge request. We would like to test a release candidate in all the models of the platform that we are going to run in production. Let's say a point of sales, I have five or ten different models, so I want to test my application in all of them. So we need some kind of automated testing and if possible, something that is hands-free, so no manual intervention. And this is why we created Jumpstart. It's not a device management system, it's basically a testing tool. So what is it? I know this is the open hardware room, but Jumpstart is basically a software project. It's written in Golan and it has the concept of devices, which are the devices under test, embedded systems that we want to test our software on, and the concept of driver. Driver basically exposes the capabilities of a hardware connector. We will explain more later, but a hardware connector is a piece of hardware that allows you to enroll these embedded systems into CI platforms. We have built a script language based in Yaml and allows you to automate some of the onboarding process. And Jumpstart allows you to remotely control these systems and it has the following functionality like power management, control signal management, storage, and console management. It works with the major CI platforms like GitLab runners, GitHub Actions, Jenkins, TecTon pipelines. We are developing as well a Kubernetes device plugin to be able to schedule these TecTon pipelines in the Kubernetes nodes. And at the bottom of the slide, you can see when you use the Jumpstart CLI how you can list the devices that are connected. So as an example of GitHub Actions, if you want to enroll your embedded devices into GitHub, you just need to run a self-hosted runner service per available device, per the device that you want to run. Then you can add a tag like Jumpstart.vrasperepy for. And whenever you want to run a job, you just select which tag or which platform you want to run it on and it should work. We have created a reference design for a driver, for this hardware connector that I mentioned before. We call it Datlink. And Miguel will explain more later. But if you see Datlink, you just need to connect it via USB to the GitHub runner and then create your workflow, your GitHub Actions workflow, and run it. This is an example of the GitHub Action workflow. For example, list a device, download an operating system image, prepare the image, mount it, change some configuration, inject some application and ready to use. And then we can use the scripting system language that we have created to automate the onboarding of the device. Just a disclaimer, if you use, for example, GitHub Actions, it's better to change the default settings because for first, Jumpstart requires full root access to the runner. So if someone has privileges to run, to push a PR, it can compromise the system. So this is how the script language looks like. You can put a name to the script, a selector for the target platform, and then a set of steps that would automate the onboarding. Power off on, write image to disk, and then we can control also the console. We'll see that in action later. So as I said, we have designed Jumpstart with modularity in mind, with a driver-based model. The Datlink is our reference design, but we have also developed other kind of hardware connectors just to show you how easy it could be. And if there are other hardware that you can use to enable this, please write the driver and you can leverage all the benefits from Jumpstart. So we have the Datlink driver. Driver B could be done the same with an SD card multiplexer, plus a smart plug, plus a serial cable. So, yes, as Ricardo explained, when we started the project, we didn't find a proper test harness. Along the way, we found some others, and we will be adding drivers for those. And if you have something and you want to add the drivers, I'm super happy to help. This is what, it's very obvious, at least now it's not pink, like in this morning we had an issue. So what our test harness is doing is switching a storage device between the testing host and the device. So you can access the storage device, the iOSV, from the testing host and write your image very, very, very fast. And then you can connect it back to your device and power it on, and then you can talk to the device via the console and we have some control pins. So far it's very basic control. There is no analog interfaces. But we have the next revisions. We have taken a lot of feedback to add extensibility to the platform. So yeah, this is how the version 1.1 looks. We did it in a mini ITX form factor, so you could put it in racks in a data center or in boxes, like in this case, the one we brought. You can control power via a barrel connector. So in the back plane, you have the inputs for power, and here down you have the outputs. You can put your storage device in here and you can mount your device under test on top if it fits. So you can control up to five amps, and you can provide the power via USB PD. Yeah, so we have, yeah, as I said, the USB storage multiplexing, and this is how it looks if you mount something on top. This is star 5 vision 2, and yeah, we are running some tests with that. And then, one of the best features of this is the speed. So you can get five gigabits per second, so you can go to a little bit of speed. And it makes it very interactive. When you are working, you get feedback very quickly on if things are working or not. So that's really nice. About the hardware, so the design is made with kick-up. You have the repository in here, and here the pollution of our prototypes. So we made 100 before, I mean, last summer, and yeah, it was around $80 per device, and we just made five, I think. Then we made 1.1, and we added some additional EMC filtering to the power. We moved the storage device inside because initially it was outside, and it was okay, yeah. And we added some connectors for the expansibility, so we have an SPI and H2C connector, so you need like a doter board to talk something specific to your device. You can do that. We have version 2.00, which we could not produce yet because of company policies. If I want to make an order to make the prototypes, it's going to be beyond the maximum without the purchase order, so I need to register the vendor and so on, and it's complicated, so eventually. But that one, instead of requiring two USB connections, which this one needs one for control and one for the storage, we'll only have one. I think I have a picture here, yes. So in this one, so the connection from the testing host comes here. It comes to this USB 3.1.5, and you can connect additional devices. Maybe you need to put a camera, or maybe you need to put a logic analyzer, or a Canvas adapter, so you can do it via USB, and with the software we could detect via the USB topology where those devices belong. So the idea is that you could have a testing host, but you can have 10 jump starters. That links, so we changed the name at some point to make it clear what the software and the hardware was. And yeah, we also added a connector for APX. Yeah, I will run a little bit more so I can make the demo. That link board has a controller chip, and the firmware is written in Rust. It has a nice console that you can talk to if you want to do it manually, but that's handled by the driver in Jump Starter. For people who make hardware with USB, this project is super interesting. You have it in almost every Linux distribution, and it allows you to update your firmware on the field. So you can publish your firmware to firmware update, and then you can, I mean, you create the descriptors, and so on. Fingert update in the Linux systems will realize that you have a device that is on the database of firmware, and you can get updates through the network. This is how it looks. Suddenly, I couldn't take one of Jump Starter, but this is how it looks, for example, if you're running on desktop, or if you do it on the console, you see something like this. So, yeah, we're releasing every version in GitHub with all the production files, so you can just download the production files and take them somewhere, and hopefully, I mean, probably you will need to adapt to the vendor, but you can get that. And, yeah, Cid asked me if I could talk about them. We are talking with them to see if they can pre-make an amount of devices and make them available in their co-create program. Normally that is meant for, I mean, if you are a creator and you want to make money on your device, there are other programs like this. They will handle the production, and you just take care about your design. But what we did in the meanwhile, I don't know if that will work or not, we just provided the links to the, how do they call it, the Fusion Gallery. So, when they made the prototypes, they give you a link that you can share and they will repeat the prototypes for others. And now, hopefully, small demo time. So, okay. So, this is, we prepare this demo repository that is actually connected to this. This is registered as a runner on that GitHub. I don't know, hopefully, it's connected, we'll see. And this is the device under test that we have. This is Raspberry Pi 4. And we are building an image and testing it on the device with two different distributions. So, yeah, this is some of the previous runs that all passed. We can look at them and we can see, for example, how do I see that? Checks. We can see that they were tested with Raspberry Unlight and Fedora Rock Height. So, in the process, it will download the latest version of the image, prepare the image and test it in the hardware. So, we go to one of those. You can see the steps of what happened. Okay, this one was a simpler test. Yeah, we can see previous runs. And we can see an example here of, okay, what happens if I break the construction of my image? In this case, we are testing a DPM module that is connected to the Raspberry Pi. So, if I remove the DTV overlay in the config for the Raspberry Pi, it should not work. So, when we go to the checks, we see, okay, Raspberry Unlight stopped working and we can see that it failed at the DPM interactions. We can see, yeah, when the image was being flashed, it was not working. Into the device. And I could show you if this is all working. Maybe I need to make a bigger font size. In this case, this is what the runner is calling. I can list the devices or I can run stuff on them. I can do things manually. For example, I can power on the device. Hopefully, I need to tell which device I want to power on. So you can, and if it's working, yeah, power on. You can power it off. You can request a console or you can run the scripts that we run in CI. For example, if I go with this other device, which is an SD wire, I don't know if any of you are familiar with this. It has an SD card and then a connector that looks like an SD card so you can plug it into a device that boots up the SD card. So if you connect these to the jump starter to the software, you can see that it's working. And I provide, it also needs like a serial console to talk to the device, otherwise it's going to complain. So if I list the devices, I should see, okay, I have the SD wire with this serial number. In this case, I cannot make all the associations with the tags and so on, which are stored in the hardware. So I have a config file that matches the serial number and then at this point I can just flash one or the other and it's set disk image. For example, if I set the Raspberry Pi 4. So this is the same process that you can do in the scripts, but you can also do it manually. For this, I need privileges. We want to split this part of the executable in a separate one only for that purpose with lots of filters to make sure that it will not break anybody's server. Yeah, this is the nicest part. How quickly those. Yeah, we need to be a little bit cautious with the data. Linux, because even if you request the system to eject the device, sometimes it tells you, okay, everything is all right, but the cache is still finishing in some part of the subsystem. And yeah, I think we're, okay. Thank you. Thank you. Thank you. Thank you. From the software side of things, have you looked into LabGrid? No, I have not. But, sorry. So yeah, the question is if I have looked at LabGrid and not, but I will. Yeah, because it does something also similar. Is it open source and open? Maybe we should work with them. Hi, thanks for the talk. I would have asked for LabGrid too, but maybe one other thing. Do you know about the automated testing conference call, the monthly one, that's coming out of the Elinux project? There are already record people there talking about the Coliseum stuff. Maybe that's interesting to share there too. Yeah, so what's the name of the? It's the automated testing conference call around the Elinux project. Yeah, I can come to the front end. Okay, yeah, thank you. That's great. Yeah, that happens sometimes. Big companies, you have people working on things from different places. Thanks for the talk. I wanted to ask how do you actually specify tests? How does the test work with this? Do you have the device? Yeah, so. Is that the YAML syntax thing? Yeah, exactly. That is the YAML syntax. So far, it's rather simple. So for example, if I go, and this is available on the repository, if I go here to the demo, and I go to the Raspberry Unlight, for example, you can see this test, TPM, or latest raw image. So we assume that the image is already built, and we just tested on the image. And it's just a series of steps so far, steps, interactions with the CDL console. So we expect something like that for control. And yeah, we want to add also integration maybe for other types of devices that are not Linux based, maybe within where you want to flash them. Something that I did not explain is that we also do power metering. So one of the things that we want to do is provide that report, maybe from this point to this point. Okay, how many millibars, what's our I consume? So you can check if your software is consuming more or less in your hardware. So really cool project. I'm really excited. So you said there's an external USB for connecting modules. So is that for so I can test something externally like, let's say a very simple system where I've got, I'm turning on a light switch, I can plug in something with a Luxe beta and check that the light actually came on rather than the board went, yeah, that came on and I have no idea. Yeah, yeah, that is the idea of the version to that. Yeah, you could connect anything and have a way to associate the device under test with those devices because you know where they are in the bus. My question is a similar question because there's, you know, literally thousands of pieces of various test equipment out there. Most of them these days probably run on either a network interface or a USB interface and they're like, yeah, they're classic. They've been translated from the old HP IB GP IB, skipping commands. But you know, things that are like really powerful test tools is, is there any plan to integrate be able to integrate stuff like that because maybe I have an analog board that's really highly precision. And I need like a really highly precise meter, you know, digital multimeter to measure that I'm not going to measure that with a simple weight bit a to D. Is there any any plans on being able to integrate stuff like that so you can. We want, I mean, we want to be able to enable that but every tool is different. So we want to make it as modular as possible. We haven't still thought exactly how to how to do that. But the first thing that I guess we need to figure out which which USB devices are related to that or maybe other config file in the system. So we have a lot of consistency saying, OK, this serial number has these two and these two and these two associated. And then when you call the software that talks to that tool at some point in your script, you can you can talk talk to that. There is even more interesting stuff like sometimes you need to test different parts of your system in parallel. Maybe it's not one hardware piece. You have them and they need to talk together. So at some point we want to be able to to run in parallel multiple devices and the rest. But yeah, let's see how far can can we get. OK, and thank you so much for your presentation. My question was around the jammer files and the specification about although it is not available now the canvas communications and the flex ray and other protocols. If do you plan or you have a roadmap in order to put it into your specification and the way of sending standard ones in order that OK, I want to get profit on my job and of the opens of the tool. But I want to create a new hardware, but I want to use the same protocols as yours in order to change for your second version that you are developing into one year. But I need to I don't know if you have planned into your role in order to something. It's not on the roadmap. But yeah, one of the things that is doing and probably one of the reasons why it's getting into the embedded space is automakers. And yeah, they I mean they use can buses or automotive other med. Yeah, so at some point, yeah, we will need to figure out how to do that is we haven't thought about it, but yeah. OK, thank you very much for the presentation. Thank you for having us. You
Automated Documentation for Open Source Hardware
Okay, our next talk will be from Pieter Himmer. This will be on automated documentation. Please give a warm welcome to Pieter. Yeah. Thank you for being here and thanks for the great programming because, yeah, my talk really follows well, follows up well on the last talk. So my goal for this presentation is to tell you about how to automatically generate IKEA-style assembly instructions for open source hardware. So before I do that, let me just explain a little bit about myself. I've currently just started as a self-employed software developer or slash researcher and I'm, my target is to make a living in open source software development. So I'm currently, I've just had my first contract with Ansel that has been talked about before and this allows me to work on a really nice project in Freca that's close to my heart, is managing data for models, so improving parametric design. I'm also a co-founder of the Open Toolchain Foundation, which is a foundation that aims to support a whole tool chain for going from a design to a physical product and we've seen great examples for that. I come from academia, I've been a researcher in the area of high performance computing, but a couple of years ago I got a chance to work on actually my passion, open source software and hardware and I was in a project in Hamburg with which, in which my colleagues from InMachines in Gratia were creating the OpenLabs starter kit. So the idea was in a span of one year and a half create eight machines for FabLabs, 3D printers, laser cutters, CNC machines and three versions each, so rapid prototyping so to populate basically FabLabs and all open source hardware. But of course there is one problem that a documentation we heard about in the last presentation, so documentation is really crucial for open source hardware, without documentation no open source hardware. And it's for replication but also for collaboration informing people how to improve things. But as we probably all know here documentation is very labor intensive and it's always out of date essentially. So related work, there are two basically two approaches document after the fact you design your machine or whatever you design and you start documenting with the potential problem that you miss important things to, for example to tell collaborators the design decisions that you made. Another basic approach is document while doing but this has the problem that you may document much more than you actually need. So in the current state of the art I think in documentation for open source hardware is git building and a difference with our approach and git building is, git building is I would say text first, images second and for our approach is images first and we try to minimize the amount of text. And that helps us with something that we find very important is a semantic relation between the source of the hardware, the CAD files and the documentation. This is difficult to do in, this is difficult to do in if you have text describing a machine for example. So our goal for this research was to integrate a design and a documentation process, generate assembly instructions automatically and support design evolution. If the design evolves then we hope that we can just push a button and regenerate a manual. So my colleagues of Ingressia in machines they created the Fabulaiser Mini and they spent many months with three persons, a graphic designer, a CAD expert and a machine designer to create a very high quality IKEA style assembly manual, assembly instructions and this was really nice but also a huge effort and I believe that when the instructions were done it was already out of date. So this was our starting point in trying to automate this process. So an overview of our approach is that we have a CAD file that we annotate in a CAD-like manner with, for example, with layers and something that we call layer states and we have a textual specification and the textual specification relates to the CAD source. I will show you that later on and we combine that information and generate a PDF assembly instructions. So we created a dedicated workbench in FreeCAD to help us annotate the CAD file. So typically for us the input is a step file and for example we have a button that allows you to select some screws and then you press the button and then they will automatically explode and show this red line that you can see on the screen. Another thing to highlight here is that what I circled here is one of the step layers, step one detail and that is something that we can refer to from the textual specification. I will show you later. And another thing is that the window that you see here with the CAD model, we created essentially a button that allows you to take an SVG screenshot, high quality image of exactly what you see here and we can remember the camera position. So you can essentially move and rotate your model and think, okay, this is how I want to show this in the manual. You save the camera position and you can generate an SVG image out of it. So let's go into the textual specification. At the left we have a domain specific language that helps us describe the manual and at the right we see the output and we can specify the title and we have a command for a bill of materials and this is the only thing that we need to specify and then we get the visual bill of materials that you can see at the left on the assembly instructions. We can specify which main image we want and what kind of highlight we want and this is what I ask you to remember step one detail. So the image that you see here comes directly from that layer state that we defined in the CAD program. This is where you can see the semantic relation between the information in the textual specification and the annotations, the CAD like annotations in the CAD file and everything that is underlined rats are basically these references into the CAD file and that allows us to create this page that you see. So and we also have some annotations, remarks and how to that references a dedicated how to page that tells you how to do things for example, combine to profiles for example and we have commands to add tools to the page. So what was the result? So hopefully you are of the same opinion that the original and the generated are very similar. There are some details that are that we don't have the same flexibility as a graphic designer for example in the text on the top you see there is a red dot in the text. Well so we cannot create so or we cannot do the same thing because we generate the page and we just don't have the flexibility. This is another page and again here is a problem with the flexibility that a graphic designer has for example in the bottom you can see there are two options for the same for a tool and that is difficult to do for us when we generate these pages. So because the original manual was developed over a course of months with three persons without any time tracking it was really difficult for us to measure scientifically what the cost benefits in terms of time are but we tried to give at least an indication and we didn't have the resources to create this whole manual so but on a small model of about six steps a small vice and a vertical lathe to really create this manual was about 25 minutes so but it was us who wrote the software so let's make it a factor two I think it's still pretty good for creating these kinds of manuals. So in terms of design evolution going from version N to N plus one we heard about this in the previous talk so minor changes in the model let's say replacing small screws with smaller ones do not require any action at all you just push the button and the new manual has been generated. If the changes are larger then we can show that because of the abstractions that we use the changes that you have to make are all limited in scope and for example if you make a larger version then you probably have to zoom out on one of and store that camera position but after you've done that you can just push the button and it generates the manual again for you. So and the biggest change that yeah the biggest change that you can encounter is if you split up steps assembly steps or you merge them then you basically have to go all over all these abstractions. So before I conclude this talk I would like to acknowledge the people that I've worked with so the co-author of our paper is JC Mariscal Melgar at the Helmut Schmidt University in Hamburg the team that created this example manual that was a huge inspiration Daniel Ingressia, Markola and Liana Sayuri Honda of Inmachines Ingressia and so this project has been funded in the context of the Interfacer project funded by the EU and to make the professional to make the software a bit more less research software I received some funding from the NGI-Dubb C OpenCall as part of the Open Know-How project. So my conclusion is that well our research proposes a novel solution to the documentation update problem in the collaborative environment of open source hardware so I'm happy to take any questions. Hi, thank you for the presentation. I wish this existed two years ago when I was writing the documentation. The question is is there the capability to also add a photo or like an external picture to this tool? Yeah so we don't prefer photographs because they tend to get out of date so we prefer to have 3D renders but yeah I think it would be possible to add that to a layer yes it needs to be customized a bit. Okay because I do not want to model glue in Frickad when I need to. Right thank you. Is there a reason why the steps are explicitly numbered rather than automatically enumerated because if you add a step two between step one and current step two you need to rename all the things. Yes no there is no reason you can name it whatever you want. We just did this to make clear that these things represent steps but no you're right. Hi Peter thank you for the great talk. I can use this not only with Frickad designs but I can use any step file that I have some step assembly and take this apart or is it limited to Frickad? No actually it works better with step files than with for example assemblies in Frickad because for us it was difficult to choose which assembly because there are so many. But and the hardware designers in our case used Fusion and the CAD expert took the step file from Fusion, drew in many more things and used Rhino so for us the input was step. I have one more question. I saw the point from the camera was in your code so I saw the point from where you're looking was in the code. How do you select which part would be taken out like the screws or the hinge? Is this done by clicking and selecting or how do we do that? Very good question. What you do is you have something so the idea is to create layers in Frickad and we have a layer for parts for each step. You have your model and you can select things and then these parts will automatically go into the layer that you just selected and will disappear so this makes it very easy for you to basically go down your model and select everything in the right step and that's the basis for the bill of materials. Then we have layer states that define which layers are on or off and so you can basically go from one layer state to another layer state and this positioning for example, the positioning of the screws is also stored in a layer and if you turn that layer on or off it's straight switches between the actual view, the assembled view or the exploded view so you can switch very fast between all those views. The parts are like manually. The screws are taken out from the transport menu or are they taken out on the transport menu? The question is whether the screws are taken out manually or not. No, it's automatic. You just select the screw, you hit a button and we can. So the technique is we take the center of the bounding box, the center of mass for a screw that is a bit more, well if the screw is here, the hat and there it's a little bit to the right so we know which direction to take it out so it's automatic. Okay, thank you very much for the very interesting talk. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Sharing parametric models as web apps with replicad
Hi, I'm Steve. I'm a Swiss software engineer. I like to thinker. I like to make things. I like to share the thing that I like and the thing that I make. This talk is a lot about these kind of things. And the story starts, as many stories started today I think, in 2020, 21. And for some reason lots of people started to pick up new hobbies and I was no exception. So I started to do 3D printing. And it was a lot of fun. I bought a cheap Chinese printer. I think you're quite a bit with it. And I must admit the hardware part was not really my thing. I was more into the modeling part. But yeah, lots of fun. The thing is it was not as easy to share with friends. A lawyer friend, they're not going to thinker as much as I do. So it's more difficult to share. The thing is the machines are getting better nowadays. They're getting closer to being appliances. And so I can share them with them. And I can try to share the hobby, generally speaking. And the good thing is the modeling part, people are not going to model. I assume that these people that are interested potentially in 3D printing, that are going to interested. They're going to go on one of these websites, if you don't know, they're a repository of 3D models. And they're going to apply a very simple workflow. And they're going to download the model. They're going to slice it. So use the software that you tell the software, what is the model, what is the filament you use, what is your printer, and it magically spouts out a file. That is a print. That is just it. It's very simple. And if you're not technical, that's perfect. And I've been using this workflow for different things. If you look on the left, you have this thing that you might be using. And it's a way to make snowballs that are perfectly shaped. And if you remember, I'm a Swiss engineer, so I like my snowballs to be perfectly shaped. The other ones are not beer crates, because I don't have a printer that big. Therefore, batteries. Anyway, they're great models. They're simple models. They're fun to print. They're fun to share with people and to give them and all things like that. And this is not what I'm going to talk about. You know, these are very well modeled and shared as just a single file. What I want you to talk about are things that are more like that. So this is a very good project that you can find on printables called the Honicom Storage Wall. So it's a way to do pegboards for the printing. You have this base plate, which is a Honicom that you put on your wall and then the community has gone well and done a bunch of different attachments. And you can attach anything. So here it's probably in someone's office, but I've seen people using it in their bathroom, in their kitchen. You have attachments, people model attachments for everything. And it's just great. There's a lot of big community around it and all these kind of things. So I'm not going to talk about the attachments and modeling the attachments here. I'm going to talk about the plates. Because what happens is, you know, these things are made for 3D printing and 3D printers that tend to, you know, have different sizes and different beds. And this is what you can see in this file. You have different sizes of base plates that correspond to popular printers. They're not going to cover all of them, but you know, you can get quite far. You then have people who want to have, you know, nice borders because, you know, perhaps it's in for their kitchens and they want to have the kitchens being looked really well. And so people, you know, the community has provided. But then you get into these explosions of combinations and you don't have it covered necessarily by the community. And I can see people in the back just screaming parametric models. And yeah, yeah, I know. And this is what we usually think about it. And I mean, small parametric models, software, and anyway, I don't think these are the best answers. This is the best work, one of the best ways that we have now. But I think we can approve on it. And I'm going to show the limits first. And the people making the honeycomb storage wall projects are really good. And they have shared their the files that they've used to build. So they build the thing with Fusion 360. They also have some people in the community have re-implemented the model in OpenSCAD. And I'm just going to walk the, you know, the simple workflow I was at the beginning, you know, download slice print, what it looks like if, you know, you were new to 3D printing and you want to, you know, change the size of your build plate. So you download the model, this part of the same, then you have to find the hobby version of Fusion 360, or I don't know what it's called now. And if you try to do that, you know what I mean, it's not easy to kind of hide it. And then, you know, they try to get some money. If you figure that out, then you download it, pick sometimes a big file. Then you just, you know, you have a professional tool in front of you. And presently, I, you know, I'm comfortable with Fusion 360. This is what I use to actually learn CAD. So it's quite nice. But, you know, you just have a huge program in front of you and you don't know what to use. Perhaps you, like me, and say, oh, it's a challenge, I'm going to be interested and watch a lot of video and learn how to CAD. But perhaps then when you're done, you just don't know what you were trying to do. Or, you know, what is more likely to happen is you're going to abandon. You know, you're not going to customize it, or perhaps you're going to ask your friend who's more technical to do it for you. With OpenSCAD, you have something that's similar. I'm going to go faster. So download the model, then you download OpenSCAD. And the thing is, perhaps you're, you know, it's an open source thing. You don't know exactly what it is. It doesn't look as professional as the other tools. And, you know, the computer telling you, oh, it's not only software, you're sure, perhaps you're just going to abandon there because you don't trust it. Then, you know, it's code CAD. I love code CAD. But if you're a lawyer, code CAD is not your thing. And perhaps you're going to try anyway, but you're going to change the wrong line and add the wrong type of code and all these kind of things because, you know, not everyone is a programmer. And so, you know, you're going to feel that you're going to abandon and you're not going to have the thing that you want. So what do we want? We want to lower the bar for the end user, make it, you know, make these parametric models accessible to everyone. And the solution that, you know, proposes to have something that works very fairly similar to what we have before. You just add the configure model step and then, you know, download, slice and print. And how can you do that? You have these web generators and configurators. I don't know if you know them, but typically, if I want to create a QR code, I just, you know, Google QR code generator, I skip the first five because there are probably just lots of ads in there and I know the good one. I don't remember right now, but, you know, you have these kind of things for these two. It's single serving for a particular purpose. And it's just great. And there is no reason, or there might be a little bit, why we shouldn't share our parametric model that way. You know, it's just software to just do one thing, you know, the UNIX philosophy. And so what I have is a QR code for you. You don't need to go there, but it's because you can see it here. So it's a configurator that, you know, creates the Honeycomb wall storage thing. In the middle, you have the model. If you are on the top, you have something to configure it, you know, the number of rows, the number of columns. And here you can just download. You can see what goes, you know, configure, download. I don't have the slicing and the printing because it's another tool. I don't want to implement a slicer and a printer in my software. It's just something very simple. And you know, you have a couple of things where you can edit, you can see the code and all these kind of things because, you know, we share stuff. The code is open. Just go. What you're thinking is, I don't want to build my own configurator because it's, you know, you have them to maintain a server. You have to pay for it. You know, you have many reasons that you might have. Oh, I just went a bit too far. But, yeah. Yeah, you don't want to pay for it. Many reasons that you might not want to do it, right? Because perhaps you're more back-end person and you don't really care about building the UI. Perhaps you're more, you know, you don't want to touch C++ or you don't want, you know, you don't want to install, to compile some stuff on servers. You know, many reasons you don't want to do that. And so the thing that you've already seen because there was a bit of spoiler, we want to lower the bar for the maker as well. And the way that I've been trying to do that is with this project that I've been building. So it's not the first purpose. The first purpose is going to come later. It's just a bit of suspense. But with Replicad, you can, you know, as someone who is interested in code, make a web configurator very easily. So what is Replicad in that context? First is the online workbench for CodeCAD. So if you want to do CodeCAD, so it's drawing with code, you can just go to the workbench and you code on the left and you have your model on the right. It's something that, you know, was probably originally done by OpenSCAD. You have many, many different examples now. You have cat query, you have something similar. And in the terms of something purely online, you have something called Cascade Studio that exists. So it's nothing completely new. But you have it there, it's something that exists. You can do your model. Then it's something that is a bit different. It's a dot in JavaScript TypeScript. And perhaps some people are just, why would you choose JavaScript? Many reasons. It's a great language now. You should try it again. And the second one is, if you are actually new to CodeCAD, you might want to learn to code. And there are lots of resources for JavaScript online. It's, you know, it's a bit everywhere. And so there's a lot, there are lots of resources to learn it. Also, NPM exists. If you want to do some Voronoi stuff, there's a library for that. So it's also quite nice to just use a language and not have a specific language for what you do. It uses the OpenCascade kernel. So it means that you can make fillets as well as you can make them in FreeCAD, which, you know, means what it means. It, I mean, it means it's a powerful kernel. It's not perfect, but it's very powerful. So you can do lots of things with it. And the last thing, which is coming back to why I was introducing Replicant in the first place, we have a built-in web configurator. So, you know, you draw your thing, you can download the model, or you can click on something to share it, and you generate the link. And the thing that if you, perhaps, didn't let you the time to open the configurator that had before, is what you have directly. Just, you know, a bunch of parameters that you expose, a way to download it. It's very simple. But it's not everything. The first thing, you know, as a software developer, we build things for ourselves. And so, you know, I'm saying that I'm rolling the bar for the end user and, no, no, I'm doing it for myself. But it also means that, you know, perhaps you're a bit like me, and you're a web dev, and it means that the bar is also lower for you to build things with this. So perhaps in the list of things that I had before, some of them say, well, it's not that bad to build your own UI paths. I want to. And actually, replicates this library, and so it means that you can. I can just import it. You build your own front-end project and use it to the library. And as an example, I'm going back to my conference-driven development project that I had before. And here, you can look in the corner, and it's just parameters, right? It's not that great, because what I want is distances. I don't want rows and columns. And so, oh, what you'd have to do is actually do a bunch of math before actually, which is not very good either. So I did another project that you can look at, which is what you expect. It's an online configurator that just generates the same model than the other one did, but with my own UI. So it doesn't, you know, it shares some resemblance with the other model, because, you know, I'd meet both, and so I have a bit of style or try to. And the thing is, now you have distances. You don't need to have rows and columns and do the math, etc. Yeah. You have an undo button, because it was already there in the thing that I copied, but you might want to do undo in your particular project. But it's something that I built just for that. And there is a viewer, and actually it's more responsive than the other, because I spent five minutes to do the responsiveness and all these kind of things. So this is what you can do with this kind of thing. And my aim was to use CAD as a web API. Nowadays, the browser is just an amazing piece of software. You can do audio in it. You can do, you know, 3D rendering, which I use actually. But drawing stuff with CAD is not something that is there by default, which is probably a good thing. But you might want to do CAD in the browser anyway, and perhaps use replica for that. That's kind of my point. Actually, you know, going back to why I did it, this is something that I did. Another thing that I'm into is making boxes for board games. Don't ask, you know, people have some more niche hobbies. And so I made a specific UI for making boxes for board games. So, and it generates the box. But, you know, someone might want to generate documentation or to make the first step and not have to install a free CAD for that. So it might be a tool for it. Or, you know, if you have some specific needs, perhaps just for hobbies, but for work, it's something that you could use. It's kind of my point. And so if, you know, we get in towards the end of the day, and we're going to think a bit about what have we learned. The first thing that I didn't really mention, but I want to stress, because it's quite important, you know, I've said, oh, perhaps not great to share things, you know, parametric models. Actually, there's no wrong way to share stuff, right? It's just, let's try to be better about it. The thing that you can do is we can lower the bar, make it easier to share parametric models and, you know, as configurator, especially. And then, if you were a bit code-cautious and want to play with these kind of things, perhaps have a look at the workbench. You know, there's a bit of a community on the GitHub repos. You can ask questions, discussions, people are interested. So have a look. And if you are a web dev and you want to do a bit something more involved, you can think of replicat as a library to work. And so this is all I had for you. I hope you had fun. Do you have any questions? You make this sound so easy, which is wonderful. But where were the dragons? I mean, part of the thing is, was learning, I mean, I rely a lot on a project called OpenCascade.js. And to me, the dragons are here. Copiling C, C++ is not something that I want to do. And so I could avoid them. Then, it's lots of fun of, you know, building the thing, trying to figure out the different technologies and things like that. There was no, I mean, and then it was about learning OpenCascade. And, you know, one of the things that a replica does is it tries to handle memory as well as possible, because OpenCascade is C++. And, you know, you have to manage some times the memory. And there are definitely memory leaks in my project, but, you know, you're welcome to find them and fix them. But yeah, it was trying to find ways to do that. At some point, I designed the API to handle it. And then I found a way that was better to just have it magically disappear. So when will you buy a laser cutter? So you also make laser cutter boxes? Actually, it won't be, I don't have a laser cutter and I've been, you know, resisting buying one for a while. Before getting into 3D printer, I bought a silhouette machine. So, you know, it's to cut paper. And if you go to the website, the deck in the box website to make boxes, before making the 3D printed boxes, I mean, boxes in paper. And so you have the die lines and it generates, you know, you have the same interface to generate die lines to cut things for paper. But yeah, so, you know, I'm resisting as much as I can. And perhaps there is also one thing that is a bit of a dragon, that is perhaps a rabbit hole that I partially fell into, is the 2D part, because the OpenCascade is not that great for 2D Boolean operations. And so I started to implement them by myself and I'm, you know, starting to build a 2D CAD kernel and it's getting a bit out of hand. Great talk. Thank you. Have you ever thought about free CAD import or kind of a connection to free CAD as modeling free CAD might be easier than coding it? I'm not sure exactly how that would, I mean, you can import step files, but then you have, you know, the whole model and the way it is, I'm not sure that it would be easy to do. And part of the thing with code CAD that, you know, makes it easier, typically, you know, you have the type of naming problems that they're, they're solving currently, you don't really have it, because instead of selecting by, you know, the number, you know, which one you have in the array, because you clicked on it, you can say, oh, I want to take the edge that is at this distance from that, because you know this when you, you know, you model and batch, yeah, you have to do a little bit of mass to figure out what is the distance and things like that, but normally it's basic geometry and you're going to have it wrong and do it again, but it's going to be okay. And so, no, the answer, short answer, sorry. Okay, thank you, Steve. Thank you.
WebAssembly, WebComponents and media filters all at once: a proposal to open the Web to variety of formats
I'm going to talk about WebAssembly, WebComponents, and MediaFilters all at once. I propose to open the Web to a variety of formats. That's it. So thank you very much. I'm really excited to present this project today, since it's the first time I'm presenting the open source aspect of the Bevar project. So I'm Jerome Gorin. I'm a lecturer and researcher at an engineering school, which is called Gunilla Salamiya. I've been quite active in many open source projects at the Kniele and in Telecom on a JPEG. But the work I'm presenting today is the fruits of the research I have conducted many years ago, ten years ago. At this time, the proposal of my PhD was more theoretical and not practical. But since then, there is many technologies which have been integrated inside the browser, which let us deliver all types of contents on the web. So with my associate, Maya Bistrom, we created a company which is called Bebara. That means in Swedish Preserve. And with this company, we want to promote this technology and we want to speed up the adoption. So the talk I'm going to present today is to show you the open source aspect of this project and show you how you can contribute to it. But to dive into this project, let's start by thinking back a decade ago. So at this time, when browser didn't include the ability to embed multimedia playback. So at this time, you have to use plugins, web extension like Flash from Adobe and later Civilite from Microsoft to add this ability to play contents. But this kind of extension are now used. I mean, they face many issues like post-sale, HTML and CSS integration, security issues, accessibility issues. So to fix this, HTML integrate new tags, which are the video tags, the audio tags to allow the browser to natively support multimedia contents. So these new tags allow a wide variety of usages like rich internet application, like social media, video sharing. But they also restrict the media format and content to an handful of codecs. So there is no guarantee that a format will be supported across all the browsers. So the formats which are supported, you can count it on the end. It's like mp4, mp3, Flac and so on. So on the next slide, what I show is for instance, org and Fulera are not supported across all the browsers. And what is also concerning is that the JPEG Excel has recently been dropped by Chrome. So what this realistically means is that a lot of people put a lot of efforts to develop useful formats, useful codecs, but they are restricted from a wide adoption due to this gate keeping placed by the browser. So what we are proposing is kind of the construction of this by-bill tower of an entire container and format in order to let people freely use and develop their own formats. So we are only using W3C standards and open source technologies to fix some of the problems of this gate keeping and to make everyone, it make it easier to integrate like new formats to innovate, to deploy stuff and also give the ability to support legacy formats like AC3, DNG, MKV, EPS and so on. So let me now turn into details of our solution. So there are two parts, there are three parts, but there are mainly two parts which are the web component and the web assembly and we also use media filters. So for the first part, for the web component part, we instead of creating like new scripts, creating new tag, we just extend the usual tag, the audio tag, the video tag and so on. We are using the attribute IS which is something standardized by the web component and we add internal logic, we add logic to this video tag. So this is where web components are used. Then for the web assembly part, we are creating a new attribute which is the using attribute on which you point to an external library which is compiled with web assembly. So for instance, on this one, on this example, we will point to the library, I mean the libraries from the XIV foundation which include OGV, FLAC and VORBIS to decode the input source which is in OGV and maybe in AFERA. So if we think about it, about this solution, it can be like a bit overwhelming because I mean we know the source, the format of the source and we include quite a lot of code on it. So what we added as syntax is the web attribute. So we developed what we call a solver which is based on the open source project JPEG which has been presented many times at FOSDEM. And this solver will be able to create a media pipeline to adapt from an input source to an expected output source. So for instance, in this example, we have as a source an MPEG1 program stream that embed MPEG1 videos which is mostly not supported by browser and we will transcode it to an H264 format file which is supported by all the browser. So what the solver will do is by itself the solver is able to check amongst all the libraries which has been provided with the web attributes to do the transcoding, so to adapt the video to the user browser. So in this example, we take a portion of code of LIM and MPEG1 and LIM X264 to do the transcoding dynamically. So with this same principle, we are supporting audio and we are supporting images. So now let's do a demo. So I have delivered on a website so you can test it with your desktop or with your mobile phone if you just flash the code. I have created two web pages, one with raw contents delivered on the website, on the left part. So you will have a raw G2K image, a JPEG-G-cell image, a Dolby Digital AC-free sound and an MPEG1 video. And on the left, we will not use the universal extension that I have presented and on the left, which is the main page of BVRA, we will show the results with this universal extension. So I will do the demo in live so I hope that it will work. So if we go on Safari on the left, so this is the page without the universal extension, what we can notice is that Safari itself is quite supporting a lot of format and container. So it supports G2K images, it supports GXL images, it supports MPEG1 video, but it doesn't support AC-free. So now let's see the situation with Firefox. So in Firefox, you can see that G2K is not supported, GXL is not supported, AC-free is not supported, and MPEG1 is not supported. And with Chrome, that's the same solution. So now let's see the real main page of BVRA.com that includes this universal extension. And so G2K is supported and if you just dive into the source code, this is the semantics I have shown to you. So you have the ease to say that we are using the universal, we add the internal logic, we have the solver and we are using the open source library which is OpenJPEG. Same for GXL and on AC-free and MPEG1. Now if I switch to Firefox, you see that you will have exactly the same result. And on Safari, everything is as it is, but you have now the AC-free audio which is supported. So on all the browsers, you have the same results. So we also release an SDK in open source. We release an IDE based on Visual Studio Code that you can use. And the goal of this IDE is to help web developers to find the right combination of filters for a given content. So I'm going to show you the IDE. So to install it, it's quite simple. You go in the store of Visual Studio Code, you type BVRA. So on my computer, it's already installed. And then you can open any content in the file explorer. So for instance, let's use the GXL file. So you have first the preview. So it means that a library is existing to decode this content. On the graph part, you will see the mega filter pipeline that has been used to decode this content. You can also view the source of each filter. So on this one, the filter is based on libgxl. So I think the connection is quite slow. And on this part, on the Accessor tab, then you have the script to be integrated inside your HTML. To support this given format. So now, I mean the connection is quite slow today. This is the source that has been used for the graph. So this is this one. So I use libgxl. And this is the code from one filter. So as you can see, it's like using GXL. And then you have a semantics just to adapt the open source libraries with the input format and the output format to help the server finding the right combination. Yeah. So we're to find us. So this is like the end of my presentation. At the moment, we are starting. We are like only supported and full of media credits and container. However, we are like adapting new libraries. We are working on support of new formats, new type of documents, 360 videos, 3D object. I mean, everything which is multimedia can be like constructed with the server, you know. Everything is open source. I mean, what I've shown to you is open source. So you can see check the code from the editor on GitHub. Also test interface and the SDK is on better slash filters. Everything is in LGPL. So you can contribute to it. You can take this code. You can, yeah, I mean, I've kept this decision short to have the time to let the time to you to ask a question or you can also find me later on the audience. So that's how for me. Any question? Yeah. I mean, it depends on which. Sorry. Okay. So our expensive it is to transcode on the browser. So it will really depend on the type of filter you will use. I mean, we have some filters which are called WebCodec. So it will use the acceleration of the browser to do the transcoating. So by using the WebCodec, if it's supported by the browser, then it's very fast. The other thing is that you don't always need to transcode if you are using a canvas, then you will just decode and display it on the canvas. So it means that there is like no delay when you open a file, when you are using the canvas. So it really depends on the complexity of the encoder. It depends on the technology which you are using and depends on how you want to integrate your video. But we are using WebAssembly, so there is no overhead which is imposed by WebAssembly itself. You have quite a native performance, you know, on your browser. Yeah. These are static files, why do you want to transcode ahead of time? Because you can, if it's like delivered, why don't you use static files? That's it. So that's a good question. I mean, why? Because first you can adapt depending on your browser, because for instance some browser does support the native file like Jigsail, you know. And the other thing is there is a lot of files which has like functionality in it, functionality that are embedded in the container. Lex takes for instance DNG files like raw files that are used by your camera, by professional photographer. Then if you use the DNG itself, you will be able to like view the raw format. You can view the preview and you can play with, you know, everything that you can do for instance with Photoshop, you know, like having this high HDR range of color and so on. And if you're on 360 videos, then you will have for instance to use the interaction. If you're on document, you have to add this interaction. So by playing with native file, then you will not lose any functionalities that the container has initially, you know. Yeah. Do you want to make a browser plugin so you can view the websites that don't have the text? Yeah, there is also a browser. Yes. Yes. Do you want to do a browser plugin for that? There is a browser plugin already that is able to detect the, if the format is supported or not for a given content. For, I think that the best functionality yet is to trust the web developer itself because he knows like he wants to use a specific file. He wants to use a native content so he will integrate, I mean, the functionality that he requires for his website. If the web developer asks the user to install something on his browser, then I think that he will lose a big part of his audience. It's better to prepare everything for the end user than ask him to install something. You know, I don't want to come back to the situation where we were with Flash and Civil Light and have all this kind of issues. So, I guess the web pages that still have JPEG Excel will make sense if you want to do that. Yeah. Can you repeat again? Yeah, yeah, yeah, yeah, of course. But this extension is existing, I mean, but it's like less useful than this first presentation that I'm doing, you know. Yeah. Is the MKB file format not supported? It's working on it, actually. Ah, yes. Sorry. Is MKB supported? I'm currently working on it, you know, because what I'm one of the break we can have is that a lot of format has patent on it. So what we can distribute freely is license free patents. So MKB is one of the license free patent is you have the open source on it, the open source code to be used. So the application is quite easy and I think that that will be my next work, you know. Something that I forgot to present, which is quite really important, is that the plugin extension already have a store on it. So if you want to try a combination, then you just have to add a library on it. So for the moment we have PNG, JPEG, JXL, OpenJPEG. We have the full FFFPEG, like a Kika.t.pmp, and then we extract some part of the FFFPEG just to reduce the size of the big project of this project. But MKB is really like the MKB inside the XIV decoder, so I will extract it and I will work on it. And then if you like click on add, for instance, let's say that on this one it was JXL, then I'm adding OpenJPEG on this one. So you will see that it will be a candidate. And on the preview it will check amongst the JXL and JPEG and see that JPEG itself is not just for me, you know. So it's like unused.
Streaming live using RIST On Demand to thousands, how you can have your cake and eat it too
All right, good morning. Again, I'm Sergio Amarata. I'm a board member on the RISC forum, and an active member of the RISC committee that writes specs for RISC. And I'm also the maintainer of the Librisque Open Source project. So today, we'll be talking about how RISC can support end to end live streaming with packet recovery. But in particular, I will explain how we can support this in a broadcast scenario, meaning streaming to as many users as your bandwidth can support. So we'll cover the topic in two sections. First, we'll provide a roadmap or an update on the RISC specification and the Librisque project itself. And then we'll go to a practical application and show you how you can do live streaming in a large scale with the open source tools provided. So on to part one, the development roadmap. The last time I gave an update at FOSDEM regarding RISC was February 2020, a few days before the pandemic shutdown. Now, four years later, we will explore what happens instead. I guess if I have waited one more year, I could have blamed the Thanos snap for the delay. So let's do a brief recap of the beginning of the protocol. In 2017, the VSS forum created the RISC activity group for the purpose of creating a unified interoperable protocol for transmission of IP data over loss in networks. The requirements were that it needed to be based on the UDP protocol, and it needed to include negative acknowledgment retransmission requests. So one year later, after a successful multi-vendor interop event, the simple profile specification was published. You can see that in the bottom. The RISC activity group then proceeded to add multiplexing and encryption capabilities and publish the main profile specification in early 2020. It was at that time that the Libris Library Open Source Project first was published. And you can refer to my talk I gave back then, where I go into detail explanation of what the simple profile does and the main profiles and the differences, et cetera. So as you can see on this slide, the RISC activity group has been quite busy adding features to the protocol to accommodate all possible use cases during the last four years. What started with the simple profile, the first release, as the desire to add packet recovery to an RTP stream with an MPEG-TS payload, has now turned into a reach protocol that will work with any payload and which includes multiplexing, encryption, and authentication. So Libris, the open source project, currently supports a simple profile and main profile. And we're working on adding support for the advanced profile. So in addition to the core specifications for the protocol, the RISC activity group has also published a series of recommendation or best practice documents. These are documents that extend the protocol into specific applications, into specific niches, and you need to actually consider that in the library that is compatible with this specification. So the library Libris, when applicable, has been made compatible with these recommendations, the code synchronization, the relay, et cetera, et cetera. So enough history about the protocol and the specification documents, those are all publicly available. They're not behind any payload. The VSF documents can be downloaded, the PDFs, and you can look at the specs and all the recommendations. Let's talk about the Libris open source project itself, right? In case you are not familiar with what RISC is, we can define it with just one simple sentence, like you see up there. It's a new protocol for transmission of IP data across lossing networks using UDP with NAT-based retransmissions. Before getting into anything else, I'd like to clarify the three most common misconceptions people have about the RISC protocol. And this has come about in talks and conversations. People tell me, oh, well, RISC is only for MPEC-TS, false. Advanced profile includes support for any payload with clearly identified payload types in the header now. There's even an registration with support binary payloads, et cetera. And misconception number two, you need a large buffer and therefore latency is large, a second or more, right? In order for you to use RISC, false. You really need two to six times the round trip time, the RTT, between the two endpoints you're trying to send the data through. So the shorter the RTT, the total buffer required will also be shorter and you can talk about 10 milliseconds, 20 millisecond total latency. It's just depending on what network you're deploying it in. In addition, and this is a major misconception on that second point, RISC also supports real-time data channels with no added latency, with lossy channels that you can have APIs going back and forth in real-time and send data that cannot wait for these buffers. Misconception number three, you can only use RISC for transmitting in one direction, right? You send data over there, this is packet recovery, you're done. That's false. The protocol allows for bidirectional transmission both with and without packet recovery on both directions. The limitations are usually introduced by the implementer of the protocol. The specifications are broad enough so that each implementer has the freedom to add or remove features at will. So let's talk about the Libris Development Roadmap. How do we determine where to go next, right? So we divide it into three categories. The first one is we want to improve the reach of the library. And by improving the reach, we mean improving the adoption of the library by client applications so that users can go ahead and have it available on every device. Libris, of course, adds support for all the different specs like I showed before and all the recommendations so that all these reach features that make the protocol have more use cases are available immediately under the Libris library. The second is distribution, right? We make sure that our library compiles on every platform so that it can easily be adopted by anybody and that it makes it when possible into open source applications like FFMPEG, Libris, OBS, etc. As a matter of fact, when running it within the video LAN servers, it compiles in all 21 different architectures that are predefined in their CI. So we're pretty confident that if somebody wants to use it, they can. In the distribution aspect, we also have it on the major distros now available in Debian, OpenBSD, etc. And the third aspect of how to determine what the roadmap is is we do timely enhancements and timely bug fixes very quickly when they come about. So on the feature set, I think that the most important addition recently that allows the application to be used into this broadcast market like one too many, the media service scenario is the EAPSRP6A authentication protocol. It was introduced in 2022 and what it allows you to do is instead of the normal model where you have a pre-shared key that you have to share among two endpoints which makes it very insecure because if that communication of that key for the encryption gets compromised, your entire network is compromised now. This protocol allows you to do a username and password, a unique username and password for each of the connected clients and part of the protocol, doing that username and password exchange which is different for them, includes the negotiation and the exchange of the pre-shared key. So there's no risk anymore of that pre-shared key to ever be compromised. Other features that allow the broader adoption is that we're working on a one-way satellite application, we're working on multicast discovery and a few other things. So third aspect, the distribution. Many FOSS projects already have Libreps compiled by default or have it as an option. If you know of additional projects or if you know, please drop me a line, I'd like to keep a database of which projects are actually already included in it, if possible. Ritz is also a part of my own day-to-day operations which gives us the advantage of finding the bugs before they are found in the wild and we fix them very quickly. OK. So performance enhancements over the last few years, we now have the ability to automatically configure based on the network conditions. The Libreps library does an RTT with a new packet that was released, the Echo packet, 10 times per second. What that does is it lets us measure, with a UDP, you know, ping, not a regular ping, the network conditions between the two endpoints. We know the inter-packet spacing, the variance, mid and max, we know the latency, we know all these things and with those values, with those parameters, and if you look at the default configuration, the library will auto-adjust its buffer to the perfect buffer for that link without you having to guess or know anything about the network. It will also adjust the initial buffer, the reordering buffer based on your jitter on the network. Your inter-packet spacing jitter, gaps in maximum jitter, and make sure your reorder buffer is at least that much. We've added, you know, because we've done these very large improvements, we've realized we need better metrics, so we've added support for Prometheus and other things straight out of the library, so that you can actually grab that and, you know, plug it into third-party tools and immediately create your dashboard that gives you the proper visibility in the connections. And, you know, last release was just a couple of months ago. So the top priorities for 2024 for the development roadmap is we want to add support for DTLS encryption and authentication. We want to fully add support for the new advanced profile that adds, you know, the new header ID with the special payloads. And we want to try to see if we can get back support of the library into VLC 3.0. So the goal of the original project, like we mentioned before, was an interoperable standard for this type of transmission. There were, you know, half a dozen or a dozen different methods or there still are of doing UDP with packet recovery, each vendor specific, et cetera, et cetera. Our goal was to create an interoperable standard with multiple implementations, and I think we've achieved that at this point, at least at the higher broadcast level and tier one, tier two companies and a lot of the open source projects that support REST. They all talk to each other, even if it's not the same implementation. So now to part two, right? Let's look at REST as a live streaming platform, right? And particularly we want to look at a model that does N2S. How do you use REST and Libris in particular to do an N2N streaming chain, like the one we're doing here, for example, or, you know, any one-to-many scenario, right? Lots of viewers. So let's diagram, you know, a simple scenario here. We have three components, sources, the sender, which is a REST device, and many receivers on the bottom, and the box here on the bottom, you know, symbolizes a single one of those receivers. So we see the logos up there for FFMPEG, BLC, and Open Broadcast Studio. That could be also G-streamer, any source, any encoder, it doesn't matter. Somebody that has the ability to generate compressed or uncompressed video stream, right? Well, we need a binary stream of some kind pushed to the library. Libris in particular doesn't care about what the payload is. You can push anything in the payload, we'll deliver that to the other side, even though the spec for simple profile and main profile say that you're transmitting MPEG-TS, the library doesn't look at the payload or restrict it in any way. Okay. So the source is sending a UDP, or RTP media stream into the input. We buffer it so that we have it available for retransmission, and the minute the buffer is full, we start listening on, we put the sender in what we call listening mode. It opens a UDP port and start listening for receivers that want that stream, right? So the minute our receiver wants to connect to us, then the handshake happens. I'm obviously oversimplifying the process of the handshake that all happens. The SRP68 protocol is quite complex. It would take a talk just to go through the details of that handshake and everything that happens. So this is only symbolic. The handshake happens. The username is sent to us, and we check for that username within our database of username and passwords. It's not really a data-major password, but a password hashes to keep everything safe. If the authentication succeeds, then we send as part of the SRP68 protocol the pre-shared key so that the receiver can decrypt the data now. Once the data is decrypted, that's it. We have an end-to-end transmission from source to hundreds of destinations with just the risk protocol in between. So with proper planning and setting everything up correctly, you can have a 300 millisecond glass-to-glass, one to hundreds of listeners. You need a good network. Like I said, the latency is more dependent upon the RTT between the endpoints than anything else. I mentioned 300 milliseconds because in our large-scale deployments, we've done this anywhere within the U.S. with 300 milliseconds glass-to-glass. When you have to expand it and have users that are across the ocean or with crappy networks or Wi-Fi, the latency will auto-adjust. The protocol will auto-adjust. For those players, suddenly they get 500 milliseconds. We notice as a rule of thumb that somebody in Wi-Fi gets a penalty of another 200 milliseconds automatically. So how do you do this from a practical point of view? The LibreSploracle includes some command-line utilities that allows you to send, receive, and relay. The RISC2RISC is the one... If you want to do a relay application one-to-many, this is the ideal scenario. You can also do it with a RISC sender, to be honest, but the RISC2RISC is effective because it acts as a relay, doesn't encrypt or decrypt, doesn't do anything, but receives data and sends data both in the RISC format. You can put this in a CDN, your data sender anywhere, and you configure in the RISC2RISC a listener with authentication, and then you put your stream from anywhere, your source, like from here, to that endpoint. Then you configure the other end, the one that's going to send to the older viewers with a database of user-oriented passwords, and now you have the full authentication. It adds no additional latency in that process. It's only the latency that you decide to put as far as buffering. As far as quality and quantity, the sweet spot seems to be between 3 and 5 megabits per second, resolution 720p or 1080p, whatever code you're using gives you better or less quality, and that seems to traverse all the different VPNs, corporate networks, et cetera, without any issues. Quantity, the RISC2RISC can handle 100 simultaneous connections, and the number seems low, but because of the threading model and the fact that it has to do retransmissions, after that the retransmissions get compromised. The way you scale is that you can instantiate multiple instances of the same RISC2RISC application within the same machine, and in our case, we have 1500 simultaneous viewers going off of this type of transmissions 24-7. So the RISC password utility is also a command line utility available on the project that allows you to create the username and password combination hashes, just like the HD password file in Apache has a similar format, that's why we created it this way. You run the utility, put a username and password and that outputs this username with a hash, and then you append that to a file, and then the sender can grab that file and use it as an authentication database. In the case you want to scale that to a much higher level, you integrate directly with the library and you use the library callbacks to do the authentication yourself against your own databases, and you can scale that to thousands of users. The command line sender is a typical scenario of what I put in the diagram, what I was using in the diagram, you put the input any type of UDP stream, output you encrypt it, and then the output URL, if you look at RISC, is in your column, column, you add 127, you add the add, just like you do typically for FFMpeg or VLC or that type of stuff, when you want to listen instead of send, and it creates a listening on that port, and that's all you would need to do to create a sender and use the sender as a really, as well, just for one stream. On the receiver side, you want a player, for example, that you can put the username and password, right? You put the RISC in FFMpeg as the input, RISC, column, forward, slash, forward, slash, et cetera, or VLC, or any one of your choice. In our case, we did a custom VLC application inside of Raspberry Pi where we were doing this 1500 at the same time. There were Raspberry Pi's running VLC 3.0 inside with a lib-RISC implementation inside. The transmission of the secret in this case, which is a password for the username and password, should be handled in the same way you share passwords now for any account outside the scope of the protocol, and that's it. Then it becomes very simple to create a large-scale network with this. So the summary is the key feature for this is this new type of authentication that makes the secure implementation on a large scale, and it gives you better latency, lower latency, then the equivalent HLS or dash, with a security model that's built into the protocol. It's no longer the browser or the DRM inside the browser, everything. It's the protocol handles the entire DRM. So we have a really solid roadmap for the future. We were looking for additional contributors and people that want to help adding the next set of features. We're looking for open-source projects that want to implement the library. We'll help you put it in. And that's it. Thank you very much. Thank you. Okay, the question is, what if you're pushing your stream to Africa with a really bad connection? What is the acceptable packet loss? I'm not sure what you mean by acceptable packet loss. To me, zero is an acceptable packet loss, and the protocol is capable of achieving zero if you give it enough buffer. You give it a second buffer, and the round trip is 200 milliseconds, and you will get zero packet loss. We've done tests and we've done transmissions between Australia. I was just two weeks ago doing a demo, a transmission from Australia to Madrid. 16 cameras at 10 megabits per second each were being transmitted in real-time using RISC, and they were being used in Madrid for a production of the event. And the transmission didn't have a single packet loss, and it was all done across open internet. We used one second buffer there because the connections were relatively good, but if you go and, you know, if your transmission is really bad, just increase your latency, and the protocol will recover. We have part of our CI integration process tests that add 50% and even 75% packet loss. And you see spikes in bandwidth, but we recover every single packet if you give it enough buffer. Does it support simultaneous build rates? Does it support simultaneous build rates? Yes, we support multiplexing. In all this example, I've done just one UDP input. You can configure the library and the command lines to ingest multiple UDP inputs, give it a different ID, and then on the other side, you can demultiplex them. I assume that's what you mean by maybe having different build rates within the same stream. The camera, like, sends it on the fly according to network to the combination? Correct, yes. And one of the specifications that you saw on the recommendations was called source adaptation. It was written precisely to accommodate that scenario. What is the best case, best use, or the best practice recommendation on how to do source adaptation? Reduce the build rate, adjust the build rate based on network conditions. It's all documented in a part of a spec as well. So for non-MPEC-PS payloads, as you mentioned, is there already a mechanism like a composite trail to basically define the mapping of different payloads? Absolutely. For advanced profile, there's a GitHub repository that has the mappings already. We have a dozen or two dozen of them. I'm one of the administrators of the repository. All you need to do is go in and, you know, put an MR for whatever binary payload you want to define. All right, thank you. I have another question here. Is it also possible to multiplex and demultiplex subtitles? Is it also possible to multiplex and demultiplex subtitles? Yes. The protocol itself doesn't care what you put in. We consider each of them as a binary payload of some sort. You're the one that determines what the format of that payload is. And you have this pipe. You put multiple UDP streams. One of them is going to be your VTT payload or closed caption or whatever you want to put in with whatever format you want. We don't define or control the format of what you put in. We do to decide on multiplex and mulling. We give you the capability to give them IDs so that in the other side you can map those IDs to different outputs when it comes out. Thank you. But it means that you don't do any timing, right? In between the different streams. That's all user-side. Well, no. When you give us... The question is, that means that you don't do any timing or synchronization. On the contrary, because we are taking care of the multiplexing, when we ingest all the different UDP streams, the timing is guaranteed. The minute we receive that UDP stream, we actually, in the library, the implementation that we did, we grab the timestamp at the network card. This stream came in at this time, and then we reproduce that exact timing on the other end. We reproduce the spacing, the pacing, and the latency. We make it fixed, so that is not variable. That means that when you multiplex many things in the same tunnel, you're guaranteed they're in sync on the other side, or at least as they were when they came in. We're starting the use cases of the protocols to the more for... kind of the current adoption on endpoint devices, mobile devices, browsers. Okay, the question is, the use cases of the protocol, what is it towards more, point-to-point devices, point-to-multipoint, browsers, etc. This is the last question of our time. The original idea was to just do point-to-point transmissions. That was the original scope when we created the first version of the spec. That has changed. We achieved that, and now we went beyond that. Now we want to tackle the distribution. We want to tackle the one-to-many, the media servers. We have actually a project going on with Miss Server to add a lot of this functionality and the scalability as part of the project itself, so that we have at least one media server that already supports that in a very scalable way, where it becomes very simple for an application like VLC, or VFF Play, or Gstreamer to hook up to this media server and start the playback immediately using the Pshuoroko. Thank you very much.
Livepeer Catalyst and The Conspiracy to Solve Video for Everybody Forever
Hello everybody. My name is Eli Malin and I'm here to talk about the conspiracy to solve video for everybody forever. It's top secret. Don't tell anybody. Definitely don't live stream this talk or anything like that. This is a conspiracy. We don't want to let anything out. My name is Eli Malin. Like I said, I've got socials there. I'm a director of engineering at Live Peer, actually very close to my five year anniversary at Live Peer, which is longer than I've done anything. Today I want to talk about what motivates me and what I think we can do to make video better for everybody in the world. I'm also going to talk about Live Peer Catalyst, which is software we put out, a media server that I think is going to help us toward this end. I've been working in decentralized video since 2016. I quit my job to try and go start my own company and from there all the experience I gained in that led me to Live Peer. This is the best life advice I've ever gotten from Dominic Tarr who founded Secure Scuttlebutt. If anybody is familiar with that project, I asked him how I could contribute to Scuttlebutt and he said, figure out what you are uniquely suited to do and then do it. I think since this tweet that's been approximately what I'm doing is I happen to know a lot about video tech. I thought I could think of ways to improve the lives of people using it. That's what motivates me and that's what causes the mission to try and solve video for everybody forever. We've had a lot of nice state of the union talks in this room about how things are going. This will be the state of the union for our conspiracy to solve video for everybody forever. There have been some setbacks. Video is not yet solved for everybody forever. I'm going to talk about a few of them here. The first I want to talk about is corporate centralization. This is an analysis of game stream platforms, actually like all live streaming platforms of a certain category here. You can see we sort of live in a world where it would seem to send out video to people. You need to be a mega corporation. We've got YouTube owned by Google and Twitch owned by Amazon. Everybody else add up every other competitor and they don't come close to those two. While these platforms have enabled a lot of people to get started with video, there's also been some problems. In order to make these platforms work at these scale and make it work with all the different content producers, these platforms tend to be extremely strict about things like having copyrighted music. Even when we're talking about a clip that you put in years ago, if that gets reported, you can get your Twitch account or YouTube account permanently banned. This has happened to a lot of people. Basically, it wrecks their entire careers. They build entire careers on top of Twitch, on top of YouTube. Basically, these backroom deals that Twitch and Amazon, that Amazon and Google rather, have had to sign with different folks, causes them to, in order to make the copyright holders happy, in order to these deals they sign with the big music labels and that sort of thing, they have to permanently ban these streamers. I want to point out this isn't a law or anything like that. This isn't the rules about how you're supposed to use copyright. You're not supposed to use copyrighted content, but in terms of permanently deleting things, that's something that they just negotiated amongst themselves. But it basically, given this chart here, it basically has the force of a law. That's people's entire livelihoods going away. This is one group of people for whom we have not solved video for everybody forever. Another one would be, of course, Twitch streamers in South Korea, where Twitch has said, nah, it's too expensive to operate networks there, some sort of contract dispute between local Korean ISPs and Amazon. You just can't stream on Twitch in South Korea anymore, or you've got to go around in VPNs and that sort of thing, which is probably fine for the people in this room, but not fine for most people that are wanting to actually go about having a live streaming platform. YouTube, same deal. If you get three copyright strikes in 90 days per the contracts they've signed with the big record labels, yeah, just gone, gone forever. Here's another one. This came up in my research for this just recently. I used to do a little bit of Twitter live streaming, because it was a fun way to show, especially like the coding I was doing. I've got a local archive of all my stuff, because I'm a video engineer, and that's how I operate. Most people wouldn't, and so this is Twitter saving a little bit of money, post deal on acquisition, and then, of course, Tumblr's ban, which was on adult content, but there's sort of algorithms, just sort of ended up nuking a lot of people's old Tumblr libraries, things they had spent years cultivating and building, and that's just all gone now. I would contend that video has not been solved for everybody forever for these people. Oh, yeah, here's a fun one. Yeah, so a lot of people, if you build up a PlayStation and you buy an episode of Mythbusters, because you like Mythbusters, then you have Mythbusters and you want to watch that. Yeah, it's just gone now. You don't get to watch it anymore. Why? Some agreement between Sony and Discovery that happened in some boardroom somewhere that will never be privy to, and that's just gone. The concept of buying a piece of digital content apparently is just not something that really exists. Here's one in the news recently. Deepfakes, Taylor Swift was briefly blocked from being searched for on Twitter X, which is kind of funny because there is a ton of explicit Deepfakes being made of her and they didn't have any way to get that under control other than to block her from search entirely. This is an emergent problem. This has actually gotten worse. Some of this other stuff is long-standing problems, but this, technology continues to develop. This is an emerging problem that's making video less solved for people, so that's not very encouraging. This next one is just a little pet peeve of mine, but have you ever had a video on your phone and you want to put it up on a TV? Yeah, that's like impossible. You can get a Chromecast. The best ways to do that are Chromecast, which is of course, you can have this proprietary Google framework built into every app on iOS and Android in order to make that happen, which is not super impressive. Some very heroic efforts to reverse engineer that, but I don't think it's quite to the point where it would work in all cases. Or if you have an iPhone, you can pay Apple 130 bucks for an Apple TV. This is to get, think of the cumulative video knowledge in this room. That TV knows how to play an H.264 video, I guarantee it. It shouldn't be a problem, but to this very day it is. Curious if anybody, this is maybe my personal pet peeve, but I've worked in video adjacent startups for a little bit. I'm going to take it out on B2B SAS a little bit. This is the fate of every video startup if you don't actively combat against it. You might start with high ideals and all the stuff you want to open source and make a really accessible product to everybody, but then the big client comes along and they've got specific requirements, specific demands, and your company basically just turns into doing whatever they want. I've been around in the video industry for a little bit. This is very common pattern that's happened with friends of mine, and I think it's sunk a lot of other really promising products. More on that at live here in a little bit. That's the depressing part of the talk. Let's talk about some successes. There have also been some things that have gone really, really well. One thing that really took off during the pandemic is like masses and Plex servers. I don't know if anybody has got like a Plex login or a Synology login for a friend's mass or that sort of thing. My group of friends got really into this in the pandemic where anybody could access this. Shout out to my mom who I think has ripped more DVDs than anybody in America. I bought her a NAS for Christmas a few years back and she's, they just talked her and my stepdad through doing a raid on the NAS in order to get more storage on there. That was maybe the best Christmas present I've ever given. Yeah, this has been, these don't go away. Like Mythbusters on Discovery does when you buy it on PlayStation. Take a DVD, you put it on this and this keeps working. Little self-explanatory. This works. This has worked since it came out. BitTorrent is arguably the most successful decentralized project of all time. It works consistently. You can get everything on it and your content doesn't get yanked when somebody else, when there's some weird contract negotiation that happens. This is a popular one, maybe an unpopular one, but I did some work with video NFTs at Live Pier and these still work. On OpenSea, the video is all backed up on Filecoin there. You can play it on this and there's actually entire like crypto platforms that this, your creator library, you can upload this and it's an NFT collection associated with your crypto address. They could decide not to mirror that content for you, but in terms of like having your own content library compared with the YouTube and Twitch and Tumblr cases where they're like the company could just yank it from you. Yeah, no, this, you have to have some faith in the blockchain that you're running on of course, but other than that your library is your own. So those are some good ones. There's also some emergent projects I want to talk about that I think are going to help us sell video for everybody forever. The first is that decentralized social is going to become a thing. I put some of my favorite ones up here. I am a big fan of Blue Sky and the At Protocol. I worked with Scuttle, actually got my sort of my start in decentralized tech pre-crypto working with Scuttlebutt from Dominic Tar, who you saw earlier there. Some stuff on the farcastered lens in the Ethereum ecosystem. Activity pub and mastodon. I sort of give half credit. I have a lot of love for people in those ecosystems. I think there's a lot of incredible work happening. I think they, a lot of them, would acknowledge that the weakness there is the sort of the lack of a portable user that could move between servers like a concept of like a user with a key that can sign data and own that with what they're working for, which allows them to own their own media library and not be nuked when a corporation signs some deal or whatever. It's consistent with what I call the fundamental principle of decentralization, which is that user actions are sovereign. If I upload a video, I should have access to that no matter what else happens. They could decide to take my videos down from their servers, but in terms of like the definition of my content library, what videos are associated with me. That shouldn't be something that anybody can take away from me. Yeah, some of you may be familiar with this. There's a group called the Coalition for Content Providence and Authenticity. It's a bunch of companies now. It's started by Adobe and the BBC. This does a couple things. This was a little bit, these companies getting ready for a world full of deep fakes where to combat in a world where these models proliferate and anybody could generate any video that says, whatever, how do you ever trust the video you're looking at? And the answer here is actually pretty similar to the blockchain answer, which is you do so with signing chains. So you've got a C2P enabled camera, which has a little encoding chip on there. It's got a private key associated with that. That signs the video as soon as it's created. That goes to some editing software that makes some transformations to that and signs that it did so. And then by the time you get to a user on the web, they could theoretically click in the corner and say, okay, here, this video is created at this time and edited by this person. And it's not perfect, but especially with the provenance chain back to the original camera, you can have more confidence that you're looking at the right thing. This isn't in wide adoption yet, but I'm very excited about it. And I think it has an answer to some of the things I'm talking about here. So yeah, the goal in my mind for that to leverage this in the social case would be like, I could livestream on Blue Sky, transcode that video on the LivePeer Network, more on that in a second, and then somebody else could post a clip to their mastodon. And you could look at that clip and be like, oh, I'm looking at a clip from Eli's livestream, right, through the magic of signing and deterministic transformations and that sort of thing. Yeah, this is a really crude example that we came up with for how that could look, like a little button you can press in the corner of the video to get information on that. Yeah, so let's talk about, so that's me talking about some problems and solutions in video as a whole. Let's talk about what LivePeer has contributed to this. I'm going to give a really fast overview of what LivePeer is and the LivePeer Network. LivePeer's mission is to build the worlds open video infrastructure. We started out doing that with the decentralized network of video transcoders. Most people in this room has probably run into the concept that video transcoating is very expensive. Doing this kind of processing can cost as much as like a dollar per hour video on different cloud services. The LivePeer Network instead has this group of decentralized orchestrators, basically people running in video cards in lots of sort of unconventional spaces, not in data centers as people might be accustomed to. Because of that, we can offer sort of this radical low-cost transcoating solution that's much cheaper than existing approaches to this. But we got caught in some of our own traps here. We had the decentralized network. It worked really well, but as we stepped into the video industry, people didn't want to hear about loading up an Ethereum wallet with funds and transcoating on the network and that sort of thing. They're like, hey, can I log in with an email and password? We're like, so just sort of because it's what you do, we ended up building a SaaS product, which I'm really proud of. We did a lot of really good work. Shout out to the MIS server team who MIS server makes sort of the core streaming engine for the live streaming parts of this. But it got to the point where engineers at LivePeer couldn't even boot this up on their own laptops. This was just so chaotic, so many different microservices. It's a globally distributed team, so it tended to be like one person had their own service that only they knew how to operate and that kind of thing. My work for the last six months in particular started out as something called LivePeer in a Box, where we take all of this and we put it all in a single Docker image with a single point of configuration management and that product became what we now call LivePeer Catalyst. There's the LivePeer Studio hosted product. We've been using the phrase get lab model a lot. There's LivePeer Studio hosted product and there's LivePeer Catalyst, which is the self-hostable version that anybody can run on a laptop. I'm just going to give a really quick demo of this. Let's see if it'll cooperate. There we are. There's instructions for this on doc.livepeer.org. I intended to have this running when I started, but ended up having to restart this laptop here. Good, good. We should get a ridiculous amount of spam here. Shout out to MistController booting everything up here. Okay, okay. As I mentioned, lots of different services that are all crash looping until everything is set up and running here. We'll go over to... This is local host. I just have it as my URL here for the purposes of putting a TLS cert in there. This is a full self-hostable version of LivePeer Studio. I can step into a stream and hopefully this will work. I can go live right from here. Nice. We support WebRTC, ingest, RTMP, SRT, and then RISC as soon as the Mist server project that was mentioned ships. We'll have that. I'll stop the broadcast here. It's got scalable live streaming that can be fan out to as many nodes as you want. We've scaled up to... We tested most recently with 200,000 concurrent viewers all over the world. We've got assets. There's one I recorded when I was testing this. This is my live stream when I was sitting there finishing my slides. This is an asset associated with that. We've got lots of different features here, multi-streaming, so you can push out to Twitch and YouTube, stuff like web hooks and signing keys. Going for this freely distributable, full-featured server here. I want to leave some time for questions, so I'm going to jump back over here. Concluding thought, what do you take when you put all these things together? We've got the rise of decentralized social. We've got stuff like the C2PA, which is going to provide content provenance about video, so we know it came from. We've got this freely distributable MIT licensed server with all these different capabilities. What do you get when you put all those things together? The answer is I don't know yet. This is all emerging very quickly. I can see the future a little bit. I can see a world where some of these decentralized social projects start to take off, and of course any social thing is going to have live streaming eventually. In order to make all of that work, it cannot just be assertions by a particular server. We want to have signing mechanisms and that sort of thing. I'm looking forward to that world. I'm looking forward to building it. If any of you are interested, you can join the conspiracy here. That's the Live Peer Catalyst community page on the QR code on the right there, if anybody wants to. We're hosting a party this evening at Market Bar at 7 p.m. Feel free to stop by that, and we'll talk about solving video for everybody forever. I've got a couple minutes for questions, if anybody. Other than that, I can just give you some time to scan some QR codes. How can people contribute? How can people contribute was the question. There's a landing page over there talking about how to get started. Actually, if you go to docs.livepeer.org here, there's a whole catalyst section now, talking about both how to boot up your own catalyst node locally and how to develop on it. If you want to make changes to things internally, if you want to change some of the internal microservices, that sort of thing. Yeah, we also have, I didn't do it yesterday because I was here, but every other week we have the Catalyst Hackers Group on the Live Peer Discord. So, discord.gg.invite.livepeer, or you can just google it and figure out how to get there. That's all the people that are helping us build this sort of thing, coordinate there. Cool. Thanks, everybody.
StreamCrafter - In browser broadcasting
Hi everyone, my name is Marco. I'm one of the maintainers of MISSERVER and today I want to talk just a little bit about the streamcrafter which is a broadcasting studio which runs from any modern web browser. Any questions about that? Well, it looks like there's still a few minutes left so I'll go ahead and give a bit more context about the streamcrafter. First of all, this is developed by the MISSERVER team so I want to talk just a little bit about what MISSERVER is. Then I'll move on to the streamcrafter itself. I'll do a quick demo and talk about what will be next on the real map. First of all, what is MISSERVER? Well, it's a media server. It's completely open source and public domain right now and it has a very broad support in terms of which ingest protocols, delivery protocols, codecs, containers, remixes on the fly and it's very efficient in memory and CPU usage. We think it's a fairly cutting edge media server so to say. Hopefully we'll get some more contributors in the long term to work on the streamcrafter with us because we're all backend engineers and especially making the interface nice and snappy is not our strongest part but we're working on it. So what's the streamcrafter? Well, let's say you are developing a social media platform and I have to deal with a whole bunch of users who do not know anything about codecs or configuring OBS to get a lower latency or a higher quality. Basically, they just want to drag and drop their inputs and have a go live button and that should be it for them. So what we want to create here is something which is very intuitive to use and the system integrator can then set up which kind of delivery protocol they want to use and just drop it into their platform. It's a drop-in react component and it's also a composter of course so the user can add cameras or screen shares and in the future we also want to add the ability to pull in streams from this server or pull in the video and audio feeds from a web-articity conference call so that the streamer can just composite all of these video feeds and use the audio mixer to their liking and just broadcast that. Cool. So this is a slightly outdated overview of how the streamcrafter started. As you can see, it's not that complex. You have a way to add inputs and then you have a way to mix all these inputs together and process them at an overlay or a sound effect or whatever and then you need ways to broadcast this. Now one thing which you don't see on this image is how do you get this media data from the input to the compositor. So right now the default way this works is it all happens in the main thread which is not ideal because if you have a whole bunch of inputs it kind of slows down a bit. So we're moving to a web worker mode which is already implemented but web APIs aren't really there just yet to make it work really well. So at the moment what it does is the web worker will ask the main thread for new frames because during the broadcast and then the main thread will send back individual frames to the compositor and it works. It's not ideal but in the future we hope to make use of the media stream track processor API which isn't available in modern web browsers yet but it would allow you to just transfer the entire video buffer into the compositor and then it can do all its work in a separate thread and then broadcast that directly. So let's move on to the most exciting part which is going to be the demo. Let's move this a bit to the side. So as you can see the interface is not our strongest part but it's usable so we're going to just add a scene and add a couple of inputs. It's a bit difficult to use from... oh, that's after a good start when... oh shit. Looks like my mouse isn't working on the big screen. Cool, so it looks like we won't have a demo of this but feel free to visit the website video.strong.rocks and play around with the broadcaster. But basically you can just drop sources into the canvas and it will also have a way larger canvas screen on your own monitor than this little screen over here and it should stream in low latency and you can just share that link to a viewer anywhere else in the world and they will be able to view it. There it is. Yeah. Here's the inspiration here. Oh, yeah. Thank you. Cool. Can you add a second window and we can also put the player side by side. I'll let you. That's green. So second screen. Well, let's just start streaming first then. That's fine. Yeah, so we've added a scene. Now let's add... just add a tab screen share. And then you just drag that on top of there. I don't know. The scaling is a bit off but it should be better on your own monitor. You can crop layers if you want. Like if you only want to share this part of the screen and it will automatically fit the layer inside the... Automatically crop the input to fit inside the layer so that people don't have to worry about stretching the input and it looks off. Yeah, and then you just click start, share the link with your viewers and then they can view that instantly. Cool. Cool. Well, it looks like it's not responding anymore. So what's next? Well, as you can see the UI needs a bit of an oval. It needs to scale better for mobile devices and for low resolution screens. Secondly, a code refactor because currently it's a bit of a prototype grade code. We want to make it extensible and easily maintainable. We want to have a plugin feature so people can add their own processing to the video layers, for example. And lastly, integration. Currently you can broadcast in web RTC which is fine for low latency workflows. But maybe you want something with a bit higher quality at the cost of a bit of latency. And so we're thinking of a tight integration with Miss Server so you can stream media data. For example, in Matroska format using HTTP readable streams, streaming directly to Miss Server. And that way you will get a bit higher quality without the low latency of web RTC, of course. But it should look a bit better. Cool. Are there any questions? What format is the video from the RZTAP to Miss Server in this case? If you're using web RTC it will be, sorry. So what format is being streamed from the stream crafter to the media server? Well, if you're using web RTC it will be Opus Audio. So you might have to do a bit of audio transcoding to get to AAC. If you're streaming with the other options which we'll be adding in the later dates, you will be able to transmit in any other audio format which is supported by the browser. It will be VP9 if you... What's the video codec being transmitted? It will be VP9. But this is also something which you can modify if you're using a different caption. Regarding the video and audio, what are the limits of the codec? What you can do with it? Is it machine specific, browser specific? Where do the limits come from for the codecs? So what limitations are there on the video and audio codecs? Yeah, I think you're doing it inside a browser. What are the limitations of that program? Well, I think the biggest limitation is... It depends on which browser they're using, if they're using a modern browser or an old browser, for example, modern browsers have very wide support for basically anything you want. Of course, anything that's happening on the main thread can get a bit slower over time. That's why you want to move all the compositing to a separate web worker thread to keep the main thread nice and snappy. Because if you add lots and lots of inputs, you can notice the UI starts to slow down a little bit. That's what we're trying to prevent there. So in a new browser, if it has AV1 supporting the future, you'll be able to just... Yeah. ...not AV1? AV1 wouldn't be supported right now. I don't think it's supported at the moment, but maybe in the long term. You're saying you will roll back-end to engineer, but this is all done in the browser, right? This is a direct, yeah. Well, I consider it a bit of back-end engineering because you have to transfer media data from to the web worker. You have to overlay them a little bit of math there. It is a little bit of back-end work, but you know, of course, the UI presenting it is all front-end work. Is there anything you should be running on a server? So this is all running inside the browser. It will be cool, of course, to offload it to, for example, a mis-server and do auto-compositing in the background because then you can maybe do a bit more fun stuff with it. But the idea is that you can just drop this into your existing platform and your users can start going live without any other setup required. Is the rendering hardware accelerated or does it need to be? This is currently not hardware accelerated now. What framework are you using for the UI? So what framework am I using? Currently, it's all written in React. We do want to put some of the processing into native JavaScript, maybe, but currently it's all hooks and components. I'm assuming it's open-source, but I'm just taking it out. Can I find it? Yes, sorry about that. So the question is, is it open-source? Yes, but we're still working out which exact lines we want to put out. We do have a repository. It's not public yet. I was supposed to do that before the talk. So, yeah, probably later today we'll have the GitHub repo up with the full roadmap and demo link. How many people are working on this? So how many people are working on this? Currently, it's just mostly me. It's products by the mis-server team. And hopefully, of course, in the long term, we'd like to have more contributors because it is an open-source project. It would be cool if we can have other people work on some nice plugins for more video processing or audio processing. But this was all written by me at the moment. Sorry, what would be the next steps to this project? We'll be the next steps. So we do have a full roadmap of other features we want to add. So, of course, the UI is the most important thing to fix because that's what the end user will be interacting with the most. We also want to have more publishing options because WebRTC does degrade the quality a little bit. We want that option for the integrator to say, you want to have the full quality and maybe at a higher latency. And also, more input options like currently, it only has screen shares and adding video and audio devices accessible by the browser. But it would be fun if you can add way more inputs than that. And we're not really focused on having advanced editor controls because that would be maybe too daunting for a normal person to use. It's more about having the flexibility for the system integrator to choose these kind of inputs, these kind of outputs. So the question is, how is this being uploaded? Is it being recorded or not? This is a very broad question. It's very broad because I do this solo, those jam sessions. And if I can just use my phone to do it and people can watch it, that's fine, but also I want to get the video afterwards. Yeah. So at the moment, it does not do recording. Of course, if you send it to a media server because it does support any web-based media server as well as the server-special signalling protocol. But if you send it to a media server, that can do the recording on the server side, of course. But adding recording from the browser directly, that wouldn't be too much of a feature to add, I think. It would be nice once you add to the web map. So thank you. I have a couple of questions. I don't know, should we have to... Yeah, no, that's fine. I think we have some time. One more question. What is the commercial app that is similar to this? Sorry. What is the reason that it is similar to this? Sorry, can you repeat the question? The commercial app is very commercial. But it is similar to this? Well, of course, so is there any commercial application currently already integrated or in the... Is it similar to this presentation in the content? You're talking about existing applications, which... Yeah, so, Restream, of course, they have web-based broadcasting suite. But they don't have the composing in the browser. The user can choose a layout and add a few, like the cameras and stuff. But it does look really nice. We want to get some ideas there from the user interface. I don't think there are many competitors in terms of what we do exactly. Of course, you have OBS, that's a client application which users to install and configure themselves. Sorry. Restream it. I have to check it out, but... Cool. Thank you.
PipeWire State of the Union
Alright, okay. My name is Swim Taimans, I work for Red Hat and I started writing pipewires some seven years ago, I don't know anymore, way too long. I gave a talk about pipewires last year, so basically it's a follow-up on that, a little bit of things that happened in the last year. For those who don't know what pipewires is, it's basically a multimedia sharing and processing engine. So pipewire was originally built to send video frames from Wayland to applications because screen sharing in Wayland was completely unimplemented in anything, so there needed to be some way of funneling those frames around. It went to a whole bunch of iterations to make that happen. It started with G-streamers, some custom implementation, and then version 0.2, which is something that sort of worked, and then it sort of devolved into an audio framework because people think pipewire is for audio, but it's actually more for video. So it devolved into an audio framework and here we are now. So basically the core of pipewire is to link applications and hardware into a graph. It's very similar to what G-streamers does, you make a graph of processing elements. In pipewire's case, this is distributed, so it's an IPC mechanism to funnel multimedia around between apps, devices, and so on. So there's a whole bunch of multimedia that you can funnel around, cameras, screen sharing, but also audio. So pipewire tries to implement all of the APIs to make that possible. So there is support for Video for Linux, there is support for Bluetooth, there is a compatibility server for Pulse Audio apps, for audio, a compatibility library for Jack applications. So you get all of these things here, all sides covered, and you can also run Jack next to it, but in essence, it funnels data around. It's built in the same principle as G-streamers, so it doesn't exactly know what data is, it just funnels it around. And it does so very efficiently or try to. So that's basically where it is now. We managed to build a whole bunch of stuff on it and replace Pulse Audio and the Jack Demon in most test-tops now with pipewire. So 1.0 was released last year, so that was a major milestone. Very happy about that. So for that to happen, I wanted to have at least as good latency as Jack server so that we could actually replace pro-audio use cases with pipewire without having to sacrifice latency or performance. So that took a while, but it eventually worked and now we are on par with Jack regarding latency, and it's using quite a bit less CPU for large buffers, and it's getting almost a little bit better than Jack for very small buffers. So that's pretty good. One of the reasons for that is Jack is more efficient even at lower buffer sizes, but pipewire is more optimized in its conversion and funneling samples around. So that's the compromise, I guess. So compared to last year as well, we have now support for NETCHAC with Opus. I think it was a question last year, why don't you have that? Well, now it is there. So you can actually NETCHAC between Jack and pipewire. They're compatible. One thing that doesn't exactly work very well is firewire devices. The problem is that I don't have a firewire device. You can't really buy them anymore, so somebody needs to send me something. They are like €1,000, you can maybe buy, I don't know. It's also professional audio, so you need cables to connect. It's just not a plug and so on. So that's a little bit of a back home yet. What else are we working on right now for AES? It's basically RTP. It's used for various hardware, professional hardware, that does audio over TCP and IP. So you can interface with Dolmete devices and so on. It requires like a shared clock with PTP and all of that. So we have worked that in pipewire. You can run the graphs with PTP clocks, it syncs and all of that. So people are testing that. Very specialized hardware, I don't really have any of these things. On the other end, we are now past the audio stage and now we are going back to the video. Because last year, some things fell into place to make that possible. For example, video modifier support was added. It requires a multi-step negotiation. I have these modifiers, do you support that? No, I do. I do all but then what video formats and what resolutions. And you need to go back and forth to arrive at the video format that the compositor in this case and Gstreamer, for example, or any other application like OBS, to get the most efficient video frames negotiated. We also added support for compressed audio formats. So for Bluetooth, we are still tracking, it's a draft, low energy audio. There is development in Blues, which is the D-Bus service that runs and there are all the connections with devices. And it exposes a D-Bus API that an audio server such as PipeWire could use to talk to the Bluetooth devices. So there is development there and we are trying to track that and match it to make that work. Some small things that were added that we don't actually know what to use it for. Interesting things that are happening is the video support. So I hope this year this will continue going forward. So we added video support in Firefox, so that means that instead of Firefox going directly to the video for Linux device with IOCTLs, which is not so nice in sandboxes, but which also doesn't work with newer cameras, because newer cameras, they need much more setup. They need to control setup media controls and all of that. So there is a new library called Lib Camera that also handles these new kind of cameras that you are supposed to use. So instead of porting Firefox over to Lib Camera, it's better to port it over to PipeWire, because then you get all these cameras that are new, but you can also do some other things like send video frames between applications into Firefox. I was going to try to demonstrate that, but the video support in OBS is still a pending patch to make that happen, maybe next year. So there is also camera support there. So in OBS, it's an application for making screencasts and YouTube videos and stuff like that. So you can compose some things and try to demonstrate that. There is also a thing called virtual camera. So OBS can export its scene, and it looks like a camera that PipeWire makes, and then you can actually consume that feed into Firefox, and you can start chaining just like you would chain audio processing elements, but then with video. So that's hopefully something that we will try to make work this year. There's some more work needed to get that going. So we are bug fixing small improvements, because there is nothing really to be done on audio that we know it should work. And all the remaining problems are, in my opinion, I don't know yet, driver issues, timers that don't work so well, unpredictable delays in drivers. So I think the work needs to be done somewhere else. No immediate plans to fix there. So all the work goes into the video side of things. So video routing. So we're working on video converters so that we can convert between formats. Like if you want to implement certain shaders that work on one format and not on others, this should be made possible. Also processing filters with Vulkan shaders or processors. So here, so now that Firefox and OBS use pipe wire for the cameras, we need to start thinking, okay, this is now going to work in flat packs without having to open the whole socket. But then we can also start adding security, like the pop-ups, do you want to allow this camera, yes or no, or take away access to the camera if you don't want it anymore. So there's some talks about that to make that better. This is in-planet currently with the portal. But there are other use cases, like for example, we don't have any access control for audio in browsers at all. But that is something that we'll hopefully flesh out this year. Another thing, explicit sync support. Again, if you do the video processing, it's better to delay or like to queue up as much work in the GPU as you can and then have GPU itself synchronize all the buffers waiting for rendering and stuff like that. So explicit sync would transfer buffers and also file descriptor with it that you can use to wait for completion of the buffer data actually. So that's also something that we want to try to do. And then tooling and docs, the things we continue doing. So I was going to show you a little bit what it looks like the video. Everybody knows the audio. Also a little bit of tools here, I don't know if you know any of these things. So there's like a top thing. This is interesting. It doesn't do anything because there's nothing going on. But you can also get these things like a draw view. I'm showing how now because then you can see the cameras as well as a device. So if you, I don't know, let's see if this is going to do anything. Probably does, but there's no, it's going to the HDMI. Anyway, you can see, maybe it comes on the feet. I don't know. So you can have like a little look what's going on here. In this is a tree view of the graph basically. So you have like the audio driver is iterating and pulling in samples from another tool, PA Play. You can also see this as a graph view. And all of these things, you can link them together to other things. So right, so each of these devices and nodes are in a graph. You can visualize the graph. You can change the links between these things and do all these things. So for example, for OBS, this kind of what it is. Well, I made a very stupid scene, but you can make some interesting things. I don't know. You can put some backgrounds there and place yourself there. So this is using a screen sharing from one of my windows. I think the terminal, but it could be anything using pipe wire and also the capture, which is a new thing using pipe wire. And also these things here, the microphones, they are still a bit pulsed audio. You can look in the graph here. That's becoming a bit more complicated. But you can see these green, these yellow boxes here light up. So you'll see that hopefully a bit more. So you know Shell, that's the screencast stream that sends video to this one. That's the camera from OBS. Yeah. So I was going to show some Firefox things, but there is no export button here. So normally in OBS, you can now do so start streaming and send all of that to, I don't know, one of the hundreds destinations that are supported. But you can also start a new camera, a virtual camera, and then you could consume that camera or this composition in other pipe wire apps. So if we enable all these pipe wire apps and we make them as efficient as possible with all of the video modifiers and all of the tools that we get from Vulkan. Yeah. We should be a step closer to the ultimate goal. Yep. Some other thing that's interesting. I haven't shown that yet, which is basically called filter chain. So you can do this. You can make a small little file. Wait, let me see where I put that again. Yeah, this one. Yeah, it's a conflict file. It's not very easily, but I can imagine GUIs that generate these things, but nobody has written any of them. But you can basically make a little graph of lots of plugins and LV2 plugins. And you can link them together and then you can tell pipe wire to make a new sync of that. That's the input for applications to use. And then that is the output of this filter. So this is something that does again. And you can just then I'll use some debug here. Okay. You can run this graph. And if all goes well, you should also see new sync here. So this is this new thing that appears. So you can just stop this program again and take away the sync. So this is interesting. And did I quit? I can do it again. And so here is this new volume sync. So you can just on the fly create and remove devices if you want. It's a bit like pulse audio with loading modules. But in pipe wires case, you don't actually all need to load them into one demon. You can have separate programs starting and stopping them as they go. So for this filter chain, for example, that's used on for like implementing like sound correction for speakers and all of that. We haven't done any of these things on desktop yet. Also, maybe something we can do. Like for example, on Apple, you get the sound of of apples are so great because they do a lot of filtering to make the frequency match speakers and all of that. So if you don't have that, it sounds very thin and a lot of laptops, they need some extra processing to make them sound great. Sometimes why they sound a lot better on Windows. We don't do any of these things yet. So that's also something that we can do with these filters. All right. Something else that that I don't mention here because it's actually another project, which is the session manager. I've shown this. One big component in all of this is a session manager. We use wire plumber normally. So that one is kind of orchestrating all of the things that happen in the graph, the devices that appear. If a player comes where it's going to be linked, how it's going to be linked, is it going to do a mixing down mixing or is it going to need some filters before it does that. So all of these rules are external to pipe wire in in a session manager. So a lot of work is also happening there. It's a separate project. But yeah, there's, for example, a version five coming out where all of the conflict files are rewritten in a different way. So that's also a change or interesting things that are going to happen. For the pipe wire demon itself, I think it's kind of that's what it is. No new plans. Okay. Yep. So the usual. Yeah, we worked a lot on our documentation too. There's a lot more stuff there. Also the weekend as a whole lot of stuff. It's a bit difficult to organize all of these things, of course, and it's why am I. This is weird. I didn't start the browser. Well, I could do that. I guess. We got tons of information on the week. All of the stuff should normally be documented somewhere or another. So a few. Problem is that it's so, so much configuration and so much options that people get lost. I tried to do some simple guides. How do I enable multiple sample rates and you literally have make this file put that in it. That's it. So. All right. And get up. That's where we are. So yeah. Questions. Yes. Speaking of docs, I was looking at them just the other week. I assume you have the ability to use your own event loop manager rather than the basic tutorial, which says create this one of pipe wires. Yeah. So the question is, can you use your own event manager or do you have to use the pipe wire one? You can use your own one. The pipe wire one, you can make it and then you can get the file descriptor from it and add that to your own, to your own loop. So for example, no shell does that. It uses the G main loop. KD as well. Is there something or do you know some project which hooks in speech recognition into the audio part and creates subtitles on fly? What they're in the stream. The question is, is there an application that hooks in the audio stream and generates subtitles on the fly? No, but it's a great project. I think. Yeah. There's also the case, for example, of keywords. So listening for keywords like, hey, Google, okay, Google or something like that, or I don't know. Hello, Gnom. Yes. When you talked about consuming the virtual camera, would you be able to send those sources to multiple destinations? Yes, so the question is, does a camera can be sent to multiple destinations? Yes. So there can be multiple consumers from one camera in pipe wire. I can actually show that. Just to show what's going on. How am I going to do that? I can, for example, start OBS. So that's one using the camera. And there's also, let's say, I think there's an example here. It's in build. Build. Examples. I think it's called video play. Other way around. No. The thing is, of course, the second one has to have the same resolution of the first one. There's no conversion going on immediately. There's a way to reorganize the negotiation and all of that, but that is a GAM policy for wire plumber, I think. So that's, I think, it's not immediately implemented. Yeah. I was curious about the, the work capacity. I know that there is an AES-627 plan. And I also was wondering if it was the same thing for video, maybe a SIMPTP or an NDI or things like that. So the question is, RTP or network support for video? Completely unimplemented. At all. So only done for audio. Yep. No. The current stage, like, is it just, do you have, you've been involved with people who are using the AES-727 communication? Yeah, I know people are testing it. There's an issues page about the state of it. So I'll have to look it up what it is exactly. But you can find it if you look for AES in the issues. You find it on all of the hardware that people test with things they have, the tweets they have to do, and then we try to all, so that's ongoing. I have to go over here. Yeah. Yeah. Thank you for making it because I switched to hardware like two years ago and it was just a very pleasant experience because it just worked. Yeah. And I've also been using it in music related stuff and all the places Jack for me as well. Yeah, it's great. Cool. That was the plan. It wasn't the, if I have to repeat the question, it wasn't the question, it was just praise. Yeah, we have more questions. I have two questions. The first one is for the wire plumber. Does it have a ground using the place or it's just a direct command line? Command line. So the question is wire plumber. Does it have a GUI? No. No GUI. So you can, for example, have several applications and all the sources you can.
Open Source Community Updates
Good morning everyone. I'm going to speak about a few stuff about different open source multimedia communities and I'm going to speak about what happened last year because we had the beginning of the year which is why we are talking about 2023. For those who don't know me, I'm the president of the VLAN nonprofit. I'm an active member on a few open source multimedia community. I'm also doing other things outside of the open source multimedia community but I won't say that because that shouldn't happen. I have also a few companies who are doing open source multimedia consulting. So I was here last year, I guess, quite a few things happened, mostly about releases on FFMPEG, on David, on VLC and others. The good thing is like last year I came and gave some promises and we actually did the promises. Well, people did. I did nothing except making some slides. So FFMPEG 6.0 was stuck just after 4th them last year. It was quite a large release because there was a lot of discussion about what it was because we are trying to move FFMPEG to a one-year release schedule. So what is a large release and what is not a large release? So I started doing stats because for some reason no one did stats in the past and seeing that one release gets 200 people around is quite large. Not people sending patches but actually merging them there because the FFMPEG community still has issues merging all the patches that we receive. Major changes. The beginning of work on multimedia spreading mostly on the Merx website at the beginning, risks 5 optimization, hardware, AV1 decoding and some work on FFTs, new APIs. Well, see my talk from last year because that's mostly it. A lot of new codecs and filters. Well, I did the presentation a few days before release. The release happened. It was a big success. I hope. So then the next major release was 6-off. It was a bit difficult to get. We were quite late compared to our initial schedule. Our initial schedule would have been summer. It was more autumn, like October, November. And this one was supposed to be a small release. It's not a major release. And you still see that it's 150 contributors and the number of line changes is insane. Of course, the larger contributor is Anton. You should see his talk from this afternoon. But a ton of work on multimedia spreading of the FFMPEG CLI. Of course, it's not completely activated in 6.1, right? But still all the commits went through because Anton knows how to do small commits instead of big, major patch sets, which is easy to review. A lot of things happened on the Vulcan decode. Acceleration, hardware acceleration, mostly the work from Lin. It's maybe one API to all the new hardware API in the future, right? Yeah, OK, no. It's another one. At least this one is supposed to work cross-platformally. Well, then like a VR API. I did push a lot on FLV plus and RTMP plus, which is basically extending RTMP and FLV for new collects. So if you're not happy, blame me. I was deeply unhappy about all the new stuff which were supposed to replace RTMP and like whether it's called RIS, the sortie, or Rush, or stuff like that, right? Oh, it's going to be great. Yes. But RTMP is here. RTMP is everywhere. RTMP is on devices. And like, oh, yeah, we're going to do a new stand-down in 10 years was not really what I liked. Also because never happens, right? See the XKCD about that. So we're extending the RTMP to support multi-track audio, multi-channel audio, new collects, especially so now you can give AV1 and also HDR over RTMP. Is that a good solution? No. Is it a pragmatic solution? Yes. New decoders like RivaTuner, but also VMIX, which I quite care about, and quite a few passes for decoders that are coming afterwards, but they got in C.1. And the beginning of RIS 5 optimization. And for those who care on Linux, there is now an AV1 via API encoder. 7.0. 7.0 is out soon, TM. It's a very large release, probably one of the largest. EVC, for those who don't know, it's Samsung Kodak that was underlined by ISO, which is supposedly with less patents than VVC. I say supposedly because of course that's not true, right? Because probably Cisvell will do another patent put around that. The major part is VVC, right? It's mostly done by a few people, some in China, some around here. And that's probably one of the largest decoder that we've seen in FFMPEG in the last few years. Because as you know, everyone decoder was done in David, mostly for licensing reasons. We wanted to have AV1, MIT and BSD license decoder, see the essay of Stolman about why that's okay. And now we have a VVC decoder, right? So it's probably the largest work that we've seen on FFMPEG since HVC decoder, and it went a lot better. It's still going to be marked as experimental because it's not first enough, so we don't know exactly the security of that. But what's interesting is that's around 18,000 lines of code. It doesn't support the whole full VVC almost. There are a few features missing, so I'm not sure how many will be in 70. And also it's reusing some of the assembly from HVC, but also some assembly from David, right? Which is something I did not expect. But we'll talk about that for the next year, I guess, because we'll have a lot of VVC assembly this year into FFMPEG. QoA, more RTMP, more AV1 work, and lately AVIF support is coming. I hope real video will come because there was a batch on the mailing list. I think it was forgotten. No one cares, but just like it. For old guys like me, like having real video six was cool, and I hope Lynn can finish HVAC. Else it will go to the next release. Stats. So I did two types of stats compared to 6.0 and compared to 6.1, right? So if you look at for major release, it's 180 contributors, 2,500 file touch, and more than 350,000 lines of code in one year. That is huge compared to what we've done for FFMPEG 6, right? So most like, well, a good 50% degrees. Of course, half of it is from Anton. No, okay, no. Maybe not, but if you've not seen the talks from Anton this afternoon or the one from Anton at VDD, you have to, right? Because, but basically it's much better for everyone. And mostly people who are using directly FFMPEG CLI. And if you want to have any ABR letter, multiple encodes, multiple protocols, and so on, you don't need to use directly the, the, do a new tool based on the APIs. Of course, a lot of cleanups and API changes because it's a major release. So of course, a lot of threat safety because else the military spreading work will not work. A lot of things on arm assembly, mostly for HVAC, but also for a few others. So good, better speeds. And on the API changes, there is lots of new codecs and profile because of the one that we added. Quite a few things about HGR metadata, IMF headers and the related channel mapping changes. There is a new thing called Stream Group, which we're going to use for IMF, maybe for enhancement layers like LCEVC or other things like that, some Dolby profile. Seven, eight, I don't remember, right? But some of those. Lots of discussion about side data, including the new packet side data from stuff on Direct3D12. So we can have a Direct3D12 acceleration. And of course, because it's a new major bump, a lot of deprecations, including the final YUVJ deprecation. Yay, we've only talked about that since 2013. Yeah, so military spreading, see the talk, right? You have to. So that's mostly it about for FFMPEG. I'm now going to speak a bit about David. So a lot of things happen on David. Yeah, sorry. A lot of things happened on David in the last year, right? We had like quite a few releases. They look small. They're not. There are a ton of work in February, in May, in September. But what's interesting is that we did basically all the optimization for everything that you care about today, right? So all the neon is done, all the SSC3 is done, 32 bits, 64 bits. AVX2 is done. We finished by all the intratools, Z1, Z2, Z3, like really the stuff that, except when you care about image, are very small in terms of runtime. But all of that is done, right? So for normal people, the work on David is done, right? Well, I'm not sure we are normal people. So now there are things happening on AVX512, mostly by Enric, right? So, and the good thing is, a contrario from what people have been saying for a long time, which, oh, no, you cannot use AVX512 to be faster than AVX2 because of the issue with TDP and the clock changes. It's actually faster, right? And in many cases. And also because now we have other chips, which are not done by Intel and are quite competent, you can have AVX512 without going, slowing down the whole CPU, right? So I think it's mostly done for AVX512, because on AVX512, we will not do all the coverage of AVX2 because in some places it's not worth it. But this is some of the things that are happening on the next release, which is happening next week. Martin, maybe. There is RISC5 that was done by Nathan. So we start the RISC5 port. Mostly the inverse transform were done. Hopefully more people will help this year. And from nowhere, from China, some people arrive with a long arch support and they did a ton of things, right? Like a lot of the inverse transform, some loop filter, some loop restoration and MSAC, right? So that's quite useful. But still, it's like a bit more niche than the usual, the normal mainline users. Interesting things were done on reducing memory usage, because some people, I think, meta complain about that. And it was just like, oh, okay, yeah. One of the problems with memory on David is the way we're doing the frame threading, which is why David is so fast. But like one of the problems is that it can use a bit more memory. So we looked at that and we did some fixes for that. The next release, I don't know when exactly it's out, because there is some security issues that are integral overflows and I think are exploitable. So I need to discuss with Chromium people to be sure that they know before I do the release. But that's mostly about it on David. We are looking at David hybrid decoding on GPU, but so far I don't have much to say about that. A bit about VLC. We did quite a number of minor releases of VLC this year, mostly about 3.0, lots of security issues. The last release was 3.0.20, three months ago. And we've had a large number of downloads. We've seen 150 million downloads in three months, which is around 50 million per month. So that's very steady. And you know that I care about one release number or download on one release, because you can know, it can help to have the install base of VLC. So the good thing is that we are soon going to beat Firefox in terms of users, not because we're getting bigger, but because... But yes, that's interesting is that the number of downloads in VLC is actually increasing in the first month. Like in five months, we always get around 220 million. Now we're seeing that it's getting a bit more. Usually in three months, we have 120 million. This is what we had two years ago. So we're at least getting bigger. A lot of those users are of course on Windows and Mac OS. What we're seeing is that the number of users on Mac OS is increasing, which is worrying for me. But VLC4, a lot of work happening on the clock. We have lots of difficulties stabilizing the new clock, which is one of the large work on VLC4. And the cool stuff that we've been doing that are finally out, which is the VLC on Unity and VLC on Unreal so that you can use open source tools directly to output video and real time video inside 3D engines. And of course we did some stuff on VLC in the web browser because it's actually working now. But most of the things that happened this year on VLC are related to the Android and iOS versions. I don't talk often about those, but for them, because usually I don't have time, so this time I have. At the same time, we improved quite a bit the Android Auto, which is different from Android Automotive. Well, it's Google, right? They can find a great naming, then they fuck it up. So Android Auto is like your normal car and you can basically play with something that is on the phone. So we have had major stuff on Android Auto, right? So the app is actually usable. It's not done by a few nerds. And at the same time, we had Apple CarPlay, which is not like Apple Car because that's for 2028. But yeah, so actually now people are using it because it's usable. Most of the people use that for music, of course, and not really for video because you shouldn't watch video while you drive. Some people are laughing, but you know that now bigger cars have actually screens at the back for the kids to watch directly. Anyway, Android 3.5 and 3.6 of VLC got big jump on foldable, right? So because like we're back in the 90s, now you can have flip phones that you can open, right? Quite popular in the US, weirdly, when we see the stats. No one else cares. Support for Android 12, 13, 14 because, well, they need to justify new things. And of course, it's breaking the UI and breaking the permissions for absolutely no gain for their users. But mostly, we back ported or forward ported, I don't know how you call that, the web server features that we have on iOS, which is extremely popular to Android. So you can basically upload files directly through a web browser because MTP and USB is now completely broken on modern Android versions because they decided that, yeah, you can't use that anymore. On iOS, a lot of things that were already on the Android version came, right? The other way around, we tried to match them, play back history, but so everything like network library features, right? So you can use your Plex or DLNA server, SMB server and still have like continuity history and so on, right? External subs, for some reason, Felix, I did CDG support. Where is Felix? Why? People asked you to add karaoke of CDG? Who are they? Why? Okay, I'm sure, MKA. But the last thing interesting is we have now support for VisionOS. So if you have 200, then 300,000 euros, you can buy one, right? And so it seems that Apple has no idea why would we have support for VisionOS, right? The SDK is completely broken, nothing works, but you can run VLC on it and watch your favorite movies directly on VisionOS. Yeah, now I'm just going to speak a bit more about the community. We did great VDD in Dublin thanks to Anil and Vibhuti. We did, as usual, crazy stuff at night in Dublin. People thought we were crazy. We are. But it was like quite a good VDD. A lot of VLC and FFMPEG falls were here, so that was pretty cool. But it's important because like our communities are sometimes a bit difficult. And so on video land, we organized VDD 2023. It was important because like the last things were that we didn't have, we didn't do enough VDDs because of COVID. And so we did some elections. We have a lot of things that we're going to change in the nonprofit, mostly on the infrastructure work. We need to buy new servers and we do most of our infrastructure because our new servers are now 10 years old. We did a NAB boost, which was completely insane with our big Julien. It was quite fun. On the other side, on FFMPEG organization, there were lots of discussion about the community management. One of the reasons is that when we decided the General Assembly elections, and it was not precise enough the way we would update the list of members, so there was a lot of discussion. But like the problem is how do you reboot strap based on something that is not correctly well done? We should have used Lydia earlier to have like good organization. But anyway, this is fixed as part of this year. So we now have a good General Assembly and so we managed to be like T-Series, which is technical committee and community committee, so that we will be able to fix our discussions, at least decide on them. And also like we've been doing FFMPEG technical meeting, the last one was at VDD. We did offer one in June in Barcelona. Was it the year before I don't remember? And one at Fosdam, right? So trying to do what we've been doing on VLC, which helped a lot, also on the FFMPEG part. Last part is just like, and that's for a lot of people watching, not too many people in the room, but like the FFMPEG community needs more support, more corporate support, more money, right? It's now like a core infrastructure project, and it's one of the only ones that is not supported by Linux Foundation and CNCF and all those people who have a lot of money. So the only two companies were actually giving like really supporting YouTube and Meta. But the other one is very difficult to get any cent, because like some of the big GPU chip providers are very poor when I ask them. I would suggest that if you have time, look at the talks from Kiran at Demux where I explained all those issues, but like we really need help on all those things. And I think that's it. Thank you everyone. And because for once I'm not rushing, I even have time for questions. No questions? Yeah? What about the LTS? So in theory? Okay, so the question was, and someone asked the question to Anton before, and I think Anton skipped the answer. Yeah, I didn't answer the LTS part because I forgot. Yeah, yeah, yeah, yeah. You forgot or you did not like the question? If we follow the plans, 7.1, which will be at the end of this year, will be LTS. That's at least supposedly the plan. Are we going to match our plan? Yeah, I think so. The plan is to have 7.5, 7.9 as LTS, and so 7.1 as LTS. Yeah, there we go. No, 2027, that's a target. Other questions? Yes? Yeah, so Unity is a piece of shit company. They are using open source tooling and they're based at the beginning, the web completely based out of the work on C-Sharp on Linux, right? So it's basically a C-Sharp store. They're using a ton of open source libraries, but including LGPL and GPL libraries in their tooling and so on. But if you do extensions on the store, right, what like we're doing, they now refuse open source and not just GPL or GPLv3, right, like Apple or Microsoft, like just LGPL completely off, right? You
FFmpeg VVC Decoder
Good afternoon all. I'm here to talk about the VVC decoder in FFMPEG. I'm going to introduce VVC. I should imagine if you're in this room you're already somewhat familiar or at least interested but I'll refresh some of the coding tools and some of the objectives that it has. Talk about where FFVVC, the FFMPEG VVC decoder fits into that. Again, what new coding tools VVC introduces. Talk a bit about the threading model which is one of the most more interesting things for those of you who already have some experience with FFMPEG. Then go over performance, how that compares to previous codecs and the other VVC decoders out there. Conclude the talk, talking a little bit about the Google Summer of Code program this summer and the next steps for FFMPEG. First of all, a disclaimer. I did not write very much of this code at all. The credit should go to Noemi in China who unfortunately couldn't be here today. Who am I? I am Frank Palmer. You can find me at frankclammer.com. There's various other contact details on there. I was one of the Google Summer of Code students this summer working on this project. As you saw in the agenda, we'll talk a little bit more about what that involved later. Going into the introduction then. VVC or not H265, H266, that should read, is a new standard from the Java. It's succeeding H264 and HEVC, so quite big boots to follow. It's got two main objectives. It aims to have 50% lower bit rates than HEVC for the same quality of video. As the name suggests, versatility is the other main objective. That involves a lot of new coding tools for things like screen content coding, adaptive resolution change for things like video teleconferencing, independent sub-pictures. Versatile applications underlie a lot of the decisions made in the design of VVC. The open source landscape of VVC. For encoders, you have VTM, which is the reference software. You're not really going to want to use that for practical encoding. You have ENC, VVNC, which is developed by the Fraunhofer Institute. That is a practical decoder, encoder very fast. Finally, you have UVG266, which is an open source project developed by the community. Then on the decoder side, you again have VTM. You have the dual of VVNC, you have VVDEC, which I believe there's a lightning talk on that in a little while, which is very fast, very good decoder. You have also developed by the Fraunhofer Institute. You have OpenVVC, which is a community project VVC decoder, which is relatively performant for a single core. Unfortunately, that has now been abandoned. I don't think there's been a commit in about two years. Finally, we have what this talk is introducing, FFVVC. The state of FFVVC, the C code was merged at the start of the year. I believe it was a month ago exactly now. As John Baptiste talked about in his talk a little while ago, we believe it will be in FFMPEG 7.0, but possibly under some sort of experimental flag. The Inter-Prediction Assembly has just been merged about a week ago. We have some other assembly that has been written and is in the review process. It's important to note though that FFVVC is not yet maintain complete. There are some coding tools that are missing. The big one that we've heard from the community is intra-block copy support is not yet implemented. There is a patch set for that that's in the works. I'd be doubtful it will be in the 7.0 release of FFVVC though. Most of the other features that are missing are things that are a bit more exotic than intra-block copy. Features such as wrap around for 360 degree videos not yet implemented, independent sub-pictures, reference picture resizing, some of the more exotic stuff, but that will all come in time. This shows the assembly status, what has been written so far, what we're prioritizing, and what we've been able to reuse from HEVC. Things that we've prioritized so far are largely low hanging fruits. The inter-prediction we were able to reuse quite a lot of that from HEVC for good gain. SAO is entirely identical between HEVC and VVC so we've been able to rip that directly. Inter-prediction and ALF are both big contributors to the decode time in C only, their high priority. One of the GSOC projects last year was working on the ALF stuff so we'll talk about that a bit more so that's on its way. Inter we've managed to get some bits out of David for the more generic stuff just like averaging functions. That's been effective in getting a quick speed up there but we need your help with this. There's not many of us working on this at the moment and there's a lot of assembly to write. That's going to be key to performance as we'll see in the performance later on. Decoder size. I believe the biggest decoder now in FFMPEG in terms of lines of C. I'm not sure how it compares to David but even being the biggest decoder in FFMPEG it's still much smaller than open VVC and VVDC as you can see here. How do we manage to achieve that? By being in FFMPEG basically we're able to reuse parts from previous codecs. We're able to use the CBS Quebec reader you can see there and reuse like whole swathes of code also parts of the binary so it's kind of hard to measure that but you get a more bang for your buck in terms of the size of a compiled delivery codec. In the future I believe we may be able to also use some aspects of hardware decoder APIs to do the DPB reference management. We managed to be much much smaller and that's one of the main reasons really motivating putting this inside of FFMPEG. The other one being FFMPEG's vibrant community we can say which hopefully will help maintain this into the future. Moving on to what's new in VVC so there's a lot of new coding tools like a dizzying amount. You can see here you could talk for an hour and many people have about even a subset of these. As you say we haven't implemented them all yet but there's loads to play with which yeah feedback to them the ability to make much smaller bit streams and also to make more versatile video content. What FFVVC introduces that's new for FFMPEG is this stage-based thread model so lots of previous codecs have the frame and slice thread models which do well for sort of low number of cores but have some sort of here ceiling at certain point and so FFVVC uses a much more fine-grained thread model which is able to allocate threads based on the stage of decoding individual CTUs and yeah as that says it means we're able to much better utilize higher core counts and so our C code with no assembly we're able to decode 4k over 30 fps on you know relatively high-end desktop processor but I think that's really impressive. This thread model is possible to implement in HEVC. FFVVC does not use it I think it's also possible to do stage-based decoding in AV1 but it wasn't a factor in the design of AVC. The way that it works is you divide each CTU into several stages of decoding they're all listed there and the key thing is that each stage depends only on the current or previous stage of the neighboring CTUs and so you can start doing the D block of one stage before you've done the pass even in the like top left corner very far away sorry before you've done the intro I think you have to do the pass for all first and the effect you get from this is this sort of wave front that progresses across the image of each of the different stages and yeah it allows you to use much more cores. To allocate those cores we've had to introduce this new AV executor utility which has been made available in LibAVUtil so you can use this for other projects inside FFMpeg. It's a really simple algorithm at the moment but centralizing the control of allocation of threads you know not repeating yourself means we have now one location where we can make improvements here. It's a really simple algorithm it's based on I think some of the earlier implementations inside Python and Java's executor structures or whatever they call them but yeah having that one thing in one location that can be used throughout FFMpeg to improve multi-threading. Yeah so onto the performance section so at the moment it's pretty slow compared to previous codecs I mean this is to be expected by to a certain extent VVC is just a more complex codec than previous generation stuff it has to be in order to achieve high rates compression. This SIMD here false and true for FFVVC so this is with stuff that's not yet in FFMpeg master this is with the current state on the development staging repo. You can see we are getting about over 200 over a doubling of speed increase for FFVVC already but there's a long way to go as you can see from David's really impressive assembly speed up they have there but our multi-threading picture is quite different so that shows you the effect of doing that stage based multi-threading we're just much more easily able to use higher numbers cause yeah note here that this is using hyperthreading which is why you've got quite the knee there at six threads and but below six threads it's really not far off from that ideal you get a core you get the same multiplicative increase in the speed up comparing it to VVDC then. VVDC uses the same stage based threading model so you're getting a very similar performance between FFVVC and VVDC. Open VVC uses the conventional frame and tile based multi-threading techniques so that's quite useful on the left hand side there that figure to compare what is the effect of this new threading model but you can see and then on the right hand side the single threaded performance C only between FFVVC and VVDC is pretty much on par. VVDC behaves has quite significantly different performance on different operating systems but the average between the two is pretty much the same and on 4k it's a similar picture but everything just gets slightly more pronounced. Open VVC is slower that the speed up that we're getting from using more threads matters even more for larger videos so you can see that effect here but we're still lacking on the assembly front so VVDC has a lot of assembly already for quite a few different architectures and you can see that they're really pulling ahead once you enable the assembly there. The theoretically FFNPEG VVDC decoder should have somewhat of a higher ceiling due to the fact that FFVVC's assembly will be handwritten whereas VVDC's is using intrinsics and on some architectures using SIMD anywhere as like a portable SIMD library which introduces them overhead so with enough time hopefully FFVVC can be even faster but we've got a long way to go to catch up to them at the moment. So just sort of wrapping up to the last couple of things here so talking about the Google Summer of Code program in 2023 so there was two Google Summer of Code students contributing to the VVDC decoder this summer. Myself and Sean Liu so I worked on a lot of the stuff that was added in version two of VVDC so that includes the support for 12 and 14 bit which needs the range extension which changes various things to the entropy encoder when you get to higher bit depths and I've also been working on AVX2 optimizations for the inverse transforms they all had to be written from scratch in the end there's not very much that you can share between HEVC and VVC due to the way that the HEVC transforms are written in FFNPEG and Sean Liu is working on also on assembly transforms for the filters which some of them are in the process of being upstreamed at the moment I believe. So yeah next steps as I'm sure this performance and what we've been working on has sort of shown we've got a very solid baseline with the C performance and the multi-threading but we need lots more assembly in there to be able to compete with existing decoders so upstreaming and what we've already got implementing more functions with assembly also more architectures so ARM is going to be a Google Summer of Code project for this summer potentially also risk five there's a lot of work on doing risk five assembly for FFNPEG at the moment so we'll need that in time polishing off the maintain conformance so implementing those features that I mentioned for missing earlier particularly intra block copy is a high priority the thread optimization 32 plus cores so we may be able to improve the AVX2 utility for higher core counts if there's sufficient demand for that and the GPU based decoder so a lot of the stuff in VBC is really well designed particularly to do with the separation of stages that we saw earlier means that it's really well suited to decoding on the GPU so that's something on the far horizon. Concluding so FFNPEG now has a VBC decoder I've introduced that new threading model showing some of the benefits of that talks about the C in multi-threading performance and how that compares with VVDC and given an update on the status including the optimized assembly we're currently working on we'd help with this like especially with the assembly there's just very few of us who only work in our free time so progress on that front has been relatively slow so yeah patches welcome alright yeah thank you very much for listening. If anyone's got any questions I'll be happy to try and answer them as best I can as I said in that just like disclaimer I did not write very much of this code I just did you know the bits I've talked about and then I've worked on doing bug fixes especially since we've one thing I forgot to mention part of why we're going to have to be experimental is OSS fuzz we've only recently started being fuzzed since we went into FFNPEG master so we're getting a lot of reports for that at the moment that we're trying to work through before we go into like a normal release but I'll try and answer any questions as best I can yes. So the question was have we considered trying to use C in forensics? Yeah as a step between having fully C code and having handwritten assembly for everything it's not the FFNPEG way FFNPEG everything is handwritten assembly I think there's a little bit in like lib SW scale I believe but that's when the FFNPEG is in the process of removing that tiny bit of C in forensics that we still have so yeah I mean we're probably not going to do that just out of you can go faster with handwritten assembly so if we're trying to get that same performance and even be VVDC I think it's the only way to go really. Okay there's no more questions yeah thank you very much.
S2S: PeerTube instance dedicated to Sign Language
Hello, welcome in Brussels. We are glad to be at FOSDEM with you and we wanted to talk you about our project which is called S2S and the VAMO is a project to tell everyone who has sign-in videos that we can put them in the same place. Many videos are very good on the internet, they are signed correctly and we would like to get a copy of all the videos in the same place. We use peer-tube to do so and if you go at our website you will only have videos that are available in sign language or with a specific subtitle for deaf people. You know the subtitles, the colors for example, and where it is written things that happen in the movie but that no one speaks for example is a door closes and makes some noise outside as a camera then it can be written so we like this video too. The better video we can have is with sign language and subtitles because therefore everyone can understand everything. This project has no need of funding. It is a very cheap project because it's just hosting a website and we just use the peer-tube software to host videos so it's about 100 euros per year to run this project so we really don't need money. We need people that do the same thing in other countries. We will be glad if someone does the same thing in Germany because we could federate our peer-tube instance and have much more content available and we also need many people who know about this project to post more videos so we can grab many many more videos. Do you have any questions about our project? Show us the next slide. Show us because I feel exactly on the next slide. You can show it. You can go to the slides. There's one right there. Chrome is already open and probably being faster. Sorry. Okay. Thank you very much. Thank you very much. You
5G-MAG Reference Tools: Bringing 5G Media to Life
All right, let's go. So who of you know anything about Fiji Mac? Can you raise your hand? Fiji Mac? OK, that's why we are here, right? So let's start by who we are. So Fiji Mac is an international non-for-profit cross-industry association. And what we do is basically to apply global internet Fiji-based access, global APIs, to media, to multimedia applications in the domain of, for instance, production, contribution, news gathering, et cetera, but also on streaming, also on broadcast, and also new media, like XR. Applying all these technologies means not only taking care of media, but also taking care of network capabilities, network features, transport protocols suitable for doing and then architectures for streaming, for CDNs, for broadcast, MantiCast, network assistance, satellites, like non-terrestrial networks, and so on. We decided to launch a development program not just to talk about technology, but to actually build stuff, right? And we have established a community of developers that is sponsored by Fiji Mac, but is open to anybody. And what we do together with, let's say, these companies and our Nolfe-G Mac members is to develop reference implementations of all these technologies we've been explaining for validating standards, for building demos, for testing and experimenting. And that's, obviously, for everybody, network operators, service providers, broadcasters, and so on. What we do in the reference tools development program is actually to build all these series of technologies. You can go to the website and you will have more information there. Particular, for instance, we have our own set of CDN notes. We can get metrics in terms of quality of experience, in terms of consumption, reporting, so who is consuming my player, my OTT player. For instance, that's information useful for service providers, broadcasters, and so on. We have also developed our own end-to-end system for something called Fiji Broadcast, which allows you to broadcast radio or TV to your OTT app on your phone, not based on the internet, but based on broadcasting. And we are recently onboarding a new project for AR, MR, let's say XR applications, and we have started developing all these series of tools where you can actually do this, right? So you can create content for XR devices, and you can put your, for instance, TV channel for users on this, on this display. This technology works, actually, we have been at IBC, for instance, showcasing Fiji Media Streaming, what I explained, and also Fiji Broadcast. If you wish to participate, this is fully open to everybody with an interest on all these technologies for production, contribution, and distribution. We have a GitHub repository where you can find all the projects, there's documentation, at the moment there are plus 30 repos with different technologies. We accept code under the license terms that contributors feel comfortable with, that means we have a repos on OSI, licenses, but also on other kind of licenses, please check if you would like to contribute. And you can also participate, anybody is welcome to participate, there's a Slack channel, we have public calls every Friday for developers, for academia, for the industry in general, and we have also a Google group with information on announcements, releases, release candidates, testing periods, and so on. You can find all the information at FijiMak.com slash community. So if you have an interest on Fiji Media Production, an app link video, on streaming, on Fiji Broadcast, multicast, beyond 2D volumetric video, et cetera, et cetera, next time, please have a look at FijiMak, thank you. Thank you. Thank you. Thank you. Thank you. Any questions?
dublang, a multi-language live coding system
Hello, thank you first the organization for accepting my talk and thank you for coming. And in this talk I will present software that I am developing since I think two years ago named the Dublang. It's a multi-language live coding system. And then I will do a very short presentation here with a small video demonstration. And first a bit about my profile. I work as a research software engineer inside the project named Cortex Platform in France inside the University Gustave Eiffel. Also I am a collaborator of the software heritage project as an ambassador and a debunk contributor and also as a hobby. I am a live coder and visual artist. I am very interested in live coding to produce sounds and to produce video as well. And that's why I created this tool to support my interest in this subject. First the name of the project. I think it's important to mention from where comes the inspiration. The Dublang name is inspired by the musical style Dub. And Dub consists of remixes of existing music and Dublang consists of remixes of existing software. Then one of the goals of the Dublang tool is to have a single interface text to our interface live coding interface to manipulate and to use multiple different tools in the same source code in the same session. Then how it is designed. The Dublang system is designed in a client server architecture. And for that I am using in the client side I am using Nail Vim text editor because I found it very easy to implement not also because it's easy to implement using the Lua language that it's a really nice script language that fits very well in the purpose of this tool as the purpose is to mix different tools in the same environment than a script language like Lua language works very well. And then in the other side I have the servers that is being managed by system D service. Then here is one example of how it looks like Dublang source code. Then here is an example where I have two different languages and the region with the hashtag and exclamation defines a region for a specific language. And then I can have in the same source code different regions with different program language. And then for each language I have to implement inside the Dublang system extension through a plugin. Then the Dublang system is built on top of the architecture is pluggable. And I can create plugins, new plugins to integrate new languages or new tools. Let's see if I can play this video here as example. I hope the sound works. It doesn't work the sound? No, it doesn't. But I don't have sound yet in this moment. Oh, that's still something. You can try. You should be here. Plugging this into the audio object. If you're looking, you might get sound. What I do with this? Which put a click left or right? Sorry. Oops. What happened? I don't know what happened here. Oh, man, what's happening? Okay, there we are. I clicked it in the wrong button. Sorry. It won't feel screen apparently, but there you are. Let me go back one or two seconds. Then here, I have more or less the same code I showed in the previous presentation. Where is the sound? I lost the sound. Ah, look here. Oh. Yes. Then here, when I evaluate this, it's being executed by the super collide server. And then in the same source code, I'm going to add some... I think I finished my time. Just to finish this is now them. Then this is Bambam, it's being executed by the Tidal Cycles language. Then two different servers and the client is sending it to the proper server. Sorry for extending my time. Thank you for your attention. No time for questions, I suppose. Thank you.
VVdeC<>Arm: Optimizing an open source VVC decoder for Arm architectures
Okay, so yeah, I'm the filthy, uh, from over guy that asked the intrinsic question. Um, and my name is flow and nice to meet you. Um, last year, my colleague Adam, I had a talk here about, um, the open source VBC decoder, um, VV deck and the encoder VVank here at foster and I, I optimized the decoder for arm architectures in my master thesis, which I will talk about now. So basically that's this on the right. They can see the, um, Zindi optimization of VV deck. So VV deck is optimized for SSE 4.1 and AVX 2 4.2 was needed. So 4.1 was enough. And to be also able to run VV deck on arm architectures, um, the, the open source project Zindi everywhere is used, which basically like in this case, parts the SSE implementation to arm architectures in the justice by using, um, either built in functions, um, or neon intrinsics in this case because it's arm and, um, but it can also only use, um, scarline implementations and tells the compiler to like vectorize it automatically. So yeah, a combination of these. So my goal was to, um, yeah, make it faster for arm. Um, for that, uh, the first thing I did was identify the hotspots. Um, I was profiling VV deck using instruments since I was using this M1 PC here. And yeah, um, I divided the profiling into three steps. First of all, I identified the most time consuming functions. Um, with these I checked like the performance on arm versus the performance on X 86. And the third part was since VV deck is implementing, uh, every Zindi function as a non vectorized version as well. Um, I compared by wanted to know like how much, uh, speed up does the Zindi implementation generates. And with all this information, I chose for the foremost promising function, which basically means I wanted to get the biggest bang for the buck. And yeah, I chose to optimize these four functions. Um, on the left you can see these, the names of these four functions. Don't mind the names. Um, the only thing that is interesting is like the speed up. And, um, yeah, this graphic shows the manual optimization optimization. So the optimization I did versus the automated, um, the automated optimization from Zindi everywhere. And I visualized this for one of the JVET video sequences for a quantization parameter of 43. And, uh, yeah, you can definitely see that like two functions have a really nice acceleration so compared to the Zindi everywhere implementation. So in this case, the apply load Zindi function and the X get SAD function, but generally speaking, uh, you can definitely notice that Zindi everywhere does a decent job and to, yeah, in comparison to like just optimizing with C and forensics. And yeah, after having a look at the single function accelerations, I also wanted to know like how much is the impact of the optimization of these four functions on the general, um, on the total acceleration of VV deck. So I measured 11 JVET video sequences two times, obviously, since I need to compare them and average that for every or for, um, common quantization parameters. And yeah, the range is between 3% and 9%. What is not definitely noticeable is that like with, um, decreasing quantization parameter, the speed up gets, um, lower. And this is because the bit rate is higher with lower quantization parameters. This may, uh, this is because, um, not this is because, but, um, and because of that, like the decoding of the entropy decoding is getting more complex and yeah, it gets a bigger piece of the cake. So yeah, that was basically my master thesis in a nutshell. And after that, um, I also integrated like Zimdi everywhere to, um, to, to port the AVX to implementation, to arm, which also led to a contribution, which was pretty nice. It led to a conclusion to Zimdi everywhere since there were some, some errors in the portation. And right now, since there's also an encoder, I'm repeating the optimization for VVNG. And in the future, we might also optimize for the scalable vector extension like directly, or the scalable matrix extension. So yeah, thank you for joining for us. If you have any questions for free to ask, you can also ask me at the, I don't know, post foster and drink up. I don't know. Yeah. I have one. So what is translating across all the speed presets when you do the encoding, the decoding improvements? Uh, what the presets? Sorry. So when you do the encoding, you have different presets, right? So I didn't know that's you are asking about the encoding. So after the encoding, when you decode, right, does it translate across all the presets for recording? Cause every preset may not have all the tools. Uh, yeah, that's true. So the question was like, um, they are different, um, like there are different presets in the encoding, which affect the functions called in the, um, decoding. This is, um, true. I mean, I did like, uh, I tried to get a general overview, which functions were used by like profiling several, um, yeah, several, um, settings and tried, uh, yeah, and tried to figure out which functions were used most and average that basically. So yeah, there's a like a bigger story behind that behind the profiling, obviously, since this was only a five minute talk. And yeah. Does this mean that I can use a Raspberry Pi now to decode it? Have you tried to use the ARM devices to see? Okay. So the question is, um, can I use a Raspberry Pi to, uh, to decode it? And I mean the Raspberry Pi is based on an ARM, right? And I would say, yeah, obviously you can write because, um, I mean, you could do it before as well because Zimdia everywhere was included and Zimdia everywhere ports, uh, the SSE implementation to ARM, which, um, I mean it doesn't do it like perfectly, obviously, but we actually, um, submitted a paper or some colleagues of me are submit to the paper at a mile high conference. Um, yeah, I mean, I can, I mean, I can even probably put it up on, for, on the foster side, maybe. If you want to see that where they, um, like measure the performance of Zimdia everywhere on ARM and, so much of examples of what platforms, I mean the platforms, uh, which are supported are also like visible on the GitHub repository. So, um, yeah, this is also, um, on foster website. Um, like when you go to my talk, like there's the VVDec repository linked to it. And there you can see it. Um, yeah. That's tight. There's another question. Why don't you probably, you're simply by hand instead of using the quality performance in June? I mean, obviously we are still the best, right? So we are still the best when it comes to decoding and encode. But, um, like VV, like VVDec and VVN is performing pretty well in comparison to other VVC, um, coders, I would say, right? Um, yeah, that's true. That's obviously true. I mean, of course we have a head start, but, yeah, but I mean, let's see, right? I think nothing better than a healthy competition. Yeah.
Cosma, a visualization tool for network synthesis
Okay, good morning everyone. My name is Arthur. I'm an assistant professor in Lyon, France. I'm here today to acquaint you with a little program called COSMA. I'm going to present the design choices behind it, touching on mostly two points. It's going to be a short presentation. The architecture of the program, which may interest you if you're working on interactive publications. And the features, which may interest you if you're a scientist or working with scientific data, and you have information management needs. So that should be every scientist. I'm presenting on behalf of the team, and first and foremost, my developer colleague, Guillaume, who is not here because he's on hiatus for a very happy family reason. And also my senior researcher colleagues who have a lot of knowledge and research colleagues who have advised us on the design of the application since the beginning. Okay, COSMA came as part of a research program on Paul Hôtelé. I'm very happy to be mentioning Hôtelé here because he was born and died in Brussels. He was a famous Belgian figure. He was a pioneer of knowledge organization. He's recognized today as a precursor to information science. He was a pacifist, an internationalist, a feminist. He had also some flaws. He was a utopian. He had some sometimes a bit dated views on topics, but he's a very interesting figure. He's the one who popularized the word documentation, so that's that. His main idea, Hôtelé, was to go beyond the book. What he wanted to do was extract all facts from publications and sort of organize them into universal encyclopedia. The idea was that universal access to knowledge would bring peace. That was the utopia. He worked all his life on tools to achieve this, including bibliography, classification schemes, index cards, and so on. There's a museum dedicated to him in Belgium, in Mons, so if you're in Belgium for a few days, I encourage you to go and visit it. So in 2018-2019, we worked on a map of Hôtelé's professional network. It was our take on an idea that had been done before, which was to combine a graph view and also a card view, so like a little index card with metadata, but the note that you're currently selecting in the graph. And one day I asked Leone, can you make that for my research notes? Because at the time I was accumulating files that looked a bit like this, a bunch of plain text files with notes on specific things. These aren't actually my notes. I just borrowed Andy Matousiak's note for this presentation. Andy Matousiak is a researcher who's working on tools for thought, non-linear writing, etc. The idea is that you have files which reference each other with links, internal links, just like in a wiki, double brackets around the title or an identifier, if you prefer using an identifier. And so what Guillaume made is that he designed a prototype which became eventually COSMA, which renders these files into an HTML file. So yet another graph application, after all the graph applications that we've seen in the previous presentation about Giffy Lite. So this is an HTML file which contains a graph view. The rendering of each file in HTML and also a few navigational tools, an index, a small search box, etc. This could be anything. It could be any kind of knowledge base. It could be a glossary of terms. It could be a network of people, of concepts, of events. Really it doesn't matter. It's like a commonplace book or wiki or a zettelkasten, if you're familiar with that word. Even a mind map to some extent. Conceptually it's a bit like that. What distinguishes COSMA is that we have, well, the architecture and the fact that we designed it around scientific writing needs. So I'm going to describe briefly the architecture point and then I'll describe the features a little bit more. So it's purely a visualization program. You cannot edit data with it. It just reads plain text files. And most of the features are actually located in the exports. So this is actually COSMA. It's a command line application. And you use it to generate these HTML files. If you're familiar with Tiddly wiki, it's a bit like that. So it's a single HTML file which contains everything except Tiddly wiki. You can edit the data. This is read only. So it's less like a web application and more like a sort of augmented document. You can share this file, obviously. It's just an HTML file and people can open it in their browser. And the idea was that I was familiar with software like Jaffee and I always wanted to be able to share graph visualizations with colleagues or students, but not as static images, but as interactive things. And there are lots more options now that exist to do this. We just did this for little markdown files. So that's the brief point about architecture. The features, as I mentioned, they're related to information management needs. Everything is designed to encourage knowledge organizations. So categorizing things, classifying, indexing, tagging, relating things to one another. It's basically a memory aid, actually. It's not for graph analysis. It's more for network synthesis, so to assemble document graphs about things. And the way it encourages knowledge organization is to provide a few features that reward this knowledge work. So, for example, if you assign types to your notes, colors will appear and you will have filters to modify the display. So you can toggle, for example, one type. Here I've toggled the inside type, which was in orange. And it also and mostly encourages link-based knowledge organizations. So that's using links in the way you're describing the relations with things. And the way it rewards that is to provide contextualized backlinks. So that's the thing that's at the bottom right here. These are the incoming links. So you see here where this note has been cited, and most importantly, how it's been cited because you have the context, the surrounding paragraph that's here. So that's a contextualized backlinks. Not an idea that we invented, we just borrowed it from actually web pioneers. It's been going around for a long time. And in recent years, there's been a wave of tools for thought text editors in which you can create little notes, link them together, organize them, and they pretty much all have this feature. We just wanted a way to have it for scientific writing and also to be able to share it easily. Now the big thing that we did is we have the same feature, but for citations. So if you're working with bibliographic data, so you have maybe a raw JSON file, maybe more likely you're working with a reference manager like Zotero, N-node, Mandalay, etc. And if you're in your notes, your citing works. So for example, here on the right, I'm quoting the two references that you can see are stored in the file on the left. Well, Cosma will generate a bibliographic note, that's the dark gray one here. I haven't created a text file for this note, it's been generated automatically. And most importantly, it will show me the backlinks as well for the citations. So I can see where I've been citing which work and how in which context. I want to close on the idea of network synthesis very quickly in my dissertation. What I argue is that linking the simple act of relating two things to one another in hypertext, it's a knowledge organization process. So that expression is actually a thing in knowledge organization literature. It's classifying, indexing, tagging, basically any process that you do that organizes knowledge. And linking is a way to do a lot of that. Linking could be a way to index, to classify, to tag, to assign things to others. And most importantly to compose with links you can express new ideas, just like Lego. If you have a note on a concept and a note on another concept, and you just bring the two together in a sentence, this relates to that because this, this becomes a new idea, you express it in a new note, and that's ideation, the basic process of research. I'm going to skip very quickly all these examples that I had added because it'd be fun if there were some time for questions. And just to say that this process of synthesizing knowledge, this is why I titled the presentation Tool for Network Synthesis. Obviously in the process of research the first step is analysis. You start with an object, a phenomenon, and you start, and you try to decompose it to see the fundamental building blocks. But the goal is to take those fundamental units and sort of mash them together again to produce new things. And this tool is just that, it's just a tool to help with this process of knowledge synthesis, which is to assemble and expand over time these little document graphs. I'm saying document graphs because there's the expression knowledge graph. Knowledge graph is usually a set of descriptions in a database, and these are just little documents, so hence the word document graph. Right, I'm going to hand here, and if you have any questions. Thank you. Do we have any questions in the room? We have four minutes. A question about using graph-based and markdown-based, I don't think it improves the blocks or the accident. So the question was can we use this application to visualize nodes that would have been created with applications such as Obsidian or Luxik. A colleague actually wrote a little Obsidian to Cosma Converter because we have a data format which is close but not quite the same as Obsidian. Obviously you have to have a YAML header, the links have to be a certain way, etc. So there is a converter for, if you have nodes written with Obsidian, there's a converter out there to transform them into the format. I don't know that there's such a thing for Luxik. It's possible because it's just plain text, markdown, YAML, it's very easy to write, I think, a custom parser and convert it. Do you have time for one more question? Thanks for an interesting presentation. At all I'd really like to use in combination with Obsidian. I was wondering about the format of the nodes. You mentioned Zetl-Caston, which has a specific format and way of linking. There's permanent nodes, there's every node. Could you elaborate a bit on that, on what type of nodes would work well in this, not a synthesis, a way that you would use? Yeah, a repeating question. What type of format would be ideal to work with Cosma since there are many formats out there at Zetl-Caston? The type of nodes. Oh, the type of nodes. Atomic nodes. I've shown Andy Matushek's notes, he writes a lot about evergreen nodes and the principles behind evergreen nodes, things should be atomic, densely linked, and the titles of the nodes should describe one thing and maybe work almost like APIs. It could be a sentence that describes the idea. So that's the best sort of mental model. It's less suited for a daily log, for instance, than for a sort of conceptual knowledge base, again, where you try to relate events, concepts, people, etc. I hope I was clear. Thank you so much.
From the lab to Jupyter : a brief history of computational notebooks from a STS perspective
Hi guys. So no demo for me. I'm just here for some food for thoughts. And I will talk as a social scientist about a specific case. What I want to do, I have very little time, so I will move very fast, is to make two things. First, a very, very short history of Jupyter's notebooks. And then, sort of a plea for better knowledge of the way scientific software are made and their history. Because I think it takes a lot in our area. The question is, and my starting point is, where are our stories of scientific software right now? I mean outside the specific events and globally in the main scientific area. Because software won't say a lot about these everywhere. And they ran from bespoke and code to international stars. So we are software's every-round research, but very little stories of how they have been made and how they evolved. And social sciences rarely looked at those software. And when they look at them, they show there are very specific dynamics going on. Research of software are open indeed. They are looking for uncertain ends. Researchers are usually known as specific developers, and there are very specific funding constraints on how software are developed. And these are specific consequences of the way those specific kinds of software evolved. The code can have some brittlessness. There is a lot of intertwinement with scientific activity. And it led some researchers to become specialized in software engineering and developing software. And it led to a lot of specific journalist of friend. We have seen one with J.F.E. Light just at the beginning of this day of the room. So I want to take a step back. And because there is a lot of open question about that. First, how can we tell the stories of our scientific software and how social sciences can tell stories of scientific software? Because there are different journeys, especially in open source. And there are different steps in the history of each scientific software. Sometimes it stops, sometimes it continues for years and years. And on a broader level, there is much intertwinement between open source and academia. And especially, what are the links between open source and science? And how the connection is made between academics and software engineers? And just to quote Christopher Calti in two bits about UNIX. In fact, the UNIX spread first to university, computer science departments, and not to business government or not government organizations. And then that it also became part of the co-pedetrical practices of generation of programmers, computer scientists. So there is something connected between open source and open science. And I want in my very little time, but I had to work with a specific case, which is the case of Jupiter's notebooks. And to say it in one sentence, innovation, it is an innovation going from research to become a worldwide infrastructure of data science. It was released, notebooks were released in 2011, 2012, and spread everywhere. And they won the ACM award in 2017. And it is the perfect viewpoint to see how a scientist of course emerged, how he progressively get more and more abstract from this starting point in the laboratory, and diffuse within and outside academia. If you want a long version French history, there is a paper in Al, but I will keep it very short. I'm not here to advocate about Jupiter notebooks. I use them, I love them, but I won't try to convince you. And I'm quite sure there is a lot of people against them around here. And if you are not against them, but you want to see why people are against them, just have a look to the dry-gross talk. But I'm making some sure that you know approximately what Jupiter's notebooks are, because I have no time to discuss about them now. What I just want to say is a very quick story. It is first a PhD student, then a specific script, which is a Python, then notebooks appeared, and finally we got Jupiter, as we know currently, which is basically an infrastructure for interactive data science with different kinds of languages. And you can see this evolution with the Python Dev mainly released, with the progressive emergence of notebooks around 2010, and the appearance of Jupiter. I just go back on those different steps. So let's dive in this history. The important part is to have the context of the early 20, or the term of the millennium. And we are at a moment where we had a lot of achievement with the free software movements and open source development. And there is around the laboratories, paradigm of literal programming, from the next move. And for people coming from computational science or mathematics, there are a lot of proprietary open software specialized for interactivity with programming like MAPL, Mathematica or MATLAB. And at this moment, there are also the beginning of the scientific Python community, which just is starting to develop with the first SciPy workshop organized in 2002, in 2002 in Austin, Texas, especially in the south. And in this context, Fernando Perez was at the beginnings of Python and then Jupiter, was a PhD student in his fourth year, tried to finish dissertation and wanted to move from proprietary software to open source and Python and need something more interactive to do his work. And the script which will become Python was a simple personal fix for the problem of his own workflow and was really grounded in his common sense as a researcher in physics and computational science. So he wanted something to make sense, programming with interactivity. And this was the idea, the value inside this moment that will unfold in a job. In this basic case, the SciPy community, so the scientific Python community was quite an amplifier and there was a very quick reception, and to the secret reception by this community and the company which backed SciPy and thought posted IPytranslations on their web page. And they get a lot of support from this community, think back and contributors, and quickly after this start, other contributors joined the projects, especially Brian Kanger, who jumped in 24. And they managed to secure financial possibility to continue and it was attained with post-doctoral grants that fellow Peerers get at Colorado Bolger and then thanks to the support of a team in Berkeley which joined in 2008. So the fact is, IPytern is something really well grounded in academia and SciPy community. If you look to the main contributor of IPytern, almost everyone was a PhD, some of them are in a position even later after the emergence of the software. And notebooks in this context were just a feature which appeared later of IPytern. And because 2004 and 2011, the project developed, a lot of support was given by the Python community and there is a lot of features and tried multiple times to add a notebook feature because it was something already here in other software. There are five missed attempts before they were able to make a first fabled version of notebooks because some technology, especially for browser, was not available. So in 2011, 2012, a new release of IPytern included IPytern notebooks. It was the beginning of the history of Jupyter and it works pretty well because it was really quickly adopted by the SciPy community while outside the first specialty frontiers of the developers of IPytern. And in 2021, Nature can say that IPytern notebooks are one of the ten codes that are making science, sort of a huge thing inside the SciPy community. But progressively, the notebooks became something more important and they led to abstraction of what a notebook is and the way researchers are using programming in their work. And there are two dynamics. The first, it was a movement of abstraction out of the Python community and on the other one, it was strengthening of the practices in the project of software engineering. And this allowed the project to make a split and to move from a very specific IPytern tool to something more general, more abstract, which became the Jupyter project and was backed with six million dollar grants of foundations that support open science. So it was a huge move because it led to refactoring the code, change the philosophy, reconstruct your latest with the whole project and there was a lot of money involved because it needed a lot of, you know, hiring of software engineering to do so. So at this point, Jupyter became something which escaped the academic world and had a worldwide option. Notebooks became standard of data science and they were integrated a lot of services like, you know, Google collab or use in third party, you know, tools already existing like the regular studio code. So it was, you know, a turning point in the way this initially scientific project became something way bigger than scientific community. And somehow I would stop here because it opened a lot of questions. Of course, for the research community, the question are what the current users of scientific, of competition on the books, what kind of work are they doing? How does it make the way we are programming change? But at this point, the question I want to carry here is does Jupyter project or software are still scientific software? And so how does something which was created inside within the scientific community is starting to get another dimension and to be something bigger or no more, you know, a research tool. So just to rub up because I am going to the end of this presentation, I want to stand for more historical documentation, not only documentation of code, but historical documentation of how those specific software genres are associated with scientific specialties, institutional background, funding possibility. And we need to take this specific dynamic seriously because of course, for competition on the books as we are trying to do with other colleagues in different projects, and there is a GitHub repo if you want to add some archive in the story, but also for all the other tools that are inside our laboratories, inside our daily routine of scientists, because they are a huge part of the way we are crafting knowledge and they don't have the same history than other more material, you know, artifacts and scientific instruments as the discops or particle accelerators. So it's my point, I finish here, thank you. Sorry for the speed. How can we define scientific software? Very neat question. Can I and how can we define what is scientific software? I think the only way I can answer that is that software crafted within the context of scientific research at some point and that builds not for making, you know, a complete tool but for answering specific research question at some point in the advancement of knowledge. And usually there is a national literature about the way that scientific software are really different like that don't take really seriously into account at least at the beginning, versioning, test units, they are quite squirming the good practices of software engineering. At least at the beginning and then if the software is still around a few years after and gain more users, it started to integrate those good practices. So somehow there are two universe but more organizational and social universe different and I would say scientific software defined by the
Beyond Ratings: Empowering Communities through Wikirate for Transparent Corporate Impact Research and Analysis.
Thanks. Hello. So my name is Vasily Gikazyaki and I'm a data engineer with Wikirate International. And I'm going to talk about how Wikirate empowers communities for transparent corporate impact research and analysis. But before we get into details about what Wikirate is, I would like to talk a little bit about what is the problem with environmental, social and governance data of companies. So usually when it comes to ESG data, we can say that they are expensive, exclusive and inconsistent. There are a lot of data sets hidden behind paywalls. So individuals need to pay thousands of euros per year to get access. Additionally, there are a lot of ratings. There are a lot of organizations actually producing ratings about companies. But the problem is they don't provide access to the low level data sets. So it's difficult really to understand what they are rating. And also, yeah, they don't make the methodologies transparent or the sources transparent. Yeah, finally, the last few years companies started reporting more ESG data in a text format, in sustainability reports. But the problem is that a lot of company reporting is not standardized and that hinders large scale analysis and comparisons between companies. So what makes open research so important in the context of corporate accountability? It actually fosters transparency in corporate practices and empowers different stakeholders, especially people that don't traditionally have access to those data. That they don't, let's say, investors and they don't have the money to pay to get access of this ESG data. It encourages collaborations in global scale and promotes data driven decision and policy maker making and drives positive changes. So Wikiread is an open source, open data platform that brings corporate ESG data together in one place, making it accessible, comparable and free for all. It's a wiki that means that anyone who has a passion for sustainability and ESG data can come to the platform, contribute to the research, contribute to the available data and organize their research as well. So our community is mainly comprised of civil society organizations, academics, university students, data and sustainability enthusiasts. And we strongly believe that in research and in research in companies everything starts with a question and ends with an answer. So I would like to give you a sort of overview of the structure of the data on Wikiread. So these research questions we call them metrics and we can have for its metrics several answers. And its answer is linked to a specific company, a year of reference and a specific source. So here we have an example about, did Airbnb UK Limited produce a statement in relation to any modern slavery legislation or act in 2022? And the answer in this case was yes. It produced a modern slavery statement under the UK Modern Slavery Act and there is a source and citation actually linked to this answer that leads to the actual modern slavery statement of a specific company. So in Wikiread in addition to research metrics we provide also calculated metrics as a tools for calculations and for analysis. We can say that the research metrics are actually the building blocks for analysis and the calculated metrics actually are built on top of research metrics and allow users to run calculations. So we do have namely Wikiread in score metrics, formula metrics and in more strictly formula metrics allow users to run their own calculations in coffee script so they can be quite complex or not complex. It depends on what the users want to do with the data. So and this actually, this calculated metrics helps to bring transparency into ratings. Here we have an example with the fashion transparency index is actually a rating that scores companies, fashion companies based on how transparent they are on different sustainability topics. We are a partner, we have a partnership with Fashion Revolution which is also an NGO and they're building, they do this analysis in research and we're helping them to actually make the research, the data, the ratings, the analysis, everything transparent available to the public. So one source of data on Wikiread is of course data that are coming from the ground, from civil society organizations but also there are a lot of data in the public domain. So it's easier to bring structured data and semi-structured data and building data integration pipelines to Wikiread but of course we do have the challenge of unstructured data and how we are going to bring those kind of answers into the platform. So for those reasons we are performing research projects and we have called for volunteers to come and research those reports and finding answers to questions on specific topics like modern slavery, greenhouse gas emissions, etc. So how is the data used? One use case of Wikiread data is building data dashboards that are actually used for advocating for change. One example is Fashion Checker which was developed in partnership with Klinglo's campaign and actually advocates for worker rights, especially worker rights on the supply chains of fast fashion companies. We have the Beyond Compliance dashboard that was also a partnership with Work Free Foundation and is a living data dashboard that assesses modern slavery reporting and tries to highlight gaps on modern slavery reporting and puts for new legislations and for new policies. Also the data is used for writing news articles and producing help CSO, CISO-Cyber-Suside Organization, produce reports and making research findings and analysis transparent. So it's also used for writing research papers, research some of those and Wikiread data are free, are under Creative Commons license so it's welcome to anyone to use the data, explore the data. They can do it through the API and through the user interface. We have an available RESTful API and several wrappers that will allow users to pull data from the platform, also contribute data to the platform if they want to. We also have a GraphQL endpoint that allows users to form more dynamic queries based on their needs. So where to start with Wikiread? If you're interested in contributing data, you can always say start with the guides please. Read the guides as most of the questions are answered there. But of course if you have any more questions you can directly contact us. And yeah we have several projects that are in need of contributors. You can help us improve the data. We have verification tasks, we have verification tasks so we ask the help of the community to help us with this process. And yeah if you are interested in volunteering these links are available to the slides. And yeah you can contact us if you want to share ideas with us, form partnerships or get support. Yeah as I said in the beginning Wikiread is an open source project written in Ruby. You can check out our github repository and if you want to get started with Wikiread and DECO you can do it. And you can also create if you want your own data dashboards if you're interested in ESG data. So yeah I think that's all from my side, maybe it was too fast. But yeah thank you. Thank you. We have maybe four minutes of questions. Hello. Hi. I have a question about if AI has helped you in a way in any of these processes for example while manipulating or getting data from the public domain or something like that. Yeah so yeah we are considering, sorry the question was about if AI helped us in any way obtaining data from the public domain. So the answer is that we are considering using now AI and LLMs for extracting actually more structured answers from text reports. But still we are in the test process so yeah hope that I answered your question. Yeah. How much of companies are covered by the data set? Is there specific industries you are targeting? Yeah so yeah we have about, I'm sorry yeah. The question was how many companies we cover at the moment on the platform. So we cover about 1400,000 companies. The biggest focus on research or the biggest companies. So more data can be found at a very popular let's say companies and because we did have a lot of projects on the fashion industry we do have a lot of data about fashion companies. And in total at the moment we have five million answers. So yeah. Any other questions? Yeah sorry. So it's open to contribution as far as I recall and you mentioned some verifiers. I want to ask how do you make sure the data is consistent and how do you go through the checks and see if the data is reliable? Yeah so yeah the question is how we check that the data that they are coming into the platform is reliable. And of course it's a question also because we are doing crowd research sometimes people do not have the expertise on ESD topics. And what we do is we have different verification levels. So we consider an answer verified when more than two people have come to the same conclusion. And you can see on the platform that we have Stuart verify and community verified. Stuart's usually are members of the community that they have more expertise on the specific topics that the research is about. I'm always the person that's like let's squeeze as many questions as possible so we'll do really rapid questions right now. Very quickly. I was just wondering if this could be expanded to cover other types of data rather than ESD. The question is if this could be expanded to cover other types of data. Yes it can. It's again about environmental data but one use case that it comes on the top of my mind now we are having companies you can have something similar for countries. So you can highlight for instance the electricity or water usage or CO2 emissions per country and not focusing specific on companies. Thank you so much.
From Grassroots to Standard Practice: how an Open Science society shaped university initiatives
I'm going to go to the next slide. Let's see. The U.S. is now going to present from grassroots to standard practices how an open society shaped university initiatives. Here you go. Hello, everyone. I'm coming here to talk about my experience at the Surrey University in the UK where I did my PhD in the last four years. I just finished my PhD. Thank you. Thank you. Basically, is this working? Next. This society was founded in 2019, which is when I started my PhD, but I wasn't part of it at the beginning. Sometimes it's difficult to find out what's going on at the university. It was founded by Marta Topor, who was a student at the time. Emily Farron, who is staff, she's a senior lecturer, I think, at the university. They wanted to create this society to tackle this kind of reproducibility crisis in the field of psychiatry, but we'd aim to expand to all the other areas. It was a society run by young researchers and postgraduate students. It was open to any students, but undergrad students are usually not interested in this kind of society. At some point, we have over 100 members, which was very successful for this kind of society. We were inspired, but different from the other students. We were inspired by the different students and we were inspired by different grassroots initiatives. We have the UKRN, the reproducibility network in the UK. This is a national peer-led community of researchers that want to tackle this reproducibility, trusting science, and improving methodology of scientific methods. We also have other initiatives, like reproducibility. This is a journal club where people get together to read papers about methodologies and how to improve methodologies and analysis. The Riot reproducible, interpretable, open, and transparent is also a science club. I think it was started by King College London. We have the UKRN, and the Riot reproducible is also a club in King College London. Those were our aims. Integrate, open, transparent, and reproducible methods in science. Help with this rigor and quality and this trust in science crisis. As I was saying, we have the UKRN, the Riot reproducible, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR, and the UNR. We have a lot of meetings and discussions and workshops, and we also have conferences in the university. We managed to create... It run for two years maybe? Two or three years. Two or three years, yeah. Yeah, and that was very interesting because it brought like all the university together so we could discuss with different people, so not only students from different faculties, but also staff and researchers. And from these different events that we were creating, we started in January 2020 the monthly mini hacks by Daniel Curtin here, she's here. She will be giving a talk later. But yeah, this was when we started putting a greater focus on the computational methods of these social sciences, so not only that reproducibility, but also we wanted for people to have the skills, coding skills that maybe are not so easily accessible when you are in social sciences or in non-computational fields, but that are very, very important for those fields. So it was this cross-disciplinary collaboration and skill sharing and kind of like hands-on coding so someone would come give like a maybe 20-minute talk or 15-20 minutes about a topic and then there would be like a collective programming time. And then it's always better to kind of learn these things in a group, right? So you can ask questions instead of just watching a YouTube video or reading something online and then no one is there to guide you. And yeah, it's also useful to encourage and promote these best coding practices. So sometimes I'm sure many of you have had this problem when you go to like a research, academy, software project and you have no idea what's going on and all this spaghetti code so to try to make people be better at coding. And also as a way of training and improving your research and your employability and just your own skills that may be helpful for you in the future outside of the university. And I guess, yes, the isolation and the learning curve that is worse for some people in different fields or just depending on what background, what opportunities you've had before, right? So with the pandemic, there was a lot of things going on, right? Some of them very bad and in a way we had new opportunities because we all went online so there was more like a global opportunity of sharing from outside the university. I guess we always had this option but we weren't really used to do things online. So we created the mini-hack consortium which is kind of like taking the mini-hacks initiative outside of the reproducibility society which they were still interlinked but it became its own thing and we started creating a lot of online courses every month bringing different people from different universities. We had people from universities in Spain, in Germany, but also in Latin America. We had a collaboration in Colombia. We did a two-day hackathon for a neural language processing. So bringing people who were experts in different things to talk about their experiences and things that were... maybe they weren't experts in but they've learned about, I don't know, if you've done a PhD in some science, you know latex but when you're starting in research you have no idea how to write latex and it all looks very strange and difficult to understand. So it went very well at the beginning. We had a lot of people interested. We have more people joining the workshops so when they were in person maybe it was more difficult to advertise it across the university but when we went online we could reach a larger audience so that was very successful at first. These are some of the tools that we were using both at the reproducibility society and the mini-hack. So yes, we have all our files and data at the Open Science Framework repository so if we had any presentations we would record them, we would upload the video so you can go there later and all the slides and anything that was required at these workshops. Then we publish all the slides and any outputs of these workshops in F1000 Research which is an Open Science scientific publisher so then you could also have your own thing in your CV and record, right? So it would help your... in the future and then obviously we publish all the code in repositories open for everyone to use and for this advertisement of the workshops we would use Eventbrite which is not open but it was very useful to reach a larger audience so it's not only the people you know and that already know you but it also promotes your events to people who have related interests so there were a lot of people who had no idea who we were and they were just joining because they saw the topic was related to their field or something. And this brings me to what happened at the University of Surrey so we created all these events and workshops and things and then through Emily Farron who was part of the staff she was pushing this to become more policy at the University of Surrey and more people started joining from the staff, from the researchers areas and we got... I think they adapted this open batch the open batch, yeah, to know that your... I don't know what I was going to say that your project has all these data standards and all these open standards of research and they created a working group that was leading this change within the university and they created a community with forums and chats where they could share all these things across universities or not just in psychology but across all the other faculties they created a research handbook, they actually created an actual module that people can take about open science learning about open data and they created this open research annual lecture and this is basically the continuation of the conference we were doing as students So to finalise this is what happened to us We were very successful at the beginning, we had a lot of people, more than 100 people as members of the society but with the pandemic apart from going global we also had other problems People were meeting less online, less in person People weren't going to maybe the faculties and the office to work so we started like putting it apart and because students come and go people were leaving the university and there wasn't any new people to join and at half the time that this type of initiatives required which is a long time So I was left alone with the mini hacks and it just completely faded in my last year of PhD when I couldn't get more people to give more talks or workshops, I just couldn't find anyone else But at least the university policy now, maybe more students will come in the future and we'll want to take this on and we learned a lot so we already have that going on for us and I don't know if we have any time for questions Sorry We do have to be so let's take some questions, thank you Thank you Thank you So it's a very nice story, it's a pretty good way but is this journey like what you just presented Yes, yes, so the question is if all of this work that we've done is anywhere and people can take on from that Yes, so everything we've done, not only the workshops and the events that we did but also like our policies and our documents of how to organize and how to do everything is in our open science framework repository so the first point here and people can go back there and see like how did we do the main list, what are the resources that we use what was the kind of like flow of organizing a society which is helpful even if you want to create your own society that is not continuing this That was the second question, second and third Yes Thank you very much for the presentation, what's your point of view on the initiative you have in science club with respect to this The European Open Science Club, is this similar to some of them? I don't know, I'm not familiar with this So would it be like similar to this but European level Okay, so I am not aware of the European Open Science Club but I think it would be very interesting to connect with all the different initiatives that kind of have the same aim in place and that would probably be useful also to get people to collaborate with this, thank you
Bridging contributor's knowledge and the technology of The Turing Way, an open guide for data science
Okay, great. Yeah, after the last speaker, I think my sticker game isn't quite up to standard. But I'm Jim, I'm a research software engineer at the Alling-Turian Institute and a volunteer core member of the Turingway team. The title I submitted was Bridging Contributors' Knowledge and the Technology of the Turingway and Open Guide for Data Science. And I thought that was a bit long and vague and maybe I'd have another go and I came to a personal perspective on the interface of infrastructure and people which is at least shorter, but I'm not sure if it's much better. But I'm going to be talking about getting the people who contribute to your project with the infrastructure, which does mean the technology, but also means the processes and people who control what gets into the project and how decisions are made. So the way I like to pitch it is we all contribute to projects for a reason. We're making something for a purpose and there's different ways we can measure that and you might think about a number of contributors or stars or downloads or engagement or something. But the common thing between all of these is we need the contributions. So that's how we make progress. And so maybe what I'm going to suggest is maybe the important thing, the most important thing that you should be thinking about and measuring yourself against is how well do you facilitate people who are contributing to your project. So a bit of scene setting for the Turingway. The Turingway is a handbook for reproducible, ethical and collaborative data science. It's developed completely in the open on GitHub and is open to the licensed creative commons. And there's quite a large number of contributors. The last time I looked it was a bit over 470 contributors in total. When I work on the book, that's what my screen looks like. My background, I have a chemistry degree in PhD and during that I was exposed to Linux and open source software through computational chemistry and became really passionate about that. And now I work as a research software engineer. So I'm not really a chemist, but I'm not quite a computer scientist and I'm somewhere floating in the middle. And the Turingway is a big project where I work. And I started off just making a few tiny contributions, fixing typos and links and things. And late last year I became one of the co-leads of the new infrastructure working group and we think a lot about the CI and automation and how we help people get stuff into the book. So what's the actual problem I'm talking about and trying to think about? So if you're a maintainer of a project, one of your key tasks is maintaining quality and to a certain extent that means putting up a bit of a barrier to contribution. You need to have some standard, you need to keep standard, you don't want to break things. And that can be a bit tricky because that involves a bit of pushing back against people, maybe giving critical feedback or maybe not accepting certain changes and you need to kind of strike a balance there of encouraging people and then getting stuff in particularly when your contributors... I always say non-technical, which really means not software engineers, I suppose. Your contributors may come from many different backgrounds and have different amounts of experience. So some might be more or less into tech and that can also create problems because you can't assume that every contributor is going to be the same and they might need different levels of assistance and that might make sense to one, might not make sense to others. In the Turingway in particular, the community is incredibly diverse in terms of their educational and professional background, the language they speak, time zones where they work, where they live, their lived experience. And most of the contributions, most of the data that's actually in the project is pros and not software. They're contributing their ideas and their knowledge in the form of text rather than working on the code. And so also the people are generally not software engineers, which means there's quite a lot of additional support in terms of, you know, YAML format, why does Markdown work and why doesn't CI pass. But I think the important thing there is that all of the people that contributes make valuable and important contributions to the book and their sort of technical ability, so how well they understand the build process and things, isn't really a good measure at all of the value of their contributions. So we focus quite a lot on how to enable people to contribute to the book. So here's our approach at the moment, things we do. Probably not surprising, but everything is version-controlled in Git and the projects on GitHub. But I think there is an interesting question of why do we do that and why don't we just have a wiki if it's mostly text. I think the simple answer is the advantages of version control are just too strong. You can go back in time. Handling multiple contributors is really easy even when they're working asynchronously on different branches and you have to fix conflicts and things. And because it's a guide about open and reproducible data science, there's elements of do what we say, we've got to demonstrate the sort of culture we're trying to create. And so that means doing everything in the open, doing everything as reproducibly as possible. There is a community handbook, and I always love how sort of meta this is. It's like a book within a book, and it's a book which tells you how to contribute to the showing ways. There's a contribution guide, there's the code of conduct, style rules and things like that. So yes, I love that the book tells you how to write the book and contribute to the book. And because it's part of the book, it's completely open, and if you think those rules should be changed or adjusted or can be clarified, you can also contribute to that. Recognising contributors is really important, and we try to recognise all types of contributions, so not just text and not just code. And one of the ways to do that is to use tools from the All Contributors project, where you can tag people for the types of contributions they've made, and you get this nice table of people and their contributions, and that is also displayed in the book. And on a Git repository, people are encouraged that if they feel they've done something, they've put in some effort, they should suggest that they be added for a certain contribution type on this. More recently, we started using a GitHub action which ties into the crowd-in API. So crowd is the platform that's used for making translations to help to better recognise the translation efforts that go on in the during way. There's a lot of support. As I said, a lot of people are not super technical, and so we might need to work quite closely with them. We like to think of pull requests as a chance to work with people and collaborate with people and make connections and not just a barrier to stop things you don't want to be merged in. And that support goes even further. There are different types of events and co-working sessions to help people get contributions in. So there's regular co-working. There are sprint-style events called book dashes. And so there's a lot of stuff which adds a bit of social element, but it's also about helping people work together collaboratively to get things into the book. And we lean on, see on automation a fair amount, and the focus there is remove burden from the users so we don't expect people to build the book or run tests themselves. Everything is done in CI for you, so you don't really need to know about how it gets built. You can just focus on writing the muck down. So here are my sort of unoriginal lessons for how to search contributions. Building a community takes effort. If you write some code, you put it on GitHub, it doesn't necessarily mean people will engage with you. You need to be quite proactive in reaching out to your community and assisting them. And that means you need to know who your contributors are so if you can identify that and figure out what would help them, I think the thing to think about is what thing can you do which would most enable them to contribute. Leaning on CI is great. You can sort of say what CI says goes. It's a fair way to sort of compare people's work, all the tests are done in one consistent place. Everyone's being marked to the same standard and avoid sort of arguments about, well, it works on my machine. I think this sort of goes without saying version control, not optional. It's brilliant, do everything in version control. And if that means you need to do some support to help onboard people and how to use that, which we definitely do, it's definitely worth the effort, it's worth the pain of doing that. You should be flexible. So I think something to keep in mind is it's better to bend the rules a little bit to better contribution in and you shouldn't let the perfect be the enemy of the good. However, you do need to know when to be strict and here's some suggestions of what maybe your red line should be, things that sort of aren't acceptable to merge in. However, even when you're doing that, you've just got to keep in mind be kind and respectful and actually problems are an opportunity to get to know someone and help someone and teach someone. So thanks very much for listening. I'd just like to thank a few people. I'd like to thank all the infrastructure working group on the sharing way. That's Brigitta, Danny, me and Sarah. I'd like to thank Ale and Anne who's here with the project and community managers and provide a huge amount of support in getting the working group started. Skrobiria who worked to make all the brilliant illustrations that you've seen. So without them, you would have been looking at a lot of bullet points which wouldn't have been as fun and absolutely everyone that's contributed to the sharing way. If this has sounded interesting to you or the book, A Guide to Open Refugees, Refugeeful Data Science, you can read the book. Here's some ways to join the community, the sort of social aspects. And we've got more definitive, clear ways to get involved. So you can read the community handbook. We've got good first issues and there are events you can join to start making contributions. Thanks very much. So we have one minute so you can take one question as you please. Why we welcome the next speakers if they are here. Yes. So you mentioned Ale have seen how you're recognizing contribution and you can have people nominate. I'm curious if you have sort of a list of what those categories of contribution are because I assume that many products struggle to recognize non-striped contribution. I'd like to see that you have a record but I was curious how do you, I don't know what those times are, do you have a list of how you can recognize these other titles of contribution? Yeah, I'm going to read the chef. Yes, so the question is we say we want to recognize all types of contributions and earlier in the slides you could see a little emoji to say what those are and so is there a specification for what those are? So I can learn more about that. So the All Contributors Project has a list of contribution types and the emoji and so they've, I would say it's sort of loose suggestions that some of those are infrastructure, content thinking, event planning, things like that. So we roughly follow what that says. We're not sort of super strict on saying this emoji relates very specifically to this kind of work and actually the approach I think it's probably written somewhere in the community handbook is more like if you feel like what you've done, for example is event planning, you should find the closest one to that and add yourself. So we use it quite loose and take the attitude that even a contribution of any size, if it's a measurable contribution it's worth adding an emoji. But yes, the All Contributors Project has their table of what those mean and some sort of suggestions. But I think it's quite nice. I think another project you can sort of maybe use them a bit as you want, maybe some of them are more or less relevant depending on what your project does. Thanks.
The CarpentriesOffline: Teaching Foundational Data Science and Coding Skills with Little or no Internet Access
Hi everyone. So I'm Abhishek Dasgupta. I'm a senior research software engineer in the University of Oxford and I'm presenting carpenters offline today with my colleagues Janata and Colin. And so carpenters offline is about teaching data science and coding skills in low resource settings in places where you don't have internet. It's a way to do courses without access to the internet. So who are we? So Janata is at Newcastle. Colin is at the National Oceanography Center. I'll let them introduce themselves later on. And there are other people as well. We have collaborators from the University of Florida and also Stella and Barsha and Durham. What is the carpentry? So for those of you who do not know, the carpentry is a non-profit organization which was built to teach foundational coding and data science skills to researchers. And their vision is to be inclusive and to make the software teaching skills as accessible to as many people as possible. And they have various kinds of carpentries. So they share some courses like software carpentry, data carpentry which is focused on data science and also library carpentry for library and information sciences. And we have various roles like we have a carpentry instructor so anyone can go through instructor training and be a certified carpentry instructor. There are carpentries workshops which goes through the approved carpentries curriculum which includes things like introduction to Git, introduction to Shell and introductory courses in R and Python. And we use technologies like WebBest. Our course notes are all open source and online. And we use Etherpad and Google Docs for shared notes and everything is on GitHub. So how did carpentries offline start? So again it was at the Software System in the Institute Collaborations Workshop, HACTA. So we started upon this idea that what if you do not have internet and of course a lot of the instructions in the current carpentries curriculum require you to access to internet. So they will say download this from PyPI or download Python or RStudio and install it. What happens when you do not have access to internet or maybe internet is very expensive or maybe you want to work on the EuroStar which we found out that internet does not really work and the Wi-Fi is there but it is a sham. So we came up with the idea that why not use Raspberry Pi's because they are cheap and available. I think this was before they suddenly became unavailable but basically using some sort of low cost single board computer to host the carpentries infrastructure and allow people to access it offline. So it won the HACTA and it also, you know, had the SSI Fellowship on that in 2022. So the first part of it is actually getting the data from the internet onto the Raspberry Pi. So that is a package developed by a team at the University of Florida in collaboration with us and that is called offline data site. So offline data site is a component of carpentries offline but you can use it outside. So if you for some reason want to get R and Python and like the common packages, you know, pandas and ampi all that, on cash onto your computer so they can work offline, you can do that. By default it will install a certain set of packages which are customized for carpentries but you can add your own packages to that and it's available on PiPI so you can install it. What we mirror is the latest installers in Python and R and also we use partial mirrors of PiPI and CRAN. You can customize the packages so you can specify your own packages to download and once you download it you can set your local PiP or CRAN to get data from that and then we also mirror carpentries online material of course and installers like RStudio. So I'll hand over to Colin. Thank you. There's been three threads to our project in terms of building hardware that people can actually go and use. The first is to put the carpentries offline onto a Raspberry Pi that can be booted up and run at a workshop. The other is to actually build a bootable flash drive that can be used with an old laptop perhaps and then the third and latest one is to actually build a miniature HPC high performance computer that can be used for teaching HPC lessons. So for the first option we have the Raspberry Pi. I have here a Raspberry Pi Zero, one of the cheaper ones. I think these when they're in stock cost about $10 or $15 or euros and they can run as a Wi-Fi access point mirroring everything. Some of our lessons require us to have Git Hub so instead of Git Hub we have a program called Git T which kind of has all the functionality of Git Hub but is self-hosted on the Raspberry Pi. We can also run an EtherPAD server, mirrors of all the lessons and CRAN and PiPi mirrors. So what it looks like we can use this, the Pi Zero or we can use the bigger Pi 1, 2, 3, 4, 5 now. And when this boots up and I will try and do a little demo. So when it boots up you should see an access point called Carpentries Offline that will be accessible. If you were to then join that access point and go to your web browser and type in Carpentries Offline.org or 192.1681.1 you will see the web page that is listed there and that will then enable you to get onto the Carpentries Offline. Do we dare to do this? Where's the Wi-Fi chooser? It should be at the bottom there. Oh, it's on there. Good. Is that joining? Connecting. You can see this on the up screen. It says we're established so let's see if the demo gods are in our favour. And there we go. The web page being served from this little Raspberry Pi Zero. If I was to click, for instance, on the data Carpentry and go to the R Ecology lesson, there is a complete mirror of the R Ecology lesson that we can then teach from. And we also have Git T so you can log into Git T and have a very similar experience to GitHub or if you want to download some software to say you need to install R there is an R Windows and an R Mac package available for you to download. We also haven't got it listed on this page but you can then point your CRAN or your PIPI installation at this server and install things from those mirrors instead of having to get them off the internet. And I will put this back on. Edgy Rome to keep going with the slides. The slides are idiotic. Oh, okay. So I don't need the internet. It's beautiful. It's not working. I'm just going to keep going from there. So one of the problems we're trying to build the image for the Raspberry Pi is we initially started with a set of instructions of boot up your Raspberry Pi with an image you've just downloaded, type this and type this and type this and type this and eventually you'll have a Carpentries image. We then moved to having a shell script that could run all of that automatically so that it was a bit easier to reproduce. Then at a hackathon we went to, someone suggested what seemed like brilliant idea of we could run this in GitHub actions and do it all in the cloud and have that spit out a Raspberry Pi image for us. Many hard months of work later realized this wasn't quite so easy because the emulator for the Raspberry Pi is really slow and it turns out that GitHub actions actually has a six hour time limit which wasn't enough to do all of our installation. We had a few hacks to speed things up so one of the things we found is that not just is the computation slow but the network access out of the emulator is really really slow. So downloading anything inside the emulator was much slower than downloading outside so we actually download all of the offline data size stuff outside of the emulator, mount it in a virtual drive and then copy it into the emulated image and build the emulated image and now we've got it down to about two hours and it pretty much works. The one snag I'm currently having is that GitHub is not allowing us to upload the final image and I think we're hitting the maximum file size limit from GitHub and need to find somewhere else to host our images but there's a link later on to our GitHub if you do want to go and download the last image we managed to get on there. As a kind of side effect of doing that I started testing builds in the cloud just natively on a AMD64 system and realized that we could also then build a Docker container containing all of this and that was actually kind of useful during workshops because sometimes some of the carpentries infrastructure would go down midway through a workshop and the last thing you want halfway through a workshop is telling everyone to go and download something and finding that the website they want has gone down so we found that we could also replicate all the infrastructure in the cloud or on our own server if we have a server and it meant we had a backup version of everything needed to run a workshop which has saved me on multiple occasions now from workshops where we lost access to something and I've now got a almost one press solution with a Docker container that I can deploy out to a cloud which also works very nicely for testing stuff. It's sort of the tail end of the COVID pandemic. Raspberry Pi's got really, really hard to get hold of and really expensive. It was always the joke that Janetta had all of them in her house but I don't think she was the sole cause of that but the chip shortage certainly didn't help us and so we started looking for alternatives to the Pi. Do you want to take over at this point? Do I want to do this bit? I don't want to. Where are we now? On to option two. Oh, okay. So this option two is also because some people say but I already have a laptop so why do I want to go and buy a Pi? Especially if we look at countries where you don't have access, that's already a big problem apart from when there are two scares to find. So we've come up with the idea of doing exactly what we're doing with the SD card for the Pi and creating an image for that but creating an image on a flash drive which can also be downloaded, written to a flash drive and it has exactly basically the same software on as the SD card and you can just boot any laptop that you can boot from a flash drive. You can boot from the flash drive and it turns it into the same server as we have here. Oh, okay, so now let's put this. So more or less last year, April, I was running a workshop, an intro to HPC and everything that could possibly go wrong did. So I was starting to think, well, I also had a shed load of my fellowship money left and I had to spend that on something and I thought, but hang on, if we extend this project to cover the intro to HPC, that would be a really cool thing to do and there were a few things that we could do with the money HPC. The hardware is more visible so people don't know really what an HPC is, this massive thing in the cloud somewhere hidden and they don't really ever get to see the HPC that they work on it and it's also quite easy to mimic the original limitations which you're not going to do during a lesson on a big HPC but on the many HPC you could actually do that, you could hit limitations and see what will happen and teach people how to cope with that. You also, and this was the big thing, people get this email, say register for your account on the HPC, they don't do that, they show up on the day and you've got to jump through all those hoop of keeping people registered which is not a quick thing to do because sometimes there are loads of things to be done before you get your actual HPC account. So also the nice thing is it doesn't interfere with the real HPC so users can get quite afraid, they're going to break something on the real machine so in this case you can assure them it's here, we're not going to break anything of national importance or international security or something and of course if you don't have access to a real HPC then that's also covered. Also the day when we ran the workshop one of the things was somebody was doing something on the node, they were running on the login node and you know what's that, you can't even log in, let alone run your scripts so you won't have that and you won't also have problems with network access because it's all local. So like Colin said the reason there was a scarcity of Raspberry Pi's was because I probably had them all in my house. So although I got the money for building a many HPC I decided to go for these rock pies which are sold in the UK by a company called OkDoo and RS Electronic and it was an absolute disaster trying to get these things ordered and the time I had to work on this thing was passing because these guys couldn't get my order, it's all too dark. So I collected all the Raspberry Pi's in my house and built this one that you see there in the picture and that consisted of three Raspberry Pi 4.4 nodes and then one for the login node, in the meantime I've added I think two more nodes or something and it's running the Pi OS Lite, the 64 bit one and then the head node also acts as a Wi-Fi access point so everybody can just log into that. So in the meantime I did manage to buy the rock pies but we are still in the process of setting that up because the idea in the end is especially the Raspberry Pi's would probably make more sense but the rock pies were slightly more at a higher spec but the idea is in the end to produce two images that you can again just download from the website and we want to be able to build up this operating system with scripts so we still need to do the scripting, at the moment I'm still doing everything manually but so we want to work to do the scripting and then download an SD card image which you can just write to it so that the people who will end up with this mini-HPC don't need the knowledge to set all of this up and this is basically the software that we want installed on it because this is what the lesson covers and what most people will be using and what we need to actually do what we want to do for the networking, for the scheduling, etc. And then I've got to get to the point where I credit everybody so all our gets here, if you want to know where I got the STL's for the printing so a lot of people I do have to give them credit because everything is their work. I forgot to put this picture on last night so this is pseudo for executing bash commands so I forgot that one to add it then, it's the important one. And then there's the small one that was called RM. And so here's some more links and credits. The Raspberry Pi image that we have at the moment that Colin's been talking about can be downloaded from that link there. You can find us at CarpentriesOrFlying.org and what's not on there would hopefully soon be on there because it's a work in progress all the time. We also have a Slack channel on Carpentries workspace and so we can be found there so if you have any questions or anything you can get us there and that QR code will lead you to CarpentriesOrFlying.org and I think that's the end. Thank you. We have time for questions. Can you share with more how people are using CarpentriesOrFlying outside the team? Okay so, I forget to repeat the question. Oh, how can we use CarpentriesOrFlying being used outside the team? So one of the things so far was because we're still working on this image is that we have not, we've got a lot of people that keep saying they're interested but nobody has really taken it on yet. So especially with the Mini HPC also I'm still working on it but I hope to now soon be able to run my first workshop on it and then hopefully it'll take off from there because then I can say to people okay we've done it. Also I've kind of used it, I went to UNO's first year of Strathclyte to give a, to run a workshop and the power went down and the internet went down and I was able to use it there just from my laptop actually that day. No, I didn't use the Pi, I didn't use the Pi too because I had it all there but I've known we've not been able to get people because I feel it's not in a state where I can actually set somebody off with no experience because we also need to develop an off-boarding lesson. So we hope to have more hackathons and more work sessions where we can work on these things and get it ready for other people to adopt. So if anybody here wants to go off and adopt it please let me know and you know where to find me. Actually there was another one that I ran in South Africa where we wanted to test this and we ran into a limit of eight people connecting at that point but I think Colin has sorted the problem out. He sorted the newer Raspberry Pi's but there's zero, I think so there's a limit and some of the older ones do but there was a firmware fix for that. A typical Carpiti's workshop would probably have between 10 and 30 people so we'd be aiming for that sort of size again. A typical Carpiti's workshop would be between 10 and 30 people but we had limits with the older hardware but we think the newer hardware doesn't have that problem and we should be able to get 30 people on it. Another question. Yeah I spoke faster than I rattled over because I didn't know how much time it was left. And this hat makes me warm. At least we didn't get the hot hats. Five minutes. If there are questions please ask away. If not, it's cool. You all have been informed that we have a social event tonight. We are going as an organizer team of this room to the tavernier bar so you can check the queue on the left. That's the way to the tavernier. We are also organizing an event next week that's going to take place online with some other talks not only that couldn't make it in the tight schedule today but people who couldn't travel to Brussels. We wanted to be inclusive of those people too so that's the QR code in the middle and the website is the last QR code that's on the back of the room and the right there. A very important question. Where can I get a hat? I've got some more at home. Next time I go to touring I'll ring one. These tools are all representative of the carpentries. This is the Alipay software. This is the library carpentries. We also have a protractor but we are figuring out how to get that on the bed. Thank you very much. Thank you.
Updating open data standards
Okay, so thanks for bearing with me. I'm Sara Petti and I'm going to talk to you today about updating open data standards based on the journey that brought us from the Frictionless Specification version 1 to the version 2, which is currently ongoing. So briefly about myself, I'm the International Network Lead at Open Knowledge Foundation and also the Frictionless Data Community Manager. I love the digital commons and I'm based in Bologna, Italy. I left here some ways you can contact me via email, ex-former in Onos Twitter and GitHub as well. So before we start, I just wanted to give you a quick introduction for those of you who might not know the Frictionless Data Project and the Frictionless Standards about the Frictionless Data Package, which is the core Frictionless specification. The Frictionless Data Package is basically a standard to package your data. It's very simple and very easy. Basically, you package your data together with a descriptor, so containing your metadata and a scheme about your data. There I put a link if ever you want to explore all the specification of Frictionless Data, you can just go on that website. But so the Frictionless Data Package was released in 2016 in its version 1. So some years have passed meanwhile, and it has actually gained a lot of tractions in research communities, academia, but also it has often been mentioned in the Open Data Guidelines of governments and public administrations, and it's often used by data wranglers as well. So we started to think about what actually was the success of Frictionless, why was this standard so successful, and these are some of the things that we came up with. So the first thing is, so the Frictionless specifications were not developed alone in a room, but they were really the outcome of modern 10 years of iteration with community of practices, stakeholders, and also a full engagement on issues around interoperability, data analysis, and data publications. As you've seen from my slide before, the specifications are very simple. I think that that's also one of the key of the success of it. Basically, because they are very simple, they disrupt as little as possible, whatever existing infrastructure is already there. When thinking about actually the Frictionless specifications, we always had in mind as an example the CSP, which is a standard for tabular data, and we think that the key element of CSP and why it is so adopted, it's because it is so simple that everybody can use it. It's not maybe the most adaptable to specific use cases, but it's still adaptable by almost all, and so it's one of the most used actually tabular standard right now. So the simplicity of the Frictionless standard also mean that they are extensible and customizable by design. So they are designed for tabular data, but we have a lot of people in the community that use it for other data as well. We have metadata standards that are much unusable, because of course we want to have to bear in mind that data must be fair, but we also keep in mind that there are human that might want to manipulate that data, and so the metadata standards are also human editable. Another thing also very important for us was not to reinvent the wheel, so try to reuse as much as possible existing standards and existing formats for data. And then last but not least, we tried to build as much as possible something that was language, technology and infrastructure agnostic. Once that was done, we started thinking about the options of the standards, and one thing that became clear quite quickly was that a standard alone was sometimes not enough, and that you need also a technical implementation of those standards. And it's funny because I was talking yesterday with someone from the Frictionless community who was telling me exactly this, that it's so great that we have basically built libraries on top that you can use to perform a number of things on your data, for example validate your data or extract your data, and those are present in a number of programming languages. So I work at the Open Knowledge Foundation where the core Frictionless team sits, and we developed for example a Python framework, which is the first link that you see there, but then the community that uses Frictionless also developed other libraries in other programming languages that perform some of the same functions as well, so we have Frictionless R for example, Frictionless JavaScript, and those all form what we call the Frictionless universe, and here's a website that I'll definitely encourage you to go and have a look if you're interested. So okay, it's all very nice, everybody adopted the standard and it gained traction, why did you need to update then? Well, of course since 2016 issues started to accumulate in the GitHub repository, so basically last year with the core team at Frictionless, we started having conversations with the community, and we started to go through all these issues, try to triage them and see those that were more requested, those where there was more conversation ongoing, those that made more sense because of the internet requirements that came up during the years, and so we decided to start a draft roadmap for version two, and then the second part was okay, now that we decided to update those standards, how do we coordinate this update, and that was probably the part that took most part in as a community manager, and here I tried to resume the key elements of this update and the things that it was important to take into consideration for us for this coordination, for the coordinating this update. So the first thing is of course don't do it alone, right from the beginning it was very clear to us that we had to take into account and bring in people from as many backgrounds as possible, because as I said before, the Frictionless data standards are very simple and they are adaptable to many different use cases, but if you want to build something so simple you need to also hear a lot of people, a lot of, have in mind a lot of use cases, because they can actually help you to build a common data model that will fit then the needs of everyone, or at least it will help you find some minimal common ground. And so when we started our Frictionless data specification working group, we brought in people from research institutes and universities from different academic fields, but also libraries, open data cooperatives for example, and engineers as well. The other thing is be clear, so the first thing that basically the working group asked us was, okay, very nice, you want to do this, but please let's define the overarching goals of this project, let's have a roadmap of this project and let's have it somewhere that it's easy to find. So for us it was quite easy, we have a project website which is frictionlessdata.io, so there we published a website announcing the specs update, detailing the goals, the deliverables, and from there we also linked to the roadmap, which is actually on GitHub, because that's where also the technical discussion with the community is happening in all the issues that you see there. The third thing that was in the beginning a bit taken for granted, but it actually needed some thinking as well, was to decide how to decide, because okay, we sat down with the working group and everybody was like, yeah, okay, we'll do this with consensus. But then we clearly realized that it needed some definition as well, because not everyone was understanding what consensus really meant, does everyone need to participate every time to the discussion, even if maybe it's some part of the specs that's not really important to them. And so we basically decided that PR can be merged in the specs only if two-thirds of the working group has participated in the discussion and has a favorable opinion about it, and that consensus, we understand as consensus when we reached a kind of like solution that everybody can live with. And the blog, it's in the announcement blog if you want to go and have a look. So that's it. So just to give you a view of where we are now, at the moment, basically we had 36 issues that were part of our first roadmap. So 10 out of 36 are now closed already. Of the remaining 26 open issues, 11 already have a first PR proposal, and then 23 over those 26 have already actually an ongoing working group discussion. What we decided to add also as a kind of like information for the community, for the working group as well, but also for the broader community, is to have a public live track also on GitHub live as an issue, and basically you can go there and we update it on a weekly basis to basically have a place where people can monitor the progress. And our aim by June 2024 is of course the release of the friction specs version 2, but we would also like to release a small Python metadata mapper and also some integrations in external systems like SICAN and Zenodo. To conclude, I just wanted to mention that this update is made possible thanks to the generous support of the NL Net Fund, NGI Zero and Trust. There's a lot of fantastic opportunities out there that maybe could be useful for you as well. They found a lot of open source projects, so I encourage you to go and have a look. And then I wanted to thank you for listening to me today. I left there a bunch of links that might be useful. So the first one, the frictionestlitter.io that I mentioned a couple of times, is the project website where everything is linked from. So if you want to find, for example, all the GitHub repositories, you will find it there, but also the different project pages. We have a community chat on Slack, but if you prefer to use an open protocol, you can access it by a matrix as well. I left the website of the Open Knowledge Foundation and also our Twitter handles of the Frictionless Data Project and Open Knowledge Foundation. Thanks. APPLAUSE We have time for one question. Yes, thank you. A very short question. We said, we said, agree on a commentator model. Someone who has repeatedly failed at that. I'll be right back. Thank you. So the question was how to agree on a data model, because that's very, very difficult to agree upon. I think for us, the key, it's of course something very difficult to do, but it's basically to take away all the layers of complications and all the sometimes specifics of some type of data. Instead, what we did was basically collect all the kind of data that we wanted to support and try to understand what the common things were and basically start from there. Of course, again, it is very simple, adaptable, but it is also something that is focused on tabular data. It is extensible to other kind of data as well, but that's the kind of course you have to have a data type in mind. I don't know if that answers your question. Thanks again to Sada. APPLAUSE APPLAUSE
Wikimedia projects and OpenStreetMap as an Open Research Infrastructure
The aim of this presentation is to show how Wikipedia, Wikidata, Wikimedia Commons, the Wikimedia project and also OpenStreetMap and other resources can be used as open infrastructure for research. We're talking about websites that are based on an open infrastructure, so they're based on open and free software and of course they have all content that is available openly. What is also interesting about this ecosystem is that it's incredibly multilingual, so you have a really wide community of contributors in over 300 languages. And even more, it's one of the biggest existing online communities and this is obviously a feature if you want to collaborate with citizens, which is one of the aims of OpenScience, so working and collaborating among people and institutions. Also something that is also valuable is the fact that we're talking about resources that can host different kinds of content. So it can be data, but it can be also images, audio, documents with also a community that can contribute in different ways in improving this content. It can be restoration or improvement of images. It can be also adding captions. It can be transcribing documents. There are many advantages in those projects. Some of those are very well known. The visibility is probably one of the biggest. We're talking about for Wikipedia, 28 billion views per month. We're talking about the visibility that Wikipedia and the Wikimedia Commons have provided to collection like the Met collection, the Metropolitan Museum in New York. It moved from a collection that was viewed two million times to 10 million times. So the visibility of those projects is very impressive. But also we're talking about an international community, a community that has also chapters around the world and the desire of enlarging the community with policies and with funding that have been created. Also we're talking about reusable resources, so a resource that really provides content, information data that are really available also to people that don't have particularly technical skills. And also there are other features like the fair data principles that are applied also on all those resources, but also an attention to new ethical principles, the care principles, or the synergy with open government and with GLAMS, so with cultural institutions. So those resources are already used in research. Wikidata is probably one of the major examples. And the beautiful project, SCOLIA, is one of the examples that you might access that provides information about researchers and topics. But there's been a lot of work related to how to use those resources as a research infrastructure. And I'm just quoting some of the papers related to this and focusing on Wikidata. Daniel Michen has done an incredible job in this. He was also a Wikipedia in residence for the Open Knowledge Foundation. We just heard a presentation from them. He was the first Wikipedian in residence and he worked extensively on open access, improving content also on Wikipedia related to it, improving also the communication of the project among the open science system. And in 2015, there was this project in working on Wikidata as a virtual research environment, which is very promising. It was not financed, but it kind of gives an idea of how the infrastructure can be used and is already used in this direction. Furthermore, there are studies that are highlighting how, sorry, I need to breathe. This is something that I, you all noticed that it's something that I sometimes forget. So going back to, in 2019, there came up this study about Wikidata. So it shows how Wikidata is already extensively used, but he also talked about how art and humanities and social sciences are not very present in the field. And a research about art and humanities used and it's used in Wikidata show how there are projects that use the data, but there are a few projects that collaborate using the data. So create a community that actually upload the data from research and use the data. So I'm just going to present to you three positive elements and three challenges that I encounter in my work related to arts, humanities and social sciences that I think might be interesting to highlight. So for the advantages, the large use of all those resources combined together, so not only using Wikidata, but really take advantage of the different format that those resources allow to upload. A second element is the broad interest for a heritage and museum, so the existing and real attention that are on those projects. And the last issue is the possibility of visualizing and monitoring content. I breathe a moment and then get back to you. So for the challenges, a major one is a copyright and the restriction to public domain, the difficulties of course of collaborating with a community and also the challenge of scaling up and working with the different skills. So the first element, the possibility of using all the infrastructure is particularly interesting for humanities, arts and social sciences because it allows to really take research, resource and data. And in humanities and social science, you have a lot of also qualitative data. So you have interviews, photos, you have site exploration, you have artworks, you have content that comes from archives for example. And you can find on those resources the possibility to upload it. Also working with OpenStreetMap allows for example to enter data that Wikidata would not allow. So the combination of those two really allows you a broader work on those infrastructure. This is an example that comes from the upload of data from the Ticino region, the region in Switzerland. And the upload was done on Wikidata but also OpenStreetMap and with the upload of images on Wikimedia Commons and the creation of articles also on Wikipedia. The second issue is related to heritage. At the moment the 97 nations have participated in this contest called Wikilov's Monuments and they have uploaded an incredible number of data but also they have worked in creating one of the most incredible database of heritage sites around the world. And this content enriches the existing resources but can also be used to evaluate the existence of images and also the presence of heritage in different countries. This is a visualization we've been working on that allowed also to create a research based on the analysis of those data. Another project, another focus of the community is working on content coming from GLAMS. So GLAMS stands for Galleries, Libraries, Archives and Museum. So we're talking about the broad network of cultural institutions and institutions. Consider also that universities have libraries, have collections, have archives. So it's very strange how sometimes the research institutions perceive separately the GLAMS. And there is sometimes a great difficulty in Brigida too. Also a lot of research for humanities and social sciences comes from those sources. So you work on documents, you work on images, you work on collections and this is really a center of interest for researchers in those fields. And the Wikimedia project, particularly after 2006-2008, have really invested a lot of energy in encouraging institutions to become open access and to upload content on Wikimedia commons and also with synergies with Wikidata. And in Italy we did a project in which we contacted all museums. We created the best existing database of museums existing in Italy. It was done in collaboration with ICOM Italy. We uploaded a national statistic about museums. So on Wikidata you can really access all the available data about museums and museums in Italy are quite numerous, as you can imagine. And also they started also collaborating and opening up their content to make sure that also museums were engaged in checking their data and contributing with authorization. This is a topic that I will shortly touch. We created a forum that allowed the museum to upload authorization for the content. In Italy there are restrictions also on public domain. And this forum was also developed with Daniela Scasciatratte, who might be here, so one of the developers, and so to facilitate also the institutional contribution to the project, which is one of the problems with Wikimedia commons. You need to be an individual to contribute to Wikimedia project. So you need an external interface or a system that allows to associate to a user an authorization that gives the authority to that user to upload content for an institution. So it is a step that is still missing. Those data allow to produce research. So you can monitor museum in a country. You can view if they have a person in charge of communication, how is their collection, if it's digitalized or not. And we enter in this a third positive aspect of the Wikimedia project and OpenStreetMap is that you can really visualize content in amazing ways. And visualizing content doesn't simply mean I have a statistic, I see what is there. It means also to visualize knowledge because what is on Wikipedia and what is in Wikimedia commons very often provide you information of what is available as knowledge. For example, images of heritage. In Italy we had a discussion with the Ministry of Culture because they miss images of certain areas of Calabria. So the community actually negotiated with the government, the data, and then they produced content that is now accessible also for the government. So monitoring what is available there somehow provides an image of what is actually available on the internet and to anyone. So monitoring knowledge is also interesting if you are contributing to it and you're contributing to modify it. So if a museum or a researcher is improving content related to the architect Paraviccini you can really see how you made an impact on that knowledge. And it's quite incredible to visualize this impact because normally impact on research is made with completely different criteria. So this criteria is actually something that is the mission of a museum of course to improve knowledge otherwise if their mission was to have a lot of visitors they would offer beer that would make it a bit easier. But also it's really a way of changing the perspective on how you create a researcher that is really available and visible. Now I brief a couple of moments and then I move to the challenges. Okay the first one is copyright. This is an issue that is present on all humanities, social science research. This is obviously a very well-known challenge. So you would expect that for example you take a photo of a monument and you upload it. Actually things are a little bit more complicated than this in particular. The fact that you need to identify what is heritage and what is not. You need the rights obviously of the photographer but there are other issues related to property and also to the rights of the author of that building. If you live in a country that has freedom of panorama you can take photos of everything that you can see outside so you're fine. But many many countries do not have freedom of panorama and it's a right that was not unfortunately made accessible to anyone with European copyright law. So in those countries you need to ask the authorization of the architect that has not been dead for more than 70 years or the artist that produced an artwork. This is a layer of complexity. Furthermore you have layers of complexity that are added to public domain content. This is tricky maybe it will change because in theory with the European copyright law maybe we are moving in the right direction. But in Italy you need to pay a fee for every commercial use of content in public domain. And this is obviously a very complex block. So those restrictions create layers of complexity and makes it more complicated of course to upload content on the Wikimedia project. In particular because those projects really want content that is clearly open and accessible. I still have a lot of time so I should relax. So I want to make sure that I tell you everything that I might know. So we did a project to explore the impact of culture on safety in Africa. We did it in three countries and in particular in Cameroon we worked a lot with authorization. So we uploaded Cameroon, Duane Cameroon has a great production of public art since 1991. So there are artworks that are disseminated in 13 neighbourhoods and it's quite an incredible project because they've been commissioning artwork to international artists, local artists. You can see the transformation of the city through those artworks. So what we did, we uploaded images on Wikimedia Commons. We created data on Wikidata connected of course to the categories. We created a list of artworks on Wikipedia in English and in French. And we uploaded text because all the production of the research project was on CC by SA and with the authorization. So for every single institution and every single author we created a permission that was then sent to the system of recording permission of Wikimedia Commons. And this allowed to create really the possibility of uploading content. Since it was done in Africa it was a bit more complicated so what we did was we had printed a form that the artist would sign, we scanned them and send them to the permission system that recorded them and registered a ticket. So of course I took an example that is particularly complicated because public art in Africa with a living artist with no freedom of panorama is probably the worst you can get. But it's feasible so it's complicated but it's something that is possible to do. Of course it requires a lot of changes also in procedure and also the need of creating processes that allows the upload of those authorization and facilitate also this connection between institution and the rights management. The second challenge is related to collaborating. And now I don't know how many of you contributed to Wikipedia. How many had their content deleted on Wikipedia? This is something that is an experience that I think everyone... So contributing to Wikimedia Project is not easy. It's a little easier sometimes on Wikisource and Wikiquote I would recommend as a first if you want to go on holiday on those projects is quite fun. Also no Viki Voyage can be challenging too. So those projects have a lot of rules, policies and also collaboration is never easy. So everybody that collaborates know that they are challenging in involving other people and also creating processes that are transparent. There are some specific rules of the project that researchers need to take particular attention to those in particularly of course the no original research which is also an advantage for a researcher because you quote the work of everyone and you source everything. Also conflict of interest, the fact of declaring why, how we are contributing, neutral point of view for the encyclopedia and of course Wikipedia is an encyclopedia. So it doesn't provide space for everything. But it's true that for museums, cultural institutions, for heritage, also for improving articles related to territories that are very connected to topics related to architecture and art, this Wikipedia is perfectly suitable. Also you need to consider that research in humanities and dissemination. Sometimes the two boundaries are storing information and disseminating information. Sometimes also scholars would like that the way they store information is beautiful, is accessible because it's something that might interest a broader public. So it's sometimes not sufficient to store a folder on Zenodo. You would like to have an interactive map that allows you to see the building, having access to all the documents on. And Wikimedia project can somehow provide this infrastructure. The last issue is how to make this scalable. So of course working on licensing, working on CC0 for data is an issue. But the upload of content on the Wikimedia project requires a certain expertise. And what I saw in the past is that very often projects worked when there was the community involved. So people that were experts already of those projects. So this joint work and also maybe the model of Wikipedia in residence could be an approach that can be interesting also on the Wikimedia project and OpenStreetMap. Finally I wanted to just mention that I'm working on a landscape analysis of research infrastructure for social sciences and humanities. So I started on Metawiki, that is where we start, we always say, oh you find the don't make it Metawiki. So here you find a list of research infrastructure to make sure that Wikidata has those resources. But the truth is that at the moment there are two problems. The first one is that all local infrastructure or collection databases are not connected and they're not perceived as research infrastructure because they are too small and they don't have national relevance. So having the possibility of bridging those resources and maybe Wikidata can really provide a landscape analysis on this would be very valuable. Also making sure that we know about those is very useful because those are resources that can also nourish those websites. And finally there is the problem that investment by government on research infrastructure normally they focus on implementing the infrastructure and maintaining the infrastructure while also populating the infrastructure is another issue. And there's going to be also a presentation about OpenRefine that is very important and relevant for this because obviously you need tools that allows really to nourish and to connect those infrastructure. I'm done. Thank you. So now you land there and take some questions. I told everybody that I know to not ask questions. Yes. One question they asked the question was if we have an idea of how much data from research feeds Wikidata and is accessible on Wikidata. I think this research of 2019 might give some insight of it. I don't think it was more focused the study on models rather than actual data. It is sourced. So I presume it's something that is possible to is an information that is possible to view on Wikidata. So that would be feasible. It's true that sometimes the taxonomy of also property so the possibility of actually getting a full access to the information is not obvious. Also for research infrastructure one of the challenges that one thing is called the virtual library. The other one is a digital library. The other one is a repository. So combining all those also broader terms makes it a bit complicated to get a full idea of it. Thank you. Another question. Thank you. I enjoyed the talk. One thing I was wondering is there is a link up talk in the chat here because I enjoyed any and Chris's talk talking about opening infrastructure finder and now you're talking about opening infrastructure finder. I wonder whether there is dialogue or anything at all. Yes. So the question was how to connect maybe the if I understood correctly connect the possibility of finding open repositories and how we can connect with this one. It is important to notice that a lot of libraries and repositing existing repositories are already collaborating with Wikidata. So there is a desire. Europe Anna which is one of the biggest repository has a very strong collaboration for example just to mention one of the most well known. This is a repository of Glam's for open research. There are lots of connection associated rather to repository that provide information about researchers or papers. This is something that is implemented on Wikidata quite nicely. But it's true that also in general the investment are not on something like Wikidata. So investment are either repository by topic and at a national level I never saw an investment that is on Wikidata. It is rather in maybe creating some interconnection. So this is something that my but of course I'm also here to actually stress this. I think we should collaborate more with Wikidata that would be valuable, useful and efficient. So that's all. Thanks. If you have any more questions we can welcome them. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Unlocking Research Data Management with InvenioRDM
Thank you. Hello, everyone. My name is Karolina, and together with Havi today, we will tell you about how Invinio-RDM is unlocking research data management. But before we start, I would like to ask you if you see any connection between those three images, and is anyone able to answer that quiz? And Luisa says that you have three seconds to do so. Those are cuts. Those are cuts. No, no, sorry. So what about now? Sorry? Yes, you're close. So the common, the connection between the images is that is CERN, actually, where the World Wide Web was invented. It's located in Switzerland, so fondue and chocolate. And that's because you can see the funny pictures of the Internet of the Cuts. Thanks to World Wide Web Invention. But it's not the only thing that we do at CERN. So we are housing the biggest machine in the world, the Large Hadron Collider, and many more machines which experiments are using. And also we are sharing our knowledge and welcoming visitors. So wherever you are in Geneva, Switzerland, please pay us a visit. We do much more than only physics at CERN. So we do also open source projects, like, for example, the World Wide Web, that it was given back to the public. But that's not the only one. And this is what we are talking about today. So it's Zenodo, which I have been told that some of you know already, but you probably don't know what's in VINIA-RDM. But I will start with Zenodo. So Zenodo is an old-purpose research repository that any researcher around the world can just go and store their research results for free. And it is hosted there at CERN as long as CERN exists. So the question is, why do we need such a place? And this is the answer. So very important, up to plus, the crucial scientific data, many years of research work inside. Well, we don't want to allow this to happen ever again. So we provide a safe space for storing data for the researchers. But not only researchers. We have also integration with GitHub. So you can cite your software, stored in GitHub. And what's the advantage of storing it also in Zenodo is that GitHub allows you to delete your software, but it will be preserved in Zenodo. And we have received many questions about the platform, if it's possible, to take it and install it as it is in another institution. So up to a point, it was not possible. But we have received so many questions that in the end, we have developed another platform, which is InvinioRDM, now the engine of Zenodo. Now it is possible to easily upgrade the software, install a new version, and we are basically supporting the underlying engine. So, Havi, if you were to characterize InvinioRDM with one word, what would you say? That's a good question. So you have to use one word, I would say, that InvinioRDM is fair. And when we talk about the concept of fairness, I'd like to quote our former director general, San Roth-Litter, who once said, why do I like Zenodo? Because Zenodo is fair, fair in the sense of lower case and fair in the sense of upper case. The most conventional use of fairness, which was already covered by this first part of the presentation, is like equitable or just. Now let's see how InvinioRDM embraces and promotes the fair principles that are, that is an acronym that stands for Findability, Accessibility, Interoperability, and Redusability. So starting with the first one about Findability, when we upload our research, one of the key things is that we want to have a link that we will make sure that it will resolve over time, that it's not going to be broken. And for that purpose, we have DOIs, which is a digital object identifier, which is a globally unique and persistent identifier. We encourage people to use their own DOIs if they have one, otherwise there will be one automatically generated and registered using, registered in data site. It's as important to have a nice metadata. That's why we adopted the data site metadata schema, which is simple, yet a powerful format to describe nearly any research output, data sets, software, as she mentioned, journal, papers, anything you can think of. And of course, to find out all this data, we need a good search engine with capabilities, such as filtering options or search variations or powerful query syntax that will allow you to find the data even without the identifier. So these are key aspects, not only for humans to find data, but also machines. So if we continue about accessibility, a very common use case is that we have our data and we want to keep it restricted, but we want still people to find the data. For that purpose, you will make your metadata public, and if people want to access the data, they will have to request access via a simple form, and then you can choose if you want to grant access or not. In the same way, you can also share different links with different permissions levels that will allow people to view the record and the non-pullished versions or even edit to make collaboration easier. Now if we talk about interoperability, one key thing is to follow standards. That's why we follow the one I mentioned, the data site metadata schema, which includes things like common vocabularies, which will allow us to have the same concepts to describe data as other people do and other machines do, so that we make sure that everyone will understand it in the same way. Another important thing is that when we upload our work, we have to link it properly with other data that is also uploaded, and you can do it very easily as well. And if we talk about how machines exchange data, we also provide a strong REST API that allows you to build your own data, build your own integrations of top of the Miner-DM, and we also have an integrated YPMH server, which is a standard in how systems exchange data. If we talk about the reusability, I think one of the key aspects is that when people use our work, we want them to cite it correctly. So here you have different styles of citation that will always include a DUI. The DUI is very important also to track the impact of your work. If you remember, she talked about software citation. We know that 85% of all software citation is on Senado. And of course, having also a clear licensing information, it's also important so that people know how they can use your data under what conditions. And I want again to stress a little bit on the metadata, so having a rich and comprehensive metadata is very important not only for people to reuse their data, but maybe for people to also reproduce it in the future. And since we are talking about the reusability, do you think there is something else that we can reuse? Yes, we can reuse the whole software entirely. So these are examples of how InvinioRDM was reused with other institutions, with our other partners, and as you can see, it's very customizable. Those interfaces are very different from each other, so it's quite flexible if you would like to join this sizable community, that it's still growing. We have many partners around the world. And if you would like to install an institutional repository, also in your institution, you can get to know more about InvinioRDM under this QR code on the right side. Also, you could pass by our booth, it's in the building K, floor 2, floor 2nd, and if you are a developer who would like to contribute to Open Source Project, you can check out our community on Discord as well. We answer questions, and you can see also a growing community there. So thank you very much. So are there any questions? Thank you very much for the talk. I already know it, and I like it, by the way. But I have only one specific question for the... You also said, you have plans to support the process with mixed licenses. Software is usually not just one license, there's a lot of SPDX expressions or something like that. Okay, I will just repeat the question for the stream. So we also were told that it's like our repository, I think that's worth to mention. So the question was, if we plan to provide more licenses, so I think we were very fast here on the slide. There are already many standard licenses that you can find, and they're available, but also it can be customized. So whatever license you need, you can add to the software. If there are more licenses, you have a data file under CC4, CC by 4, and draft code under MIT, so then you cannot simply say from the outside this is only MIT or CC by, so then you need a list or CC by and MIT or something. Okay, you mean if there are multiple values for the licenses attached to one record, do I understand correctly? Okay, if I remember correctly, it was no? You can have multiple licenses. Yes, so you can have multiple licenses, but you cannot map it one to one. You cannot say for the file is this license and for the metadata is this license, so it's not there. Okay, thank you. I think the question is, if I archive the software in Zenodo, how long does the software is preserved in Zenodo? So the question was for how long is the software preserved in Zenodo? So the answer is as long as we have data center at CERN, as long as CERN exists. Okay, but what is the commitment of CERN in order to organize in terms of how long it will last? Well, in terms of contract, can I? Well, for now we say forever, but let's see what future holds. We'll see if the sun goes out. Sorry? We'll see if the sun goes out. Yes. Hello. I am, sorry, out of those compared to other data in photos at CERN, is it more specified for scientists' researches? Or... So what I think the question is, is that if it's targeted on one area, that's this what you meant, of research whatsoever? Yes. So it is not targeted, because it's very, like we said, it's reusable. So we have, for example, universities also installing the software and keeping it as institutional repositories, but these universities might differ in the domain. So it might be, for example, Northwestern University, but then they host many domains. They do a lot of research. We have also installations at CERN, like one is an ODA and one an internal one, an institutional repository, which we are in the process of migration right now to upgrade the version of the software. But I can also, like, come back to hear there are much more many usages. So it's not targeted to simple domain. Okay. Next time again, there's another theme. Thank you.
Making OpenRefine more reproducible
Okay. So we welcome Antoine Delpache, if I'm correct. And yeah, Florey Searst. Thank you. So I'm Antoine Delpache. I'm a developer on the Open Refine project. And I'm very happy to be back in this bedroom to tell you, give you a few news about Open Refine. And in particular, I'm going to be focusing on what I'm working on right now, together with Zoe Cooper, who's a designer on the project, to make Open Refine more reproducible. So I will first explain to you what Open Refine is, because I'm not assuming everyone was here four years ago. And if you were, don't worry, there are some differences that you might be able to spot. I'm very keen to know if those differences look good to you. And also, what do I mean with reproducible in this context? So what is Open Refine? It's a data cleaning tool. So you can import tabular data, mostly, in it. And then it lets you do all sorts of cleaning operations on it. Guess what? So let me give you an example. So this is a database of filming locations in Paris. So every time you film something in Paris, you need to register it with the city, and then they make this data set. And one thing I can do here is to say, let's match all of those films with an external database. And we call that reconciliation. So in this example, I'm going to reconcile it with WIC data that we've already heard about earlier today. And because reconciliation is a bit of a tricky process, we have various options to let you configure how we're going to match your data to WIC data so that we just don't only rely on the names, but also on other attributes that we have in this data set. And we then have various tools to help you make that a little bit efficient and let you review the results of the reconciliation manually. So for instance, here I can hover this and get a link to the WIC data item that it could link to. So that's a sample of one type of operation that people do a lot with OpenRefine. You can then manually match things if you want to go through the entire data set yourself. Let me show you something else. Well, first, once you've done this reconciliation, you can pull some data from the target database. In this example, I could, for instance, do something quite simple. Sorry. Let's just add a new column with the URLs of those entities in the database. So that's something that I can do quite quickly. And you get your new column. You could also pull more information from WIC data, identifiers in other databases, things like that. Let me show you another sort of operation you can do in OpenRefine. This is the column with the directors of those films. And I can try to cluster them. So what does it mean? Well, we are going to basically look through all sorts of values in this data set and try to detect whether they might refer to the same entity. And when that's the case, then you often want to normalize those to one consistent spelling. That's very useful, typically, as a first step for reconciliation. So those are samples of the canonical values you could use. So let's say I want to use all of those suggestions and accept them as valid clusters. OK. So those are the sorts of things you can do in OpenRefine. Now, what do I mean by making this tool more reproducible? So imagine you're a researcher working on some data that you've collected. You're cleaning it with OpenRefine as part of your research process. And at the end, you want to publish a paper about what you did and you want to make your research process transparent. So you want your fellow researchers to be able to inspect what you've done in OpenRefine and ideally even reproduce it on a similar version of the data set. So what can we do for now? The best thing we have for this so far is our undo-redu tab. And as you can imagine, it's primarily designed for undoing things that you've done, but it also happens to list all of the operations you've done so far with OpenRefine. So you could try and copy and paste this in your research article as a way of saying, this is what I did. Now, this is not exactly ideal. So we are working on improving basically this part of the tool. And before we get into reproducibility per se, there's already a lot of usability issues with this interface. And that's where it's been very interesting to work with a designer on this project who was also not familiar with the tool before she came on board. And so she was really able to come with a fresh eye and identify things that I really couldn't see anymore because I've been looking at this for so many years already. So for instance, here, it might not be clear to everyone that you can actually click on those previous steps to go back to them. We don't have any undo button in OpenRefine. We only have this weird undo, redo tab where you can't really click on the undo or the redo, like things like this. And so it's been really eye-opening. What else can you not do? Well, say I realized that this match here was wrong and I want to undo just this operation, but I want to keep all of the following ones. There's no good workflow to do that, but it's very often requested. So let me now show you what we can do with those extract and apply buttons here. So I'm going to roll back here. And if I click extract, I get this interface where I can select some operations I'm interested in and then I get some code for them. And this big blob of JSON is something I can copy and share as the representation of those operations. And I can also reapply them later on on this project or another one. Now, the problem with this is that it's very hard to work with this representation. It's very unreadable. And it's also very brittle. So for instance, if the column names of your new data set do not exactly match the column in the original data set, you will have horrible errors and it will be very hard to do anything with those operations. So that's the core of what we're trying to solve, providing a better representation for those operations so that you can understand what they are and also reapply them reliably. So as a summary of the main goals of this project, make the basic undo-redu functionality just more usable. Then make this reproducibility also easier and effective because we want those representations of operations to be reliably applicable. And also adding this advanced undo functionality of undoing not just the latest operation, or maybe just modifying the parameters of an earlier operation. So that's the main goals. And what do we have so far? Well, you might have already noticed some differences in this prototype, but let me show you another one. So far I've been working on making open-refined operations aware of which parts of the data set they modify. Because the problem is, if you want to let people undo a deep operation, then you need to be able to detect which following ones can be kept or not, or if they need to be recomputed because the data they were working on has been touched. So now that we have this capability of scoping operations a little bit better, you can, for instance, run reconciliation on multiple columns and that will run concurrently, which is something that wasn't possible before. So you see the reconciliation I started earlier, it's only 7% complete. It's a very slow operation because which data reconciliation is particularly slow. And now I can already start reconcealing the other column. And if you see, we already get some results, although the first one hasn't completed yet. So that's already won win. It's not directly about reproducibility, but I hope this will be work on by users because it should save people a lot of time. And on top of that, we've done some research about how other tools represent pipelines or their undo-redu functionality. So this is a screenshot from Talent, another data cleaning tool that we've been looking at. And in those sorts of data cleaning tools, you design your pipeline explicitly on a canvas. So it's a very different sort of user experience. But we've also been looking at Excel, how they let you track changes, or basic undo-redu functionality in Google Sheets, things like that. So that's been also very interesting in trying to get some sort of user experience that our users are already familiar with. So as you can see, this is all work in progress. This is what I have just here, a prototype. We don't have full answers to all of those questions yet. But we're working on this, and we are very keen to hear from you. So if you're interested in those topics and would be happy to test out some ideas with us, we're running some user testing sessions. So you're very welcome to sign up for those. And that's basically the state of the project. And I also have some open refined stickers if you happen to organize some training events in various places. So do also get back to me if you want some. Thank you. Thank you. We can maybe take one question. Thank you for the presentation. So it's an interesting piece of software. But what exactly is the target audience? Because I mean at some point, if you have the data rendering script, it makes the job. I mean, to not get me wrong, it's interesting. But just to know who exactly you are targeting. So the question is, what is the target audience of open refined? So it's a broad range of communities. I would say it's generally suited for tasks where you can't really just write a script upfront, which will do your keening. And it's not really about whether you like programming or not. It's just some tasks where you need to be looking at the data while you're doing the cleaning. As you saw reconciliation, it's a messy thing. You can't really just come up with the parameters and make the matching. You need to be looking at the data. Same for clustering. So it's a mixture of interactive data cleaning and a little bit more automation that you would have in Excel. So basically here the point is the point is the point click aspect for the operations. So the real point is the point click aspect for the user. Let's thank Antonina again.
Qadence - A library for Digital Analog Quantum Computing
All right, folks, we're going to start. David, it's you. Hello. Hi. I'm David, or Jorick. I work at Pascal. I'm going to tell you a few more words about that in a minute. And I am here to tell you about an ongoing work at Pascal called Cadence. And as you can guess from the name and possibly from the logo, it's related to quantum computing. So before I proceed, I would like to stress out one thing. None of the things I'm going to tell are my work. I'm just, for one thing, I joined Pascal recently with a baggage in programming language theory, compilers and things like this. And this project has not reached the stage where we can use programming language theory or compilers just yet, but maybe someday. So a few words about Pascal. What do we do? We build qubits. More generally, we build quantum computers. We build quantum algorithms. We build quantum tools. We build quantum teaching materials. I forgot to mention we are a private company, but we are a spin-off from several laboratories. Sorry, there is a strong research background at Pascal. And importantly for today, we build open source tools related to quantum computing. And if you're interested in knowing what the inside of a quantum computer looks like, well, that's part of the inside of one of ours. I think this one is called Fresnel, but I'm not sure. You can see lots of lenses which suggest that lots of lasers are involved. Yes, lots of lasers are involved. We're not generally allowed in this room because of the class 4 lasers. Way too dangerous. Still, cool to have. So if you're like me, you might have a question. What the heck is quantum computing? I mean, we all hear about it. A little bit. Well, I hear about it every day, but I pay for that. But we hear about it in mass media and everywhere on LinkedIn, etc. It's still not clear. At least it wasn't clear to me. It might still not be entirely clear yet. What quantum computing is all about? So the first thing is quantum computing is about computing with qubits, not with bits. An important part of it is quantum computing is very much research. You may have seen many announcements, each of them informing us nicely that the last few problems in quantum computing have been solved. I'm sure that we are going to see these announcements for the next 5 to 10 years. Quantum computing is currently a very active research domain, but it's a research domain. And while there are companies that are actually building quantum hardware, we are not there yet. It's not something you can buy at the local shop or even if you go further down the road. And it's probably going to be a few years before we can do anything really useful except in a few domains. I'm going to mention that a bit later with quantum computers. Still, it's extremely exciting. And when I say it's open research, it's open research for the hardware, it's open research for algorithms. And these algorithms most of the time are designed based on mathematical models of quantum computing. There are a few algorithms, but not many algorithms that actually run on quantum hardware. And there is lots of research on compilers and tools, but again based on mathematical models usually and simulators. Lots of hype too on quantum computing. So on the upside, it means that lots of credits for quantum computing, lots of funding, which is why companies such as Pascal and a few others can do their work. It's also thanks to this that a number of academic laboratories can do their work. And it's a good time to be working on quantum in general and quantum computing in particular. It makes things a bit complicated when you have to read a press release and it's a bit hard to understand whether the new problem that has been solved on a mathematical model has been reproduced in labs or is actually ready to come out in production. Why do we care about quantum computing? Well, we do care about quantum physics anyway because in computing, I mean, because CPUs need to deal with quantum phenomena on a daily basis. One of the reasons why we cannot make CPUs that are much faster anymore is that we have hit some physical limits. I'm not exactly sure which ones, I'm not a field physician, but they exist. So we want to go for the next generations of hardware and at some point you can either continue fighting quantum physics or try to embrace it. So that's one of the reasons. Another reason is that there are hopes that quantum computing will be faster. I mentioned hopes because despite some papers including a famous paper by Google two years ago each, we don't know yet. There are good reasons to be hopeful that for some classes of algorithms we will have something very fast, but we're not sure yet. Similarly, we hope that we can be energy efficient. I'm going to show you some algorithms later during this presentation. And there are good reasons to be hopeful that we could possibly someday replace entire data centers working on very specific algorithms with something much smaller. Again, this needs to be validated in labs and on industrial hardware. We're not quite there yet. And also simply because we don't know how to build new hardware at the moment. If you look at what's needed to train chat GPT or at least an old version of chat GPT, I assume it's worse now. If I recall correctly, they were using 10,000 boards, each of them carrying, I don't know how many GPUs each of them carrying, I don't know how many cores for the training part. And I don't know how long training lasts. So how we do it at the moment is we expand as many resources as we can, which is not something that can last forever. Again. So I mentioned bits, 0, 1, easy. Cubits, three dimensional, more complicated. Plus you have the question of whether the qubits are 0, 1, which is a complicated phenomenon, its measurement, and I'm starting to have a few intuitions about it, which probably means that I'm wrong. So there are two flavors of quantum computing. The first favor is digital quantum computing. This is a program in digital quantum computing. If you look at it, you'll see something that looks very much like a circuit. Well, that's why it's called a digital circuit. You have quantum data coming from the left conveniently. All these rx, ry, rz are gates, which operate on the quantum, on the qubit, sorry, in these, all the ones prefixed with r, r rotations on the sphere. These x, z, and I could have had, y's also are symmetries on the spheres. There are all other gates, but these are the ones that I had an example to use with. And at the end, you might be able to do some measurement, and in practice, you'll have to run your experiment many times because what you end up with is probabilities. So you need to measure probabilities by taking pictures, essentially, which means you have to take many pictures. So as I mentioned, a program is a circuit. And there are programming languages for almost 10 years, I think, there have been programming languages designed to create those circuits, or at least to give a syntax to the circuits and possibly to do modeling and simulation on those circuits. But the big snag is the hardware isn't there yet. One of the big difficulties that digital has is noise. I know it's not the only difficulty, but that's the one I remember, which is already good for me. Again, I'm coming from a different field, adapting is complicated. On the other side, you have analog programs. This is an analog program. This is actually part, I believe, of the test suite of one of our computers. So the test here is, hey, can we make a program that looks like our logo? Needless to say, it's probably not a very useful program. But we need to manipulate things at a very fine level. So in practice, when you're dealing with analog, a program is not a circuit, but it's also called a circuit and some parts of it will model as a circuit. But in practice, it's geometry and pulses. It might be different for other kinds of hardware support, but I think the ideas are generally the same. When I say pulses, I mean laser pulses, so you have to set up a frequency, a shape, and things like that, which is a bit complicated. I'm not going to claim that I have any understanding of how it works. And this, why do we care? Well, there are two reasons. One of them is this actually takes advantage of the hardware. It maps extremely naturally to hardware constraints and to some classes of problems. So from the top of my head, there are a number of graph algorithms that map very naturally to this. I showed you a two-dimensional representation, but it could also be three-dimensional. And so graph algorithms, a number of optimization algorithms. I'm going to show you a little bit of an example later. And if we have a problem that maps naturally to an analog circuit, the big advantage is that this is something that you can mostly run today on some machines. Not everything can be run, but we're much closer to this than in a digital world. And one thing I should mention, if you are familiar with the history of computing, well, every computer nowadays is digital, but before World War II, there were already many computers and they were pretty much all analog. So if you look at the battleships of the UK, US, French, German, Navy, they all had onboard computers that were electromechanical and that were used for aiming precisely. So they were computing ballistic trajectories. It worked before we knew how to do digital, and it worked because this specific problem that they wanted to solve had a very nice physical, electromechanical representation. In the end, they disappeared. It took a few decades for them to disappear replaced by digital, because digital was so much more generic, but it took lots of time for digital to catch up with analog. So these justifies war were interested not just in the digital, which is going to be much easier to program once it works, but also in the analog, which might give much better results in some specific cases and which is much closer to being actually something that we can use. Of course, the problem is how do you program that? I mean, that logo was not very intuitive. Well, it's easy. Well, no, really. And I apparently accidentally removed one of my slides, which was a big differential equation, which showed on one side the interactions between atoms and the other side the interactions with the laser itself, which I have no idea how someone can go from this differential equation to actually writing an algorithm, but some people succeed and they have my complete admiration. Anyway, that's why we, and when I say we again, I mean they have devised cadence. Cadence is a toolkit. It's designed for experimenting. You can experiment both with digital circuits, with analog circuits. You can mix them. Once you have written your circuit, you can simulate or execute it. When I say simulate, the world is a bit overloaded, but simulate. I mean, an emulator running on your CPU or GPU that's going to pretend that it's doing quantum physics usually at a fairly deep level. You can pick a level or execute. Well, if you end up in the subset that actually runs on the machine, that you need big glasses and be very careful to look at, that we have in the basement, we have a few of them. They're not really in the basement, but we do have them. So if you end up with this, you can compile your program to essentially a sequence of laser pulses and then send laser pulse to the computer for execution. We do that because there are many experiments that still remain to be done. We're not quite there yet. One of the reasons, I'm putting it first because that's the one I'm most interested in, but it's not necessarily the main reason, is this is the kind of thing that can help us find out how to design a programming language that is both usable, ideally by human beings, and also executable on the hardware, which is something that doesn't really exist at the moment. Another thing is, even without that, just having some abstractions on top of laser pulses, for instance, we have libraries of geometries, well, that makes life easier when you don't have to actually solve that differential equation all the time. An interesting aspect of simulating and executing circuits is that we can run optimizations for at least two different meanings of optimizations, one of them being how we deal with noise. Noise is a big problem with quantum computing if you put your substrate, you should put your atoms too close to each other, they're going to interact, if you put them too far away from each other, they're not going to interact, how do you send exactly the data you want and not the data you don't want from one to the other. So that's the kind of thing we can simulate using CADNs or lower level tools, or possibly other tools, but anyway. And the other thing is something I'm going to show you very soon, again, still might work. So at some point, I assume that some people will ask questions, don't be surprised if my answer is, I have no clue. Okay, so let's look at a few demos. So this is an example of a graph. Let's re- yeah, okay, this is a random graph. We want to solve the MaxCAD problem. It's a well-known problem in graph theory. The detail is not extremely important. We want to find the best places to cut the graph according to some criteria. So this can be reformulated as maximizing this value. And someone, I was sure I had written my sources somewhere. Okay, so someone has devised an algorithm to do that. Sorry, I didn't sort my sources. So this starts by waiting, yes, after the wait. So we derive a circuit from the graph. So there are as many nodes as edges, if I recall correctly. And we do a number of operations whose objective is to eventually make some configurations more likely than others. So I couldn't tell you exactly how it works. Many operations, many, many operations. Yeah, and in the end, we can measure stuff. So once we have this, we can represent the quantity we want to maximize as an optimization problem for one of the many different, what? Okay. Demo effect. Hop. And so this code is basically PyTorch for people who are familiar with PyTorch. And then we can run what we call training in that case. So we can run the optimization problem. So what we're going to do is iterate. So there is a theorem in the paper which I forgot to cite that shows that this computation is eventually going to converge. There's no guarantee that it's about after 100 iterations. But in practice for a demo, it seems to work. And if we pick the configuration that was most likely, again, there is this problem with the cat which might or might not get out of the box. If we pick the configuration that is most likely, it happens to map to the solution that we're looking for. And here, so we need to cut in such a way that something, something. I don't remember exactly how to read this, but I'm going to read it. I don't remember exactly how to read this result. But the interesting part is, hey, quantum algorithm, give me the grants. So that was a digital algorithm. I'm going to show you something that has a very similar outline. We want to fit a curve. So this, we're just going to take the curve x maps to x2 and see if we can teach a quantum circuit to basically represent this curve. For this, we're going to use the quantum, the ansatz quantum learning algorithm, which exists. And basically, we're going to try and optimize a number of parameters, a number of angles here, and see what we can do. So again, let's finish our circuit. What is going on? It was working this morning. Yes. Yes, no more error messages. Okay. Okay, so this is with the initial state of our quantum circuit. The dots are the approximation, the, are samples that we want to approximate. And the curve is the initial result. As you can see, it's not exactly a perfect match just yet. So we're going to run a few steps of learning algorithm. So this one is just pure by torch, just regular optimization. And usually it works. Normally it works. I'm going to pretend that it has worked and I'm going to pre, to start. Yep. What the? Yeah. All right. So after a few steps of learning, this is what we get. We have an orange curve that why not absolutely perfect actually matches fairly well the blue dots. So okay, it's not, not time to call the noble committee for that. But this has applications. Of course, this is a very simple example for a very simple curve that we want to fit. But if you look at it with a little tolerance for approximations, this is kind of the things that neural networks are doing. That the learning phase is something kind of like this. In fact, there is an entire subdomain of quantum computing. That's quantum machine learning. And this is, I believe, one of the simplest algorithms of quantum machine learning. If you look at the API documentation of cadence, you will actually see a QNN module. So quantum neural networks. And this is a very, well, a very active subfield of an already very active field. Because if the hypothesis we have on, if the models we have of energy use and computational power are correct, this means that hopefully we could replace these tens of thousands of cores used by a chat GPT or whatever its competitors are named and replace them by something that consumes way less energy and hopefully runs at least as fast. So time to reach conclusions. What do we have? We have a toolkit designed for exploring the design of quantum circuits, both on hardware that already exists, on hardware that we believe is possible and might come out of, into labs or out of labs within the next five years, and on purely hypothetical hardware because why not? Experiments are interesting. We have this mechanism circuit optimization, which I've showed you. I showed you how it could be used to solve problems or to approximate curves. It has also other applications such as the problem of noise. I mentioned noise between atoms, for instance. Sometimes you want to optimize based on noise models and make your things work because you know that your model isn't perfect or at least your high level model isn't perfect and you want to go to a lower level model. And again, it's not a programming language, but I hope that maybe someday it could serve as the beginning of one. Ongoing work about enriching everything, writing libraries for domain-specific problems, for known algorithms, for geometries, etc. There are many questions. There is ongoing work on compilation, on the subset that we already know how to compile and more larger subsets. And of course, I'm trying to make this easier to program. And when I say we, of course, I mean them. There was a paper recently accepted and presented at Planck. If you were interested, it's on the last line here. And all the documentation and the source code are on GitHub. So thank you for listening. APPLAUSE We have like four minutes for questions, my friends. I'm sorry, did you catch it? Was there any attempt to implement the circuits that we mentioned as an actual problem? I can see the question for the mic. Yes, the question is whether these particular circuits have been implemented on hardware. The answer is I have no idea, I'm sorry. LAUGHTER I believe... No, sorry. I'm not going to say random crap. I don't know. Right now, the main use case is experimenting with this. But again, for the second algorithm, for instance, if we can manage to make its scale to very large... to a large number of curves and more complicated curves, there is a potential application to basically machine learning in general, not just artificial intelligence, but... And the former one, I can't think of any specific example for the former one, but I know that graph algorithms are very interesting for many things, because, well, for one thing, there are good reasons to believe that they can be executed on existing or almost existing hardware. And there are many important problems that can be modeled as graph algorithms. For instance, we are in an energy crisis at the moment, and all the energy distribution problems, for instance, are graph algorithms. I've heard of people who want to work on it. I have no idea whether they actually work on it. Also, for car... for modeling the circulation in cities, things like that. I couldn't tell you about more than that. Okay, I think we should also thank you very much. Thank you very much.
Welcome to the EU Policy Workshop Devroom
So good morning everybody, welcome to the open source in the European Legislative Landscape Devereux. I have a confession to make which is that we applied for this devereux two days before the end of the closing deadline and we have made it up as we went along after unexpectedly being awarded a devereux. So the whole day is very organic but it has a very important purpose. We've discovered that the European Union has noticed that devices contain software and that the software needs regulating. And they have started doing an amazingly effective job at writing software into regulation. So one of the people we have with us today, Benjamin Bergel, wherever Benjamin is, he's presumably, I know he's here but he's hiding. He was involved in writing the NIST 2 directive and then he went on to write the CRA and he is surprisingly expert if you have a low opinion of EU policy officers or unsurprisingly expert if you know that they're all generally brilliant people. However we discovered that the EU's model of what open source is is that it is low quality components full of defects that are created by hobbyists in their basements. And the regulations rather reflected that. And so we found over the last year it was very valuable to engage with the regulators. Today what we want to do is not talk about the technical details of any regulations but rather gather the feedback of the open source community so that we can document the reflections and outlooks of the community for the benefit of the commission as they go forward in regulating within their digital agenda. So we've arranged for there to be four workshops today. The first workshop that is starting in six minutes is a workshop on the consequences of the Cyber Resilience Act and the Product Liability Directive. Then the second workshop which starts at 11.15 is going to look at how we engage with policy makers as a FOS community. The third workshop which is at 1.20 is going to look at how we can assist in getting more free and open source software in use by public administrations. And the fourth workshop is going to look at how the free and open source community can come alongside the task force that is implementing the DMA and the DSA and promote interoperability given that the best path to interoperability is not standards but rather the implementation of standards in shared open source packages. So that's our agenda for today. We have some ground rules that you'll see again during the day. First of all we encourage you if you are like me and you talk a lot to maybe talk less and to encourage and leave space for other people to express their opinions. We encourage you to always be holding the microphone when you speak in a session where notes are being taken and that is all of them because today we have four rapporteurs free to the workshops. The rapporteurs will be listening to what's said, noting down the substance and writing a written report for us to send to the commission after the workshop. When you do start speaking please make sure every time you start speaking you indicate who you are and if you have an affiliation what your affiliation is. Please note that this is a very complex topic and we know that it's a very complex topic so please be open to new ideas. When we run into an intractable problem let's note it and move on to something we can fix rather than obsess about the obstacle. And finally there's two ways of looking at this. Please observe the FOSDEM code of conduct or if you prefer let's have fun and make new friends.
CRA & PLD: [begin workshop] How will the open-source community adapt to the new EU Cyber Resilience Act and Product Liability Directive
I'd like to hand over to the chair of the first panel, which is Martin Erz and Martin Erz from MLNet Labs, and is going to lead what we do next, Martin. Thank you, Simon. So welcome to the first block of the day, which is about CRA and PLD. I just heard from Simon how the structure generally works. I will say a couple of works about how this block will work right now. So an important person during this session will be our rapporteur, who will be writing down all the things that the speakers say, but also perhaps the things that you will bring in, because the idea of these sessions is to actually have some interaction. So for this session, that will be Merco. Merco will be our rapporteur, and at the end of the block, he will summarize what he learned today. So for the agenda of this particular block, we will have two lightning talks. We will have a panel, a workshop bit where you can actually do something yourself, if you haven't already, by asking questions. We will have a third lightning talk, and then we will close with a rapporteur's summary. So that's our agenda until about 11.15.
CRA: 40 new ways the CRA can accidentally harm open source
Hi, so my name is Toby Langelle. I run a small consulting firm based in Geneva, Switzerland. And I have kind of straddled throughout my career open source and standards. So people thought it was a good idea to bring me in to talk about this. So this lightning talk is called 40 New Ways to CRI Can Accidentally Harm Open Source. And that of course references the 40 plus standards that are, the harmonized standards that are going to be written in the next couple of years to essentially make it possible to implement the CRI. So the first thing I want to say is the CRI has landed. It could have been really, really bad. A lot of us were really, really concerned. And it turns out that it isn't. Firstly, first thing is the open source community rose to the occasion. And I think that's really amazing and it was beautiful to see. And like a lot of people put a lot of work in. And I think we should all be very thankful for the work they have put into helping us. And then also policy makers actually paid attention, listened, and considered the input from the community. And also for this, I think we ought to be really thankful. So thank you for both sides for making this happen. In the process, we avoided harming open source pretty seriously. And we also avoided harming that EU's ability to leverage open source, which was another one of the potential risks of the original versions of the CRI. So we do now have a lot more clarity. There's an asterisk there because lots of people still have lots of questions, myself included. My key takeaways from the last version of the CRI is that the responsibility falls in the right place. IE was the people monetizing open source. The company is monetizing open source. So for me, this is really important and it's great that this has spelt out really clearly in the last version. And then the other thing that I thought was really interesting is the open source stewards, this new notion of open source stewards, which really institutionalizes the foundations that have been playing an important role in our space. And it's also, I believe, a really smart instrument for the EU's ambition around Southern Tech. That said, it's going to have industry and ecosystem-wide impact. I think companies will be a lot more cautious. I will certainly advise my clients to be more cautious. And a lot of projects will move to foundations and I think they will do so earlier. And then the conformance requirements, they're going to climb up the dependency tree. And so essentially, I'm suspecting pretty quickly most of the ecosystem will actually be subject to some parts of the CRA, probably the lighter version that is for open source stewards. And I do have a question that was this. This is going to create a lot of financial and work overhead. And I'm still kind of wondering who's going to be paying for this. So I think this is a question that will need to be dug into a little more in the future. So to meet the CRA, there are essentially going to be two options. Either you demonstrate conformity by yourself, so the burden of proof is on you, or you will essentially follow a set of standards, the harmonized standards, and this is going to provide presumption of conformity. So the fact, though, the standards are going to be how the CRA impacts open source because that's what everyone's going to do. Essentially follow the standards so that they can be presumed to be conformant. And so 40 plus standards, that's 40 plus way things can go wrong. If you believe that the standardization process is less opaque, easier, more open source community friendly than policymaking, I have bad news for you. And so essentially the same kind of misunderstanding, the same kind of risk that CASC carry through the CRA is probably going to carry through 40 different standards. Actually sitting in 40 different rooms to make sure that 40 different standards don't harm open source in a weird and unexpected way is a lot of work. So I mentioned the opaque standardization processes. Also open source has special requirements. Things have to happen in the open. They cannot be patterns around the standards. And not every organization functions in an open source friendly way to put it mightily when it comes to how they deliver the standards and how non-uncumbered by patterns these standards are. So that's also something that will be incredibly important to make sure that the open source community can actually have access to those standards and be able to implement them. The two last points is there's a huge diversity of open source stakeholders, a lot of which were very poorly represented in the CRA even though the open source community was there. So there were these two words obviously and they were very much involved. Hobbyists, it's very hard to actually represent hobbyists, right? Small commercial open source startups that are going to be incredibly impacted, including in the EU because they will be considered manufacturers, rightfully so. Probably don't have the resources or the know-how to be involved in the process. And the last point is interop was other jurisdictions. One of the huge strengths of open source is the fact that licensing is essentially standardized worldwide and like the MIT means the same thing here and there roughly sufficiently that it's like okay. And if we start having security standards that are different across different jurisdictions it's going to be a huge burden on open source maintainers and open source developers and we want to make sure that if you comply to whatever the EU comes up with in terms of standards it's fairly similar to what NIST is coming up in the US, etc. etc. And that's it. Thank you very much.
PLD: When software causes harm – who pays and why?
Okay, so I'll advance the next lightning talk, which is about the second big legislative effort that went on during the past, well, couple of years, really, smiling at one of the people that worked on it in the European Commission, which is the Product Liability Directive. And with us today is Rob Carolina, who is the General Counsel for ISC, Makers of Bind, who's going to give you an introduction into product liability in five minutes, which is... So, take it away for Rob. Martin's original idea was do the product liability thing in three minutes, and then you can do some other stuff for two. So what I'm doing here is giving you a reading test, and I'm trying to condense down to two and a half minutes a topic that we spend about 40 to 60 hours on in law school. So the reason that I'm giving you this reading test is because I want you to be familiar with this fact pattern. I'm going to tell the story in reverse from how I usually do it. This is a story about an automated car that hits a pedestrian in Ireland, Pat Victim. That car has on board a piece of software called Bravo Drive, which has included within it a piece of software called Open Sesame. The car was imported by Exotic Imports. The car was manufactured by Einstein Motors in California. Einstein Motors got Bravo Drive software from Bravo Bits BV in the Netherlands, and Bravo Bits VB got Open Source, Open Sesame from Firefly APS in Denmark. Terry Dastardly hacked into the automobile because of a weakness in the authentication package, provided a few inputs. And the next thing you have is a car that runs over Pat Victim in Ireland. Don't worry about Terry Dastardly. He dies or she dies in a horrible paragliding accident or without money or is run over by a bus. Just take them out of the equation. The question that product liability seeks to answer is, in a situation like this when we have an injured victim like Pat Victim, who pays for their injuries. Two slides that look like this. This slide is designed to teach you the difference between two different legal theories on how you sue people who manufacture things. The left-hand side is the law of negligence, at least as it's practiced in common law countries. I would not come to a civil law country and teach people about the Napoleonic Code. However, I will talk to you a little bit about common law and suggest that the two are not worlds apart. As you can see from the chart, when our victim tries to sue all these various peoples, Johnson, exotic imports, Einstein, Bravabits or whatever, Victim is in a little bit of difficulty because the people who manufactured and imported the car did everything reasonably. They selected good components. They selected trustworthy producers of things. They did not act rashly. Whereas the error in the situation came from a software vendor called Firefly and maybe, just maybe we could establish that they owed what's called a duty of care to the victim. If someone like Pat Victim was a foreseeable victim when someone wrote this authentication package in Denmark, but as you can see, it's going to be difficult to establish that. Now, in the reading test that I gave you one slide ago, I did put in there that the folks at Firefly, they had a bad week. The problem with their package was because someone made a coding error and the QA people were kind of asleep that week because we're going to get that in a forensics report from an expert who's going to come to trial. The right hand side of this slide is designed to teach you a different area of law that was adopted in the U.S. in the 1960s and in Europe in 1985, which says what do we do in situations like this where everybody acts reasonably but Pat Victim still has injuries? And the answer is we don't look for people who did things unreasonably. We don't care how careful they were, how cautious they were. We look for people who manufactured and put into circulation a dangerous product. We tried really hard to make it safe. It doesn't matter. If it's dangerous, it's called no fault liability for this reason. And as you can see, because the automobile manufacturer and the importer, and this is the law as it exists today in Europe under the 1985 directive, because they were dealing with a product that is dangerous, they will be strictly liable, but the software vendors will not because software has not been deemed to be a product. Enter the PLD, which changes things on the right hand side of this chart. And as you can see, what happens here, one of the design characteristics of the PLD, and the origin of these slides, by the way, was I did a talk at Etsy five years ago, which said this is coming. So I keep using the same slides for five years, and they're still accurate, is that we recharacterize software as a product, and now we can attribute liability to Firefly because they distributed a dangerous product, a piece of authentication software that had been, that didn't work properly. We'll just leave it at that for right now. And since we're running a few minutes ahead, I have one last slide that I'll show you, and I'm just going to hold on this for 60 seconds while you read it. If you're looking for a copy of this, I just posted it half an hour ago on X and on LinkedIn. So whatever the answer is, depends on what questions we're asking. I know a question I'm asking, I'm the guy on the left. It appears the questions on the right were the questions asked by the European Commission. And that's how we have the answers that we're talking about today. Thank you. Thank you, Rob.
CRA & PLD: panel
Okay, so welcome back to this session on the CRA and PLD block. We are having a panel with some of the people that directly wrote the pieces of legislation we're discussing in this block. To my left we have Benjamin Bergle, who is working for the European Commission as Head of Sector for Standardization and Product Security. I almost did it right. Next to him is Chuck Dinghou, who is Director for the Python Software Foundation and we really wanted to get a community perspective on this panel, which is what Chuck will provide and also what Chuck will challenge you to help us provide, because that's kind of what we are trying to do here. And finally we have Omar Enaji, who is Policy Officer for DGGROW, who has worked on the Product Liability Directive for multiple years now. My name is Martin and I will try to ask some questions. You will be asking the really clever ones. I will be asking the other ones. So let's get started. I would like to ask our panelists to do a real quick introduction specifically to answer the question, what does implementation of these laws mean to you? Because we've been over the proposals, we've had the negotiations, they're about to be confirmed by Parliament. So what this panel is about really is about looking forward. We're not doing the negotiations over. We're now looking at when will these actually hit Europe and what's needed to get there. Thanks a lot, Martin. So for the Cyber Resilience Act, I mean the text isn't final yet, right? So we don't know exactly when it will enter into force. As I said yesterday, sometime around the middle of 2024, maybe a little bit later. And then we have a three years transition period. So manufacturers, hardware and software manufacturers, they will have to start applying the rules roughly around June 2027. So that gives us three years during which we can prepare for the implementation. We just had this fascinating presentation on the 40 standards, right? So that's going to be a huge part of our work, helping the European standardization organizations with the standards. We will also have to produce guidance, of course. And thank you actually very much for inviting us here because I think these are the venues where you get all the tricky questions that need to be answered in the guidance, right? Because of course the CRA is a high level piece of legislation. It will not provide an immediate answer to every edge case that you may have. So I think this is where the guidance really comes in. And we want to be inclusive in this process. We want the community there, open source, single vendors, everyone. And we're really looking forward actually to this process. Thank you Martin as well. So for the PLD, a bit different from the CRA because it's a directive and not a regulation. So it requires transposition at a new level in each member states. Actually the law will be applicable in each member state. So this year will be 2026 in theory around June, July. It will depend exactly when the parliament will give the vote. And by then the liability rules will be kicking in. So yeah, that's roughly it. Would you mind spending a couple of cents more on the difference between a regulation and a directive because we appreciate that a lot of you may know a lot about software and we also think some of you may not know a lot about you lawmaking. So can you? Yeah, so I mean just a quick legalistic view says at you level you have three types of acts. You have three types of acts. A regulation, a directive and a decision. A regulation and a decision basically only requires to be directly applicable at national level. But the law remains the same. For a directive it requires transposition and the transposition is basically incorporation into the national law. You will have 27 different laws that would say the same basically. But because of the particularity of the directive it would also require changes in some other parts of the national legislation. A directive also requires implementation along with the incorporation. The regulation only requires the implementation of it. And it's directly applicable as a regulation into the national laws while the directive needs to be incorporated to be applicable. So you will have the central piece of legislation but for the rest you will have national laws that will tell you or give you the answer. And the role of the commission during the transposition, the two years transposition, that's why there is a deadline for that. It's to check each legislation to ensure that there is no mistakes that doesn't go against what the main legislation says. So that's the big picture. So my next question is about your personal experience trying to express false into law or maybe to interact in the EU policy space whereas you may have previously focused on the developer space. So a different question for each of you. So for you, what was it like to work on a policy topic as someone who is very knowledgeable about software development? For each of you, what was it like to work on a topic with the nuances that open source has in your policy? So first of all, again, my background is very similar to a lot of developers. I'm closer to a software developer than a policymaker. So for us, I think we have a lot of concern about whether I will be reliable. I mean, maybe I've created some fun stuff. I publish it as open source because I want to share it but then you have no control of who is taking it and doing what about it. For example, the car example maybe at the beginning when I created this project, I'm not expecting someone to use it in a car and then the car will hit someone. So that is something that I think a lot of developers have that in their mind. There's a bit worry that now if this happens, will we be not publishing anything anymore so that will affect the open source ecosystem a little bit more? And also, for example, if you're working for companies or maybe then your company would tell you to not do it because the company don't want to get involved in your hobby project that may get into trouble. So there is a lot of concern, I think, as someone who, you know, and also, because software is very different from hardware, right? You can't make something at your backyard and then come and in fact you can take it in production. But software, you know, the power of software is like, you know, some individual developers, they can still develop a piece of software that is, you know, very applicable in a lot of application but is maintained with very limited resources. I think that that make hardware and software a huge difference in terms of scales. You know, you don't have enough, you know, resources, you can't massively produce something in hardware but if you have limited resources, you can still massively produce some things in software that like a lot of people use, right? So that's the concern from a developer perspective. Yeah. So. Thanks. Yeah, I mean, for us, I think it was a huge challenge to adopt the existing European framework for product legislation, the new legislative framework, as you call it, or the CE marking that you're familiar with, to software and to cybersecurity, right? Because, I mean, software is not a tangible good. It's different and cybersecurity is also very special. It's not the same as safety. Usually we've always regulated safety. Now, for the first time, we are regulating security and I found that to be a huge challenge. I think we managed to get it right but it was a challenge. What I really liked about engaging with the open source community is that you meet a lot of passionate people who really care, right? So when we regulate other areas, you get to meet lobbyists who are simply paid to defend interests. Of course, you're also defending your interests. But on top of that, I mean, you meet people that actually really care about the things that they work on and you see it's more than just a job, it's a mission for them. And I really appreciate that. Well, I mean, for me, it's a bit different because let's say that the product I really like to direct is about any type of product. So what I had, it's basic. I think it's the CO2. Do you see maybe it's a defective product. But the idea is basically how to deal with perfume industry, with car industry, with tables, with chairs, with vaccines, with pacemakers, with hats, with whatever you want, all of those industries. With the PRD, we didn't have a specific sector. We had all of them at the same time. And what we actually needed is basically to have people that could represent each of those sector to hear the concerns and what could work and what could not work. And I have to say that with the open source software community, it was maybe a bit harder to achieve that because of the fact that you are all individuals, there is not really someone that represents you. You need to speak a little bit harder because this is not a mic. It's only for the recording. So what was really complicated for the open source software community for me is basically I could not have a single voice that could tell me what were the full concerns, had different voices. But to be totally honest, the one that we're more talking about, your issues, let's say the bigger one, which I pretty sure do not represent you. And so that was the main difficulties for us from the PRD perspective, was to get what are the real concerns and how do we reply to them. But at the same time, we also had to be totally honest. The PRD is a piece of legislation made for victims, which is basically all of you, all of us. So we needed to find the right balance, not to put too much pressure on the one that creates the product, but also not too much pressure on the person that actually suffered the damage. And that was what we needed to achieve. And where we need to find the good balances when we have your inputs, this is where we can actually find the perfect balance in a way. So I will be asking, giving the crowd a opportunity to ask questions. So if you have one, raise it before I'll get to you. I'll ask two questions to Benjamin so I can have a look around. So my first question, Benjamin, and it's about stewards, is how can a steward know they're a steward? And my second question is, suppose they find out they're a steward, but they're not in the EU, who is the supervisory authority they are supposed to be talking to? Okay, so I mean, you find out if you're a steward by looking into the law, the law defines the concept of steward, right? It says you have your, if you're someone that's, I mean, if you're a legal person that supports a project on a sustained basis, and this project is ultimately intended for commercial purposes, you are a steward. The regulation also gives a few examples, such as foundations, I mean, not every foundation will be a steward, but if it meets those criteria, it's a steward. And so you can look it up in the law. As I said before, there will be cases where it's maybe not as clear cut, right? We hope that with the guidance, we can also address those cases. So I'm quite confident that the end of the implementation process, people will usually know if they're stewards or not. Now, if you're outside the EU, so the CRA is indeed a regulation, yeah? It means it applies across the entire single market in a uniform manner, and all the market surveillance authorities are responsible for you, essentially, yeah? If your product is, or if your software is published and accessible across the entire internal market, then all the market surveillance authorities will also be responsible for supervising you. So I will be walking into the crowd to get a question. I will be off camera, which is fine. So please state your name and affiliation and a question if you have it. I'll hold the mic. Okay. My question is about a Debian, which there is a Debian foundation in France, and there is software in the public interest, but these foundations only handle financial issues. They have nothing to do with code in any way or form. Are they going to be considered stewards? Yeah. So unfortunately, I cannot give legal advice on individual projects, right? Because if I get it wrong now, then it's a huge problem. So you will have to check for yourselves. I mean, what I can tell you is, I mean, we put some indications into the law when you could be considered a steward. So for instance, when you are hosting the collaboration platform, if you are to some extent governing the project, if you take decisions on the project, or if you do steer the development process, then you would be considered a steward. Taking another audience question. So please state your name and affiliation and then the question. I'll hold the mic. Thierry Carreze, Open Infra Foundation and the open source initiative. You mentioned the chilling effects on development and engagement from the open source community. And I think it's the main fear we have is that whatever legislation is created, it would prevent or discourage people from participating in the open source commons. And I think it's linked to any uncertainty will be interpreted in a worse way. So how are we going to, with 40 standards on the CRS side and transposition in every country, 27 countries on the PLD side, how are we going to have enough certainty for those people to, for them not to have this chilling effect on their participation? Thank you. I'm going to Omar first. Well, I think you can send an email to one of us. That's basically the first. I mean, we are open to have any discussion with anyone that has an issue on the ground because we are not on the ground. So this is how it works for every unit in the commission is basically everyone has legislation or has a policy and we receive feedbacks from people. Someone, for example, for transposition would say, well, I'm in Spain and this is how the law applies in Spain. And I'm pretty sure that that was not the main idea because when I looked into the main piece of legislation, it says something opposite. Well then it's the work of the commission to realize, well, that something goes wrong there and then we enter into contact with the national authorities. That's for the transposition part. But if there is like, there are issues during the years of applications of the directive, then we have what we call a review clause in each piece of legislation. Every three years or five years, you will have someone from the commission, usually one of us that will do the review with the study, having interviews, taking all the evidences and proof and you will collect all of them and then realize, OK, there is an issue that was not foreseen at the beginning. How do we solve it? There was a gap. How do we, how do we feel it? That was actually the main, the same thing that happened with the PRD, the PRD that dates back from 85. It took 40 years to review it. Before that, we started the collection of the reviews of the proof and we collected all the opinions and this is where I say that maybe your community was the one that was not really involved into that because of how the process is, but everyone has a voice in the seat there. Sorry, I want to ask a follow-up question. So I know that like sometimes the White House will have some open call for like suggestions and comments. Will you plan to do something like that? Well, first of all, we need to apply it, but that is for sure that for the next review, which will happen. So it's two years, four years, it will be in six years. In six years, we will take us, do a state of play of how it's applied and then obviously we'll have to collect a bit of information and we will have to check with people of the industry, the communities to see what is their experience and if there are things that work or don't work. So that's how we will have to do it. But I cannot tell you right now, but it will be one, but I'm sure that it will be one because it's how it works for these kind of things. Yes, so I would like to fork Omar's answer. I would like to add that, I mean, I don't think there will be a chilling effect on open source coming from the CRA, to be honest. I mean, let's be frank, open source is essentially outside the scope. I mean, of course, there will be cases where manufacturers will, of course, try to place requirements upstream, right, and talk to upstream developers, but I mean, you are for the most part not covered by the Sub-Brazilian Act. If you want to make sure that the transition goes smoothly indeed, I mean, please do reach out to us. I think we've proven over the last year that we are a very approachable bunch here. We are taking your concern seriously. We are going to do our utmost to find solutions, but we are even legally obliged in the CRA. There is a specific provision that requires the commission to consult the community. I mean, we would do so anyway, but you even have that reassurance that we have to do it. And yes, I mean, just please do reach out. Just one thing for the chilling effect, because for the PRD we have experience with that. Forty years ago, I could show you the newspapers that was going all around Europe from manufacturers saying that if this piece of legislation would enter, there would be no products anymore in Europe. I'm pretty sure that this is not the case anymore. What the PRD did is basically give trust into people. When they buy something, they know that if something goes wrong, at least they will have the back cover. That is the idea of the entire piece of legislation. One practical comment I would like to make with respect to the question that was just asked is that after the panel, we'll have a workshop. And one of the mechanics is that we want to ask you about your fears, but also hopes and perhaps your solutions. So if you're listening to this and think, hey, but I have these corner cases that I'm really worried about, make sure to remember for like 20 minutes more and then put them to paper because we're actually trying to collect these. I saw multiple hands. I'm first going to ask a question myself and then I'll return to the audience questions. So it's related to the PLD. In December, a political deal was reached on the PLD and one of the things that was publicized by the MEPs looking out for open source specifically was that open source would not be in scope if it was not a commercial activity. And it was a delegation to the technical level to implement this idea along the principles of the CRA. Now when the text of the PLD became public in end of January, what we saw was that there was a single or maybe one and a half PLD recital and the CRA has seven, eight maybe. So I'm asking is the PLD team that much better at writing recitals? Can we somehow use the nuance that was expressed in CRA in the PLD or are you going to offer guidance? Because I was a bit surprised. I was expecting more nuance but maybe I'm wrong and you're the expert. So maybe a bit of a tough question for you Omar and I'd like to hear from you. So I mean I will ask you a very short question. How many products do exist in the world? Because what you as a community got is basically one full recital over 47 while the PLD applied to millions of products. So I think in proportions you got quite a lot actually. The difficulties for the PLD is basically to say that yes there is a CRA that gives an explanation about open source software but you will also have the AI act for that. You will also have all the type of legislation that will get the open source point and we have to cover all of them at the same time. We cannot copy paste from one single legislation because we apply to all of them at the same time. So the difficulties was really to find the right wording. I think as you said the MEP that quoted said that the main idea is that is the commercial activity. While this is applicable for any product, any product that's actually been developed or supplied outside, mostly supplied outside of a commercial activity, it's out of the liability regime. And that's what we written in the recital. It's basically restating the fact that if it's outside of a commercial activity then you're out. But if you're in, that's where the PLD applies. We cannot create a specific regime of open source in the PLD itself also because of the nature of the legislation that has to be neutral and you cannot have just very specific provisions about one single product because each provision has to apply in the same way to any other type of product. That's a bit of the, and the CRI would apply for cyber vulnerabilities but then you will have the AI act that would apply also for open source. And for us we need to cover all of them so that's why it's in this way. So I have a question that relates to work that will be a little bit out of your hands. So for you Omar, it's about the 27 member states that somehow need to get the work you did and then make their own and somehow understand the nuance of what open source is about. For you it will be about Toby Stark on 40 standards. What will you be doing for Chuck and me and all the other people writing software to apply what you learned in the past 12 months or maybe you already were to help the people doing that work not make, like understand the nuance of what is essentially a niche of a niche but also rules the world of products with digital elements. So what will the commission do to help the community in these stages of the process? Okay. Yeah, I mean, so the commission is not writing the standards, right? I mean, that is how it works. I think you also would not want us to write the standards. So that's probably a good thing that we are not writing them. It's the European standardization organizations. They are made up of national delegations from the national standards bodies. These standards bodies, they often send representatives from manufacturers and from others. We have like, the commission has basically three ways of being involved in that process. First, we are the ones drafting the standardization request, which is the basis on which the ESOs, the European standardization organizations are going to work on those standards. So in the standardization request, we can already express our expectations, what the standards should look like. Then, although while we are not going to be writing the standards, we are going to be there all the way, right? So we will be in all the meetings. We will listen to the conversations. We will give our views. We will answer questions on how things are to be interpreted in the CRA and so forth. And in the end of the day, we also have to rubber stand the standards. They have to be cited in the official journal of the European Union that gives them this power to give presumption of conformity. So what I can reassure you is that we are going to be there all the way. We are going to look at the process very closely. We are also more than happy to engage with those parts of the open source community that do have expertise in standards, right? To find solutions to the issues that you may have. So again, I mean, I already said it a couple of times, please do reach out and let's discuss that in more detail. Thank you. Well, I mean, my work is not done yet. As I said, the transposition will kick in as soon as the co-legislator have officially voted, which should happen either in June, July or September in any event. And then after that, we launch the transposition period, which means basically that we will be receiving the 27 piece of legislation piece by piece, or sometimes just the entirety of it. And we will have to work closely with each single member state to ensure that the legislation reflects exactly the directive. What we have as a tool in the commission is what we call the infringement procedure. So when the commission realizes that a member state does not conform itself with a new legislation, we can bring the case to the court to ensure that the member state applies it or does it properly. I'm not, I mean, as a small background, the first PRD took for some member state more than 20 years to properly transpose the directive. So I hope we're not going to be there, but this is how it works from our side. And then once it's transposed in any event, we will have to check constantly if there is a good application, because one of the things is it's not only the transposition by the member state, but also how the jurisdictions will be applying the law. A national court is also a representation of the member state at your level. So if there is a misapplication at that level, we would also have to intervene to ensure that it is done in conformity. Thank you very much. I will take two audience questions, one here and one there. And then we will continue with the panel if there's time. I will be holding the mic. Please state your name and affiliation. Alistair Woodman representing 2501C3's outdoing open source projects. As far as the PLD is concerned, do you anticipate that the market will support insurance policies for this to deal with the sort of quenching thing? Or is it a non-goal or a goal to encourage insurance in this particular regard for non-mulse-feasant behavior? I think that was for you Omar. So the PLD does not have any requirement about insurances. So everyone is free to do whatever they want. Basically, it's just you need to calculate your own risk. And once you know your risk exposure, you will know whether you need one or not. But it's not from outside that we do it. And to be also totally frank with you, as I said also yesterday, most of you here will never have a claim on PLD. I mean, this does not happen every single day for each type of product. We have a few cases that can happen. You can have access to all of them. It's true that for software it's a bit more rare that this happens because you have something that the traditional products don't. You can correct the piece of software before something wrong would happen. You know that there is a vulnerability. You know that there is maybe something defective inside. And then you will correct it with an update and then you avoid having any issue. That's a bit more of a facility for you. And we will not impose from our end-in-end insurance for that. That's a bit of the approach. Audience question. Please state your name and affiliation. Hi. Olli Johansson. I'm an open source developer also active in OpenSSF and OVASP. The problem with those 40 organizations that create standards for us open source developers is that ECMA, Sennilec, all of them require quite huge fees. Who will pay them so we can take part in the standardization effort? I think this will be for Benjamin. Yes, I don't think I have necessarily a satisfactory answer for you, right? Yeah. So I will take note of your financial needs. But indeed, I mean the CRA is just one of many pieces of legislation. So we do not shape the standardization policy. We just use the standardization process for the CRA. But indeed, I mean this is an important question and we are more than happy to look into that. Thank you. So we're slowly nearing the end of this panel. I'm going to ask a number of questions in succession and then we'll see if there's more time for audience questions. So to Omar, I'll ask, do you know about any other related legislation that is coming for this community that we should be waking up about? So take a moment to think about it. I'll get to you for the answer. So for Benjamin, I would like to talk to you about the guidelines. Can you be very specific about how people can contribute to the process of writing them? And there is this delegated act possibility about voluntary security at the station programs. Can you talk about what your intentions are, maybe how people can help? So my goal is with these three questions to the two of you, is to give people in the room a clear view of what they can do. Should they have the time, the money, etc. to be involved in EU policymaking? So now I'll hand over to Omar. Well, I have no idea. I have to be totally honest with you. We are many, many directorates. But it's as simple and it doesn't really require any money. It's just two times time to check what the commission is doing. The various directorate channels, I mean, mostly did connect, but it can also be did you grow? Could be just wherever the directorate is. And then if you have a question, you are wondering something, you are... I mean, don't say that to my other colleagues, but you can send an email to the units. And this doesn't cost any money. They will happily reply to you and give you any answer that you're seeking. There are stuff that you don't understand from legislation. There are information that you would want to bring to the attention to the commission itself. Our emails are open for that. And this is also our role to have a look into what happens on the ground. I mean, as I already said, we are legally obliged to consult. But we would do it anyway, of course. As the commission, I mean, we are very likely going to organize conferences where people can attend and bring their ideas to the table. We are likely going to have some form of expert group or a similar body where people that want to, like, be more involved than just ad hoc, but in a more structured manner where they can engage with us. And, of course, you will be seeing us at conferences. You can invite us to your events. We're happy to attend, maybe not always physically, but online. So there will be plenty of opportunities to engage. As regards the voluntary security attestation programs, so, yeah, I mean, the idea is basically to give those projects that are not directly in the scope of the CRA a chance to provide some form of assurance that the projects are secure, right? We know that many of these projects, they don't have financial resources. So the provision is quite open in that regard. It does not require the ones that develop a project to also pay for that program, but other people can step in. So, for instance, integrators that take an active interest in a certain component because they need it for their own due diligence, they need the assurance that it's secure. They could also team up and pay for that assurance. Now, these attestation programs, there is only a so-called empowerment in the law. That means that the commission is empowered. We are allowed to flesh them out. So they are not there yet. We don't have these assurance programs at this point in time. But the commission will be able to work on this. And for this, we will also need your input so that we can shape these programs in a way that they are useful for the integrators or users to have the assurance that they need. But they also take into account all these specificities of open source projects because they are often so different, right? The way they are structured and the promises or commitments that they can make compared to more traditional, manufacturer-based projects. So I think we'll take one more audience question and then I'll ask Chuck if she wants to do any reflection on what this means for Python maybe. Let's see. I think there was a hand pretty early on. So name and affiliation, please. Hi, Vittorio Bertola from Open Exchange, which is a 300-people German open source software maker. So the question is, well, first, this is creating cost, of course, not for security because we have a flow of security record. We already spend all the money that's necessary on security. But for the bureaucracy that now you are introducing for compliance. So this is making us less competitive and all our competitors are from outside Europe, including Google. So how is this going to be compensated? And maybe ours, we are a pretty big company. We can cope for it. But the French Foundation for Debian that has to hire a lawyer, there are going to be costs. Are you going to put some money onto this, maybe to fund developers to cope with security issues or to fund the bureaucracy? And also, how are you going to avoid the international players from gaming the system? I mean, it's way too easy. I see this happen for like the Googles and Apple. They create some initiative, which is a non-profit. They put the code into that. It gets outside of the CRA scope or maybe gets the like system, whatever. And then they don't have to support the cost of compliance. Well, we still compete with the same piece of software and we have to pay the full cost of compliance. So do you have any thoughts on this? Do you want to check to go first or answer the question first? Yeah. Yeah, I mean, it's true, of course, that there will be some bureaucracy. I mean, no law has ever been created that doesn't create some bureaucracy. Okay, maybe the PLD doesn't because it only hits you once something happens and not before. But usually, of course, there is a certain compliance cost that's quite unavoidable. I think the competition concerns there may be a bit overstated because the CRA does not only apply to European companies or manufacturers or open source projects, but it applies to anyone who is bringing, publishing or putting on the market those products in Europe, right? And we all know that Europe is a big continent. It's quite relevant. There are probably very few manufacturers in the world that do not place products on the European market. So they will all be subject to the rules. We do have some field facilitations actually for small manufacturers when we talk about actual manufacturers. So there is a provision that again, it's an empowerment for the commission. It allows us to create a form for a simplified technical documentation for small companies. So that means that small companies, they will only have to fill out one form essentially and the length of the form is somehow going to inform the expectations towards how much information you're going to provide. So I think that can help a lot like one single form makes your life much easier. And then we also have some funding calls. Actually there are funding calls ongoing right now until the end of March that also aim at helping small companies deal with the implementation of the CRA. Thank you Benjamin. So I think we are at time. I would like to thank our panelists for the courage to come here to talk to us, to have this conversation. They're not leaving yet, but I will ask you for a round of applause before we continue.
CRA & PLD: workshop
So a quick apologies to the people on the live stream. What we just did was workshopping. It was, in my idea, real fun and also real messy. You couldn't see it on camera, but now the program will resume on camera. And I'll hand back over to Toby, who will take you through it. Wonderful. Thank you. Well, I thought that was fun. I hope you enjoyed it too. So now we're going to, each station, each group will have about five minutes to talk on some of the key findings that they found and share them also so that the live audience actually knows what people are concerned about, what their hopes are, and what kind of solutions we want to. So I'm going to invite, why don't we start with the, well, the PLD, is the PLD fine? Yeah, I think that's fine. Good luck. Thank you. Do you want me to go through quickly? Yeah. Okay, I think the good word is trust. That's the main one. The idea is really to bring the trust in new type of products, at least. The alignment before the piece of legislation is a good wording. I mean, as I explained from the beginning, the PLD applies to any type of product, so it needs to be aligned in that sense. And I think some people got the idea with the DMA because of the interoperability that the DMA is introduced now. And it's a bit of the idea also of the PLD that when you have hardware and you can install different type of software, you also need to pinpoint the reliability of the person. That is a bit of the opening as well to these new possibilities. I think, so for the fears, we have three types of, across at least two main ones, the scope, I guess it's the, still the clarity of the scope for the commercial activity, and the cheating effects that it could have on the community. I mean, I can talk for the cheating effects, but that, again, we will have to see exactly how it will work and how it will affect directly the community. I cannot tell you in advance, I know that there are fears about that, but that's also the reality. And as I said, what we only did is to clarify the situation. We did not change drastically the situation. I know this is not what you will hear all the time, but the scope was like this already before, but now has been clarified to his previous way. So we will have to check for that. For the scope itself and the commercial activity, again, I cannot tell you on each case how the commercial activity applies. What I can tell you is that the commercial activity is not always the same, or at least the scope is not the same of the piece of legislation. The PRD will not be applicable to certain products that the CRA does not apply to, to make it another wording. It's not because you're not covered by the CRA that you're outside of the scope, that you will be outside of the scope of the PRD. You might actually be. So just to make it as clear as possible. There was a good question on the open source silicon chips. If I just remember which one put that question, and I will then get back to it just to understand a bit more. But I would just go to the solutions. There is a good call for guidance. There is something that we call the blue guide that applies for every product in the union. The blue guide, it's just a guidance, not a piece of legislation that can help market surveillance authorities to apply a legislation, definition, etc. Could be a good way also to update it for software and specifically to open source software to make it more flexible and more adapted to that. I think that's a very good point. I would just use this time to give a short on the communication because there were some points about contractual liability or limiting the liability. That's not possible. There is no way that you can escape the product liability directive even by putting any type of clothes. I'm sorry for that. The piece of legislation is made for protecting the most vulnerable person. So you cannot sell a product or you cannot provide a product by limiting it. But what you can do is if you are a micro enterprise and someone wants to take your open source software, you can decide with him that you will not take over the liability in case something wrong happens. And so basically if a victim goes to court against you, you will pay. When I say you, let's not imagine yourself, but you as I say, you can just basically compensate and then you have your own agreement where you get back the compensation from the integrator. And in the other situation around, if the victim goes against the integrator, the integrator then will not be able to go against you to claim part of the compensation that he gave. So that's something that exists in possibility in contract. Thanks. I think we're good. Yeah. Wonderful. Thank you. Should we go next? Do I talk or do you talk? Do we do both? Half-half? That was a bad answer. Hi everyone. So on the workshop on the CRA standards, we have, I think, a lot of participation, a lot of thinking heads and a lot of fears in the beginning for sure. Some hopes, which was nice to see. And there was also nice to see the connection between the hopes and the solutions. So I think with time we got to some solutions and I'll let the other moderator explain the solutions. But I think in terms of how we organize the hopes and fears, there's things that the open source community should probably do and or that they should address. There's things that the standardization development organizations need to address. And there's things that the regulator needs to address. And I think trying to figure out how we can collaborate and cooperate between this triangulation is going to be key moving forward. So thanks very much for being here and I think we hope we can continue the conversation. Thank you so much. And sorry to put you on the spot. Don't go away right now because you have not been introduced. And so Felipe Jones-Mauw. He is the standards person for the CRA at the comm, the EU commission. And so you will, this is the person that you will be able to bug about these issues the most, right? So thank you very much for being here and thank you for joining me and organizing this session. This was great. So in terms of solutions, we grouped them in a few sort of like topical clusters. I think the ones that really stood out are open standards. Like everyone's really concerned about like the process but also access to the actual output of those standards. So I think this is going to be really critical. Community organization is one that comes up fairly regularly too. So how do we do this? How do we structure ourselves as all of these different stakeholders in the open source community to participate in this is something that we need to work on. Then there are requests about good EU guidelines. I think like this is great. So also requests for EU funding to help was, well probably organization and sending people to actually participate in these efforts would be for example great. If I can shamelessly plug into that. There's actually a call recently being accepted and a project starting called Cyberstand. And their aim is to support the participation of experts in standardization efforts. Also some other sort of auxiliary tasks around the standardization, the development of the standards for the CRA. So I really encourage you to take a look at that and if you're interested in participating then that's probably an avenue for you to do that with some support. Thanks. So where can we actually reach out to for this? The name of the project is Cyberstand. So from there I hope you can. So the answer is Google for the Cyberstand. Search for Cyberstand sorry. Or get in touch. And lastly, yeah, so one of the things, well, this is great. One of the points that came up is better access to policy makers while we're doing this right now. And you know, it's a very accessible person. So thank you very much. And then some, well, EU-US mutual recognition of standard compliance. So I care about this. I think that's great. And then a focus that we're more on actually doing the work of actually making the security better of software. And also being able to do self attestation and have the integrators or the manufacturers fund some of the compliance and effort, which of course I believe that open source sustainability is very important. So I'm all for this. Wonderful. Thank you very much. Last stand. Do you all want to come here? Can we do this? Oh, beautiful. So in those two corners, it was more close to the text and close to close to specific aspects of the legislation. Over in this corner, we were going from coming from the other end of the chain. We're talking to people about where they are at the moment and their perception of what is happening, what could be better. And so we looked at all the hopes and fears and hopes and fears are kind of the same thing. You can hope that something will go wrong or you can fear that it will go wrong. And so we bunched them together and we made a few different categories. In general, there was a lot of discussion of funding, but we found this is a complex topic. And so in funding, we have to think of specific suggestions rather than just increasing funding, make funding available to more smaller entities, remove the costs for certain things. For example, having to buy a copy of standards. And then we noted that participation in projects is also as good as funding projects. So there are multiple ways to think of funding. A second thing is procurement and procurement by the EU institutions can be good for supporting projects. It can also be good for increasing the awareness between the EU institutions and our ecosystem. The third thing is that there is a lot of funding that is currently available and there is little awareness of it. There was the open science cloud, which has one billion of EU funding and only one person in our group was aware of it. So there is also an information gap there that we as an ecosystem and the EU institutions can work on. There is the issue of being first mover. So in some of these pieces of legislation, the EU is the first to regulate in a big way the way software is distributed. And so we have to keep in mind that when we do this, it may be copied by other parts of the world. And so on one hand, it's useful to do it well for our own people. It's useful to provide a good example. And we should probably also try and work with other parts of the world to ensure that other parts of the world don't bring in similar topics into legislation and completely different requirements. And then we have a different set of requirements in every different region of the world and developers will have a big headache. FOSS awareness within the EU institutions is a big topic. Basically the more people are aware of what FOSS is and how it is funded and how it's developed, the better we would be treated as FOSS. So we need to increase awareness of who develops FOSS and increase involvement. And the last thing is international coordination, working more with the UN, WTO and ITU. I've got two minutes left and I said to Benjamin that I'd give him the last two minutes to give his impressions of what he heard in general. So if you would like to. Thanks a lot. Yeah, so first impression, everyone wants money. I'm not sure we will have all the money that you want. We'll make an effort to support the community through calls, of course. Yes, I think I've noticed that a lot of the hopes that you have are actually the same hopes that we have, right? So you are hoping that the manufacturers will contribute more back. That's also our hope. I think the CRA is doing a lot in that direction and I think it can be a game changer. And yeah, I mean, you are also hoping that the community will step up and also provide solutions such as templates and so forth. And I mean, that's of course also our hope. I mean, the CRA cannot succeed without you. I mean, it cannot be just the commission doing all the work. We also need you. So we are really hoping that especially the more those players that are a bit better equipped in the community that they will help draft. I don't know, cybersecurity policies that then other smaller stewards can maybe copy paste and take over or that you will also maybe start projects for tooling to help companies or open source developers test their products, ensure that they are secure. So we also really count on you. That's also one of my messages for today. Thank you. And thank you for all the participation. I will hand back now to Martin. Thank you for everyone who wrote down their hope, fear or solution. Thank you to the people from the commission that were willing to put up with our chaos, trying to run a workshop in a venue intended for call. And thank you in particular to Toby who made all of this work.
CRA & PLD: CRA conformance for Open Source Projects
Our next speaker for today is Marta Rybczynska and I probably didn't manage to pronounce her last name so I will be asking her to do that again and show how I was almost right but not quite. Marta will be talking about CRA conformance and the thinking that Eclipse has been doing around this and she has a background for a number of years developing different solutions but I think she also closely followed the CRA. You may have seen her article on the Linux Weekly News months back which was a very good summary of where things were at at that time. So without further ado this is Marta and enjoy her talk. Thank you Marta and you pronounced my name quite correctly in fact. My name is Marta Rybczynska and I'd like to do a test implementation of a Siri in five minutes today. So let's go. The example open source ecosystem that is quite standard one with a physical product to make things easier. Starting from the end we have the final product that is sold to customers and we have the device manufacturer of that product and that device manufacturer is assembling multiple open source and preparatory elements adding their own software to the whole thing to build their product. This device manufacturer can of course have multiple product and they are not integrating one open source project they are integrating upstream project A and of course a hundred other open source projects that upstream project A develops a project under an open source license and they have dependencies. They have a dependency B that is another open source project working in a similar way. So okay here enter open source towards you have already probably seen the definition I highlighted the important parts for me. Legal person that has a purpose of objective to provide support of open source. Okay so what comes out of it was towards pop in in the whole thing. They pop up for the dependency B. They pop for the upstream project A. That's that's pretty expected and then a few remarks in there. Very likely stewards will be foundations especially if they have trademark to the project name. That is quite quite obvious situation but we also have situations that are little less obvious. When we can think between stewards or manufacturers or none of those for example if there are four profits that are supporting projects that are not critical to their income like open sourcing CI scripts, open sourcing, programming, tooling for their board. Things that are absolutely not critical that they are absolutely not monetizing. And we also have consulting companies not giving names. They are many consulting companies that are contributing to open source projects in a sustainable way for years. So how do they qualify? And when we add this can we have multiple stewards for a single project? If we just take that definition of a steward why not? There may be a foundation and there may be a company that actually donated the code to the foundation that's still contributing. If it's if they are not monetizing why not? And then interesting case stewards. There's a definition by stewards also have some obligations and what happens if the stewards cannot force the project or they want to force the project but the main developer say I'm not going to implement that. Pay someone to do that work. What do you do? Question mark. Okay and then we finish adding the CRA elements to our scenario. We add due diligence or that the device manufacturer should do about the open source projects they are implementing. We have the conformity assessment that they should do while releasing their product and we have the final user documentation that they are expected to release. And well mostly for the con we have some challenges for the conformity assessment. Changes and opportunities for the open source world. A final product includes dozens of hundreds of open source projects usually. So manufacturers quite often use the same project in many different places and many manufacturers use the same open source project in different places. So what makes sense and what is logical to do the conformance work, to do the paperwork all together in an open source way and release it open source license? Oh there's an alternative. The big ones will be able to pay the whole work on their own. The small ones I'm not absolutely sure if they include a hundred projects. So that will be it for me. Thank you.
CRA & PLD: rapporteur playback
So I'm trying to summarize what we have learned in this session. We had a great opening from Toby who kind of explained to us the importance of these standards that will be written to accompany the legislation. We had a lot of discussion about the 200 pages of text that is the law. And we will then have fun reading the 4,000 pages of text that are the implementation standards and a lot of the details will be in those standards and are still to be developed. He also pointed out that the CAA has landed now and is not catastrophic, which I think is an important point to take. It is also an opportunity for the FOS ecosystem to step up here and to play a leading role as stewards. We have a lot more clarity with the separation of roles. We also have the first time that a major law in a major economic block talks about free and open source software and describes a specific role for open source software stewards. So I think that's a win. Next we had Rob walking us through the wonders of the hardware and software supply chain and how liability, especially strict liability, can work out in that and highlighting that the approach in the UAPLD is one of strict liability. We had a panel where besides the really interesting questions which I will go to in a second, we also had a very symbolic picture here in the front. We had the Python Software Foundation sitting in the middle squeezed between the Cyber-Syber-Zerians Act and the Product Liability Directive. And I really thought when I saw this, this is kind of a, the picture of what we're seeing because we have a group of people that is making free software available to the world trying to do the best and essentially really, yeah, working in the public interest here and trying to see how to, how can we make this work in the environment of our own regulatory frameworks. Also are the highlights here. I found it indicative in the question to Ben who is going to be a steward and what happens if the steward is outside of the EU that, first he said, looking through the law, which is the right recommendation, but he also said this is to be more clarified further down the road. Yes it is, but that is just indicative of this uncertainty that we are currently in. And so it's the right answer, but it means that we need to stay on this topic and we need to get answers to these questions. I also thought that the question from the audience about what if we have a very decentralized open source community, maybe with a legal entity in France, but that's not really controlling anything that the developers do, it is just coordinating the work. It was very pertinent because this is exactly the gray zone between being an individual developer, being a loosely organized community and then being a well organized centralized community, a more centralized community that clearly qualifies as a steward. And I think there, it's not just on the lawmakers to make this clear, this is an impact where the free and open source software community will have to sharpen our own governance norms to make it more clear which are we in this situation. So there will be implications on how the communities operate that were clear in, this was one of my takeaways that was clear in this discussion. We focused a lot on the Cyber Zillions Act because it was such a pertinent topic recently, so I was really glad to see the Part Liability Directive here. Key takeaways that I took from the discussions are you cannot escape the Part Liability Directive, I hope I'm quoting that right, and I think Oma also was able to say why. If a law protects the most vulnerable person in the chain, then it can easily exclude others. Who is the most vulnerable person will probably have to always be accessed in a concrete case. He also pointed out that an important aspect of EU law making is that all these laws have review cycles and they're not one-off written and then collect dust and we will have to live with the consequences. They will be reviewed and he encouraged us to engage in the review process and to provide our feedback. This is probably one of the most important takeaways today for the people in the room is let's stay engaged here basically. How am I doing on time? Two minutes, okay. We talked a little bit at the end of the discussion period on how will the European Commission engage with the standards. Oma pointed out that the European Commission does not write the standards and what the process will be and that your source community will be encouraged or is encouraged to actively participate in this. That's a big takeaway that we need to take. There was always the question of who pays for the additional bureaucracy, who pays for the fees to participate in some development organizations. I think we didn't really get good answers here today but it was made clear that this is an open issue because our organizations are almost exclusively non-profit organizations and that's additional cost. Additional cost requires more fundraising. I think the question of who pays for this, who pays for this and who pays for this got repeated a couple of times including in the workshops. Let's go to, if we have a little time, I summarized a couple of highlights here. One is a big shout out to our own lawmakers here. The EU is approachable. It was here in this room. It was in multiple panels. It was willing to answer questions from angry developers. I really congratulate them to this attitude so thank you for being here. Another takeaway is why is this room so important because I think it was Toby who pointed out that we have a very diverse set of stakeholders and normally in the processes the stakeholders that are well heard are the ones that have the means to do so, the big foundations, the larger projects and we need to have the hobbyist community, just small like one person enterprises and all those also like, yeah, involved and that's I think more in this room than in the discussions until now. I want to point out one thing and this is in response to Omar who said software is only one thing out of a million products. That's true. Software is also an every product and every product consists of 80% for you know, so software. That means we're not one in a million. We're 40% of the overall market if you divide it 50, 50 between hardware and software. So I think we're totally worth it to be especially considered in the law and I think we do have the impact that justifies that. Regarding engagement in standards, last statement here, the EC made a very direct offer to engage. I think we should take that especially the foundation I speak for the Linux Foundation here. We will engage. There's one really positive signal here. We've recently been appointed to the multi stakeholder platform for ICT standardization which is kind of the consulting group to the commission here and we will use our influence there to bring more free and open software players into standards development. Regarding that as well, keep in mind that standards development is also a national activity in the member states. There will be representatives of your member states appointed into those standards bodies and it's a great way to engage through where you live and get somebody into that. With that I would like to close, a big thank you to the panelists, to the moderators and to all the participants. Thank you very much.
FOSS policy engagement: a CRA retrospective.
So Martin and I, I think we met seven months ago, six months ago, eight months ago, something like that. When you started to get interested more and more into policy, and I was in Eclipse trying to make something out of the CRA so that we can solve all the issues that we've been discussing during the first session. And we quickly realized that we had backgrounds that are very complementary. In a sense that he has obviously this all open source background, and I have this policy background into advocacy and how to advocate in terms of policy. So out of our four matters combined, we sort of decided to combine our efforts. So here today I discuss what it's like to do advocacy, and then Martin will explain what it's like from an open source perspective to be new in Brussels and try to do something about it and how we try to organize the whole thing. So I think what we need to do here is share the information about how policy making is done in Brussels. In Brussels you have several institutions gathering together in order to create those policies. The main three ones not entering into details of all the policy making procedures are the commission, the one that is drafting the proposal, for example the one that drafted the PLD, the one that drafted the CRA, the two policy officers that were here this morning are the one who did it together with their teams. Then you have what we call the co-legislators. One is the parliament, the one that we directly elect, and the other one is the council, the council of the EU which is representative of all the government of the EU. So we're talking France, the Netherlands, whatever country you're from in the EU, that's your government that is there. Then the question is how do you actually influence those policies? There are I'd say several things to keep in mind as a community to do that. If you want to influence that policy, first is to get interested so that you gather the knowledge on the policy, to gather the reason why this policy is happening, to gather the actual details of the text or the issue that the policy makers are trying to address. The second one is get organized, trying to actually have an impact so that you have credibility and the policy makers don't see just one citizen coming to them but a group of citizens that is organized enough to represent a part of the society. The other one is just write down stuff so that you have clarity. Then you have to identify the different elements within the policy making process that can allow you to get involved. Here I'm talking about contacting policy makers, the REC-1, being able to identify them properly. I'm also talking about getting support in your network, in the companies that you know because the open source community in the case of the CRA for instance, we won part of the overall challenge for policy makers but they also have to discuss with industry, car companies, large tech companies that are closed source as well and all of this needs to be addressed. I'd go back to Marty now giving details and then we'd be exchanging throughout the presentation as well to see what didn't work, what worked, what could have worked if we would have acted differently and then if you have questions we try to do that as well. I'll start with a quick promise. I was here last block and this will be the last time you see me here and then I'll just sit down and other people can talk because we take this rule seriously. What I will be sharing with you is my personal story of how I got here because that was not really my plan. So my role is to work between policy and technology from an Elnet Labs which is a small R&D organization in doing DNS enrouting from Amsterdam. In September and because this is not about me but about the lessons I learned, I'll give you the lessons first then you can plug your ears for the remainder of my talk and then that works for the deal. My lessons first, I think we were too late and if you're too late, if the commission makes a proposal then you're chasing the train, right? So we did a lot of train chasing and that means you need a lot more effort than you would have needed if we were in front of the train at the station discussing where the train would be headed. I also think and I learned that we cannot expect FOSS to organize as an industry organizers because if it turns out that it is nicely organized like a trade organization and you're probably not talking to the whole community but just to the industry part of it. And I think because of these facts that the digital dossiers in the European Union needs to change a little or we need to figure out the mechanisms to do advocacy from the community because for all the digital dossiers software is relevant and for software open source is relevant. And I'll try to illustrate why I think these are the lessons because I think I and we all got quite lucky on this one. So two last lessons, I think we should be talking more to parliament as a community. They should be the people that are most accessible to us. I think I personally didn't talk enough to parliament and it turns out that even if you don't have any EU policy experience you can figure this out if you have enough time and if you're lucky then you can make this work. So now to the story and you can plug your ears if you want to because you just had the lessons. So in September 2020 I found out that there was this thing called the CRA and I read it and I thought hey this is weird, open source is mentioned which is great but it's also clustered under the whole idea of non-commerciality and we all know that there's also a lot of open source and commercial products right. So I sent some emails and I got lucky real quick because I reached out to the Dutch digital civil rights organization Bits of Freedom, they connected me to a law professor and they connected me to a wonderful recent graduate who had a lot of context on this law because she had recently interned with the team that actually wrote it. And I want to thank Francine because she delivered the mental model to me and to a lot of people I work with later about how this actually fits in with the NLF and that's thanks to her. So in October I contributed to the first blog by ISOC, they were the first I think to mention hey there's this thing coming and I wrote a little newsletter saying oh I'm spending some of this time reading this stuff and I send a little tweet and then I got lucky again because the tweet was noticed by the commission, by Benjamin specifically and he said hey maybe you should come and talk. So that was kind of a surprise, I mean you write. So I was planning a visit to Brussels, I bought a t-shirt, super relevant detail, yeah. So I was there because my organization works in DNS right so I was attending a DNS conference in Brussels and I decided to drop by the commission because I was invited. So I learned a couple of things, they have really nice rooms with really nice flags and you get, they receive you friendly people and then you get kicked out because that room was for the boss. And what also really stuck with me and Benjamin repeated this morning, one of his colleagues said oh it's so refreshing not to talk to a lobbyist for once and I was like yeah I've no idea what I'm doing here but so but it was really constructive and I'm not saying this to please people, I'm saying this, this was November 2022, we actually had conversations about whether compliance work would increase security of open source and I was arguing it probably wouldn't and we got the question like is there anything we can do that will increase security and so we got into the conversation about if a vendor is obliged to report a vulnerability maybe they should also be obliged to send the patch if they have one and that was just a conversation and I don't know how it got into the law, maybe Benjamin will tell us someday but I think at that point already they were thinking oh maybe you can tweak this a little which I think was great. I also learned and this is about talking to parliament, Benjamin and his colleagues were very insistent that I should talk to the co-legislators and I was like what I'm here now so co-legislators others also other and it turned out they were actually done already because they made the proposal they could explain the proposal but they were not the party making changes at that point in time. So in December I visited Brussels again and I came with a list of examples like hey this is the ways that people write software and maybe interpret it as a commercial activity and people told me yeah it's great but you need to talk to the co-legislators. I was like oh great maybe I'm doing this wrong but I'm not talking to you so maybe you should come to Fosdam. So last year we had a Janssen, the session with Omar with Benjamin and I think that was the first time we did EU policy at Fosdam and some questions were raised because Benjamin told us, it's on camera, you should be talking to the co-legislators and Alex who is chairing this room today was in the audience and asked so what is your plan because you're talking to the wrong people and he was right right but I didn't know so we're just trying to get. So what it did get us though was it started building alliances right because a lot of people started interacting with me with others, Open Forum Europe became very active and I think that was what Fosdam did for us. Now remember that blog it did three things so I got in touch with people who did know how Brussels worked and that was useful because I didn't. I got an email from an aide to the Dutch senate and they were interested to understand what was it what this about so they started writing questions to the commission, they started writing questions to the Dutch government and these questions are obligatory to answer right so it created some pressure and we got a visit at an outlet lab from the Dutch delegation writing the CRA from the national like from the council perspective and it turned out that the Dutch were both very pro CRA but they also grasped FOS so that I think was a win because at that point in time we kind of or at least I could start to talk to the right people instead of the friendly people because that was kind of how this works right. So the questions from the senate helped for the civil servants to actually show up because they need to focus their time on where the relevant problems are right and I got some help from people that are more experienced working the Dutch government so I want to thank Bert. Then came some silence because FOS was over and I talked to the Dutch and then everything moved back into its own room right or just silence and I got a rejection from parliament because I applied to go to the hearing and they're saying yeah we don't know you we were mostly talking I mean there's limited seating and we're talking to the people we've heard from before so that was a bit of disappointed I mean it was a very kind rejection letter but it didn't help me as like a random Joe trying to but also open forum Europe built up steam so in and so open forum Europe was the place where a lot of us met on a weekly basis Kiran made an agenda every week I was now in the corner and he basically made sure that people were actually discussing the same topics no matter whether and we or maybe I got lucky again because about every couple of weeks I got from random people leaks from the council discussion from what the parliament was discussing and I learned how to analyze these and just share like oh this is what is being discussed and I also had to I also learned how not to write a position paper because we wrote some they were very lengthy and I think they were completely ignored so that was a good thing to learn and we yeah and we learned that the policymaker perspective perception was that we were just trying to get out so I and I think a number of people shifted focus to challenge some specific assumptions instead of arguing about the scope so over the summer I started emailing some me piece the commission the council there was a lot of silence and a bit of despair but then in the October November time frame suddenly communication started flowing we got emails not just me anyone asking for reviews we got asked questions for input there was a proposal floated about an open source steward and no one saw coming and I think at that point we started to be in the position where we should have been before the train left there were people engaged on a topic talking to the right people about the policies they were making and I think the policy makers delivered in parliament in the commission and in council by actually having conversation about these topics with people working on the specific aspect so this is how I got to these lessons learned we were too late we cannot expect the community to organize like industry and I think digital dossiers need a way to get us involved at a stage where where in a way that actually works for the community oh yeah and you should talk to the co-legislators that's it for me I'm having over the end so because I let it did a lot of things weirdly and I'm happy to hear about it so I think it raises the questions then from the very beginning I sort of said this is how it normally works you get organized you get interested and then you start writing stuff and then you speak to the co-legislators but the question is and Martin raised it very well he got lucky I wish I could be that lucky in my life but how are we gonna get organized who is gonna be the person that is enough interested in the open source community so that we get the information at the right time who is gonna write the stuff how are we gonna agree on the stuff that we want the open source community to say all those questions we need to figure them out figure them out on our side because Martin just said the co-legislators and basically all the institutions need to figure out a better way to interact with us but how do we also get better at interacting with them and that's the question that we're probably going to have to discuss today during the workshop the and then the fishbowl and that's I think the question of today in this specific session are the two what do we want the co-legislators to do so that they can come to us more often and in a better manner and how can we step towards them as well so that we also get better at interacting with co-legislators and the commission thank you
FOSS policy engagement: The impact of the NGI Open Source projects on EU policy and values
I trust them. However, now we want to hear about, is this the first slide? Oh, sorry. Like this. OK. Jean-Luc, and Kemen, who starts? Good morning, everyone. I'm Jean-Luc Doréle. I work for the next generation unit, an generation internet unit. It's not a new unit. So together with colleagues, and I have one of them here, we are supporting the so-called thanks generation internet initiative, NGI. So I don't know how familiar you are with NGI, but I will introduce that in a few words. And then Clementine will give you some insights on what we are currently doing in analyzing the impact of this initiative. So we have heard about the legislation this morning, the CRA, PLD. We'll probably hear about the DMA a bit later, the DMA-DSA. So the European Commission is the executive body of the European Union, and we have the right of initiative, and we give initiative and suggestion to the co-legislators. And we have two main, to simplify, two main stream of action. One is legislation, one is money, funding. So I'm representing the funding part, and the colleagues who are representing the legislation part. And in the funding part, you will have many, many things like agriculture, policy, regional development, infrastructure, and you have something called research, and more precisely, the Horizon Europe program. So this is where the money for research is channel, the main instrument. By the way, it's a law, so it's also a legislation, so co-legislator matters, even in this part. And as part of Horizon Europe, there is a lot of actions. You can imagine, it's the research spectrum. It's a little bit less than 100 billions euro over seven years, that's the budget. And if you scroll down, at one point, you will find something called next generation internet, that's us. It's been operational for five years now. We are completing the fifth year. We mobilized between 20, 25 million euros per year. That's the envelope. And what we do is we fund open source. So I don't know how familiar you are, but there are a lot of people funded by us in this FOSDEM event, these two days. There was a session yesterday, there was a stand from NLNet. And the way we work as commission is that we don't give the money directly to the funded party, but we give money to intermediaries. Why? Because in the context of open source, we believe that we don't have the instrument to fund open source communities, because at the commission, we have instruments for big consortium, 2, 3, 4, 5 million, and a lot of participants, et cetera. So what we do, we give money to intermediaries, and we have projects, and some of them are here. And they, in turn, take this money and do calls, open calls. You can find them in ngi.eu, and there is somewhere a section on those calls that are open more or less continuously. We just awarded a 27 million euro funding for the next three years, so it's a quite significant project called NGI0 Commence Fund. And this notion of commence is that we are funding community-based and community-governed development of open source. So there are some nuances in open source that will not dig into that details, but check the eligibility if you're interested, and you will find that it's relatively, it's very open. Well, five years means that we need to step back a little bit, understand what we have been doing. It's 1,000 funded projects. Again, you can find the catalog in ngi.eu, 1,000 projects, and we need to understand what's the dynamic of this portfolio, because comes the next steps of funding, and it is a core legislation, it is a policy, so we have to have the impact understanding so that we can come to the decision maker and tell them that's the five years we have done this and this, and this impact this and this. So we have contracted the Gartner and the Clementine that was involved also in the early discussion more than five years ago that settled the rules and the principle for this initiative. And I'll leave you the floor for detailing precisely the benchmarking of these five years of NGI and its impact. Thank you, Jean-Luc. Good morning, everyone. I am Clementine Vellier. I used to be a researcher here at ULB. I studied a master in IT at ULB, and I have been coming to FOSDEM for the past 20 years. Five years ago when the European Commission launched the bid for creating the next generation internet initiative and shaping it, at Gartner we are humbly in love of technology, and so of course our understanding was that it had to be digital commons as well, the next generation internet, and that we would have to reach out to the existing community. So what did I do? I called an LNET, I called Mikhail, and that was my lucky moment. We won the bid. And so that year at FOSDEM when we finished the study, we had one of the keynotes of FOSDEM presenting the NGI study with the policy officers at the time. So that's also for me, I think, the first where the European Commission came to FOSDEM, and I'm glad that they're here every year since. The reason why we thought it was also interesting to present the NGI in this track, as Jolnik was explaining, is that the NGI initiative with its thousand projects has an impact also on EU policy, because it actually helps implement them. So we've been trying to collect data points fast to present them at FOSDEM today, and these data points come out from the initial analysis of a survey of these NGI thousand open source projects. So we looked at how they implement EU law, how do they carry and foster European digital rights, how do they provide alternatives, how do they link to standardization, what standards do they implement, and the topic of sustainability is also analyzed through other different dimensions. So basically, there are six European digital rights and principles. We have data points here that basically show how many projects implement how many digital rights, and so basically we have over 20, 25 percent of the projects in NGI that implement up to three of these digital rights, and we can cite this number, which I think is really interesting. You have three quarts of them that actually state that they increase safety, security, and empowerment of individuals. These EU digital rights are really important. I encourage you to follow the link and keep them at heart. I think they stick close to the open source community as well. So this was one of the initial findings. The next one is about how do they fuel the uptake of EU policy. So we have cited many different EU policies. Those that were actually mentioned by Jean-Luc, so I won't go too much through them, but I think what's interesting to see is that let's say 90 percent of the projects see themselves as implementing EU policy. So to be honest, I think the open source community understands somehow EU policy because they have similar values and they understand what they're thriving for. We're getting there. Another data point that's really interesting, of course, is the fact that more than half of the projects state that they're implementing an alternative to large monopolistic solutions that exist. And so as these alternatives are open source and available, they will, of course, be a good source for driving digital sovereignty in Europe. The standards. The standards are also very strong on the European Commission's radar for many reasons. For one, they fuel innovation, but they also help create, let's say, scale and breadth in the market and interoperability aspects as well. So obviously the NGI projects are rooted in the standards community, mostly ITFW3CIOIEE, but we've heard also other standardization organizations linked to geospatial aspects that were cited. And now, of course, this is key as well because the NGI is going to, of course, be part of the Web 4.0 that will take place worldwide. And so geolocalization will be a key part of that. What about sustainability? I mean, we've seen in these last data points that there's a strong footprint in those EU digital rights, values, policies, standardization, but what about the outcome? How do these projects follow through? Well, actually, three quarters of them follow through. It's a significant data point for many reasons. For one, it shows that there isn't a concrete outcome. So either they follow through because they are part of a larger effort, an open source community or a business, and so those contributions funded by public money have been integrated back into the community solutions. The second point that's interesting is that 8% of them have, as a follow-up, created a company or foundation. And to build on that point, so we can extrapolate the data and then estimate that NGI so far in those five years has generated probably a data point around 80 companies or organizations or foundations that have been created. What about those companies? So we asked what are those business models? And interestingly and probably not unsurprisingly, the main business model is about SaaS hosting hosted solutions for almost 50% of them. So let's say this is also an interesting data point because we understand that this releases or at least takes away the struggles for implementing other people's open source tools and solutions and the technical hurdles if you're not a techie. So the reuse is also important. How are these tools reused and how are they made available to end users as a wider part of consumer base? Another very interesting data point talks about the developers. So what about the developers? What about the open source community? Well, the NGI definitely attracted the community and the talent. So in that sense, we were very happy to see some of these data points because it was the bet to actually reach into the existing community and inject the funding into the one that's actually making the internet run and trying to make it, let's say, pursue further its maintenance, its update and its innovation. And so in terms of community, we have 40% of projects that have a small community. We have 15% of projects that have a large community. We call that over 50 people. And then we also have six of these projects that are running in a community larger than a thousand people. So if we extrapolate that data, we can estimate that 80,000 people are actually contributing to the NGI through code testing or bug reporting. This is a very important data point for us. But it's also interesting to see how the NGI solutions are channeled through the open source community through the mainstream distros. This means that the user base is also expanding. We're looking on how to make sure that these solutions are reused. And last but not least, I think one of the main lessons also is the notion of agility of funding. The mechanism that Jandik explained at the beginning of this session is about funding innovation. And that was also one of the recommendations of our study was how to do this. We have these notions of course of fail fast, of incremental funding when you fund innovation. And the fact that three quarters of projects pulled through means that a quarter of them are recognized as failing fast and that's fine when you fund innovation. I think from my take today, from the discussion of this morning that were really interesting, as much as we managed here to do an agile funding scheme for the open source community, as much the European Commission and NL net have done with the CRA discussions, almost an agile policy making approach. And I think there's lessons to be learned from that. I think there is value in iterating, in iterations. And I think the open source community has taught a lot in an unorganized organization that you can't really tap into formally, but somehow always finds a way to innovate and always finds a way to get their messages through. So for me, this was a really insightful morning and I'm really happy to have presented these data points today with you. Thank you. Thank you. Yeah, let's hope we can continue with the interesting and conversations. With this, I'm going to hand over to Kiran, who is there.
Public services interoperability: [begin workshop] Free/open source and Interoperable European public services
Okay, good afternoon everybody. We're going to start our next session. Everybody has found a seat so we know how many people can come in. That's up to the people at the door. My name is Gai Selenius. I work with the European Commission Open Source Program Office. This is for the recording. Okay, I will amplify myself. I will quickly run through the ground rules which you have seen already. We would like you to make sure that everybody gets a chance to speak. Me and Lina will be handing around the microphone. This is for the people online. We want to focus on finding solutions for the problems that are coming to you soon. This is the third workshop. The title is Public Services Interoperability. It is made out of two parts. We will first discuss the Interoperability Act and we will then have a presentation on the Commission's Open Source Strategy. And the reporter is Axel who I think is still outside so somebody should get him in. And with that, we are almost ready for our first session. Welcome. Okay, so hello everyone. Just a very quick introduction. My name is Lina Cevajos. I work with policy at the Free Software Foundation Europe. I just want to make a little bit of introduction about this session. If you have been here before, you have seen that we are trying to try different formats doing workshops, fish bowls. And this format, we kind of imagine it more as a discussion, kind of like what we have had so you don't need to move, you just need to raise your hand. I will bring the microphone to you. And I also wanted this to be not like a technically like a Q&A but more like let's chat and let's try to find common ground.
Public services interoperability: The Interoperable Europe Act; the challenges and opportunities for the free and open source communities.
Welcome. Okay, so hello everyone. Just a very quick introduction. My name is Lina Cevajos. I work with Policy at the Free Software Foundation Europe. And yeah, I just want to make a little bit introduction about this session. If you have been here before, you have seen that we're trying to try different formats, doing workshops, fish bowls. And this format, we kind of like imagine it more as a discussion, kind of like what we have had. So you don't need to move, you just need to raise your hand. I will bring the microphone to you. And I also wanted this to be, you know, not like a technically like a Q&A, but more like less chat and let's try to find common ground. Yeah, so about the topic. So this is a, the session is on digital public services and interoperability. There will be a session more focused on, like that is not the digital public services later on. And because we're talking about this, we're going to have two parts. One is a focus on the Interoperable Europe Act to serve as an example of what's happening at this moment when it comes to digital public services. And this is an act that, you know, happened last year. And I personally have to say that I had the feeling that it was a little bit under looped. So not so many people pay attention to this. And I think it is a very crucial piece of regulation because we're talking about interoperability in digital public services. So to start the discussion or to start this, let's talk about what is interoperability. So like the definition, you know, is like the ability of information systems to speak with one another. And now we want to make use of this feature to deliver public services or to make public administrations to deliver public services. And this can have so many examples in practice. One that I can always like to give is that imagine you're going on a road trip from here to Paris and then you go with your car and then you want to park on the street and then you get to this machine and then you want to put the car plate off your car and then it turns out it doesn't recognize your plate because it's from Belgium and then you just end up having a park in another country that you cannot park. So is this, you know, like we're talking about things that affect us all in the EU. I mean, it has to do with freedom of movement, but also with education, with health. There are so many aspects that are so important when we're talking about interoperability and digital public services. And of course, for these to work, we're talking about critical infrastructure. And this is where free software plays a huge role. And that is the reason why we were trying to be active last year and still trying to be active to make decision makers understand the role that these, the free software and open standards come or have in this regulation. I also have to say that, I mean, the commission proposal already acknowledged this. The commission proposal already came with some of these wording. And there were different things that we were trying to get active. So I guess we're going to learn more from this, from Issa, from Calvin, from the commission. But there is like a very interesting inclusion of a governance structure. So we will learn more about it and we were trying to push to be there because that's where, you know, the decisions are going to be made. And so on. And of course, I don't even know what I had this slide, but anyway. And then the second part, then we're going to focus on the European Commission efforts that are happening when it comes to the direction of free software. And for this, then we're going to have a decision with Hays and with his colleague also from the OSPO, from the European Commission OSPO. We're going to discuss all these efforts because there have been some of this coming from like years back, but just bring an example. The open source strategy after that decision that came with it, the code repository that another European Commission is having. So we're going to learn what has been done. But what I want us to keep in mind, and I think along the day I've been seeing and I think it has been a very fruitful discussion on how we can get engaged. And how we can interact with decision makers. But it is also important that we keep an eye on implementation because of course it is important to advocate for free software when it is possible with decision makers. But once the text is done, such as the interoperable Europe Act, we had the opportunity, we did what we could. We would discuss this during the session. Now we have this piece of regulation that we need to make sure that it is implemented. And let's make use of these words that we struggle and we fight so much to put in there. And let's help them as well to figure out how to implement this. So let's look at the future. Let's look at examples. Let's see what has worked well, what hasn't worked well. And let's try to find ways where we can all collaborate and we can make use of all these efforts that are happening at the moment. So, yeah, and I also want, yeah, again, like the structure. So we're going to have this presentation and of course we're going to open the floor for questions, comments and so on. Remember that you have to wait for the microphone for the live stream. And it would be also nice for you if you could introduce yourself, your affiliation and so on. So decision makers also know who you're talking to. Who they're talking to. Yeah, and I think that's all from my side. I really hope we can learn a lot on this conversation. And again, let's try to keep an eye on the future and how we can monitor implementation of all these amazing regulations happening. And yeah, without any further ado, then I'll hand over the floor to Issa from... I'll bring this. Oh, cool. You got it. Okay. I'll do it. No worries. Yeah, from Edgy. Yeah, thank you very much, Lina. Okay, here. And thank you very much. Thanks for the slides. Issa is with us on behalf of... She's one of my colleagues in the Commission, Digit B2, our recent new unit. And Issa is also very much informed on open source. I'm happy that she's here. Thanks. And me too. I'm very happy and thankful to be here because I think it's a great opportunity. I'm also very thankful for Lina introducing the act as something that is important and should get more attention. Because I'm completely unbiased after working four years on this. But of course, I also believe that this is an important piece. But for now, it's a piece of paper. And it will only get to live when... And it will also be on you to make this... fill this with life. And I think there are some opportunities in there. And this is what I'm going to focus on today. And then I'm looking forward to the discussion and to your ideas. And yeah. So let's get started. What's this act about? And I mean, the first very maybe disappointing news for you. It's not an open source law. And it's not nothing that... now we can say, yeah, it's clear that it's not an open source law. But it's something that has actually been discussed. And has actually been discussed from the very early start when we did the impact assessment. Should we have a law that is on GovTech, open GovTech. But then we saw that there was no majority for this path. So now the interoperability policy evolved into the Interoperability Europe Act. And maybe I just quickly why no majority. This actually came from both sides. So from those who are very much fan of open source but said, but if we want to do an open source law, it should be made in a different way and shouldn't build now on this interoperability. Then it's too linked. It's too much in one direction. We should do it properly. Let's not make it from the back door in open source law. That was the fans of open source who were against making this an open source law. And then there were the ones that said, yeah, but we are not there in the public sector. This might hinder the digitalization of the public sector. So this is what we found. We put something in the middle. That's the compromise that we found. This is what I'm going to present now. And what was the main objective of the act? It was to help EU and member states administration to deliver connected digital services to citizens and businesses across Europe. And I think I don't need to tell you that open source can contribute to this. I just strongly believe it. And I also don't need to talk to you about how interoperability is linked with this objective and with open source. What is maybe interesting is that the European interoperability policy has always focused not only on technical interoperability, but has these four dimensions of also legal, semantic and organizational. And that was very much in the negotiations that we've talked a lot about. How can we actually make stronger mechanisms to help also legal and organizational interoperability? And this you might also find back when I now go into the components of the act. Because we say the act is structured around these four components. So a structured and co-owned EU cooperation governance, mandatory interoperability assessments, recognized in your reusable interoperability solutions, and strengthened interoperability support. And now I'm going to tell you more about what this keywords, what we mean with this. So I start with the solutions. One of the mechanisms that will be established is that the EU public administrations in Europe, and they are the ones that are actually the act is addressed to EU administrations from EU level until a member states local level. And all the binding requirements that are in the act go towards these public administrations. They don't go towards private parties. And those ones will decide together on certain solutions, and solutions are in a very broad term. So solutions can even be a framework, can be an architecture, can be guidelines, can also be a solution, can of course be software. Like the definition that is in the act is quite broad, but the public sector will agree in this interoperability governance that there are interoperability solutions that will become interoperable Europe solutions, so that everybody will not be, they don't have to reuse it, but they will at least need to look into the reuse of such solutions. And this I think is an opportunity for the open source community that wants to know where is the EU public sector moving, looking into the interoperability Europe solutions, might help them to see, okay, if this is here in the catalogue, then it's not a software that fits with those interoperability Europe solutions, it might be easier for public administrations to reuse it. And then another thing around the solutions that what we've managed to get into the law is that sharing is a default. And for now, between public administration, sharing is not a default, and proper territory solutions are a default. And if you want to share among the solutions, then the IT people go to the lawyers of this world, and I'm going to talk about one of them so I can talk better about them, and they will ask those people, can we share, there's somebody else who wants to reuse it, and then the lawyers come and say, yeah, but this is problematic, and there might be this and this problem. So because, and we hope that by just putting in a law that normally if somebody asks you, you have to share, then the lawyers already need to argue why can't I share. So we hope that this changing the default might already be a push in the right direction to make public administrations that are, and they also have to be risk, very risk adverse, to be more friendly towards taking this risk and being brave. And what come in the negotiations, also, and that's what Lena also talked about, is that there is also a small thing about priority to open source solutions when you are doing interoperability solutions, and this is something that is now very new in the text, and that public administration will need guidance on how to implement this, and will need help on, and this also I think where the community has an opportunity to help them, guide them on how can we really make this work and put this into practice. As a second component, there is this governance, and there is, the governance is composed, as Alina already said, around the board and the community, and in the community is, we were thinking about you also as part of the community, of course also the community of local public administrations in the board, actually it's only the member states represented, but IT is very, sometimes very federal and some scattered around all public administrations levels, so the community is really about putting many different actors in the field together, and then giving structured input, channeling this input also maybe with digital platforms, digital tools towards the board, to take the decision to have always a sounding mechanism towards the board. Another thing that I haven't seen when we were drafting the law, but when I talked with open source people who said that was interesting for them, was the clear points of contact that with this law, we are having now responsibles on member states level and in EU institutions who will need to implement the law, so when you have anything that is linkable to the law, you know who to reach out to, and that this is something that is actually important for the community. The third component, the mandatory interoperability assessments, I think this links back to the discussion I heard here before on the fishbowl where you said, but it needs the right processes and how do we get into the policy making, and it's very hard to bring the two worlds together actually, the interoperability assessments will be something that when public, in the public sector, is setting any binding requirements for their digital public services in the future, before taking a decision on these binding requirements, they would need to do and publish a mandatory interoperability assessment, and I think this interoperability assessment report will for the community create a lot of transparency and help them actually then also engage in policy discussions afterwards, because the policy text is already translated into concrete requirements, maybe it might not always be technical requirements, because if you look at the law, it might be requirements on business level, and now we are looking into how to really make this work in practice, but I think it's a very interesting tool that can actually help bring, and that's why we put it in the law, because it's about illegal interoperability bringing IT and policy together and helping to have this conversation early in the process, and not when the law is already written and then you sit there with the law with requirements that are actually not implemented in a user-friendly, citizen-friendly, engaging way. So that's definitely something we need to talk about in the community and to develop in the coming months. The fourth component is what is also in the law, it's around GovTech cooperation, there is innovation projects, and it's the first time that there's a definition around GovTech also in the law, so it's just that the recognition that this is important, even if there is no strong legal requirements around it, I think this is an important step because it helps to argue that this is important, if you say, but because it's an allowance, it's important. Then we have a spot on regulatory sandboxes, and here also for me, they are an enabler for legal interoperability, and they have to involve also the GovTech, but create a way when you do innovative stuff to have the York legal questions channeled towards the EU level and have the EU regulators already pre-discuss on how can we tackle this, how can we have this innovation also rolled out in a legally safe environment. Last point is upskilling of the public sector, and this might also sound a bit vague and lame that is in the act, because it's nothing, it's not something you go and the police can come and say, you have to upskill, but I think if there is, it's mentioned in the act that the public administrations have to skill their stuff on interoperability, and I think you can't skill them on interoperability without talking about openness, without talking about open code. So I think a lot of these trainings will be created in the next coming years, so together then look into what can we push, what messages, and then you can talk in the public sector to people who know about open source and are not afraid of open source. So this is an opportunity for me. And one opportunity that I don't have on the slide is that this is going to be evaluated, so everything that is not in today, now if we talk, get the right questions in this evaluation, then it might be that in the next version this can also get into the text, so this is the entry model and we will see where this takes us, but this also needs early involvement, this needs, actually, that the community reaches out as soon as possible and say, this is our ideas for where it should go, and this is the question we should ask. So I'm very, as I say, now maybe you understand why I think it's very relevant that we are here and that we start this conversation and looking forward to the discussion. Thank you. .
Public services interoperability: workshop Interoperable Europe Act
Yeah, so about the topic. So this is a, the session is on digital public services and interoperability. There will be a session more focused on, like that is not the digital public services later on. And because we're talking about this, we're going to have two parts. One is a focus on the Interoperable Europe Act to serve as an example of what's happening at this moment when it comes to digital public services. And this is an act that, you know, happened last year. And I personally have to say that I had the feeling that it was a little bit underlooked. So not so many people pay attention to this. And I think it is a very crucial piece of regulation because we're talking about interoperability in digital public services. So to start the discussion, or to start this, let's talk about what is interoperability. So like the definition, you know, is like the ability of information systems to speak with one another. And now we want to make use of this feature to deliver public services or to make public administrations to deliver public services. And this can have so many examples in practice. One that I can always like to give is that imagine you're going on a road trip from here to Paris and then you go with your car and then you want to park on the street and then you get to this machine and then you want to put the car plate off your car and then it turns out it doesn't recognize your plate because it's from Belgium and then you just end up having a park in another country that you can't park. So is this, you know, like we're talking about things that affect us all in the EU. I mean, it has to do with freedom of movement, but also with education, with health. There are so many aspects that are so important when we're talking about interoperability and digital public services. And of course, for these to work, we're talking about critical infrastructure. And this is where FreeSoft plays a huge role. And that is the reason why we were trying to be active last year and still trying to be active to make decision makers understand the role that these, that the FreeSoft and OpenSunders come or have in this regulation. I also have to say that, I mean, the commission proposal already acknowledged this. The commission proposal already came with some of this wording. And there were different things that we were trying to get active. So I guess we're going to learn more from this, from Issa, from Calvin, from the commission. But there is like a very interesting inclusion of a governance structure. So we will learn more about it. And we were trying to push to be there because that's where, you know, the decisions are going to be made and so on. And of course, I don't even know what I had this slide, but anyway. And then the second part, then we're going to focus on the European Commission efforts that are happening when it comes to the direction of FreeSoft. And for this, then we're going to have a discussion with Heis and with his colleague also from the OSPO, from the European Commission, OSPO. We're going to discuss all these efforts because there have been some of this coming from like years back, but just bring an example, the open source strategy. After that, the decision that came with it, the code repository that another European Commission is having. So we're going to learn what has been done. But what I want us to keep in mind, and I think along the day I've been seeing, and I think it has been a very fruitful discussion on how we can get engaged and how we can interact with decision makers. But it is also important that we keep an eye on implementation because of course it is important to advocate for FreeSoft for when it is possible with decision makers. But once the text is done, such as the Interoperable Europe Act, we had the opportunity, we did what we could, we would discuss this during the session. Now we have this piece of regulation that we need to make sure that it is implemented. And let's make use of these words that we struggle and we fight so much to put in there. And let's help them as well to figure out how to implement this. So let's look at the future. Let's look at examples. Let's see what has worked well, what hasn't worked well. And let's try to find ways where we can all collaborate and we can make use of all these efforts that are happening at the moment. So yeah, and I also want, yeah, again, like the structure. So we're going to have this presentation. And of course we're going to open the floor for questions, comments and so on. Remember that you have to wait for the microphone for the live stream. And it would be also nice for you if you could introduce yourself, your affiliation and so on. So decision makers also know who you're talking to, who they're talking to. And I think that's all from my side. I really hope we can learn a lot on this conversation. And again, let's try to keep an eye on the future and how we can monitor implementation of all these amazing revelations happening. And yeah, without any further ado, then I'll hand over the floor to Issa from... I'll bring this... Oh, cool. You got it? Okay. Yeah, from the... Yeah, from the... Yeah, thank you very much, Lina. Okay, here. And thank you very much. Thanks for the slides. I'm very happy that Issa is with us on behalf of... She's one of my colleagues in the Commission, Digit B2, our recent new unit. And Issa is also very much informed on open source. I'm happy that she's here. Thanks. And me too. I'm very happy and thankful to be here because I think it's a great opportunity. Also very thankful for Lina introducing the act as something that is important and should get more attention because I'm completely unbiased after working four years on this. But of course, I also believe that this is an important piece and that... But for now, it's a piece of paper and it will only get to live when... And this... It will also be on you to make this... Fill this with life. And I think there are some opportunities in there and this is what I'm going to focus on today. And then I'm looking forward to the discussion and to your ideas. And yeah. So let's get started. What's this act about? And I mean, the first very maybe disappointing news for you, it's not an open source law. And it's not... Nothing that... Now we can say, yeah, haha, it's clear that it's not an open source law, but it's something that has actually been discussed and has actually been discussed from often very early start when we did the impact assessment, should we have a law that is on GovTech, the open GovTech. But then we saw that there was no majority for this path. So now the Interoperability Policy evolved into the Interoperability Europe Act. And maybe I just quickly why no majority. This actually came from both sides. So from those who are very much fan of open source but said, but if we want to do an open source law, it should be made in a different way and shouldn't build now on this interoperability. Then it's too linked. It's too much in one direction. We should do it properly. So let's not make it... From the back door in open source law, that was the fans of open source who were against making this an open source law. And then there were the ones that said, yeah, but we are not there in the public sector. We might... This might hinder than digitalization of the public sector. So a bit... This is what we found. We put something in the middle. That's the compromise that we found. This is what I'm going to present now. And what was the main objective of the act? It was to help EU and member states administration to deliver connected digital services to citizens and businesses across Europe. And I think I don't need to tell you that open source can contribute to this. I just strongly believe it. And I also don't need to talk to you about how interoperability is linked with this objective and with open source. What is maybe interesting is that the European interoperability policy has always focused not only on technical interoperability, but has this four dimensions of also legal, semantic and organizational. And that was very much in the negotiations that we've talked a lot about. How can we actually make stronger mechanisms to help also legal and organizational interoperability? And this you might also find back when I now go into the components of the act. Because we say the act is structured around these four components. So a structured and co-owned EU cooperation governance, mandatory interoperability assessments, recognize your reusable interoperability solutions and strengthen interoperability support. And now I'm going to tell you more about what this keywords, what we mean with this. So I start with the solutions. One of the mechanisms that will be established is that the EU public administrations in Europe, and they are the ones that are actually the act is addressed to EU administrations from EU level until a member states local level. And it's all the binding requirements that are in the act go towards these public administrations. They don't go towards private parties. And those ones will decide together on certain solutions. And solutions is in a very broad term. So solutions can even be a framework, can be an architecture, can be guidelines, can also be a solution, can of course be software. Like the definition that is in the act is quite broad. But the public sector will agree in this interoperability governance on some solutions that will become interoperable Europe solutions so that everybody will not be, they don't have to reuse it, but they will at least need to look into the reuse of such solutions. And this I think is an opportunity for the open source community that wants to know where is the EU public sector moving. Looking into the interoperability Europe solutions might help them to see, okay, if this is here in the catalogue, this might be something when I built my software that it fits with those interoperability Europe solutions, that might be easier for public administrations to reuse it. And then another thing around the solutions that what we've managed to get into the law is that sharing is a default. And for now, between public administration, sharing is not a default and proper territory solutions are a default. And if you want to share among the solutions, then the IT people go to the lawyers of this world and I'm one of them so I can talk better about them. And they will ask those people, yeah, can we please, can we share? There's somebody else wants to reuse it. And then the lawyers come and say, yeah, but this is problematic and there might be this and this problem. So because, and we hope that by just putting in a law that normally if somebody asks you, you have to share, then the lawyers already need to argue why can't I share. So we hope that this changing the default might already be a push in the right direction to make public administrations that are, and they also have to be risk, very risk adverse to be more friendly towards taking this risk and being brave. And what come in the negotiations, also, and that's what Lena also talked about, is that there is also a small thing about priority to open source solutions when you are doing interoperability solutions. And this is something that is now very new in the text and that public administration will need guidance on how to implement this and will need help on. And this also, I think, where the community has an opportunity to help them, guide them on how can we really make this work and put this into practice. As a second component, there is this governance and there is, the governance is composed, as Alina already said, around the board and the community. And in the community is, we were thinking about you, also as part of the community, of course, also the community of local public administration in the board. Actually, it's only the member states represented, but IT is very, sometimes very federal and some scattered around all public administration levels. So the community is really about putting many different actors in the field together and then giving structured input, and channeling this input also maybe with digital platforms, digital tools towards the board to take the decision to have always a sounding mechanism towards the board. And another thing that I haven't seen when we were drafting the law, but when I talked with open source people who said that was interesting for them, was the clear points of contact that with this law, we are having now responsibles on member states level and in EU institutions who will need to implement the law. So when you have anything that is linkable to the law, you know who to reach out to and that this is something that is actually important for the community. The third component, the mandatory interoperability assessments. I think this links back to the discussion I heard here before on the fishbowl where you said, but it needs the right processes and how do we get into the policy making. And it's very hard to bring the two worlds together actually. The interoperability assessments will be something that when public, when the public sector is setting any binding requirements for their digital public services in the future, before taking a decision on these binding requirements, they would need to do and publish a mandatory interoperability assessment. And I think this interoperability assessment report will for the community create a lot of transparency and help them actually then also engage in policy discussions afterwards because the policy text is already translated into concrete requirements. Maybe it might not always be technical requirements because if you look at the law, it might be requirements on business level. And now we are looking into how to really make this work and practice. I think it's a very interesting tool that can actually help bring and that's why we put it in the law because it's about illegal interoperability, bringing IT and policy together and helping to have this conversation early in the process. And not when the law is already written and then you sit there with the law with requirements that are actually not implemented in a user-friendly, citizen-friendly, engaging way. So that's definitely something we need to talk about in the community and to develop in the coming months. The fourth component is what is also in the law. It's around GovTech cooperation. There is innovation projects and it's the first time that there's a definition around GovTech also in the law. So just that the recognition that this is important, even if there is no strong legal requirements around it, I think this is an important step because it helps to argue that this is important if you say it, because it's an law. Then we have a spot on regulatory sandboxes and here also for me that they are an enabler for legal interoperability and they have to involve also the GovTech and have, but create a way when you do innovative stuff to have the York legal questions channeled towards the EU level and have the EU regulators already predescuss on how can we tackle this, how can we have this innovation also rolled out in a legally safe way. And last point is upscaling of the public sector and this might also sound a bit vague and lame that is in the act because it's nothing, it's not something you go and the police can come and say you have to upskill. But I think if there is, it's mentioned in the act that there is the public administration have to skill their stuff on interoperability and I think you can't skill them on interoperability without talking about openness, without talking about open code. So I think that a lot of these trainings will be created in the next coming years so together then look into what can we push, what messages and then you can talk in the public sector to people who know about open source and are not afraid of open source. So this is an opportunity for me and one opportunity that I don't have on the slide is that this is going to be evaluated. So everything that is not in today, now if we talk, get the right questions in this evaluation then it might be that in the next version this can also get into the text. So this is the entry model and we will see where this takes us but this also needs early involvement, this needs actually, yeah, that the community reaches out as soon as possible and say this is our ideas for where it should go and this is the question we should ask. So I'm very, as I say, now maybe you understand why I think it's very relevant that we are here and that we start this conversation and looking forward to the discussion. Thank you. Thank you.
Public services interoperability: Open Source efforts in and around the European Commission; and how about a next EC open source strategy
Let's start our second session on the, let's now focus even more on the European Commission and its efforts on the direction of open source. So we have two experts from the OSPO here to tell us a little bit what has been done. And again, let's try to remember to look up the future. Let's see how everything has been done, what we can learn from it. And let's also try to bring up new ideas on how we can make this work. So, all yours. Thank you very much. We also have stickers. I think that's important. Please pick them as you leave the room when you walk away thinking, oh, this is not going where we wanted to go. There's a flyer. And outside you already saw we have a roll-up to make a bit of advertisement for the OSPO. So Gai Selenius, I'm Dutch. Sir Anjith Arora, I'm confused. Indian originally, but have been in the UK for 40 years and Belgium with the last six. And the two of us will try to do a stand-up comedy show that will tell you a little bit and bring you up to speed on what happens in open sourcing and around the commission. We've been here already for quite a few hours and you've heard a lot about the policy developments. The OSPO was started three years ago to remove legal and organizational barriers so that the commission could become faster at doing and sharing open source. And what this first slide, our staircase diagram, should show you is that we've been in open source forever. He's right behind the camera now. He's one of those who worked at the commission and kick-started what now is the Apache project. That's before this timeline started. And the commission has always used open source in the infrastructure right from the start. And I spoke to the people who installed the first lamp stack saying, OK, so that's where we kept notes on which server was doing what and what the passwords were and who was doing things and so we could transfer it when we rotate it around the organization. And pretty soon other DGs, because Srinjeet and I are talking from the Digit, the Directorate General for Informatics that's doing like an internal service provider, pretty soon other services, the ones who are doing policy or putting things in place, were calling Digit saying, I need a server for, I have a pesticide database system, it ran on a lamp stack. And so the second layer, the use thing, that really exploded quite fast. And so in our data centers, I think the numbers are 60, 70 percent these days are Linux machines, replacing other Unix types by the way. And then around 2003, 2006, in our Directorate General people started realizing we need to work at the member state because this stuff is reusable, it's going to save us pots and pots of money. And the interoperability program was started that led to the interoperable act. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you very much. No matter how much I say this. Your time is above. Thank you. Thank you very much. Thank you. Thank you very much. Thank you. Thank you. Thank you. Anna. I've been trying to observe this change in the Commission ever since 2007 when I started writing for the European Commission on the open source observatory. And so I've seen it going on this path. And I helped the Commission ask if I could help revamp the open source strategy. In the year 2000 and 20, we announced an open source strategy that came with 10 action plans. They are in the strategy, the strategy is open, it's out there. Except that we never gave the details that was far too much. The shopping list was like this. And first being recursive, start yourself, set up an open source program office. That was really quite easy. But the two big barriers we had was that the Commission had a big problem in making its software available as open source. Because we had a red tape process that would take six months of just a project person chasing signatures to get permission to make the software available as open source. In the room we have one of the IP specialists who helped us a great deal with removing this barrier. We now have, the default is if you want to share your software, it is open source. If you want to share this proprietary, you go into this paperwork process that will take you six months, which means no project will do this. Because we don't have time. No development project has a project developer that can spend 10 minutes every Friday afternoon chasing signatures. Now when the project team says this is a useful software solution, it shares its open source. That came with code.jropa.tu because we needed a place to share our software repository, share our software. And the other big thing is that we moved our internal teams to sharing software amongst themselves. And finally when you created a project on your internal repository, it was closed to you and your team. Which means that I as an external, not in that team, could not see what the code was. I didn't have access, only saw the name and you could find with a bit of luck the project owner and you could ask for permission to see the code. We've changed this default and we already see the rewards because teams are now starting to reuse each other's work. Not so much literally the code, but they're starting to build on each other's preparations. Open sourcing, I managed it. I talked about it already. We have 400 projects, 2000 users. It's growing. You're all welcome at code.jropa.tu. It's GitLab based. You need to log in, but that's just my small step. The legal barrier was removed. I talked about that. We have labs, an important way to onboard new open source tools to our users. So people from other services can come to us and say, can you, you know, we would like to experiment with HedgeDoc for example. Well, now we can do that. Here are the ones we're currently running. Ythsi, OpenTalk, HedgeDoc, CripPad, Discourse. And we're always looking for others. Outreach is what we're doing here and in other places. We successfully integrated open source in our IT governance projects, which means that when a new project is proposed, we are asking questions about open source. So we are forcing projects to from the start compare and look on the market to see if there's an open source alternative for their proprietary product. And if they are willing to share whatever they're doing as code. And I will hand over to Srinjeet for the important pilot and preparatory action. Thank you. So I joined, as I said, six years ago on this project, EU FOSSA 2. So we have this cycle. We may start some initiative. The European Parliament gives us some money and we start what's called a pilot project. If that's successful, we move on to what's called a preparatory action project. So hence the number two after the initiative. And in the EU FOSSA 2 initiative, we actually created an inventory of our open source. We didn't have one. Not ashamed of that because I don't think many organizations have one even today. And it's a fast moving, changing scenario. And then from that software inventory, we were able to identify which was our most important software. And then we decided to run back bounties and hackathons to improve the security. So a lot of things were done under EU FOSSA, many for the first time like back bounties and hackathons. And then we thought, why are we not doing this on a European scale? Really, we need to cooperate on open source and bridge these islands that exist. And actually it's the right time because open source has matured to a very high degree in many member states. So it makes sense to cooperate and on a number of areas. So basically in terms of specific initiatives, so FOSSA is about FOSSA for European Public Services. And it's about cooperation. Now breaking that down into specific work packages, one of them is to create a catalog of solutions. So the French government, the Italian government, all member states have built wonderful open source systems, solutions. Why are we not reusing them? Often we are not reusing them in our own countries, let alone across Europe. So giving visibility to solutions built by and for public administrations is a wonderful idea. And we've had great successes in national catalogs and regional catalogs, saving lots of money, time and increasing interoperability. So FOSSA is going to, we already have an MVP and we're going to expand it to many more member states. And hopefully we'll have a rich catalog which will save time and money. Can I steal back? Yeah, please. Yeah, because we're being flagged also for going over time in our own session. I just wanted to round off with one thing. Is that we are trying to prove that we can reuse software. So the catalog system is actually thanks to the Italian government team digitalicus. We're starting to reuse their tool and the standard public auto jamal is also developed by them and others. And this is also something we're implementing. And there's the author. There we go. So thanks for that. And I think you should deserve the credit for this. And we are now in the process of launching our internal efforts to revamp the strategy once more. Because the staircase diagram, the first one, those were the internal strategies. The last one, the most recent one was an external communicated internal strategy. And we're going into this process again. And so the next round here is where we would like to discuss with you, whether we would like to have ideas from you, we would like to know what else can we do as an Ospo. And I think the whole day so far has given us a lot of stuff to work with. But that's what we'll do in the next session.
Public services interoperability: workshop Open Source strategy at the European Commission
Okay, yeah, perfect. Thank you very much for the time and for the insights here. So yeah, I mean, I guess some of you were already familiar with the strategy and so on. But now that we know what it has been done, then I guess you guys have some questions, comments, and just to keep it like a custom practice now, I would like to kick it off with a question, actually. And I mean, since I read the strategy, I was kind of like struggling with this whole thing of inner-source. So I really wonder if you, now that the strategy is almost over and then you're kind of like renew it. If you think this is the direction to go, or like, you know, first of all, what do you guys mean with inner-source to put everybody on the same page? And also if this is actually the direction that the commission should be going or if you're thinking about something else. Sorry, forget the very important microphone. Inner-source is really maybe not the best label for this thing. But the goal was to make our software development, the commission has like 6,000 software developers working for it at any given time, doing all kinds of projects. And some of them were in the room. And what we wanted to achieve is that on the road to making the whole thing as open-source as possible, we needed the first step. We needed to have the existing projects of which there are many realized that they needed to get ready for reuse. And you can't go to a project that's been running for 15 years or for five years and say, oh yeah, by the way, we're going to make your stuff available next year. Because these guys are like, whoa, but it's full of passwords. It's full of internal references to machines and stuff. So the code has to be matured. And so we've gone through bug bounties and hackathons internally to make sure that projects were ready to move from being, you know, in the basement with just these five people that have been working on this for 10 years to being shared with the other colleagues, with the other professionals whom we can rely on not to abuse the information that is often in these systems. And the other thing that we're doing is that we're only, we're making it easy for projects to go open source as easy as we can. But we're not forcing them. If there are reasons within the team, within the DG, within the service to say, this is so niche, why should it be open source? Or this is so secretive, why should it go open source? We're not going to force them. But we already see a lot of projects going open source because most of the developers, this is where they want them to go. And they'll talk to their project managers to say, we should go from inside, we should go to Code.org Europa. So the numbers that were on the slides are very good and give us another decade. Maybe that's too slow. It might go faster. All right. Yeah. No. Hi there. My name is Romer Adams. I'm working at Red Hat. Building actually a solution for the public sector. Two quick comments. First of all, great that you discovered that there's our responsibilities with CRD, SCRA and PLD from a code perspective. So it's good. The second part is that in a source for taking this into consideration is one of the best thing you can do when you are in a very disconnected environment and you cannot share anything with outside from a security perspective because you can apply all the construct of open source to build software. And at some point when you clean everything up, you can share that with the other member states. The last part is more a question there. When we're thinking about the open source catalog that you have built, I guess we're speaking also about blueprints. Will there be any incentive for member states to get to those blueprints as a first class citizen in tenders versus trying to build something from scratch? Hot potato. Thank you. Good question. So I think the idea of reuse is warmly welcomed by almost everyone. Why would you build something twice if it's already working in municipality A? Unless it's not good enough. You could amend it, you could improve it and that's the idea of an open source solution. Alluding to your question about making it mandatory or incentivizing, it's not for the commission to do that. I think open source is by its very nature a very cooperative, open culture. So we are hoping that procurement will come on board and say, why do you want to build this? Show me that this doesn't already exist in the European catalog. So I think it'll happen eventually because it just makes financial sense. So I think that's where the solutions will become blueprints. Hi, my name is Dali Boran. I work for the commission services. So I'm a user, but I didn't work there for all my life. So I also worked as a developer in the industry in the education and coming back after several years, I saw a great progress in open source software on my computer. So thank you for that. On the other hand, as a practical user, I have problems with it because I've seen things moving in another direction. We are forced now to use the Office 365, as you know, then TMS teams for also internal and external communication, not to speak about the applications for some jobs which you can fill in the form only using Microsoft Windows or some testing for in Epsoc, which can also be done on Microsoft Windows computers. So I cannot do anything about it. I mean, I've been trying using the available open source software as much as possible, a LibreOffice, and even some other solutions. But I think you might. So I urge you, at least if not to look for some long term or radical solutions, which I would, at least not to allow this regression of the current situation, which I see coming in six months to one year. Well, depends. I mean, internally, what do you use for your team communication? Please answer that question. You need the microphone as well. We use everything, including the tools that come with the machine and when we switch it on. But we also use all the alternatives. And if you were a software developer for the commission, well, if you were, then the good news is you can have a Linux laptop at your disposal and running Ubuntu mostly. And this, yeah, we're, you know, we see that pain point, but I can already tell you that the OSPO is not there to make that switch. What we can do is translate the demand from our users, which includes you to say, look, there is a need for an alternative. And there's many teams that interact with the outside world that cannot use MS teams. There are many teams and projects of people working on legislation, in fact, they're asking us, do you have anything because I cannot use Office 3, whatever? It won't work. And so with them, we're trying to figure out what's the next thing we put in our lab and how can we make it accessible to legislators in the member states so that they can work on the next generation of laws. So we're trying to find ways to do this. I wish we could show a lot of progress. I think it'll take a bit of time because, yeah, it's in place. Yeah. So I think the first thing to say is that we absolutely acknowledge what you're saying. That's the truth. Right? Now, the question is how do we solve it? It's a collective we as crisis outlined as the Ospo we are doing things like using some of the labs and some of the other mechanisms, right? We're pushing open source. So there's another movement, right? Open source is expanding throughout the commission and its awareness is benefits. And yet there is an entrenched Microsoft culture. So I think we it'll just it's a question of time and it's a question of continuing to pass on the messages that you're mentioning. And we do that as the Ospo. So I think that's the answer. Sorry. It has to change. I agree. The best moment to start that change is now and we're trying to bring those experiments. Exactly. We're doing something what we can. I'm sorry. I would say let's go for a question and then maybe you can. So well, thank you. But unfortunately, regression is going the other direction. And unfortunately, it consumes more work time. It consumes more money, bandwidth, energy, so it's not compatible with Green Deal. And also, please think about this. If we would have, you know, an open source working environment for all the administration in Europe and for the politicians, maybe they would think differently and then support the open source more. Okay. I agree. And then I take command from that again. No, no. I always say it's a question here. Okay. Yeah. My name's Ingen and I'm just a citizen. I was trying. I'm trying to put this question somewhere. I'm not sure whether it fits. But why we were busy with CRA, another regulation sailed by which has implemented, which implicates maybe problem for open source, which is the European ID. So A does. And how can we assure that this European ID will work on open source systems of all kinds? So on free smart, free androids on Sailfish, on Linux smartphones, on Linux desktops and everywhere. And how can we ensure that the ID application itself is open source, which I think is mandatory for a fundamental thing like this? That's not about. You want to leave the room now? No. I can get part of the answer. And I'm going to look in a few other things, if you don't mind. Because there's so much going on. And as I was trying to say, there are good developments happening in Europe. And I want to first acknowledge there's a circle on this diagram, and I put it back online. These are the OSPOs in the member states. And four of them are in the room. Five of them a few minutes ago. And we're working with these peers to figure out, A, the citizen interaction. I think that's a, it's clear that we need to, this is a goal for us, a task for us that we didn't identify three, five years ago. And about the commission is doing a lot of open source reference implementations. So when we, when there is a big regulation like this, like A does, for example, there is an open source reference implementation. I do not know if that one would immediately run on your mobile phone, but it will run on a Linux machine. That's what it was made on. It was made for it. And this is also how we help the member states test their implementations because they can either use the reference implementation and install that, or they can test theirs against our reference implementation. And we do this with many other tools. So you'd be surprised to find commission, commissioned, sorry for the strange formulation there, but software built for the commission, either by software developers working directly for the commission or through companies that are in industry everywhere. If you do an import, export decoration these days, you're touching software that is made available as open source and built through the European commission. If you're in the steel industry, you're most likely doing things that is done with open source commissioned by the commission. If you're doing e signatures, it's most likely using open source libraries that were developed as open source through the commission. So this is one of the good things that is happening. And then I would just like to point out that the OSPOS are doing good stuff in the member states too, and we're trying to build a network that reinforces itself. So let's take two more questions because we're running out of time. Can we get the OSPOS to stand up just so we know who they are? Thank you for the great talk. And we spoke a lot about infrastructure and tools, which are very important. But is there a European wide plan to somehow replace Mastercard and Visa? These are essentially payment clearing systems, and these are leaking hundreds of billions of euros worth of economic value out of the thing. If there could be an open source trustworthy solution built into it, which every bank issues to retail people, that would be great. Wonderful. I think that's the answer. We're not the right people. And these are very large initiatives. There are many, many large things that could be done. We are working at the grassroots level in terms of open source, propagation of open source internally and connecting with OSPOS externally, and doing projects like FOSCEPS. And check out the NGI funding framework because there are super interesting projects on this topic too. Okay, so one last question. Thank you. Hi, Paolo Vecchi, board member of the Document Foundation. Well, I would like to say thank you guys, or the OSPOS team, or the European Commission, which is now 150 people. I think most of it is in this room. And, well, a handful of it, because I actually seen the progression of what has been going on in, well, lately, and you did a very good job. In a way, you said that, I think you commented that probably is not your job yet, maybe, also to promote some open source platform within, maybe the European Commission, or just part of the way the users will see mostly. So we have an example of LibreOffice, or maybe also Linus Texop, or something like that. I suppose that that probably is going to become one of your tasks later on, maybe when you're going to be more structured. But in a way, I hope that we're going to get to a point where the European Commission actually is going to be the example for the rest of Europe, where an organization that manages enough people to fill a small town will show how to do things, how to implement open source, so that for a small town to nations, actually going to be able to do it quite easily. And another thing regarding, well, something that I'm very biased about, LibreOffice, there's been a bit of an effort, a lot of effort in trying to get LibreOffice updated in the Catalog application, the European Commission. It is there, at least, so in theory, it should be able to install that quite easily. And it would be nice to have more feedback and see how many other open source applications you would like to see on your desktop, so that the rest of the community can say, there guys at the European Commission, can you please add this one, because it actually makes sense and we help other people switch into open source. That is a collective effort. Thanks. Okay, perfect. So any comments on there? Should we wrap up? Yeah, we can wrap up. Okay. I'd just like to tell you that LibreOffice is on the Commission's laptop, so... Yeah, I think the general answer is the small army that can fill a town is already using open source, right? That's why open source is on our list of software to use, officially. So it's a lot more usage than we... that might be visible, right? But sometimes the external tools like teams and things and word... We have to connect with everybody in the Commission, right? So it's increasing. Thank you. Okay, perfect. I'm sorry, we need the microphone. Okay, so now we're going to have a wrap up from the repertoire on the inside's inputs that we got. Thank you very much, everybody. But let's take around five minutes. Only five minutes and then you can go.
Public services interoperability: rapporteur playback
Thank you. Okay, perfect. I'm sorry we need the microphone. Okay, so now we're going to have a wrap up from the repertoire on the inside inputs that we got. Thank you very much, everybody. But let's take around five minutes. Only five minutes, and then you can go. That's all right. Okay. Oh, yeah. I'm sorry. I'm sorry. I'm sorry. Okay. All yours. I'll do a gesture thingy. Hello. Okay. So for the few of you that might still have forgotten about what we heard about in the last two hours, I'll just do a quick wrap up. But first of all, thank you all for putting forward so many questions. We're very happy to see that many people be interested by public sector and public sector. And having so much interest in what our speakers presented, I think maybe it's all right. I think maybe some of the most interesting comments were first on the presentation of Issa before that related to clearly the question of the implementation of this act. And then we'll have to go back to the first question. Clearly the question of the implementation of this act and how to be part of this question of the interprobability assessment. Now it was also made quite clear that this legislation is not an OS law and that there were many pushbacks again having this made as an OS law instead of having it just as an interprobability Europe act. A lot of you also raised question of the open documents, of the open format and how this will be integrated into this regulation. I think that's what's very nice to hear. And on the question of standardization there as well, it's been quite clear that the board would be one of the main actors in the implementation of those. As for the second presentation and the question of the OS, the question of the OS strategy, you raised question about the inner source, how to move away from inner source. I think it was also quite clear that it was the first step and that it was really helpful for you. And on the question of incentivizing or having mandatory use of open source and what was the strategy used which was clearly trying to understand how open source works, why people adopted and trying to avoid having counter effect by making it too mandatory or strong. Yeah, and maybe just to finish, I think one last one. Oh, yeah, thank you. One last one was really interesting also on the question of the EIDES. Thank you for bringing that up and other regulation like this. So we talked about very specific policy papers or regulations today, but there's a lot of regulation at the U-Level that are constant by open source. And I think it was really interesting to learn about how actually it can be shaped up by this reference implementation and so on. So, yeah, thank you all. I will finish to write a better report about that later on. And good luck to those who stay. We will start, I think, in a few minutes. So, yeah, thank you. Thank you.
Digital Services Interoperability: Intertwining EU telecom law, the DMA, internet devices and Free Software
Thank you. Thank you. My name is Nico Riecke and I'll give it to Lucas to start it off. Hello everybody and welcome. I'm very glad to be here. Thank you for the organizers to invite us to this talk. I'm very happy to see that there are a lot of interest to DMA because we have been working already on this for some time and I would like to contestalize all the problems that we already started hearing here about interoperability, about security, about having access to infrastructure from the telecom perspective because together with Nico we would like to prompt you with our experience on advocacy on routers. And I think that contestalizing this example on routers and router freedom in Europe can help us to understand a little bit better what expects us when we start dealing with smartphones from the DMA perspective. So let's talk about end user control of devices, this contestalization about the DMA and telecom and then I will give the word to Nico so he can tell us our experience with router freedom. So the first question I would like to ask us, do we have control of our devices? Devices are becoming ubiquitous, we are using that for everything in life but I have the feeling that we are losing control over our devices. We cannot change the battery, we cannot uninstall the programs, we cannot even install programs. Today people call it side loading but in laptop we cannot, we don't call that side loading because we are just downloading and installing. But well big tech now says to us that if we wanted to install something outside Dev Store we needed to side load. This is not good for software freedom but let's talk about that. So I think that we are losing a little bit control of our devices and here are some key aspects of gatekeeper control of our devices. They are imposing online accounts us so if we wanted to use our devices they say first you need to create an account with me. And then I think when I bought my Android phone I was prompt, the first thing that I showed in the screen was you need to create an online account. Then when we use our smartphones we are already trapped into vendor lock-in because we have no access to third parties repositories and app stores. And this is really key because on these repositories is where we can find apps and we can exercise our software freedom in order to populate our devices with our software. And last but not least we are not free even to uninstall software that comes to our devices. And we check that sometimes on drive devices or iOS devices there are a list of apps there that is draining our battery and they are doing stuff because it's proprietary but we don't have access to the source code so we have absolutely no clue what is happening there. So based on these facts I think that we are losing control of our devices. And therefore some questions that I would like to prompt the audience and perhaps in the coming moments of the workshop we need to answer is that how we can re-empower users to have control of our devices. So we already heard that DMA is a very important piece of that and I believe so I think DMA is crucial but we need to go further. First, we need to recognize that devices or ecosystems are mostly proprietary. So the largest two operating systems smartphones they are proprietary, Android and iOS they are proprietary. And therefore being that and since they are so large we are calling the gatekeepers due to the monopolistic power over termination bottlenecks. So basically everything that we needed to do with this device we need to go through this company. That's why we are calling a monopoly over the term features of these devices such as one example operating systems, browsers and app stores. And of course we hear them since they have this power of devices they can hinder interoperability and they exercise type controls of APIs, apply proprietary standards that we heard today, hampering functionalism, block access to drives and hardware. So in the FSFE, the Free Software Foundation Europe we have been working on a concept that we call the device neutrality. And this we want to re-empower users and to give control over devices back to them by software freedom, eliminating vendor lock-in and giving end users control over data. And last but not least I would just quickly because I'm a lawyer and I would like to point to what is happening nowadays in the EU, right? So for 10 years we have been working on the open internet regulation also called the net neutrality regulation. And this regulation had very clear rules over internet access devices. So it applies to routers, modens and other internet access devices. Then in 2018 there was a reform of the telecom law in Europe called the European Electronic Communications Code. And it then implemented some rules over iOS, over operating systems and apps and also to network operators. But now comes the DMA with rules over devices and operating systems and apps. And in order to contest to allies all these challenges I would like then to give the word now to Nico, I'm sorry, so we can learn a little bit how DMA can connect to that from the perspective of routers. Okay, as a case study, yeah, yeah. Routers of freedom, well, my wife and I were really excited. We got our first house, we were moving in together so we had to prepare for the move and one of the things we had to do was get an internet contract. And we didn't think much of it, we were doing some comparisons and said okay we got this contract. It was some all in one provider and you could recognize it from the box, it was an all in one box, it was a TV, telephony and internet. But also outside of being a box that did everything, it was a modem, a router and it did so badly. It filled quite a bit of times at a certain point, it filled entirely and we had to wait three days without any of the services to get a replacement. But after another failure getting dropped out of an internet call I said okay this is it, I'm going to get a router myself so I know I can trust it and it's reliable. But I found out that this internet provider didn't really support that. It was all to me because previously it was on a telephone network connection and they were even advertising that you could use your own router and modem and some of my friends were doing so. Also if you would call them for support they say are you using your own modem? They would just assume there was something you could do but not with this provider. So that was all then I learned about router freedom and of course there's a lot of benefits here on the slide. Some personal but also some in the grander scheme of things like competition and creating a healthy ecosystem of devices. Now internet providers put up quite some barriers to prevent you from exercising router freedom, the ability to choose your own modem and router. Some of them actually have some technical merit. For example the telephone network is laid out differently than a coaxial network. If your modems are doing bad things on a television network, the coax network, as the lines are shared you might interfere with the devices of your neighbors. But that of course is why we have standards. If your device is compliant and the standard you can get one from the store, plug it in and it will work and there's no reason to deny you those freedoms. Actually one of my FSV friends said oh it works just fine. I have these and these devices running at some friends. I could really recommend to use your own router and get the freedom you want. So it wasn't really a technical barrier. Now at the FSV we've been at this for 10 years as Lukas said and we keep an interactive map keeping all the states and working with regulators to ensure that router freedom is actually achieved. One thing in 2015 is that we had the EU net neutrality act and it was like yeah, Inno says users should be able to replace or change their device. So you think okay router freedom everything is good but not unless the regulators regulate. That's the main thing and that's why we have this map to keep up with the regulator and the state. Now about two years ago in Belgium there was a consultation about modemvrijheid, a freie modemköser, basically the ability to choose your own modem. There was a consultation and I saw this like okay we have to get in on this even though I'm from the Netherlands, I care about this, I can do in Dutch. We got a different volunteer, some from Belgium, and together with Lukas we responded. We were quite alone initially not having other parties that had the technical knowledge to go through this legislation and have the community behind them. But eventually through a survey we were able to actually engage the crowd and come over the argument and we achieved router freedom. So shortly here are some examples of people in our community using their routers at home to also establish the practice of router freedom. There's also benefit of free software on routers but that's something else. And myself I now happily using fiber with my own router and Lukas if you want to wrap it up. Yeah, so yes, so it's a big win when we have router freedom and we fought against the operators that and they always came to say that interoperability is a problem, security is a problem. But with router freedom we could prove that this cannot be a problem and I hope that with our discussion with DMA we can bring our experience and say that we can also overcome this problem. Thank you very much.
EU Policy Devroom Wrap-Up
So we're going to close the session for today and for the whole day of session. So I'll ask the DEVRA managers to help us to organize if they could come as well with us and then we'll let Simon close everything. So thank you everyone for staying right to the bitter end here. So the DEVRA managers that have been organizing the day for you today are Enzo over there. Enzo was allowed to do this by clips. There's Deb here. Deb broadly did this because he wanted to. There's Heath here from the European Union. He's had to take it took him about a week to get permission from his boss to do this. So I think I thank him very much for doing that. Yeah, yeah. So let me see. We've got Alex here who is from FSFE. We've not in the room at the moment. We have Axel from Open Forum Europe who is out wandering the estate. And Martin without whom we probably wouldn't have done it at all right from the beginning. So thank you very much, Martin. So what are we going to do with all this? Well, the reason we've had a rapporteur in each of the four workshops is because we were told that if we want to get any traction at the European Commission, we need to give them a report. And so we've taken notes of all four of the workshops and we're going to construct a report that gives the essential feedback from each of the elements. We're going to make it look nice. And we're going to work out how to subdivide it so that it can be used in each of the directorates where it will be a useful tool. And hopefully that will be a way that there will be lasting change and not just a great weekend at FOSDEM. I am also very grateful to Alistair Kergan from the FOSDEM organizing team who has been our guardian angel for making this happen and making the keynote session that we had yesterday happen. Without him, we probably wouldn't have got it. And also last year. And so we're very grateful to him as well. And I'm very grateful to so many people, the people who are here now and the people who have been here all day, who have been so positive and encouraging and engaged so well with the European Commission staff. And I want to especially thank the European Commission staff who have given up a weekend day. In some cases, in the case of Omar and Benjamin, two weekend days to come and meet 8,000 friends who they didn't realize they had before. I'd like to encourage those of you in the commission to treat us as your friends. We're not lobbyists. We're subject experts. So please refer to us whenever you're preparing legislation in the way that you would refer to a subject expert. Many of us are freely available to you whenever you write. We're on signal. So. And on matrix. Maybe parliament. So. So behind the scenes, we've also had some support. You haven't seen anyone from council here today. We did try to reach out. We didn't actually find anyone who was free this weekend. And we're grateful to the people from the parliament who supported us. And it's really very good to have had all three parts of the Trident present here. I think it amazes me. I've been coming to FOSDEM since 2006. And it amazes me that it's taken until 2023 last year for this to show up at FOSDEM. But we are going to try and make sure that it remains an important instrument in creating end user agency and software freedom for people throughout Europe going forward and in perpetuity. And with that, thank you very much. And there is a closing session in Janssen.
Reimagining Personal Computing with E ink: Community Insights and Design Challenges
Hi everyone. Thank you for having me. Thank you for the organizers and volunteers that have been doing this event and all the other ones as well. I appreciate it. So yeah, so my name is Alexander Soto or please just call me Alex. I'm the founder of Modus. So just a bit of a backstory. In 2021 at the height of the pandemic, my bedroom transformed into a workspace. So I was spending most of my time from the morning to the night just in front of a computer, right? So using the computer and sort of battling this distraction, refocusing and then trying to get back into my work. And you know, the same tool that I have to use for work is also the same tool that I'm using for like leisure, for entertainment, and sort of this is kind of what brought about the idea of being able to or trying to re-imagine what would it look like to have computing that is calm, inclusive and humane. So at Modus, we're kind of reimagining personal computing with focus on creating calm and inclusive and humane computing. And we're doing that by creating a high refresh rate electrophoretic display controller. In the slide that's uploaded, we have videos where you can take a look at that. I also have our prototype here. So if you see me or if you want to try it out, I'd be happy to demo it so people can check it out. And this wouldn't be possible without the team. So this is really the team that's turning this vision into a reality. Wenting or Zephyre has led the development of our electrophoretic display controller. Brody, who's done amazing work sort of like CAD and manufacturing for our paper monitor chassis. And also Michael, who have had many conversations of what the possible software architecture and sort of stuff would look like. And last but not least, I want to thank the community really. We did a community survey. We had about 3,000 people fill out the community survey. We had about 300 different contributors and people who are interested in joining our pilot program, over 5,000 people in our mailing list, and a special thanks to Nelnet. We were working on a prototype. We're getting close. But we needed some support and Nelnet really, we're a Nelnet sponsored project. So really extended thank you for your support. If you haven't, please check them out. They're amazing. And a little bit about that survey. Oh, a little jump. Sorry. A little bit about that survey. So from the survey, there were multiple responses that were allowed. We had about 3000 people who filled that out. And some of the most popular use cases of why people want to use E-ink were reading, writing, coding, and just in general, be able to do it for a focused work. Not too surprising, but from the survey, what I learned a lot were from the feedback. So I think for me, I was also, what I experienced, just being focused, being distracted, refocusing. There were other people as well that shared similar concerns just from this quote, you know, I lose hours and days and weeks of my life, getting sucked down rabbit holes. I'm not so sure who hasn't experienced that before. In this room at least. And, you know, just with entertainment and content, missing deadlines, and then other people expressed concerns about our eye strain and just accessibility. Other folks who have been using commuter, you know, since they were a young age and have tried to look at other solutions between, what are the examples that they gave? Well, using different solutions that are available and still not having really any success. So I've read each one of the comments of all the feedback that I got. And there's some overarching like patterns and themes. So there's this one, there's a desire for living a more balanced digital life. So reducing your screen time on social media, entertainment, be able to kind of like unplug and be able to be outdoors. Oh, and I connected online. Be able to connect, be outdoors and also being a more less visually stimulating environment and kind of reducing digital clutter. So that's like one particular group. The other group where I learned more about is just people experience eye fatigue and strain, but also very specific health issues, people experiencing myopia, epilepsy, some level of light sensitivity, headaches, migraine, traumatic brain injury. And I think one feedback or comment that's been engraved in my mind was an engineer who was writing or completing the survey on behalf of his wife because she had epilepsy or has epilepsy. And it's tried all the existing solutions and they weren't working. So that's like one specific comment or feedback that's engraved in my mind. So Cat Holmes is the author of a mismatch how inclusion shapes design. And this particular quote is one that kind of speaks to me. It's, you know, all of us are temporarily able bodied and at some point in our lives we'll face new kinds of exclusions. As we age, when we design for inclusion, we're designing for our future selves. So I'm getting older, I'm more divergent, I have a host of other health issues that I'm aware of and unaware of. And this quote and then kind of working with Models and the work that we've been doing, kind of spoke to that to me at least. And overall I think that there's a need for creating technology that satisfies our essential needs and while protecting our world being. How can we like redefine the role of our digital devices to foster more healthier and balanced life? And at least a vision is, you know, can we create a new class of devices that are built from scratch that sort of embody these principles of humane technology both in hardware and software design? So what about software? You know, how can we be productive in a calm world? Can the work that we do be synced on the cloud, fall into our focus? Can we collaborate without notifications? Can we scale a minimalist UI to more than just reading and writing? So if we start with the basics, the common use cases were reading and writing. We have kind of like a mockup of an example of a writing application, simple text editor that will allow you to do prose and code and also a simple reference for we can like browse the web using Gemini. You know, this is how you would sort of explore and use information. So, you know, how can we possibly scale something like this? And modellix. So modellix is a core framework that's in development that's designed to tailor applications and documents across various devices and end users' needs. So the way that I like to think about it is imagine responsive design but taking much further, not just related towards particular screen types, but a longer range of different types of devices they have and really cater to the needs of individuals. So here we have the source of the particular application, a minimalist reader, and then let's say you have a reader who has low vision or perhaps is using maybe some forfeiting device, the application itself will sort of adapt to that. So yeah, so modellix adapts to the interface to users' preferences, making adjustments depending on their need. There would be a semantic model that guides the adaptions across different modalities, screen types, operating system devices, and enabling developers and users to extend and enhance the application for each of the modalities. Some of the approaches of this is to be able to restructure the complexity of the interface and moving complexities associated with the particular modalities. I have a little bit of a mock-up here of kind of explaining how the modellix and modalities would work. We would have a modellix-aware application and then would present to the user a visual and or possibly like an audio interface. And we're also looking into, well, missed a slide there, but the complexity, some of the challenges are the complexity in the representation of user interface and also how can we be backwards compatible with existing applications. So here we're also looking into saying, also looking into possibly using like large language models for being able to support applications that are not made with modellix from the beginning. Need to expand on this a bit more, but we'll do that later. So next steps. So we've been working at this for about two years now with our display controller and our prototype. We are pretty much done and one of the things we want to do is be able to do a crowdfunding campaign later this year to be able to make the devices or the boards available to people. We're also a small team. We're about three or four people who've been working on this. So if anyone wants to get involved, anywhere between documentation, between design, we'd love to have you all. So we also have a link here if there's different ways to be involved. And I think that's mostly it that wraps up. I think, I don't know if I sped that through too fast. I have 15 minutes. Well, I have 15 minutes and have a demo and I'm happy to answer any questions. So that's it for me. Let's take some questions first and then we'll see if we've. Yeah, I could do. Yeah, I would have to disconnect it from the display, but I can't get here, but yeah. I would have to disconnect the from the here and there, but we can do it afterwards. Not a problem. Any questions? Okay, I see one down here. Please repeat the question. Yes. So the. Yeah, so I didn't. Sorry, the question was am I working with multiple different devices or displays? Is that correct? So initially, the motivation for the project was to make a laptop. We initially had an investor was interested in supporting it. Sadly, they backed off a little bit. So we continue just working being complete bootstrapped and working on the display controller. So the first device that we want to make available to people as well, one, the board itself or hardware engineers, hackers, people want to work on it, give it accessible. The second one being the monitor itself. It'll be a nicer case. So that'll be the second device and we're working right now actively and identifying like what would be the third possible device where they're being between doing a reader or maybe a dedicated like typing device for like authors and writers. This talk today was more focused on like software and like what we envisioned. What would it look like to rethink to be an inclusive technology? I have another talk tomorrow is talking more about the hardware side of things, but the long term vision is to be able to create a whole new class of devices that use our controller hardware and software stack that is optimized. So right now I'm on the next year. Hopefully we'll have maybe a reader here and Fosom 25. But yeah, so that's kind of what it is. And I think part of it is that everything's open hardware. It's on our repositories. There's a read me. So if you're a hardware hacker engineer, please feel free to get started. Hi. Have you been in contact with one of the larger districts of environments like the integrating is one of the challenges you had was not being able to easily easily tell what part of the interface needs to go where it needs to be changed in a certain way. Yeah, so the question is, have I spoken to larger projects, for example, like no more KDE and the second part of it? Just want to make sure. In regards to the changes, I haven't spoken to any of the larger projects. If anyone is in KD or I'm happy to talk. I did speak with Drew DeVault from Swarstut and kind of thinking about what would it look like to create a dedicated software stack at the higher levels. And we see promise in using Wayland particles, specifically things related to damage tracking that would enable applications to create this idea of an ink native application, for example. So if everyone's ever used like a remarkable or any other dedicated device, be able to create ink applications that are natively built and with the support of Wayland protocols. So that's the extent that I've gone, but I haven't spoken specifically to no more KDE projects, but happy to. This is what I've been thinking and working on for two, three years nonstop. So I have many ideas. Yes. Yeah, we're going to hit a 60 Hertz refresh rate. Sorry, what is the refresh rate that we are aiming or targeting with the ink displays? We're able to hit a 60 Hertz refresh rate right now with our controller. It's has good, has anything that's in the market right now. And we strongly believe we can improve upon that both in the optimizations at the hardware level. And also with the dedicated software stack and Implement and Wayland protocols, we could create a more native, create a more native like experience. So yeah, open up the hardware stack, dedicated tailored software stack. I think we can see a lot of optimizations. Yeah, back. So the question was related to the 60 Hertz. It is excellent. There is typically problems related to like ghosting or the problems and have we found a way around it to answer your question? Sorry, to answer your question, yes, we found a way around it. So there's ways where you can modify at the hardware level in order to be able to do that. But still coming back to the earlier question about the Wayland protocols and the larger projects, I think this is where we would need support to be able to create software that's tailored to that medium. But I have a general belief that the display controllers that are available in the market that a lot of existing companies and products use are not optimized so much. I think it's, if we look at E-EngleVraw, they primarily have focused on e-Veeters and digital signage, but not so much in other form factors or devices, let alone having software that's tailored to it. So I think there is a unmet need in the market, so to speak, where these controllers could be pushed further, but they haven't. So I think with the right combination of optimizing at the hardware level, the software level, probably having specific guidelines of this is how an E-Engle native application would render, this is thus the homo-dialics piece, right, that I sort of mentioned before, the tie-in, I think we could hit 60 hertz and also reduce go-sing and things related to that. Any other questions? I'm going to, do I have two more? I'm going to go with the person that will, yes. Yeah, the question, yeah, the question is what would it take to make a modalex of wear application, modalex of wear application, UI, library, or whole different design system? We're trying to figure that out, so it's in development, so I think part of it is a combination of everything you just stated, part of it would be philosophy, part of it would be also design, another part of it is figuring out the right software stack for it too. So I think we've gone under a lot of our rabbit holes trying to figure out what would be an appropriate stack, and I think the way that I look at it is we need to get it to the hands of people, we need to be able to get the board and the monitoring to the hands of people. We have some ideas and some levels of direction, but tapping into the much wider community, I think is where that lies. I'm going to, there was a question over here, yes. You're covered. Okay, awesome, thank you. There was a, think of a question, yes. I'm trying to change our user experience, our UI and so on, so what do you want to talk about? I remember hearing for more, call for the, for last 15 years, starting with one of a lot of her child and other lots of things, so on the one hand, now you might have a bit easier because there is more hardware, like the brand laptop, like the brand, so the brand is anything framework, like M and P, yes, or the other can be, there is much more fragmentation, so that's one thing you need to fight and then you need to convince people to start using this UI in your application and we just had talked from standard birth. Yes, so the question was what is our focus and then mentioning projects like the OOPC, mentioning the fact that it's been around for about 15 years and just what is our focusing hardware and software. I think overall, I see Modos and I think where we're at right now, I see myself as more of like an enabler, an enabler to be able to enable other folks to be able to take E-ing devices in whatever direction they would want. We as like Modos have this particular vision of creating this idea of a humane technology and create a hardware and software that's tailored to it. We might just disappear overnight, right, but the fact that it's open hardware, there's a repository there, you can go ahead and use it. I did track, I don't know if everyone's done track, but I think, there's like a totem pole, I think this is as far as we have a foundation where we can start if someone like Pine 64 or another one like OOPC would want to use our display controller in order to take it in a particularly different direction, it's more than welcomed. I think given our capacity where we are right now, we just want to be able to first focus on the display controller, be able to build the community and then see where the community really takes it. So, prior to my work, I worked a lot with universities and other educational institutions. One thing that's really been happening recently is especially with students with extra needs, universities have been really pushing for assistive technology a lot more than we do with this thing. I was wondering about an area that you consider looking into working with organizations like DSA in the UK or ensuring that the US and the UK are looking into other organizations. So, finding tools for students with extra needs and almost building like a foundation there for other developers to then jump into that market. It's a brand that we are looking at. Yeah, so the question was related to have it looked into like assistive technology in particular the needs of students in the education space. Does that summarize the question well? Yes, I've been looking into assistive technology. I don't know enough. I need to do more research on it. I've looked into it, at least back in Boston. There are these SBIR and SCTR grants that are available, but they're complex in the sense of being able to apply to the grant. The path to enter into a positive grant is a substantial lift. So, I have looked into assistive technology. I think that there are, it's a different market. It requires different needs. I don't know enough about it, but I've looked into it as possibly being a base to start from. I'd love to talk to you further more about it and learn this is why I'm here. We're going to take one more question and it was just back there. Sorry. Also happy to talk afterwards. Thank you. I have a question about the brief that you mentioned in the beginning of like the E being a source of less, but the I think you can help in particular high conditions. So, how do you make sure that your design process matches that needs? The question was related towards E-ink addressing issues to like eye strain and fatigue and how we can make sure that our design process is aligned with that. Is that somewhere as a question? Well, figuring it out, still working on it. The challenge when it comes to E-ink and other displays is that there's also a lot of mixed studies there. I can't conclusively say that problem is related to blue light, for example. But also our eyes and pupils are changing throughout our lifetime and everyone here has different levels. So, I don't have necessarily the funds to prove this is what it is, but I have heard a overwhelming amount of feedback of people with light sensitivity and issues. So, to your question, yeah, I think that's something that we need to take into consideration in our design and would want to get. There have been some folks who have some level of sensitivity or some problem who've expressed interest and I'd want to involve them from the very beginning as part of that design process. I think now that we have our prototype done, that will enable us to answer more questions like this and investigate further. Okay, thank you so much. Thank you. Really cool. Awesome. Thank you.
Bad UX is Bad Security: Adventures in Qubes OS UX Design
Thank you. All right. So I would like to start with this very controversial sometimes notion, which is I want to convince you all a bit that that sentence that is up there that bad UX is bad security is actually true because I get often people who tell me that complete bollocks I will later talk a bit also about cubes but I don't want to start with this. I would like to start with the general principles. So why UX matters for security? The thing is very often when I talk with hackers about security people come to me like but we don't actually need usability. People can figure it out if you care about security you will figure it out and that's not a good approach. One thing is of course it's not like security and privacy are things that you should be you should have to deserve to work for them not everybody deserves only the smart people but the other thing is it doesn't matter if it's the fault of the user if it's the fault of the software if we get compromised if we get harmed the harm is done and I would personally like there to be less harm less damage to the users and that's that means making things more usable for people taking into account how humans work how human brains work this is of course sometimes a controversial concept but we are all human here and we make mistakes like people like humans who make mistakes user errors are a real vector of attack and a very important vector of attack when we read about compromises and for example big corporations very often the initial vector of attack was oh somebody clicked on the link or somebody answered the phone somebody talked to somebody and said what they shouldn't somebody made a stupid password so we cannot just say well I did the tech side all the problems there are user errors not my department this is not a good attitude it's like if the UX for the door or for the door control process is terrible and you end up doing oh nobody can remember the code just put the sticker next to the door then the person who designed the security system failed yes people shouldn't put sticker with the password next to the door but also the person who designed the process did a bad thing this is not good and also we are not mothers and fathers of our users we should not be like oh you have to deserve this you have to work harder why are you not paying attention chair a bad user we need to treat our users seriously like adults who also sometimes have different priorities than our programs not like children because the thing is humans make mistakes this is a thing this is a this is a truth universally acknowledged we all do we will make mistakes we may have other priorities than making using the software perfectly like very few people just want to use the software as good as possible they want to use the software to do something and also the problem is our brains were not exactly optimized for using computers also controversial our brains brains have a lot of heretics a lot of shortcuts that they take all the fascinating optical illusions just tell us this our brain is not perfect at perceiving the universe and reacting to what's happening we have a lot of iffy things in our brains and this is something that we as people who make software need to take into account people also do shortcuts they do it like they want to do things fast and if you keep noticing that your users keep doing a shortcut that is for example a secure less reasonable terrible there is a need to do we cannot be just like well stop doing that this is bad bad user no no if for example people people keep people keep walking on the grass then they probably get need to get somewhere and maybe that's not how this square should have looked like you have to take this into account that people will want to get close to their goal not necessarily in the way that we would like them to do it and again even the smartest person in the room can be in a hurry you can have a bunch of brilliant engineers brilliant physicists and they may make stupid decisions and they may sit and be like yeah and it can't be that bad right this one time what could possibly go wrong it's not that terrible or something exploded oops we have to take this into account we cannot be just making the software that we make with the assumption that people want to make mistakes you want to get perfect users this is just impossible that's that's not how humans work one of the big things that I find very important for designing things security related security related processes is in attention that is we generally just notice the thing we care about we don't notice everything that happens in the background this is not a bad thing this is very useful for our brains that's called cocktail party phenomenon when a human being can actually for most humans understand conversation in a very busy room and a cocktail party because our brain is very good at being like this thing I care about all the rest not important not my thing but this is very annoying when you are trying to design a good process for security because this means that a small red blinking light may be ignored the error message may not be read because the person just cares about one thing and I really like to refer to you to a psychological experiment that demonstrates this is how humans work it's called the invisible gorilla and the experiment was people were asked to watch a short film where a bunch of people was playing ball passing ball and told count how many times the balls is passed at the end of the short film people asked okay how many times the ball was passed cool did you notice the man in gorilla suit walking around and 50% of the participants did not notice the man in the gorilla suit because they didn't care about it they were told to count the passing of the ball so gorilla was there a gorilla and that's how humans work we cannot design our secure processes thinking yes people will pay perfect attention to everything all of the time that's just not how our brains work and I like to show it on the example of the error message this is liberal office error message this is what a designer program a sees there's an error message as an explanation what happens all very useful things and this is what a lot of users see because what they want is to get to the file and there are some words and they're annoying because they are stopping them from getting to the file please give me my file so it's just a bunch of annoying red stuff and a big button that says oh go do my thing and then the person opens the file and be like I cannot edit it something's wrong what happened is there an error message and this is I know this is annoying when we are designing things and making things is just like just read the error message why are you not reading the error message people want we have to think about communicating things not just in the error messages because a lot of people would ignore them because they don't care about them in the moment the error appears okay so this is my introduction this is my introduction on human brains complicated what is the thing I'm working on this is keeps us a reasonably secure operating system we don't we don't say it's perfectly secure because nothing is perfectly secure don't use computers if you want perfect security and cubes is a fairly complicated thing it's sort of a meta operating system which means that it has a bunch of virtual machines talking to each other everything's isolated this is my virtual machine that has my devices this is my virtual machine with my work everything is compartmentalized and the thing is we are trying to make it actually usable for people because you could have done the thing of partitioning things into virtual machines manually but it would be such a pain to actually make it work cubes provides the layer that allows you to actually use it to get all the security of really strongly isolating the things you're doing but also being able to use it without writing pages and pages and pages and pages of shell scripts this is a slightly cut but mostly visible diagram of how cubes works so you can see a lot of different virtual machines called cubes because we are funny like that and there is for January for the user there is a bunch of system stuff that does all the important system things and there is a bunch of user things like this is my cube for work stuff I have my browser my liberal office whatever I have my social media cube and those two cubes those two virtual machines don't know about each other they can talk they can share things if I click on a stupid link on my Facebook account it won't compromise my work which would be very nice so that whole idea of providing this separation is very very nice but it leads to a very complex usability situation because you don't have just one operating system you have a bunch of them smushed together that's not easy that's why we are providing a lot of interesting tools to make the process of using those things together a bit easier but also to still maintain some security and I want to discuss two things that we are doing that I think show interestingly how this can be done how you can make things usable but also think about security and the thing is the first thing is copying and pasting so in the normal system Linux Windows whatever you select text you press control C or select copy whatever text goes to clipboard control file V then the text goes to a new place this is of course terrible from the security standpoint mostly there is a bunch of attacks that are your clibbert that steal things from your clibbert put some things in your clibbert that should really not be there cubes makes it a bit more complicated sorry for the slight cutoff this is some technical problem first you copy text but this lands in the clipboard of the virtual machine you copied it from and all the virtual machines don't know about it to actually move it to another virtual machine because for example on your private Facebook you found this fascinating link that you have to share with your co-worker you have to press control shift plus C copy to the global cubes clipboard and then control shift V to copy it to another VM this is a bit more complex and yes we theoretically could have done this is more easily right we could be just like us always copy everything but that what goes all the security problems that would cause all the issues where one thing could steal the clipboard from other thing that's not what we want does the introduction of this separate step it also means that when people are trying to copy and paste things in cubes between different virtual machines they have to stop for a moment and think do I need to do that this is what I want why am I doing this this is something that forces you to stop and to pay attention for a second to this process and that leads to slightly better decisions with relationship to security of course it's not perfect some people get very much used to it they get it becomes also like automatic for them yeah this is yet another step just press the keys very quickly but and that means that of course further security is still needed that means we have to provide more layers of configuration of information of what's going on we do have a whole complex policy that allows the user to configure it and the thing is there's a lot of text here and a lot of you will be like nobody reads that yes that's why we put it in the settings so only the people want to customize what's going on actually go and read it the other people probably want because they don't care but only if you actually care enough to want to learn a bit about what's going on then you go to the settings and read it and then you can specify for example what can copy to where and how to control it so it is making the process so we are making the process of copying and pasting adding this additional step to make things a bit more secure by leveraging those two mechanisms technical one but also making people think for a second about what's going on the other thing that we are doing that I think is very interesting this is current work I could say devices things you connect to your computer they are evil like a lot of them can be very malicious you never know what actually happens within the thing that you are connecting to your computer maybe it is actually a USB stick or maybe not maybe it's some more malicious device that's just masquerading as a USB stick you know it's very complicated with them and even those devices that are not evil they very often can do far too much for example microphone camera they are very powerful things they can record a lot of things that we really would not like them to record and of course our browsers our programs are swearing to us that nothing malicious is ever happening but some people don't think this is sufficient level of security and for many people well attacks can happen and we would like to be protected about it that's because that's why QPSOS isolates all the devices in their own cube and the user can decide okay my camera I want to connect it to this cube this virtual machine from which I'm making calls but not to one from work because I want my boss to have absolutely no chance to see that I'm working in my pajamas or my microphone can only be connected to this cube not to the other and the problem with devices is that the initial user interface for handling them was made by engineers and it's not very friendly there are small things there is a list of stuff a lot of complicated technical details of what's coming from where one for example one USB stick can appear multiple times for very good and sensible technical reasons but it's very annoying when you have to figure out okay which one of them is the thing I actually want to use you have a list of cubes you want to connect to which is also very small and I ended up with this and I decided to ask my users okay does this work for you is it good and a lot of people said no this is terrible because I keep making mistakes because I I want to connect for example my USB stick to my development cube but I keep connecting it to my work cube because those things are very small and it's very easy to click on the wrong one and the thing is yes it's a user error it's not the fault of the system that the user clicks on the on the wrong thing but we would like the errors to be less common I know it's a user error but I still think we could make it easier for users to make less errors good decisions and that's why we're working on redesigning it and I think this is a decent example this is not yet working in cubes this is incoming will happen very soon once I finish working on it so extremely soon we we are changing things to one provide more information which is another thing that a lot of users told me when I started talking to them actually doing user interviews like yeah I know I should know that but I have no idea which of the devices I see listed is my camera because they all have like names that consists of numbers and letters randomly maybe we should we can actually show people which one of the things is the camera which one of the things is the microphone that's why icons actually show what's happening that's why there is much more space between different options and that's why the options are actually described not guess what it's going to happen no now I'm using actual full sentences to describe what the thing is doing and yes this is basically a visual update right this is not a technical change this is not deep delve into the back end of the USB of how cubes handles USB stack but this is a change that a lot of people when they saw it all of you said oh wow now I think I will make less mistakes now this will fix a bit of me problem but at the end we will have a more secure system everything will be better for me as a user even though it's just a visual change of course some people are like and this is terrible too big why it takes up so much space but unfortunately you can never have everybody be very happy this is basically the same okay so as a final word on these two examples and generally I would like to say a bit about how to design with security on your mind if you're a designer or if you're a programmer making a secure things that want to be secure designed for human error designed for mistakes not just for success take into account that people will do things badly people will be in the hurry if you if you ever want to design a process for a thing that's supposed to be used by human imagine that your user is currently having their six month old baby yelling and they're cut puking at the one at the other side and you want to design a thing that will not completely compromise them even if that unpleasant situation does happen the things that are secure should be easy making things in security should be harder making a shortcut the shortcut the easy way should be the secure way going around because people will sometimes go around also we are open source people we like to go around sometimes so the going around the insecure way should be harder design for actual human beings don't don't think that if it's a user error then it's not our fault because unfortunately user error is also our fault not just the user thank you five minutes for questions please yes isn't it creating more friction in the process and rather than focusing on adding more layers to like force people to go to read the like all the security issues rather than that why is there not a focus on the display of the error messages that makes them read it makes the user read them more properly okay so the question is why why are the more friction instead of just making better user errors so two reasons one reason is that sometimes it's difficult to tell a part user error for what user wanted if I copied what I wanted into a wrong cube this is a user error but it's not an obvious error that can be detected by the system and the other thing is friction is not always bad we like to think that friction in design is always a bad thing but friction also forces people to stop to think for a moment and at some sometimes when we design the system so that people have to make certain choices we give them a large variety of choices but there are some choices we have to give them a chance to actually make those choices and friction allows us for the stop to make a choice I don't want to add friction to every copy and paste within a single VM there is no friction it's just when you're going outside and the friction is by design to also show that this should not be a common operation to decrease on the making shortcuts thing yes Do you have some methods to encourage users for secure behavior for example let's say what prevents me to log to social media on my work cube so the question is how to prevent users from making bad decisions security wise for example logging into social media and work so in short we don't have a technical solution for it we have just a solution of describing like tutorials how can you use it sharing the setups of the developers of the core users so like educating people encouraging people to use different colors for different environments you also if you want to do it like you can limit yourself by limiting network access of different cubes so be like okay this is firewall and cannot access Facebook or whatever we don't have a good solution like system-wide this is still a decision that the user can make it has to make also because the user needs to divide their work into those virtual machines themselves this is something that the user generally has to do No so the question is do I have any favorite examples of UX oh this is a very difficult question security yeah oh I don't know I'd say that I really like how those usb tokens pass for u2f authentication work so I really like this process which adds just the perfect amount of friction with the need to press a button so I think this is my favorite example we have to finish you
Penpot 2.0 is here!
Change the slides. Magic. So we have Pablo from Tenpot, talking about Tenpot 2. It's here. It's here. So excited, so excited to be the last talk of the day. Also, there's some nice free chocolate. Not called to action yet. Just free chocolate there so that on your way we exit. But we will have a bit of a birthday party here with all of you. Because we're turning four today. So yeah, we have some waffle here from Brussels instead of a paella from Spain probably, that way we're using. But basically it's very important, very exciting. Every time we come to Fosn, I think my first Fosn was 2005, 2006. But it was only four years ago, 2020, that we announced this was going to happen. So every year we come here and say this is something new, and then we have Alfa and then Beta and then 1.0 and then 2.0. So very exciting. So I'm going to take a bit more water because of the excitement. So we're going to discuss PEMPO 2.0 and then it's time to meet Hanson Demo. We'll see how it goes, the staging server, and the Wi-Fi. So for those of you who might not know about this PEMPO, PEMPO is this open source design platform for design and collaboration, code collaboration. And we like to discuss, and this is very, very relevant for the open source design track, design and code collaboration. Perfect talk by Ariel, perfect takeaway also for PEMPO 2.0. So we believe we bring design freedom for our product teams, and we do so in various ways. The fact that PEMPO is open source, definitely a key ingredient. It gives you privacy, security and customization. You can hack it, you can do it whatever you want. You can use cloud, you can self-host it. We are pro open standards, so that means everything is SVG and CSS native. We make sure that we're not creating yet another proprietary format. We want to have this sustainable design and sustainable collaboration with code. And we do believe that it's important that whatever tool we build has to bring something that was not present in existing design and prototyping tools in the past, which was this collaboration between designers and developers. It was felt similarly as some good tools or code tools that are not welcoming to designers. Similarly, design tools, design prototyping tools were not welcoming developers. So what if we fix that? That was the whole idea behind Gradient PEMPO. The next generation of design tools should be about collaborating design code. So this is like the basic intro on PEMPO. But we are here to discuss PEMPO 2.0. This is a major release. You could call it PEMPO 2.0 or PEMPO 10.0, because it's just a massive change in just one year. So we're going to cover UI redesign. We're very proud of that. The new component system, wonderful new inheritance and overriding and all that stuff. CSS read layout and some of the stuff. So let's see how PEMPO 2.0 UI redesign looks like. Like this. No, that was 0.2. But that was only four years, five years ago. It is elegant. It is simple. Why frame me? Of course, the reference. Anyone gets the reference of the picture? No? Willow? Willow fans? No? Ah, I said, ah, Willow fans. Yeah, that's my age. And so, no, this is PEMPO 2.0. Look at this. It's very fancy, right? I would like to have the light steam, you know, but perhaps at this time that would not be smart for me to ask. This is just wonderful. I mean, this is just a design that is being created with a beautiful interface. Because open source and, you know, beautiful go along. What was behind this, the whole UI redesign? Well, this is a design tool, design prototyping tool. It has to be interactive, it has to be real time collaboration, it has to be multiplayer, and it's a productivity tool after all. So we needed to reduce the cognitive load. So, it's so tempting to make many things be achieved differently. So, in terms of real estate and how you would achieve things, goals, we reduce the cognitive load through heretics and through research, and just intuition sometimes. By the way, the picture you are looking at is a portion of our design system, which is completely available. I will show it now in a minute. We also improve accessibility. We, strong believers in accessibility should be a de facto standard for everything we do. It is absolutely challenging to include all accessibility in a design and prototyping tool, since it is very visual and it's very a complex tool and has a lot of micro interactions, and we already discussed cognitive load. But we try our best for the size of the team that we are, you know, just 15 people in the broad team. And still, we do want to pursue that. So, major work here was color, of course, and typography and size and relative shapes and that. So, pretty basic, but still, I think, worthwhile. We will continue to do that. Of course, you should be able to use Pemport to design accessible UIs. But here we are discussing Pemport itself as an accessible tool. And I think it is beautiful. I really honestly think it is beautiful. Probably one of the best, most beautiful open source tools, but also one of the most beautiful tools. Okay. I think it's also, sometimes it's just about pride. And why not, right? So, here we are showing just like a crop. We are going to see just the theming, dark theme, light theme. As in case you are, you know, you are fans of one or the other. It's not important what we are showing, it's just so that you can see how different Pemport looks now that we have support for both dark and light themes. And of course, you could create your own theme, whether it's corporate theme or just some other theme, because now we have the possibility of having n themes. We just created the two most common, okay? Before I go into that, why did, what? Okay, that's probably, okay. So, is it, yeah. So, here, that's for later. So, you can actually enjoy our design system as a library, if you want. I mean, this was meant to develop our own UI, but if you like it so much that you would like it to inspire your UI, why not? So, we have many libraries and templates available, thanks to our great community that, you know, continues to provide amazing stuff for everyone to reuse. This also will be available, and I think it's pretty cool. So, it follows the typical design system pattern, so, and all that. So, we use that, okay? Yeah, okay. New component system. This was, a ton of requests, basically, had the underlying theme of new component system. For those of you who are not familiar, basically, it is now a thing in design, I mean, not now, but like now in the past few years, to make everything highly reusable. Similarly, as we developers have thought about how to code. And so, we actually, part of this design work has borrowed terminology and abstraction. Abstractions from the code and engineering world into design, because it works. Design is also a science, and so, it is easy to borrow those concepts. The, what we wanted was to make it easier for everyone to build, like, the main components, like the original elements that are like the ideas, like the ideas of the components, and then very easily track the copies of those ideas. Pembot 1.x did not have this metaphor. It was much more abstract, and you had some trouble finding where the ideal component, or the master component, the parent component was. Now, it has a kind of physical representation, sorry, not physical, but, you know, what I mean. And it is easier to track those components, the main version, and then follow their copies. And that comes with all sort of very cool ideas about inheritance, overriding, overloading, and also using a copy to reset the ideal. If you are so happy with a copy that you think every other copy should now follow this copy. The way you do that is that you basically reset the main component through that copy. So, by the way, Riz of Hans, who here is a developer? Okay. And who here is a designer, or does the design? Okay. Both. Both, both. Yeah, yeah, yeah. The question is, not exclusively. So, then I have a call for action for you developers in the room. The proprietary design tools are coming for you. They're coming for you because you represent ten times, well, here, much more, but you represent ten times the market size of the designer world. So, it's now obvious for the proprietary design tools, design proprietary tools that you are the next in line for milking, being milked. I hope you have strong opinions about that. Also, the updating workflow is now much obvious what's going on. The synchronization, and we'll show, I hope we'll be able to show that during the demo, is obvious. I mean, when you are synchronizing things that are there, right there, simultaneously, that is very obvious, but also, and I don't think I'll have the time to show this during the demo, when things are synchronized behind the scenes, okay? You get notifications, updates, you can decide to dismiss some synchronization, and perhaps later on, now is the good time. So, that has been improved. And then we also have new capabilities, very obvious ones, very tangible ones. Annotation, which is, okay, I'm going to document this component, whether it's the main component or the copy of that component, but also the swapping, the quick stopping. Because when you have everything as a component, you sometimes want to swap that component for another one that is also capable of taking the role of that component within that context, okay? So, here's a very simple example where you have the main component, that's a very simple landing page. The main component is the one top left, right? You know that because it has a specific legend on top, so it's very easy to spot. And then the rest are copies, and the synchronization is instantaneous. This is really like capturing what someone is doing on Pembroke campus. This is a very simple example, of course, but it's good for animator GIF on a presentation here, okay? So far so good, right? Yeah, yeah. This is a component swap, and I was discussing a minute ago. So, here we go from image gallery to image gallery, but with title and description. So, basically, someone decided to have different components that could fit into, in this case, this is an app. Looks like an app. But what if I try this, or perhaps in a different context, I want to show different stuff. For whatever reasons, you should be able to have your components easily be swapped. And, of course, this is easy to navigate. Here you can see, you pay close attention. You see there's the content. The content is basically an arbitrary categorization that the designer or the user used. But you can then go back one level and find everything in your component library. This was just to show just a small list, okay? Very good. And, wow, we have CSS Grid Layout, or CSS Grid CSS Layout. Not Grid Layout, because Grid Layout, of course, we had that from probably 0.2. Grid Layout is the print media standard of columns and rows. This is the CSS. Why is this so important? Because this delivers on a promise. We said if we really want to unite designers and developers around one language, what if we're able to bring the code language, the expressiveness of the cloud-tip programming, that is, in CSS, natively into building a design, without using the code, just using a user interface. Okay? See some people saying, aha. So this is a complex theme. Probably it would deserve its own, I wouldn't say, Fawcett and Track, but perhaps a talk, which is the cloud-tip programming, the cloud-tip design. So this is about if you want to read more on this, just stick with the cloud-tip design. And the cloud-tip versus imperative is about expressing rules to get to a point, but not exactly how I get to the point. And CSS is perfect about that. Because the browser understands the rules, tries to get to a goal their way. So when you're designing for the real world, one could argue that imperative design is problematic. It's not fluid. It's not reactive. It is limited. But the cloud-tip design is able to be okay with a fluid world uncertainty. And CSS embodies that very keenly, finally, after the specs of CSS Grids or Grids CSS layout came in 2019. So it's very recent. And for an open standard, that's very recent. So we started with FlexLayout, which is about alignment that was present earlier in Pempot. So Flex is about one-dimensional alignment. But Grids is about bi-dimensional. So with both, and you can combine them the way you want, you have almost total freedom. You can do all sorts of compositions. You know, Flex, Grids, and you can nest them the way you want. And Pempot was able to build that natively. For the first time ever for any design tool, we decided to trust the code standard instead of creating our own interpretation of how design should be created with new vocabulary and terminology. So this is very opinionated software building here. So here you are seeing edits in a grid, again, cropped and very simplistic for the sake of, and you can see where, if you're familiar with CSS, you're basically seeing CSS visually. And you can see how the code next to the UI is automatically being updated because it's synchronized by design in a way. We actually started with the code, created the user interface, and it is trivial for us to impact the code. This code is part of the Pempot's user interface. You can go to inspect code and you can see that. I just pasted there, you know, synchronously. So it gives Pempot users the possibility of the CloudTip design, which is amazing. And all those YouTube tutorials about you designers you know about CSS, this is the code tutorial you, it's easy for you. You just follow this code. No need for that anymore because you can just use your visual language, knowing that it is expressed as code instantly. So I would rather, I mean, I would like to ask for a round of applause for the team to get us to this point. Demo time. So this is, I hope you can see it, yes. So this is a very simple design. I'm going to just select this and make it like this. So this is a, I mean, don't pay, this is a bento design, it's trendy, that's not important. So this is a grid. I'm going to actually edit it. And I'm going to just go and add a column to the right. Okay. So notice that we are using FR units by default. You could use whatever you can do. Go out, auto, pixel, doesn't matter. It's fine if it's like this. And by the way, I forgot to duplicate this file, so I'm messing with someone else as by now. We have a limited undo. So, yeah. And then what I want is to pick this one, this element, and I'll just put it here. It automatically understands this lot. This one, I'm going to do something different because I'm going to create a component out of this, so Ctrl K. And I'm going to duplicate this component, so Ctrl D. Now I have a copy and I'm in Components. You can see that, sorry if you cannot see it very precisely, but there are different legends there on my console. So what I'm going to do is I'm going to just move it here. And you notice it doesn't react really to the fact that there is more space available. And this is a reactive design, so I want to do that. So that's easy. I just select it. This is a copy. And I can go here and just... No, no, no. This is not going to happen to this demo. Okay. Just one, just a mouse. Just a mouse. And I'm using a trackball. It should be easy. It's great that I feel the... Everybody stop breathing. Okay. This is a certain level, you know, whatever, precision. So here I will go for just use the space you have. Okay. Totally... But notice, notice, there's more, there's more. Notice that the main component did not react because I overrided, I overwrote this attribute, which is fine. But if I go to the main component and change, perhaps let's go for something silly as the fill here. Okay. And I changed that, then the main copy does react. This is the synchronization that I was talking about. So I'm going to use something like this, I don't know. That's it. There's more. Because... And this is something that happens a lot. I go to the copy here. You can, of course, navigate all this, but if I go and select button and I change the fill, yeah, like this, let's say, something like that. And now we all praise to the demo gods, okay? I can decide, okay, I like it so much that I'm going to update the main component. Okay. And I update, and that happened. And now the main component, if had all the copies across not only this file, but elsewhere, if I use this as a library, would have the notification that that main component has changed. Do you want to apply those changes to your copies? This is very nice, right? And so to finish, because I know I'm out of time, one last thing. I have here a code pen. I always like to end with something like this outside pen pod. So I can take this, I can go to inspect, I can go to code. All this is there for you to enjoy, to fund, everything. So I'm going to copy the CSS and just copy this, okay? It's going to take a while because there's a lot of images that depend on the Wi-Fi. We now have HTML on top of SVG, so you pick what you want. We don't care, you know. It's everything is, as long as it opens up, that's fine. We copy that. So what this is doing is, since we are, if we are telling the truth, you should be seeing, the moment it downloads all those base 64 encoded pictures, the design, let's see. Yeah, that's what I'm trying. I have Wi-Fi, it works. Yeah, this is, well, I'll send you a link. But basically, this is basically what you need. You need the HTML, the CSS, and it's built exactly to your, to the perfect standards. Because nothing I did was not possible using the CSS expressiveness. So there's no way you're going to mess it. It is one-to-one perfect match. So that lost in translation, that back and forth issues that typically designers and developers, you know, express that they're having, very frustrating, doesn't happen with Pembot. And of course, this is real tank collaboration. I'm just single player mode here. But so, so quickly to finish, we saw the UI design, new components, this is right now. We have some other cool stuff going on. And the question is, when Pablo, when do I get Pembot to buy it all? It's coming, it's coming. Wouldn't it be nice if we had it today? We have a staging server. If anyone interested, come to any of the Pembot team members. We can give you the security URL, which is basically quite simple. And you can try it out. But it's in the next few weeks, basically. So we're aiming for February. So it's still forced in month. And so very, very soon. So thank you a lot, the team, the community, and everyone. You can find more stuff there. Thank you everyone for staying up to now. And I hope you enjoy all the work that came from Pembot to Bueno. And now, before we leave, or where all the track ends, anyone has a lighter? I do not have a lighter. So. May I steal the light? Yes. You didn't sing happy birthday? Yeah. Okay. Hello. How's it going? Oh, yeah. All right. So it's so exciting. This is our event. It's basically how we were born. So it's very exciting to do this. So I wish everyone wishes something nice for their open source project for Pembot, for Foslan, for the community. So it is like this. Yeah! Take it. It is chocolate. It is chocolate from Pembot. Thank you very much. Thank you.
Open Source Firmware status on AMD platforms 2024 - 5th edition
Welcome to my presentation about open source firmware status on AMD platforms on Fosden 2024. It is a fifth edition already of this presentation. For those who don't know me, I'm a firmware engineer at Friendep. We are based in Poland. We do open source firmware stuff. I'm mainly interested in core boot, advanced hardware features, security, stuff like that. I'm a maintainer of a few platforms in core boot. Sorry, I'm full. You have to stop. We don't have the devices. Two of them. So we have full. Two of them. Yeah. So yeah, please have a seat and we will continue quickly. But a few people I think. Yep. There's one more here. One more here. Yep. Okay, so yes unfortunately. There are like two people in the corner. Yeah. That's it. Sorry, excuse. Okay, come on, come on. So for those who don't know Friendep yet, we are doing various core boot stuff. UFI, FWPD, Yocto. So you may always also find various contributions in those projects as well from us. And the platforms I will be like mentioning throughout this presentation are mainly on this slide. This is like kind of glossary. For the terms I will be using throughout this presentation. And those processors or micro architectures are either currently now supported in core boot or where supported in core boot up to some point of time. So if you need to please go back to the slide. I have uploaded them already onto the system. So you can always check on that. And let's start. So a little recap from the last year. In the January 2023, we had another release of core boot, which happened to deprecate a few more platforms based on AMD silicon. We were not fulfilling certain requirements about the code quality and drivers and interfaces that were being also deprecated by this release. So like we lost a couple of platforms like PCN, GCP-1, Lenovo AMD laptop, the G505S and others. However, since then there were no more removals of any AMD boards yet. So that's kind of promising because all that is left right now is like quite modern hardware. So I don't think it will be like dropped very soon. Okay, to also recap the recent status about the mobile processors of AMD in core boot. So last year I have talked about the patches that were sent to the review by Starlabs. Apparently they had their own design on AMD processors for laptops. However, since then there were unfortunately no updates and I haven't received any information from Starlabs about any plans or status of it, unfortunately. For those who track also the other developments like AMD Chromebooks, the AMD, Mendocino and Phoenix are still in development, but the FSP binary, which is responsible for the whole slickensization for Mendocino has been published, but not yet for the Phoenix. And of course the publication intervals indicate that it may happen quite soon because we had the interval between season and Mendocino about five months. So right now it's like passed about... So it should quite soon happen, but I'm not sure about the release dates. But yeah, the difference is that the Mendocino is like Zen 2 architecture while Cezanne was Zen 3, so it's also quite... not so straightforward about the release dates because Zen 3 is like a newer architecture, but then there seems to be some kind of update to an older architecture. So let's continue with the Corbuth status on a little bit older and newer platforms. So we also had the initiative to bring back the ASUS KGP D16 platform. We have been trying to upstream that code that we have rebased to a newer Corbuth revision. However, we have received a response that it will be like too much work to get it back and probably there is like no manpower to actually review the whole of the code. So we decided to try to redirect the funds we have received from Immune 5 for the KGP revival to offer some additional features based on the Corbuth Dachar that we released for the KGP. However, there was no response from them, so this project is kind of stalled. But yeah, let's leave the bad news behind us and let's move maybe some forward with some more positive news. There is also an initiative by an individual with a nickname Hanetser. His name is Marty Plummer and he decided to port MD-FSP to a desktop board. He is doing it in time as a hobby. He has had some successes with CISAN-FSP. However, he has success with Picasso-FSP, so the older micro-architecture than CISAN. He could sort of boot the platform, but of course there are still some problems to solve. MD-FSP for example can initialize only soldered down memory, so if you have a platform with typical memory modules, it is kind of problematic. When he tried to use some kind of newer processor with CISAN-FSP, he faced problems where FSP had CPU IDs hard-coded in there, so he possibly couldn't get past the insertization for processors that were not intended for use with this FSP. Also, there is also a problem with the PSP binaries that are actually published for the Chromebooks, so the mobile processors. These PSP binaries are specially crafted for Chromebooks and the verified boot, and they might not work well also with something that is not Chromebook. And the hardware that is not Chromebook, because apparently there are some configuration fuses that are distinguishing like a Chromebook device with a non-Chromebook device. But you have also some new initiatives which are much more promising than hacking with FSP or old platforms, and what I mean to say is servers. Something that was my many probably considered almost not possible not quite long ago to have open source firmware on servers. There were moves from Intel to make it happen, and we probably saw some FSPs being released on EDK2, Tynocore. We have had efforts from 9 elements that were porting some servers on Safari Rapids, for example, but that still uses the old, good-known Intel FSP. What AMD thought about is like entirely new approach. Because the model of FSP is like very, very costly, because they have to port the UFI reference code for their silicon into an Intel FSP format, just like constant work of rewriting, adapting the code, testing, and to be honest, it is not like maintainable and scalable approach. So what they come up with is the open-sill, which stands for open silicon-icization library, which is fully open-source silicon-icization code for AMD servers. This project has been announced on OCP Summit in Prague last year, and the initial plan was to show a proof-of-concept on a general platform. So it is current generation AMD Epic Server processors, and we also have a working Corbuth proof-of-concept, as well as EDK2 reference code as well. If you want to know more about open-sill, I also encourage you to watch the presentations from the OCP Summit or from the OSFC conference, so they are covering in more detail what open-sill is. So let's try to summarize how the current state of the open-sill Corbuth looks like. So I did a quick round and tried to build the general reference board Corbuth binary with these few little simple steps. And just to show you a few statistics, there are still of course some blobs that are needed, like the PSP, there is no way around that, and they are still quite heavy, like it's like four megabytes, as you can see, the APU, AMD APU firmware. But comparing to Intel, where, let's say, current generation desktop has one megabyte blob of microcode, four megabyte blob of management engine, and another one and a half megabytes of FSP, that's already like much better situation. But at the end of the build, we are informed about some missing blobs, which is the APCB. So the Agiza configuration block, if I recall correctly, it is the input information for PSP, how to train the memory, what is the topology, how to find the training parameters, and stuff like that. I have later checked that these blobs are somewhere present in OpenSyndery repository, I think. So I don't know why Corbuth hasn't integrated them, maybe they already did, because this presentation is like two weeks old, so things could happen in the meantime. So I think it is doable to get those APCB blobs from OpenSyndery for sure. So we have like a revolutionary approach for AMD for OpenSource firmware on their silicon, but what can we expect in the near future? According to official AMD information, the OpenSyndery is going into production mode around 2026 with the server processors that will be released that year. Currently now it is only a proof of concept code, so it is just for evaluation only, and you can use it for personal purposes. But what is more important, that AMD plans to expand all market segments of their silicon to use OpenSyndery. So in the coming years we will see all possible platforms that could run OpenSyndery. So basically we returned to the golden era from like I would say 2000 something, where AMD were releasing also full installation called 4-Dale platforms, where everybody could actually make a fully OpenSource bios firmware for AMD platforms. So that is very reassuring and exciting news. I have also got information from AMD employee about a new library, something like that, which is called the AGSA compatibility layer reduced, which is a wrapper on the original AMD UFI reference code that can be integrated with Tyanochor EDK2 to boot a Ginoa server platform using UFI firmware. So it is very Ginoa specific, so building it might be a little bit tricky, even for experienced developers. It is quite fresh, so don't expect a rocket high quality of it. It is just an initiative done in probably some free time by one of people who sits together with us today. Feel free to try it. I haven't yet had time to look at it, but it is there. It is also public repository on GitHub. Okay, I will speed up a little bit because I am running out of time due to those disruptions in the beginning. PC engines, probably most of us know what these platforms are, what this company were doing. They were supporting Corboot for many, many years. And we see more interest in this platform. We are going to launch a product based on Corboot with the Shara, where we will offer the standard features that we offered with the standard PC engines firmware, but we will try to use the UFI, so we will provide more security. We will have secure boot, we will have setup password, TPM 2.0 support, measured boot, verified boot, stuff like that. And it also will be available as a subscription, so anybody can donate us to support us and make the development happen for this platform. Because for the past few months it was quite neglected because PC engines ended the official open source firmware support. There are also efforts by Felix Held from AMD, which also did some work recently in upstream for this board in free time. So this platform will still be alive for those who are fans of PC engines. Change boot. We also have dedicated talk for that, so I will just only briefly mention it. So we are expanding the possibilities of launching the operating systems with dynamic root of transfer measurement for AMD platforms. We will cover the UFI boot mode and booting Zen with QPSOS using Anti-AvenMate. So I encourage you to come to Mach-A's talk, which will be at 20 past 4 p.m. in this room, so you will get more details about this initiative. And I mentioned Densharo, but possibly not all of you know what this is. So this is like our downstream Corbo distribution, which aims to make the firmware more approachable for end users and regular people. So we aim to provide a validated pre-builds of the firmware that are known to work and which are to help spread the open source firmware usage that way. We also offered the subscription model, which allows the subscribers to integrate directly, for example, with developers, request the features, have the most recent updates earlier than regular public builds. Also, they are given a special treatment in terms of newsletters and stuff like that. So a little bonus at the end, which is might also be interesting. AMD also published the AMD secure encrypted virtualization firmware to GitHub. This is the firmware that is running on the PSP, so it is very revolutionary publication, I would say. Because till now nobody released any parts of proprietary co-processors to public. I haven't heard about Intel releasing any parts of Intel ME code. So again, AMD number one. And I would like to give special thanks to Paul Grimes, which is with us here from AMD and Fakesheld for the insights, review and suggestions to the presentation. And yeah, I'm open for questions if we have time for that. Thank you. Do we have time for some questions? Yeah, three minutes, I guess. Please. Okay, so the question is whether the approach taken by Oxide to bring up a modern platform just using the PSP Blobs could be adapted to Corboud. Technically, yes. But what Oxide did was still like some bare incision of the IO buses. They needed to program certain registers to get the PCI Express and stuff like that. So it's not entirely, let's say, independent from the code that runs on the host. So they still needed to put some kind of small portion of incision and extract the secret bits needed for the incision. So what we are right about is that the PSP actually, yeah, it can bring the RAM app and boot the platform. So you have at least half the responsibility of the BIOS less to do. That's true. But adapting that thing is, I would say, not so scalable. And it would result in a very feature-limited result, right? And other questions? Please. Do you have a plan to support Corboud with open-sealed on framework laptops? Our framework laptops? Yes, there's a plan to support it. It's going to be a proof of concept. It will not be a full replacement. For example, it won't support any power management features. So if somebody wants to build it and put it on their framework device, they can. The framework is not going to be supporting it. It's going to be independent. But yes, we will have it for both the 13 and the 16 inch. Did you ask about AMD? I've been talking about AMD. Yes, AMD because open-sealed. Also, I forgot to mention, I'm sorry, Felix. This is also like last-minute information. Felix also pushed some patches for AMD Phoenix to be integrated with open-sealed. But right now, as open-sealed for the mobile platform, it's not available. The open-sealed is stopped. But the infrastructure is being prepared, I would say. So we might probably expect something in not so distant future. We are out of time. Yes, I can also answer questions like outside or later. Thank you.
systemd-boot, systemd-stub, UKIs
I had my second talk of the day. The first talk was very somewhat topic, but it focused more on the distribution side of things, how to build all this stuff. I welcome you to look at the video if you have some time later, because what I'll not be able to answer in these 20 minutes, hopefully the other talk might. So I will talk about this in boots, this in BSTAP UKIs and what those are, and why you should all switch to that, of course. So let's jump right in. So system reboot, what is it? We usually call it bootloader, but it actually isn't. It's a stupid boot menu. Like if a bootloader, at least in my view, is something that actually is capable of loading sectors onto disk, parthing them, and then eventually setting up the boot params and jumping into this, we do nothing of that in system reboot. All we do is we give you a menu and you pick something, and then we chain load some other UFI binary. So yeah, it's a fancy boot menu, nothing else. Makes it on one hand dumb, but also nice and robust. It's built around this model that you have drop-in files inside of a directory, which I guess is very different from Grub where you have these boot scripts and things like this. Our way to configure things is supposed to be as simple as possible and modeled after how we started doing things in package management and classic Linux distributions like that, but it's not this pattern where it has this directory, like a configuration directory or a directory where you put desktop files on the desktop into and things like that, where every package can put more stuff into it, and then the combination of all of them is what makes the system work. And we just said, okay, let's do it the same way. Have one directory in the ESP, and people that want to populate the boot menu just put one file in there, and that's what populates the boot menu, and that's already it. So, yeah, it just takes basically this Linux pattern around package management and just takes it to the boot loader stuff. So, this boot is UFI only, right? Like, this makes things nice because it basically means we don't have to actually do any boot loading. It implements something we call the boot loader spec, which is spec we wrote ourselves. Actually, it just basically tries to define in abstract terms how, like, where to place kernels, where to place descriptions of what to boot. It supports two kinds of menu entries. Type one and type two, we call them, I think the focus should always be nowadays on type two because they have much nicer properties regarding measurement, cryptography, and things like that. But type one still exists, and people will continue to use it because it's more flexible in regards, it allows you to configure the individual items manually. Type one is a configuration file, basically, which just says use that kernel, use that in already, use that stuff and things like that. Type two is something where the boot menu items are just binaries, UKIs, as we call them, UKIs. We'll talk about this later in more detail, but the very short version of that is it's a kernel glued together with its init.rd and a couple of other things and then turned into one UFI binary. So, basically takes much of the early state of the OS and makes one thing out of it that can be updated as one, signed as one, measured as one, loaded as one, which makes it robust and secure and very nice. Since Friday or something, System Boot is also like eligible for signing, like Suze actually did this ahead of time, but now it's officially okay, so you can get it assigned by Shim with the same infrastructure and things like exactly like you can get grub signed. System Boot is supposed to be fully automatic, no configuration, right? Like there's no boot scripts, no nothing. I mean, there are some configuration options, but the design is to just work and not require configuration, right? It should be just one binary you drop in and then you have this other directory where you drop in the menu entries and that's supposed to be it. Of course, there is something like you can configure something in EFI variables and there is also a configuration file, but that's just for the nerds and it's not supposed to be the default. It also has a nice functionality that besides looking at these directories for boot menu items actually capable of finding Windows installations automatically and Mac OS, which is kind of nice because you don't have to configure that either, right? Like from the West you don't need to do anything. SD boot when it boots up, it just looks, oh, is there also Windows installation that adds the one awake? It's really nice because it's robust and it has also benefited that if you add Windows after you install Linux it will just show up. It also has APIs to user space, which I think is very important, right? Like for us, the bootloader world and the user space world are not distinct, right? Like they are closely intertwined for various reasons, like for example, because user space adds and manages the boot menu entries because from user space you generally might want to be able to select what's going to be booted next because there are things like automatic boot assessment, right? That you figure out did this boot actually work? If it worked, booted forever in the next time, if it didn't work and you've tried a couple of times and give up and revert to the previous thing. So this always requires communication between the bootloader and the operating system. So we defined, that's actually another spec where we defined this generically with CFI variables and things like that. We said this is how bootloader and user space can communicate and can send each other commands basically. It also does early boot random seed stuff. This is because traditionally in particularly in VM environments there was no RDRan, no Virtio RNG things and then Linux really didn't like it. You didn't have to any entropy in your VM and then certain things just hung and that's super annoying. So we took a bit of inspiration with something that FreeBSD did which is an early boot random seed. So basically you have a random seed that is stored in the ESP. You can update it from user space and it is updated from user space. After we did this, Jason Donfeld who's also the maintainer of the Linux kernel RNG, we wrote a couple of things that we kind of confident nowadays that it's really good actually and the good thing is it works everywhere, at least everywhere where you have EFI and make sure that from earliest moment on you have really good entropy in addition to whatever the hardware might support you. It has automatic enrollment of Secure Boot keys which I think is actually kind of nice. It implements like this tofu concept for Secure Boot enrollment. So if you want to change your certificates, which I think people should do and particularly in virtualized environments, then you can just add the keys to the ESP and then on first boot up when we are in setup mode we'll just enroll the whole thing and then we'll be locked down. So you have the trust on first use. Like on first time you boot up, nothing is enrolled, nothing is trusted. That's the moment where everything is trusted. Then you add the keys and from that point on this is how it's locked down. It also has this thing where again with the drop in deer you can load additional drivers mostly exist so that people who really want can make the ESP or something like that, one of the weird file systems. Yeah, already mentioned briefly automatic boot assessment exists which is like the infrastructure that we count before. We boot something, how often we have booted it and then from user space can report back if that actually worked and then I get this kind of robustness thing going. So much about system reboot. Boot control is one part of the user space part side of things. Boot control is like a command line tool for installing system reboot. That's kind of its primary job but it can do a couple of other things as well. It's a use space side. You can tell it to boot on the next boot up to specific menu entries. You can list the menu entries. You can update the random seed in a couple of other things. We hope that it actually runs automatically on boot for example to update the boot loader. It always will do this. To make sure that the copy of the boot loader that is in slash user is instantly copied also like if it's for some reason the package manager who ever updated system forgot this it's always kept up to date. So the focus is really that the boot loader is always up to date. It also resets the random seed by the way like from the Linux pool so that there's a good chance that the random seed is as good as it could possibly be. So much about boot control. Next thing system destub. System destub is also UFI binary. System destub is basically a little UFI binary that you glue in front of a Linux kernel and an inodori that runs in UFI mode. It does a couple of things before it transitions into the actual kernel. Why do we have this? It does a couple of things like for example it measures the payload of what it's going to start. So now you might wonder if it's a UFI binary that a second would sign and things like that. Why does it need to measure because firmware already measures all second boot binaries. Very good question if you ask that. The reason we do this is because these measurements that the firmware does are PCR9 I think and there's a lot of stuff in there and that basically means it's hard to predict because there's stuff that is controlled by the firmware and there's stuff that comes from the West and you cannot bind security to a PCR that has sources that you cannot really control. At least you cannot do this in a predictable way from the West point of view like figure it already out. Like you cannot predict it on basically the Fedora bridge systems if you build Fedora. But if we do the measurement separately of the payload of the UKI we can do that in a separate PCR and then we can predict it because in that PCR there's only going to be the stuff that the West vendor controls and not also the firmware stuff and then cover the firmware stuff with something else. UKI is what this becomes when you use system-based stuff right like the combination of system-based stuff plus a kernel plus an inodori plus kernel command line plus all these other kind of things that's what we call a UKI unified kernel image. Yeah it is system-based stuff supports a couple of sidecars right like this UKI model that we try to want to push distributions towards where you unify everything into one image that you can sign measure as one update as one and things like that that comes as with problems like inherent problems like for example the inodori that you built into this you like we expect that vendors will build those on their build system so they're always going to be the exact same ones on every installation which is great for many reasons but also horrible for others because depending on the machines you will need large drivers large firmwares like Nvidia driver for example comes with multiple hundred megabytes of firmware if you would always build that into all the UKIs that you as a generic distro vendor ship to your people then yeah this will be really really large second-build binary as it turns out because of all the measurements booting really large second-build binaries works but it's kind of slow so I also had this inherent problem that yeah in this model where UKIs are built on OS vendor build systems the question is open how do you parameterize that right like because on a simple laptop you do not need to parameterization you can figure out everything on its own but a learner is supposed to be generic right like you have these installations that want additional parameters like they want to configure I don't know additional ways like route passwords so that it can log into the inodori or a boot ice-gazzy device that you actually want to boot to so these this I mean there's a reason why the kernel command line exists people want to want to be able to do this in certain setups not a laptop is mentioned that should not necessarily be necessary but the more you go to the server side they all want to do this so we came up with a couple of ways how you can have sidecars so that even though while we push everything to the UKI model where you have a single thing that is self-contained and that has everything you can put next to it the sidecars that configure individual things like there's one concept we call system creds I went into this in more detail in my earlier talk but let's just summarize this at system decreds like the asterisk cred stuff that is that is basically short little bits of information like like keys like cryptographic keys and passwords and things like that that you need to operate but they're individual bits and they they are encrypted and locked to TPM so that you can actually put them in an untrusted environment like for example the ESP where there's no implicit trust and you have to authenticate it before you use it there's another concept we called add-on if I add-ons which are basically the same idea as UKI's right you take you make a PE binary that you can sign can you can measure as one however you leave out the Linux kernel the inner D in all these kind of stuff and just insert the kernel command line that you would also add to UKI so you basically have a binary that looks like a binary but doesn't contain any code however you can authenticate it via the usual Secureboot usual shim APIs like it was a binary because the UFI just cares that it's a PE thing so these add-ons as we call them are our way how we can allow people to extend the kernel command line because when a UKI is booted and system D stop takes over it will look for the side card files find this add that to the kernel command line and boot on and it's in a fully trusted way because these things need to be authenticated the same way as everything else is authenticated via the the shim Secureboot stuff I already mentioned this the system you stop also does measurements right like of the content so that we get this isolated out so that we have yeah one PCR that only contains the US stuff separate from the stuff where the firmware is this means duplicate measurement but that's fine at least I think it's fine something that also does it can read additional kernel command line options from SM bias type 11 I'm not sure SM bias type 11 well I'm in the boot loader room so I hope you know what that is like SM bias you you you probably all know is like this descriptive thing that the firmware passes to the West and there's one object type you can add that's type 11 and it's wonderful because it's just called vendor strings and you can put we can put anything in there that you want and various virtualizers like QEMO for example allow you to directly set that from the kernel command line from the QEMO command line and yeah we use that also to extend the kernel command line right like so you can just on the QEMO command line sets a string that is just implicitly added to appended to the kernel command line that is eventually booted we kind of want to push people actually to the model where they use this more often it's actually an awesome thing and I'm kind of pushing like trying to push all the cloud renters to adopt this as a generic way to provision data into VMs but anyway other topic another component is UQify it's a basically it's a Python script that helps you gluing together UKI so it will take system to stop kernel and I read sign it as one it will also do the TPM predictions of what will the PCR will look like when it booted the signs all that with second boot it gives you one EFI binary that you can just drop in the in the SP and boots up and everything's secure and wonderful then one other tool system be measured like much of this like all of the what I'm talking about here is actually part of system me because I'm system the guy system the measure is a tool you probably don't have to interface with it anymore because UK UK fight does that behind the back for you it's the actual engine that predicts the PCRs that the UKI will will result in if booted yeah I just want to mention that exists and yeah there's another tool called common stall as for the traditional distributions so that they can ship inside of a devian package or RPM a kernel and that this tool it's like plugin based and things that will copy the kernel into the SP and potentially it built the UKI at that moment right like because we want to cover a couple of different models like one model where the UKI is built on build servers of the OS vendor and another model more for the let's say democratic devian style distributions where they can do this locally so that they can use their own keys so yeah the kernel and styles are infrastructure to make this happen it's really nice this full UKI support for example for this like so if you want to do your sign your own stuff you can trivially do this because you can just use that and drop in your keys and Etsy and then it happens magically there's something I don't have that much time anymore should we switch to questions okay this is one of my last slides anyway assistant you part PCR log is one of the most recent things it's a more complete prediction engine like I already mentioned the system we measure tool which is able to predict the PCR measurements that a specific UKI will result in system dPCR log is supposed to cover all the other PCRs that they are that are firmware stuff and things like that system dPCR log deals with like all the other operating systems generally have this like Windows Chrome OS Android they nowadays have all these predictions and well depending if they actually care about TPMs or have some other second-hand clif thing doesn't really yeah it's all a little bit different rare but the ones that care about TPMs generally have this prediction engine where they just look at all the different things that happened during boot analyze the UEFI event log and try to calculate a TPM policy to lock disk secrets to our version of the tool is called system dPCR log it's supposed to be modular so you have again drop-in directories where different components of the west that will show up in the boot pass like the UKI boot loader shim and things like that plus components that are not even necessarily under the OS control but our firmware stuff can be described with little jason fragments to just say the measurements that I expected for each of these components there is a concept of alternatives because usually you don't want to lock your secrets to exactly one kernel or one boot loader version because you want to update them and then if that update fails you want to be able to go back and things like that so usually for each component you want alternatives also very well supported and system pcr log takes all this information explodes what all the the the pcr values could be in the end and then generates a TPM policy out of this that it stores in a TPM and v index and that then our disk encryption stuff can reference as an access policy long story short this covers like this basically locks down the OS against the firmware versions with all the measurements that firmware does that are not necessarily predictable for the OS because yeah the firmware people suck there's also like support of course that if we cannot predict firmware measurements we have some some logic there to deal with that so if you do all the combination of this then you get a super secure system and everything's great my recommendation is do this but these components are relatively independent of each other and as the things happen and different distributions started adopting different parts of this earlier like for example Susan nowadays adopted system to boot already but rel for example for the confidential computing stuff already adopted system to stop and they all pick different parts of this okay my time is over so this is my my summary here if you use all in combination everything works great but if you pick what you want you don't have to pick anything at all if you don't want but if you use in combination you get this full boot chain stuff everything's secure and relatively robust because all the update cycles are around individual files you have ways how to parameterize it and extend it and yeah there are a couple of more slides we don't have to cover them but let's move to questions we have five minutes for questions so yeah so the question was regarding whether the system D stop stuff works outside of UFI environment the answer is no like it uses UFI APIs and it's just UFI all of the what I was talking about here is more or less model after UFI system to boot system you stop absolutely only UFI but the further you go with the with the like carline store for example that has nothing to do with UFI like unless you actually use the parts with well I mean I know it isn't there I think there are sorry so my my suggestion would that it would be a well just just adopt UFI and avoid all this mess I don't know I think everything has problems UFI have some my general like I mean I get this all the time like this thing like oh we have to stick to grub because it supports all the non-UFI world and I say sure my recommendation would always be if you look at this stuff like there are certain like philosophical ideas built into this right like you have a drop-in directory you put on type one type this kind of stuff is entirely generic and there's like type two is not generic but type one is totally generic right like so my recommendation for it for that by the way is just use UK eyes as they are right like they they are a PE wrapper like it's a really simple format actually PE right like and it's just an envelope that carries sections for you if I think grub now this can parse that too cannot see so if grub can parse it you stupid bootloader should have no problem at all parsing it and then you suddenly have a universal format and you boot windows style PE binaries even though it's not Windows but it's I think it's it's the way to go like model it after UFI UFI has its words everything has its words but I think it's way better than than the stuff that came before for it and just yeah so my recommendation would always be if you can't do this stuff at least consider the ideas behind her a bit behind that and like drop-in directories and sing like single file updates and these kind of things and then try to model it afterwards and the more you can take over the easier will be your life because this probably will end up in all the distributions and the less differences there are the easier it is right so I think even grub supports type one at least on type two or something like this so so we're on purpose we wrote the spec as a generic thing both of like of all like there's a spec about yuki yuki i there's a spec about the bootloader spec there's a spec about the bootloader interface like where how how we do because we will always clear to us that not everybody's gonna do you if I so uh yeah we did that as a nice service to the community but the other people have to figure out if they actually want to adopt this it took them long enough to not adopt it so far but now things are changing if you want to not do you if I my recommendation would be let's yeah look at the specs and I'm sure like adds to shit to it like it's a spec senders like there's a getup issue thing and if you need something else then send an issue and like if it makes any sense at all we have no problem with adding that to specs very short yeah very short so all these projects are system well it depends like the specific age like so the question was if all these projects are under the system the umbrella so that depends right like so we we created like a group which you call the uapi group where we try to standardize these things right this is to a large degree admittedly system the adjacent people right like who like adopted a way of thinking like we do it but there's nothing specific in it and the specifications are on purpose written independently of like the word system B doesn't show up in this but like it might show up in the things but it's that's not the point of it right like so the code right like that's a different story right like we this is in the system tree right like this is developed like unix was developed I guess you have like this get repository and you have all these components on this the fact that it's in there doesn't mean you have to use them you can mix and match them like is mentioned like different distribution pick different things up like opens us it at sd boot first and not sd stop and then sd the right hand was confidential computing take sd stop but it is not interested in sd boot because they didn't want a boot load at all so yeah this is how it should be right yeah we'll give you the buffet pick what you want don't pick what you don't want I don't care but yeah this is ultimately it's very Linux focused various ufi focused very systemally focused but I think look at the specs maybe you can reuse something but yeah yeah I think even the uk i stuff like because we don't like you know how the firmware jumps into your uk i that is not defined by us that is simply an artifact of the fact that's a pe file right like so you can jump like any find any other way to jump into it you can even look for the Linux kernel in it right like that's what grub does right so it looks for the linux pe section in it ignores all the other stuff and if it wants to and then does the classic boot protocol that does not do anything like this okay but anyway yeah thank you you
Kernel command line to configure userspace considered harmful
Okay, thank you. Good morning, good afternoon. Thank you for having me. My name is Luca. By day I work as a software engineer in the Linux systems group in Microsoft in the Azure organization where I manage the Azure Linux OS that runs on the infrastructure there. By night I'm involved in various open source projects. I'm an assistant demantainer, a debian developer, a DPDK LTS maintainer, a bunch of other stuff that I consistently forget about. Now, if you read the title of this talk, you might think, hang on, was that really intended to be that provocative? And the answer is yes, yes it was. This is my yearly talk to make new friends. But of course I tried to mean it in a positive way. So I want to provoke some thoughts and discussions, see what we can do about this that I consider a problem and that I think we are in a good place to start fixing. But first, even though everybody lives and breathes secure boot, some background, if you're not aware, if you work on bootloader, so boot system, you already know all of this, but just one slide. So in the beginning we had BIOS, everything was great. The security model was low as if. In the 2000s we got Ufi. So Intel, Microsoft, a bunch of other people got together and created this new protocol for firmers. And it actually has a security model, which is nice. Now, it gets a lot of mud thrown at it. Every time there's a bug in the news like the logo face stuff, people are going, oh, why do we need secure boot? It's always broken anyway. Well, having a security model doesn't mean that everything is perfect or never breaks. It's software. It runs on computers. Of course it breaks. The point is that we have a process to deal with it and we have a natural security model to follow. So the way it works is that there's a chain of trust that starts in hardware. So for example, Intel boot guard. So the hardware verifies the trust in the firmware. The firmware verifies the bootloader, the bootloader verifies the kernel. The set of keys and certificates used are stored in the firmware. I won't go into details in that because it's not too important for this. This is generally called secure boot in a nutshell. Now, this in the 2010s, thanks to the work of a lot of people, like the government and many others, Linux joined the party finally. We were shut out of that ecosystem for a while and by default, distribution couldn't boot on new hardware. You had to go and fiddle with the buyers on this secure boot. This changed in the 2010s. So we have Shim, Grab 2 and the kernel lockdown stack and distribution can work again by default. They are signed with the U-featured party CA. You get your Shim signed by Microsoft and then you sign your second stage bootloader like Grab or System Eboot with your distribution key. And then we have this patch set in the kernel that was called secure level in the beginning when it was out of three. Then it was merged as the lockdown LSM later that basically tries to protect the kernel and the firmware and the hardware before the Xeboot service is called. Xeboot services is an API call in the U-fe interface and when that happens, a bunch of things get locked down. You cannot change secure viable anymore. The firmware goes away and a bunch of other things. It's very important to protect the system before that. So this is what this ecosystem tries to protect. Secure level also tries or lockdown tries to separate UAD0 from ring zero. So the theory is if you are root, you shouldn't be able to change the hardware or the kernel memory outside of what should be allowed. So this is not perfect. It went a very long way and it fixes a lot of problems. It's not perfect. Of course, it's softer. It's never perfect. But yes, the idea is we have this boundary between UAD0 and ring zero. And this has been working for 10 years or so. It's great. We moved, we have no trust whatsoever to having trust until the point we start the user space. And that's great. But other operating systems are way far ahead. My course is way far ahead. Android is way far ahead. Windows is way far ahead. We do nothing for user space so far. But in the past couple of years, we've been talking a lot about how to fix this and things are starting to happen. So this is the next level. We'll have a unified kernel image. And by the way, Renard had a talk this morning about UKIs and I think he might have mentioned before as well in the previous talk. I could not get him because the room was full. But we've been talking about this stuff for a while and there were at least three or four talks talking about these things. So you might have already heard about these concepts and we'll repeat them again in a different context here. So what we are trying to do is extend that level of trust and security and authentication to user space. For example, the E30 right now on any generous distribution just sits on the boot partition on the ESP and anybody can offline or even online that has write access that can inject anything. And they're just built locally. They're unverified. You add a backdoor to the prompt that asks for your locks encryption password and you wouldn't have any idea because it's completely unchecked and untrusted and verified. Unified kernel images try to fix this. The unit RD is part of the PE binary that gets signed so that the shim or the firmware verifies it before loading so that we can extend the chain of trust a little bit further into user space. At least the first part of user space, the unit RD. But that's not enough. We want to go further because once you go to your root file system for the unit RD, well, that also is unverified. Now, Wacuzov is working on the IP, Intervity Policy Enforcement or Execution or something like that. It's a new LSM that basically allows you to write a policy that says any binary that runs on my system in user space must come from a DM Verity volume that is signed by a key trusted by the kernel. DM Verity is a mercury system to do online verification of block devices as they are loaded and opened. It's a very, very nice interface that has been available in the kernel for 10 years or so. And with IP, we can use this to move the chain of trust into the full user space. So now all the code that runs is fully verified with a chain of trust that goes back to the hardware. With discover of these images, we can also protect further payloads, so containers, end-spon containers, portable services and other things that are attached to the OS. If you're running a read-only system, you need some way to attach new applications there, of course. And with DDIs, you can further extend the chain of trust again in the same way for those payloads as well. So we put all of this together when we have the shim and lockdown stuff for the boot process and then UKIs for the InterD, IP and DDIs for user space. We have a very nice system that chains back to the hardware and implements a full root of trust. And that's very nice, except for the kernel command line. This is just stored as a plain text file in system debut type 1 BLS. So it's a type of boot images supported by system debut or in grab as well. It's just plain text file. If you have root access, you can write whatever you want there. If you don't, it checks, that just gets out there and run. It also can be edited on the fly if you have access to the keyboard, which probably is fine on a laptop, because if you have access to the laptop, you're probably the owner. But if you're on a server, owner of VM, or a confidential VM, that's kind of bad, especially for the Taster computing case, because the serial console is just a device owned by the supervisor, which is outside your PCB. So why is this a problem? Because it has become kind of a kitchen sink. Just for the kernel alone, there's that document there, which is very nice and documents a lot of options available. It's 7,000 lines long, and itself says this is not a complete list. So we don't even have one list that says, okay, this is everything you can do with this untrusted, unverified interface for your machine, which is not ideal. Also, I checked, I'm not a kernel developer, but I checked as far as I can see, the very first parsing of the kernel command line happens in the kernel's EFI stub before Exiboot services. Remember I said before Exiboot services is a very important point in the boot process that before that you want to protect your system. You'll be really careful about what happens and what is allowed to run and execute and change the flow of execution. Now, you can use the kernel command line to configure the kernel to do things like disable a Cerenox, disable IP. I talked about a moment before about IP. You can disable all these security components using the kernel command line. And it's not just the kernel that you configure. It's called the kernel command line, but it's just a command line. You configure everything and anything in user space. Everybody sees it by default. It's approximately aligned. It's there. Everybody has their own parsers. The custom written to parse it and read it, and it's used for absolutely everything. And again, this is bad for confidential computing. The Cereconsole is outside of the TCP. So this is a difficult problem. Now, of course, there are historical reasons for this. It's super convenient. It's amazing. You have a problem. You just press E to edit a debug, and then you can get some debug logs if your machine doesn't boot. That's amazing. That is super useful. But I think we're getting close to the point where we need to make decisions and whether we want to allow this always or in some cases or disable it completely in some other cases, because it is the last bit missing, as far as I can tell, in the security story of the boot process on Linux. So for system reboot, we have decided to stop allowing editing the command line and supplying untrusted sources of input to it when you boot UKIs. You cannot do that. And we made a lot of friends with that decision, I can tell you. So the problem is, of course, the flexibility is gone. Can we get it back? What are the use cases? So one of the main ones, the root FS auto discovery. In this case, you would do root equal, devs, da1, or whatever. Now you probably may be using a UGID to identify the disk so it doesn't switch partition. You don't lose booting. We have something called the discover partition specification system that is supported by our tools. So basically, the very quick resystem reboot tells the intrad where your root is. It automatically finds it based on UID set on the partition. So this use case is very well covered now. I already mentioned UKI, that we mentioned very frequently as FOS, so I go very quickly through this. You can add a command line to the UKI when you build it. It's very easy with UQFI, our tool to build UKIs. But of course, it's one entry. It's a fixed entry. The UKI is meant to be shipped by the OS vendor, and that is very not flexible, of course, because the OS vendor doesn't know what you need to have on your OS to make it work. Now we have a future plan. We'll get to this this year, but you'll be able to actually specify multiple options. For example, your OS vendor will be able to say, I have my kernel command line, which is the default one, and then one that has debug, and another one that has factory reset, so that you have multiple options. And in your boot menu from system reboot, you can select a non-default one if so you wish. This is very hard to list, and I get Leonard to implement that very soon, but he hasn't done that yet. The other thing we have, so system is tab is the small UFI stub that is embedded in the UKI, the first bit that gets loaded. So we added this thing this year called addons. Again, they can be built by UKFI, and what they are is they're just PE binaries, so they are signed, so the firmware verifies them using the secure boot certificates before loading them, but they don't have code. There's just a kernel command line configuration. So you can use this, and system is tab, we automatically load them if it finds them. Again, through the firmware, so verified and signed and trusted, and then you can use that for extending the kernel command line that was in the UKI, and that is fixed. This is really meant for platform owners. So for example, if you want to set crash kernel equals some amount of memory, that's probably the same across your whole fleet, at least for the same devices, so you can use the same addon everywhere to set these cases. Again, we want to add selection to, right now every addon will be used. We want to add a menu to let you select which one you want a boot in case that is needed, but we don't have that yet. It's again on the to-do list. Next thing we have extended extension images. This can be used to extend the IninterD. So you can drop them in the ESP, they are the invertees, so again they get verified by the kernel, and given the IninterD 2 is fixed, we can use this to extend the IninterD with additional code or configuration. It can be used for both configuration, overlay none ETC or code overlay none USR. Again, we don't have any way to select which one you want. We just pick every one, every extension image that we find in the ESP. And also you can embed them in the IninterD if you want and extend the router fast with this, or download them at running time to extend the router fast when it's read only. Finally, this is my favorite one, and I think this should give us enough flexibility that we can start to talk about actually disabling this stuff by default. So credentials are a very simple concept that we added to system this some years ago. They are just key value pairs. The key difference is that they are scoped by default. They are only visible by user space and only by services they opt into them by key. So in your service you say load credential key. If you have a credential for you name it there and it will be loaded. Everybody else will not see the content because they can be encrypted. And we are, I think we have that already, if not it will be ready very soon. You can encrypt them ahead of time if you know the TPM's public certificate for the SRK. You can encrypt them ahead of time for any machine. And they are decrypted only when the service starts and reads them. And only in the namespace view of the service. So they are fully isolated. Nothing else outside of it can see the credential. And you can drop them in the SP again in a per image or global directories. Again we don't have selection. Everything that is found in this location is picked up. And we are starting to add support to every system component and outside of it to use credentials to configure things that used to be configured with the Cora command line. So your networking can be configured credential. Your users, your autologin, your root password and a bunch of other things can all be, which you need to start. Like literally hundreds of things can be configured using credentials. I have a pull request open. Hopefully, as soon as I figure out the TPM measurement story, but we will also allow you to create new credentials from the boot menu. Like when you have a system boot with type one or grab two, you can edit the Cora command line. You'll be able to on the boot menu to just type credential and then a name and then a value. It will be passed to by system D and added to the interface so that it can be used. So I think this is very powerful. It's something that should give us all the flexibility, most of the flexibility that we need. Maybe. Is that enough? We have GPTO to discover for your root for a system. UKIs, add-ons, extension images, credential. Is this enough to cover all the use cases that we need in the 90% or maybe 95%? Of course, there will also be a case where you have to go put your hands on a machine that is completely broken. What you do in the case is you disable security. You break glass. You take your node offline. If it's a server, you take production workers away, you debug it. You disable security and then you can do whatever you want. Let's say 1% of the cases. Are we there yet for the 95%? That is an open question. I hope we can have some discussions about this. We also have a Secure Boot 2.0 comment. We are starting to think about it. Should this say something about what is allowed or not allowed to do to the kernel before execute services? It's kind of an important topic. Should it be a specification or should it not? Now, as reviewers and maintainers of your space components, I think it is time that we start to say, hey, if you're adding a new option via the kernel command line for my user space program, please at least also add a credential to it so that we can use both at the very least. And is this a full term? Most likely, but we still try because we are trying to push the envelope a bit forward every time and hopefully we'll get somewhere. So I think that's it and we have three minutes for questions. Oh, thank you. I was fast. Questions? Please. So you said about selecting crude partition, there was this discoverable partition specification. How you handle multiple installations of the same distribution like free fedora installations on the same disk? So the way this works is that system disk... Sorry, repeat the question for the microphone. If you have multiple partitions, multiple distributions, and start at the same time, how do you find out automatically what their disk is? The basic stuff is that the way system deboot system disk actually finds it is that it tells the initardivia EFI runtime variable which disk the ESP that was used to find the system deboot or the UKI was on. So you get the root of the disk. And then the auto generator there takes the disk and only look at that. So if you are installing with multiple disks, then we select the right disk like that. If you have multiple root partitions on the same disk as well, then I have no idea how we do that. I think we recommend to use different volumes for that. I think there's some way to do it. I don't remember to be honest. I need to look at the generator. There you go. So use a different UKI for the different root of FES basically. Yeah. Or again, with the credentials, we support credentials then for the auto generator yet? So we should add that. So that's probably a good way we could configure with the credential. But this is made so that by default you find the right thing for the simplest case. And then of course you need configuration for the complex one. If you use the same disk for multiple root file systems, well then you need to tell which one to pick. And that's one way. And I think we it is configurable and we should have credentials support for that. So you drop in a credential and then you decide which one to use. Yes. But it's a good question. That's a BTFS question. How to deal with starting from difference of volume on butterfests? I have no clue. I don't use butterfests, but the meta people here do. Do you have any other? I don't remember. Right. So yeah, it's not supported right now in the specification, but there was a buzzer, I think. Patches welcome. As usual. Yes. Yes. Anything else? Please. So the question was, can we use the auto generator when we create the UKI? The answer is yes, because then you would use the kind of command line. But if you are generating UKIs locally, our idea is that the UKI is generated on a server somewhere by your vendor. So I wouldn't work in that case. You could create a credential when you install it, for example, to tell it actually going and figure it out from that idea. But yeah, if we do support building UKIs locally, we have kernel installed plugins for that. It does work. And yes, you could do it that way. Yes. Yes. That could work if you're building locally. Yes. Sorry, that was according to the UUID in the UKI itself. And again, yes, it can work if you're building it locally. Yes. Anything else? Yes. And no, it is not a workaround for broken EFI variables. We added this so that we could configure autologene in VMs. I think that was the first time we added this back then. But this is a way to, again, the main use case was to be able to have secrets that are encrypted against the TPM and are not visible by default. So the services don't have to implement encrypted and decrypting all the stuff because that is hairy, especially against the TPM. That was the main use case, to have this as sealed stuff that is only visible by the service in its main space only when it runs. That was the main use case because normally a lot of times you configure secrets by environment variables and things like that. And of course, that is bad because environment variables are inherited now on the process tree. And you don't want your secrets to leak down to all your child processes. So this is one of the reasons we added the credential concept. Yes. So another question for credentials. What is the scope of credentials you load from this ESP? Is it the whole EUNITRD you can see it or some? The worry, so again, they are obtained. So sorry, yes. What is the scope of credentials loaded from the ESP? Does the whole EUNITRD see them? Yes, if they obtain. So your EUNITRD is trusted and verified and signed. So you build a configuration, you say service, full and bar can load this credential. Service, ABC, don't. So only full and bar that have opted in will see that and it will be decrypted for them. But yes, anything in the EUNITRD can obtain and I think we transition them across as well to the full OS. So they will be available also for services running from the after EUNITRD to full OS transition. Yes, credentials are awesome. You should check them out. They all, by the way, the slides are online and all these things are links to the actual documentation. Anything else? I think we have two minutes. I have a pretty dumb question, but let's say I want to put an RO on the kernel and online. I do that with the prediction. No, because that is for the, depends. Is that for the kernel or, oh, sorry, sorry. If you want, if you want to put RO on the kernel command line, would you do that on using a credential? Depends who's reading that. Is that for the kernel or is that for a user space in EUNITRD setting up your root file system? It depends on the case. If it is for your kernel, well, probably you want that in the UKI itself, because it's something you want your image to run in that configuration state, so you put that in the UKI itself. If you want that to only for certain cases, then maybe you can use addons and only deploy that on the machines that use the same image, but with a different configuration. So the answer, yes, it depends. There's many ways to do that and depends who's reading it and what the use case is and if you want that to be the default or the non-default on whatever else. I think we have, okay. Thank you.
Reducing Costs and Improving Performance With Data Modeling in Postgres
Who's using Postgres in here? Yay! Thanks for being here on Sunday and our next speaker, Charlie, is going to talk about reducing the costs and reducing the costs. Those things out there. And other things as well. Good luck. Thank you. Thank you. Good evening. So yes, welcome. Today we're going to talk about how to reduce costs and improve performance. Doing really easy stuff, really easy things. Well my name is Charlie Batista. The presentation is not about me and this was made by ChatGPT so we can get these lights later. Sorry. Okay. Okay. Good to go. Nice. So what we're going to talk today, so we're going to have a bit of a review on what's about this talk. So we're going to review some concepts on how the hardware works, right? So try to understand a little bit what is cash, how Postgres stores data, and the summary. And thank you guys. You are good to go. So I think it's gone. See you next year. I'll be a bit fast because I have a lot of slides and not much off time so I will try to get. If you have questions at any time just raise your hand. That's fine. So just interrupt me. Or if you would like to, you can also wait for the end. We try to answer as many questions as you can. So what is this talk about? So this talk is not about go there and modeling a business. It's about how to model database, right? We try to understand a little bit how the underneath hardware works and how we can play nice with this and how Postgres can play nice with this, right? So we see some concepts about how computer stores data, how Postgres stores data. And it may be get a little low level but I'll try to keep it as more higher level as possible so we can all follow together. And we hope that the end of this talk will be able to understand a little bit and save some money, especially for those running on cloud, right? You know that space and these and those things cost a lot of money. And well, that said, let's start. We're going to do a quick review on the hardware. So I suppose most of you guys have seen this picture before. So this is the memory architecture. So how the memory is divided on the hardware. If you see down here, we have those secondary storage. This is your hard drive, SSD, HD type if anybody still use that thing. So this is the thing. If you see it's quite large but it's slow and usually inefficient. The latency is high. As we go up to the top, we go to the CPU registers, they get really, really fast. That's where the magic happens but they also get really, really expensive, right? We want to do our best to always use them in a very efficient way. Things to understand about memory. Memory is either volatile or non-volatile. That down here is non-volatile. That's where you save your data and you should save your data there if you want them the next day. Because if something happens and power loss or whatever, so everything that is up memory on those non-volatile, they're going to be lost, right? So also, as I said, the down is cheaper, the upper is higher price. Memory can also be basically accessed in three different ways. We have random access, direct access and sequential access. We'll see that most of the times we're doing random access, especially in RAM, RAM is basically random by nature. We also always try to do, when you go to the hard driver, to do sequential access both write and read and you try to understand why. So if you see here, see I have four CPU cores and I have the IO controller. One thing that you need to realize is the CPU is not connected to your disk. There is no physical cable or path away that the CPU talks to your hard drive. It doesn't matter if it's SSD, ADDD, tape or whatever. It needs to go to the memory controller. The memory has physical direct access to the hard drive. So every time that you need something, your CPU asks the memory and the memory catches that thing inside of the hard drive. And then it moves up all the way to the CPU. Also a very interesting thing that most of the developers do not realize is we don't write bytes on hard drive. When you're a software, open a text file and you save your name, my name is Charlie, five characters, five bytes, no, I still save one block in which most of the systems are four kilobytes. For that very simple operation of five characters, I'm dealing with four kilobytes block on the operational system. And that is for everything. The database also do the same. So if we can start doing more work with those four kilobytes block, things can go faster. That's one of the main ideas. Another thing that you see here, ADDD, they're really slow but still have a lot of companies and people using them because they're quite inexpensive nowadays. But random IEO is terrible, is low, is horrible. Every time that you need to do a random IEO there is horrible. And the problem is it's a mechanical device. So most of the people believe that the performance problem is on the plate, that the spinning plate. Actually that one is fast. The problem is on top of the spinning plate you have a literal ah that needs to move back and forward. And this movement is really, really slow. Really, really slow. So if you do a random IEO and the arm keeps, needs to keep moving back and forward, that's going to be horrible or whatever application. And especially for database. That's going to be really, really bad. So on SSDs, it's not that bad but it's still, the performance or random IEO is still not as the same as the sequential IEO. Both writes and reads. So writes are closer but they're still not as the same. So this is a little bit about what I mean about sequential axis. Sequential axis is when you write one block after another. Sequential and random is when you have that mess. So most of college people there when you go to their bedroom they had random axis there. So you can think about that thing. So but yeah. What is this cache? We're talking about improving performance and talking about those things but what does it has to do with cache and performance and database? So cache on its very simplistic definition, I got it from the Wikipedia, it is a hardware or software component that stores data so you can access them in the future in a faster way. We have many, many different levels of cache. So we have natural cache, we have hard drives cache, we have application cache, most of the database they do have the application cache and we also have the CPU cache. This is the one that we really interest for us today. And we have some definitions here. So what is a cache heat? Anybody? Come on guys. Exactly. Let's say for example I want to do a select and I want to get the Charlie, the first time that I do select and get Charlie's information and goes for the disk up to memory CPU. So but it stays there. The database doesn't throw it back because next time that I do a select for a Charlie's information again the memory is not there. The information is on the memory. If luckily on the CPU cache, remember that high top we also have cache there close to the CPU which is really, really fast. This is a cache heat and a cache miss is the opposite. So if I do a select and that row has never been selected before it needs to go to the hard driver so that's a miss. It needs to go and then all the way so that's going to be really, really slow very inefficient. So if you have a heat ratio higher so more cache heat than cache misses the better and the faster is our application. So we always try to improve that metric there. It's a very important metric. We also have some writing policies. So we have the write through and write back policies. So the write through policy is when you send information to the cache especially when we save in data to the cache so the database is saving data it can keep the information in the cache now and save later or can immediately save that information both in the cache and in the hard driver. So a write through is immediately save that information. So it stays in the cache but then immediately saves information. So what's the problem with that? The problem is latency. So we increase latency when you use that policy. Oh is it all bad? Some applications are fine with that. So some applications that need higher reliability will implement that policy. And so we have tradeoffs here. The other one, the write back. So the information stays on the cache and eventually that's up to the CPU if it's a CPU cache or up to the aggregate of the application to eventually save that data back to the hard driver. Remember, everything that's up there if we have power loss we might have a problem. So in this case we might improve a lot of performance but we may lose in reliability. So we need to have that tradeoff. And then we have the prefetch. Different CPUs, they're very smart. At least they tend to be very smart. So those theory that when you go to cache, the information that you get now you probably will get more information around that data. So we call it cache locality. So based on locality if I have Maria, Charlie and John, if I select Charlie, the probability that they'll need data from Maria and John is higher than that I get down the line. So it's playing about probability. We also have the time locality. So the information that they just accessed now has higher chance that they're going to access in the future, like in five minutes and a few minutes. So what this prefetch does, when we ask for one block for the CPU, the CPU, oh, this guy's accessing this block so he's probably going to need the next block. So the CPU prefetch loads in advance that block and puts in cache. So if I need that block, it's already in cache. The CPU doesn't need to go back to fetch that information again. So that's improved performance. That's awesome, right? Now comes the problem. Cache are expensive. So when it gets close to the CPU, they get really, really expensive. If you're going to buy a laptop or device or whatever and you choose the CPU i9, i20, you're going to see they have those L1 cache, L2 cache, and they're usually in gigabytes, right? Thatabytes. They're usually in kilobytes. You're going to get kilobytes of cache. What can you do with kilobytes nowadays? Almost nothing, right? Another problem is what we call cache line. The cache line is literally the line that the cache is divided. So the cache is divided in many lines and each line has a specific size and usually depends on the CPU word size. What's that CPU word size? So if I have a 64-bit CPU, the CPU word size is going to be 64-bits and the cache line likely would be 64-bits. So why likely? Because when we go to the CPU, we start not counting the time in times anymore, in seconds or nanoseconds. Now we're counting the time in clock cycles. So inside of the CPU, it has those things that they call registered that are really, really fast. It takes only one clock cycle for the CPU to get information there. So it's really fast. When we go to the cache, the L1 cache, it usually takes between three to seven clock cycles. It depends on the CPU and depends on the cache. So things start to slow down. People always tell you memory is really fast. Memory takes 215 clock cycles. Can be 100 times slower than the CPU cache. So memory for the CPU point of view is really, really, really, really slow. So we don't want to go there. Can you imagine your hard driver? They didn't even put here because I don't know the number of years of data to fit there. That's insanely slow. So that's why we always try to put information there. Remember that thing that I said about the line size? So we might have a problem. We want to fit everything inside that size. We don't want to waste that thing. So can you guys spot a problem in this here? I put a really simple algorithm here. So a tip is not the code or anything. It compiles. So that's not a problem. Anybody? On the line. On the line. That's exactly on the line. So most of the developers, at least a lot of them, will believe that information will fit in memory just like this. So we have the int, the boolean, the int, the boolean again. The problem is it won't. Why? Let me see if this thing works. Can you see? No, that pointer doesn't work. So you see from 0 to 7, it has 8 blocks, little blocks. So it is 64 bits there. 8 bytes or 64 bits. This is in this example the size of my cache line. So because the CPU can only fetch one word size at a time, it can only fetch that information. So the boolean that is only one bit here will have to go to the next page. And see all those white stuff? We call it padding. It's waste. Waste of money and waste of time. So in this example here, if we go back, we have one cube, big blocks, right? We could fit them in two big blocks if we aligned them properly. So we will have two CPU blocks to fetch that information. But actually, we're going to have four, double the time. And we only have four variables here. So we mess it up things really, really quickly. Keep that in mind because that's going to be really important for our discussion here moving on. So yeah, how POSC is organized the data? So POSC and most of the database has its own file organization. Those are the most common one. We have the B3s. I just put here in ascending order. So no special things. It's not that one uses more or less on B3 or one is better on others. It's not the point. So but POSC uses what we call hip files. Hip files are really interesting things because they are very simple. If not the simplest one, it's one of the most simple. How a hip file does work? So it's basically one spaghetti of information. You put one block after another. That's it. That's a hip file. There's no order, no guarantee of order. There's nothing special. You just have one block after another. This is a hip file. So it's very simplistic implementation, right? But it can also be very efficient because remember the locality thing. So if you just have one block and if you prefetch all that information, that's going to be sequential. A problem for indexes, sometimes people do not understand why POSC please do not pick that amazing nice index that I just created for my application. So it was there. I have an index there. Sometimes, you know, but the database doesn't pick it up. And a lot of people, they want to facilitate the database to pick things up because they think there's nothing in the database. Well, sometimes they also do and then they realize they're not. But it happens. So one of the problems is the index is a B3, most of them. So B3 by nature is random. The access is random because you have one block here, another here, another here, another there. So you are changing the access pattern from sequential to random. Remember how random lies expensive, especially on the ADD thing that they spin in this? Yeah, that would be horrible, even for a really efficient index. So that's why the database say, nah, not today, maybe tomorrow, you know, today I'm fine. I just do a full table scan, let's just get everything. And more often than we sometimes think it's faster. So but if files in postgres, they have some very interesting properties that go back here. If I put somewhere, no, here, no, here. Well, they are eight kilobytes. So the block on the hip file, I think, yeah, eight kilobytes size. And each file has the limit of one gigabyte. So it means that we can only have one gigabyte table in postgres, right? No, limit one gigabyte, just create another one, and another one, another one. I have one terabyte information that I'm going to have 1025 files. So be mindful of that, because depending on the file system that you use, some file systems, they do not play well with many, many files on the same folder. Right? Be mindful of that as well. That might be a problem. So it's of eight kilobytes size. Keep that information in mind as well. That's also going to be quite important when we move on. So as I said, one row is just appended after another. So not in fancy. There's no fancy organization or whatever. And if you go for postgres, insert all your data in a nice way, index it there, okay, yeah. And if you do an update on that data, and then you search again, if you do not put another thereby, that's going to mess up your order. So be also mindful about that. So the hit files has no guarantee of order at all. Right? So this is basically how it's organized. And every single block, it has its own organization. So it has a header. The header has a lot of data. I put the information here, as I said, I'm not going to go through all of them, because that's not the point. But you can go and search on the documentation. They're really nice. Now what matters for our discussion here is how these things are organized. Right? So this is inside of the block. This is a picture of the block inside of the database, inside of postgres. So we have the header, and then we have the data. Nice thing is data is started and stored by the end, not by the beginning. We have an index, sorry, a pointer that points to the data. So the data goes towards the center from the end of the block, and the point also goes toward to the center. So when they get close together, that means I don't have space anymore on that block. Right? The block is full. The database needs to create another block to put more information. Also, postgres doesn't split the data between blocks. If your block doesn't fit inside, if your data doesn't fit inside one block that is 8 kilobytes, actually we'll see that if it doesn't fit inside half of the block, it's going to do something else. Otherwise, you're not going to lose your data. But yeah, the database will handle that. So also, those rows here that we see here, those tuples, tuples and rows, we call them interchangeably the same thing. So the rows here, they also have their own organization. I put here all those things. And I will go for a couple of things that are important for the discussion, not all of them. They are fixed size. Well, they have a header as well, that the header is fixed size. A question, does anybody know how postgres stores new? If I have a table and I have a lot of columns that can be new, anybody? The null bit map. Sorry? The null bit map. The null bit map. Does anybody know what is a bit map? Okay. Yeah, a bunch of bits. So you have what we call map, actually, you have the sequence of bits. You have 11111010100001. That's the map of bits. So what postgres does, it has this sequence of things. And the position of the one of zero on that thing is the position of the column inside of your row. So if it's zero, it means there's no null. If it's one, that means you have a null. So it's highly efficient and compact the way that stores null inside of postgres. Yeah. This is how it works. And also what is very important here, this is, again, another photo of the row, is this padding. Remember the CPU padding? Well, the database also tries to keep things aligned with the CPU. So if things don't align with the CPU, the database will add padding. And we'll see how nice that can play. Right? Remember that I said if the data doesn't fit there, the postgres will save somewhere else. This is what we call toast. This is the oversized attribute storage technique. And usually goes really well with coffee. So postgres uses another file, that's not your data file, to put the information that doesn't fit inside of the block. And then it puts a reference inside of the block, a pointer, to the beginning of the data on the other file. So the database does it automatically, you don't have to do anything about that thing. So every time that you have text, VACA, 3000 and something, that, you're going to have a toast because that won't fit in 8 kilobytes page. So there are a lot of improvements since being created with compression and those things. So it plays quite well. And when we understand how those things work, we can also change our data organization. But that by itself would be another talk. But I can give a just a small example here. So let's say we have the table user, where the user does the authentication, right? So for the authentication to work, we only need to compare the username and password. But then when the user has been authenticated, sometimes you want to show the pictures, the bio and a lot of information that's there in the toast. So if we create a table with only that information for the authentication, the authentication process that's a complicated process can go really, really fast because you can fit a lot of things inside one block space. And the database doesn't need to go to the toast. We have the toast in the information and the modeling. So and then when we need to show the information from the user, we can use, still use that same primary key that we have a reference for for an key and get that data on that very specific time. And for most of the applications, the authentication is a hot path. Everybody knows goes to the authentication. But really few people go to the bio page. So we can improve the performance of our application just doing nothing. Really simple changes. And we get a boost on the performance. So we already got the things, the performance improvement. How those works. So as I said, when it's filled a percentage of the block, it will be saved elsewhere and has a pointer to those things. And now we come to what really matters. So if the data is too large, they go into toast. But what if I have too many small problems? Well, if you have too many small post-process has a limit of how many columns you can fit inside of your table. Because in that case, it won't fit on the block, right? So the question is, if we have too many columns, how does it do if it doesn't split? So you just have the maximum of columns. Depending on the data type we're using, for example, if you're using big inch, that is eight bytes, so you have that number. If you're using smaller inch, you can fit more. So but you have a hard limit because of the block size. That is what it does. If it doesn't fit, you need to split the table in more than two, three, four, whatever many times you need. But that would be insane. Like have hundred or six, 64 columns inside of the table. Can you repeat the question because you're not talking about it? So the question is, the post puts the information outside because it's a larger than the block. But if we have too many rows, too many columns inside of the row, how post-process handles that? So that was the question. And again, post-process just has a lot of limits on the number that you can put there. More than that, you need to create more than one table. That would be the solution. Now we come to the data alignment and padding. So remember that I said the database also does padding and post-process also does. So the natural alignment on post-process padding is of eight bytes or 64 bits. That's the natural alignment. So what does that mean? It means that every time that you put one integer, integer is four bytes, you need to put another integer together to get a perfect alignment. If you put an integer of four bytes and a big integer just after that integer, so that big integer is going to go to the next one because it doesn't fit the natural alignment. And you can ask me, well, what's the problem? Just go to the next one. Not a big deal. So every type has its own alignment here, as you can see. Shaw has basically no alignment needed because it's variable. It doesn't mean that's good to use car. Actually it's good to put at the end, not at the beginning. And we'll see why. So we see that Shaw has alignment of two bytes. It means, yep. So does the order of the fields or can optimize the order of the fields? So when I decide my table, do I need to think this? I got it. So the question is, does the database automatically reorganize internally to put it in the best way? Or does the DBA need himself to change the order of the fields, the columns, to put in the best way? The database does not. This is work of the DBA. The DBA has to do this work, and it's a really good question. And we'll see sooner or later why it happens. As we see here, every type has its own alignment. This is on post-presokumentation. It's just a copy and paste. So the two bytes on most machine, that means that one will occupy two bytes there. So for example, we are aligned with eight bytes. So I can fit four of those data types one after another that does eight bytes. However, if I put one of two bytes in a big integer that's eight bytes by itself, we're going to have six bytes, pageant, waste. Remember, pageant is waste. Most of the time, waste of money, especially if we're running on cloud. All right. And it's really possible to optimize those things to make them to work in a better way. And this is an example. So in this example here, what I did, yes, let's say I create that table. See, we have a really few columns there, not many. I just put in a random order, like I put an integer, and then I put a bar card, and then a timestamp. So that was a very small table. I only inserted one million rows. So on that one million rows, I got a certain size, and then I organized it just to align better, to remove the pattern. The alignment saved me 25% of space. How much would be 25% less in your build on AWS of storage? We make and buy a burger, right? We never know. So we'll see a couple of our examples. Wait for people to take photos. All right. Yeah, question? Jason B, it has its own, if I test the Jason B data type, that's the question, right? So Jason B has its own specificities, so, and most of it doesn't go inside of the block, because most of the time it doesn't fit inside of the block, right? The problem about Jason B, it has its own algorithm of optimizations. So, but I haven't answered your question, I did not, but that would be really interesting one to do, to see how that works, how that plays together. Especially because Jason B has a binary format, so the binary itself, algorithm should be able to do a lot of optimizations on the way. But yeah, I'll take notes of that one, because now I'm curious. Another question? So the question is, if there is any analyzer tool that we can use to see how much space we're wasting, right? Not that I know. That might have there, but not that I know. That also would be a really nice open source tool to develop, you know? People looking for ideas, that's a nice one. Back to Jason B, the data base, the table. Like, there are a few things about the Jason B, is that like 60 bytes, 32 bytes, 64 bytes, how do you organize the database? Okay, the question is back to Jason B, how does it organize it? It's organized in 64, 32 bytes. I don't have an answer for that question, because I really have no idea. I haven't played with Jason B much. Most of the work I'm doing with those things, performance, and how I would say, netrodata types, right? So, but yeah, I don't have an answer for that either. How long do you take to review everything and optimize it? And are there any tools to make the process much quicker, like any automated scripts? The question is how long does it take to do the full reveal and the process, and if there is two? Well, I don't know of two. And the time, it highly depends how complex your database. If you have a small database, like the one that they use for the TPCC, it took me like a few minutes, right? But to have really complex database, it can take you some time. But it's time that works it. The question is if the alignment only works on Postgres or if the other database is as well. If I'm not mistaken, Oracle also does, but its implementation is specific. It's from database to database. And I have not played with those, especially because this type of documentation is not highly available. And if you do not have documentation, you really need to go to the binary and it's really time consuming. And I haven't been working with other databases for like 15 years. So yeah, but they probably do. The question is if we have a way to ask Postgres how the alignment is, right? If we have tooling, yes, we do have some extensions that we can go deep and see how the organization is. We even have some extension that we can check the memory for from time to time. Okay, moving on because I only have 15 minutes. Thanks, sir. Probably now it's less because he's been showing me that for five minutes. So, okay, what are the implications about those things, right? I showed you one example. Now I'm going to show you another one that they use a tool named sysbench that they did a TPCC-like experimentation to see. So this is just an example of one of the tables. This is how sysbench created the table. This is normal stuff. And you see it's really small table. There's not much. And most of the columns are integers. So nothing going on there, right? And this is what I did. Besides the flashing thing, did you guys not see anything? I changed some orders of the column. It's a shame that this point is not working. But yeah, I changed some orders. See, I put all the integers on the top and then I put the small integers on the way that they are in pairs to improve the alignment. And then I put the other columns. See that I put the timestamp. A timestamp, if I'm not mistaken, is also eight or four bytes. And after that one, I put the another small inch because they can still use the same space on the alignment. So I have four and two, still have six. So the only padding that I'm going to have is after that one. So I tried to minimize the padding as small as I could. Back to the other one. See, it's just not that bad. We had an integer as small as small integer and then an integer. So not so bad. So it shouldn't be of huge difference. So this is what happened. See the new on the left, the schema name. The schema new are where I created the new ones. The schema public are the old things. If you see the total size of the orderline one is 3,000 megabytes or three gigabytes, 3.8, to be more fair. The old one was 4.1. It's about like 15% increase, right? And also interesting, look at the index size. We also have improvement on the index size and they have exactly the same indexes. Let's go back. See the index, the primary key here, we have the column one, two, three, four, and then another two columns for the other index. And the same. Exactly the same indexes. Don't change the name. Exactly the same color, exactly the same order of the columns. I didn't even play on the index for the optimization. I just left them there. I just did for the table, right? And this is what happens. And I'm highlighting here for one of them how much space is saved. One table. And a very small table, right? So, but how does it play with performance? Because, okay, one thing is I can save any space on this and performance. So obviously it does a TPCC type test, right? Try it to use as well. The answer is I got an average 8.4 performance improvement. About around, on average, for this example, this load, 19% disk space production. I'm cutting down at 19% of my disk space views on cloud providers. I think that's why they never approve my talks on their conferences. Now thinking about that, I think it makes sense, right? The latency, I reduced it about 15% latency improvement. Just shuffling the columns around. The application does not even need to do. And I'll tell you, when I created those tables, because the SIS bench has its fixed structure on those, the insert and updates, I had to trick the SIS bench. I had to create views and then rows inside of Postgres to make those views insertable, updateable and deleteable. So on top of this, I still have latency on the database I had to put because of the tooling. And even though I got almost 9% improvement. Can I just clarify, I'm confused, your average write and read around 8.2, 8.5. So I thought the latency would be around between those two numbers. But obviously you mean latency in a different sense, or you're measuring? Latency on the application side, because latency is not about how fast or as low you get the data, but also at the end how fast or as low you process the data. So you have more data in smaller blocks. So on the application side, latency is going to be a lot better. So I'm also improving the application for free. You have some kind of improvement, but what are the things you need to do to reorganize the columns? So your question is what happens if I need to go to the table and reorganize the columns, right? So, yeah, if you already have an application, you may need to do the trick that I did here. Right, first create the new table, so you're going to double that for a certain moment type. You create the new tables, and then you create views and you create rules inside of the database. So where your application will insert, update, and delete, obviously select as well on those views until you change. Or if you use a tool like, if I'm not mistaken, PG... I forgot the name of the tool, you know. Actually, the guy had a really nice talk on Friday on PG Day that you can do online data change for your thing. So your application does not even need to know that you're changing the database. So you have a few different options. What's the name of the tool? PG Online Scammer Change? That's from ISQL. Oh, PG, okay, yeah. PG Online Scammer Change, that's true. Thank you. Why was there is a reason for the project not doing the organization for you? The question is, is there is a reason that the PostgreSQL does not do the organization? Yes, because even though we can have really smart things on there, the database doesn't know the full story. So it may try to reorganize the data and mess up the things on the way. So it's always safer for the DBA to do those things. And at the end of the day, the database should keep the data and retrieve the data for you. So you need to know how the data plays inside as well. Yeah, I wonder if it's important which columns can be known and how often these rows actually contain those columns? So the question is, is it important for the column to be known or not and how often it plays? So it is important, not for the column itself, but the following ones. Remember, PostgreSQL doesn't store the know, right? It stores on the bitmap. So then will be a pointer there and nothing on that place, right? So yeah, it does play a role. I haven't tested that to see how much impact that would be, especially because I just used the tooling, right? So but yeah, it definitely should have some impact, for sure. Would it be fair to say that this benchmark measures bit tree more than anything? And let's say that we're not prioritizing latency, but rather we're doing like the same sort of dense joints and we want more bandwidth. Is it true that if there's something like grid, right, we're going to be able to get as much bigger improvement in bandwidth points? Can you rephrase that? I don't think I understood the question. So for this benchmark, it seems like it measures bitmap scan. So this is like a bit tree benchmark. And let's say that we're not after latency reduction, but rather we want to get more bandwidth. Let's say in a case where we would have distinct or time-based joints. Could we use something like grid to effectively get a much bigger set in bandwidth if it's going to be dense enough? If I understood it correctly, the question is it's fair to say that the benchmark mostly tested the bit tree performance, right? So not really the density, if more density of data that would improve. Well, actually, as I explained, post-crisis, especially for insertion and things, we're going to have on the heap file. So we often do not do, at least in this type of benchmark, not do with the bit trees and performance would be really marginal, especially if you saw we don't have many indexes. It's like most of the tables here are the primary key, and I feel them, they're only one index. And the only bit tree structure on post-crisis for this example are the indexes, because the table itself, they are not. They're just heap files, right? So in that sense, I would not fully agree, but definitely density would play, like if you could have more dense tables. And the example that I gave for the authentication, so what do you do when you change? You just put a few columns there for the authentication. You are increasing the density of the information you have on that table. So on the same block instead of having three or three, that will have a thousand, right? It's a lot more dense, so that per se makes it a lot faster, especially if you think about that, time is also wasted on the network, as you mentioned. And network is not only the bandwidth of the network inside of the computer itself. Well, we have five minutes. Let's rush to the end. So this is a summary. So post-crisis stories. And actually, if you, now wait for the photos, almost, okay? So if you want to take something, this is the summary. Every data type has its own alignment on post-crisis and can cause padding. And it's really, really dangerous. So we can get really, really mass with all data, especially because most of the tables in our application has like 20, 20 something columns. And we are not careful on how we put them. Yeah, we can mess up. Questions? I think we have two minutes. Possibly, yeah. Yeah, possibly. When you have fields with varying size, like the text field, what are the problems? The text, the VARCA, right? So they do not play well with padding because as the variable, so the database doesn't know how to optimize them on that way. So it highly depends on how your information is put. And that's usually best to let them to the end. Sorry, can you say that again? Okay. Uh-huh. Yeah, the question is, if I test it on larger database because smaller database might just fit fully in memory. So yes, I did. And what you need to realize is all the data, all the padding that we have here goes to memory and all the pad that goes to memory goes to network. So in all the padding that goes to network goes to your application. So you're wasting space in your hard drive, in your memory, in your CPU cache, in your network and your application. What are the numbers? Well, it highly depends on application to application. So I would say that's empirical. There's no science on that one that you can get up to 30% performance improvements. That's all. If you have more questions, thank you guys. Here is my link again. Thank you.
Introduction to the Public Code and Digital Public Goods devroom
So, hello. Welcome to the Public Code and Digital Public Goods Dev Room. My name is Elena Finlay-Diracht. I'm here at the Foundation for Public Code. This is my colleague. Hello, everyone. Nice to meet you. I'm Amreen Taneja, the Standards Manager at Digital Public Goods Alliance. And there I manage, lead and promote the Digital Public Goods Standard. So, very excited for this Dev Room today. And I'm Jan Einley. I'm also at the Foundation for Public Code and I'll talk later. Cool. And I'm going to... So, in case there's any confusion about what we're doing here and who we are, this is a Dev Room dedicated to everyone developing public code. That is open source code that implements public policy used for the public good and by public organizations like governments, public administrations and state corporations. Digital Public Goods, DPGs, are open source software, open standards, open data, open AI systems and open content collections that help meet the sustainable development goals. So, we have a couple of housekeeping notes. Most importantly, the FOSM Code of Conduct applies here. So, please be respectful in the space. Oh, sorry, this way. On this side of the... Okay. Secondly, even more. Okay. All right. We have a window open for ventilation. That's to make the space a bit more comfortable. If people would like more than one window open, I'm happy to hop on that. We're going to leave the window open all day in any case. And that moves us to the third housekeeping point, which is that if you have any questions, if anything comes up today, talk to Jan, Amrine or me. And that's it. So, on to Amrine. Thank you so much. I'll just take a moment and get this up. Okay. So, well, I think moving on. So, I've already introduced myself. So, first of all, I'd like to warmly welcome you all to this dev room today. First things first, I'd like to share with you a bit about the Digital Public Goods Alliance for those of you who are new to this organization and concept. So, we are a multi-stakeholder initiative which was launched in 2019. And our mission is to accelerate the attainment of sustainable development goals by facilitating the discovery, the development and the use of digital public goods, which are essentially all open source solutions. So, I'll share about this as we move forward, but I'd like to kick off this conversation by introducing you to the Digital Public Goods Standard. Right. So, just to give you a little bit of context of where this concept and this definition came from. So, the DPG definition was actually laid out by the UN Secretary General. And there are five kinds of open source digital solutions that are recognized or can be certified as DPGs. So, these are open source software, open data, open content, open standards and open AI models. So, we have a set of nine indicators, right, that make up the standard. And I'll share a bit about each of them with you today. So, the first one is SDG relevance, right. So, this is a very broad topic. So, essentially any application that wants to do good for the society in some form or the other will come under one or the other SDG, right. So, what we expect from you here is, first of all, to establish a clear contribution to either one or more SDG and also explain how your application will be seeking to do that and achieve that. And then also we have an SDG tracker tool, which I'll be sharing in the presentation as we move forward. The second indicator is open licensing, right. So, the DPG standard has a set of specific licenses that we accept. And all licenses, supposing that are, you know, they're approved by OSI are there for software. We have Creative Commons licenses for open content. And then we have various other licenses for AI systems as well as data. So, because there's positive time, I'll not get into too much detail right now, but I'd love for you to have this conversation with me later on. I'll move on to the third indicator for now, that is clear ownership. So, that essentially what we mean by this is, the DPG status needs to be renewed every year, right. So, you have to send out an application everywhere every year and, you know, your application needs to be up to date with the standard that we have created. So, we need to know who the owner of this application is and it could be either a person or an organization. Both are acceptable. And what you have to provide to us is a proof of ownership, which is anyway a legal requirement for the application. Now, fourth indicator talks about platform independence, right. So, this is a tricky one, right. And the goal here is for vendor lock-in to be avoided. And we prefer for everything to be open source, but let's say you have a proprietary component within your application. So, when you apply for a DPG, what you have to do is provide an alternative open source component to this and explain how it should be implemented and the condition being that it should be relatively easily implementable, right. That solution should be easily implementable for anybody who has enough technical knowledge about this. So, we in fact have external, you know, facilitators and experts for this particular indicator. We have them with us today as well. So, Ivan, that's for you. So, if you have any questions around this indicator, please feel free to contact him. Now, coming to indicator number five, that is documentation. So, this is fairly straightforward, right. So, it basically means that you need to have all your documentation in place. So, this can be in the form of a repository or, you know, on your website or in the form of some good book. And it should essentially have enough detail, you know, that someone with enough technical knowledge should be able to deploy the solution by themselves. That is the requirement that we have. Now, moving on to indicator number six. So, that basically talks about mechanism for extracting data, right. So, if your project collects any sort of non-PII data, then it should be possible to access it through non-properity formats. That is the condition that we have. And now, coming to indicator number seven. So, this is adherence to privacy and applicable laws. In fact, I have some news around this indicator which I'll be sharing with you later on. So, essentially what this means is that your application, it should be compliant with, you know, any of the privacy laws that are there in the jurisdiction where the application has been created or where you intend to operate. So, if it's Euro, it'll be GDPR or anywhere else, you have to provide proof of compliance, of course. And that can be through, you know, providing us a terms of use or privacy policy. And of course, these things are held on a case-to-case basis. So, you know, you'll be speaking to our reviewers around this. And once you satisfy the conditions, then we move forward. Now, coming to indicator number eight. So, this is adherence to standards and best practices. So, essentially, any standards and best practices that apply to the industry where your solution belongs, you have to adhere to them and you have to provide some proof of adherence as well to us. And lastly, coming to indicator number nine. So, this is do-no-harm by design. So, do-no-harm by design essentially means that we, you know, we say design because we don't look at the implications that will be there, you know, down the line somewhere which are completely out of your control, right? So, we look at how the digital solution is being used or rather how it's being built and not how it's being used. So, that is what we kind of focus on. Now, moving on to the next slide. So, this is about how do you become a DPG. So, this is a three-step procedure, right? So, first stage is nomination. So, nomination essentially means that you can either nominate yourself or a third party can nominate you. And the second stage after this is technical review. Of course, this is a very, very rigorous process. We have level one and level two reviewers who go through, you know, your application. And if your application satisfies all the conditions, then, you know, your application is essentially certified as a digital public good and it is recognized on the registry. So, like I mentioned, step one. So, we have a five-minute eligibility test that anybody can take and you can figure out whether your solution is at the outset capable of becoming a digital public good or not. Step two is the nomination. So, this is what the application form looks like and this needs to be filled up as per the criteria that we just spoke about. And this is step three. So, success. So, if, you know, your application is selected, it is added to the DPG registry and this is the SDG tool tracker that I was talking about. So, this is where we have 150 of the DPGs categorized and arranged as per the various SDGs that they are striving to contribute towards. So, now coming to call for experts. So, I was mentioning about something about indicator seven. So, this is where, you know, the standard is entering phase two of operations. So, what this means is that we are going to be fine-tuning critical indicators of the standard through two expert groups that we are launching, one on privacy and one on AI. You'll see this poster across the dev room and outside as well. So, if you're interested, please feel free to scan the QR code and apply. And these are the requirements. So, if you're a subject matter expert in either privacy or AI with a technical background, legal background, academia or, you know, any other background which you think would be a good fit, please do apply. And it's not much of a time contribution. It's about three to four hours for this knowledge partnership. And of course, if there is previous experience in standard making, then that is also highly encouraged. And with that, it comes to an end from my side. I would like to introduce Yan now. So, who is our DPGA member as well as the co-host here for this dev room. Thank you so much. Thank you, Amreen. And I'll come from the Foundation for Public Code. We're a non-profit. We're based in Amsterdam, but we aim to work globally. Just last year, we started chapter in North America. And we exist to help the public organizations who already decided that they want to work with open source, develop open source, to help them do that in a collaborative way. So, ensuring that also that anyone can reuse what they have been doing. And to do that, we have the standard for public code. Here are some old versions. We have some new paper versions here, if you'd like. Just last month, we released 08.0. And it has a number of different criteria in it, certification criteria. I'm not going to go in as deep as Amreen did. But this is what we use to sort of like certify that a code base is able or easy to collaborate on. And our philosophy is that it shouldn't contain any surprises. It should be more or less the best practices in the open source business. So, you're probably already doing most of it already. And then there's probably also a lot of shortcuts that you have made to save some time that you're not doing, but that you wish you had the time to do. We have collected them all here, because that varies over. And if you comply with the standard, our thesis is it will be very easy for someone to come up and collaborate with you. It's of course an open standard itself. It's cc0. You can start using it immediately. You don't need our permission to do anything. And you don't need us to come talk to you. Reuse it, adapt it to your needs. If you find that something is, oh, this is shaping with me, please contribute back to us so we can continue to improve it with your feedback. And these are sort of like the type of requirements that we have. And just as Amrin showed with the DPG standard, we also have a self-assessment test that you can do. There's just 10 yes or no questions to give you an idea how close you are to dig into it completely, because in the entire standard it's like 116 requirements or something like that. And there's a review template, of course, and a checklist to easily check what you're doing. And we list everyone who is compliant on this website. Today it's a list of zero, but it is a list still. But we also include right now everyone who has said, oh, we are aiming for this goal. So everyone who has the ambition gets listed there. And then just a tiny little thing. We also have a number of governance game decks. It's a little sort of a game you can play with your community to figure out how do we want to do with our governance. And we're giving them out from the small fee of signing up to our newsletter. And with that, I want to introduce our first speaker from the day.
Some updates on Public Code in Germany
Okay. Hi everyone. My name is Marco and I'm an active member of the Floss community for about 10 years now of contributions to SignalDino also in the wireless community tooling, wireless mesh community tooling area. Currently I'm working for the German government for a German government agency that builds IT infrastructure for Germany, mainly backend infrastructure. And we like we are in the middle between the 16 federal states of Germany and the federal so we have a lot of stakeholders in place and to contribute to. Also during this job I get a lot of feedback and see a lot of things that are happening in Germany. So first maybe let's talk about a little motivation about this talk. So in Germany I have the feeling that the term open source is very omnipresent in the public administration also in politics. No one actually speaks about free software. So the open source term is the leading term here. Also there's very little information about how Floss is used in public administration and also there's little knowledge in public administration about how to handle Floss software appropriately and there's hardly any contact with the Floss community. There are exceptions of course but like generally speaking there are ways to improve that. There are also hardly any statistics on the use of free and open source software in the German government and so my impression after three years in this domain is now that everyone is talking about at least open source software. Maybe they also mean free software. Maybe they don't do the decision between the two terms. It's also okay but in practice hardly anyone is really doing or following these software development practices. Yeah right now there is a lot of happening in Germany and I thought it might be a good chance to give an update what happened in the last year or so and what's happening right now to give you a better feeling how these things that happen in Germany might also be like relevant for other countries or if you are from Germany I hope it might also be interesting for you. Yeah so the first question are we Floss yet in Germany especially and I wanted to start with the state of Floss laws and regulations there. So in June 2020 yeah two and a half years ago there was a principle defined in the service standard and this service standard like these designs or gives design principles for government digital web services like interactions between people and the government and this service standard is also mandatory for the largest digitalization program over the last five years. Those of you who are from Germany may know it as online two and a half years or short OZG. It's a law in Germany that mandates the government agencies to provide their services online and in this principle it says that a source code from the realization of digital services must be made available as open source so that's very progressive. We think that's a nice thing but the problem with it it's not mandatory. There made a survey I think in 2012 no what was it it's written down here 2022. They made a survey and out of 15 from 221 people that have been been asked to give it a high priority in their in their own projects but that's only a very few people from and in practice I also see many people don't don't know it actually so it's not not very broadly adopted. Then in 2021 there was another approach there was an obligation from the economic stimulus package also intended to improve government digital services and there it says the source code will be made available as open source whenever possible. Nobody really knows what this whenever possible means and unfortunately the federal ministry of the interior didn't really keep track of which projects actually released software as as open source under any open license. I think actually so I personally know only one. There were a lot of projects in there that got funding so this really didn't have had much impact. Then in November 2021 there we had a new parliament in Germany we had elections and the coalition that formed after these elections then decided or wrote down in their coalition agreement that they wanted to or yeah that the development contracts of public agencies should be generally commissioned as open source and the corresponding software that is being developed should always be made public. So this is like the same intention again and yeah there's a but because like after this agreement the German government spent 4.8 million a billion euros investment in proprietary cloud infrastructure in addition to 1.3 billion dollars to Microsoft licenses of course you can't just throw Microsoft software away it doesn't work. So this is like more a long time terms change but these 4.8 million cloud infrastructure that have been like this was a new contract that didn't exist before in this form you could have like invested in open source software here. Also in general less than 1% investment during like for the second by the current government investments from the current government from the current legislation went into the open source software ecosystem and also the plan financing for so-called senders that's the the German Ospo has been cut by nearly half due to resource I don't know they didn't find the money that was needed so they had only 24 million euros that's still a lot of money for an Ospo that's great but yeah compared to the initial plan it's less than we expected and hoped. And also there is still no floss procurement regulations that are badly needed to give government agencies a tool to really require these procurements to be based on open source licenses. But we have some policies in the German federal states we have 16 federal states in Germany and two of them Turing and Schleswig-Holstein they defined a priority for free software in their federal laws. The first one like it's mainly the same text in those regulations and the first aspect is that a priority for free software should be applied if technically possible and economical this is again we don't really know what this means it's like hard to define when is it economic to use open source software compared to proprietary software often this like also comes with long-term impacts so this is this is a really hard question and it's easy to like find some arguments why it's not even cheaper. Also for in-house developments the rule is that an open source license yeah has to be applied and the software needs to be published as long as it is not used for security relevant tasks and this is still again I don't know what your security relevant task is and even like for maybe people thought about like police software etc but still I think we in this room are know that even like especially in these domains it's super relevant to have open source software to have the possibility to see inside the code and see what they're doing there. First to improve the security and second like to improve control of what agencies do in their day-to-day business. Okay but still like these two federal states have these like thought about these questions, applied some regulations that's great I really like the effort there and in practice we see there are still also there are some very motivated people in the governments there and they're doing everything they can to improve this even further so I think that's a very good first step here that's nice. So let's have a short look at the European perspective here. I just created a graphic based on the information from the join-up platform also from a questionnaire to the German Bundestag like our federal parliament and we see that currently right now there are some countries in Germany actually in Europe actually I would say it's a huge amount of the or like relevant parts of Europe in terms of like their power also in the European parliament have some some regulations in place concerning open source software. The Swiss parliament just passed a law this year or last year sorry in March 23 to publish all government software and an open license there will be another talk in the legal and policy issue deaf room at 226th Oc so yeah head over to this talk to get more insights about Switzerland. Okay so but let's have a look at floss in practice and in generally we must summarize that these political objectives that have been defined are mainly ignored to be honest in public administration so there is no like the step from legislation to like the execution of this laws is hard and it's yeah it's not done not not not done yet. As in the industry we also have this phenomenon of open washing like presenting some kind of software as being open when in fact it's actually not. A small example for this is like the government site builder that's used to build the websites of all the German ministries and on their website they say it's based on open source we dig a bit deeper we can see that the technology the technological basis is 100% open source that sounds great so I wanted to dig a bit deeper and I tried to find some download link I have found some and unfortunately I didn't come to any git repo or something instead I was greeted with an HTTP basic auth because like the software is based on open source software it's correct but it's not released as open source software. So why is that that the public administration doesn't really respect these political intentions that have been formed on or formulated on like every federal level in Germany like from the from the top federal level in the Bundestag to the federal states and as far as I see it the public administration has like no no too little experience either with public procurement of free software it's hard they don't know how to like buy free software and buy support for free software and also they have no experience with releasing software or releasing their own code as free software there's little incentive to invest in existing free software coming from some laws and regulations and also there's little incentive to release own code and to collaborate with others to improve their own code because there's so little knowledge about the like benefits of all of that. Yeah, like in summary I think the application of these floss software development models still heavily depends on individuals we have individual cities and we will see there in we'll later see an example of this where it works really well but yeah my feeling is that it's still dependent on individual persons that yeah mandate for this and do the the heavy work in practice it's not really widespread and spread adopted in all government agencies. Yeah we will later see how to fix that but first maybe let's talk about some wins there are also great things that are happening in Germany. Germany just built an open source collaboration platform called OpenCode it consists of like GitLab instance there's a discourse forum there's a wiki.js wiki and it also is also based on the public code YAML standard that is used to annotate the purpose of public software and this encourages the public agencies to make things open today the administrations do not really dare to do this though like with this platform they can see that other government agencies also release their code and if others do it it might be okay and I might also be able to to release my code as free and open source software and I think that's a great thing. Also it's somehow a safe haven for for public administrations to get some first experience where they don't have to go to all the real free software or external free software repositories like like gitlab.com or or like even github where they have no experience with and like this is inside the government even if it's public it's something government owned and this might help to to convince some people to release their software there. I think it works okay so there's already more public organizations on there than on github at least for for the german organizations to be fair there are very little organizations on on github in Germany from the public administration yeah but still only a few real projects exist on this platform to be honest many of them are stubs many of them are just code dumps or other kinds of documentation consultation processes etc so it I think it's a good start there needs to be more more code there integrate all these products we know like for example next cloud colabora also Univention Corporate Server, Op-Mix Change and all this this kind of software that exists but doesn't really integrate very well and the idea of Open Desk is to pay the software vendors to to build integrations between these solutions. Also there is an interesting project from Germany that's called Kullibri it's completely public funded by an IT service provider of the federal government and it's basically a component library that uses web components and is meant to be what are meant to have a strong focus on accessibility and they also do a real open source model they also accept accepting contributions in my opinion they have interesting great tech but it doesn't really feel like public administration project it's the normal open source project and that's great great as a huge recommendation if you if you're looking into a component library maybe this this might be an interesting thing for you there's also the current design system that's meant to be used for all government services to build a unique or re-identifiable design system it's not an actual software project like the design system which that defines which design elements are used on the websites but still they have the philosophy and also the community building parts baked in in their DNA and they're trying to to get involvement and they're trying to to build a community that's also a great thing that that that happens right now already mentioned there are some cities they're doing great progress the city of Munich built its own open source transparency website and this is really interesting because they document which floss software they use they also document which software they contribute to both in terms of code and also in terms of funding and they also document which software they write and publish so they really understood the benefits of free and open source software and build a website to make it transparent i think that's a good example maybe for other cities too and we have the national documentation portal that's like read the docs like project where documentation for developers on the core government infrastructure can be found it's by itself a license under the european union public license and it's also contributing accepting contributions so let me close this talk with the question what does it take for free software to be to become the the default in public authorities and i've brought three challenges here the first one we need to release custom build software of course under free software licenses i think the regulations here are very very important so there needs to be some regulation in place that enforces governments to do this because otherwise they have no knowledge in this there's no so like there's little to no motivation why to do this in the first place so regulation helps very much here to to release all all the code yeah and of course also knowledge and skills in this area need to be built up in the administration maybe our ospo made contribute a lot to this in the next years but that's a major challenge in all government agencies second challenge uh for software procurement is of course a real yeah a thing here we need you you you you very important because we need to measure our progress does it really work does it do we make progress in in this in this area right now there are hardly any statistics so i think it's it might be a good idea to have the mandatory use of a researchable software catalog before buying any software like the italian government does this already there's the italian free software catalog and all italian government agencies need to have a look at this catalog it doesn't really say anything about what they do about this like results they just have to document that they have searched here for for the software or for the kind of software that they want to buy and if there's something in there they that's a good opportunity to look into it and see like for example is yeah is this software useful for us before they're buying any yeah non-free software here yeah if you want to learn more we collected some infos on best practices about free software in the german government also some examples this kind of follows the idea of of like the awesome list but like just find some information about what is what has worked in the government to improve free open source software maybe this might also be something for for other countries too for your communities too i really encourage you to build some knowledge about what already exists and communicate about the efforts that have been taken already okay thanks for listening and if you have any questions you can contact me here or yeah maybe later outside if we have time maybe one or two questions we don't have time okay
GNU Health. Incorporating Digital Public Goods in the European healthcare system
All right. So first of all thanks to the organizers for having us here. And I got to say I'm not Louis Fai-Khann but I'm spontaneously replacing him today. So nevertheless I will introduce both him and myself. Louis is both a computer scientist and a physician and he founded New Health a bit more than 15 years ago. And he's specialized in genomics and medical genetics. And apart from being active in social medicine he's also involved in animal rights. Then shortly about me I studied computer science in Hanover and there I'm employed since a bit more than two years. And mainly I'm working on an Ansible deployment of New Health to ease and improve the installation process but I'm also reporting and fixing bugs or rewriting the documentation. And last year we also hosted the annual conference of New Health in Hanover. And it was also together with the Orson conference. Sebastian will do the following talk about Orson. And the institute I'm working at is called Computational Health Informatics and even though we are only working inside computer science it's always related to medicine. So behind New Health there's a non-profit, non-governmental organization called Ngu Solitario which is working globally and it's focused on social medicine and New Health. But there's also the Global Exposome Project that aims to investigate how the environment has an impact on our health and how social problems like pollution of water or factory farming or wars also impact this environment and consequently our health. And then again there are also projects about animal rights where it is involved. Ngu Solitario is spread quite around the globe but when it comes to productive use in hospitals then we hear the most of projects in Latin America or Africa for example in Argentina or Cameroon. And then there are many research institutions, hospitals and so on for example in the top in the middle there's a university in Argentina that is cooperating quite much with New Health. Okay, so what is New Health actually? In general it is a hospital information system but the core is a hospital management information system that is often called HMIS node. And there you have one client-server architecture and it takes the quite realistic approach compared to other ways of organizing the infrastructure of hospitals. And it is first of all based on Frighton which is an enterprise resource planning tool so you can overtake the user management and inventory stock and finances functionality from this. But then we are adding modules for hospital functionality and putting this on top. And like Frighton it is written in Python and using the PostgreSQL database back end. Even though Frighton could theoretically use others we are always taking this to first have a uniform way and then also because there are many good functionalities for productive use. And then for example you have really many modules that are part of New Health for example about surgery or the laboratory or genetics and bioinformatics and as it's used in many precarious reasons, New Health is embedded as also one subproject which basically means that there are for example images for respiratory pys because sometimes yeah it's really a matter of resources what to use. And as the name says, New Health is a GNU package. So the HMIS component as I said is a client server architecture and on the upper left you can see a screenshot of the client and with this you can generate graphs, you can display images, there's a calendar you can use yeah and also the electronic health record is part of this. Then there's a reporting engine coming with Frighton and so all the information you feel in the database fields can be exported as an ODT. So there's a LibreOffice in the background and you can yeah generate this and print it or start outside the program. Yeah. Besides there's an integration with Orsan which is a DICOM server to support medical imaging and actually there's no DICOM viewer integrated in New Health and as usually there is the DICOM format used. It was chosen not to reimplement any DICOM viewer or do all the work Orsan has already done but to integrate Orsan and so to synchronize patients and studies between the two of them and to just use Orsan's DICOM viewers that are integrated there already. Apart from this there are also other components of the New Health ecosystem for example the Federation and my new health. So my new health is an app that is that can be used to enter vital data and in the end also to share that vital data. And last year at the 40th birthday of GNU the second version was released where all the dependencies outside Python were eliminated because many people don't have Linux on their phones and we had requirements before they were now eliminated and it was migrated to Kivi so now the idea is to have something cross-platform. And then the GNU Health Federation aims to connect multiple of those HMIS nodes and ideally also make the people, give the people the opportunity to share the vital data they recorded with the hospitals. And so to give one example the colleagues in Argentina also used this already in the beginning of the COVID pandemic to trace how much, yeah, just to trace the situation of COVID. And now, yeah, to come to the topic of the room also, GNU Health was declared a digital public good which is in the context of the sustainable development goals of the UN where many goals should be achieved until 2030 and one of them is healthcare and so, yeah, GNU Health is part of this and also just advertised at the European Commission join up where, yeah, free software or open source software is, yeah, advertised inside the European Union and then compared to other software projects, of course there are always bureaucratic barriers and also certification processes but there are many steps to check if your project is a medical device software but actually at least the hospital information system itself and the electronic medical records are not a software or a medical device. Of course then there's the other stuff for example in Germany would for sure need to have an interface with the insurances and most of the productive use is somewhere else. Then, yeah, from our point of view, proprietary software and public healthcare is a contradiction, yeah, and we think that there should be, yeah, a move to free software and there's really many barriers and a lack of funding especially for free software projects and, yeah, there could be really many benefits of putting more resources in communities like this so that everybody can profit from what people are working on. This is why we also signed the campaign public money public code. I already saw it in the slides of the talk before. I guess the most people know it but basically the name already says if there's public money spent for a project then the code should also be available to the public. Said quite easy but also not the reality. Yeah, I'm finishing with a side of that Luis often says which is who has this a social project with a bit of technology behind, yeah, to highlight that it's not only about the software but also about the philosophy behind. Yeah, that's it. Thanks for your attention.
From disconnected elements to a harmonious ecosystem : The Epiverse-TRACE project
First up, we're going to hear from Hugo Brisson from Disconnected Elements to a Harmonyne Ecosystem, the APRiverse Trace Project. Hi, my name is Hugo. I'm the lead software architect at data.org and today I would like to talk to you about the work that we are doing to build a harmonious ecosystem for epidemiology as part of the APRstrace project. So today's scientific research relies more and more on data science and computational tools and this is true across fields such as epidemiology, climate science or econometrics. But the pipelines that are used by these data scientists are getting also increasingly complicated to maintain and to update. And to change just a single step in this pipeline just to use a different piece of software, you may have to spend hours of data wrangling just to get the right format for the inputs and the outputs. And the problem is that this maintenance that is really complicated is something that we cannot afford when we are in the middle of a crisis. The price is just too high to pay. When the next pandemic hits and we want to get really fast results to understand what's happening, it's not the time to do basic boring data wrangling, we want to do actual science instead. And so set differently, we have some good isolated free software tool but we don't need just good isolated pieces of software, we need a robust ecosystem as a whole. And this is precisely what the APRstrace project is about. It's an international multistakeholder project to harmonize the ecosystem of epidemiology tooling in R. And we do this by making the existing pieces interoperable, by supporting existing tools to adopt global standards such as the ones that are defined by the Digital Public Good Alliance or organizations like R OpenSci and by developing a community, a sustainable community around these ideals. I can also define our goals by what we don't want to achieve. We don't want to erase the existing established communities. We recognize that diversity of solutions is good, it's nice to have a rich ecosystem but we need interoperability in this ecosystem. And so the way that we do this is by involving the community. We work with existing established communities and by this I mean both established communities of users such as public health institutes or NGOs but also existing communities of developers. And in the end what we want is to come up with a solution that increases usability, sustainability and maintainability for everyone involved. We've had already quite a lot of success with this approach. We've managed to package and release a lot of un-maintained non-portable code bases and including many more tools than the ones that are presented here but just for the sake of this session I should mention that two of them are already registered DPGs and one is in the process of being submitted. Having a sustainable network of collaborators is something that is really exciting and really ambitious but as you can guess it comes also with challenges and in particular research and academia are really competitive spaces which makes it difficult to build some collaboration between some communities. Additionally, because we have a multi-stakeholder community, communication is really difficult in a network that has so many collaborators and so many nodes which creates delays and miscommunication and the question is how to build something that is sustainable even though funding isn't probably in this space. To conclude I hope that I managed to convince you that responding to this crisis be it the climate crisis or the next pandemic will require interoperable tools and that this can only be done for collaboration and multi-stakeholder project. But even though it's necessary to have this kind of complex community it also brings a lot of extra challenges especially around communication, collaboration and sustainability and in the end what may appear initially as a technical challenge is even more of a communication and social challenge. With this I will finish just with a picture of the entire core team of the project and invite you to come to talk to me if you're interested about any of this. Thank you. Thank you. Thanks. And thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Legislation Editing Open Software (LEOS) - an innovative open-source solution for drafting legislation​
Yes, we just go right on. Okay. Thank you. Good afternoon everyone. I'm Fernando Nubla. I'm a project officer from the European Commission and I'm going to tell you a story, the story of Leos. So, upon a time we were in LaGislan and you can imagine LaGislan is not that funny. It's talking about legislation, you know how complex legislation is. And in this case we had law legislation that was complicating the life of everyone living in this kingdom. Legislative laws was enforcing very complex rules to everyone that wanted to create a new piece of legislation. Rules about the structure of the documents, formatting, etc. He was not taking care of the versions. We had versions everywhere, in local computers, in surf folders, in everywhere I can imagine, even in a paper. So it was getting very complicated for the people working with the legislation. And there were a lot of people and no one was collaborating because law legislate was not helping them on that. So we went to the round table, we tried to find a solution. We needed to help these people. So we started by creating a work plan, an idea of what we need to help them. We were defining more or less the solution that we wanted. But there was something very important. It was the financing. We couldn't do this without budget, without any financing. So we used two programs of the European Union, the ISA program when we started in 2012. And then, right now, the ITA Euro program since 2020. With the budget, the financing, and with the work plan, and with the idea of the project, then we created Mr. Leos that you have here, and you have also in our search. Mr. Leos, we wanted him to be an open tool that is a web application. We wanted to be able to draft tests. We wanted to have a rich editor where you can put images, formulas, track changes. We wanted to have collaborative tools to create comments, suggestions, to work with other people, everything centralized. We take care of all the versions. So now they are not spread everywhere. They are all in a central place, and you can go and check them. And something very important, we wanted to use open standards. We didn't want to keep drafting legislation in an unestructured standard that you can use further. We wanted to create something that is structured. And we are using the commandos for you standard that is open for everyone, for any administration or government that wants to use it. And the last aspect is that we wanted to do it open source. That was something new, the European Commission doing open source projects. We did it. And we bring with us the community. We are not just alone. We wanted to do this with member states, with other countries, with academia, et cetera. Whoever wanted to help us is welcome. And then I know you were waiting for this. There was a battle, of course, between the Leos project and the legislation. But no one was harmed. So our idea was to convince them that there was a better way to do the things. So finally we end up with a legislative in our team. We are running around Europe. We are helping other member states, other institutions to use this tool that is open source, is available for everyone. And together we are going towards the future. And the future is leading us to artificial intelligence, machine learning, imagine drafting legislation just with one click and getting the proper test. And this is the tool. So everything that I said is true. You can check it out. We had all the features that I was saying before. We had the structure. We had the versions everywhere. We had the research with the touch changes. You have on the right the collaboration to create comments, suggestions, the use of tests. We are using open projects like the CK Editor, Hypothesis, EUI. So we are creating our software open and using open source projects. You can check us in this QR code. You can scan also our t-shirts. We are in code.tropa.eu Leos. All the software is available there. You can check it out. You can contribute while waiting for you here. Thank you. And this is the end.
TruBudget - a DPG to support the project workflow in international multi-stakeholder environments
So, hey everyone, my name is Zuri, I work for KFW in a field of development cooperation and in this case development cooperation is not about GitHub but working together with third world countries and donor organizations like us to make the world a better place. So together since a couple of years we developed the digital public goods so we also registered that and that's what I would like to show you and also to invite you for collaboration. There are three slides, I would like to explain the problem to you, our proposition and where we are at the moment. So that's one example, if you look at development cooperation and how it works today, so you see one country here, maybe someone recognizes it, it's Ethiopia and we started to count how many government organizations and NGOs are actually working in this country and supporting it. And at some point we stopped counting so we didn't have any time to put more logos on it. The problem here is that we don't trust the data exchange and information exchange for the partner countries, they don't trust basically us. So in the end we end up, many NGOs and we as well, to do the project ourselves instead of just giving the money to the country so that the ministry there can actually do, build schools, hospitals and so on. And that's the real problem and already in the Paris Declaration 2005 we decided that we should do all this stuff on the systems of our partner countries that never happened because of lack of trust. And now what's our proposition to solve this? We called it true budget. So we figured out that the solution is not to install some kind of SharePoint or Google workspace or something, so another intermediary because whoever owns the data is actually more powerful than the rest, the data is the new oil basically. So we don't want to own the data and also the partners, if they own the data we actually don't trust them potentially. So the idea was to build this decentralized solution, which is truly decentralized and managed data there. What I mean with data is actually to manage the workflows, who is doing what, how are the projects implemented? So for example we build schools, how are the tenders done, how was the money dispersed, who did what basically. So what is actually stored as decentralized data is the status of all the different workflows, of all the different participants in this network of people and organizations. So technically what it is, it's actually a front end based on material UI React, so the JavaScript stuff, also the API side, it's a Node.js server and on the data side it's a very small blockchain solution, it feels like a key value data store basically, but it has a very nice property that actually synchronizes across different nodes. So there's a kind of a conscious mechanism that the data is then synchronized across the different parties. And that is very important also from a political view that basically you process this data on i-level. There's not one party that owns the data and the other one doesn't, so if any one of these participants here, that's only a very simplified view, would get out of the picture, it would still work out, right? So where we are at the moment with this, yeah, we did this since a couple of years, as I said, we are registered as a digital public good. We have a couple of pilots, for example with a Brazil Amazon fund, it's a very important one where sort of Germany paid money if less Amazon forest was destroyed, so it is very good. We had the vaccine alliance also very important, Burkina Faso is I think one of the oldest projects we did or one of the first we started with, we have the Ministry of Water, we try to manage this data and get to a situation where we can actually give them the money and they actually use the money to do good stuff instead of us developing the projects which we believe is not the most sustainable approach. Yeah, of course, as I said, you are invited to contribute to this project, it's running since a couple of years already, we have a couple of contributors, it's also the first open source project we did as KFW, it's a state owned, German state owned bank, so remember the talk we had before that state owned organizations, it's tricky to do open source, we achieved it here and I'm quite happy to be part of this project and would also be happy to if you join us. Thanks a lot. Thank you. Excellent. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Moodle: Empowering educators to improve our world
Hello everyone, so my name is Noel and I'm here to talk about Moodle. Moodle is a learning management system that you can use for your online learning and teaching and our mission is to empower educators to improve the world. We want to do this in an accessible way that can be used for everyone and that can be customized for every use case. We do this through open source and actually Moodle started more than 20 years ago and the first commit was actually the same year as the first edition of Forstdom and preparing this talk I have been looking at the archives and this is actually the first talk about Moodle. It had been mentioned here and there but this is the the first talk specific about Moodle so it may be the first time some of you okay it may be the first time some of you hear about it so I hope you find it useful we are a certified B corporation and Moodle is a registered digital public good and in case you don't know who is using Moodle at the moment more than 400 million users translated to more than 160 languages mostly contributed by the community the translations and you can find these stats in stats.moodle.org but I have to mention that since Moodle is open source and can be self-hosted all this information is only obtained so in reality there's probably more people using Moodle than this. In this slide maybe there are some logos you recognize we know that Moodle is used by more than 60% of higher education institutions it's also used by many education ministries and it's also used by many governments and NGOs so Moodle is used all around and who is making Moodle. Well there are some important part of the contribution from the open source community and other companies but mostly it is done by Moodle HQ which is the company I work for. We are currently more than 200 team members distributed in more than 20 countries and we speak more than 26 languages and I didn't want to leave them without mentioning the text tag so Moodle is made with Vanilla, PHP and Vanilla JavaScript with an SQL database and the mobile application which is the team I actually work for is made with Ionic using Angular so in case you want to learn more you can look in the ad repositories to look at the code. Also I mentioned that it's very customizable to different use cases so you can build plugins for Moodle and if there is something that is not doing already there is likely a plugin already working for that and if there isn't you can make a plugin yourself. Here you can read the developer documentation to see how to build it both for the LMS and for the Moodle app and finally even though the Moodle LMS is at the core of everything that we do there is also many other things. For example I already mentioned the Moodle app which is interesting for low resource environments because you can use it to use offline so you can download the contents and fill the exercises and everything and it synchronizes when you go back online. We also have Moodle Cloud so you can self-host Moodle but if you want to get started we have a software as a service solution which is Moodle Cloud. We also have MoodleNet and Moodle Academy to share and find learning resources and if you want to integrate Moodle with your organization we have Moodle Workplace and Moodle Certified Partners and service providers so there is a lot more that you can dig in if you want to learn more. So that's it you can learn more at Moodle.com and if you need to contact me my mail is noel at Moodle.com and that's it. Thank you.
What can digital open source projects do to reduce our environmental footprint
So how many people here are worried about, say, climate change? And yeah, a little bit of a concern. There's definitely issues, crazy weather, blah, blah, blah, blah, blah. What does that have to do with open source and my talk? Well, let me tell you. We live in a finite world. And as much as we want to believe that the cloud is green, it's not. Everything that's digital is tied to an atom. So as I say here, if you don't measure it, it still matters. It still matters because we're looking at the environmental footprint of our digital lives, and it's significant. It's about the same size as the airline industry, and it's growing exponentially. So when we're thinking about our digital infrastructure, every piece, every bit is tied to an atom. So whether that's electricity and whether it's hardware, it all comes from somewhere. It all has an impact. Think about the lifecycle of our products. There's not just the process of these devices that we have around. It's also the question of the creation and disposal of them as well. There's a lot of systems that are interrelated, and they have a huge impact. So when you're developing code, or if you're doing open hardware, think about what is the ecosystem you're working in. It's so interrelated. Whether you're a JavaScript person who's working on the web, you're a PHP person who's building content management systems, whether you're a Python person that's involved in creating other data processing tools, all of these things that have communities rely on networks of other pieces of code and a lot of people to maintain and organize them. So think about that ecosystem of people and code that our projects work on. So so much of this is thinking about sustainability like a measure of quality. So how do we make sure that good code is both accessible, because I have to say that I'm an accessibility person, but also is sustainable, that we're trying to minimize the impact that we have on the planet. And that that is baked into the definition of quality. We're thinking about it early in the process. We're not waiting until the very end to evaluate it. We're trying to build it into our CI CD pipeline so that we're catching errors and we're looking to minimize our website or we're minimizing the impact of our code on a spring by spring basis. And you're having a livable planet is not a feature request. We need to have that. This is a bare minimum that we need for our society. We need to start working together around that. So there's so much to learn and so much that's happening in this space right now. 20 years ago, this was not something that was generally thought of. People were like, well, just don't print out the web and don't print out your web pages and your emails and you'll be fine. It's like, no, that's not good enough. The information is changing very quickly. There's a lot to learn in the space. And I think it's really important to try and think about sort of learning that, but also finding ways to contribute. So where can you give back? Where are the experiences that you've had? How can you get involved in measuring your project's impact and moving ahead on that? How do we look up at leveling up our expectations, encouraging more people to discuss and to learn about this? So it's really important to have these talks. There's a whole section of talks here at FOSTAEM on energy as there was last year and that's wonderful. If you're going to the State of Open Conference, last year they had a whole sustainability track as well. In making sure that there's some conversation about sustainability as part of your project is really important. Getting people engaged and doing something about sustainability is a good way to go off and keep optimistic about it and keep our attitudes that we can make a difference, we can make a change. This is something that is doable. So getting people involved. And this is a huge problem. But everything that we do in the end is going to be insignificant as an individual contribution. But as Gandhi said, it is important that we do it. We need to go off and find ways to contribute and to play our small part to move things ahead. There's lots of best practices and best standards out there. There's one from the Web Sustainability Guidelines that was just launched as a draft in September. There's so evaluation and development on that. That's the Sustee Web Group or the Web Sustainability Guidelines. There's also the Green Software Foundation that's done some really good work. I'm building infrastructure around that. The Green Web Foundation is another one that has infrastructure and information about that. Also the IETF and the IEEE I think also have sustainability projects as well. So there's lots of different ways no matter how you're involved in the tech world to look at best practices and sustainability that you can work with and extend. And that's all I have. Yeah, any thoughts? Okay, any questions? Does anyone here have a sustainability? Go ahead. Any kind of practical tips that make matters good? Look at how much processing there is. Sorry, any practical steps. One practical step is to sort of look for where there is time and transfer. So how do you try and minimize the efforts and try and make sure that you're counting the milliseconds used to process it? What are the process heavy things that your code is using? Yes?
Open Terms Archive
So I'm here to talk to you about the digital common that I'm quite passionate about, Open Terms Archive, which is a digital public good incubated within the French Ministry for Foreign Affairs. It has been existing for three years. The code base is under an European Union public license. And its goal is to enable the democratic control of digital platform. Make sure to shift the power balance from the large digital services to the end users and regulators. How do we do this by measuring the evolution of the rules, the terms of services, which are actually the way in which the services decide that you can do things or not. And oftentimes it takes over some regulations. And we provide tools to archive and enable analyzing and influencing those rules. We are basically creating technical tools that will enable us to unite different actors who are powerful enough to influence the rules of these large actors to be more loyal, more respectful of end users. Right now this is a bit blurry, but it's going to get better now. So who do we try to unite? We just looked at who can impact, who can really change the behavior of those large actors who are not respecting your privacy, who are not respecting your rights online. Regulators, if I'm a privacy regulation authority and I can have a million euro fine on some company, it's going to change its behavior way more than if I'm just yelling at it that it's not respecting me. Legislation, it can take a long time to be enacted. I think about 10 years for GDPR, but once it's in place it does actually change things. Why? Because someone in the big companies actually fears to go to jail. And that's quite impactful as well. Takes a long time, but useful. Press and media. There are tons of articles that say that some companies are bad and evil, but sometimes there's enough coverage that it will actually threaten their business model or their user base. It happens sometimes. Think, for example, on the WhatsApp case where they had to back down because so many people were afraid of it, didn't like the change. And finally, consumer protection associations, such as some of you might have heard of the Shrems, for example, cases in the EU that prevented the bulk sending of transfer of data of EU citizens or any consumer protection association that can go on a lawsuit against the companies. So we have demonstrated this impact model by providing tools that have been taken over by the European Commission for ensuring that some regulations were actually impactful on the behavior of companies. We have been in contact with Congress people in the US and so on and so forth. So we have these examples, and I have to be fast now. This is a DPGA public good session, so I want to highlight that we've been selected for the Nobel Prize Summit thanks to the DPGA. So we're very thankful to the Digital Public Goods Alliance. And if some of you are considering applying to be a DPG, I can talk more about it, but yeah, it's cool. Okay, that was not geeky enough for FOSDEM. I know. So now how does this work and how can you participate? First of all, we just need to define which documents you want to track. That's quite simple. You just use a URL and then a set of selectors like CSS stuff, usually just CSS selectors, a bit of JS if it's necessary, but it's like 5% of the cases. So here you define your target documents that we are going to track. Then we have made the tracking part quite simple. You could think that you could put together a script in half a day to just download stuff. That's true, but there's a difference between this. Time's up. I say it's four minutes. Sorry. Okay. So, yeah, oh, good. So yeah, we do this with software. We store every version and then we make it readable. So basically we will extract just the legal text instead of having all the menu navigation stuff and all that. And we will provide a diff for all these people, regulators, legislatures, all these people who don't really get all this stuff, but now they have a way to be notified when there's a change. And then we can have humans who will write down a summary of what has changed and circulate it around. And then instead of having one person there, one press person here, one regulator yelling, we have all of them together and thus we believe we can actually influence. We also provide data sets and this is a decentralized system. We would love to have more instances if you want to create a new one in your country jurisdiction. Please do. It's going to be amazing. Please visit us at contact at opentermsarchive.org.
Developing in Public : Open Source Tech Education
Hi everyone, my name is Hermann, Benji, I come from Venezuela and I'm the founder of Code Your Future. Code Your Future is a non-profit organization that trains refugees, asylum seekers, forced migrants and people of low income backgrounds in programming skills and help them start a career in tech. Today I'm going to tell you a little bit our story and what we're doing with open source and creating one of the most inclusive platforms to contribute to curriculum development. But before that I want to show you a video, but you know we're small groups so we're going to try to do this session interactive. So we're going to talk. Are you ready for that? Yeah? So I'm going to start with a video that we normally share with our COFers when they join to start the training program. It's a video that we created with the community and the voice is from our students. We don't know if the sound's going to work, but let's give it a go. Code Your Future, in this video we will learn how Code Your Future works, how it is different and how you might want to change your expectations. Code Your Future is a majority minority community of adults and together we are powerful. Everyone has some kind of struggle in our community and everyone also has help they can get to others. Everyone is wanted and needed. Together we can solve many problems. We are not a school of students and teachers. We are professionals collaborating to achieve realistic practical goals. CYF is not an accredited institution. We can't use our certificate to get a job without actually learning coding. All we can do is learn to build great software and we can do this to employ our client by actually building that great software. At CYF we don't tell you what you can do. You're not here to judge your potential. We only look at what you have done. We make decisions based on evidence. You have to rebuild things and you have to show us. Nobody will allow you to do your coursework or tell you off for not coming to class. If you don't do your work your reward will be your own failure. So you're free to make bad decisions. You are an adult in charge of it all. And so is everyone else. Everyone at CYF is a volunteer and we are all choosing to work together. Everything has been created by a volunteer. We sort of problem and try to solve it. There's nobody coming to save us. We are here and we are saving ourselves. CYF grads come from many different businesses and go to many different jobs. We are here to support you to get a job you want and a life you want. We know that with hard work, honesty, kindness and challenge you can get there. Don't sit around waiting to be taught things or complaining that you haven't been taught enough. Build your own projects. Find ideas that motivate you. Fall in groups and work together. Don't wait around. Try things. At Culture Feature the only failure is the failure to try it. You have this year to change your life with a huge hope for two forwards. Seize the opportunity. It's your life and you can change it at Culture Feature. Okay, that's it. Thank you very much. This is an intro to our organization. I get a little bit emotional when I see it because it really puts together a lot of who we are, what we do and how we do it, what we have been doing over the last seven years. I'm going to start by telling you a little bit of my background. If you're interested in knowing, is this sounds interesting like you want to keep hearing about it? Thank you. It's good to know. No, no. Okay, so as I said, I'm from Venezuela. The back story, sometimes when people hear about Culture Feature and they come to me, you know, you found the Culture Feature, tell me what is the story, how did it come, how did you do it, what was your inspiration, what is the inspiration? I think I summarized everything in these two areas, migration and on fairness. I spent pretty much more than half of my life abroad traveling in different places. And I started in the tech industry, had a great, you know, a good job, and had a wonderful experience learning and growing, but I was always a little bit touched by the on fairness, the inequality, the inequity in this world, how, why some people have access to so much opportunities and others, not just because they happen to be born in a different place and in the wrong country, by the wrong autocrats, and that was something that just kept nagging me for a very long time. And at some point it was just unbearable to kind of just keep doing the same thing. And above all is that when I was a child, you know, my country was a pretty stable, relatively safe place. This is my home country, it's in the mountains in Latin America, in the Andes, and it wasn't the richest place, but it was okay. I had a great childhood, I was playing in the streets with children, and I never thought that the country was going to change so dramatically. No one ever, when I was a child, imagined that the country was going to become this. Mass migration in Venezuela is one, had one of the biggest immigrants from any country. It's like millions of people have left the country over the last 10, 15 years. It's completely different to my childhood. These are people that a few years ago literally started walking, leaving the country, because that was the only chance. So this is the concept of a forced migrant. You don't want to leave, but you feel you have to. If you want to look for safety, if you look for other opportunities, because it's just so bad going on there. And then later on over the years, I started realizing that this story is actually overlapping with many other stories of what was happening. I'm sure all of you have heard about the war in Syria from a few years ago, and the migratory crisis that emerged from there. Well, people in Syria, they will tell you that when they were children, they also never imagined that the country was going to end up being like this. That they were going to start migrating and walking across, but they actually also hosting refugees from other places. It was a completely different story. And it's this change of reality that made me think a lot about the world. And when I decided to quit my safe job in tech and start looking for funding something else, well, at the beginning it was living on an overlap, but I entered this other world. The world of refugees and asylum seekers, the world of long-term unemployment, the world of unsafety, where you don't know where people have homes, or if you're going to keep the same place for a while. It felt that it was literally a parallel universe, that we live in certain reality and then the others are living in another one, and we rarely overlap. We rarely know what's the reality in that other place. We hear about it, we see it in the news, we see some statistics here and there, but do we really know that many people that live in those circumstances? And all of that is in contrast with this wonderful world that we live in tech, with so many advances, so many developments, so sophisticated technologies like this one. Can anyone recognize where this comes from? What device took this image? Sorry? Yes. The web space telescope, you know, a state of the art. This is capturing a constellation from one of the oldest pieces of light that exist in the universe. It's more than 13 billion years old. This is a time machine. It's looking back to the origins of the universe. It's only between 200 or 400 million years after the Big Bang. And we have this technology, and we can enjoy this beauty, this wonderful images that tell you this immensity that is out there. And we have these beautiful things happening, and I want us to just like see it for a little bit. Isn't that beautiful? They call this the nursery of stars, because it's a constellation of the formation of the stars over hundreds of millions of years. But all this beauty is contrasted with another reality. And I want someone that is brave enough or two people to start explaining what it is. What are these numbers? Anyone guesses? You can try this. No, it's okay. We can just try. A person out of 78 has to migrate from his own country. That's correct. One in 78 people in the world is a forced migrant. And that number keeps, the ratio keeps getting worse and worse. One in 78, forced migrant. 10% of the worldwide wealth is detained by, no, sorry, detains 7600 of the wealth. I'm not sure if I was right. Yeah, that's correct. 10% of the population in the world owns 76% of the wealth. And I checked this million times, and I just couldn't believe it. What else? Any other one? Sorry? Is it too low? Do you want it to be? Well, it's bad enough. What else? Yeah, the 10% produced nearly half of all the CO2 emissions. And then on the flip side, How are 50% keeping 2% of the income? Thank you. And are responsible for 12% of the CO2 consumption. Yeah, exactly. So these to me was the best representation of inequality in the world. When we're talking about wealth and pollution, these are the two extremes you have also, like, you know, very huge numbers when we're talking just about 1%. It's a big, big challenge. Do you think you can change something with your project? I mean, a bit of a tree bud in my opinion. Okay, tell me about it. A lot of companies already talk about inclusion, and they have like websites, and like, there's a lot of discrimination. But I'm not sure you're going to change that. How can you have impact on that? Thank you very much for the prompt. That's very good. We're going to talk about this. Yeah, we're going to talk exactly about that. So I want to know from the audience if you would be interested in helping changing people's lives. Raise your hand if you are. Great, thank you. That's great. And you don't have to say, I don't want to. The vast majority of people, they're like, you know, it's okay. My life is complicated enough, and that's okay. Come in, welcome. No, no, don't worry. Come, welcome, welcome, welcome. Welcome. The question is how, right? You were saying, is it realistic? Is it possible, right? Come in, come in. How? Come in, please. Do you want to sit? Don't worry, don't worry about it. How? So let's explore that idea, that concern a little bit. How? How could that be? What could we do? Any ideas? We want to change a little bit. We want to do something, small, tiny contribution. What can we do? Someone that hasn't said anything? No? Small steps, just teach one and one teach another, and another teach another. Like small things. Training. Yeah. Thank you. One person training another one. Information. If the media can talk about all of these problems, it would just be a huge step. Okay. Awareness. We want to keep talking about these things. Uh-huh. Maybe as a response to your comment, I don't know if it is, well, probably, well, at least if I'm considering how the world in, from my point of view has developed, I do know not believe that there will be one change, one technology, what I don't know, fusion, for instance, which solves all energy crisis problems. And this will also serve to, um, to ensure that we, that there won't be poor people anymore because we are living in a highly inequity, uh, in equal world. But, um, maybe it helps at least to, uh, like you said, with small steps in your own environment where you can make impact. So what you're saying, if I can summarize this, there's no one solution that feeds all. There's no one magic wand that is going to solve all of this. Absolutely, right? And you know, you're, you're good to say, hey, is this, is this a thing? Is this possible? Now I would like us to spend one or two more minutes doing something that is extremely unusual in these conferences, which is let's talk to each other. So let's look at your neighbors and then just talk for one to a minute. What other ideas would you have to actually bring some tangible? Are you, do you, do you feel comfortable? Is that okay? Let's explore that. Let's, let's talk, you know, we're a small audience. Let's talk and they say, how else can we do these things? What can we do? Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Do, do you guys want to, do you want to share? Well, just ideas, just to spark ideas. No? Just think. We're like brainstorming. OK. OK. Brilliant. Thank you very much. Thank you very much. Thank you so much for being so kind and speaking to each other and sharing. We have 10 minutes left, so let's carry on. So as you were saying, there is like, it's a complex topic, but we've been involved for seven years working in this area. So we have maybe a little contribution to make here. One thing that I just want to highlight, hell is paved with good intentions. This is something that I learned, and I saw. There's a lot of people that we want to do good, but it's really hard to actually do it, not only just to action it, but actually to have the impact that we're trying to do. A lot of the times, the good intentions can lead to really bad consequences. And that's something that is very important to analyze in every single aspect. In a decent case scenario, you actually don't manage to help. But in a really bad scenario, you actually cost more harm. One of the areas that for me has been a really good framework to understand is the mesoherarchy of needs, to understand, OK, when we're trying to make impact in a person's life, we kind of look at what it is that we're trying to do. And in the first level, we have the physiological needs. When people need food, air, water, for example, when there's like big emergencies like earthquakes, and there's this horrible words that are happening, like we know that, OK, we need to help people in these areas. It's very clear. It's very, you know, people really understand that intuitively. But then when we try to do something else, it becomes a little bit more complex, yeah, more difficult. It's like, OK, people need more than that. It's not enough to just like have the food and the air and the water, we need more. And one of those areas is we need to feel secure. We need to feel, you know, that, for example, when we have a job and we have employment, we feel like a little bit safer. We feel like, you know, and also when we start connecting with others, when we have a sense of belonging, when we also believe in ourselves and we have self-esteem, and finally, when we actually keep learning and keep growing, we actualize ourselves. All these are important, and all of these are interventions that we can do. When we think on a project, we think we're more or less in which area we're trying to make a difference. And what we decided in Code Your Future was that our goal was to launch careers, was to going to help people start a career in tech. We're not going to help anyone. We're going to help those that need it the most, those that you saw in those statistics, people that were forced migrants, that suddenly arrived in a new country and had no connections and no jobs and no way to know how it was working. And then we saw that tech was a great place to be because it's pretty democratizing. If you know the skills, if you have the skills, then you can actually get a job. And that's what we started working on. But then later we realized, well, it's actually not just about this. Because actually, yes, you can, you know, you say, I'm going to try someone get a job. But actually, in order to get a job, people have to be skilled. But in order to be skilled, people need to also feel safe and a sense of connection. So our community of professionals, the old volunteers, and they're all coming to share, give their skills, the old people like yourselves, normal people, they come and they created that sense, that community and that growth. And then it goes back to the question, does this work? Does it actually make a difference? And we did measure it because we said we're going to do a training, let's organize a training that's all very nice, very exciting. We started like doing web development, the full stack JavaScript. But then that's not enough. Like that is something interesting, but there's much more than this. The real part that mattered was getting people to get jobs. And we were lucky to see from the very first class, people getting jobs, like Centaur, refugee from Ethiopia, was unemployed long term for many, many years, went through the program, and now he's working at the Financial Times. Or Ahmed, a refugee from Syria, his dream was to go to university, but the war interrupted all of that. Come, please, come in, come in. The war interrupted it, and he was completely lost. He never went to university, but he always wanted to be a developer. He joined Code of Future, and a year later, he was having his first job as a developer. And finally, Ansi, who is from India, had given birth just four months before the program started, had a tiny, tiny baby with her. She had not done any employment for five years, was completely alone, feel disconnected, very shy, very low self-esteem, and went through the program. Please join us. And then, and she's a developer five years later. Now, is this going to change the huge numbers that we saw there? No, of course not. But it's just basically one person at the time, because every single person matters. Every single person makes a difference, and is like connecting and knowing that you see that journey of people getting into employment. It's a dream. These are some of the numbers that we have. Mass majority of people we work with are living below poverty line, mass majority are ethnic minorities. We have a huge range of ages. We have a great gender split, and above all, people are getting jobs. This is our ultimate measure of success. And over years, we have been diversifying. We have an open source curriculum development that Daniel that is here has helped organize for many years. We've been diversifying, getting into new areas of development. And we started like here, but then we just been growing. One of the last ones was an SRE program that we created together with Slack to help people pay them all the way from like no programming all the way to getting a system engineering job at Slack. We are doing this. And because we wanted to talk a little bit about our syllabus development, at the very beginning of the early years, we were very excited. We started developing our own curriculum and creating content. And we had this really long, long list of content. It was very good. We're very excited. But over the years, it became really hard to change, really hard to adapt. It was a little bit like at the beginning was, you know, we had these little blocks and they started growing. So at the beginning was like, okay, well, we have a block and we put another block. And then if we wanted to do a change, it was easier because there wasn't much content. We had to change one and then we put another one. But over time, we had like bigger, higher towers of content and information. It was really, really hard to change. And then at some point it was like this. We had this big thing. I was like, oh, Daniel would say, I think we need to change this part because it's not working. It's like, oh, well, but if we have to change this, we need to change so, so many things. So it become really, really difficult. And then we had to design a new curriculum paradigm and we just went in the open source mindset. How can we make as inclusive as possible? How can we allow as many people as possible to come and contribute? And we basically completely changed the way that we had the curriculum and we decided we're not going to have content. What we're going to do is basically point out to whatever content is out there. And then we have basically bits of information that can't live anywhere. And the content basically is just those pointers that say where those bits of information are. So if we want to change something, we just rearrange it. And if we want to move from one to the other, what we have to do is just point it out the direction. So our curriculum, instead of these pages and pages, longs of text that becomes really updated, and it looks like this, this is our curriculum. So if you're interested in knowing more, join our open source curriculum development. Thank you very much for your time. It's been a pleasure talking to you. This is Code Your Future. And we're here in the sun. Thank you. Thank you so much.
How to Use Private Data in Generative AI: End-to-End Solution for Retrieval Augmented Generation with CrateDB and LangChain
in the morning on Sunday. It's nice to see all here, looking very bright and early. So we shall get straight into it. Let me welcome the first presenter of the day, Maria and Christian from CrateDB, who are going to be talking about privacy and generative AI. Thank you. Good morning from our side. Pleasure to open the Dev Room today and thanks for being here that early on a Sunday morning. We're going to talk about a very interesting topic, generative AI, how to use your own data and how we can build such applications also based on open source software. I think everyone is used to open AI and chat GPT, but you never know what happens with your data in these cases. So very, very brief overview. This is gen AI. I think everyone in the room played around with it already. Just a very quick summary of the basics here. You have your source data of any kind of sort of data. It can be text, it can be code, images, audio, videos. Everything is transformed. We are encoders, but billions of parameters that we use, a lot of text, a lot of input to train the so-called foundational models. We as users formulate some prompts against it. We ask the models some questions. It does its job and it generates the output and a language model does nothing else than predicting the most likely next token that it should generate. That's all the magic behind. We see a very, very big potential. When I first tried chat GPT more than a year ago, it was amazing. It started to write code for me. It starts to generate articles. I even went to some tools out there, took 30 seconds of my video and all of a sudden I can be a virtual speaker. Very, very impressive, super fast, but there's also a bot assigned to it. Obviously, some quality issues. All of you heard of hallucinations. Last week we had the example of what color is the water. Is it blue or is it really transparent? Depending on your training data, if you use children's books, the water is obviously blue. If you use the real-world training data, water should be transparent. Same as snowflakes or not white. They are transparent technically. Also, a lot of ethical questions, a lot of governance questions. Official government people talking to deep fakes, not realizing it. Also, a big threat that we have in the future. We have to be aware of also some environmental impact. The key thing we want to talk about today is quality and reliability with the importance of current, of accurate, and also of private data that is not available publicly. Because all of these foundation and models have been trained on public data. What's in Github, what's on the internet, what is in the documentation. Yesterday I watched a presentation with a clear message to everyone writing docs. We are responsible for what these models tell us. If you write bad documentation, we get bad results from GEPT or other models. It has been trained on not so good training data. Here, for example, Maria figured out promo code, open AIS web. If you register there and put the code 20% off. But unfortunately it was not working. So asking GEPT, hey, how can I apply the promo code? I'm sorry I know about this promotion. That's something you don't want to happen if it's a company chat bot. You want to avoid this. So perfect example why we need this current and accurate data up to the minute, maybe even up to the second. We need this current data. And obviously non-public data, private data, it's internal documents, it's confidential documents, documentation that is not public. And also if you are working with, they use legal documents, they use the technical documentations, vectorize it, put it to a language model and then for the maintenance workers, they have an application ready. But this is information that also must not leak. And this brings us also into a little bit of a dilemma because there are multiple options to bring this private data into the foundation and models or to enhance this foundation and models. First thing, again, I think everyone in the room heard about it, is fine-tuning. Where you give some input data, you really change the parameters, the weights in the foundation and model so that the knowledge gets incorporated into your fine-tuned LLM. Very good. You put the domain knowledge in there, but there are also challenges, right? You don't solve the frequency issue of the data. It's still some static knowledge. So there's research out there that one single wrong training data record can kill the overall performance. One guy says the water is blue and all of a sudden the response of the chat, but it's all water is light blue or something like this. And it doesn't solve the problem of hallucinations. You might still get a lot of hallucinations and not talking about the resources that we need. So second option, retrieval augmented generation, which is kind of developed into kind of a standard when you want to work with your own data. So first step is you really need to make the existing data, whether it's videos, it's data from internal database documents available to create the embeddings, to calculate the vectors, how this knowledge is internally represented. And then as soon as your user asks a question in the knowledge assistant or the chatbot, there's a called retriever is then asked, hey, please give me the relevant context. And this can now be a similarity search in the vector database, or it can be a combination of various searches, a full text search, to your spatial search, a regular SQL query to get information out of your databases. This context is returned back to the retriever. It is put into a special prompt, as context, as additional information to the prompt, and together with the question, and this additional context, not a large language model can generate your answer. And you can put into the prompt, as we will see in the demo also, please use only this contextual data. If you don't know the answer, please say you don't know. Limits the hallucinations a lot, doesn't prevent them 100%. Good. I think I talked about disadvantages and challenges already. And one advantage I forgot to mention is access control. Now that you really get this context from either vector store a different database, maybe create, you can put fine-grained privileges there. The example application that I mentioned before, some of the maintenance workers are not allowed to use legal documents, for example. So they don't use the index, use the embeddings of the legal documents, but they are obviously allowed to use the technical documentation. And someone from the legal department, oh, what is the support contract with XYZ? Are we now in liability? Et cetera. Obviously, they need then different indexes, different search indexes. How to do this? How semantics represented? Key is the vectors. So, or embeddings. And the vector is nothing else than a series of decimal values or an array of decimal values with a lot of different embedding models out there already. And every model has its strengths and weaknesses. Some are more optimal if you use, for example, German text, if you use Chinese text or Indian text, right? A very different way how to come up with the semantics and to analyze how the attention mechanisms internally work, right? Because the sentences are built in a very, very different way. So you see different performance there or highly specialized models. You do an image recognition. Oh, it's a sleeping cat. And this can then be vectorized as well. And you can search for this context in your vector store. And now, if we think this one step further, how could an architecture look like for such a knowledge assistant or a chatbot? Prototype is always easy to build, but you need to think about a lot of a lot of additional topics. First of all, it starts with the data, right? The data that you want to train, that you want to vectorize, that you want to make available for your search. So we've shown here a landing zone from different sources, can be the original sources. You might copy it, depends on the architecture you want to build. And the important thing is the processing layer. How do you chunk your data? How do you create the vectors? And obviously, you need to store these chunks of information together with the vectors and provide proper data access control. Second part here, the LLM part, talked about it now multiple times. You need access to the embeddings, you need access to the large language models, and then also needs to be some logging. What do do you use a query? How much cost does it incur? Is the performance okay? A lot of logging that also occurs here. And intentionally, an LLM gateway put in front of it because it needs to be changeable. Chatbots with a lot of functionalities don't want to go into all the details, obviously monitoring and reporting. And the beauty of it, you can build all of that with open source tools nowadays. And also the embeddings and language models can be open sourced, a lot of alternatives out there. Now, why create a long chain? You need a robust data management. As we have seen, there's a lot of different data sources involved here, data stores, whether it's logging, whether it's semantics, your agents communicate in JSON. So you need to store all of this information, ideally in one store, not five, six different databases here that you need to operate, you need to learn the language, et cetera. And also long chain, other opportunities are also out there. Think of Haystack and others that you could use. But all of these frameworks give you a very good set of building blocks. You can just use them. It's available in Python, JavaScript, there are also Java ports out there, ports to other languages are now available. Everything you need is already in these libraries to come up with your overall architecture. And that's now the point to hand it over to Maria. She will guide you through a demo where we want to use it, try to simulate how you can use support tickets, internal data. Here we took some Twitter posts from Microsoft. We will vectorize them and we'll show how a support or a customer can then interact with this chat bot, ask certain questions. It won't demonstrate it's not such a big effort. You can get started right away. And all the demo, we put the link here on the slide. You find also the link to the demo in the app or on the website for the talk. Thank you. Do you hear me? Okay. Awesome. Thank you. So you have heard a lot of theoretical aspects of the drug and how it works. I have a little bit more than 10 minutes to show you a practical example. But believe me, we can have hours long workshop on this topic. So essentially, the idea today is to show you how to augment some of the existing LLAMs with the private data and how to use it for the context of some specific questions that this LLAM has not seen so far. So we actually use data that capture customer interactions on Twitter and these customer interactions involve different questions from the users about Microsoft, Amazon, all these different products today and how actually the support from these big companies actually answer to these user questions. So this is not something that you usually see on the Internet very easily. So if you have maybe some problem with some Microsoft product, yeah, very often you can actually find the solution out there. But some very specific questions that are asked directly to customer support is probably a very good reason why it sells to the customer support. So you didn't find the answer to this out of the box. And we will use CradyBee as a vector store to support this example. So I think Kristina already gave you a good overview of what the CradyBee is. What is the long chain? Long chain is an open source Python project that actually is used to facilitate the development of LLAM applications. It's a pretty cool project that integrates a lot of large language models, a lot of models for calculating embeddings and actually something that helps you integrate some data source with some language model without thinking out of the box how the full engineering pipeline should look like. Actually you can just do this in a couple of lines of code. May I add one point here that I forgot to mention. Although you use long chains, very good starting point. What we have also seen for very advanced purposes, you want to directly interact with your data, with your source data, with your vector store and all of that is available in standard SQL, no matter which data model you're using. And CradyBee is an open source store, one of the easiest ways to run CradyBee is actually to use a Docker image. So a vector support in CradyBee has been available since 5.5 version, but if you actually always pull the latest image, you should not actually think about this. So once you run this Docker run command, we actually run the instance of CradyBee cluster and then we can access the admin UI in the local host. So currently I think because of the resolution of this screen, yes, not everything is available, but actually in this admin UI you have a couple of tabs that you can use actually to monitor your cluster to run some query in the console and also to have overview of the tables and the views that are available in your database. So let's go back to the example because the time is flying very fast. So what we need is the first step, we need a couple of import statements to make sure that the long chain and all libraries that we use in this example are available. What is also important is that you import CradyBee vector search interface that is available in one of the long chain versions, let's say, which is used to interact with the CradyBee. And as a next step, because we need to interact with the CradyBee instance, we need to specify how we interact. So this is done by specifying connection string. We are using open source version running on local host, but you also have option, for example, if you want to deploy CradyBee cloud cluster and at this point we also give option for all users to deploy one cluster that is free forever so you can just run it and use it for testing purposes. Finally, we need to specify the collection name that we are going to use in this notebook session. So if we run this piece of code, the connection string is now available and then we can start interacting with the CradyBee. So for purpose of this notebook, I rely on open AI models. Of course, there are long chain supports, so many different models, you can actually integrate many of them, but if you choose to use open AI, make sure that you have open AI key as a part of your environment variable. So now let's take a look at how the dataset looks like. This dataset is also available on our CradyBee dataset repository, which is also open source and it contains the customer interaction about Microsoft products. So essentially we would like to now kind of narrow the scope of this notebook for the for the illustration reasons and time reasons. So essentially this dataset has some information like who is the author of this question, whether it's unbound, outbound question, when it was created, what was the context of the question or the answer and actually whether this text is response to something or is it response tweet or is it created in response to something else. So essentially all this information and now the idea is to feed them to the large language model and to ask questions that could be for example seen in this dataset. So as a first step, if you remember this big rug image is to create embeddings. Embeddings is actually the representation of your data that is suitable for machine learning and AI purposes. So first as a first step we need to load the data from this dataset and for this we use CSV loader interface that is available in Longchain and now in this like few pieces of code we already we already creating embeddings for all the data set for all the entries in our dataset. So if I go back to the admin UI I can see two tables. So in the first table actually gives me a collection of entries. So as we as we define the the first collection we created is called customer data but essentially what is interesting now is to see like embeddings created for all the entries in this in this collection. So for example this is the instance of the of the document that we are actually using for the training purpose or for the context purposes and you can actually see how the embeddings look like. So if you use open AI embeddings usually the length of your of your vector is going to be 1040 something yes it would be size of 1040 something but you can also for example choose some other embedding algorithm for example hugging face as you can see suggested here which is which is open source and it can easily be used out of the box in just two lines of code. Now once we have these embeddings let's define our question and our question today is like okay I have some I have some order on my Microsoft Store but I want to update the shipping address and how I do this. I also here put alternative questions so like when you play with this notebook you can also put your own questions and see actually whether this data set has enough information to answer this question. So once the question is answered what we want to do is actually we want to find the context that is relevant to this question and this context is done by doing similarity search of the vector representation of our question compared to the vectors actually that we stored in the creative instance and this is actually done in just one line of code. As Christian suggested vector search is one way to find the relevant context of course kdb supports other types of searches like full text search or geospatial search or just key search keyword search so like you can use different type of searches combined together to find what is what is the relevant context for your question. Once we do this we are now ready to actually ask our LLM to answer our question and how we do this. First we need to create a prompt that explains LLM what his purpose is. So his purpose is today to be expert about Microsoft products and services and should use the context that you are going to actually give to the LLM to answer relevant questions but if the answer is not fine in the context it should reply with I don't know and this is very simple way to create a prompt that actually gives instructions to LLM how it should answer specific questions and finally we just need to create small chatbot by using some of the available models that are integrated with the long chain and also passing this context together with the user question. Once this is completed we can access the answer and in this case it says to update the shipping address you will need to cancel your current order and place a new one. Maybe that's something that is still up to date that is relevant maybe it's not relevant anymore but it's actually something we learned only from the dataset we provided so this is a way how to actually how you actually use your private data to teach LLM actually what should what should be the context for any incoming questions. So I hope you like this demo you can play with this notebook it's on our creative B examples repository and you also can see there are other similar notebooks for different different different types of examples for different prompt engineering examples or like how to create another another form of chatbots how to use another embedding algorithms so please let us know what you think give us a feedback open a new issue on this repository and we are looking forward actually to work with you on these topics. So I think that is all from us thank you for being part of this session maybe we have time for one question okay awesome do we have questions anyone thank you for the talk I have a question about the embeddings model because if you encode prompt with language model and use external embeddings model they cannot be in different spaces and if you do similarity search have you tested it and do you see the effect of different embeddings I mean it's a very important question now if you the way you create these embeddings is super important and you're usually limited to one embedding algorithm because you need to they need to have the same length and obviously they need to capture the same semantics simplifying a bit and this is also what I meant with the customers that we work with they were able to create different indexes right and then the retriever gets more and more complex as you've seen on this architecture slide this is a simplified example you maybe you need to query different different indexes created by different embedding algorithms you know so that you can search your images you can search your textual data right obviously you might use different things there and then re-rank the results come up with the really relevant context maybe from different indexes and maybe you also want to combine it with a full text search or limit it to customer support tickets from Europe trying to come up with a good example there are or to customers support tickets from the US with some geospatial inhibition but this is then the re-ranking of the results that really identifies the particular context that is really relevant for the question okay thanks a lot any more questions no so thank you very much for the very nice talk thank you you
A murder party with Lea
Okay, so now we can start. Thank you very much for coming to the Python Dev Room and getting up early on Sunday morning with this cool weather outside. So now we are going to have a very, very nice talk by Pierre Denis, who is a long-time Python user. He's also the creator of Liya, and he's going to talk about Liya in this talk. Liya is a Python module for helping to calculate probabilities on situations presenting uncertainties. And what that means, I hope he's going to explain to us now. Thank you. So welcome everybody. So we are here about something serious, a sad story. I'm not a good storyteller. I'm afraid, but okay, Dr. Black has been killed last night. Maybe you have heard about that. And okay, we have three suspects that have been identified with given probability to be the killer. And it seems that colonel Mustard is the most likely, is most likely the killer with 40% to be the killer. Then we have Mrs. Peacock, 25%. Mrs. White, 10%, and Professor Plum, 25%. Okay, then these are prior probabilities, but we have the help of a profiler, a segment, the profiler. This guy is very smart. And he can tell, for example, that if Mrs. White is the killer, she'll be absent for the investigation with 95%. Otherwise, if she's innocent, she'll be absent with only a probability of 20%. And the profiler tells you several statements like this with probabilities. So when you see this kind of situation, you see, okay, it's quite complex. How can I use this information? Because nothing is certain. Okay. So the investigator is Leah. Here, Leah is not a person, as you have understood. It's a module dedicated to probabilities. So okay, I have several statements here. In other presentation, I elaborate on this, but this time I prefer to show you Leah in action so you can better understand what it is about. My claim is that Leah is something quite easy to use, quite intuitive. You know probably that there are several packages dedicated to probability or statistics. The core feature of Leah is to be easy to understand and probably well suited for education. Okay. Let's start. First, I import Leah, which is here in version 401B. Anyway, so first of all, I want to define a fair coin with head and tail. I do that. So Leah can work with any Python object here. I define probabilities on strings, but you can define probabilities on numbers on any Python object. Okay. Here for education, I prefer to switch to fractions. You know that Python has fraction included. So I've switched the display to have fractions. So if I want to create a biased coin, I can define several values and here it means that tails will be three times likely to go that then head. So I have a new probability distribution. So what I'm doing here is a crash course on Leah because we want to be acquainted to it before doing the investigation. I can also use a probability mass function to define probability in a fraction. So Matplutlib is integrated so you can display your histogram about any probability distribution. Okay. So I want to make 100 throws. So I use my B coin variable, my probability distribution to calculate to make 100 random coins, throws. So you see in this random throws that there is more tail than head. But okay, how can I be sure that it follows the probability that I given? Simply you can use Leah, the same function as before, Leah, VALS. You provide the values and this time it will use the random sample as it will be a frequency counter and you see that here more or less it conforms to the probability distribution that I provided for the biased coin. So what is interesting on this kind of object is that you can use many of what you do usually on your Python object. For example, you can index. If I ask for zero, it will take the first letter, head or tail, H or T. I can chain with the Python lower method and I have lower case H or T. I can map Python function here. This means that it count the number of characters which is four, head and tail, four characters each. So we have a certain four. And as you could expect also, all the operators are overloaded. So if I concatenate my distribution B coin with fixed string, I have a new distribution that follows what has been defined. Okay and here it's something a bit funny. What happens if you multiply a dice with a coin? You get that. Okay. Let's now throw two coins. So the new method allows you to define new event with the same probability. Here I define two coins which are biased together. If I add them together, I have all the combination possible with associated probabilities. So we will see that this is very important. We are able to calculate conditional probability with the given method. So here I try to see, okay, assuming that I know that the first coin is tail, what is the combination of the two coins? So here we see that the previous result has been filtered out to get just the two remaining possibilities. So it's a common feature of LIA is that when you define variable, there is a kind of lazy evaluation. They remain linked together in a network that define the relationship, the dependencies between the random variables. Okay. And you can also define Boolean events like, okay, what is the probability to be? I define it at 140 seconds. And then I can use operator like to be or not to be. And the result is it's true, it's certain true. Because okay, to be it's either true or false and not to be it's the contrary. So together it's certainly true. Okay. And there is also a dedicated function in LIA which is P. So you can extract the probability of true. So you get really a real probability like this. Okay. Let's go on. So here it's an excerpt of a book that it's three centuries old from Abraham de Moivre. It's probably one of the first problem solved by de Moivre here. Okay. Let's ask to find the probability of throwing a nace in three throws given a fair dive. This is how to calculate in LIA. So here I define a dive. I create three instances which are independent, which are assigned to the 1, 2, 3. And then I ask for the probability of any of one of these dives is a nace. The result is 91, 216th as calculated three centuries ago by de Moivre. So far so good. Okay. No, I don't know if you like playing a role-playing game. So there's a small example that where you can use LIA. So imagine that you have here this dwarf which fights against a troll. Okay. I first define a new kind of display with percentage because it's more convenient here. I define two different kind of dice. Okay. Imagine that your attack hole is d25 plus 4. Okay. What is the probability to ever hit? You see, okay, it's easy to calculate with inequality. So you have to be greater or equal that the troll armor class. You get this probability. So the damage of the magic axe is to d6 plus 5. Here is the result. But this damage is only applied if the dwarf can hit the troll. So for that we have a special construction, LIA if underscore underscore to avoid collision with the Python if. And okay, this means if there is a hit, then I apply the magic axe. Otherwise, the damage is zero. And here is the new histogram. So this is the probability, the actual damage that is done to the troll. And then from this data you can answer the, okay, assuming that the troll has 20 health points remaining was the probability to kill him in four rounds or less. You see it's deadly simple with this formula to calculate. We find it's 40%, something like that. Okay. Okay. You follow? So I will, I have many, many, many examples. But by lack of time I will drop maybe some of these examples. Boys or girls paradox, something very funny also that you can find on Wikipedia. So the chance to be a boy or girl are even. So okay, boy, one half, girl, one half. Mr Smith asked two children, at least one of them is a boy. What is the probability that both children are boys? Many people and including myself, the first time I heard this I think, okay, the information give me no clues. It's one half. But if you calculate like this with Leah, so you define children as two, a joint of two children and that you count the number of boys, calculate the conditional probability, the answer is one third actually. And what is interesting with Leah, you can understand why this is the answer by asking Leah to show you all the combinations. So here I show you the gender of all the children, the number of boys and given that the number of boys is greater or equal to one. And we see here the answer is here and we understand better why it is one third. Okay. It's a bit fast but you can do it at your own pace later. Okay. What happens if you have more elaborate problem? Like here we have several children. The eldest is a boy and he's got three brothers at least. What is the probability that all children are boys? Okay. You can model this like this. Here I create seven children. And I put, so you see when you read this expression, it's quite close to the initial problem. Of course you have to understand the elements of Leah to do that but after that it's quite easy to model. The answer is one forty second. Again it's possible to ask why it is so and here by joining you see that's okay, seven children is this part and the other are this. So you can better understand why it is so. Okay. I will drop this Monte Hall problem which is well known. You can read that after the session offline. Okay. Let's go back to the initial problem. So okay. First I change the display options. So the, first we define the pure probabilities like that. So here I ask Leah to display the probability in one line because it's more convenient in this case and as a percentage. Okay. So we have like this and we see, okay, colon and mustard. Our priority is the killer, the most likely the killer. Okay. Let's now try to write down the different information we have. So if Mrs. White is the killer she'll be absent with probability ninety percent. So I define here a variable. Mrs. White is absent using the if as we've seen before. I put the condition if the killer is Mrs. White then she'll be absent with ninety five percent else twenty five, twenty percent. Sorry. Okay. This is the percentage that Mrs. White is absent. But it's not very interesting because we, we, we are more interested about who is the killer but we will see what will happen later. And then we can continue and define other rules like this. If Mrs. Peacock is innocent she knows who's the killer with probability seventy five percent. So you see here there is a missing information which is the else part but we assume that Mrs. Peacock is not insane and if she's the killer then she knows who's the killer hopefully. So I put here the else part as one hundred percent. Okay. And then we can elaborate on more complex information like this one. Okay. I will not detail but you see again it's quite, when you see the statement, the tradition in LIA it's quite straightforward. And the last one is here. So what have we done here is to define what we call the Bayesian network which put the relation between different random variables. What is interesting with this kind of network is that if you get evidence about something you can go backwards and refine the probability to be the killer. So for that, okay, I define a list of evidence here. So first of all it's empty and the conditional probability is the same as before because I have no new evidence. So imagine now that Mrs. White is absent. I can add it to the evidence and define a new conditional probability. So you see it change a bit. Evidence two added to the previous one. Mrs. Peacock is drunk. Okay. I add this information and I get new probability and so on. Professor Plum accuses Colin and Mustard. And finally we know that the killer is a woman. So for that I use here the Python start with Mrs. because it's a handy way to say given the suspects that the killer is a woman, I add it to the evidence like that and you see, okay, there is a new probability. So there are just two suspects remaining, two women and Mrs. White is likely the killer. Okay. Yeah. Maybe you can consider this as a game but sometimes probability can play a very important role in some trials. So long time ago there was the Dreyfus Affair. There was a big flow of a so-called expert that makes a mistake in this affair. And also more recently, Selik Larch case where also there is a bad reasoning about probability. Okay. So I want to mention also that Leah is able to do symbolic calculation. So by using the SIMPY module that maybe you know, so it's very easy. It's the same interface. So instead of number you put variable name between quotes like this and you have probability defined with formula. So you can redo all the same exercise and you will get formulas to be the killer, etc. So a small example here. Okay. I don't detail here. It's a binomial function here with P and here I calculate a conditional probability and it displays me a nice formula. So you can check offline if you want that it is correct. Okay. I want just to finish about my bullshit generator which was made 15 years ago. So here the goal is to produce sentences at random based on a list of words and a list of grammar rules like this. Then you see that I put here a probability on each grammar rule so that the most simple rule are used preferably to avoid to get to long sentences. So yeah. Okay. I get... So it has produced... I don't know why it's... Okay. So maybe I don't know what happens but... Okay. I restart my kernel. Normally it's supposed to speak and to write down sentences but... Okay. Anyway. You can play that also. The Python code is really small so you can try it at yourself. Oh yeah. Okay. Of course I didn't import LIA. Okay. That's it. But anyway, sorry for the small interruptions but I think we don't have time for questions or... Maybe one question. Maybe one question. Okay. Thank you for the presentation. I have indeed one question which is about performance. So do you have information about performance, your libraries compared to other libraries or yeah, what are your insights on that? Yeah. It's a good question. So okay. It's not really the concern. So here as you have seen the results are exact. So okay. As you have seen also it's quite fast. So there are several optimizations. I have no figures but okay. As you expect there are many problems which are very complex and for that LIA provides Monte Carlo several Monte Carlo algorithm that gives approximate results in a fair time. Yeah. But I have no figures. Yeah. Okay. Thank you. Thank you very much. Thank you very much.
A slow migration from Django templates to Vue+GraphQL
Okay, so now we have both students. So now we have both speakers here. We can start the talk, the next talk. The talk is going to be about a slow migration from Django templates to Vue and GraphQL. So Jonathan, Jonathan, Jonathan and Dominic Georg, both Germans, they're going to talk about a system, Alexis, which is a school information system, which was apparently written in using Python Django templates and they now ported it to Vue and GraphQL. So give them a warm welcome and thank you very much. Thank you. Can we get the microphone for the other speakers? Thank you very much. These were speakers, the last one is the prize speaker. Hello, first them and Python Deafroom. We are the Alexis project. That's the all Libra school information system and we want to tell you how we transitioned from a Django app, with a templated web front end to an interactive web front end as the needs arose for one of those in our project and how we did it incrementally. I'm Michael Bauer and I'm a developer at Alexis and I work mostly on the new frontend and the new features we are enabled with that. So with that let's introduce the rest of the team. More of the team. Yeah, my name is Nick. I'm more or less one of the founders of the project. I started tinkering on the school management system. When I was still at school, I don't think I can remember when that was. Today, yeah. I don't know what my role is on the project right now, but someone might know. So. I have a microphone on my own, so I don't need to microphone. That's decent. So I'm Jonathan. I'm the lead developer with the Alexis project and I'm coordinating the dev process and everything connected to this. Okay, so let's get started with the talk. What's about Alexis? What is Alexis? This is a free and open source school information system and it has a free software license, a European public license. So it's thought of as an alternative for schools that they have a free option to manage themselves and organize themselves. It's a modular system, so any school can just take what they need and don't have to use the whole system. And it's also done in such a way that it complements existing solutions. So we're only focused on the parts that aren't there yet in a free software way. It's developed by software developers, but also students and teachers. So we're working together with pilot schools and already have it in use there. The main Alexis features, of course they're divided into these components, but this sort of the main components is the base data management. It's the basis of the schools, like we have classes and pupils and teachers and so on. Then we have a timetable system. It's like a calendar system just for schools. So you can create timetables and you can serve them to the students. So each student has its own personalized timetable. Also the teachers have them. And there's a digital class register to take all the notes and information for classes seating plans. So you can design and show seating plans for the classrooms. And it also integrates with other services. We have a matrix integration, a O of integration, LDAP and CSV, V import, export. And also we just have a calendar system inside Alexis that's producing standard Eichel calendar feed. So there's lots of choice in which ant devices are used to hook up to Alexis. And it's a quite universal system. There's also provisions for student ID cards and inventory management in schools. With that I would like to give over to Nick. He is presenting you the telecom technology stack. Yeah, thank you. Okay, yeah. So thanks for making this nice graphic to help me know how this works. Jonathan, yeah, well, our legacy code base was a traditional Django project and with all the modules as Django applications. When we started basically everyone was doing server side rendering with all the nice templating features of the Django framework. To introduce you to the rest of the tech stack on top of Django, we use PostgreSQL quite heavily. There's a salary task broker and Redis for caching and for synchronizing several nodes when running Alexis in a multi-node setup. Yeah, and for the front end parts, we, as I already said, we used the Django templating engine and some not very well integrated front end utilities like the materialized CSS framework which at the time somewhat allowed for making yeah, modern interfaces following the material design standards, but it started to bit rot quite quickly here and Jonathan will give you some idea about that later. Okay, so that was the legacy tech stack and where is my name somewhere else here? Do I have to say anything more? Yeah, you can see a page in the legacy tech stack. So you have to. Yes, yes, nice. Little overview of how it looked in the past and yeah, I have to say then the problems started. We occurred some very ugly bugs like I think users described to us there if there was like a select menu, depressed an item in the select menu and but actually was selected the item above or below this item. So that was not so good because many users were using iPads. And in addition to strange bugs like this, there was also a problem with maintenance, with materialized as you can see by this issues here. So yeah, there was a big discussion whether materialized will be developed any further. And in addition to these problems, there was also a request for new features. As we spoke about time table planning or sitting plans, we needed some way to do this highly dynamic features in a better way because the control of time table planning is a very complicated thing. Also these customizable calendar views and auto saving views where you don't need to press the save button. It all wasn't possible anymore with our old front end. So we had an idea Nick will present to you. Okay, so yeah, probably many of you know that it's now the new thing to separate front end and back end entirely and make a nice shiny mobile app or whatever. And yeah, Jonathan more seriously already gave a few hints about why we would want to do that. I think there's one other challenge that we faced. Did you mention offline capabilities and caching? No, because you know, Alexis is used in schools and things might be different in other parts of the world but in Germany only two things are certain in the school system. Namely that your mobile network will not work at school and that's the wifi won't work at school. These two things are certain and therefore teachers always complain that they could not use the server side rendered views when they had no connection to the server. So I think this was more or less one of the biggest challenges we tried to solve. So separating the front end actually makes sense here. Okay, so what we wanted to do, we wanted to replace materialize because materialize was stuck somewhere in 2015 and wasn't really developed, it was abandoned. We had a few patches on top of that, I think somewhere even upstream but it didn't get better and it lacked the dynamics that we needed for a really new shiny intuitive interface. Yeah, so what are reactive? All right, yeah. Yeah, reactive front end libraries and yeah, to make the interface, yeah, to not have it reload on every single interaction and yeah, and also a very important idea. Alexis provides a very good foundation for handling organizational data at schools but yeah, we want to tailor to the needs of different schools, of different types of schools. The ideas they have, one of our most important claims that we share with schools when we expand the benefits of free software is that we can make the software work like the school works and we can transform the software instead of transforming the school. So on top of the foundations for organizational data management, the idea was now that if we could replace the front end for some parts, like make a different class register for an elementary school because they have very different needs, we do not have to replace the data structures, the models and the APIs but we can make a front end that is more tailored to the needs. Yeah. Okay. Yeah. This is not my part anymore. No. So we then decided on how we want to do our new technonesis tech so we, as we said, we just took the backend set, okay, that's our backend and then we decided we want to do an interactive front end with UJS and the front end library beautify and some other UJS libraries and we want those both parts at communicate and yeah, we are a graph API. And so this was our plan and there were some challenges with this plan. Oh, just a graph API. So, yeah, let's see again. Thanks for helping me keep up with my tradition. I always give one very good talk before BiaNite and one very bad talk after BiaNite. Okay. So, yeah. So as we already said, the platform is supposed to be very modular. It consists of, don't know, do we have some figure how many jungle apps we had at that point when we started the migration? Like around 15 I think. Like 50 apps that could be loaded dynamically into the jungle project. We actually had quite a bit of magic in there to discover the modules of the jungle apps dynamically. So the administrators who deploy servers for schools could simply install the Python packages needed for the system they want to put together and then everything falls into place in kind of some black magic way. And now this did not turn out so well for separating the front end because normally when we separate the front end, we want to have one JavaScript or whatever application that is delivered to the clients, nicely bundled with whatever JavaScript bundler is the current type. And then it is one JavaScript application. We could not do this because we do not know which parts of the system are used and in which versions. This might be very flexible for every school. So we need to bundle the JavaScript front end application on the machine where Lexus is deployed. Yeah? 10 minutes left. Oh, yeah, thank you. Okay, and you need these 10 minutes? Probably. Probably, okay. Yeah. So the right way would be you have one front end application, one backend application. They are more or less separated in development. They could be developed independently, but we cannot do this because, yeah, what's? Okay. I have to switch the display so you can see this. Okay, this is where we actually generate parts of the bundling configuration for VIT because when we build the bundle, we know which applications are there. We have the JavaScript front end code bundled with the Python packages in the same repository and at deployment time, we need to extract the JavaScript front end code and let it all fall in place like we did with the Python applications, which was sort of a major challenge. So, yeah. Yeah, the microphone is developing, that's good. And then we faced another challenge. We said, okay, we weren't able to migrate all these apps at once. So we had to find a way to integrate the old front end with the new front end. And what you can see here on the Bima is like how the new front end does look like. So there is no real optical difference with the old front end, but it's the new front end and we have had to find a way to put those old pages somewhere in this new front end. And if I just say the word iframe, I probably get some scary faces here. So, yeah, we made it and just put an iframe somewhere in there and then we built some glue, which takes the URL, which is actually called and then called the different URL with a prefix where the old site lives and integrates with in the front end. And that looks like this. So what you see within this container is an old page. And what you see around this container is the new front end. So if you can see which URL is dated here, it has the prefix Django. So it's within the iframe and if I click the button, the iframe will navigate to this Django URL. I will do this and you can see that magically the actual URL from our new front end is also updated. So it's a kind of, yeah, ugly magic. And this also goes one way further. So this is an old view within the new front end and now I click one of these links and it's navigating to a new view in our new front end. So this needed a large bunch of glue to put all this together, but now it's working with some exceptions. Nick will come too. Some exceptions, yeah. So like this iframe with a server-side URL page in the new view.js front end, they are always communicating using some sort of JavaScript message passing. I did not yet understand. Okay, so what are we doing here? This is the dynamically generated bundler config or something. Yes, it is. I don't think we have the time to go into detail about this. And oh, whoa, there's a video. Michael is fine. Yeah, I had to Michael. Yeah. Here you can see the new front end in action and why we did this transition because we wanted to have more interactivity and here you see how you can design a timetable now with the new view front end. Someone's inserting lessons into the timetable and it's highly dynamic and all just works. So we just want to tell you about new problems and I think this last part will also be done by Nick. So, oh, yes, this problem. Okay, we already talked about iframes and how they communicate or sometimes we all know communication fails and that you have Alexis and Alexis and Alexis. And I think this visualizes quite well what sort of trouble this slow migration caused for us but we did not see this too often in the recent time, right? Mm, not too often. I don't think so. Prove me wrong, okay, thank you. We called it mini Alexis. Now we call it Alexis Matroszka situation. If you know what this means. So, yeah, it did just a good. It pops up every month again. Every other month here. All right, so now we have ugly front end bugs for the integration and all of this will be sorted out once we get all applications and all views migrated to the new front end. The JavaScript ecosystem shares some of the same problems we had with materialized situation because you know there's beautify three and it's pretty neat. We needed to migrate to view three. View two has been deprecated for two years or something. Pardon? This year, this year, this is not too far in the past. Okay, but it's deprecated. And beautify three is cool and we would want to migrate to it, but it's still missing the calendar component, the calendar date picker component, right? And we are basically the only thing Alexis ever does is handle dates. So this is somewhat of short, some sort of showstopper here. We hope that this will be sorted out. I think the release date for the date picker is moved every quarter or year to the next quarter of the year of some, but we will see how this works out. Yeah, of course there's an easy solution to the problem and an obvious solution here because we could just do this, right? No tomatoes for me? To get some new problems. And so we are always shifting from one set of problems to the next set of problems. Okay, thanks for bearing with us. I think I'm slowly getting awake. You can find us in the hallway track if you want to get more information and less chaos maybe. All right, do you have any last words, Jonathan? I think we have like three minutes for questions if I'm right. So maybe if someone wants to ask a question, otherwise we also will be available via email. So yeah. Any question? Thank you. I have a question. Why did you think about GraphQL instead of something like Django's framework and exposing APIs and using that instead of adding a new layer in between the front end and the end? Yeah, well, I think we chose GraphQL. Because I think the obvious alternative would be US or something like this. So, but we chose GraphQL because we were able to select what we deliver to the front end. We have like very complex models. And we say that, okay, we just take this set of information for this page. And from the other page, we need a much larger set. But of course, this GraphQL integration is causing us problems with an un-maintained or slightly maintained Django library and things like that. So as we said, another set of problems. Yeah. I think for the presentation, it's not right. Help. Yeah, back to you. I can just be loud. Yeah, just be loud. Okay, I'll just be loud. So thanks for the presentation. I know your pain. I've had to do that job a lot. So my question is, why didn't you, what I've been having success now is the back end for the front end, right? Because all these fancy new reactive libraries now have these meta frameworks, which is an awful word. But they kind of work. And so like, have you considered doing that? So the way I like to do it is you have the new back end for front end. And when they don't know what to do, okay, PHP help, then they just get the page back. So why did, I don't know if you looked at it like, why did you try to keep a single page application? You want to answer this? I can transfer from there. Yes, I would, what was the question about this? Have you taken a look at these back end for front end? Do you like them, do you not like them? Is that? What exactly do you mean? So like next JS, for example, that's the reactive. Yeah, okay. It has one like that. Yes, it's a kind of, we never have been using this. So it's like two years after this migration started, we just thought, oh, we could also use, have used this. So, but now the work is done. We have to go on with this. Our developer capacities are very limited. So yes, it's a kind of knowledge we didn't have. Okay, so thank you very much for the very nice talk. Interesting system. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Django migrations, friend or foe? Optimize them for testing
Hi everyone. How many Django users in here? Raise your hands. Keep your hands up if you are dealing with Django projects with a lot of migrations, with time and continuous integration minutes. Okay, let's talk it for you. Perfect. You are in the right room. Now, I am Denny. I am on your right side of the photo. I work again in JavaScript, Python, Vue.js, Django, everything. It's pain-me-stuff. So let's start with Django migrations. Our way to propagate changes from your models to a database schema and keeping track of them. Let's quickly recap migration commands. So you can use make migrations, migrate, show migration, and SQL migrate. The first one, make migrations, create new migrations based on your model chains. You can use different parameters in there. For example, an empty migration you can customize. You can give a migration a specific name, and you can restrict the creation of a migration to a specific application. The model, for example, if you want to recreate Twitter, we know the reason for that, is this one. You can create a class for a model, and then creating the migration with the command will create a new file in your project in the migration folder with this content. So initial equal true if it's the first migration in your project. A list of dependencies if you are using something like, for example, authentication, or if you are on the second migration in the project, the first dependency is the first migration, and a list of operations performed during the migration. Then you can migrate your migration, of course, using this command specifying an application or not, or a migration name. So if you want to move to a specific point in the history of your migration, you can specify this. So as a new project, you can migrate everything using managepi migrates, and everything is at the last version of your database schema. Then if you want to roll back every migration in a project, you can migrate to the zero migration, and everything is rolled back. You can move to the second migration in your project with this, and without specifying a migration number, you can migrate everything to the latest version. Now how this works under the hood, you have in your database a Django migration table with a content like this, so the application name, the name of the migration, and the date time when the migration has been applied to your database, so everything is on your database. There is a better way to show this, so using show migration, you can have a view of your list of migration in your database, in your schema, with a tick if the migration has been already applied in your database. And then with SQL Migrate, you can print your SQL statement for a specific migration. So with our example, we can display the SQL code for this. So let's take a look at this. A transaction will be opened, every command will be applied on your database, and then the transaction will be committed if no errors. Now if you need to make further changes in your model, you can apply those changes and then create another migration. The migration will depend on the first one, and then the code will be another transaction, the SQL command, and commit. And again, and again, you can apply migration on your database in production using this. What if you need to do further changes, then for example, an every tweet likes and a lot of other stuff, then you can make change in your models, create a single migration, because of course I like to be well organized and structured, so every single change for me means a single migration. Then you end up having a lot of migration like this one. But even worse, if you need to create, for example, a shop app for a customer, then you need to create a model, and then during the lifetime of your application, you need to do a lot of changes to your model structures. Okay, we won't list this, but we had to do a lot of changes, for example, adding tables, switching data from a table to another, to a main table to a detailed table, and a lot of other stuff, changing data during your workflow. So changes can be a lot of pain, a lot of stuff, and when migrations become a lot, then your performance during tests could decrease a lot, because during the deploy is perfect, you can move forward and backward with simplicity, but in tests it's not that simple, because you need to wait for every migration to apply before running tests. And if you are paying for your testing time on GitHub workflows or other platforms, then that could be painful. As a disclaimer, the timing for this talk may change from laptop to laptop, so keep this in mind, but on my old laptop, this is brand new, so it's faster, hopefully, on my old laptop, it was the timing. So running tests on 20 apps like Shop, I just copy pasted them 20 times in the example repository. Test took just a single second, less than a second to run, and that was perfect, so there's no need to do this talk. Well, not exactly, because creating the test database took 20 seconds. So one second of tests for this project, and 20 seconds for database creation. And that was not optimal, because we were on the verge between the team license and the enterprise license for the timing of workflow runs, so between 3,000 minutes monthly, and that wasn't optimal, we wanted to remain in the team license, because it was cheap, and then we wanted to optimize that time. The first possible workaround is to use KIPDB, running tests, and this parameter preserve the test database between runs, and that's perfect, because the first run applies the migrations, and then the database will be kept on your cache somewhere, on your Oculus, for example. If the database, of course, does not exist, it will be first created and migrated, and during other changes in other prequests, for example, migration will also be applied, so everything is okay, hopefully. So this approach saves 20 seconds for us after the first test run. The problem was configuring your CI CD, because a solution could be using cache or artifacts in GitHub workflow, but this takes time to create and store artifacts in GitHub, or, for example, using an external test database from inside the GitHub workflow, but that wasn't optimal, and a friend of mine, or mistaken, suggested me this package, Django migration CI, that allows you to simply configure an external test database, so you can consider this and save 20 seconds if you have an external database. Another possible workaround, one line workaround, is to use in your settings migrate equal false, so if you are using this, migration won't run during the test, and this is similar to set none as a value in migration modules, but for every apps in your project, so it's better this way, so single line change, and this has a lot of pros and cons, pros, of course, single line change, and it doesn't run migration during tests. The problem is it's like make migrations plus migrate before every test run, so this will add in our example repository five seconds of time, so that was the opposite of what I wanted to obtain. So diving into the Django documentation, I discovered this great, great comment, squash migration, and this squash an existing set of migrations into a single one, you can specify your migration name, and optionally start migration name, it will squash every migration into a single one. This was pretty good, I tried this one on the shop application, and I decided to squash every migration into a single one. It was good, not perfect for us, but it was good. The problem is that we needed to add manual porting, because for example we used a lot of functions, manual function, from a migration to another, from a version to another, and that weren't migrated or automatically squashed, so we had to copy paste the function code into the squash migration and make some adjustments. And if we inspect the squash migration file, we can see there is on the top of the class definition a list of things, a list of tuples in the replaces variable. So the first item is shop, the application name, and the second one is the migration name, for every one of the 26 migration. And the recommended process is first squash, keep the old files, commit and release to production, to staging the demo until production, then wait until all systems are upgraded with the new release, then you can remove the old migration files, commit and do a second release. Then last but not least, you need to transition your squash migration to a normal migration, delete all migration files, all old migration files that has been replaced, and update all migration that depends on the deleted ones with the new squash migration, and after everything you can remove the replaces attribute in the squash migration, and everything is fine. Then if you want to clean up your database, you can prove references, so in your database there won't be references to old migrations. Let's test performances after squashing, after spending a week on my work project doing that, and oh no changes, so I lost a week doing that without results, and don't tell my chief. So what's the point? Well the point of squash migration is to move back from having several hundred migrations, five to just a few, for example if you create a branch, a separate branch where you are working you alone, you can squash migrations and propose just a single migration file in your request. I know, I know you wanted to speed up tests, so let's do it. Are you ready? It's not that easy, but first you need to recreate migrations. So let's annotate migrations for a single specific application with show migrations, and then copy paste all the names of your migration files, and then you need to manually create a replaces, you remember this one from a moment ago, you need to recreate the replaces list with application name and migration file name, and store it somewhere in your computer, then move your migrations in a temporary directory, so out of the way, and make sure that show migrations doesn't show stuff. Now it's time to recreate migrations using your application name and a name, a specific name, for example init squash, so you remember that this is the squash migration, and that will create the first migration at your last model version. Then open your migration file, copy paste the replaces array list, you created a moment ago inside your class, then you can restore your old migration files in the original directories, make sure for missing or overwritten files, and then remove the temporary directory. Now with show migrations you need to check that everything is there, so all in this case 26 migrations are there, and the first one, the squash migration is there but has not been applied, then apply your squash migrations and check again with squash show migrations that everything has been squashed and you have just a single migration, and then you can go back to your post squash task, so commit and release to production, upgrade those systems, of course staging demo production, everything else, update on migrations that depends on the deleted migrations, remove the replace attribute, and if you want to bring references to the little migration, and everything is perfect, right? Well, not exactly, if you have migrations providing initial data, you need to create a new migration for that, because recreating migration from scratch, it doesn't create that insertion in your modules, or even better, you can use fixtures, and in the doc you can see how to use fixtures in both database migration and also in testing, and that's perfect, and then you need to be aware of circular dependencies, because if your project is big and grows during the time, you could have circular dependencies from a project to another and backward, and this problem requires you to remove all foreign key, causing the circular dependency, create the first migration, restore the foreign keys, and create a second migration, and this way you will hopefully solve this. Now, let's try to test performances after all of this, after another week spent on the project trying to tell your chief that, oh, I'm working on something useful, I promise you, and yeah, of course, after recreating everything from scratch, our database creation task took five seconds instead of 20, that was perfect. Yeah, it was perfect, but does this apply to everyone? It depends, because if you have really big, big projects and you are paying by the minute your CI CD workflows, and you are on the verge of having to pay $3-4 per user per month, to 20 something dollars per user per month, then maybe you want to stay on the little cheaper branch of this, so that could be a solution, but if you just want to make order in your migration file, then just use squash migration without everything else, or if you want to speed up tests on your localhost, you just need to use KipDB, and everything is fine, without having to spend, in my case, two weeks working on this, just to save maybe a couple of seconds on your project, so it depends on your use case, and we are done, so if you want to see the example repository, it's there with three different branches, if you want to compare them on your local machine, and I uploaded the slides on the FOSDEN website, so they are there if you want to take a look at them. Thank you very much. Okay, we have time for quite a few questions, I see one up there. Given your salary, and these two weeks of work you've done, how many years of enterprise lessons did you avoid? That's a nice question, hopefully my chief didn't ask me that, but I think we could have paid maybe a year, I don't know, one year of this, but yeah, it was fun to play with this, and for me at least spending two weeks trying new stuff, or trying to discover hidden stuff from the jungle. More questions? Good question. Yeah, thanks for the great talk, I was wondering if you looked into using like seed data betas for CI, so that... Sorry. Yeah, you don't hear it? No, I didn't hear you, sorry. If you looked into seed data betas for CI, so that you run your migrations locally, and then dump the database, and then use that database during CI to start off with a pre-migrated database. No, I didn't think about that, it's a good idea, so you just upload your database dump, and then on your... Yeah, so you just set up your CI script to use that database when it initializes. That could be a good idea, I need to try that, thank you. So you restore the database and just applies your last migrations without having to apply everything. Yeah, exactly. Yeah, that's a good idea, thank you. Thank you. I was also wondering if you're using Postgres for example, you can disable fsync that will just keep database in memory, so that probably be a solution for big time. So locally we kept the database in memory, the problem was on our CI CD, so we created a service in the workflow files, and that was creating a database from scratch. So it was just a configuration you can just add on your Postgres site on the CI... We had to consider the time for storing and restoring the database on that configuration from the cache. So it was a little bit of time for that, but yeah, that was an option I tried to... More questions? So very cool talk. I like your method. I basically came up with the same method about five years ago for this approach. Do you think there's an opportunity to create a tool to automate some of this process? Well, that's a good question. Maybe implementing that in the squash migration in some way, I don't know. We could, we can try to do it just to save other two weeks of salary from other people. Okay, I think we're done with questions, so we're going to have another five minute break and then continue with the next talk. Thank you.
How can we trust 3rd party code? Using Python to understand the trust relationships within the python ecosystem
Hi everyone, good afternoon. You're hearing the Python Dev Room. So we're going to have now Nigel Brown speaking about how we can trust third party tools when we have Python. Many times we don't realize all the dependencies that happen when we install a package from PyP and Nigel will be talking about how we can avoid getting some dodgy packages and things. Thank you very much Nigel. Thank you. Thank you. Right. Okay. My name is Nigel Brown. I've been programming since 1981 as a kid. I got a job about 12 years later. I've done mobile devices, security, data, lots of different languages. I currently work at a company called Stack Lock where I'm doing some data science and some engineering. If you're interested in the supply chain and frankly who isn't these days, you'll love Stack Lock. You should check them out. This talk covers some of the ideas that we've been grappling with there for the last nine months or so. Okay. Here are some supply chains attacks and recent examples. I don't know much about these attacks. I'm not a security researcher. Every time I read about one, I feel vaguely uncomfortable. These are things that could apply to me on the whole. And this is why we're looking at these things and the flames prove that they're scaring things. Okay. So, recent lawmaking legislation. We've got the executive order 1428 in the States and cybersecurity resilience proposals. The EO pushes S-bombs. What's an S-bomb? The software bill of materials. It's probably a bit too much detail to go into right now. Let's look it up. There's tracks over in the building about this. The S-bombs are more of a first step than a solution. They're a step in the right direction. Creating them sounds simple. But the practicalities get in the way actually and doing something with them is more of an art than a science still. They are progressing. The key point is the responsibility for the security of your code is shifting towards spenders. That means it's shifting towards you on the whole. There's some more scary flames there because that's quite scary. Okay. Supply chain attacks aren't you. It all boils down to who and what you trust. The key point really is that security, insecurity most often comes from behavior rather than the technology. Why are supply chain attacks becoming more fashionable? Maybe it's because they're easier than they used to be. Perhaps maybe everything else got harder. Okay. Perhaps they were always there and we just didn't notice. I don't know the answer there, but there is a lot more focus on them these days. So a word on trust. Basically, we want to trust some third party code. That circle represents us. We're victims, skateboards, developers. The supply chain is how this code actually gets to us. We generally get code delivered as some form of package. And that package, the source and the package have to live somewhere. Sometimes they live in the same place as in go, which is a very good example. Sometimes they're different places. Some other package repository. These can be private, but we're talking mostly about open source. Important point, we have to download it. These are all potential failures for the software supply chain. Of course, we have multiple versions. They're changing all the time. They're moving target. And there's normally tags in a source repository that point to the different versions. And these are delivered as a bag of files to us on our laptops or our servers. At this point, we can scan them. We can do vulnerability scanning and we can do static code analysis. We should do that. Definitely should do that. And the code has owners. And the point here is that you can't really trust code. It just is what it is. It's the owners you're trusting. And the question we're faced with a lot of the time is do we trust the right people? And it's not just the code owners. There are multiple other people, contributors. And we trust those people because of their reputation. Reputation comes from several sources. It comes from various media. Personal knowledge, you might know some of the developers. They're all quite often we trust in a community of one sort or another. And companies have reputation too. Sometimes good, sometimes bad. How do you trust a company? If you've got close source, that's the only trust you've got actually. The web of trust here is building up. Now, turtle is all the way down. It's an expression of infinite regress. I heard it once. I thought it would be a good metaphor for this stuff. It turns out, I'm looking for an image, turns out Cole Kennedy thought the same thing. So I nicked his image because it displays this quite well. The average middle size, medium size project has about 1500 transitive dependencies. So you depend on something and it depends on other things. And you can investigate one package at a time. You can look at its origins. You can look at the people. You can perhaps do a code audit. But doing thousands of them is hard work. It will just take too long. Now, we probably want automation to help with this. And that's one of the things that we're working on. Trying to give this thing some oil to keep it going. So this web of trust, the supply chain can be attacked at any point here. And it can break at any point. It doesn't have to be attacked necessarily. And also the main point, there's thousands of ways you can draw this diagram. It doesn't have to be like this. But there is complexity there. And it's messy. So what do we do about this mess? Okay. So what we currently do, we really like to see these. And that's because we can count them. And we can fix them. We can show improvements. They've been guilty of a little bit of a misdirection actually. In reality, only about 2% of these are exploitable. So if you're not careful, you end up doing a lot of work that you don't actually have to. This comes from Red Hat Report. I've seen other estimates of this 2% value. And they are similar sizes. Okay. Another thing you can do, static code analysis. Currently, it's mostly signature based. Finds things that we've found before. We think there may be more legs in perhaps grabbing the source, grabbing features from the source code and running it through a neural net. And this may or may not be more effective. There's still, there's lots of research out there. But there's still lots to do. We think we're going to be doing some of this work ourselves at Stack Lock. So, but that's more for the future. I mean, but the criticisms aside, we should definitely do CVE monitoring and static code analysis. Right. So don't take anything I say here. This is an excuse for not doing these things. Okay. So another idea is to look at metadata rather than the code itself. Okay. So that would be descriptions of the package links to the source repositories, activity around it, et cetera, et cetera. This is like a bit like a classic security traffic analysis or perhaps in bank fraud detection. We're looking for behavior around the package rather than the actual code itself. Okay. So this is a graph. A graph. We got basically malicious packages look different from non-malicious packages on the whole. The ones on the left, these little blue dots, malicious packages. The ones on the right are non-malicious packages. They're surrounded by nice bunch of purple users and orange source repositories. You can't see that probably in any detail from where you're sitting. The point is they look different most of the time. Sometimes you get good packages over here that are sort of isolated and you get malicious packages over there that are well connected. So I don't know. It's a malice parent there. It is some of the malicious packages look fine. Most malicious packages don't make any effort to hide the fact that they're malicious. If you look at their metadata, it's quite obviously something. There's no description. There's no effort put in at all. Unfortunately, a lot of legitimate packages look like that as well. It makes it a little bit harder. We started off. We put a neural net on this and we tried to put a classifier and we classified into malicious and non-malicious packages. It worked beautifully. It's like so what? You don't really need a neural net to tell the difference between those two things. You just need to look, you know, has it got any data associated with it? So not necessarily very fruitful. So we don't need a neural net. Instead, we did a simple score. We did a simple score. It looks at some malicious packages. It's mostly Python. We just start with some Rust and NPM as well. We looked at the activity and the provenance. I'll come on to that a bit later. We normalize it with a whole set of packages that we ingested. You can see here that most of the malicious packages, these are just malicious packages, they scored really low. So hey, it looks like we can spot malicious files using the metadata. Not so fast. Unfortunately, the base rate let us down, as I mentioned, we do get low scores for malicious packages, but we've got 10 times as many, at least 10 times as many good packages at school zero as well, which isn't great. So if we get a low score, it means we've got a one in 10 chance maybe of finding a malicious package. We don't know for sure one way or the other. So you've got to go on to your code analysis then. And also, I should point out, this isn't a representative sample. We don't have a labeled data set of all the malicious packages in the PIKI repost, because we haven't found them all yet. So we've got samples, we sample as best we can, but we don't know. So does that handicap matter? Probably not, because most of the actual packages we want to spot, most of the packages we want to use are probably on the further side of the scale. They do have good description, they do have good information, and they are linked up. There are some exceptions. All right. Okay. We act like this currently. Vulnerabilities are all that there is, and they're all deadly. Okay. This creates a lot of work for everyone, as I mentioned earlier. We're only really worried about things that can hurt us. Right? And the real, the reality is that it's more like this. Most vulnerabilities come and don't hurt us. We should use things like Open Vex, right, to describe the vulnerabilities that are actually exploitable in place, and that's a sort of an emerging standard. And then we only have to deal with the shaded bit between the two circles there. Obviously, you want to fix all vulnerabilities, but, you know, there's a prioritization system that we can employ here. Another thing to note is that malicious code doesn't always use CVEs, and there are other things that can hurt us that aren't CVEs. So, we've got malicious code leverages bad habits, like leaking keys and manual processes. We got abandoned code, and it gets taken over, and it's not updated. But bugs and bad habits and abandon where can also hurt us accidentally without being malicious, right? Malice isn't everything. So, we want to avoid all these bad things. Most of the things we actually want to know about are actually hidden from us, right? Okay. So, the malicious code is hidden by stealth. Buggy code is hidden by incompetence or apathy. And since we started patching CVEs, bad actors have moved increasingly to zero-day exploits. All right. And let's remember, most code isn't malicious. When we look at the metadata, buggy, poorly maintained, abandoned, malicious code all look similar. And you have to ask yourself the question, well, we can't tell them apart. You have to ask yourself the question, do you really want to use any of them? So, given that this is a hard problem, why not do something simpler, right? Which is to invert the question. Look for the good, not the bad, right? Looking after your health. It's like looking after your health instead of focusing on disease. So, the good bits are everything outside the circle, right? We want all the rest of the code. And for the Rust developers who insist that code could be finished, it's this bit as well, the abandoned bit. Right. So, what does this look like? We want things that probably don't hurt us. So, this is the inverse of what we just had. So, it's good coding and hygiene habits, active development, regular releases, developers we trust. CBA and stuff like that. Coding now is clear. And the key point is looking for good things is easier because it isn't hidden. Okay. Right. So, I mentioned provenance. Right. The first challenge is provenance. Okay. If you're going to do anything with any of this code, if you're going to scan it, do whatever you like, we need there. Provenance means origin, right? We need to find out where the code came from. Starjacking is when a code lies about its origin. And it tends to be a better package than it is. So, you'll find that lots of different packages share the same source repository in the package systems. It's very common. How do we find provenance? Okay. So, remember the executive order earlier mentioned S bombs. Okay. S bombs are basically a shopping list of all your of your piece of code, whatever it is, operating system, game, package. It's like a shopping list. It's a document of provenance is what it is. What you put in an S bomb isn't quite standard yet, but it's becoming more standard. There's lots of work going on with standardization. Open SSF, there's a track over in the other building that covers this. It's where we probably want to go we want to be able to record these things strongly. Now, if you've got an S bomb, you want to put it somewhere, you want to put it somewhere safe. You don't want people tampering with your S bombs. So, a thing that's becoming more common is Sigstore. And now this is artifacts signing. It's storing artifacts in a transparency log. It's a distributed ledger. It gives us cryptographically strong provenance. It circumvents most of the problems with delivery that we've got. And there's a sort of convergence on this. It's being used more and more in the community. I think it's where we're going to end up and it does solve a lot of problems. But the fact is at the moment, most code isn't signed for now. And I think it would be a few years before it is. And nothing. Historical provenance. That's a stack lock thing. Okay, so we basically, we take a bunch of tags from the source repo and we take versions and we see if we can match the dates. And if the dates match up, then we say it's got some provenance. It's a statistical process. Quite hard to fake. There's a whole video on that on the, on our website if you're interested. Videos and blogs and things like that. I won't go into that any further here. Right, so just because you've got some code, you've got rock solid provenance for it, you know where it came from. There's no, actually, it's no really shortcut way of saying if it's any good. The old fashioned ways are the only ways you test it. You measure it, SCA, again, code review. It requires the provenance of course because you don't want to be reviewing some other bit of code that doesn't apply to your package. And you become intimate with it. And with all those turtles and packages, like, intimacy takes a lot of work. Right, we've got a community of people, right. So to make this viable at any scale, you want to share the work with the community. Okay, and also we want to automate this, right, because this is, you don't want to have to be on the email talking to people all the time. All right. Okay, I mentioned reputation a couple of times. So the reputation of the people and the, the companies that we're talking about, what do we know about someone? We know, perhaps we know them, we know a company that big, we know the size of a company mostly, we don't know much about them internally. We guess and we hope and, you know, do we even care? And the executive order says that we do, apparently. So that's where our reputation currently comes from, I think. Where should it come from? It should come from prior art, participation, recommendations. Generally, we want some proof. And generally, we want to automate this. Okay, so, the key points. Once again, look for good things. They're easier to spot. You don't trust code, you trust people. Trust is complex. It can break in many places. Reputation is important. Communities can share work. And automation makes this possible at scale. Shameless plug. That's the kind of stuff we're working at at Stack Lock. We're open to ideas. Try our tools. They're on the website. Joining a conversation with Discord. The source, if you like, if it's open, if it's not yet. And that's the end of the presentation. So, any questions, please? Page presentation, Nigel. Thank you very much. We have time for one question now. There. I'm coming to you. One second. Thank you for the talk. Maybe it was some humility. But what does your product Stack Lock do exactly to apply all you said? So, is it where you attend your packages and say where are the vulnerabilities or what does it work in practice? To this URL, www.cracksypackage.dev, you'll get to a web portal and you can type in the name of a package and it'll give you a score. What we're doing is we're increasing the number of facets the score is based off of. We've got provenance measures in there and we're going to be doing a reputation engine for it as well. So, there's a website and you can go straight there. To bring this to the developer, there's a VS Code plugin and you type along, you import a file and it'll give a squiggly line underneath it and it'll say, yeah, this has got a low score. Obviously, some of the low scores are absolutely fine but it just gives you an indication that you've got to do some more investigation. There's ways around most of this stuff but it's kind of like it just gives you flags. But yeah, go to the website. It's fairly intuitive. You don't need instructions for it. Cool. Thanks for the question. Thank you very much, Nigel. Please feel free to reach out to our speaker after. Thank you.
Match all things Python: Parsing structured content with Python's new match statement
Good afternoon. We have now Mach-Andre Leimburg. He's the CEO and founder of E-Gennex. He's not only that, but he's a Python C-Python core developer. He's also one of the organizers of the EuroPython. He's a EuroPython Society Fellow, and he's been making many contributions to Python. So, yes, we have this pop star here. Now, he's going to talk about Match of Things Python, parsing structured content on Python's new match statement. Thank you very much, Mar. Thank you. And thank you all for coming. The reason why I'm doing a talk about the match statement is that I'm getting a feeling that it doesn't receive enough traction. So, I wanted to know from you how many of you know the match statement? How many of you have actually used the match statement? A lot less. Yeah, that's what I thought. So, maybe a short introduction. Tatiana already mentioned a couple of things. I did a lot of stuff in Python. I've been working with Python since 1994, so a very long time. I did lots of things in the core development, Unicode, Db, API, the platform module. I'm based in Germany. If you have a need for, I don't know, a senior software architect, then please contact me. But that's not the point of this talk. The point of this talk is to show you this. So, this is the match statement that you have in Python. And it's actually a very, very useful thing, especially if you want to parse structured data. Now, the match statement itself is actually quite complex if you look at all the details. And I'm going through all the details in this talk. There are so many details that I have to rush a bit, unfortunately. And I'm not going to be able to show you live demos or anything because I simply don't have the time for that. So, let's just head right in. So, what's the motivation behind the match statement? People wanted to have something like a switch statement, as you probably know from maybe C, your other languages, for a very, very long time. I just, I wrote a pep a very long time ago, which basically suggested adding something like that to Python. It was rejected at the time, so it took another 20-something years to actually something like this to make it into Python. What we now have with the match statement is a lot more powerful than the switch statement. So, you can do not only matching on literals, for example, but you can also do matching on types. You can do matching all kinds of things, including conditions that you apply to these things. You can combine all of these things. You can also do parsing and matching at the same time, which is quite useful, so you don't have to have two passes. First, to figure out whether something is actually valid, and then in the second pass to then figure out how to actually use the data that you have there. It all started in Python 3.10. That's more than two years ago. But, like I said, it hasn't received that much traction yet. So, what you see here, or maybe you cannot see it, it's a graph from py-code.org, which is a very nice site. If you don't know that one, you should go there and have a look. It basically scans all the PyPI code and then does analysis on that. The maintainer did an analysis in July last year and looked at various features of the language, whether they were being used in the packages on PyPI or not. As you can see, in July, there were only 2,600-something packages on PyPI using the match statement. That's two years after the release, and it's only 0.55% of all the packages, so it's next to nothing. So, I guess one of the reasons for that is that the documentation for this match statement is not all that great. I'm talking about the official Python documentation. There are many blog posts about it, and many other resources that you can tap into and overviews. But the Python documentation for the match statement is not ideal. What you have is these three PEPs, and this is basically the best that you have in the official documentation for Python. If you want to get into these things, then I would suggest to go with the PEP 636, which is a very nice introduction, a tutorial kind of introduction to the match statement, and then you can go to the other PEPs to have more detail. So, how does it all work? We're going to have a look at this example, and I'm going to go through the various different parts of it. So, the first part is the match object itself. This is what you want to match, this is what you want to analyze. The next thing is what you have behind the case statements in there. Those are called match patterns, and there are quite a few of those. I'm going to go through a list of many other patterns that exist. Then, of course, you have the match code. This gets executed in case, one of those case statements, the case patterns actually do match. And then you have something called capturing variables. I'm not going to explain what that is now, because I have a few slides on those. This is a way basically to store the data that's being matched in a variable. Plus, you have something that's a bit strange, which is just the underscore. These are non-capturing wild cards. So, it's basically like an ELTS in an if-else statement. So, if the matching goes down, and you have a, as the last case, you have one of these wild card things, then this will always match. So, this is a way to do the ELTS in the match statement. Matching itself is always tried from top to bottom, and the first match wins. So, the order in which you list these match statements, the case statements, is actually very important. There's no fall through, like in C. How many of you know C? Well, quite a lot. That's good. So, you don't have that, because in C you can often make a mistake. If you forget a break, for example, in one of these, the code that comes behind the case, then it just falls through, and then you execute code that you probably don't want to execute it. So, let's have a look at these pattern types that we have. Like I said, there are quite a few. I'm going to go through them rather quickly. So, the first one is the literal. So, you can just write a little bit of string, a little bit of number, an integer, a float. It can also handle a couple of special singletons, like true, false, or none. Not many more. If you have something else that you want to actually match, and you don't want to write this down as a literal, you can use a variable kind of notation for that. So, if you have some other value, you put that into a variable that's accessible to the match statement. And what's very important is that you have a dot in that reference. The reason for that is a bit strange, because the match statement also works on types. And in order to differentiate between type names and variable names, the match statement and the parser, they need to have some kind of hint for this, so that they know what they're dealing with. And the dot is that hint. Now, the next two types are sequences and mappings. They look very natural to a Python programmer for sequences. You just use like the square brackets or the round brackets, and then you match a sequence. What's not necessarily intuitive about this is that this actually matches sequences, not just lists or tuples. So, if you write something like, for example, in the tuple notation, and then you pass in a list as an object that gets matched, the tuple case will still match in your match statement. So, that's a bit like a gutcher. You have to watch out for that. And it's similar for mappings. For mappings, you write them like the, like a dict kind of notation. It actually matches all kinds of mappings, not just dictionaries. There are ways to, you know, just match dictionaries. I'm going to show them. You can also match, like I said, different types. The very, you know, very simple ones are like all the built-in types that you have there. You can have support for user-defined classes. You have to pay some attention in user-defined classes about the order of the arguments that you have in there. I'm going to talk about that in a bit. What's very important are these parentheses. If you don't have parentheses behind this, then the match statement is going to basically treat this, the name that you have there as a variable, and very often as a capturing variable. So that's going to, that's another gutcher you need to be careful with. Of course, you can nest all these things. You can combine all these things that I just mentioned in various ways. There's an OR combination with a pipe character. And to make things even more complex, you can add guards to these match patterns that you have. So you can say, OK, for example, down here, if you can see that it's a sequence AB, and then this should only match if the value A in that sequence is above 10. So you can write very complex things in those match statements. And then finally, you have these white-card patterns. I mentioned those already. There are two types of these white-card patterns. One is the anonymous one, a non-binding one, which is the underscore. And the second one is one where you basically put something at the bottom of your match statement, and you just assign a variable to that. I often use unknown for this because it just makes sense. If you read that, it's very, you can easily comprehend that. If you read the code, you can easily understand that this is actually something that matches anything a bit unlike the underscore. I'm not too much of a fan of this underscore thing. Right, so now let's have a look at the capturing variables. Like I mentioned in the beginning, the nice thing about the match statement is that you can actually combine the matching and the parsing. So whenever something matches, Python will put the matched value into a variable that you define, which is very much like, for example, the ass notation that you have with context managers. There are two forms for this. One is an explicit form. So I put an example here. So what happens is it matches a list. And then if the list type matches, it will put the value into the variable sublist. And then you can use that variable in your other matching code that you have or in the actual code that you want executed for that particular case. Very easy to understand. It's a bit more verbose, but it always works, which is nice. And then there's an implicit form. This can cause some problems because it introduces some of these gotchas. The way that this works is that instead of putting literals in these, for example, sequence notations or mapping notations, you put variables in there. And what happens there is that implicitly, for example, in the first example up there, the first entry in that sequence will go into A, and the second entry will go into B. And then you can immediately use A and B, for example, in guards that you have on the code that comes afterwards. And these things are actually bound variables in your code. This works very well if you have well-defined variable names. If you don't, you can get into lots of trouble. So using short names is probably not a good idea. They should be very explicit. This does also work with some of the built-in types, not all of them. So there is a, I think this is actually a full list of all of the ones that support this. It does work with classes that you define, but you need to have a look at this pep for the details. There are some special attributes that you have to define in order for the parser to know in which kind of order these variables should be assigned. Unfortunately, it doesn't work with ABCs, but there are workarounds for that. So if you work with ABCs, for example, if you want to test whether something is a float or an int, and you want to put that kind of logic into an ABC, then there are ways to still make that happen. There are some things that don't work with the match statement. Some are a bit unfortunate, because, for example, if you use a scripting shell language, like bash, for example, a very, very common use case for matching is regular expressions. So basically, you have a case, and then you put a regular expression there to match a particular regular expression, kind of like how the string should look like. This is not supported directly. There are ways to work around this. I'm going to show you a reference later on, where you can basically find how to do this. Something else that doesn't work well is a set member matching. There are ways, again, to work around this. You can use a guard to kind of do this set matching. So the guard works by having the wild card, so it always matches. And then it uses the guard to do the actual check whether something is in a value set, or you can use the OR pattern. But the OR pattern is sequential, so it's not really efficient. Optimizations haven't been done yet, which is a very common theme that you always have in Python. First, something gets implemented to have something to work with. And then, in the next couple of releases, people then worry about performance and add better performance. So that has happened a lot in Python in the history. It's probably going to happen for this as well. So I talked a bit about the guard trust. I just want to reiterate some of them. This I already mentioned. If you use the tuple notation or the list notation, and you think that, OK, this is just going to match a tuple or just a list, you can easily get this wrong. So if you want to do this explicitly, then you actually have to use the type notation for this. So you have to write list or tuple, and then the sequence that you want to match. The same issue you have with the mapping types. So you have to pay attention to that as well. Another gotcha is the wildcard pattern. So you can only use the wildcard pattern at the very end of the list if you put something up at the top of the list. For example, if you start with case and then wrong values, because wrong values is a capturing variable, it's regarded as a wildcard case. And so it will match anything. And the parser will actually complain about this. So this is not valid Python. However, if you put a guard with it, then you can use it. Which is probably in order to make certain workarounds possible. I don't really know what the reason is why this works. It's a bit strange. And then the parentheses. If you look at this code, if I wouldn't have put an error there, you probably wouldn't have seen this. What I did there is I put a dict there, meaning that I want properties to have a dict, like a dictionary value. And they want to match that. But I forgot the parentheses. So what's going to happen is the parser is going to regard this as a binding, sorry, capturing variable. So it's going to put the value into a dict. And then it's not only going to not parse correctly, because it will just put any kind of value that you have there into this dict capturing variable. But it will also bind dict to this value that you have in there, possibly breaking code that comes afterwards, because you can no longer access the built-in dict. So this is something to watch out for. And finally, this is the talk that I wanted to mention. Raymond Hettinger. Who knows, Raymond Hettinger? Not that many people. That's strange. You should definitely look him up. I mean, he has done so many good talks. It's just incredible. I mean, if you want to learn something deep about how Python works, he has all the talks in his stack. So definitely have a look at that. He did a great talk at PyCon Italia 2022, also on the pattern matching. And he shows a lot of tricks on how to work some of the deficiencies that you currently have in the match statement. So I was actually faster than I thought. So I'm done. So yeah, this is always my last slide. Never stop to learn. Always learn new things. Never always try out new stuff that comes out in Python. And I hope this talk will kind of make you have a look at the match statement and maybe use it more, because it's actually quite useful. Thank you. Thank you, Mark. Thank you, Mark. So now it's time for questions. So I can say a few people with the hands raised. I will start here, and we will go up. So we have four people, at least. One of your first examples, you first had to check whether this is a list, like with the list in the parentheses. And then two cases later, you are trying to catch against the sequence. That means that this will only match if it's a sequence, but it's not a list, I guess. Like on your first slide, literally. The first one, like this one? Yes, this one. So on the third case, it will match if the thing is a sequence with three elements, but that sequence is not a list, because otherwise it would have gotten into the first case. Is that correct? Given this one, yeah? Yes. Since you have a case list, oh, yeah. Yeah, so you're right. What happens here is that this will always match for lists. So if you put in a real, like a true Python list, then you will always go in here. If you have defined your own kind of sequence, that's not a Python list. Only then it will get in the top. Then it will drop down here, and we'll parse here. And as Heckelman and Laska mentioned for me, what happens if you put a generator in there? Can you match against generators? Because then you will kind of mutate the element while casing the case. Would that work? This is a good question. I think if you put a generator in there, it will actually match the generator type and nothing much else. It won't actually call the generator to give back any values. But it's a good question. I'm not really sure. It probably works like that. Hi. Thanks for the great talk. I had a question regarding the caveat you gave at the end regarding the dict. Is there a proper way to do it, like putting parenthesis, or is it not possible to match a type inside of a hash map like that? Let me just find the slide. This one, right? Yeah, that one. So what was the question? So here you put the dict, and you said that, of course, if it will overwrite, let's say, the Python dict, would it be possible in that case to put parenthesis to match the type here? Yes, of course. And that was the code is actually written in a way that this would have been intended, right? So the intention was that properties, well, it's matching a mapping, right? So if you put in a mapping that has, as one of the keys has properties, and as a value has a dictionary, then this will match, right? Without the parenthesis, it won't match any mapping that has a key that has properties, but not actually look at the value, and simply just put the literal value into the variable dict. That's what happens. OK, I think I see you up there, right? Yes, hello. I was wondering, with this capturing variable, it can sometimes lead into ambiguity. So I was working how well this would work with the existing typing system, where you would, for example, have an object that, like, dict that represents the type. So that is something that I did not really cover in here, but perhaps you noticed the syntax that's being used here is actually somewhat different from the type annotations that you have in Python, right? So those are two distinct kind of, basically, systems working here. These types that you have here are actual Python type objects that you work with, whereas the type annotations are being used by, for example, MyPy or other tools, other static code analysis tools to figure out whether something is correct or not. So this actually happens at runtime. I don't know if that answers your question, so. Well, sort of, I guess. So you can't really put the typing types in here, let's say, because there is generics in there, of course, that would be highly convenient for matching. Right, right. I think that, I mean, in typing, you do have some actual Python type objects. Those you can use in here, right? But you cannot use the type annotation kind of syntax, for example, for matching an integer or something, yeah? No, it doesn't make sense, of course. That doesn't work. Thank you. Do we have any more questions? We have time for one last one. Yes, we do. Oh, my God, we have two. I'm going to the right side, because we haven't had many questions from there. I'm coming. Let's go. Thank you. So, yeah, maybe this is wishful thinking, but how difficult would be to implement or to provide, like, a match that will match not in order, but it will give me the best match? Would be that possible? Because, for example, I'm working in code generators for wrapping CAP from wrapping C into Python, and sometimes you can't do that. And from C++ goes over, function overload. So I can think, OK, I can have function overload to Python and translate that to a single function with match for different signatures. However, I will have to, I don't know, I need to know which is the best match for each case in order to order the match statement. Will it be possible to have that kind of logic embedded in Python, or that's too wishful thinking? You can try to do this by ordering the cases from, you know, the longest match to the shortest match. But apart from that, I think it's, this is actually a hard problem that you're describing there. Because if you want to, if you want to figure out what's the best match that you have, then you actually have to go through all the different cases that you have in here, and that's going to have different semantics than what you have now in the match statement. Usually the problem is like to know which is the most concrete type. Usually the problem that I have the most is like to know which is the most concrete type to the base type, so to that it matches the most concrete one instead of the base one, because it's like it can match us both. But in C or C++, it will always match the most concrete one. And if, and it's not there, it will get to the base. So, and for example, for now, it's like right now in Python I have no idea how I will solve that when I'm wrapping APIs. You can do that by ordering, like I said, you can order the case statement that you have here from the most, let's say, abstract one to the most concrete one, and sorry, the other way around, from the most concrete one to the most abstract one, and then like in the example I just gave where you have a list, yeah, when, if you pass in the Python list object, then it will match the first one. If you pass in, in this other example that I had here, if you pass in, let's say, a user defined sequence, then it will drop down and then match that one. So that's more abstract, right? Thank you very much, Mark. Another round of applause, Mark. Thank you. Thank you.
Python 3.12's new monitoring and debugging API
It's time. So thank you very much to Johannes Bergberg. He will be speaking about Python 3.12 new monitoring and debugging API. For those who were in the previous talk, there was a brief about the profiling features. Johannes is a JVM developer working on profilers at SAP. He also writes blogs about profiling and debugging topics. Thank you very much. Thank you for introducing me. Before we start, I want to introduce you to the concept of debugging, because I'm sure nobody of you have ever debunked. So the first bug that's what's like found was in the 50s when they found a moth that was in between their relays and it makes zip and the whole system like crash because like in the olden days it relays. At SCADAISCAR once wrote, if debugging is the process of removing Zafferbacks and pro pro-peering must be the process of putting them in. As I'm sure all of you are doing lots of programming, I'm sure you're also doing lots of debugging. So that's what we're here. So consider this example program. It's called a counterprong. It just counts the lines in this example in itself and it returns zero. And we're like, why? So that's the problem because the file actually has 26 lines. So let's look at the code. I'm going to see clans of the code. Make this shortly so you don't see what it's about. But the idea is we use the debugger through this because the debugger is a great to understand our system. And the cool thing is now with the new APIs that we get in Python 3.12, writing the debugger is far easier and far faster as I show you in the following. I'm Johannes Pechberger as you already heard. I work at SAP machine, which is the third biggest contributor to the OpenJK, which is like the major Java runtime. And I started talking to people about Python because I also like Python. So it's a bit easier to VM than JVM. The question is now, why do we need a monitoring and debugging API? Because when I'm from the Java world and in Java, we haven't built in debugging API. So we have the ability to set breakpoints, to ask for values, to walk the stack and have everything. But in Python, does the Python interpreter know about the concept of breakpoints? So because I'm here, not only, but with a few here, who of you thinks that the Python interpreter knows about the concept of breakpoints? Please raise your hand. And two of your things. It doesn't know about the concept of breakpoints. Okay. It's a trick question, of course, no, because otherwise I wouldn't be asking this question. So it doesn't know anything about breakpoints, which is not a bad thing. So any ideas how we could implement it? So the first idea that came to my mind was like, we have this code. This is actually the code that was part of the code. So the idea was we just place in front of every line a debug statement, like a debug method that we define somewhere. The idea is simply put in the debug method, we check are we currently at a breakpoint in this file online, and if yes, we open some kind of debank shell. If you've ever used PDB before, that's essentially what, so it's the PDB shoulder, we could be opening. But the question is how did we get this file online? And the easy answer is we have this get frame method. It has an underscore, and the important thing is it has an underscore because it's kind of in C-Python implementation detail, which is great. Because it's pretty slow in PyPy. But we have to live with it because that's the only way we can walk the stack. We've seen before that we can do some EBS stuff, which is nice, but usually most profiles, not most debuggers, don't do it. So the idea is here, we have O stack, and the bottom is like the main, and then the count count line is code line, and then our debug method, and essentially what we do, we can ask get frame, the zero frame, that's the top frame, because we currently in the debugging method, and so we ask it for the frame one. And also get it from the other frames, and essentially what we can get is get information on like which are the local variables, which is the file name, which is the line number, and such, and that's quite nice because this allows us to easily implement the debugging shell because we can just open the shell that contains these locals. And so that's how we implement our first debugging method. And it's nice, and it works, and we can even write some basic debuggers with it. The problem is we want to automate this because we don't want to put this DBG statement in front of everything. So how do we do this? And first I'm going to tell you about the pre-3.12 way so you know the pain points of debugger developers here. The pre-3.12 way was the way of this dot set trace, which was an arcane way to do it, but the idea is essentially we pass it a handler and this handler is called multiple times. Essentially this handler gets parsed the frame type, the frame, and an event type, which could be like call line return or exceptional opcode, and it would be called regular time. So when we then register it, we register a specific handler and this handler then is called at every call. So every time the method called call lines is called, it's called, and every time then also a method is called line, this use here is called. And that's nice, but we want no more. We want to know also we want to get a handler on every line. So what we do, we can return from this handler and inner handler that's called for every line and this has the same signature. So the idea is we essentially implement our debugger here by not using like our writing manual here with the DBG but just setting an inner handler. This is called at every line. And that's quite nice because it works, but you might expect to show later it's quite slow. But it's okay. We can even go to the opcode level, to the bytecode level in here. But the problem is, and the question here is do we need a line event for every function? Because we know when we set a breakpoint somewhere, we only need to like, set a breakpoint, set, we only need to like check the lines there. But for example, consider that we have here this con-con-lines and our user decides that he adds a breakpoint when we're in isCodeLine. So it's a breakpoint also and isCodeLine and in isCodeLine decides, hey, I want to add a breakpoint into code con-lines. The problem is when we haven't passed like inner handler to the method here for con-con-lines, we can't enable line tracing for con-con-lines. So we have to enable it for every, every line of our whole application, which is kind of a mess. So this is slow. So there are multiple ideas how to improve this. And one of the best ideas, and if you're a Python developer, you should do this. If you're a Python core developer, is to add a new API. And this API is called Python, and this API is defined in the PEP669 and it's really, really cool. And the cool thing is also this PEP is written in a style that you can easily digest. I come from the Java world. This is not always the case with the Java PEP. So I'm with the JEP, so I'm quite happy that Python does things a little bit better. So the JEP is called Low Impact Monitoring for C Python and hopefully other runtimes will support it in the future. And it's here since October. So the idea here is that we have more fine-train support, that we learn from the lessons that like having to enable the line handlers for every line is probably not the best thing. So typically when we develop, when we use this PEP, we define some shortcuts here in the top. So we don't run to write, is this not monitoring all the time because that's where the monitoring functions live. So we call it mon and events, it's also bit long, so we call it mon not events. Then we have to assign it the tool ideas, tool ideas. So the idea is that you can have multiple tools that are registered here. And for each tool we register some callbacks. So what we do here in our example, we register a callback for our tool. Our tool is a debugger, they're like six possible tool ideas, a tool idea is one of them is a debugger, another one is a profile. And we register here callback for the line because we want to still have like line callbacks sometimes. And we also register a callback for the start function. So for the start event, when a method is called, then we have these handlers and the start handler is just passed like the code object. That's what you get when you form a method for function called f underscore code. And the offset word is located in a byte code and for line handler, we get the line word in. And the cool thing is here we can return from this, as you see in the bottom, the line handler also can return either disable or any. And the cool thing is when we disable, it's disabled all the time for this specific line and it's called for coverage. So we can also make coverage testing easier. So yes, we enable them the start events and that's fine. And then we run our program, we get the start event for every function that we call and then at every time we ask, hey, do we have a break point as an function? If yes, we enable the line here. But only specifically for this function. And then for every line we check it. The cool thing is that these line events are emitted per fret. So the ideas that were set, sister, etc., it was emitted like in the main fret per interpreter. But here it's emitted for every line in the fret that the function is currently executing. And this is really cool. And Lukas Lange wrote in a PR discussion, the biggest opportunity of PEP 6.9 isn't even the speed. It's the fact that the debugger built on top of it will automatically support all threats and supports threats properly. And with the incoming changes with PEP 703, with making the global interpreter lock option in Cpython, it will get far more important. Because with then we will probably see multi-threading Python applications and then the old approach is just not usable anymore. So the idea is that we can enable events globally and locally. And the sum of both is like they're enabled events per function. And the cool thing here is the power is in the fine-train configuration. So you can set events in a function f for the function f while this function is running. So consider this example here. We have a simple line handler here and you register a callback for each line. And then in f you decide at some point, hey, I want to set the local events. I want enable line events. And later you disable them. And it works. So here we emit hello and then we emit like, hey, we're at line 18. We're just like the line that prints n, then we print n. And that's really cool. That wasn't possible before. That's really great because this enables us to only enable line events for the functions that we will need them. And so the question is, of course, what's far as there are several methods in this PEP and this API. And what's really fast is to register callbacks. We can easily switch out the callbacks and get the tools so we can say, oh, please get the tool ID. What's kind of fast is that setting local events because where it does it modifies the bytecode to the VMS executing. And what's pretty slow is to use the tool ID to start the debugger here and to set the global events because this potentially recompars or modifies all bytecode of every function. So do it all the way. So then it's fine. So back to the debugger. We had here our start handler and our line handler. And they look essentially the same as before. The only difference is that we enable the line breakpoints when we're needed. So they're different than events kinds because we've seen that line events are pretty powerful for implementing basic debuggers. One of them we've seen already the start events. There's the regime return, yield events for everything that you do in your Python application. And there are also then in-signary events. These are events that aren't like an A that you can't enable or disable because they come from, they are controlled by other events. So for example, you have to call event that is triggered whenever you call a function and then we have C relighted, a C raise whenever exception is flown in C code or whenever fact a C function returns. And there are of course then also other events that are not enabled locally but only globally. Essentially the idea is we cannot locate them properly. And the cool thing here is maybe you've seen that we have a new event called stop iteration because we think it was in this Python version we're not using in iterators. When we were returning from iterator previously we wrote an exception but that's pretty slow. So we don't throw an exception anymore but to debug this to still notice it we have a new stop iteration event. Of course what you'll be waiting for is performance because the performance is like the thing that besides the threading support is pretty neat. So what I did I looked around and I found some people also doing some performances but they were using Fibonacci functions. Now I'm like that's a bit small, that's not representative. So I started looking into Python benchmarking suites and there's this pi performance benchmarking suite and I just hacked it. I just wrote random code in it because you can do some kind of monkey patching in Python and it's great. In Java we have like private functions and everything in Python you don't have to care and that's why you like using Python you can do things that you're not supposed to do to get some change results. So if you want I'm using Python all the time when fixing bugs in the OpenJK to write test suite because it's faster in Python than to do in Java so you get the OpenJK, some bugs were fixed because I wrote some weird Python script here. But essentially what I was then going to test is I wanted to check the minimal implementation of a debugger. So minimal implementation with such trace, so debugger that doesn't have any breakpoints here is just call handler and then an inner handler calls it every line. Then the minimal implementation for monitoring API wouldn't enable any line events because when we don't set a breakpoint we don't need to set any line events. So that's how we implement this but I thought like it's a bit sneaky because we're comparing something that triggers an event that is relying with something that only triggers an event every function calls are also made a third comparison with like setting all the line events and it turns out that's still faster which is quite nice. So I use this Py for matchpoint suite that's quite representable and what I found is that it's really, really fast. So the CessetRace when you run it, when you run all the five benchmarks with CessetRace on you have a 3.5 times larger runtime. That's pretty slow. When you're using monitoring you only have a runtime increase of like a factor of 1.2 which is like 20% slower which is pretty, pretty awesome because this means you can debug all your tight loops, you can debug your whole applications without worrying about like debating slowing down and when you enable all line events it's still 30% faster than with CessetRace and people here probably like RAS. That's essentially all the benchmarks that are in Py performance and what you see here are the orange bars. These are the bars for the monitoring solutions and here it's just at one so if the bar is not visible it means you don't have any overhead in this benchmark but essentially you see that the blue bars for CessetRace can get high, can get up to like 10, 12 so it's really good. At least in my opinion and then when we switch over and use CessetRace monitoring with all line events enabled it gets worse but still we see that it's still significantly faster. Another question is of course is this whole thing used because it's implemented and I'm working in OmTek so I noted when you implement a cool feature the chances are that nobody will use it for like a year but here in OmTek people started, but here in CPython people started using it especially the vendors like PyCharm. So it's not yet used in PVP and they're showing the second pie but IDEs like PyCharm with their version 2093.3 use it and they've seen significant performance improvements. And there's currently a pending pull request on GitHub so if you want to help PVP implement it go to this pull request and try another discussion here so I would really recommend it's an open source project to CPython and you can make PVP better so what not to like is simple. Here's a quote from the pull request from Chen Gao who wrote this pull request and he wrote after this change we will have the chance to build a much faster debugger for break points we don't need to trigger trace functions all the time and change for a line number so I'll show it to you. The bad news is it's almost impossible to do a completely backward compatible transition because the mechanism is quite different. So there's an ongoing discussion how to do this. You could take part there so scan this QR code, be part of the community, give something back and not just use CPython. So because I have like tiny tiny town left I want to just show you shortly how single stepping works because single stepping is just break points because essentially the idea is here when we have always take here with the Scona and step out of this for example we just check for the next line where only the frame before changed like the current code lines changed. Stepping is also pretty simple we just check that only like the line number changed it's also nice and then check in to is we check the next time where we just put the frame on top. So that's all from me I'm part of Northern Twitter you can find my team at sub machine A O so if you want to use a JVM use the sub machine it's the best JVM. I'm contractually obliged to say this. We work at SAP we're one of the many cool open source projects at SAP you can follow me on my blog where I write on DBegging, EBPS stuff and everything else. So thank you for being here. Thank you very much Johans we have time for probably two three questions maybe. Does anyone we have one there? Thank you very much for this talk and for this pep because it actually solves a lot of problems I had when I started back in the days developing a tool for performance analysis for Python. However at some point choose to use the C interface of set rays and profiling and whatever do your does your proposal as I said is implemented already also support the C interface? So I have to correct I'm not I have nothing to do with the nice people who implemented this sorry but so please ask them they're probably in some discord somewhere I'm just telling you about the good news because programmers usually don't want to go to conferences and speak in front of people so that's why I'll be giving talks on this. So sadly I don't know. Thank you do we have any more questions to Johans? Can raise your hand. No questions apparently. I just want to choose this opportunity to thank Mark Andre and David also known as Flypeg for organizing this dev room. You guys did an amazing work. Thank you very very much. And thanks Johans again.
Opening Railways and Open Transport devroom
So, hello everyone. Hello everyone, thank you for being here and for making this room so full even in the early morning. Don't be confused, I'm speaking in this microphone just for the stream so we have to talk a little bit louder here so that people online can hear us. I hope that's all right for you also to the speakers. Okay, so we are very thrilled from the organizers that you are all here and that we collected so many interesting talks. We will shortly give an introduction into what we thought about this schedule, why the talks are there and were selected, but in general we had so many good contributions and submissions to this deaf room so it was really hard and we hope that you will enjoy the program. We, that is people from different railway companies I would say in Europe, like in the last year when we first presented this room or organized this room, yeah, we learned a lot and we also learned each other and we are actually, we can say we just founded the Open Rail Association just a few days ago officially so this is one of the forms that we have where we bring together people from the community, from the free and open software community but also from railway operators, from the transportation community to work together and yeah, this is one of these great opportunities so it's great that we can do this for the second time in a row. And yeah, thank you. We, that is here, Louis Hamelon from SNCF, we will also give a short introduction now and then it's me, Max Mehl from Deutsche Bahn, we have Comedius Schumacher from Deutsche Bahn, Peter Keller from SBB, Mahalia Steffan from SBB as well and Simon Clavier from SNCF so you see we are quite international here and we are the organizers. So Louis, do you want to give us a short intro into the day, what can we expect today? Yes, thank you Max. So we try to tell a story this year and not to put some talks, the one after the other and we start with, first with something about the data and about the traffic forecast and to modelize the data of the demand and then what happened with the data, we try to simulate the passenger behavior with a mat sim tool which is a quite fancy tool and then once we have this transport modeling passenger model, we can use it for making simulation and building timetables so we have these three talks about a fancy tool which is called OSRD. Yes, it's a and this tool, so we will have three talks about this tool, one about the map, one about the running time algorithm and the other one about the signaling system and then how it is used and nice and what is, what, how the community works with all the, the, the, for the railway and for the transportation in general. Cool, Tristan. That's it for me, so maybe we can start now.
Open standards, open data, open-source tools: their governance and future
So, hello everyone and thank you for your patience. So I am Tuto, I'm a project manager and expert contributor with an IT for PT. I'm based in Paris and why I'm here today is because I do believe that the tools, the data that we generate should be used to build communities and inclusion. You have below a couple of languages that you can ask me questions into try not Japanese because then no one will understand in the room. Thank you. So, IT for PT who we are, we are a non-profit association and what we do originally we actually come from onboard units where we created a standard for open architecture and data accessibility and interoperability which means that basically the bus, the trains, the trams, the rolling stock would all be standardized talking together from the actual wiring of the vehicle. So making sure everyone for example use the same internet connection into tapping into the same feedback loop to the back office. So we are a membership based association. We have over 160 members in 28 countries that will have railway operators, public transport agencies and other associations. So as I said what we do really is to build this architecture for interoperability first and foremost. We also gather together a community of open source developers, aficionados and passionate people so that's why we're here and we also have at the end a label for compliance making sure that when people use standards they're not alone and they can actually check that all their different units are compliant with the standard so then from the buyer perspective you know that it fits the norms that exist and I'm happy to have Breder with me. Yeah officially I'm a public product owner for a small team in Norway with 10 people representing and tour company owned by the Ministry of Transport. We are a non-profit. We work building open source tools. We are using public funded money in our development and we want to give as much as possible to back to the society both with open data and open sources collaborating with stakeholders in Europe, Norway and internationally. What we say we do is building an open infrastructure platform. The road authority builds roads, someone builds the harbors, airports, electricity, water supplies. We build an infrastructure platform for mobility data. Open source all the way for my part of the tour and advocate for that for also the rest. This is. So on this one we wanted to show you a little bit what exists today when we talk about data that relates to transport and public transport and also railway knowing that there are different types of standards and specifications that exist. So if you take the European context you will have this gigantic European norm that is called Transmodal that is really to be viewed as a data dictionary and a grammar. So it's not a standard. It's really a reference for you to cross check concepts, how they integrate one with another, how they are articulated, how actually they are defined and because it's a European norm it is also translated in most European languages so it also makes it easier to implement. Obviously a data dictionary, a data model is nothing if a data exchange format is not created. So there were two and way more open standards that were created based on Transmodal. One is netx that is timetable information that is everything that is known in advance to describe transport network, schedule, the fares and so on. You have Siri for real-time information or anything that is not known in advance that you will have real-time updates, vehicle monitoring, situation exchange for example if you need to close railway or public transport services and one that is upcoming that we will start working on defining very soon is OPRA which is more to run statistics, performance, so public transport agencies and authorities can compare one operator from the other. You also have in the screen shown GTFS schedule and GTFS real-time which are probably the most used today across the world to describe real actually timetables and real-time information based on the on these schedules so if you use any trip planning app a good chance is that it's actually based on GTFS data because we're here in the room I would also like to thank a couple of colleagues including Stefan because right now what we're doing is actually bridging what was first created for urban public transport with the rail domain through European project. So as I said I wanted to place a little bit everything that is open standard and open source within railway and open transport and basically is just to show you that everything is linked so as a customer you usually just see the trip planning part which is on the top right where you actually want to go from A to B so you'll get your train schedule your timetable and so on and also real-time information the train is cancelled services disrupted one is late and so on and so on or just the tram is arriving at the station all of that is thanks to the data that is issued from the back office that a lot of times especially for real-time data is actually based on vehicle data where the IT4PT specification exists so as you see we really try to map out all the different standards and specifications that exist to build all of that but that's really the backbone it was more for me to give you an indication that all the data you work on has been standardized and standards are open for you to participate in. Let's go back I will focus in my next minutes talk about the upper right corner IT4PT standards from the vehicle to the back office system that produce real-time data in Norway is more and more based on the IT4PT standards but I will not cover that in my presentation because we are in the upright corner from the inside and we will have demand for finalized data that we can use so the IT4PT stuff is done by the operators. This is a overview of our all components that my team is responsible for you can split it into two which is the input side and output side all of this is open source open at github someone in the room Hannes is part of the team so we are more here to ask other technical questions it's possible and one French company has downloaded this on github and one atender and serves a region in Paris. We seek collaboration with everyone don't reinvent the wheel again in this audience how many produce data or want to produce data for the mobility sector 10? Stevan, raise your hand you want the right way to produce data and make it open yeah and the rest of you you want to use data yeah yesterday in the middle of this week I was in another meeting where the EU country talked and showed what they have done with open data regarding netx and Siri the last year nine country showed up and all of them has a lot of work to do so let's go just briefly into what we have done in Attenture we have focused on high quality data we need that to produce good information to the travelers so the operator and authorities in Norway are responsible for three things we have a national stop-lapse registry so they manually update that they produce planned time travel data in Norway we say note the conversional data so they produce native netx from the start in this context netx is similar to gtfs but have a lot of more data support more into operational data can be transferred to the data and we use Siri protocols for real-time updates that is very similar to gtfs real-time we develop OpenTrip planner open source component we do that in collaboration with that project it's a successful collaboration supports both netx and gtfs supports high quality data in Attenture we reached in January one billion requests in a month in a country with five and a half million people so you want to travel a lot the API from OpenTrip planner is openly available you don't need to register but if you do you can have more access to it and we support mostly the main travel apps in Norway to use that API so we are getting the same information everywhere the API should be relevant for different users so it's flexible so the client can decide what is important for my users to show so the Ministry of Transport wants an tour to be a neutral one the national the biggest railway operator they want to show their offering first so they show their offerings and not the competing ones in the same way and other other regions based apps show only their local area and not enough and all of them are using the same API getting the same correct information everywhere we have one place to correct if it's wrong on the left side at the source we also share data in national access point which is a requirement in Europe there we share the API I talked about we share net text and stereo raw data files and we share gtfs and gtfs as well all of this is open available and what we say to the Norwegian data producers you have three responsibilities stop places the netx data and the return data and we from an tour can take care of the data is correct in all the apps and all the international apps also so we say to them deliver data to us and we can secure that it is correct as Google which is important for them we also see that the data producers want to use the data they have delivered to us that we have merged together with other data and we have quality validation tools so they want to use it themselves to do that they want they need to have more data than we need for public information they need operational data so we have added that into our data pipeline that is supported in netx and it's not supported in gtfs so that's the benefit of using netx in regards to gtfs and they starting now to use the data opens up giving them the possibility to get out of lock-in situations which is something that is usual in the public transport sector they have a big important software provider had it for many years and it's hard to shift and get out of it by going into the netx and doing that correct it's possible to break that circle to handle the extra data we do that with one validation in the previous slide but in open data we remove all the data that is for sensitive for operational part and we give them access to that in a different data set and it's this is worked works pretty nice today open source tools open tri- I can take open tri-planer I'm a man leading that open tri-planer is an open source tool started back in the US 13 14 years ago thanks Hemsner that's it was a successful tri-planning from the beginning increased usage worldwide added a lot of functionality and after 10 years when we started to use it we saw that it was built for big cities when we entered a graph with all the data from Norway the latency wasn't that usable from Oslo to Bergen 10 seconds we shut the stop the search give one answer we decided to get away the community to build a new version so and to took that lead on that development the two first years we did the development alone we had meetings with the collaboration so we secured that we was on the same path today we are around 10 to 15 companies actively develop on it we do that together in the same master branch we have regular product owner meetings to discuss the direction on it and we share resources the open tri-planer is a multimodal tri-planer supports all kinds of modes we are still not finalized with it but it works and we can collaborate even more and it supports the standards we talked about today netx and syria and gtfs and gtfs and it's supported in the same instance so you can use those standards together and then we're getting to the part that might interest you the most is we wanted to present to you today all the open source tools that exist and others that need to be built and hopefully some of you will raise their hand and help us build them in the sense that it is a good good for the ecosystem so what exists is mostly thanks to the amazing work done at N2 because everything is open source so they have as red said an open registry for a stop place registry so all the stop points in europe pushed by the european delegation we have a national access point so it's kind of like open data platform for every single of the 27 plus three european countries where you can find a lot of data sets and not only just public transport some of them for example in france you can have the registry of all the places where you have cal pooling you can also have description of bicycle lane and so on if some of them need for you to create a login a user and a password it's mostly to try and keep up in the kpi of how many people actually use the data and you have a lot of other open source data libraries you used to have transit feeds now it's called a mobility database you have the mobility hub for gbfs and so on in n-tool you also have a data creation tool that is called n plan so to really create your schedule and your data in netx and we have four netx two validation tools that are fully open source and two developed by n-tool and green light developed through the european project data for pt which is basically to check if your netx feed is correct against the xsd scheme and then you have a lot of other smaller open source tools such as the one created by dg transit alkadis abi and so on but what we wanted to show is those are the tools that people created a lot within their companies within european projects within their initiatives because it replied to specific needs however now that we get more and more data that is open we need to create more tools so ideas we had discussing with a lot of people but happy to hear your thoughts are graphical representations of netx and serif feeds for example conversion tools from netx to merits that is more on the railway or bridging the different open source validation tools that exist or analytical tools so that's it for our presentation and most of it we want to hear from you if you have questions on either some tools we could develop or how to actually extend gtfs or gbfs or netx or seri how do we work with the railway industry and osdm which is also one that we did not have time to present today so the floor is yours actually i know the question about netx because it was kind of small on the slide that it's the Nordic profile that's the interviewers what are the difference between profiles or what does that mean compatibility with other countries so your question is on the compatibilities of the different profiles i was asked to repeat the question for the live stream so perfect no netx is a huge standard made by more or less theoretical parts so almost every use case you can think of in public transport is supported in the standard it's allowed in within the standard to model a specific use case in different ways and it's still valid so to make data interoperable and usable for serpartis we need to have profiles the regulation came at the same time that netx was uh valid yes in the european this is a valid standard and the regulation came but it wasn't the profiles existing so the uk started with a small profile first then french built a bigger one and then in 2015 and tour came with our aims that was extremely high we wanted to support what i showed in in the slides both the information part but also support the operational part for the rail operators which is complex and also for bus operators and and more so the french profile was based on the uk we based our profile on the french added stuff and a couple of years ago uh EU profile came which is almost part of the french small differences yes first supports all of the most important parts of the european profile the norleague profile but not the operational part is not supported and not needed in the EU profile but there are this will speak it's just a small the the differences between the norleague profiles and the EU profile uh and what we see now in europe is that too many countries build their own profiles so we started also this one with a norwegian profile but together with a collaboration with the norweg countries sweden danmark and finland we asked our neighboring countries is our profile supporting your use cases if yes thumbs up we collaborate if no come to us ask us and we see what's the difference and they come back they had some uh small things that wasn't supported so we added that to the profile and they had some ideas this is a better way of solving this use case and we changed it and that we was in live production development with different stakeholders producing the data and we changed the profile and it went back to my left side on the previous presentation you have to change your export oh you're gonna have to give them money no you're not getting any money you have to change it we stopped the validation you are not allowed to produce data into our production if you don't change this and then they changed it so the difference there is uh today we see that it's very hard to use net text data from a different country into the same system you don't do something so that's what i spent the last week to highlight yeah we need to solve that as well yeah was that answering your question yes you
Rust-transit: libraries to manage transit data in rust
So, we can start with the next presentation. Right pronunciation. That's right. We will talk about the rough libraries, public transport, so the stage is yours. Yeah, hello. Thank you. And I want to thank you, the people before, because Transmodel helped a lot to make like a nice model of how transit should be called, what's the stop point, stop area, a trip. So, if you ever work with public transit, read the Transmodel model. It helps a lot just to make things clear. And also, OpenTrip planner works quite well. We've been experimenting recently using it in whole France, so it's a bit bigger than Norway, and it seems to be working, so they do some nice things when we have to talk afterwards. So, I'm talking about the very other extreme end of public transit data. So, some very small tools and libraries to manipulate this data. And it's in Rust, because, well, if you're handling a few gigabytes of data and want real-time data, Rust might be an option, and that's what we've done. And, yeah. So, we are a very open and informal organization. So, in GitHub, it's Rust Transit, and we want to make some lot of small breaks just to get started using public transit data, and then you can do whatever you want to do. So, it's not very formal. We have no status, nothing, and it's just focused on implementing things. So, gradually adding more implementation and getting more things working. And this presentation is kind of a call saying, we're looking for some projects to add to it and maintain us, and maybe some people will say, okay, I have this very specific need, come and see us and let's talk. Right now, it's mostly, but that only maintained by voluntary in a cooperative where I work, which is called Coup de l'Hon liberté, and there's a colleague over there who can also answer your questions. So, the first one is GTFS structures. So, the most important part, the biggest one, so I will be spending a lot of time on it, and the other breaks are a bit smaller. So, GTFS, I was told before, is the fact of the standard used to publish static transit data. So, what time will the bus run next week? And at what stops it will stop and so on. So, we started initially for our own project as defining the types in Rust, so the structs and so on. And for those who do work with Rust, it's basically just baking third serialization and serialization annotations. And with the time going on, we added some sugar, like reading directly from an URL, which was a common need because, let's say, no way publish it on this website, and we just downloaded and have the data immediately. We started making some integrity checks because it's just plain CSV files, so identifiers might not reference data that doesn't exist, so we added those checks. And we tried to make it easier to access the data from one object to another. I want to mention one alternative, which is Transit Model, which is made by a French company who is now called HOD, which was used to be called Kizio Digital and which was used to be called CanalTP, depending how old you are in the transit world. It's an AGPL, so it's a library, as AGPL, so it might be a problem. It does much more things like file conversion, is able to convert GTFS to netx and other way around. It has very nice query functions, like tell me all the lines that go through this point. But it's a bit more complex to use. It's mostly done for the own tools, so it's not always very documented and you have to read to know how it works to get it working. And in a perfect world, it would be based on GTF structures. So we start discussing with them, but it break too many little things on their end and they didn't want to bother, just say, it's work for us, don't bother with it. So some user examples are TrontoValidator, which is done for the French National Access Point, what they were talking before. So the transport.data.gov.fr has a validator that checks that every GTFS file is valid. It doesn't have buses that go over the speed of light, sorry, and things like that. So it's based on GTFS structures. It has also some own tool with the GTFS to geogesan. Some people just didn't care about the timetable, so they take a file of timetables and just extract the topology of the network. And another project is Ketrin Maps, which is kind of a big student project in American California, I think, from a university, which are trying to make a whole system and they're contributing a lot. So it's a very small vanity metric. We have like 15 contributors, which is a lot and small for a project that's not really publicized. And we have regular people just happen to use it and make some contribution. So it's living on a small pace, but it works. So we had, what was also to say is quite performant. We tried to find the biggest GTFS that is there on the wild and apparently is the German one. So there are 600,000 stop points in Germany, at least in the GTFS files. There are one million trips. A trip is a bus doing his route, and if the bus runs 10 times a day, it's 10 trips. 32,000 stop points. So it's quite a big file and just to get everything read from the GTFS file to everything in memory, it takes about 16 seconds on this laptop, whatever it means. It takes about 5 gigabit of RAM. It means you can handle the whole data of Germany on your laptop or on a reasonable, affordable server. It's also quite robust. As I said, every file in transport data GovFR, which is a national access point, is passed using GTFS structures. It has data that comes from a lot of different editors and vendors, which are present all around the world. So we kind of work through all the quirks of all the weird things that people do. Like in GTFS, you're allowed not to put the trading commas. So if you have 10 columns in the CSV file and have data just in the first two columns, you can just put one comma and leave everything empty at the end. It's all the kind of things you went through. And I'm using this as a side note, as I have an audience that might be interested. Oh, sorry. The GTFS file was created as a dump of a database. Just dump all the tables, put the CSV file, bundle them in the zip file, which is a horrible thing to do. It's nice to exchange with your colleague as a one-off time, but not to make a standard. So in the future, if you ever work with this kind of things, don't make zips of CSV files, please. Let's use, for example, a SQLite database. You can have a schema, so you'll be sure that the data will be respect, and into there will be an integer, a Boolean will be a Boolean, and so on. You have foreign keys, so you won't have wrong references. You have typed columns. You have indexes, so you open the file and you can have fast queries immediately for free. You have a crying languages integrated. You just download, open it. You can already make some statistics. And you have kind of fun things. You can put everything on an S3 server and make HTTP range requests. You don't even have to download the whole gigabyte of file, just 10 megs, and you have all the data you need. So, yeah, this was a thing. Think about how you materialize your data at the end, because people will use it, and it will bring a lot of pain if you don't think about the serialization of your data. Okay, that's done about GTS structure, which was clearly the biggest part, and now some smaller projects. One crate, so crate is a package in the Rust world, is Serialite. So, as you told before, Serialite is a standard, it's a norm in Europe for real-time data, and Serialite is kind of a more simpler version to use. I think it's open to heated debates if it should exist or not. So, Serialite was mostly used as a SOAP interface, so over XML, and Serialite is the same data, but serialized as JSON and served over REST. And we used it initially to convert from GTS-RT, so the real-time GTS data to CRI, for the French National Access Point, to be able to expose the data in the European standard. And I also worked this small toy project where I read all the French, sorry, not French Parisian, big Paris area, Ilde-France data, to have some dashboards about real-time of this top or this metro line on real-time. So, it works also quite well with some big data. I mean, Ilde-France is twice as big as Norway, so it works. I mean, the standards are well done, and we get things working. Another one, which started really as a toy project, is Wesson for Routing. When people see OpenStreetMaps, they say, oh, nice, a road network. Let's implement some Dijkstra algorithm on it, because I want to play around with it. And if you go into the OpenStreetMap format, you see that it was meant for mapping and not for routing. And the most simple example is a way, it can be a road that goes around 100 kilometers, and it doesn't stop at every intersection. So, if you want to do some routing with that, that's a very bad graph. So, the idea of this small tool is to cut the graph into a topology, as we learned as a student, to make some routing algorithm. So, initially, I made it just for this toy project, was all the roads from, it's like a tree spanning here from Tokyo to every corner of Japan, and making this kind of tree-like structure, so it was nothing useful. So, it's meant to toy around, like the project I told you. If you want to just try some algorithm because you're a student and you want to implement on real-world data, it's very nice. It used also for WessonD. I think there will be some presentation afterwards, which is the open-source railway designer. Sorry. I'm not worried just I'm new. Yeah, I'm bad with acronyms. And we want to do the same with railway, and with railway, there was no tool to do it. But be aware, don't use it if you want real-life routing on roads for pedestrian for cars. There are much more better tools and there are tons of constraints that's not able to handle, like a turn left, a slow as a turn right, and things like that. So, use OSM for routing for turning around for very specific needs or for learning, but don't use it for real-life routing algorithm, and use those very nice open-source tools that exist. And that's pretty much. So, thank you. If you want to say that we're working, everybody wants to work with Rust and transit data. So, we're quite open. I hope we're friendly. So, don't hesitate to contact us and let's slowly grow the space of Toolbox for your needs. I saw a Chinese screenshot of the departure port. Are you planning on also integrated outside of Europe? Well, this was just a creative comment from Wikimedia picture. In series, we are not specific to any region. I mean, GTFS and Siri might be used anywhere. While GTFS is quite used all over the world, and they take more European, I think Siri has gained some traction around the world because it's more usable than the GTFS RT part, which is very focused, very big infrastructures and not always used. So, maybe it's possible. More like an information, you might be happy that Siri Lite has been fully approved. So, it's not kind of like a French version of Siri on the side, because Interforce Mobility asked for it. Okay. Thank you. That's a nice approach about all this transport. It's like making words vocabulary and being agreed on what this word means. It makes it easier to cross boundaries or formats and things like that. But it's always a bit tricky, but nice to hear. How far away is Rust in the transit industry? Like, there's another project, the transit model. Is that also Rust-based? Yeah, Transponder is Rust-based, yes. And it's used for... Huff makes a routing algorithm, which is called Devisia, which is heavily used in France. So, which is written mostly in Rust nowadays. It started as C++, well, it started as Delphi, but that's a long time ago. So, it's actually used, yes. Okay, thank you very much. Thank you very much. Thank you.
Counting on openness: Privacy-safe passenger counting
How many of you are still awake? Let's have a show of hands. Okay, 90%. Very good. Very good. How many of you work in mobility in your day job? All right, that's about 25%. How many of you would like to work in mobility as your day job? These are the superheroes of the next generation. How many of you have worked with passenger counting before? All right, five. Excellent. All right. So this story, this is like a few of the things that I want to highlight in the development of the Finnish national automatic passenger counting system. And there are like bits for everyone, and I'm going to be fairly speedy with these things. And if you have questions, please ask them at the end. And let's get started. So I've been working in public transit for a bit over 10 years now, and software development for a bit over 15. And I started my own consulting company five years ago when I wanted to help more organizations as well. But I just wanted to give you a bit of background that I come from the public transport side and not so much from the railway side. So just the basics. What is automatic passenger counting? It just answers the simple question, how many people are there in the vehicle? There are two different kinds of messages that these vendors send. For example, they send how many people went in or how many people went out of this particular door. And then there's the option of telling how many people there are in the vehicle right now. And some vendors send both of these data, and then you have to decide, speak louder? Yeah. All right, thanks. So some vendors send both of these data, and then you have to decide which one you trust, the DIF or the total. So why do we collect this? For the passenger, the benefit is quite obvious. You want to travel with the less crowded vehicles, mostly. You get this information from the passenger information systems such as signage. But also, more and more, I expect that there will be automatic decisions made for you without you knowing about it. So trip planners already will suggest trips that are less crowded. And in the future, I think general technology should already be there, but that general technology hasn't reached public transit yet. So we can't yet recognize prams and bikes and such when they come in. But when we can, then we can tell you if your pram fits in the bus that you're aiming for. Now for the authorities, the public transport planners, for example, one of the most important things is to be able to understand where the masses are moving. So you want to allocate the vehicle capacity where it's needed. So for example, if there's a route and the last three stops of one direction is often very empty, then it might make sense to cut the route short and just increase frequency. Also, some of the trip planners have these status information on how many grams of CO2 have you released when you're traveling. So that depends on how many other passengers there are in the bus. Also, pandemic precautions are important driver for the funding of these projects to finally get these things funded and running. And also when the passengers choose to even out the load, when they choose to go to vehicles that are less crowded, it means that transit becomes smoother because there's less congestion in particular spots. Now the situation in Finland before COVID and before mobile tickets were very popular. Aha, I'm hearing myself. The situation was such that... Somehow caught me off. Right. So there was not much incentive for developing these systems because we got most of the information from the ticketing systems. At least we got the information on when people got on. But in 2020, six municipalities and the government put money together and pulled it into Valti, Valti's service development company for public transit purposes owned by the major municipalities of Finland. And they pulled money together and in 2021, they chose Futurize as the contractor. Futurize is an excellent service company in Finland and I was privileged to take part in that team as a technical architect and a lead developer. And in the next two years, more companies and organizations joined in this project and many phases. I'm just giving you a bit of the background on it. I think in hindsight, our main task was to reduce vendor lock-in and to reduce the costs of APC because currently the high quality sensors take cost... Typical stereo camera costs 1,000 euros per door and 203 doors per bus means that it's quite expensive. And also, I mean, sensibly, many of the vendors want to offer end-to-end service of providing data and analysis. But then it's hard to maybe get rid of that vendor if you want to move on to the next system. So we interviewed stakeholders, held workshops, sketched out some architecture ideas and we came up with like a three-prong approach. And the first one is that we create an API spec between the onboard accounting devices and the backend. And we try to make it easy to understand for companies that don't work with public transport in general. And as a starting point, we took from HSL to Helsinki Regional Transport Authority, the capital area, PTAs, Public Transport Authorities often have the most resources, so they were a bit ahead. They had this data format that was modified from an earlier data format that they have. And we wanted to be compatible with HSL, so we don't fragment the finished market. But also, it has a lot of craft for our needs. So I've split the chasing message into two columns here. The first one is more about the APC ideas and the right one is about the general public transport method data, such as routes and operating dates and directions and such. And all of the data on the right side is available in the backend anyway from some other source. So by reducing some of the fields and also trying to get rid of some ambiguities, we just added schema version and accounting system ID to do matching in the backend and message ID to make a message checking unique when there are duplicates. And then we dropped everything that wasn't about the APC. So these chasing messages are sent over MQTT, which is very commonly used in public transport, both on board and between backend and vehicles. And I think this format allows for any company that understands how to count people or objects to participate in this market. So it lowers the barrier to entry. And we're hoping that there will be more companies that offer accounting devices. Okay, the second approach, second attack was to prototype new counting technologies. And we asked two companies to develop new things and one company to provide a reference device of something that already exists in the market. And DILCOMP created object detection from security cameras and AmpliCa used a millimeter wavelength radar for the presumably object detection. And there are a couple of pictures on the upper right, maybe a bit small. This is a picture of the prototype millimeter wavelength device 3D printed stuff hidden behind the ceiling panel. Now, unfortunately, unfortunately we learned that 20,000 euros per new technology was not enough. For us to create breakthrough technology, we managed to create the right format of data, but values were not yet usable. But maybe some of you can figure this out. I hope you can. Okay, the third approach that we created an open source backend for this whole system. So there's also no great vendor lock into our team either. And here's like the simplified architecture of it. And let's forget the left side for now, but in the middle, data comes from the on-port counting systems. They go to the MQTT broker. And then we push it into Apache Pulsar. Apache Pulsar is a distributed append-only log system competitor of Apache Kafka and has been in use at HSL for six years now. And we also wanted to have synergy there so that Valti and HSL would have similar technology backends. The messages from the MQTT broker are deduplicated and they're brought into the journey matchup. Journey matchup also takes its input from GTFS real-time, the vehicle positions which tell where the vehicles go and when they leave the stops. So it's a very simple logic in principle in journey matchup that you just accumulate values in and out until you leave the stop and then you trigger APC message with all of the public transport method data that you need in the analytics. So routes and stops and directions and so on. So journey matchup pushes it through MQTT back into the provider of the GTFS real-time API. So that services the authorities. That's the raw data or raw, but I mean the accurate data as such. But it doesn't serve the public because this is private data. This is mobility data of people moving about. Now you might think, okay, how many people moved in the door doesn't really match with any individual. But that is not so. And on the left side we describe how we need information from the vehicle registry as well to pick up data about the vehicle models that we have, the seat configuration and standing places and how we create an organization profile out of it. But for this part we need help from experts. So we asked university researchers from Finnish Center for Artificial Intelligence and Helsinki University. There's a professor on the home class group that is focusing on differential privacy. And especially Jonas Jalko and also Raja, sorry. And oh dear. Okay, I'll get back to that. Worked on this. Jonas was especially working hard on this with me and they created a method for the anonymization. Now the reasoning why we need this is that if you consider someone who lives maybe not in the city center a bit further away and they travel in a bit peculiar manner. Let's say that they have shift work, they travel at noon. The stop that they use, no one else ever gets on that particular route in that particular direction at that particular time except them. No one else gets off of the bus at that time. So if you learn that pattern, if that accurate information was public you could stalk them and figure out, okay, now their house is empty and so on. So to combat that, often I've understood that how people approach it is just been the values. So for example, if it's five to 20 people in the bus then it's many seats available. In GTFS real time, standard occupancy status, the occupancy status field has these ordered categories from empty to full. And the thing is that that's not really anonymization because when you switch from one category to the other you're still leaking information. So the method that they created based on differential privacy, we believe it is the first differential privacy method for automatic passenger counting in public transport. And I'm really glad that these researchers made this effort for all of us and it's all open source. I think it deserves a round of applause. It's also very simple. So the above case would be the one where you have no anonymization except the pinning. So once you switch from four people to five people you go from empty to many seats available. Now how their method works is that they take that vehicle model, the seats in the standing places, and they turn them into these, they take as input this upper CSV file, CSV, I'll just give you a light. And they fast these boundaries so that they match the differential privacy condition. So I'm not an expert on differential privacy. I'll hedge anyway how it works roughly so that we're actually using epsilon delta differential privacy. But in the epsilon differential privacy you have the small value epsilon that you can decide and that affects how private versus how usable and accurate your output data is. And the epsilon affects what's the probability that you can figure out an individual from that data set or whether that result was formed by a data set with a particular individual in it or not. That probability difference is very small and affected by epsilon and the delta parameter relaxes that condition a bit. So the black areas here have probability of zero, exactly zero. So that's the delta in action. Otherwise you would have these violet purple bars quite far along. So for example, how you interpret this is also CSV files is visualized. For example, if you have seven people, you have a small chance of publishing empty and a large chance of publishing many seats available and no chance of publishing any of the other categories. So we want to have such a system that if the accurate value would be many seats available, we don't accidentally publish full. The counting of these profiles is quite intensive, takes many hours. There may be various optimization possibilities in the algorithm, but it needs to be only done once per vehicle model. And then you have this small CSV file, a table of probabilities that you just sample from every time you need to publish the result at a stop. So it's very, very fast in use. And you can just plug it in if you already have another system like the one above. All right, so this has been a trip of these highlights. Check out our API spec, especially if you're interested in creating these kind of counting devices. Please try your hand in it. They are, the bosses are dirty and they are dusty and they're shaky, but otherwise you can use whatever methods you have available. Also, if you haven't yet got your own APC system, check out our backend code or maybe our architecture and this idea of not, of having only minimal data from these APC vendors. It's attractive for you. And if you already have an APC system, please do use the anonymization method created by the researchers. If you have further questions after this, you can contact me by that email. There are a few of the links. They're also in the talk page. And I'm not yet sure what else will be behind transit privacy.org, but right now it's just a link to the tool that the researchers created. That's enough of the monologue. Let's start the dialogue. So for public transport, I guess it's very important to easily detect when a road is being inefficient by maybe just moving air. For example, if in some end of the road, the bus or a tram or whatever is mostly empty, how does this anonymization algorithm make it harder to detect when some public transportation is being under you? So the question was whether or how this anonymization will affect the public transport planning use case of figuring out whether reallocation is to be done with the vehicle capacity. In our architecture, the public transport planners get the accurate data into their analytics. So the anonymization happens after and it's only for the open data part. Can you speak about your experience with the microwave based sensors? Oh, the millimeter wavelength radar. I have no clue. So we gave these companies a lot of leeway and they produced their pilots and we don't have insight into how exactly that works. Any insights about the results of their pilot or not? The insights were that thus far the results were not good enough to be shown. You haven't actually mentioned the great deal about how counts are actually achieved. Sorry, I can't hear you. You haven't explained much about how counts are achieved. The time has been involved enormously over decades. In the past you simply used to weigh the carriage. Now obviously it could be distorted by adults, children, people with a lucky age, Americans, whatever. Then you don't even know where they are on the train to the carriage. So you don't know that they're individual though. Then they looked at things like as you enter an exit using light sensors. Then they looked at things like are you connected to the internet and counting the number of people who did that. I suggest that they actually work on facial recognition, not against a database but simply against the number of unique faces on a carriage at a time. Then you can start as they move around and work out what the behavior is. What's your thought on that? Alright, thanks. There was a brief history of the different kinds of technologies used for detecting passengers and objects. Then a question of whether facial recognition would work. I'm not sure if it sounds like it would effectively work. It could. It's very tricky how to communicate it to the public in a way that is understood correctly. Like for example that we don't send anything else that plus one and minus one from these vehicles onwards. Yes, it's back. About that, facial recognition, you don't have to go really directly to that. There is so many more stuff that you can do and with your counting mechanisms, even all the source models can really do much better without having to get any facial information. If possible, just actually close that out. You don't really have to get through that. The layers, like I'm working on model share counting and we're doing that for cycling and we're doing that also for passenger counting and stuff. You don't need to actually get to mine all these information characteristics of the people who do the tracking algorithm yourself. It's not really necessary to get there. That will also reduce the communication. Thank you. A comment was about how the object detection and object tracking algorithms on open source are already quite fine without facial recognition. Yeah. It's like calculation of CO2 in the carriage. Sorry, I can't hear you. Calculation of CO2 in the carriage because when people are breathing, they can reflect air transfer and calculate how people are in this situation. Other studies have been done for COVID, for example, and you can reuse this COVID study. My next question is regarding the users of security cameras for counting people. Do you have any experiences in terms of producers of the systems? In the moment, you use the camera for a different purpose, the warranty is gone because we have this problem. This kind of thing is like, let me say, okay, we will never use it for other purposes than just checking security. This is like a big ego constraint that you have when you're procurement. Yeah. A good comment on security camera warranties. I don't, I remember hearing discussions about that, but I don't have any proper answers about what do the security service providers think about using their camera fees for something else. Okay. Thank you. Thank you.
MATSim at SBB: Using and contributing to the open-source transport simulation for advanced passenger demand modeling.
Thank you, Peter. Yeah. So today I want to talk a little bit about Matzim. Matzim is a transport simulation software that is being used at SBB, the Swiss Federal Airways, but it is actually an open source tool that has been around for quite some time. So obviously there is also a little bit talk about Matzim itself and the software and what it does and how you could use it if you are ever interested in that. That's on the agenda, so I'll briefly explain what Matzim does and why we find that useful at SBB. And we also contribute actively into the Matzim code and I'll give some examples of that. And yeah, since you might wonder why on earth are you even bothering, I'll give you some examples of our work with the software. So what is Matzim and why is it useful? If you have that elevator speech moment where you have to explain your work to your CEO and then they ask you what you're doing, then I tend to say I'm playing SimCity, but with complex econometric data behind it, so you have all these weird formulas somewhere in there. Then the elevator area is over and I have more or less explained what I'm doing and then the CEO knows that we have some guy playing SimCity all day. Well, there's a bit more behind it, but in brief that's what you're doing. So we are simulating transport and we simulate people's behavior using transport during the day. Matzim stands for Multi-Agent Transport Simulation and it has been around for roughly 20 years. It started as a purely academic project between ETH Zurich and TU Berlin. On a side note, that also explains why a Berlin guy is now living in Switzerland. So you can kind of imagine my background, but it has evolved over the years and there are many models around the world and quite a few of them are actually fully built on open data and are publicly available, not ours for some reasons, but for example there's quite a scenario about Berlin that you can download and you can see the data where it comes from and you can start playing with the model. Whether this is useful for anyone, I can't say, but I guess I think it's useful for some, even mostly PhD students to be fair. There are commercial users around the globe as well. Among us, the SBB, there's Volkswagen who have quite a strong development, also into the Matzim core, but they're not as open to talk about that as we are probably. Then there are models in Melbourne, there's one at the Berlin Transit Agency, so it has some standing right there. There's a book, there's code, there's a license and for the last couple of weeks there's also an association that kind of brings the whole thing together. Now, how does it work? So imagine you have a lot of data. You have census data, for example, you register data, you know where people live in a city or you just make that up, you replace people somewhere. You have econometric data that is value of times. You know what a person's intention is. If they travel by train, then the value of time is maybe six euros per hour and if they go by car, then it's maybe 10 euros per hour or the other way around. You have a road network that can come, for example, from OpenStreetMap, that is a very typical use case. You have a timetable for public transport, typically GTFS. You have count data. Many of the topics discussed in the previous talks are actually input data for us and that is a lot of input data. What we do then is we add some generic algorithms that basically randomly decide to tell people during the day, change your route, change your transport mode when you go from one activity to another or change your departure time choice and then we let that run the same day. It's a bit like Groundhog Day, 200 times, 500 times and mix people up and let them try out new things. This is what we call the Matsum loop that is also somewhere on my T-shirt. What comes out of it is actually output data, even more of it. You have individual daily plans. You know what your synthetic population is doing during the day, where they go shopping, what transport modes they use. You have mode choice for each strip, whether people tend to take car to get from A to B or public transport, depending on what's the offer. You have time-sharp traffic loads, so a lot of data to analyze and do your policy planning. You have distances, you have all kind of aggregate data that you can then use and play with. Obviously, this calibration process for a model that it really depicts the real world in an initial stage is kind of what's the long story behind model building. What can you use the whole thing for? Of course, transport policy evaluation, what happens if there's a new road, what happens if there's a new railway line, what happens if there's a new price. You can do a person specific. You know who's affected by transport policy because you have this agent-based paradigm behind it. You can also calculate, for example, accessibility. A lot of things are actually happening where MADSIM is being used when it comes to on-demand transport modes. You can really do your fleet scheduling, your fleet planning. You can say, okay, what happens if we have a lot of automated vehicles that replace passenger cars and what's the advantage of that? All these kind of future scenarios you can use MADSIM for, well, basically playing SimCity. The MADSIM project has been around for almost 20 years and historically it has been administered by the universities, so ETH Zurich and Tew Berlin. Professor's grow older and at one point they retire and the person that comes next is maybe not as interested in such transportation simulation anymore. Since last year, the whole MADSIM project is built on an association level, so that there's also some funding from other users to maintain build servers and all that stuff. The association also organizes things like the user meeting that is held annually or keeps track of all kind of developments, publish a newsletter, all these kind of stuff. Now at the Swiss Federal Railways, how did we start with that? It's a very brief timeline on one slide, but I think it's kind of interesting. In 2016, our CEO saw a presentation about MADSIM and decided we need this at SBB, please buy MADSIM. Well, as it happens with open source software models and open source software at all, buying the whole thing wasn't as easy and the whole procurement process didn't quite work out, so the task was delegated to the department that deals with classical transport models, so it's actually part of somewhere in the passenger division, this is also where I'm working. It came up with some challenges, for example you needed someone who knows programming Java and those people didn't exist in that department, but things that can be overcome and you need like proper computers actually because if you want to run proper big models then having a nice tiny laptop isn't sufficient. That was also something to overcome, also thanks to the IT at Peter's department in the end. At least we didn't kill it. At least they didn't. Yeah, but the whole thing, building a model for Switzerland in MADSIM took three years and ran from 2017 to 2020 and along the way we noticed that there are several additions into the code base that need to be made to make this a useful project for us and then at one point you see okay we need to decide, do we commit this back into the MADSIM core or do we keep that in our secret chamber and luckily people chose widely and this is all open source and this was actually a management-backed decision so I'm very happy with that. So in 2018 the first release of the model that I will present in a moment was released and since 2020 we have an annual release cycle of a new, of a transport model that is multi-model and MADSIM based in Switzerland and that can be used for all kind of policy studies. Our contributions to MADSIM I just want to showcase two. First of all one that is called oddly the Swiss rail rector so it's an rector based public transit router that works really fast because if you want to route millions of people within a reasonable time frame then this is what you need and compared to what we had before it was like many, many times faster and the whole simulation was actually sped up by factor three and what is also important to say, the Swiss rail rector is, well it's a Java package but you don't have to, it's tied to MADSIM, it uses MADSIM data structures but you don't actually have to use it for MADSIM problems so in a way it's something that can be used instead of OpenTrip planner but OpenTrip planner has other advantages but if you really need to route a lot of routes at the same time then it's something to look at. And then we, yeah now we have a bit of a fancy routing algorithm so it also knows like range queries, queries and intermodal access and egress and the last point is very, very important because if you want to model stations you really need to have an idea how many people arrive by food, how many arrive by bike and so on and that is not an easy question to answer and apart from the routing problem which is already quite complex it's also a part of the, you don't have that much empirical data about it so, but it's really one of the most useful features by now from our model so we're happy to have that and yeah as I said you can use that kind of independent of the rest of MADSIM so it's well worth checking out. Then there's another contribution where I was a bit deeply involved, it's basically, there is a traffic flow simulation in MADSIM that is typically queue based, I didn't go into detail here because I don't have enough time for that but we replaced this by something that is just like roughly two times faster for the whole simulation process and it's called Hermes because well it can fly now but has less plugability so depending on the use case of your simulation you can use one or the other and they're kind of interoperable and yeah this brings down simulation run times for Switzerland scenario so both the router and Hermes to something like 24 hours and since these are typically AWS instances it actually saves a lot of money to have models that run reasonably fast and for calibration process you maybe need 50 of those runs so it's it's kind of sums up yeah now what we use MADSIM for first of all there's a model called Zimba Mobi this is where I'm the product owner so I know a lot about it and it literally depicts the everyday mobility of eight and a half million people in Switzerland so basically the whole population it includes all major public transport modes so walking cycling taking the bus taking the car and it has a representation of the transit schedule obviously and also the whole road network and since it uses MADSIM the whole people's behavior or agents behavior is microscopic so including first and last mile decisions and hope that video works now yeah it does so right now imagine it's 8 a.m in Switzerland and you see people in those blue dots being at home and these light blue dogs people starting their work time and now we zoom in into a region somewhere in Eurik and see what people do there so to get from one place to another they need to travel obviously and they could travel by car that then there are in those little gray boxes or they could take the train or public transport then there are in those little red boxes and they get from one place to another and obviously you can run your analysis on that some public transport vehicles are maybe more crowded some are less crowded and yeah you can sum that up over the day and see what's going on and now we zoom in again and what you can do is see who's alighting at certain stations what kind of passenger groups are they are there do they have a regional subscription do they have a half fare card or do they have ordinary tickets for example and also on the highway you can see during the day how many people are currently on their way to work how many people are on their way home and who's just a truck and who's doing other things and this is all in the model and you can like analyze a lot of things obviously we are a bit more tied to public transport related analysis so station access and egress is different at certain points so in Jettikon for example there's more people arriving by bus to the station than in Aarau or if you have this city in Baden here then you can see people that reach the stations from nearby typically walk and people that come from further field further field they take the bus and these kind of analysis is really useful for example for station design and station planning and yeah so typically use cases for the model are then also the development of rail lines and designing of stop locations and time and the effects of timetables so it's nice that you created a nice timetable but who's going to use it and how many people will be on the trains this is an answer we can give and we can analyze what's happening around the stations so that was just the video that I showed and we can also see what's the effect of certain land use policies for example so we don't only have a model that depicts today but also one for 2030, 2040, 2050 so we kind of know how according to today's assumptions Switzerland is going to evolve and then we can do policy planning with this yeah and these are kind of future scenarios yeah and just one example for example over the next 20 years there will be roughly 20 to 30 new railway stations opening up mostly along existing lines and and very often these stations are being built because there's something happening around them so there's a new development coming new housing or a new commercial area being built or something like that and we can just like in Simcity we can add those little houses into the model and add people there give them daily plans and for example this is now in the city of Sankt Gallen where not a new station is being planned but the the moving of one station to another place so basically that station goes from left to the right and then lots of houses are being built there and with the tool we can say okay what's the beforehand we had at both those stations there are roughly 4 000 passengers a day and now it's roughly 6 000 so that would be the effect of the policy of things happening there and and these numbers help you to dimension those stops properly yeah another application that doesn't come from my department so please don't ask me doesn't question about it but i think it's interesting enough to be presented here is that we also want to go into deep down knowing what's happening along the railway corridors so Matsim has a mobility simulation i talked about that earlier and my colleagues decided okay we can replace this with something that we call Ray Sim and that actually has tracks and signals in there and blocks and we can start playing around with that and and do roughly also capacity planning and on a much easier level than it is usually being done so that you still don't need need to have an idea of every signal that's on the tracks but and of every switch but of the whole idea roughly you need to know whether a track is single track or double track for example and so the outcome of this is now also a little video you have two trains um one coming from down here it has currently a speed of six six meters per second but it wants to go it's currently accelerating to 11 meters per second and there you have also a train that is six meters per second and wants to accelerate to 14 meters per second and this train wants to go this way and this train wants to go that way and there's a station and they both interfere at one point so obviously we don't want them to crash so as you can see the train coming and the red lines are basically the blocks that are being blocked in front of the train so depending how how fast the train is the train is faster than the braking distance would be longer so more blocks are being blocked so um and then you can see the train that comes from the right has the right of way and and and the left train got a red signal and is breaking down and now that right train is passing switch goes to green and the other train can enter the station well yeah um and obviously you can connect that with the rest of matzin so you know how many passengers are on the train and then you can do your policy planning again if there's um if there's a heavily delayed freight train that but that would generate you that and that amount of money maybe you want to accelerate it but then you see oh no it interferes with all of our daily commuters they would be very angry so you can do your policy planning around this it's still in an early stage the that microscopic railway simulation but ultimately this is where we want to go to it's also released as as a matzin contribution called rail sim so it's part of the matzin code everyone can use it and i think it's a way to go in that direction as well but please don't ask me too many questions about that but i can connect to people who know about it so to wrap up matzin has helped sbb massively to to understand customer behavior and committing to open source in our point of view has really paid off here and it's also the way to go for us um yeah and these models are very very complex but um that is what they are no matter whether you use commercial or open source software um oh yeah that one has to come too um and uh yeah if you want to know more about matzin um there's an annual meeting um this year it will be in um as part of the heart conference it's a transport conference um on the 17th of june in in outer university so yeah hey thank you um yeah and um i'm happy for questions even though i only have five minutes thanks a lot for the presentation uh quite a bit of the the transport systems have an historical background for example probably some of them are based on the industrial need back at like 20 years ago or some of them are related to new developments in the city and some of these things are also presented in fact at open street map in these kind of resources are you able to extract amenities maybe or an historical sense or is the distributioner and the social demographic something that you get as a matrix how do you get this kind of distribution and sorry second yep do you import EFS and these other stuff um yeah both good questions so the first one um so we do have the census data from the federal office of statistics in switzerland and they also have an idea how it looks like in the future so we are in a very lucky situation there's a lot of data available and also publicly um in fact there's also a transport model available for switzerland publicly but unfortunately with a closed source license where you need a software that costs you roughly uh 10 000 euros a year so that doesn't help a lot um so we rather build our own models and the other question is um yeah typically we uh one would use gtfs as an as the main data source for public transport data since we are the railway operator and we have time tables in all kind of formats we use a different one but if you were to build a model for matzim typically you would use opus creed map and and and gtfs um yeah yep you um what for the model of the swiss transport uh how do you do that there's three three million agents that are being simulated eight and a half yeah eight and a half so what's what's the typical like how how much can it scale like how many agents can you have in one simulation um yeah that um so the the models do scale um but um there's there's an upper limit to what is useful because um so switzerland is still a useful also in terms of a regional scope because there you have many long distance commuters um but if you are using matzim to for example for for really long distance um choices um then you would simply remove everyday commuters so i also in a previous life i created a model for sweden that also worked but it's also roughly the same number of people um and but there are simulations for for cities in china which have 20 million people um and what you can do is you can cut down the number of agents that you simulate and and increase network capacities and so on so there there are ways to deal with that yeah so you mentioned that you can feed open street uh uh map data into matzim but does matzim also provide tools to add new assets or population models how there are tools that allow the addition of of new people if you don't want to hard code it some of them are commercial i mean there are some spin-off companies around matzim who provide this as a service but you can also do it on your own it's just basic java or python code that you can use for that so um yeah and if you do transport modeling um for public transport you would probably rather edit the gtfs than editing the matzim schedule so i think there are a lot of interesting yeah i'm here so sorry yeah well i think we one short question one question is possible yeah one more yeah i don't know how do you determine the accuracy of your model ah oh yeah that that is another talk of of an hour um so getting getting getting models right and calibrating them properly um is um is mostly countator that you need and in switzerland we have something that's called the microsensors that where where where we ask people every five years about their mobility behavior and that is very very accurate and has a lot of data where that is useful to calibrate models but um it's always a fair question if someone presents you a transport model to ask how is it calibrated okay so you
Bending geographic maps for enhanced railway space-time diagrams
Hello everyone. So my name is Alexis and I developed data visualization web applications at Westware. We do a lot of open source things and I'm totally not a trained person initially. I'm still not a trained person actually. But since early 2021 we started working for SNCF Rezo who is a firm in charge of handling the French train infrastructure. And we started to contribute on OSRD which is I guess it's been advocated here today already. Not yet. Not yet. Okay. So OSRD is an open source railway designer. It's an open source application to simulate trains on real or edited infrastructures. That's kind of amazing. The interface is web based. The project is kind of huge. The team is a good part of the team must be in the room I guess. And yeah, you can check it out. And at some point we, Westware, we've been tasked to actually enhance the space time diagrams. What are space time diagrams? First of all, everybody do not agree on what it should be named. So it might be circulation diagrams or graphical timetables or train graphs, which is actually nice train graph, but I'll stay on space time diagrams. It's probably been invented by a French engineer, Charles Hibry in the early 1840s. This engineer was in charge of scheduling the trains between Paris and Rouen. And he used this very smart chart I described right after. Some people think it's actually a Russian military guy. No, you remember? No. Okay. There's another lead and it's not clear. It's been invented by these people. Let's stay on this track. So horizontally you see the time. It's hours of the day. And vertically you see the list of the stations from Paris to Rouen. Basically it works. A train, can I zoom in? Okay, nice. The train is aligned and, no, sorry. Okay, each train is aligned on this diagram. And you can read a lot of information in this type of diagram just with those lines on this scale. So basically the more vertical is aligned, the faster the train goes. When the line is horizontal, it means that during different time, the train is at the same position, so it doesn't move. When two lines are crossing, it means that there are two trains at the same position on the line at the same time, which means that it's possible. So if I read this map, for instance, I can know that there are probably two different tracks, one for each direction, and probably no more. I know this because trains that don't go in the same direction can cross kind of anywhere, here or here or here. But when they are in the same direction, one train has to stop in a station like here or like somewhere around here, et cetera. This is kind of crazy, all the information that are just displayed in a so simple diagram. The thing is, I'm not a train person, but I know this diagram for long because it's actually on the cover of one of the DataVis reference book, Visual Display of Quantitative Information by Tuftor. And it's still used today. There are reasons why it's screen shot from OpenTrack and not from OSRD. I'll come back on it later. But OpenTrack is another open source software to handle trains, and they still use this kind of diagrams. And it becomes actually even better once we introduce blocks. So when they started using trains on tracks, it was kind of easy because basically there were not enough trains to consider collisions. But at some point, a train goes fast enough and is heavy enough that when an operator sees a danger stopped on the track, if he starts braking, he won't be stopped before the collision. So people had to find solutions for this. And so easy, since I'm going to sum up how it works, but I'm not a train person. So basically, the track is split into blocks, and only one train can be at one block at any given time, and there's a signal at the entrance of the block. If there's a train inside the block, the signal is red, so you cannot enter it. If the signal is orange, it means that you must stop because there is a train in the next block, basically. The thing is, when a block is occupied by a train, it means that during a time and on a distance, there cannot be other trains in this block. So basically, the occupancy of a block by a train is a rectangle, and when two rectangle collides, that's bad. And in OSRD, here is what it looks like in OSRD. So the red rectangles are the blocks occupied by a train, and here in OSRD, I started a simulation, and I dragged this train so that there was a collision. And yeah, it's really easy, graphically, to see that there will be two trains in the same block at some point. And I think as a data-vis person that it's kind of amazing. But how can we make this even more informative? So people from OSRD asked us, yeah, basically vertically, we just have the list of three stations or the list of point of interest, but we would like to bring more information in this. And we thought, let's start digging. Who does this kind of things? So we started looking into other transportations where people have to see how they travel along kind of a line. So here is what it looks like when you are inside the RORD. So this is a train that goes from northern Paris suburb to southern Paris suburb, going through Paris. And when you are inside this train, you have this synthesized diagram. So it's nice because it brings only the information you need, the list of stations, but also some interesting things like where can you switch to other transportation systems, et cetera, et cetera. And this is nice, but the thing is what Loïc wanted us to do was to have the exact infrastructure and to see exactly what are the tracks on the lines at any given point. And this would have needed us to actually know the whole infrastructure and to do heavy computations. And at this point, we planned to do this as a front end only feature. So we kept digging, sorry, yeah. So we can run there anything we want, but we need to know the exact topology and do heavy computations. And we kept digging to find something else. OK, sorry. So on top of this is the actual map of the bus in Paris. OK. When you take the bus 58 in Paris, you have this map. And the thing is, as you can see on the top map, this line, it appears absolutely straight here, you see. And this is kind of amazing because we cannot bend things in cartography, but that's what they did probably by hand. And they obtained this nice map where there are very identifiable areas. You can see all the streets. You can see a lot of information. But still, you know that you are going basically from left to right or from right to left. And it works. So yeah, we have to show everything a map would show. So we cannot just terrific exactly what we would like to display as we did with the schema because we have to take everything. But the good point is that we show everything a map would show, which means that we have all the context around. For a train, it would be the cities, the buildings, the places that are near the train, but not exactly on it, et cetera. It's actually called a strip map. And it exists for quite a long time. We've seen some very old examples like this one. And it's actually already been used within space-time diagrams already. So this one comes from the Russian military. It's trains between St. Petersburg and Moscow. And on top of it, not vertically aligned on the left, but you can see the whole itinerary with a lot of information surrounding. Like you can see the sea next to St. Petersburg. You can see other identifiable points, et cetera. And it brings a lot of context. So let's bend geographic maps. The strategy we used was to generate a grid made of triangles along the path. And then we generate another grid, which is totally flat. And when we want to translate a coordinate from the normal geographic system to the bent system, we just find in which triangle am I, and then I will translate it from one triangle to another. And this is something that is easy to do. So let's take a path. This is from Nantes to Angers in France. Then we generate a grid around it. So basically, I just simplify a bit the path, and I take regular steps. And I draw a line crossing it at a perpendicularly. And then I draw triangles, that kind of. But I have two problems here. First of all, there are points that are in multiple triangles. And this is bad. And another issue is that I have large triangles touching really small triangles, which means that in my final map, I know that if I have this kind of distortions, it wouldn't be very smooth. So we smoothen the grid with some, we just run some steps. I move each point to the barycenter of its neighbors, basically, something like this. Then we index all the triangles to get a quadtree. So it's really, really fast to know if I have a point, what are the nearest triangles. So in this nearest triangles, which is the one that contains my point, et cetera, et cetera, I do the regular grid on the right. So each triangle exists in both grids. And at this point, yay, we have a projection. So that's what I said. If I have a point P, I find the quad that contains P. I look for all the triangles that collide with this quad. I find the one that contains my point. Then there's a triangle with the same ID in the straight grid. So I just find this triangle, and I use barycentric coordinate system to translate from one triangle to the other. Also, I had to actually develop something. So we use the ReactMapGL and MapLibre. MapLibre and ReactMapGL because it's already used inside OSRD. So for this prototype, basically, we render a hidden map that contains the whole grid, but we don't show it into the screen. We just load every feature we can. We use layers from OpenStreetMap for the context and OSRD to have the actual infrastructure and signals, et cetera. Then we wait for the idle events that says, OK, I have loaded everything. I'm ready. So I take all the features. I transpose them. So I project them, actually. It's a projection. I also have to clip them if they go through or if they come from outside the grid and enter it, et cetera. Then I can render a new map with the projected features, which looks like this with the grid and like this without the grid. And we can look to the two maps side by side. Yeah, that's it. We have what we wanted, a map that shows the full itinerary from Nantes to Angers. We can still identify things. What I really like with StreetMap is that locally, if I'm going from Nantes at some point here, I know that I have Laloir on my right and the Scarke-Fou on my left. And those local information remain true in the BENT map. At some point, the Scarke-Fou on my left, the Laloir on my right, et cetera. You preserve local context at the price of having BENT lines around. In OSRD, this is how it looks like. This is a screenshot. I hope to show you something that works better in a minute. So yeah, it brings a lot of context. But when you zoom in precisely inside to the train in OSRD, you can see the exhaustive infrastructure, all the tracks. We didn't have signals yet, but it will come soon. And yeah, so it works for almost any path. It's known there's no loops, right? And it does bring context. With the current instantiation, we lose the Mabel data. It means that we have to load everything at once and render the map at once. But we cannot, like, if I zoom in, I will see more things with a better definition. It might come sometimes later. And it's a bit slow at the moment because we have to load everything and translate it at once. Yeah, demo, that's going to be really quick. So there's just a storybook. So it's on the OSRD UI project. If you want, you can just ask to the OSRD people. This is the project that's been moved out of the OSRD, which means that you can actually use it without OSRD data. It's just a react component that embeds some dependencies. And yeah, this is from Nantes to Marseille. Think with path. You have, yeah, on your right, you will first have this big ocean. And then later, there's Toulouse, there's the Pyrenees, and then the Mediterranean Sea. So it works as we wanted, lots of context. And also in OSRD, Roulmante-Tembourg, drumrolls. OK. This is the path I showed earlier. The trains, when I over train on the graph, I can see it on my strip map. And when I zoom in, I will have the actual infrastructure. They should be, yeah, OK. I see that the train changes swaps tracks here. That's nice. OK. That's going to be it for the demo. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I can probably take one or two questions. I'll need two minutes. So one or two questions. Sorry. Does this projection look good with photos of satellite imagery, or would it look really strange? Yes, it might look a bit strange. But actually, when the grid is quite smooth, like the one I showed earlier, where the triangles are just slightly bent, it might work. The thing is, I only work with vector data right now. But I could actually project pixels. But if I project pixels, you will have larger pixels. Yeah, a real sharp turn. It would skew things. Yes. Loïc has tried with some path that starts somewhere and goes just next to this somewhere later, and this is bad. For now. For now. Do you know how these maps were made before by bus maps in Paris, at some point? By hand. I'm quite sure it's by hand. But I don't have any proof. But I know that when I saw the amazing schematics maps of the infrastructure inside SNCF, and I asked, wow, what's the algorithm? What algorithm? So I bet it's by hand.
MARECO algorithm: how to drive a train using the least amount of energy
Now, second talk about OSDRD, it's about the running time calculation and the best way to calculate the running time is to save energy and Alex will present it. Thank you, Loïc. Hi, everyone. Thanks for coming to this talk. Today, how to drive a train using the least amount of energy with Mariko algorithm. This talk could have been how to drive a bus using the least amount of energy or how to drive any public transport that has wheels to use the least amount of energy and actually even worse with bikes. So I'm Alex Roland working also at SNSEF Réseau on the same project OSDRD for those who were at the previous talk. Here is our GitHub repo if you want to check it out. And I'm going to mostly spend some time on one type of graph, not the space time that you've seen just before, but this one is called the space speed. So it's a very easy graph that just represents the speed along the path of the train from its departure to its destination throughout the stops that it might take. On this graph, you have some speed limits that are on the line. Most of the time, the speed limits are quite lower on train stations. And then the train will leave the departure station with a speed of zero and then it will accelerate until reaching the different speed limits. Then it will break to reach the stop at speed zero again and then accelerate and then break again. So this is the fastest drive that the train can make. It accelerates as much as possible, drives as fast as possible and breaks at the last moment to reach the stop and to reach each speed limit restriction. In that case, let's pretend that the departure is at eight and then the stop is at nine twenty and the destination is at ten. This is still distance in this graph. I'm just showing you time because this graph does not show the time and we're going to need this. The problem is with public transport, if the train leaves five minutes late, it won't be able to catch up because this is the fastest drive. So it will at most arrive with five minutes late at the stop and at the destination. That's also a problem if the driver does not accelerate as much as the fastest drive. It will also be late even though it left on time. It's also a problem if the driver does not drive at the maximum speed, which spoiler happens. So going back to the fastest drive, this is actually a very bad way to plan trains, buses or any public transport because everything can fail very easily from the example I've shown. So we kind of want this public transport planification to have some margin to it. So if I'm adding, let's say, a ten percent time margin, I want to stop eight minutes later, so at 9.28 and then 12 minutes later here. So then I have some margin to like damp the few, I don't know, leaving late or not driving as fast type of problems. But we are here to save energy too. So we want to add that straight time, but we also want to save energy. Well the good thing is, physics does both of it. If you drive slower, you will save some energy. That's great. Great news. Okay, this is due to the different forces that apply on the train when it's running. You can see here, well let's not care about the weight and ground resistance. What's important here is the air drive and solid friction that scales with V and V squared. So at high speeds, it will be much greater. So driving slower, you experience it if you bike, if you drive slower, you use less energy, same with cars and every transportation system. So let's lower the speeds with a very basic way, a linear margin. So we are lowering the speeds at the same percentage all the way through the train's path. And then we arrive at 9.28 and we arrive at 10.12. Did we save that much energy? Not quite sure. This is also another way to lower the speed and then be on time with the margin that has been planned. We are lowering the high speeds only this time. But what I'm going to show you is actually the best strategy to lower the speeds because there are infinite ways to lower the speed and arrive on time. I could also just stop in the middle and then come back on track. So I'm going to show you what's been published by engineers from SNCF a few dozen years ago. I think the original paper is from 1979, so I was not born, which shows the best strategy to run trains in terms of energy consumption. So how does it work? There are four types of actions. Here I'm showing the same graph, but like a very simplified one. The train can be either accelerating, maintaining speed, posting. Posting means the driver cuts down the traction and then the train rolls with, drives with thanks to its inertia. And then the train can be braking. That's four driving actions that we are going to study. The idea is to study each type of action and see how much energy we can save per unit of time that we add. If we look at the accelerations, we can try to accelerate a bit not as strong as the maximum. So let's say V0, we like accelerate a bit slower and then we accelerate at maximum again. And I'm saving the formulas because it would be too long for this talk, but basically this can lead to a nice but small amount of energy saved per unit of added time. If we look at maintaining speed, as we saw the speed has a huge impact on the air friction. And so driving at V1 a bit lower saves actually lots of energy per unit of time added, so that's interesting. There are two reasons for coasting. The first one, the small triangle you see here, corresponds to a slope. So the driver will cut down the traction before the slope, slow down a little bit and then accelerate again thanks to gravity in the slope. And then on this distance, no traction was used. So it's interesting. And before braking, if we know we are going to need to brake and slow down, we might as well cut the traction before and then thanks to the inertia, the train will drive and lose some speed. So this is, we have two parameters here, the same V1, so the maximum speed and then VF here, the velocity at which we want to stop the coasting and start braking. And this is also very interesting in terms of energy saving per added time. And for braking, well, no energy used at braking, so no possible energy savings at braking. So what this analysis shows is that the two most interesting parts are actually saving energy on maintaining speed and coasting, which would, if we combine them, look like something like this. We want them to be equal so that the margin is as evenly distributed as possible. We don't want all the margin to be on one spot for reason I'll explain a bit later. And then basically how the algorithm works in the end. We start with the fastest drive that we compute and then we do a binary search algorithm with iteration. So we start with V1 and VF that leads to, let's say, this result. Then we get an output of how much time this represents actually. And we compare this time to the time we want, the 9, 28 and the 10, 12 from before. And then we iterate and we compute different ones until we converge to the solution that leads to the time we want. So if we come back to the first example, this leads to something like this, where the higher speeds are lowered and then you can see the coasting phases before each braking phase. And we arrive on time with the margin that we added. Now let's see how it looks like on some examples using OSRD simulation. So here a train between Paris and Lyon, so it's a high speed train, a TGV. Here you can see it on the map and then you have a linear margin. I don't know if everyone can see the green lines, yes. Okay, so linear margin on top and then maricot algorithm at the bottom. And then here the orange curve is the slopes along the path of the train. So here on this example you can see some like the triangle parts correspond to the train cutting down traction and then using the slope to accelerate again. And then you can see here that it's cutting the traction about a bit before the final braking here. So in this case we get 12% of energy saving between those two strategies for the same running time. Another example between gap and brilliance, so this is in the Alps, in the mountains. So the slopes here are quite strong and there are many like uphill and downhill slopes. So it's interesting because then we can use the triangle technique many times, cutting down the traction and then keeping up with the speed, the fence to the slope. As you can see here, there are many triangle shape. And then there are more stops here, so also more braking phases that we can use for coasting. In that case, it's 13% of energy saving also because the overall decliivity goes uphill. So it's not as great for this algorithm to work well. Another example in Bretagne, so this is the west of France between Rennes and Camper. With many stops this time, so I simulated a regional train that stops in many cities. So there are many stops, hence many coasting phases before braking. And the overall decliivity is quite flat. So in that case, we get 20% energy saving, which starts being a lot. And last example would be near Paris, so between Paris and Mont-Flajoli. Also many stops, but this time the overall decliivity is mostly descending. So it's very good situation for the algorithm to be efficient. Well in that case, we get 32% of energy saving. So let's plan all the trains with this algorithm. What can go wrong? Well, there are some impacts of the Marico algorithm on train planning and operation. I'm going to start with the few downsides. Most of the margin end up towards the braking phases because that's where the main coasting phases are. So we need to use it a bit carefully, especially in the long distances. I showed you the Paris to Lyon trip without no stops in between. So most of the margin was at the end, which means if the train leaves late, it's going to catch up near Lyon, but it's not going to be able to really catch up on the way. So it's going to be a bit late all the time, which is not great. So it needs to be used carefully. You can also deteriorate the headway, so how many trains can run in a certain amount of time because it can lower the speeds a bit too much in some areas. It also considers that drivers will follow the fastest drive at low speeds, so accelerating as much as possible, which is not the case if we study at driver behaviors. So we in the end plan trains a bit wrong if we consider that every driver will accelerate using 100% of the traction force. The good news is, sorry, energy savings. Well, this can be a lot of money in the end for the company. Each percent can be a lot, so imagine 20 or 30%. It's a bit more similar to driver behaviors, especially experienced drivers that know where they are driving. They will anticipate the slopes and they will cut down traction to save some energy. So it's more similar to the actual driving than the linear margin. Strong accelerations are better for the headway, especially in dense lines. So you want trains to leave the stations as early as possible and then drive at high enough speed because trains that drive slowly have really bad, sorry, for the headway. And coasting before braking, also in dense lines, brings drivers to approach to reach the stations at lower speed because they have been using the coasting before. So they can anticipate and adapt to braking better if there is a train in front of them with and they get like a caution signal, asking them to slow down. And that's it. Thank you. APPLAUSE We have three minutes for questions, I think here and here now. Yeah, you mentioned that grading doesn't cost any energy. We generate a risk braking. It does save energy if you rate slower. Does it take a different program? This algorithm doesn't take it into account because it's too old. So I don't think trains that could gain energy at braking were a thing at that time. I personally would like to adapt this algorithm to take this into account in the future if that becomes one of our needs for OSRD simulations. But yeah, you're right. What about the length of the train? I mean, for instance, a long freight train, you can get some very different things because there are supposed to be a sensing and a sending. Yeah, so I forgot to, I think I need to say the questions again for the microphone. So yeah, the question was what about long trains, especially freight trains, that can be very long? Well, yes, the cleavities, the slopes can compensate on very long trains. Well, this effect, this algorithm also works, no matter the length of the train. It's just because of the binary search, we don't know the exact output. So we simulate it, we see the total time, and then we adapt the V1 and Vf velocities for the right time that we want. So it's actually taking into account those, as long as the algorithm you use is taking this into account in the simulation. One last question. Sorry. I see your graphs because they're on time, but when you save energy, I will assume from the graph that the graph is now shown, it covers less distance because you have time over speed. But where is the time actually saved? Is it normally that the train would just get at its end station earlier, and now just take that extra five minutes away and spread it out by saving energy? No, actually the graph was showing only show speed and distance. So the time, if you show it on, I don't know if there is, yeah. No, we don't have time for this, okay? Sorry. But they actually arrive on the time. They drive a bit slower, so if you represent it as a space-time diagram, you will see that there will be a bit more horizontal because they drive slower. Thank you, Alex. Thank you.
Railway signaling: detecting conflicts in a complex world
Hi, I'm Eunice and I work for SNCF Rizzo and the OSRD project. So standard disclaimer, the opinion in this presentation I'm maum and not those of the OSRD project SNCF Rizzo or the Open Relay Association. So what's OSRD? It's a railway design toolbox built around microscopic simulation. So it allows to perform operational studies and also to find last minute pass through the infrastructure without creating any new conflicts. It's a licensed under our GPL V3 and funded by SNCF Rizzo, the European Union and the French state. So a short signaling primer. The main goal is that trains do not crash into each other or derail. The problem is trains are very hard to stop. They take a very long time to slow down and they need to know that they should slow down very much in advance. To do that we use signals and in order to actually use the signals we need to know where the trains are. So for that we use track circuits and the axle contours. But basically we divide the infrastructure into zones and in each zone we can know if a train is there or isn't there. We call the space between signal blocks and the block detection zones. Another thing is that train must not go over a switch that isn't set for them. So for that they need to have actually an itinerary through the infrastructure. We call that a route. And the route need to be established which means that the switches must be locked in place before a train can pass the signal at the start of the route. So for example this is using your ball signalings which is the French main signal system. So here we have a train and behind it this route is set so you have one red light, one yellow light that signals that this is a red light and then it's okay it's a green light. But up above the route isn't set this switch is dangerous so there is not only one red light there is two red lights. This means that under no circumstance a train must pass the signal. Here a train can pass it but very slowly. So we have a number of challenges. Every European country has its own signalling system and actually multiple. And there is standardization which is called ERTMS which is actually three levels of signalling system and even more complicated than that but it's not widely deployed yet and it probably never will be because nobody is going to upgrade an online for no reason. But we need to cover every single one of those cases so far as it's just another standard. We also need to not simulate the infrastructure every time we make a small change in a train for example departure time or in a forest EDCM for example it use an ASTAR through the graph of time, space and speed in order to find the pass that doesn't conflict with another train and every iteration of that ASTAR we cannot simulate the whole infrastructure it would just one scale. So we need to be able to model the capacity needs of a train while only simulating it. And also most of the application should not have 15 implementation of everything because of 50 signalling system it should be very much abstracted. So our approach is that signalling system have a very restricted view of the infrastructure the only thing that only see in front of it and it see linear pass until the next signal. So they see the state of the zones they protect and they also see the state of the next signals. We give them other metadata such as the speed of the approaching train or the kind of train this is useful in some special case. And we also separate the concept of signalling system such as BAL or RTMS and signalling the rival which is the actual code that implement the behavior of a signal. For example, yeah, so they depend on the output and input signalling system. So for example, here we have the BAL signal that followed either by a T-VM signal or by a BAL signal also. And we have two drivers, so two modules that under every BAL to BAL signals and every BAL to T-VM signals. And we inject BAL parameters because this is actually a BAL signal since the actual lights are using the BAL signalling system. So from that we can actually feed along the path of the train the state of the preceding signal here, get a state and feed it forward and we have signalling. But there is a number of problems with this. It's very cool but as you can see the actual signal reacts after the passage of the train which is quite normal because that's how it is in the real world. Our problem is actually trains need to see green before them. Their actual needs are before them and not after them. They don't really care. And we linearize the path but what is the path of the train that follow our train? We don't know because as we said earlier we are simulating each train alone. So we need to model the capacity requirement of a train but only knowing that this train pass. So why do train conflict? Either they are following too close to each other in which case they need the zone in front of them to be free or they have incompatible routes which means that they need the zone before them to have some specific switch configuration in order to proceed. There are other reasons why train conflicts such as power delivery needs and many other reasons but we don't handle those and they have nothing to do with signalling. So what's the spacing requirements? We have a zone, a begin time and end time. It's quite simple. For a route we have a set deadline which is the begin time. We have the actual switch configuration. So in order to set a route you need to know in which direction you are going to traverse the zone and what is the switch configuration you traverse it. So how do we get this? Every time a train encounters a signal we start by assuming the zone in front of the signal is occupied and we probe the infrastructure linearly until that signal becomes green again. And then we know that all the zones where the signal wasn't green basically are part of that signal requirement and we can adjust the begin time of the zone to match the time at which the train saw the signal. And every time a train leaves the zone it doesn't require it anymore. In terms of routing requirements most of the parameters only depend on the path of the train. So if we go earlier the route, the traverse zone, the detectors which basically indicate the direction in the zone and the switch configuration only depend on the path of the train. We know that. We are simulating the train. But in order to find the setting deadline which means we need to know which signal protects the entry of the route. And not only that signal but the signal before it because as we saw trains can see a signal being packed by signal before the actual protecting signal. So basically we probe in the other way so we set all the zones in the route as incompatible which means that the route isn't set and then we iterate through the signals until we find a signal that's green. Good so now for a train we have its conflict and spacing requirements and the good thing with those is they are indexable by zone so we can simulate every train once and then keep a database of each requirement and we simply need to check for every zone if all the requirements match. So the spacing requirements are never compatible if they overlap and the routing requirements are compatible if they go in the same direction and have the same switch configuration. And if we had a new train we only need to check its requirement and same thing in the ASTAR of STDCM if we had a train we only need to check that the new zones that are traversed by ZC ASTAR iteration are actually conflicted. So in the future we want to implement TVM support. We are actually in the process of doing that and it should be done by the end of the month. We want to implement support for overlaps. The main problem is friends doesn't use overlaps which basically are zones that must be free after the rest of the signal in case a train doesn't stop there. Friends doesn't use that but for example Germany do and we do not have any German on the team. And same thing for other countries signaling systems we want to implement those and the contribution are very welcome. There are also moving block support with basically the ERTMSV3 and in order to implement those we probably need another model specifically for moving block systems. Thank you for listening. Do you have any question? Five minutes question. Here. Thanks. Another one? You mentioned the different signaling systems and the different operational rules. Can you move this quite flexibly or as you said the implementation of TVM as a manual coding? Manual coding of the signaling system on the driver. The signaling system is quite simple. It's not actually a JSON file but it could be. So it's just declaring what properties the signal may have and there is actually code to check when we actually construct the blocks that they are correct. So basically sanitization of user input and in order to make the driver you just decide what are the possible transitions for your system and you implement those it's basically one function class. So it's actual code but it's quite simple. So when you do the root planning you know of course what rules you train such as post-created times but I can imagine that the train will occupy is on for a short time and the other will have to wait a little bit for that to become available. So there is an optimization problem there like what do we do? So this is a system how does it tie in with the actual timetable planning? So the timetable planning in operational studies are done manually. So because the people doing operational studies actually do this manually or want to do this manually for now and in the case of SCD-CM we cannot change the path of any train because they are already sold. So we cannot make a new train but it must not interact with any other train. So essentially this gives you a yes or no? Yeah it gives you a yes it's possible and this is the fastest path. Again for now. For now yeah. What's the difficulty? Was it a challenge for TVN's? Because there is this big limit on blocks with the speeds before? Well it's yeah but this was a challenge to integrate in the design. But for now it's not a challenge it's just a developer bandwidth. Does the conductor of the train see some kind of nice map that they can see how fast they can still be going? Or do they just see the green light red light, orange light, double red light and they just kind of react to these very basic signals? So on BAL signalling they only see the green light, white light, etc. On TVM it's actually a cab signalling system. So the driver sees in the train what speed he should go. But yeah on BAL I don't think. There is no connection so the driver just look outside the window and see what he has to do. What challenges do you have now? Or how does this simulation help in case of delays and dynamically reallocating paths or timetables where the delays are in a cabin at a scale? So we do not currently support any dynamic simulation but we plan to. We hope so. And well in order for dynamic simulation you pretty much have the same constraint and simulating. You need to actually simulate what state it's in at any point but you also need to know the. The resource needs in front of a train. So the situations are for now resolved manually like from the control center? Oh yeah OSRD actually is not used in real operations in real time for them. So I think at SNSF most of it is done by experience of the regulator. This one last question maybe short. Yes please. What is the safety requirement of the company? No. Thanks. No no no. That's it. Thank you.
How we at Deutsche Bahn develop IoT use cases quickly and cost-effectively
Okay. Yeah. So great. We managed to set up everything. We have a demo, so we needed to do this. So without further ado, I'm very happy that Olga is here to tell us a little bit about the ISU case and what it's done. A little bit of less time, so please condense it as much as you can, but take your time. Yeah. Thank you very much. I will give my best that we just in time arrived to finish off this lot. But let us start. My name is Horace Koch and I'm working for DB Suster, the IT company of Deutsche Bahn. And I'm the product owner of IoT. We works with applied IoT. And if I make not IoT, then I'm a member of Aka Open Source of Bitcoin, the chairman take a look. Digital association, Bitcoin and pushing forward the open source ID. And some words about my employer. The visitor is a 100 person subsidiary of Deutsche Bahn AG and is the digitalization partner for all Deutsche Bahn companies. And we have currently 7000 employees and managed over 500 projects and services in the cloud. And if you are looking for a new challenge, please take a look at dbd.com. Okay, let's start. What is Internet of Things? Here's the definition from the Wikipedia, but I would like to describe it with my own words. The aim of things, the aim of Internet of Things is to measure some conditions in the real world and to link and evaluate it. And this information and ultimately to derive the measures from it. And we have unfortunately only a little time, but I will try to give you a very deep insight in practical usage of IoT and for example, so we realize a practical project in this talk. We would like to measure the air quality inside of this room. And yeah, I think also it's a quite funny topic. And yeah, we will touch the theme. Where can I get the sensors? How can I transmit the data? And finally, how will be this process stored and visualized and so on. So let's start with the sensors. Where do the sensors come from? Yeah, after understanding the customer's problem and determining suitable matrix, the question arise which sensors can be used to reliable determine the matrix. And normally we try to buy this from the market, use standard sensors, but from time to time there are no sensors available for this topic. And then we give contracts to our DB company, DB-Con T, or DB System Technique, or maybe external partners to develop the sensoric after our specifications. And from time to time we make some in-house development and for this we use some sensor platforms. For example, other foods, fever, V-MOS, or Tinkerforge. For our project to measure the air quality inside of this room, we use the Tinkerforge and take a look inside the portfolio from Tinkerforge and found two interesting sensors. One is the air quality bricklet. This sensor measures the air temperature pressure, humidity, and an air quality index. The air quality index measures some gases and other values and gives us a calculated value for this. And the second sensor is a particulate metal bricklet. This measures the particles inside of the air, for example fine dust. Both sensors are connected to the master brick and the master brick makes the communication between my laptop and both sensors and we can now take a look. I connect it to my laptop and hopefully you can see this. We fire up the Tinkerforge bricklet fever, making a connect and we see all the bricklets and bricks. They are connected together and we can see some values from it. Without code, some lines or something, you can make first analysis. Is this possible to measure with this sensoric the right values? And is this worth to go further on? Okay, let's go back to the presentation. The next step is connectivity. How was the data sent to our back end systems? And for this, there are a lot of transmission protocols in the IoT environment available. Here are the four important protocols. But it's really difficult to take the right one because some need further infrastructure. For example, gateways or other costs, monthly pay or something. And you also must look for data, for example, bandwifes, coverage, energy consumption and so on. For our example, we only use the Wi-Fi connection of my laptop. So it's really easy. Normally, we use NeoBand IoT because we can't use the standard one. And NeoBand IoT based on LTE is more or less available. Okay, we use the MQTT protocol. MQTT is more or less a producer-consumer model. The producer writes some data into a topic. On the message broker, a topic is directory structure also. And the consumers can subscribe exactly this directory, this topic. And if the producer sends some data, the consumer can read it. Or got it pushed from the message broker immediately. Okay, we use AWS Core IoT for this. And there is a product IoT Core. And IoT Core is a perfect MQTT broker for us because it's full-managed and co-ops, auto-skating. And so on, and you have to work with this. Okay, then let's take a closer look into the code of this. I'm not a programmer, but it's so easy, anybody can work with this. Tinkerforge has a lot of examples, and it's more or less intelligent copy and paste. You take an example, you write a unique ID for every sensor, so you can connect more air quality index sensors together or something. Anybody has his own ID, imports some libraries, and here is the important part. We take our certificates for the MQTT protocol communication, and we create two callback functions, one for the air quality sensor and one for the particular meta sensor. And here you can see it's easy, one call to libraries, and you get all information from the sensors. Here we have a little bit formatting, print out, or write it to MQTT. The same for the particular meta sensors. It's really easy, there are examples everywhere available for this. And we can fire up this. We start this Python program, and here we can see the values. The values will be formatted into a JSON format and then sent to our MQTT program. Okay, let's come back to Fing Sport. Fing Sport is a relatively new software, it started in 2016, but in shortly the time it's more or less the market leader in open source IOT platforms. One question around, who has any heard from Fing Sport? One, two, three. Okay, perfectly. It's an open source software with open Apache license, and it is all in one solution. All aspects of IOT is available, API agree, so it's really easy to configure your system and so on, reporting, scheduling, visualization, and so on. And the best thing is the root chain. The root chain is a little bit like Node-RED, and there you can configure whatever you want to control a backend system. For example, if the error is too bad, then open the window. Okay, next step, it's always a good idea if you use an open source software to take a look at an open hub. And here are... Oh, five minutes left. Okay, it's a good software, it's perfect, insecure. It's a microservice architecture, and it's really easy to install it. And I will show you a little demo. Okay, here's our Fing Sport system. I fired up, and first we must create an integration. Integration is the part to subscribe an MQTT topic from the pro-car. Maybe you remember. I prepared it before, so I don't make it now. And the next way to create a converter, a converter is for preparation of the data. Sometimes the data are in grad Fahrenheit, and you would like to do it in grad Celsius or something, so you can prepare the data for the storage and so on. Okay, dashboard. So we try to create a new dashboard first. It's insert, and we add a new widget, temperature, temperature, and a quality sensor. We select the device from which we would like to visualize the data and which data case we want, a quality index, and for the first step we would like the temperature. And go on, and so we create our first widget on our dashboard, and this you should be repeat sometimes, and then it looks so. Okay, I let it running, and we take a look in three minutes on it. Because the system needs some minutes to measure the correct values, the sensor is a little bit of self-calibrating. Okay, we come back to our presentation, and I would like to speak about some use cases. From time to time we make some IoT hackathon, for example with our customers to better understand their requirements or to find possible solution very quickly and to make some tests with this. And from time to time we make this also for HR to get new employees or to work with studies, for example, some digital resummon school or similar events. Okay, here's an example of our environmental sensor. For example, this is measuring temperature, humidity, pressure, and the best is it measures the particles. The count of the particles in the air and the mass of the particles and vibrations. And why we make this, some employees, some colleagues of digital signal boxes told us maybe there's a connection between pollution, fine dust and so on, and air hours occurred in our signal boxes. And I have here a screenshot, it's really actual. Did anybody has ID, what is wrong with this? Okay, it's difficult to see. New year, exactly. 20 minutes after New Year's Eve we have a massive growing up, defined us in our signal boxes. Nobody knows how can it be. At the moment 300 sensors are rolled out, but 50, 15, 20 signal boxes shows this. And now we have to make some evaluation, how can it be, and maybe it's a good idea to power off the air condition or the ventilation system or something, or take a look at the windows open, what is wrong with this. And here is another example, another use case, PESC. Okay, PESC control, we can measure the red visits and so we can reduce the amount of very toxic baits. And another example, Samurai, things what is a perfect open source software if you like to realize IoT project really cost efficiency. Okay, thank you very much. APPLAUSE We don't have time for questions, but I do want to see the diagram. Yeah, yeah, yeah. What does it mean? Please? What does it mean? What does it mean? Yeah. No, no, no, no, no, no. This is below 50, it's really good. But I think I'm not sure how long does it take until the sensor is recalibrating. Yeah, yeah, yeah. I think the air is too poor for this value, so maybe we should wait half an hour or so. Okay. Okay. Thank you very much. Thank you. And this thing was bought. Yeah, is it a bit like this? Yeah, the next thing you do is get up. Basically, it's a competition to open up and all the other stuff. Now, open up is more or less...
Transportr: the Past, the Present and the Future
So we are coming towards the end of the program. We have two short community talks as final talks of this morning. And I'm very happy that we have Nicola here to talk about Transcorter. Many of you probably know that as one of the free applications to take public transport information. So yeah, it's your state. Thank you. Perfect. Can you hear me with a microphone? Or is it just for the recording? Perfect. So yeah, then welcome to my short little talk about Transcorter. It's passed. The current state and glimpse into the potential future. Let me maybe start the talk with asking you who of you is using public transport regularly? Great. I could have guessed that, I guess. But then some of you may know this kind of problem. You go somewhere, you travel somewhere, there's a different public transport system. To find your way through it, they want you to download their own app. Usually proprietary from Google Play. And then at some point your home screen gets cluttered with all of these apps. Some alternatives that you may know is using Debena Vigartra that works quite well in Germany. They include a lot of regions with decent data quality as well. But then first of all, since the new update, I think there's no map inside anymore, which I find pretty, well, it's a bit sad. And then there's also some people found out that Debena Vigartra is sending or connecting to a lot of tracking services, even if you declined that. So maybe that's not what you want to use. Well, Google Maps, another option, I guess we don't have to talk about why you maybe don't want to use it either. So as you guessed, a transporter, it tries to be another alternative to these kind of things. It was created 2013 by Trosten Grote. As you may notice from that picture, that's not me. I'm Miko Leguccio. I'm filming under Jallochim on GitHub. And I started contributing to Transporter in 2017. So when you open Transporter, it might look like this. You have a list of networks. You can choose where you are. And then you can basically look for journeys as you would expect. So in this short talk, I will, well, first of all, tell you a bit about how Transporter works, the internals. How do we get the data, basically? Then, as I said before, the past, present, and future of the project. So first of all, these official apps, how they work, well, they have their data source, usually in some proprietary format. And they have the apps that talk to some APIs that provide the data. In the case of Google Maps, it's a bit differently. They don't use the data directly, but they use a format called GTFS. That's a standardized public transport format. Initiate it by Google, but it's an open specification. So you can create your own GTFS files and also consume GTFS files, GTFS files as you want. And that's what Google uses internally for their public transport routing. Now, where does Transporter come into play? Maybe you've heard of Uffy before. That's another app that also works on Android developed by Andreas Schildbach. Even before Uffy itself was open-sourced, Andreas Schildbach also already open-sourced a library called Public Transport Enabler. And that is basically the wrapper that contains the logic to connect to and understand the data from the official APIs. And Transporter is using that same library. So huge thanks to Andreas at that point for him to open-source this and making a Transporter possible. Then there's also a second part in Public Transport Enabler where you can consume GTFS files by a proxy. In that case, you don't use the GTFS files directly. You don't perform routing on your phone. But you use some third-party provider. What Public Transport Enabler was using is Navicia, a French company. They provided this service for free, basically consuming the GTFS files and then exposing them as an API to interested apps. And that's actually how I got in contact with Transporter because when I was spending some time in Nicaragua, I was working there with some other volunteers to gather Public Transport data, schedule information, put that together into some GTFS file and then in the end making routing possible for a limited region, but at least with apps such as Transporter and Auffie. So now that you know a little bit more about the internals, I would like to go on with the project itself, how it evolved. This is the graph of the code frequency on GitHub. So as you can see, there has been quite a lot of activity in the beginning. Initial commit 2013, release 1.0 2015. And then in 2018 with this huge spike, there was a major rewrite of the app. Most of this was done by Torsten Grotter. And as you can see afterwards, activity declined a bit. Well, both Torsten and me were busy with other stuff. So this is actually a try to get some more or attract maybe some new contributors to Transporter with this talk. Maybe you've noticed that at some point we even got removed from the official after-ride repository because they found out that some library or the map library that we were using was not fully open source. And it became necessary to switch to an open source fork of that library that actually didn't include the non-free dependencies. And another thing that happened last year is that Navicia started to change strategy. It's like the new version of their software is not open source anymore. And they also stopped serving a lot of regions. So at that point, for example, Nicaragua was not available anymore, which is a bit of a shame. So in 2023 we also got some new interest from the community. There was people asking about the future of the Transporter app. Well, that's got some new energy and we finally finished the migration to the new map library. And since one month ago, we made it actually back to asteroid. We're back there. But as I said before, we lack some regions that were supported before because Navicia stopped providing them and because some APIs also broke over time. Well, as I said before, there are some new contributors. There's some effort to move to new design theme, which is great. There are quite some open issues. Some of them are bugs and many of them are feature requests. And a lot of them are actually marked as so-called beginner jobs. That means that they are supposed to be quite easy to tackle. So if anyone watching this or being here feels like looking at some Kotlin or Java Android code, feel free to pick one of those and try to work on them. And apart from Transport itself, it's also nice to see that the whole ecosystem of similar public transport apps is growing. So there's Uffy, as mentioned before. That's open source now for some years as well. Then there's itinerary, an app that tries to do or does even more things than what Transporter is trying to do, like saving your tickets. There's another Linux app, GTK-based, in that case, which is pretty new, also looks really nice. And on iOS, there's also an app that I'm not sure if it's fully open source, but at least there is some variety to choose from. And looking at this ecosystem that is growing, I think it would be nice to try to combine efforts in some way. And maybe what would be nice as well is find an alternative to what Navicia was providing before, some kind of shared service that is maintained by the community that can use GTFS files that are available for a lot of places in the world and provide an API that can be used by all of these apps and even more. So that's for me. I have three steps for you. If you haven't already downloaded Transporter either from Afterhood or from Google Play, if you find anything that doesn't work as you want, tell us. Look at the code, contribute, and have fun using public transport. Is there one quick question? Hello there. I tried to navigate to the results yesterday. Yes, Bajum is one of the regions that was supposed to be broke, the API broke, and I think we would have to look into what kind of API they're using and then maybe feel free to look into that and contribute. Sorry, we don't have any time left for more questions. Please talk to him, please contribute, and we are moving to the next presentation. Thank you.
Software needs of a volunteer operated heritage railway
So, we're coming to the end of our last talk for today. I'm very happy that we are closing with a real training operation. And in this case, we're talking about how to do that with open source, how the source can help there. And yeah, Niels, it's your stage. Yeah, thank you. So my name is Niels. During the week, I try to make medical devices speak together. And on the weekend, I'm playing with trains. So I'm working at Dampfbahn Frankische Schweiz on the weekends, which you can't see because it's too bright on the beamer. For location, this is Feuchheim, which some people might know in the medical industry, because that's where my employer is. And the next bigger city is Nuremberg, which is somewhere around here. We have a short line, 16 kilometers. It was closed down in 74. We are running it since 80-something. There's something like 30,000 passengers per year. So we are slightly sized for a heritage line or museum ray-ray. We have 400 members in the club, of which 40 are actually active. We are running steam and diesel every Sunday from May to October, sometimes occasionally holiday train or special train, or whatever. But May to October is the main time. We are completely volunteer run. So we have a professional safety manager, but everything else is completely volunteers. We are a real railway running under FONE regulations, so slightly easier rules than the Deutsche Bahn has, but still real railway rules. And because we are kind of the only railway in the region, a lot of local initiatives are in contact with us. There are some initiatives we want to reopen public transport on the line, which for us would be good, because then we can get a lot of Trassengebühr. And it would also help the area quite a lot. So why do I do the talk here? So first, I want to put heritage lines generally on the radar for you guys, A, to come and visit us, and B, because we have a lot of need for people doing IT stuff. And the interesting thing is, as we are running some of the heritage lines, have their own line, where we can do more or less whatever we want and what the E-Bahn allows in Germany. We are the perfect experimentation ground, if you want to try out some stuff. If you look at Europe, we have about 100 heritage lines in Germany. UK is the absolute mecca. They have, I don't know, far above 100. They are about 50 in Austria, 20 in France, 10 in Belgium. So all over Europe, you will find some lines. They are organized in larger communities. In Germany, it's the Faudi MT. In Austria, the EMT. UK has the Heritage Railway Association, the HRA. So there's a bigger group, and there's a European organization called Fedeg Rail. What's our problem? We like trains, but we are horribly bad at computers. So this is kind of the typical members. So that's me after a training shift as a fireman. This is our engineer who is in real life a state attorney, and this was my trainer for firemen. He is in real life a medical doctor. So I'm the only one who has to do something with IT in this picture, and I'm probably the one of three in the whole club who has to do something with IT. So big problem if we want to run anything in IT. What do we do? Of course, we do the stuff that a normal railway does. So we sell tickets. We run trains. We operate and repair the infrastructure. So we have switches and signals and things. We operate and repair rolling stock. So we have coaches and wagons and locomotives and everything. Not the usual stuff the Deutsche Bahn has to take care of. There's a little overlap. We have a V60, which I think is still in operations at Deutsche Bahn as well. But most of the stuff is 80 to 100 years old. And we have workshops and sheds and all the infrastructure around. You need to have a railway running. But also we have a nonprofit part of things. So we have archives. We do this to preserve history. So we have a lot of documentation on our trains. We have photography, everything on the historic side of things. We are a club, a fein in the city of Germany. So we have to do membership management. We have to do all the tech stuff you need to do for fein. And we need to get somehow money for everything. So we need to organize donation campaigns and try to get fundings for things. And it's not just the, so we cannot just run on tickets alone. If you need to do a full inspection of a steam engine, we are talking about half a million of euros. And that is about 10 years of running. And we need to do every 10 years this inspection. So, and we have four steam engines. So you see the problem. OK, we still run railways as in the 1950s. So our line, in our case, closed down 74. It's more or less in the same state. We are now, but the rest, the signalling, everything is still in like in 74. Our active members, unfortunately, are getting older. We get newer active members again. But unfortunately, the everyday workload on people has also increased. So you cannot spend all your life at the railway anymore. Some people need to earn money. And this is kind of increasing, decreasing the time that people can spend on the railway. They are higher safety requirements. So even if we run railways like in the 50s, we still have to fulfill all the safety requirements from the 2020s. So this is kind of challenging. Our customers want more. So you cannot come and say, yes, the ticket office is open from 8 till 9 on Saturdays. They want to buy a ticket on the internet. And of course, growing regulations and administrative effort, which you have everywhere. So the problems we have, tickets, we still sell these Edmondson tickets. As you can see there, there's carport things. And one of our members also has a printing machine for that. So this problem is solved. But we also want to sell tickets via the internet. And there's not really a good solution. There's Farkapendrucker.de in Germany, which works. And works reliably. But this thing is stuck in the 90s. So if you look at the layout, it doesn't have responsive design. And it's really hard to use. The back end is quite OK, but the front end for the customers is horrible. Unfortunately, it's the only thing we have. The other thing is you could use some kind of an event ticketing software. So a lot of people here would probably know Pre-Tex. Pre-Tex is absolutely great, but not made for railways. So it starts with seating arrangements. So usually we want to have some algorithm that not everybody sits at the windows, but that if you book multiple places, you have to set the window. And then you fill up until you only then are allowed to have the next window, because else you have all the windows taken. And then the rest will stay empty because people say, I want a window. I don't go there. There are hundreds of bachelors and master thesis on how to do a ticket selling system. But none of them has made it into an open source software, unfortunately. So this is something Farkapendrucker works. Using it helped needed. Running trains. So for timetables, for us it's pretty simple. We have one line, one train. So we take the same timetable since 20 years. So this works. But if we get more complicated, we might want to run two trains. And then we need to have a serious timetable. We use now J-train graph and FPL edit, which is made for model trains. Works really well. The FPL edit author also added GTFS export now. So we might show up on Google Maps hopefully soon. And OpenStreetMap and Traveling and all the other apps which could use GTFS. So there's probably also some larger software interesting. We have some stuff like the signaling. So this is our signal box, the complete signal box. The other safety feature you need to know, there's a key. This is something where we can improve on. And it might be some stuff like just putting a GPS tracker on the train, which then has the other problem. There's no mobile phone reception on our line because we're in Germany and in the middle of nowhere. So there are lots of things where, for example, the IoT department can have a field day. We have passenger information systems, whatever. So there's a lot of things where you could create new software, which would help us a lot. Managing rolling stock. So right now, every train management, so train cars have regular inspection dates and a lot of paperwork attached to. This is now managed in an XCloud and an Excel sheet. And that's already the advanced technology solution. Usually it's paper folders. So a lot of things there. We have reports, regulations, whatever. So this is kind of a nightmare right now. But we also got good feedback from our regulating body because we handed them readable PDFs and they said they were better than what they get from Deutsche Bahn. So what's our problem? So basically, the left side is the museum railway, half where we need to know our problems. We still don't really understand the problems well, so we need to get better at that. We need to find the solution and we need to be able to apply the solution. That will be the big thing. And the other thing is also the software side of things, so it needs to fit to our problem. We need to be able to find this software. So if you search for ticket systems, you will find G-RA and all that stuff. So completely unsearchable now. And it needs to be really ease of use because we are not good at computers. So this whole thing started at the Gulasch Programmier Nacht, a couple of, actually last year. We did a workshop at the KS Communication Camp and a small group formed trying to get the IT nerds and the railway nerds together. And that's also what I want to present here. For you guys, why should you bother if you don't like playing with trains? Playing with trains is fun. But you could also use museum railway as a learning ground. So if you work in software and want to do something for transport but don't know how railway works, we are a place where you can learn that. We are an experimentation area. So you can do a lot of stuff on museum railways. So the railway regulations are quite wide open for experimentation. So I'm really surprised what's possible sometimes. Coming from medical device where you can't do anything. And you can use this as a test bed where you have, we have the simple case. So we have one line one train or one line two trains if it gets more complicated, but we don't have the full network like DBS. How you can join? So to have the super easy entry point, we just created a Discord chat where you could just join. We are currently four or five people, so smaller. Hope to grow more. We have a wiki at kaosban.net where we, which was kind of the original idea dumping ground, which now gets a bit more formalized into a knowledge base where we try to collect the problem cases, the possible solutions, where there aren't solutions to get an overview of things. And we're starting now to network with different heritage line associations. So at the Faudi MT meeting in three weeks in Aachen, I will present basically this talk again and do the same publicity for heritage lines then. So then I'm done. So join us on the Discord if you like. We're open for crazy ideas. And if you want to play with trains, there's a museum railway near you which takes you with open arms normally. There's one question. Yeah. More information. I started this with a Netflix presentation. And in Norway, we had six rotation train nights that is using the M plan that we don't mention to produce netx data. So they are integrated in national trip mapping. Yeah. Yeah, I made a lot of notes during your talk. So. We also converted the beganesis. So just repeating for the video, in Norway, there are, how many you said, six museum railways to all you're already using the netx tool from the first talk. So if you haven't watched the first talk. Do you regulations on station visibility that the station has to be on a straight section of track applied to the historic railways? We have kind of at the limit of what I know. So I think we have some kind of heritage protection that still stuff can stay. But for example, we have one halt in a curve which we cannot use anymore because the border of the platform is not really there. So it's just a meadow which ends up in the track. And we need to put a clear border between the platform and the track there. So there are some regulations and some safety rules. But I think not 100% what the big railways have. In the Czech Republic unfortunately a lot of cities have lost their railways service because of this regulation. Yeah. So for video question was on the, if the regulations for stations that it must be on a straight part of track and that everything must be visible applied for museum railways. And then Czech Republic, a lot of towns have lost their railway access to that for that. So one last question. Thank you for the presentation. I really enjoyed it. I have also a lot of ideas. I'd be a very consider to get into the debate network. Because for example in Italy you have the foundation of like Vain Tardien. And you can actually buy within the ticketing system tickets in the Vain Tardien system. So the question is if we have considered joining DBNETs or DB infrastructure as well. Not really. So for ticketing for example we didn't consider that because it worked so far. It was a lot of manual work but somehow it works. And everything which is external gives external costs because DB doesn't do stuff for free. But we get work time for free. So if we can do it by hand or have to pay for it then we do it by hand. But they do have like their eye or their mind. Yeah. So there might be something I haven't really looked at that. But for tracks for example if we go out we are running on DB tracks and have to join their tools and work with their tools to get into the timetables. OK great. Thank you Niels. Thank you.
Enhancing the video call experience with Forward Error Correction
So good morning. I am Flor Arlé and I am here with my colleague Jean Monnier to present you a way to announce the video call experience by using the flexible forward error correction. We work in the Bell Dawn Communication Company. This company has developed the LIN phone product that is a soft phone open source to make some video and audio calls. And it works on several platforms. So today I will explain to you how we implement the forward error correction for our video calls. The first way we choose to use the flexible FEC scheme, I will explain to you how it is described in the RFC 8627, how it has been developed in our products and I will show you some results. So at first let's talk about the forward error correction. If you see this real schematic way to represent a video call, you have at first two people who share a video call. On the center side you have a device that captures the video. The signal is encoded here by a video encoder that transforms the signal into frames. And those frames are split into packets that will be sent to the receiver through video stream on the network. On the receiver side the packets are collected and decoded and the frames are recovered back and the signal is displayed and the receiver can see the video. In the case of the video call we are in a real time context. So we work on the RTP retight transport protocol that describes how you can send video or audio with the internet. This protocol describes the format of the RTP packets and it is more regalable than the UDP because for example you don't have latency problem that is adapted for real time communication. Unfortunately in the real world you have problems, you can experiment problems with your network so the traffic may be high, you can have a low bandwidth and sometimes you lose packets during the transmission. So the receiver doesn't collect all the packets, the signal is not complete and you see that your video can freeze and it is really annoying for everyone. To overcome that you can use a strategy to recover your lost packets. For the forward error correction you will recover the lost packets by using the next packet and also a run-down-see information that is sent at the same time than the video stream. With this way you can recover your full video and have a nice video call. We choose to use the flexible scheme for our project. So when you detect loss in your receiver, the receiver side, you have several strategies that you can apply. You can ask to send the packet again but you will have to wait to get the packet. You can primitively decide to send twice the video stream but it is really costly. Or you can try to recover the packets with what you have at the time. The forward error corrections allow you to recover the lost packets by using the run-down-see information and the other packets. There are several algorithms, for example the low density parity check codes and there is the flexible forward error correction. That is the method that we choose because it is really simple. It is based on the combination of the packets with an exclusive XOR operation and it is free. There is no patterns. It is a recent standard. It has been for example developed in the web RTC. So we can be interoperable with it. The standard is described in the RFC 8627. It describes, this document describes fully how an FTP stream can be protected with flexible FEC. It gives the format for the repair packet that will be sent to carry the run-down-see information and it proposes the codes to generate those packets and to decode to reconstruct the lost packets. This RFC is applicable to all media, not only video but also video, text and application. So now we will explain you how it works as described in this RFC document. So at first when you have a video stream you send the packets within an FTP session with a source RTP stream. Your packets are here represented by the squares. They have a unique sequence number that increase with time. And when you want to make a flexible FEC you add another run-down-see RTP stream and you don't change the source stream. So it is backward compatible. The principle is simply to take a set of source packets, combine them with your XO with a priority FEC encoder and generate a repair packet here. So for example this one is called R5, R4. Why using XO operation? It is because of a nice property of this operation that makes you able to recover one of the packets if you have all the other. So you can encode a packet and decode a missing packet. At the receiver side when you detect a loss, for example here the packet S4 has been lost, you can get it back by applying your exclusive OR over S6, S5 and the repair packet R4. And then this new packet can be sent to the stream here. To operate your flexible FEC you can choose several parameters. You have to decide what is the length of your repair window. It is a duration interval that allows you to buffer your source packets to be sure that you have enough source packets to make the recovery. And you have to decide which packet you will combine with which one within a protection pattern. So now we present you several protection patterns. If you represent your source packets like a block here from S1 to SdL with L columns and D lines you can make a first way to protect that is a row protection, a one-dimensional non-interleave protection where the XOR is applied on the rows. So here you generate D repair packets that protects a set of source packets of length L. Another way is to combine them by columns. So here you have L repair packets that protects the source packets with depth D. So now I will show you how you can recover the source packets with this combination. So here you have an example with row protection and here with column protection. Because you have random losses in your transmission you can apply the XOR to recover the lost packets here with the row application of the XOR and here with the columns. But in some cases it will be more difficult because if you have burst in your transmission it means that you will lose a consecutive source packet. You won't be able to recover because you can't have both packets here and here you will recover the columns that have only one loss but not the columns that have more than one loss. To overcome this problem you can make a two-dimensional protection. Here you have simply the combination of row protection and column protection and it generates L plus D repair packets. In this case you have in the RFC an iterative algorithm to recover the lost packets. So here I show you two examples with long burst here and here with random loss. The algorithm starts like this. You repair at first all the rows that can be repaired. Then you apply the XOR on the columns and you repeat so the rows, the columns until you can't repair any more any packets. Here you can see that the burst has been fully resolved. All the packets have been recovered but sometimes you don't have chains and you can't recover some patterns that are connected like a cycle here. So in this case you can do much more with the flexible efficiency. But this two-dimensional protection is really efficient for bursts. But sadly it has a cost because you have to send a lot of repair packets. You can measure the impact on the width that you will need with this term the overhead. It is a ratio between the number of bytes of the repair packets that you sent over the number of bytes of the protected source packets. Usually the repair packets are bigger than the source packets but if you suppose that all the source packets are approximately the same size, the overhead will be 1 over L for the row protection, 1 over D for the column protection and 1 over L plus 1 over D for the two-dimensional protection. For example here are the values of the overhead with increasing values of L and D on increasing protection level. You see that the overhead increases very fast. The RFC describes also what are the formats of the packets. So first you have your source packets with the LTP convention with an LTP header and a LTP payload. And you will generate your repair packets that are also LTP packets with a header and a payload. But within this payload you will carry two kinds of information. The first one is written in the FERC header. It's information about how to identify which source packets are protected. And in the repair payload you will have the result of the XOR operation between the payloads of the source packets. When you apply the XOR between the payload here you have to be sure that your source packets have the same length. So sometimes you will need to add zeros at the end of your payload in order to have the same length for all packets. A single repair packet will carry all the information needed to recover the source packets. It says the size of the source packet protected and which is the configuration of the protection. For example, here when you have R1 and you see that in the FEC header you read L is positive and D is zero. You know that you have a raw protection and the sequence number of the source packets that are protected comes from S1 to SN plus L minus 1 with the consecutive values. If L is positive but D is equal to 1 you also have a raw protection but you are inside a two-dimensional pattern. So you know that you will collect several repair packets that protect rows and then you will have a set of repair packets that protect the columns. And when L is positive and D is more than 1 you have the column FEC protection and the repair packets that are protected are interleaved. So from SN to SN plus D minus 1 times L and it can be the column of the two-dimensional FEC block protection but it also can be a single column protection in one dimension. This method has been implemented in our project NINFON. We decided to use four sets of L and D parameters. It comes from one-dimensional very low protection to high protection with three and three. Ideally we want to have two-dimensional parity protection but it has a cost because you have to send a lot of data. So we decided to adapt our protection to the loss rate that is measured in the transmission and also to the network capabilities. The repair window is 200 milliseconds. It is long enough to collect all the repair packets for any values of L and D and it doesn't cause any delay in the video. The RFC has been implemented in C and C++ in our LINFON SDK. All the elements of the FEC stream are written in the library ORTP and in our streaming engine for video and audio we added a way to manage the video quality with the flexible FEC. For many months about our strategy for the video quality, our rule is to make the best possible use to use the bandwidth but sometimes you don't know the bandwidth at the beginning of the call. It can change during the call and you have all events to manage. We want an optimal video setting so the best definition betrayed and firm rates but most important we don't want freeze in the video. So we decided to prioritize the packet protection before having high encoding setting. To have an adaptability to the network events we make periodic control of several values so we measure regularly the available bandwidth, the loss rate and the bandwidth that is dedicated to FEC. For example in this graph you can see that we propose to have low FEC protection when you have low bandwidth and to enable high level of FEC only when the loss rate is very high but if you have a lot of bandwidth you can have full FEC protection it is not a problem. And finally when you have congestions it means that you have too much packets and the transmission stops. You disable immediately the FEC because it is not your tool and it will make the things worse. So now we will show you some video with flexible FEC activated. So here you have a video, so we simulate a video call with a moving pattern. In the first window here there is 6% of packet losses and we do no protection so you can see that the video is really bad. It is a very very bad case, 6% is really a lot of losses. In this window we have enabled the FEC with a one dimensional row protection with L is equal to 5. You see that the video moves a little more but they are still freezing. In the last window it is a two dimensional FEC protection with a high level, 3 L is equal to 3 and D is equal to 3. And you can see that here the video is perfectly fluent so we have recovered all the lost information here. We have measured the recovery rate here with several values of FEC protection and you see that it increases very fast. So the flexible FEC is really interesting to recover the lost packets and the effects are really obvious. Another example here, this time we have simulated a transmission with loss and burst so we lose consecutive packets so it is a very bad situation. This time you can see that the performance of the FEC reconstruction decreases a little but they are still interesting. In the two dimensional parity protection you can see now some phrases but it is still much more fluent than the initial video. So we can make some conclusions about flexible FEC. It is a simple and resilient way to improve the resiliency to the packet loss in video transmission. It is based on the fact that you send redundant information on a dedicated stream. It is adaptable to the level and the event of your network and it works with a short delay because you don't have to wait that the sender sends you back the missing information. And the exclusive operation is really efficient and rapid. But you have to keep in mind that you will need a significant bandwidth so in some cases it is not indicated. The RFC 8627 gives a complete description of the flexible FEC scheme and it is clever because it is also backward compatible with the RTP protocol. And we show that it gives a real improvement in the video quality. So we decided to release it this year in the video calls of the Linfuan project and we want to in future work add it to the video conference and the audio stream. So thank you for your attention and we will be happy to answer any questions. Thank you. The question is about the size of the source packet. And in fact you are right, it is an issue that we have to deal with. The source packets doesn't have the same size. And for the encoding you have to pad the payload to make the XOR operation. And the thing is that when you combine them to build your repair packet you will have very high, very big repair packets and your overheads will be increasing a lot because of few big source packets. So that is a problem that you have to deal with. So you can change the size of the source packet if possible to make the more equal sizes. But you have to adapt the overheads to decide to have, you have to measure your overheads to check that the repair packets are not too big compared to the source packets and to decide to reduce the FEC protection in order to keep an overhead reasonable. But yes you have to take care of the real size of your source packets. I don't know if it answers your question. Thank you. Yes? Then you always have this fixed delay of 200 milliseconds, right? On the repair window? Yes. So we have a fixed value here. The question is do we have a fixed duration for the repair window of 200 milliseconds? Or it can be changed? The fixed delay. The question is the video output is put on the screen 200 milliseconds after the respective video break has arrived, right? Yes, the 200 milliseconds is a delay that you had before displaying your video. Yes? I'm sorry. Okay. Yes, that's the, in fact yes. Yes? So when you assemble in the stream and rows and columns, I know the second one is reversed. Is that right? No, it's not reversed. In fact, sorry, it was maybe not clear in the representation. You have, okay. This one? Yes. The draw comes back to here. So you read those one, then those one, and then those one. The second question, do you have any examples of an STP line describing how this is expected? Those ones. You have an example of an STP packet that contains a line that describes how this is established. The question is, is it mandatory to signal? So when the stream is a setup on the signaling layer, you have no flyer, I'm guessing, you still use STP and this would exist as a line in the STP to describe how it's established. Yes. The question is to know if during the call exchange, we signal the use of this protocol into the STP right. I'm not sure. There is a. Yes, okay. Signaling. That's the answer. Okay. Yes. So what you described seems very similar to RAID 5 with disk drives. So when you join drives in RAID, you have an eight blocks and then you have one drive block which contains a pad bit for each of the blocks. But there's also RAID 6 which has not one but two pad blocks. Could that be applicable to your skin here? So you have a line of five packets and then you have not one but two redundancy packets which could help you recover the line between two packets of lost. Okay. So the question is about what happens if we lose repair packets for example or if we. Could that skin be improved with having two pallet packets but one? Not one packet but two. Yes, it may be it's always a trade-off between what you what bond with you have and what you decide to send to improve the protection. There is a other way to other protection pattern described in the RFC. For example, you can decide to protect small very specific source packets by using for example a flexible mask. So you can have maybe here in this example decide to protect some packets twice and some other once or not at all. Yes, it can be an improvement to prioritize the most important packets in your stream. And there is other schemes. One pair is one pair two block one. Yes, there is other parity codes. Honestly, I have to try to tell you which one can be better. I don't know. Probably one of the problems if you apply too much protection is that you're also going to a lot of overhead. So at one point if you're in a lossy network, you send more data to try to recover from more loss, you end up in this spinning spiral that doesn't make things better. So finding the balances is where the black magic is usually. Well, thanks Flo. Well, no, it's okay. Oh, there's one more question. Please go ahead, we have some time. Maybe regarding exactly what you said, how do you know that you don't make it worse? Yes, in fact, we had the problem. So at some point we sent more information in the written on stream than simply sending the video stream twice. In that case, we control the overhead periodically. And when it goes above, for example, 1.9, you reduce the FEC protection. It's not always indicated. So it's a decision that you have to make. We have established empirical rules to manage that. Yes? I want to ask you about the masking of your gold right now. Yes? The slide is right now there. You have said that you can protect the specific package. Yes, like you protect a group of packets. Yes? That's for example in video conversations for you. For example, the push and the step or that X2, 6, 4, and you protect, for example, the key frames do be interpolated instead of, isn't it? Yes, so the question is if you can protect, for example, the key frames of the video conference. Yes, it's a way to choose which packet you want to protect. If you don't want to protect everything, but mainly the key frame, it's a good approach. Or you can make the one-dimensional, two-dimensional only protection, only when you have the packets of the key frame. Okay. So the receiver size. Is it right like all your key frames are on one column and you just protect them? Yes, but you, so the key frames are not necessarily in the same rows or the same columns, but you can change the values of D and L whenever you want. On the receiver side, the receiver just read what it has in the FEC header. You see the value of D, the value of L, and it adapts the configuration to recover the lost packets. Okay, so you can modify that value dynamically during the... Yes, you can dynamically modify the protection configuration. And it's very powerful. Yes. How do you measure the network's bandwidth, for example? Because without provoking the network with high load, right? Yes, how do we measure the available bandwidth? We have estimator in our program that tries to measure the... If I remember, the time delay between the reception of packets and try to establish the bit rate. And we see if there is congestion, if there is congestion occurs or not. But it's based on estimation. We have to deal with that. Yes, the idea of the algorithm that we use is to measure the regularity of the packet at the receiver side. And when it changes, we can deduce that the bandwidth is more... is close to be saturated. This is more or less the way that we use. So do you use RTCP for this configuration? Yes, and we also use RTCP feedback as well in order to measure... packet losses from the receiver side. But it's a bit different than just bandwidth. For the bandwidth, it's really the regularity of the receiver side, which is a measure. Thanks, Bois. Thank you. Thank you.
Shig: distribute and clone live streams among Fediverse instances
How is it possible? About interactive live streaming in a very worse. How is this possible or is it possible? To me, I'm Enrico and I'm interested in interactive live streams. Sorry. So, now it's better. I'll take it so. Sorry. Here are my contactless and I worked for different companies and even most likely in a conference system topics. And now we're talking about lessons. And in the 30 versus is quite interesting situation. When you're in a 30 versus for example when you're in Macedon, you read in post. The interesting point here is the post came to you. Means you have an app in Macedon or inclined and you don't care who posted the post on which instance the post itself is cloned from instance to instance through the further worst. Means you get a copy or a clone of this post. This is a quite interesting concept. So the instance in the background communicating to each other. How is he doing this? Of course with activity part, we had to talk right before this. So I will not go deep in it but the main idea of activity part is like you have an inbox and an outbox. And everyone in the 30 versus in terms of activity part is an actor. The users are actors, the servers are actors. And on the end you can send to every actor in the 30 versus a message or a post. And that's the way how it works. So activity we describe the things like in activities, it's like activity part, like subscription, follow and so far. And the other topic is content. It's all described in JSON. And how I said, the instances in the background communicating to each other and the content is flowing through the 30 versus. Activity part and live streams. They are in the 30 versus already implementation of activity part like OWNcast or PeerTube are the main famous. But the thing is we want a little bit more. I mean you have in OWNcast and PeerTube live streams but not interactable. It is not possible. It means without leaving your PeerTube instance or leaving your OWNcast instance you cannot interact with another stream or another instance. It's not possible. Yeah. That leads to a problem. It's called scaling in the 30 versus. That means on the end more or less the... More or less every instance provider in the 30 versus responsible for himself, you have to scale by your own. You have the possibility of course with hardware where you make an HLS CDN on top of this or this object storage. Those are the common ways how you can increase the amount of users that can watch you. But on the end you stay alone more or less. PeerTube try to solve this problem with PeerTube mail loader. It's quite awesome. Sometimes you see it. You're watching a video and then you see that other people are watching you as well. This means PeerTube Peer exchanging the chunks of HLS files. We are bit torrent and over-verb. It means you make a real PeerTube Peer connection to the other viewers. I put it on the top because this is the most common way in the 30 versus to share live streams. There are other ways as well, but most likely the basement PeerTube here in the browser. There's another way, it's web torrent in the background. Of course they can clone... Even PeerTube can clone videos from one server to another server. This is possible. And the new concept is remote runners. This is quite awesome. You can scale PeerTube with a remote runner. It means you can run other services that do the transcoding for you. Quite often it's re-expensive. This is the possibilities you have to scale your application or your instance. Oncast has a quite interesting feature. Oncast has a general concept. Oncast is you have a server and you only stream for yourself like this. But they have a dashboard. On the dashboard you can see every live stream in this time. But this dashboard is nothing else than an HTML page. They are linked to the live server. It means it's like a list of links. It's not really scaled because when you're watching there a stream, you're watching it from the server as well. This is the current state of it. But what we have now, we have ActivityPub. It is possible to share the information there as a live stream. This already worked as PeerTube as well. There's a live stream but you cannot share the stream itself. And what we want is we want to share a live stream. So in the live stream you want to have it interactive. Means an interactive live stream is a little bit more as if you have a stream with a track like a video and audio. No. We want to have it, you have a stream with a track and the tracks inside of the streams can change. You added new tracks, you added removed tracks, you enabled tracks, you disabled tracks and the tracks coming from different sources, different instances. When we can reach this, then we have interactive live streams in the Furryverse. It's not only that you share a stream, a static stream, it's a little bit more. This is what you want to achieve. It's like a conference in the Furryverse. And we already talked today about it. There's a protocol, it's called WIP and WAP. Of course we need a real-time protocol. It's clear we need WIP and WAP. It's a real-time protocol, it's a moment. And on the other side, there's another interesting approach, WIP and WAP. In short words, what is WIP and WAP? You make an HTTP request to a server and receive a WAP-ATC resource. That's it. No complicated signaling, only an HTTP request. It's a little bit like an activity path. You make a request and you get a resource back. This is written there. For the first one, you make a request to offer a resource. Hey, I have a resource here you can have. And for the second one, you make a request, you subscribe to the resource. This is only the different. This is the main idea. When you have this, here's a little bit more in detail, you can ignore this one, the eyes, only this two are important. You offer something with an HTTP, of course, and you get something back. And then you have all what you need for the resource. Finish. And then you have such kind of architecture. You can do something like this. A, you are sent off a resource like a client. You offer this to an endpoint. And the endpoint offers to the next endpoint. This is for WIP and for WAP, turn around as well. It's like you can make an, you can establish like a pipe. Yeah, sounds, it's really great. And then you can do this, you can clone streams. Because when you clone streams, only you send a request to an endpoint. Give me this and send this to another endpoint and clone this to another site. That's it. However, there's a problem. WIP and WAP is static. You cannot update the resource. When you one time have offer and the resource as a miss a request, you get an STP and you cannot update the STP anymore. It is static. Means you will receive a track, all the tracks that insert in the stream and nothing more, no way. Means you have a static resource. It's cool for a live streaming, but we want interactive live streaming where the resource are changed. This is quite important. So we want a little bit more dynamically inside WIP and WAP. This is not enough for us. And our trick sources is two things, two important things. A little bit smaller things, but the two main ideas behind us is like this. When you subscribe an egress endpoint and receiving a resource, you have to subscribe as well a channel. It is so opposite. You get a channel as well. Because you need a channel to get the information that the egress resource, the receiving a resource is updated. This is the first thing, what you need. Without is not possible. Normally you do it in a conference system. Perhaps you do it with a signaling server, your resources update, you get a new STP. But we only want rest. We have no WebSocket server. You need established an extra resource like a channel to receive this information. The second point is you have to annotate the tracks. You have to know what this track is. For example, this is the main track or it's a guest track. And here, Schick is using the STDP attribute media title. It's not used normally, some people are using it, but it's there for title of the track, for example. Here it's used for some meta information. For example, it's the track that you received as muted first, but the track is the main track or another track. And the rest is activity problem. You rely on the things. Yeah, Schick itself is an instance written in Go, based on PyM. It came with the JavaScript SDK. You get in front end, it's a web component, not an iframe. You get in web component. And this SDK is implemented in PeerTube plugin. Because Schick itself can nothing, only makes this exchanging. And it looks like this. You have a PeerTube instance on the left side, and you have a PeerTube instance on the right side. You are here starting your stream, and you want invite people on the form another instance. This PeerTube instance has a possibility to a Schick instance, and this is a complete other Schick instance. They are not related to each other. And this user is on his, and with this Schick instance, and with this protocol and background, he can exchange and communicate with each other, like a conference, but this is a stream. And then on this side, he is the owner of this, he is in streaming this one. It's then transcoded in RTMP, because from RTMP then in HLS. At the moment, I have not the direct HLS transcoding. But theoretically, you can, from verbiage, directly in HLS transcoding, but it's not implemented yet. Yeah, and let us look how it's looked. I think I have, yeah. For this one. Yeah. So, I have here the two PeerTube instances. I make it like this, and so like this. It depends on the time I already created a live stream, but you can do it directly now, because we have more time. Sorry. When you're looking, I'm not sure how familiar you with PeerTube. Here, inside of PeerTube, I have the chick plug-in. This is this one, and you can configure the chick plug-in, and you have here, this one is relating to the chick server. It's called stream.chick, means he knows this one. Yeah. Here's an ASESC, okay. Theoretically, you can use this. This is, okay. And the other one, let me see. Yeah, this is the other one. Yes, as well. That plug-in. But he is related to forstem.chick, is another chick instance. It's a complete different. They are in different servers. Yeah, they are complete separate from each other. See, this PeerTube instance, follow this PeerTube. You see? Means this one get all live videos from the other one, cloned. And, of course, this one has his own chick is following this instance. The communication between chick and the PeerTube is over activity pub. So when this chick, when the PeerTube instance get a new live, the chick get it as well as copy over activity pub. That's the idea behind it. The implementation is stored, steal from owncast is exactly the same, because owncast has a cool implementation for it. Yeah. That's a good idea, owncast and PeerTube together. I only want to mention. So, and what we can do now is, we can create a live stream right now. It's like this. I hope I have time. Yeah, I have time. Make it permanent, makes no difference. Yeah. One interesting point when you create a live stream, it should be short as possible. PeerTube can nine second delay something like this. Nine, fifteen seconds, something like this is the shortest what PeerTube arrives. I mean, when we're talking about interactive, it's definitely not take 30 seconds or 60 seconds. It's too much. Okay. So what we can do as well is, let us invite the other guy from the other instance. What you have to know is the activity pub ID from this guy. Yeah, this one. Now we create a live stream. I hope so. No, we don't create a live stream. I have to update the live stream. Sorry. My mistake. So now we have a live stream. Here's online. And in the back, I have to take this one because I'm not figure out how I can find this live stream than on the other side. Maybe someone will explain. Now activity pub has synced to both. So we have the live stream as well on the other instance. So when I have this one, I'm logged in as user one to three. I can assess now here. I'm now in. Now I'm in the web component. It's a web component rendered in peer tube from the plug-in. It's not an iframe. And I can do this here as well. So now two guys in two different stream, but they are not connected at the moment. First, they have to join. He's joining. And he's joining as guest. Takes a while. So let me see. So now we can do it. And of course we want the other guy is seeing something. So now the internet is a little bit slow. Sorry about this. Now they're both on different check instance, different SFOs. And the SFOs communicating with them and established with only rest endpoint. And the information like mute and unmute what you need. And exchanged like, sorry. Like the channel that for the web egress component is established. And even when I, let me come back. And even I can do this one. Sorry. No, I can't. Sorry, the connection is bad. So you see the other side. Now I have the track mixed. So I can even mix the live stream. And then all is working fine. Theoretical wise, and my internet goes not down. I can online goes as well. I can go live with this. Let me see that he can see this live as well. One moment. I think it's here. Yeah, it was here. Somewhere here. This one should be. Yeah, now we are live as well. Okay, sorry the internet is not so good. Yeah, that's it. And so we have established a clone stream between two instances in the first bus. That's it. Yeah. Yeah, question. I'm curious. I've worked a little bit with Activity Pub, but not Super Induct. I'm curious if there's like a, is there a live stream post type in Activity Pub, such that like other implementations like a master.on server or something could play this live stream, or does it look like just a link to a live stream? How does that go this way? The question is, is there an Activity Pub attribute or something like inside, right? I'm not sure. You have the content type of video inside, and you have as well the annotations that it's a live video or not. This came from PeerTube itself alone. So, inside of the JSON is only the host server inside. It means when you share this JSON to another PeerTube instance, you get a description like who is the owner, which actor is the owner of this live stream, and where is the home server, the home instance for this live stream. This is all what we have inside. And then, Schick annotates this with extra attributes like who is the guest, and this has the host server at Schick instance. Because you can only follow with Schick another instance when your own instance has as well a Schick instance. When you not have a Schick instance, the button to join, you have to go then to the other instance. This is the main. This is the mechanisms behind it. I think, what's the question? Yeah. Okay. Yeah. This only works when both instances implement in Schick instance. And this is supposed to work as well for own cast, because it makes no difference. Only the front is needed for own cast. And this is the main idea behind it, that you have a way to scale your streams in the background with extensions. Yeah, based on activity. Perhaps an interesting point. It's like a little bit controversial. You can use such kind of technology for, I will not say advertisement, but for recommendations. When you have a live streams, often you have the problem you want inform other people that you have as well live streams. Other people didn't know about you. And here you have something like a pool where you can add streams and then you can chat doing the live streams. Because in a back, a live streams and an active live streams, nothing else as that you have different kind of sources from different kind of furry growth instances. And such kind of things are then possible. Okay. Okay. Yeah. You mentioned that you're using data channels to change information about back of this. What exactly is set up the data channel? Renegeration, the STP. I have the egress endpoint. I mean, the receiving end point needs a data channel from the offer of the resource. The question was what came through the channel, the STP. The STP and the mute event as well. Yeah. This is coming soon. Yeah. What's the reason for the delay so much lately? Here in this one, I think also what's the reason for the delay in the latest thing? First, the network here, I guess. Second one, no, most likely the network. I have this one here. One moment. When you have this one, I hope I'll be online still. I'm not sure. This delay, what you have here, this is more bigger. This came from the transcoding form. VapRTC to RGMP. That is at the moment not optimized. This is the reason for this delay where you have such kind of, yeah. But the rest, I think it's the network. I guess. So it's not VapRTC to VapRTC. It's converted somewhere? It's like this. You have a VapRTC to VapRTC converted. Which one you mean between the server or between the? On the right-hand side, the video is quite delayed. Yeah. Where did the left? Yeah. This one. Yeah, there's a big delay at the moment. Yeah. Yeah. Now, the thing is, in this case, you have three VapRTC connections now. One is from the client. Maybe I can show you this here in the slides. Sorry. You have three connections. One to your chic instance. One from the chic instance to this one and one to this one. It's like a pipe. And I guess this was this quite fast because they are in the same location. But I guess this one makes a trouble at the moment. I guess. Yeah. Some other question? Yeah. I missed part of the presentation, sorry about that. As far as I understood, you are using Weep and Web as a way to get those two to communicate with each other. So, as I was saying before, in the last year, the view of what specification basically forces you to create an offer for that as well. So it makes changing Weep and Web impossible within the specification. Are you using the old mode where you were expecting an offer to do something? How are you dealing with this synchronization where you have to wait for an offer and stuff like this? Yeah. I try to repeat the question. Weep and Web, I think, have two options. First, you send an offer and get an answer back. And second, the second option is you say, hey, I want an offer from you. Then you get an offer and you send the answer back. What is the difference between this one? For the first, you need only one request. It's like, give me one post request. You send an offer and get an answer back inside the post request. For the second option, you send first a post request, get an offer, and send again a post, a patch. I think it's a patch afterwards. Yeah, something like this. I implemented the second one because I implemented it in June and I think now is a new version out where they are supposed only one request. Yeah. For Web, for Weep in one, for Weep, I only need one request. Yeah, that's right. But because we are not here, I not use Weep and Web how it's supposed to be because I need to dynamically, so I established Web at the C Channel as well. So that is additional. Okay. Yeah. Yeah, if no questions anymore, then thank you for watching. Thank you. Quite interesting. Yeah, because you're talking about this problem already. I wrote a long post because I liked the old mode. I liked the way that we are doing things. Federation is possible thanks to the mode. Just leave a couple of minutes to sit down. Yeah. Yeah. Yeah.
Getting AV1/SVC to work in the Janus WebRTC Server
Well, welcome everybody. Lorenzo here needs no introduction. He brought the crazy contraption to give his presentation with. It's almost a dangerous demo in and of itself. Yeah, yeah, easy. And he'll be telling us all about AV1 as we see. Let's go for it. Yeah, you can hear me, right? Yes, sir. So thanks so for the introduction. Yeah, so I'll be talking about specifically AV1 as we see. I'll go, it's in some technical details. So it may be boring here and there, but I really think it's important in order to get a better understanding of how it all works. And this is just a quick introduction over me. So I'm one of the co-founders of a small company based in the south of Italy called Miteco. I'm the main author of Janus, which is an open source for Bouticy server. And there are some links if you want to get in touch with me or learn more. And basically what we'll be talking about today is AV1. And if you're not familiar with what AV1 is, it's a new, relatively new video codec that was designed within the context of the Alliance for Open Media. That has a lot of companies behind it. There's Apple, Cisco, Google, really a ton of them. And what they really wanted to do was to create an open and royalty free video codec. And of course emphasis on open and royalty free because we don't want another H264 or H265, which was specifically designed for real time applications pretty much like Opus was also designed as a codec for the internet. So that was quite important innovation with support for higher resolution, so for KM Beyond. And most importantly, it was also conceived to have support for SVC baked in the codec specification itself. And that's quite important because some other codec support SVC as well, but many times they come as, let's say, later additions. So basically codecs are extended to have SVC supported. In this case, AV1 was conceived with native support for SVC. So all AV1 implementations are supposed to at least be able to decode an SVC stream, for instance, which is important when you start working in hardware decoders and stuff like this. And of course this got me and should get you all very interested because these are all very interesting features to have for different reasons in WebRTC. And SVC is important for a few different reasons. So we all know what CIML Cast is. You use a single M line to basically carry multiple quality streams, like you have a high, medium and low quality stream, both sent at the same time, so that different qualities can be distributed to different participants as needed. But with CIML Cast, each stream is encoded as a separate stream, which means that each stream is also decoded independently of others. But this does mean that you have to encode the same stream more than once. And the fact that they are decoded independently can also cause some challenges sometimes. With SVC instead, you still use the same media source, the same M line and so on, but the different qualities, so high, medium, low, whatever it is, are all layers of the same thing. So you have a single video stream that has like an onion, different layers, that basically make each layer provides more detail if you want to look at it that way. And so the key difference between CIML Cast and SVC is that with CIML Cast, since you have different streams, you also have different SSRCs. Each quality is a separate RTP stream. With SVC, all layers are the same SSRCs. So as far as the recipient is concerned, it's just a single stream, which means that it does require less bandwidth because you can pack some things up and it's more a layer kind of approach. It is sometimes more CPU intensive in terms of encoding because that's a bit more tricky, but it does have some advantages over CIML Cast as a consequence of that. And an interesting aspect is that CIML Cast, as we know it in WebRTC today, actually did already make use of SVC somehow, because when we say, for instance, BPA to CIML Cast, and then we mention temporal layers, temporal layers are not a feature of CIML Cast. Temporal layers are a feature of SVC. So we are basically using a feature of VPA that allows us to use a partial SVC functionality where we can have different frame rates within over the same RTP stream that we are handling. And this is just summarizing it from a visual perspective. So you have CIML Cast sending three different streams and then we can choose which, an SFU in the middle can choose which stream to send to other participants. With SVC, we have one big thing that has many layers. One participant may want to receive them all, another participant may only want to receive the medium layer, and then another participant may want to receive the lowest layer as possible. This is just to give you an idea from a visual perspective instead. And so I was very interested in implementing it in Janus, and here are a few links if you want to learn more about Janus itself. And so I started to figure out what we needed to do in terms of what do I need to do in order to get that working. And so first of all, of course, we need a way to negotiate AV1 and the SDP, and that's of course a given. It may be helpful also to be able to detect keyframes in the stream, and that may be helpful for different reasons. For instance, when you are doing Siemulcast as a server, it helps when you know whether a packet is a keyframe or not, especially if you want to switch on a keyframe or stuff like this. It's also important to be able to somehow interpret how the AV frames are spread across RTP packets, and for us it's especially important for our recordings, because when we record stuff in Janus, we just record all the RTP packets that we received, so that we can go through them later on. And so basically getting a recording in a playable format just means reorder all these RTP packets I received, get the AV1 frames out of those RTP packets, and then put them into an mp4 file to make an example. And this means that we need to know how AV1 fits within RTP, and we show how that works later. For SVC specifically, there is another important thing that is called the dependency descriptor that I'll talk about in a minute. And so that means that we also need to somehow support that in the server as well, which first of all means negotiating it, or extensions must be negotiated in order to be used. We need to know how to parse an extension of that sort, and then we need to figure out how to use the information that we receive in that extension. And as we'll see, 0.5 is the one that got me the most in trouble, and then I'll explain later why. But starting from negotiation is very easy, so you just negotiate the codec name and the relatively clock rate there, so that's easy. Detecting keyframes and support m basically being able to extract frames from packets is a bit more complicated, but that's because we need to start delving a bit deeper, and so figure out how AV1 is packetized over RTP. And that's actually something that's true for all codecs. So for all codecs, you need packetization rules, and that's especially true for video, because for video, typically you have larger frames, and RTP packets cannot be that large. They are usually limited by the MTU size and so on. And so you need to have some rules that tell you if you have a frame that is this large, this is how you split it across multiple RTP packets for this codec, this codec, and this other codec. And usually there are some similarities, but usually each codec has its own rules, mostly because of the nature of the bit stream, let's see. And this is an activity that typically the ITF carries on in the AVT core working group, because basically all packetization rules as RTP and WebRTC are all standards. Unfortunately for AV1, it did not happen in the ITF, so they came up with their own specification, which is provided here. So in this specification, they provide information both on the AV1 aggregation header, that is those packetization rules that I mentioned. So how do I split an AV frame over multiple RTP packets, and how do I get that same frame back when I have access to the RTP packets on the other side? And it also talks in great detail about this dependency, the scripture, which is a beast of its own, as you can see. And this is basically how it looks like from a visual perspective. So with RTP, you typically have an RTP header with all the usual stuff that you all know. You can have some RTP extensions in there, and this is where the new RTP extension would appear. And then you have the RTP payload. And the RTP payload is where this aggregation header plays a role, because as we mentioned, we cannot just dump an AV frame in there because it may not fit. And so we need to have some sort of information that tells us how an AV frame is actually split, or if there are more than one AV frame in the same packet, we need to know that as well. And the AV aggregation header, the AV1 aggregation header is fairly simple, because it's just a single byte with a few bits that you can set. Like, I will not go too much into the detail, not to bore you, but just information about whether these OBO, and the OBO is basically the equivalent of an AL for AV1. So if you know what an AL is for RAS264, an OBO is the same thing for AV1, more or less. So it's basically a unit of a frame. And then basically these attributes tells you whether or not an RTP packet that you just receive is a continuation from a previous frame, so that you know that whatever you're receiving now, you have to append to whatever buffer you had before, whether or not this frame is complete or not, whether you have to actually wait for something else before passing it to the decoder. You may have some information about how many OBOs are in place, which is actually optional, and we'll see why in a second. And then this bit tells you whether this is the packet that you receive is the beginning of an AV frame, which is, again, all of these pieces are very important when you have to reconstruct the AV frame when you receive it, so that AV1 frame when you receive it, so that you know that this is the first thing that you have to put in there, then you pass this year, this year, this year, eventually you again end up with the complete AV frame. And basically it looks a bit like this, so in this case, for instance, we are actually aggregating multiple OBOs in the same RTP packets, and in this case we are not specifying that there are that many elements, which means that for each OBO in there, after the aggregation header, we have a variable size element that tells us how long each OBO is, so in this case we're just going sequentially, aggregation header, we know there are some packets, we check the size, then we read exactly this amount of bytes, and this is the first element, second element we read the size of that, and we go on and go on and go on. And the W attribute over here allows us to save a tiny bit of space when you use it, because if you say that, for instance, there are just two OBOs in this element, then this means that you do need to provide the size of all the elements except the last, because then you can read them sequentially by checking the variable size length until you get to a certain point. When you get to the last element, you know that all the bytes that are left are actually associated to that frame, so you don't need that additional variable element in there, so you save a bit of data, maybe not that much, but in some cases it may be helpful. And to use the aggregation header, I mean I mentioned that it can be helpful in a few different cases. In my specific use case, I basically interpreted that, for instance, not a continuation and a first packet, I can more or less treat as a key frame. It's, of course, not really always like that, but it at least gives me the beginning of something, which is something that is very quick and simple to use when you're actually just routing stuff. You just read a single byte and make some decisions based on that. For instance, when you need to do some symbol-cast-related switches, for instance. For recordings, I needed to do something more complex, because as I mentioned, we need to traversal the RTP packets, reconstruct an obu frame, and an ap1 frame before we can put it into an mp4 packet, which means that I had to actually implement all that de-packetization rules accordingly. And also I had to implement the parsing of a specific obu in order to get some additional information, like the video resolution, because if I'm creating an mp4 frame, I don't need to decode the frames, but at least I do need to know how large it is so that I can put it into the mp4 header, for instance, or maybe use the RTP headers to figure out roughly the frame rate, these sort of things. And all that I've mentioned so far is really all that you need if you want to use everyone normally, just as a regular codec, so we forecast all streams are independent of each other. So if I want to go from high to low, I can just move to the SSRC with the low quality stream, and I don't need to do anything else. The low quality stream is encoded separately from that other one. I don't need to know anything about that other stream, they're completely independent. With SSRC, that's not always true, because you may have some dependencies in place. So if I want to go from, for instance, the highest quality layer, since we are talking about an onion, will very much likely depend on one or more packets from the medium layer and the low layer, which means that I may have to forward those two, otherwise the high quality layer will not work, because that alone is not enough to decode something. And these are all things that you need to figure out at runtime, because you have a stream that is coming in and you have to make a decision right away, otherwise you cause delays and stuff like this. And most importantly, most of the times you may not even be able to parse the payload, because, for instance, if insertable streams are used and the stream is end-to-end encrypted, you cannot have a look at the payload to see what is what. And this is what the dependency descriptor is for. The idea that you have an external component, so an RTP extension, that contains all the information related to the packet that you just received. And this one would not be encrypted as the payload itself, and so it's something that an intermediary like an SFU can use to do something. And this is just one example that comes from the RTP specification over there. There are really a ton of examples. In this case, this is an example of how L2 T3 dependencies work. L2 T3 means two different spatial layers that depend on each other and three temporal layers. So two video resolutions and maybe 30, 20, 10 frames per second. And this gives you an idea of how the dependencies work as a frame goes by. So this is the first frame, second, third, fourth, and so on and so forth. And so you'll see that in this specific kind of approach, the first packet you'll receive will be related to spatial layer zero, temporal layer zero. And pretty much everything depends on this packet over here. And then if I want spatial layer one and temporal layer zero, I definitely need to relay this packet to otherwise this one will not be able to be decoded. If I'm interested and basically you follow the arrows and you have an idea of the kind of dependencies that you can do so that you can choose which packets you can actually drop or not. And as you can guess, the problem is, as an SFU, how do I know these? So how do I know that this is what is happening and these are the dependencies that are in place? And this is basically what the dependency the scripture provides and I'll explain how in a second. And so continuing from the requirements that I described before, it means that if I wanted to have SAP or for this component in Janus, but this is true for each web artist is around there, again, I need a way to negotiate the extension. I need to somehow parse it so I don't, I need to know how it is encoded so that I can figure out what is in there. And then I need to find a way to use it. So for instance, to recover those dependencies there. And I thought that negotiation was supposed to be the easy part, but it's actually not that easy because of course you just need to negotiate that extension with that name as an additional X map. That's how it works for all extensions in the SDP. But it turned out that I also needed to support the so-called two byte header extensions using X map allow mixed. And this is because RTP extensions by default are supposed to be quite small. And so you usually have the so-called one byte header RTP extension where in one byte you provide some information, which means though that the length of the extension is limited as well. So since you are using one byte to convey a lot of information, the size of the extension itself cannot be more than, if I'm correct, more than 16 bytes or something like this. I don't remember now exactly. And the dependency, the script though can be much larger than that. And so you do need to support two bytes extensions with at the time I didn't. So I needed to implement that first in order to get it to work because when I started testing it, nothing worked and it turned out that this was the issue. And then we need, once we have negotiated it and we start receiving the dependency, the script, as part of our TP packets, we need to figure out a way to parse it. And this was really a nightmare for me. This is like therapy for me right now because I'm sharing all this with you. And I actually run to the about this in a couple of blog posts where you can see the nitty-gritty details. But just to give you an idea, basically it's, let's say a mess. I will not say that word because I don't want to be bit. But basically you can see that this is a specification that was written by somebody who writes codex, not a network specification because all fields are variable length and often at the bit level, which makes it really a nightmare to parse sometimes. And from what we regard the specification itself, it's indeed quite flexible because there are a few mandatory fields like if this is a start of a frame and end of a frame, the frame number, and the template ID for those dependencies that we've seen before. But also everything else is optional, which means that you can either have a dependency in the scriptural element that describes everything, so the whole context of the SVC or just something that tells you the scope of the current frame. And when we look at how a dependency in the scriptural really looks like, this is a simple parser that I created to basically debug things offline. And when we receive a keyframe, typically we have a 95 bytes extension, which if you know RTP, that's a lot. That's basically almost 10% of the payloads that you have. So it's really big, but that's because it contains a lot of information. So if you start parsing it and serializing everything that you receive, you have information about the different layers that you have, spatial temporal and so on and so forth. TDI, I don't remember exactly what it was, but this is just the output of that tool. That's a lot of stuff. So blah, blah, blah, blah, some more chains, some more stuff, the code layer targets. I have some stuff about resolutions. And finally, we're done. Basically, all the parts that we've seen before were basically the media center telling us, these are all the information that I used for this specific SVC context. So in this case, this was an L3T3, so three temporal layers and three spatial layers. And all those, that huge stuff that you've seen before is all the information related to chain dependencies, all that kind of very low level stuff. And so if you want to use it, it's there. And then at the end, it also tells you the resolution streams of the three different spatial layers. In this case, it was low because I captured really at the beginning, I think. And finally, it tells you that for this specific RTP packet, this is a spatial layer zero, temporal layer zero, and it uses template index number one, which is indeed spatial zero, temporal layer zero. And this is the information that we need because then having a look at all the stuff that we've seen before, we know that the resolution for spatial layer zero is, in this case, this multi-mover here. In practice, it would be something like 320 by something else. And this is it. And of course, likely not all dependency descriptors are so long, only for the meaningful key frame packets, it's usually like that. And then other dependency descriptors will be much smaller, like only seven bytes, because they will only tell you, for instance, the temporal index of this specific packet. In this case, it is a spatial layer zero at temporal layer zero. But I only know this because I received this before. So I received somewhere in time this huge chunk of information before, because if I only receive this and I get temporal index six, what is six? Six relative to what? So what does it mean? I don't even know how many layers there are. So you do need to have that information first if you want to make sense of all these smaller packets that you receive later after that, which means that when you start to implement stuff in a server, it does mean that you start need to keep a state, which is not really true for single cast or other things. I don't mean it's partly true, but only in a very limited way. In this case, it does mean that anytime that you receive that huge packet and you parse it, you need to keep it somewhere so that when you receive packets after that, you can reference them and use them for something. And the idea was that once I have a knowledge of those templates and I receive information and I know that this packet that I just received, this spatial layer X and temporal layer Y, then as a server, I can decide whether or not I want to relay it or drop it. And you can do it the relatively easy way or you can do it the hard way. The hard way is figuring out all of those dependencies that we've seen before. I went for the easier way, especially right now. If it is temporal layer 2, then relay everything related to spatial layer 1 and 0 as well, as long as it's the same or let's say the temporal layer is smaller or equal to the one that I'm receiving. So I may be relaying more than I should, but at least I know that everything is there. What's important is that once you use that information so that once you've parsed it, you cannot drop it. You need to relay it anyway because it's not only helpful to you, it's also helpful to the subscriber that is receiving that video stream because they also need to know what is what. So you need to forward that information as well. And very important, you also need to update the RTP headers accordingly, including the marker bit, which is what really drove me nuts the first time because I actually implemented all this for a long time and it didn't work. And eventually I figured out that the problem was that I was not updating marker bits as well. And this is the reason, basically. So if we have a sequence of RTP packets related to different spatial layers and temporal layers, this is basically what it looks like from an RTP perspective, including marker bits. If I am dropping spatial layer 2 because I don't need it, then what it means is that I'm dropping some packets over here. So of course, all the packets that I'm dropping, I need to update the sequence number so that it keeps on growing monotonically because otherwise the recipient will think that they are missing, losing some packets, but they are not missing them. They are just dropping them because they don't need them. So I need to update the sequence number so that this is one, this is two, this is three, this is four, five, six, seven, etc. So I need to make sure that they know that they are not really missing anything. But I also need to update where I'm setting the M equals one marker bit as well because this is needed for b-decoating, especially from Chrome. So in particular, you need to set M equals one on the last packet with the same timestamp. So since the timestamp now is changing on the second packet, because that's the last packet with that timestamp over there, I need to set M equals one on that second packet before I forward it or otherwise nothing works basically. Sorry, wrong direction. And basically, if you want to test all these and with Janus or with anything else, of course you need to have a browser that supports all this stuff. And the kind of bad news is that at the moment I think only Chrome supports it. I don't know if other Chrome-based browsers support it too, but definitely Chrome supports AV1 as a codec. And you can check that by using the RTP sender get capabilities thing to see. If you see AV1 in that list, you do support AV1 as a codec. But you also need to support SBC functionality and most importantly the dependency, the scripture. And the dependency, the scripture is not offered by default. So you do still need, I think, to first fill the trial like this. I don't remember right now if you can just manage the SDP to artificially put the extension in your SDP in order to make it work anyway, but that I should check, I should double check. But you may need to launch, for instance, Chrome with that thing over here so that the extension appears in the supported extensions by the browser. When you do that, then your browser is capable of encoding AV1 SBC functionality with dependency and scripture, which is quite important. And if you want to test this, I also made it very simple because if you go on the online demos for Janus and you check the eco test demo you can provide a couple of attributes to, first of all, for AV1 as a codec and then for a specific flavor of SBC, in this case, for instance, L3T3 to send three temporal layers and three spatial layers. And when you do some small buttons appear over there and they allow you to check one thing or the other, which means that you will send the big AV1 SBC stream to Janus and Janus will send you back only what you asked for. So in this case, for instance, spatial layer one and temporal layer two which is why my resolution is smaller and the bitrate is smaller as well. So by playing a bit with those things you should see resolution changing, bitrate changing, if it does, it works. And the same functionality is also supported in the video room, of course, which is the SFU to do video conferencing. So at least in theory you can have a complete video conference that is based on AV1 SBC as well, even though we haven't tested that much but it should definitely work. And I think this is it. I'm not sure if we have time for questions, but before that, I also wanted to announce that, I'm sorry, I'm bothering you all, but JanusCon is back. So JanusCon is our own Janus conference. So it's a conference devoted to Janus and WebRTC in general, which will happen at the end of April in Naples in the south of Italy. We have a few sponsors already which I'm very grateful for. And the call for paper ends in about a week. So if you have anything interesting doing with Janus and WebRTC, you can feel free to submit a talk there. Well, tickets are also available for sale as well. And of course, if your company is interested in sponsoring, that would be great too. And that is all. I don't know if we have time for questions because I didn't really check how fast I was going, maybe too fast or... Okay, so are there any questions for anyone at the C part? I see a couple. I think slow me with... Generally, would you say that the SBC is like the generation of simulcast or if we continue, whether we look at the future of people on the platform that will replace it or they will need to get the sale by sale? I mean, in general, if you look at, for instance, if you look at that... Oh, sorry, sorry. Slow me was asking, is basically SBC or evolution of simulcast or does it make sense to have them both at the same time? Which one will take... Which one will be more important in the future? Which one is the technology to invest in in the future, maybe, as well? And functionally, I mean, they serve the same purpose, if you want, because I have the same demo for simulcasts and if you look at the demo for simulcasts, it looks visually the same. So you have the same buttons to say, I want high quality, low quality and so on. The difference are really in just how the thing is implemented. And the main problem, I mean, in general, SBC is supposed to be more advanced, of course, than simulcast and more resilient as well, probably. But the main obstacle right now is that it's related to what I was saying before. So right now, if you want to use AV1 SBC, you have to do a custom flag, which means that right at the outset, it's really not something that you can ask your customer to do, for instance. So for the moment, it's not really something that is production ready. You can use the SBC flavor of VP9, which provides a similar feature, which is now available out there. But still, simulcast is overwhelmingly favored in general for production environments because it's been battle tested, it's been there since day one. Everybody supports simulcast, it's easier to work with and so on and so forth. So for the moment, it doesn't make sense to just use force SBC in your production environment right away, if not for experimental purposes and for testing how it works, for dipping your toes in the technology. But for the future, I definitely think you should pay attention to that because AV1 will be the code that everybody will adopt, hopefully because it's better quality, it's royalty free, it's open source, and it has SBC baked in. Sooner or later, hopefully Safari will have AV1 as we see, Firefox will have it, Edge and other browsers will have it as well. And you definitely want to be ready when that happens because otherwise you'll be the one stuck with the old codec and everybody else is taking advantage of the new team. I think learns that you can munch the SDP to make it work. For the extension, yeah. Because we have it working new team. Tuzlomi, there is one thing that in some environments might be relevant which is as many hardware decoders don't cope with SBC, but they do with Samocast because they look like a normal strain. So if you're in a resource constrained thing, maybe receiving SBC is no bueno, but receiving a normal Samocast will be better. But in theory, these will not be true for AV1 because AV1 was conceived with SBC in mind. So in theory, all hardware decoders, too, even smaller ones, will know how to interpret that. And since it's a single stream, they will be able to decode it. Of course, it's just theory and... Ideally they would. For VP9, for example, Chrome still does not use hardware decoders when you use SBC. And I'm not sure because AV1 hardware support is hit and miss yet still. And there was another question here, yeah? Yeah, I was wondering what the forward error correction strategy here is, like, is this patient, if there are... I'm sorry, if forward error correction is used, how do you use it with do is I mean... Yeah, if all the use forward error correction is SBC, then you are like, helping out some tactics and then it doesn't work. Yeah, that's a good question. And it's actually related to one of the doubts that I have related to FBC, mostly because I mean something like AV1, SBC and CMUCAS as well only makes sense when you have a server in the middle. It doesn't really make sense if you are sending something from point A to point B and point B is the one that is meant to receive it because in this case you are sending everything anyway. So unless you are using SBC as some sort of a... of your redundancy mechanism because you say, if I lose some packets related to two, I can still display one. That's one thing, but that's not really what it's meant for. And so the moment you have a server in the middle, it also means that you can offload the forward error correction stuff to the server as well. So which does make sense also because, for instance, when you use FlexFec, which is the thing that was described in the first presentation from Chrome, Chrome by default will not put any redundancy information, so it will not put any FEC packets until the peer tells them that they are losing some packets. And this is to optimize stuff, so you don't add redundancy unless it's needed because there's loss reported, which becomes a problem if you're doing something like a video conference because your uplink find may be perfect, and then you have subscriber X over here that is experiencing loss and you don't have any redundancy packets to send them instead. So the idea and probably the solution to that, this is something that I'm still brainstorming myself because FEC interests me, but I have some doubts there, is that probably the forward error correction stuff is something that the server itself will need to add on each subscriber leg. So from the server to you, I will have a dedicated FEC channel where I add some forward error correction stuff from the stream that I'm sending you, and for the stream that I'm sending you, the layer 2 may not be there, but I have a consistent stream because packets are in sequence, and so the forward error correction stuff that I'll be sending you will be different from the one that I'll be sending to somebody else who is receiving additional layers, and that's probably the only way to do this if you don't want to forward FEC and to end without treating it, which anyway wouldn't be useful at all, especially if the sender is not providing that information themselves. Yeah, in my experience, and this may be an implementation choice, of course, I did have to forward it because otherwise it would not be decoded properly, basically. And I don't know if this is actually really needed, like for instance, even the marker bit 1, that's not really needed from a specification perspective because as a receiver, you do see that the timestamp is changing, so you do know that it is a new frame and you can decode the previous one. But it's simply that Chrome expects that marker bit set to 1, otherwise it will not decode a frame, basically. So in my experience, you need to forward that information too. And I guess it makes sense because the recipients themselves also need to decode possibly differently the video stream depending on what they are receiving because they need to know if the resolution must be this size or this size or this size or something like this. It may all be part of the 81 bit stream, so it may be redundant information as far as they are concerned, but at least when I made these tests a few months ago, it was needed, so just relaying it makes sense. Yeah. In regard to switching this layer, like saw your previous talk somewhere was on bandwidth estimation, maybe you can comment on how they do go together or is there something specific to 81? Yeah, no, I mean the bandwidth estimation stuff is important for a few different reasons. And in this case, I'm talking about the bandwidth estimation on the subscriber side. So from server to recipients, because on the publisher side, there is transport-wide control CC and basically the browser themselves are capable of using the feedback to figure out if they need to send less or more. And so dynamically, you may see that some special layers are not appearing because the browser doesn't have enough bandwidth for that. On the subscriber perspective, it's really useful because it allows us to it helps with the decision. So for instance, right now I just mentioned just generically whether I want to relay or drop a packet, but this actually depends on why I should relay it because a user may want to receive the highest quality possible, a user may want to receive the lowest quality possible, but this may be because they only want a lower quality because the video is going to appear in a thumbnail and so they don't need the whole thing and that's an application logic decision. And now the decision may come from the user doesn't have enough bandwidth for all of that stuff, so they don't have enough bandwidth for special layer 2 and 1. Let's just send them special layer 0. And this is where bandwidth estimation helps because if I'm sending stuff to the subscriber and I'm starting to get information that congestion is happening, then internally the server can update which special layer or temporal layer I should send to this specific publisher dynamically. And so this will impact my decisions to relay or drop stuff and so it allows me to dynamically dynamically impact the quality of the subscriber depending on how much bandwidth they have. And in my experiments right now I've only done this with Siebel because I haven't hooked it up to SBC yet, but the key principles are really the same. One minute? Yeah, just related to that is there a way or Wipen Web to signal the final cast of the publisher and the subscribed site? Yeah, I mean for the final cast or SBC. Of course, yeah, in Wipen Web do you with Wipen Web is there any need to signal Siebel cast or SBC as well and does it make sense? And in general, I mean it's definitely important that you signal it on Wip because you want to make sure that the stream that you are ingesting is recognized by the server as a Siebel cast or an SBC stream so that the server can also parcel of those dependency descriptors in case it's a one SBC for instance or in case it's Siebel cast it knows that it needs to take care of, let's say, three different qualities. On the subscriber's side for Siebel cast it's really not important because you're just always, as a subscriber, you're just always going to receive one video stream and as far as you're concerned it's a consistent video stream. You don't even know that there is a switch behind the curtains that is happening from high to low to medium or whatever. You just see a single video stream so you don't need to be aware of the fact that it's Siebel cast. For everyone as a SBC it may be important to negotiate the dependency, the scripture extension as I mentioned because if it's needed for decoding purposes and you want the browser to be able to decode things properly then you may want to negotiate that extension as well on the subscriber's side. But as I was saying before it may or may not be needed so that's something that we'll have to check. And I think I'm really out of time now so. Thank you. Thank you.
Using GStreamer to build real-time applications with Golang
All right, well, welcome back everybody. Up next, the one and only Dan Jenkins is going to tell us all about G-Streamer and Golang. Take it away, please. Thank you. Hello, everyone. Can everyone hear me okay? Yeah? Good. Great. Cool. Okay. I forgot my clicker. Number one rookie thing to do. No, no, I've got my phone. So I'm good. But yeah, that's why I've got my phone. And it's going to look a little bit weird. I also forgot to buy I brought two European plugs with me. But one wasn't European. One was American. So my day did not start off well. So yes, G-Streamer and Golang. So a little bit about me. Very, oh, that's just going to get really annoying. I'm just going to click. Cool. Okay, so a little bit about me. So yes, I'm Dan Jenkins. I run a couple of companies. One called Everycast Labs, one called Nimbleape, and another one called Comcom. So Everycast Labs does broadcast stuff, bringing in remote talent into broadcast workflows. Nimbleape is a consultancy service, consultancy company based in the UK. And then Comcom is an event that we put on for open source people, our way of kind of giving back to the ecosystem that we build from. I was the very first Google developer expert in the world when it comes to WebRTC. I'm not saying I'm the best at WebRTC, but I'm the first that actually got accredited by Google's developer program. I love Lego, and I love real-time media. So yeah, Nimbleape, we're a consultancy, and if you've got hard problems that you want solved, come talk to us. And Everycast Labs, we've got that product that I was just talking about called Broadcast Bridge. And then Comcom. So Comcom is dear to my heart. Historically, it's been a residential event where we bring everyone, everyone stays in the same place. And then we've got three days of awesome real-time and open media content. And then we're back in 2024. Dates are still up in the air because of contracts, but it's not going to be residential this year. We're going to go on tour, so we're not just going to be in the UK. And that's quite exciting. So to the actual topic, GStreamer building real-time applications with Golang. So what are we actually going to talk about? We're going to talk about GStreamer, obviously. We're going to talk about Golang, obviously. But I want to introduce you to something called GoGST. GoGST has been around for a long time now, but kind of got itself into a bad state where it was not un-maintained, but there were lots of little forks and lots of little patches everywhere. And so we've kind of changed how that project's being managed now. And then also I want to introduce you to something called Pion. So let's take a look at GStreamer first. Who in the room has heard about GStreamer? Good. That's the answer I was looking for. So open source multimedia framework basically does everything that you chuck at it in some form. And I absolutely love GStreamer. So a lot of you might know GStreamer as something like this. I'm not going to ask you to tell me what that is, because I know that it's kind of taking in an RTSP source and then doing something with it and then outputting something at the end, via UDP, but all the stuff in the middle now. But GStreamer is actually super powerful and ultimately lets you do ingress, do something with it, and then egress. And it kind of boils down to something that's simple, right? GStreamer can do it all and can do a lot of things. So for us at everycast labs with our broadcast bridge product, we care about certain things. So GStreamer can do NDI, GStreamer can do WebRTC, GStreamer can do SRT, can do RTP, it can do HLS, it can do RTMP and RTSP, right? I'm not telling you anything that you don't know at this point. But for us, at least with broadcast bridge, GStreamer has a superpower and that superpower is app source and app sync. How many people in the room know about app source and app sync? Okay, good. That means like 60% of you are going to learn something now. The rest of you just sit and be happy. So yeah, this is what we use at broadcast bridge, in our broadcast bridge product. And that's because we don't write C. And so ultimately kind of adding code to plugins within GStreamer is really difficult for us. I know that's changing as time is going on. There are more and more Rust plugins, but at its core there are a load of stuff that we don't feel able to kind of contribute to if we find a problem. And so a lot of the time we don't like writing C like this, but we do like writing a lot of Go. And so we end up writing something like this. And this is Go GST. It was originally created by a guy with the GitHub handle, Tinyzimmer. I love the name. But now it's in its own GitHub organization. So it's under github.com. So it's under the new GitHub org and there's three main contributors. I think there's something like 17 total, but there's three main ones. Tinyzimmer me and R.S. Willy. So Lesfawkes is better for everyone. So this other one, Big, Little Ben. That's from the LiveKit team. And the LiveKit team had their own fork of Go GST. And they had put a load of work into fixing bugs, but they were never getting merged back into the project as it was under the Tinyzimmer GitHub. So now it's forked out. Well, it's not actually forked. We forked it into its own organization and then did the GitHub magic where we unforked it and then the Tinyzimmer one is now a fork of us. So there's a lot of GitHub kind of organization going on to make it easier for everyone. Did you know that GitHub forks don't turn up in Google SEO results and they don't turn up in GitHub search results either? And search doesn't work in the repo. So yeah, and search doesn't work in the repo. So basically forks are dumb. I mean, they're not dumb. But forks are bad. We should not be relying on forks for a long-term thing whatsoever. So yeah, this is actually really great for everyone now. So less forks is better for everyone. And like I said earlier, BroadcastBridge uses a mixture of SRT, NDI, WebRTC, among a load of other things as well. And so why, you're probably asking, why would we even need to use AppSync and AppSync when the modules, the plugins are already in Gstreamer? Like Gstreamer already knows how to take in an SRT feed. It already knows how to output an NDI feed and it knows how to do WebRTC stuff. So why are we building on top of AppSync and AppSync? And it comes down to greater control like I was kind of alluding to earlier. We use Pion to do WebRTC. And that's not because the Gstreamer implementation isn't good. It's just that if we want to be able to do anything that isn't implemented into the Gstreamer implementation, then we'd need to get someone to actually go and change that code. And that's something that we're able to do. My team aren't capable of doing, but we don't go really, really, really well. And so we can definitely kind of go and take that greater control. Like I said, this means we're handling WebRTC in something that we really know. Like, ultimately, very few people in this room know about transcoding something from one codec to another. And we just rely on FFMpeg or Gstreamer or whatever to do it for us. It's the same with WebRTC for us. We really know what we're doing with WebRTC and we want to be able to kind of tweak things that we can't necessarily tweak with the Gstreamer implementation. But Pion is hugely, hugely powerful. And this is the other key thing and it's easily upgradeable. So when we actually find a bug in Pion, we can a Gstreamer pipeline and never leaving the C level. But cost isn't just measured in terms of compute. Cost is everything from building the feature all the way through to deploying the feature and running the feature. And you've got to look at the whole picture. Pion gives us huge, huge flexibility and we can move fast and we can add new features and ultimately that means that we win business. So let's take a quick look at AppSource. How many people are actually familiar with AppSource? Right. So AppSource is just another plugin, module, whatever they're called. And ultimately, you can put it inside of your pipeline and you can push data into Gstreamer using AppSource. You set a load of capabilities on that element, that AppSource element, telling it, oh, well, this media that I'm just about to push into you is this format and this frame rate and whatever else. And you can push in data or you can make, so you have to push data in obviously, but you can also make Gstreamer ask you for the data. So instead of just going, oh, I've got data, data, data, data, data. And then Gstreamer goes, oh, hold on. I can't do anything with this. Why are you sending me so much data? You can, Gstreamer can actually ask for it. Now, that's not hugely helpful when it comes to real-time applications because real-time applications, in the case of Pion, sending us web, getting RTP data from Pion, for example, that's real-time. And so we want to get that data from Pion and we want to pass it into Gstreamer straight away. Because we're getting it in this constant flow from Pion. Whereas if you were reading a, if you were reading a file and then you were passing those chunks into Gstreamer, well, you've got control over how fast you push those chunks in. And so why not let Gstreamer go, ah, I want a bit more data. I want a bit more data. I want a bit more data. Right. App sync is absolutely no different. It's a, it's a plug-in, it's a module. And, and when you put it into the pipeline, it becomes an element. And ultimately you get push data out of AppSync. And so imagine you've got AppSource and then you've got something in the middle, whether or not that's transforming it or transcoding it. And then you've got AppSync and you're connecting all these bits together. And so you're pushing data in. Gstreamer is then doing something with it. And then it's pushing it, it's passing it over to AppSync. And then AppSync sends it out to your application as data. Not as UDP, not, not via report or anything. It's giving you the, the, the raw buffer of data. So you get pushed your data from AppSync via the, the, the, the new sample signal and event. I've got some data here you go. Notice how this is all go lang. So, yeah, let's take a very quick look. So we've got our sync. So that's an AppSource, AppSync element that I've made. And I'm setting some callbacks on it. And then we've got new sample funk. And then that gives me, that gives me my, my sync. And then I'm going to tell it as a return. I'm going to tell it what the return, what the flow state is. And so I pull the sample. And then if the sample isn't end of, isn't nil, then, then we carry on. If it is nil, then I'm returning that we are at the end of the stream. And then buffer. So we get this, our sample. So we're pulling the sample. And then, and then we're getting the buffer out of that. And then ultimately reading some, some, some information from that, from that buffer map, changing it from big Indian to little Indian, I think, or something. And then, and then doing some stuff on it, doing some maths on it. Not a lot of like useful information there. Like in terms of like, what am I actually then going to go and do with it? Well, at the moment, it's just printing out RMS. But then you can go off and do whatever you want with it. For us, that means getting a video and audio data out of G streamer and chucking it into NDI. Oh, Dan, why are you not using NDI within G streamer? Well, I tell you number one, when we did our NDI integration, G streamer didn't have NDI. It was, it was completely separate. It was, it was a different repo. And it wasn't part of the G streamer rust plugins. And then B, we do extra stuff that G streamer doesn't know how to do yet. So we, we grab tally information from, from NDI. And to be able to do that, you need access to the underlying NDI sender. And, and so there's stuff that G streamer can't do yet. Something that we actually want to add in to G streamer. So that we can stop sending stuff via the NDI SDK directly and we can just let G streamer deal with it for us. But again, goes back to that cost analysis, right? At the moment, we can get that data out of G streamer using app sync and chuck it out via NDI. We can do that. And it's relatively cheap. But then there's a load of extra work for us to be able to kind of go in and figure out the right way of doing it in G streamer so that like tally information becomes available as a signal. So yeah, for us, this means that we have to handle RTP and RTCP from Pion. Because Pion, within WebRTC, WebRTC is made up of lots of standards. But ultimately the media is RTP. And the bit that tells you what the quality is and everything else that goes with it along with it is RTCP. So it's very easy to forget about things that are very important when you don't deal with them. Like RTCP. SFU people in the room will go, ah, you could never forget about RTCP. But as a web developer, the browser deals with all of this for us. And so it's very easy for us to go, ah, RTP, I'm going to get my media. I'm going to get my media. And then everything works really, really well when you're in a really nice network environment. But then you chuck in real life scenario and the audio in the video goes terrible. Why did the audio and video go terrible? Because there's no RTCP feedback mechanism to go, ah, something's going wrong. But yeah, GStreamer makes all of this easy. And very quickly on this very specific thing, we use RTP bin within GStreamer. So that's that middle bit for us. We use app source, chuck it into RTP bin, and then we do a load of transcoding and stuff as well. And then we get app sync. RTP bin is magical. If you deal with RTP at all with GStreamer, then you need to be using RTP bin. There's a lot of text there. But ultimately, it implements everything you need to be able to handle RTP and RTCP and demuxing of payloads. And it's just a very nice all in all thing that deals with everything using all of the separate, all the separate plugins. But it forces it all together nicely for you. So for us, that's connecting the app source sync pads to RTP bin. And you'll notice I say pads. So for us, you can see up the top there RTP bin. So we're requesting a pad from RTP bin in that format. So it's a receive RTCP sync. And then we're also requesting a pad of send RTCP sync source as well. We then go and make a new app sync and a new app source. And you can see they're labeled RTCP app sync and RTCP app source. We then add those to our pipeline because otherwise nothing works. All of your elements have got to be in a pipeline. And then we link our RTCP app source pad RTCP app source, get static pad source, link RTCP sync pad. Yes. So I'm getting the app, sorry. I'm grabbing the RTCP sync pad from the RTP bin. And I'm linking it over to the RTCP app source. So that's basically just saying RTP bin is going to give me some information up to RTCP information via a pad. And I'm connecting to that pad so that I can then grab that information and send it over, send it back via Pion up to my web RTCP. So you'll get RTP in, in this case, you'll get RTP in into RTP bin, but you'll get RTCP in and out. So you'll get told RTCP and you'll also send it back out as well. And like I say, don't forget about the RTCP. As you can tell, I forgot about the RTCP and ended up doing certain demos and going, ah, look, it's really great. And then someone went and tried it on a really crappy internet connection and went, no, Dan, it doesn't work. And, and made me look rather foolish. So you end up looking something like this. So does everyone know about the dot graphs that you can generate from GStreamer? A couple of nods, not that many. So you can, within GStreamer, you can tell it, I want you to export a dot graph file on anything, on, on a state change or whatever. You, you've got control over when it generates it. And so for, for me, we, when we've got debugging enabled, we enable a dot graph generation whenever state changes. And so ultimately, this looks really small and dumb. It's a PDF. So you can go in and, and look at it in high quality detail. Um, because it's not a PNG. So you've got lots of options. You can, the dot graph can be converted into lots of different formats. But the really cool thing about dot graphs is it tells you what's connected to what. And so it's really great for debugging. And so for us, we've got our app source, um, our app source and our, our two app sources. So one is, um, one is RTP, which is this one. And then this one is RTCP. And you can see, I'm coming off the camera. I'm sorry. Um, so you can see that this one's set with, um, with capabilities to say that this is RTCP. And this one is set with capabilities to say this is RTP. And so you can see those are linked to a pad within a GST bin, a GST RTP bin. And so those pads are then connected to an RTP session. The RTP session is then, um, connected to a demuxer. The demuxer is then connected to a jitter buffer. And the jitter buffer is then able to go. Oh, well, in this, in this RTP stream that I'm receiving, that's both audio and video, where it's demuxed it and then it automatically goes, ah, here's the video and here's the audio. Right. And then it chucks it back out, chucks it back out, creates some pads for me, which I then connect over to, well, there's an app sync up there and that's my RTCP app sync. But then you could see here that it's then connecting out Opus and VPA into my pipeline. And then this is like the rest of the pipeline, which we don't care about, but like, I get told it's Opus and I get told it's VPA. And so I'm able to decode it and do stuff with it, whether or not that's outputting to NDI or whatever. At the end of the, um, at the end of it is, um, is an app source, uh, sorry, an app, an app sync for sending out via NDI. So we, we got into go purely because of Pion and Pion gives us loads of control. It's basically WebRTC in pure Golang. If you ignore the fact that WebRTC does lots of like actual media stuff, but when you look at, say, the, just the, the network portion of it of sending, sending data from here and sending it there, then it's pure Golang. So yeah, you can do any of this with any of the G streamer bindings or you can just, you know, do it with actual G streamer C. I mean, who actually want to do that? I don't know. But you can go and use whatever bindings you want. And so there's really nice bindings for Python, Rust, um, and I haven't used any of the others. Um, I've definitely used the Python one and the, and the Rust one myself. Um, and the Golang one, I went on there this morning to take the screenshot and I was like, Oh, where's the Golang one? Um, so here's the pull request to add it to the list. So if you've got a problem and G streamer doesn't quite solve that problem, that's what this talks about. This talk is about the fact that you can make G streamer do what you want it to do using app source and app sync. You can build it yourself with app source and app sync. So why G streamer? Why not FFM peg? Whatever. G streamer does everything that we need it to do. It has a fantastic community, super friendly community. And ultimately it's just super flexible and does exactly what we need it to do. Um, which is not something that we felt as a team. FFM peg would give us, for example, G streamer has a lot of scaffolding, let's say, um, and, and gives us an awful lot, um, for free. Whereas G, uh, FFM pegs a little bit more, more work, right? So my last message is G streamer for the win. Um, so yeah, don't wait for others. Don't wait for others to build your plugin for you. You can go and build with G streamer, app source and sync. And that's me. Thank you very much.
Build your ENUM LCR Server using CGRateS
I hope you can hear me. First of all, thank you for having me this year in Fosdum. My name is Saber Katelari. I'm a core developer at IDCS.com. And today I'll be showing you how you can build your own enum as your server using CG Rates. Firstly, something about our company. It's located in Bavaria, Germany with backhouses in Romania and Albania. We have over 17 years of experience in architecture and server-side solutions in voice-over IP environments. We have platform implementations covering both wholesale and retail businesses categories. And by now we are responsible to understand real-time processing and constraints and serious life system outages. Something about CG Rates. It's a real-time enterprise building suite, more like a framework since it can do many things. It's pluggable into any existing infrastructure. It's non-intrusive into existing setups. So it means it does not force you to make decisions. It's all dependent on your system admin if you want to take into consideration what CG Rates gives you or if you just want to ignore it. We are an open-source software since born in 2010. First sources published in 2012. Full sources are available in GitHub, 100% in Go. We always mention Go because when CG Rates first started, Go was also in its first weekly releases. And this means that we were one of the first implementers of Go. And it also means that everyone that we also paved the way for other people coming after us. We have no add-ons in private repositories and we take into consideration community contributions also. About Engine. Engine is performance-oriented. It has this built-in advanced caching system with transactional list record use and time TTL expiring records. It's asynchronous, processing with micro threads. If you know about Go, you probably know more about this. Also including API load balancer. We have three branches, V010, master and 1.0. V010 is our most conservative branch. Master is where we have our most recent developments. And also 1.0, we call it like the pinnacle that CG Rates can do, but it's still in early developments. We have a test-driven development environment with over 10,000 tests as part of our testing suite. Here we can mention unit tests, integration tests, and also call tests for switches. It has a building modular architecture which is cloud-ready. It has microservices with a rich set of RPC APIs because everything in CG Rates is API-related. And it's easy to enhance by rewriting specific components. So for example, if you want to rewrite the engine in some other code, you can easily do so. Some features for about CG Rates. You can do online offline charging system. You can have multi-tenancy from day one. This is more for wide labeling platforms. You can have multiple databases supported. We have multiple databases supported to mention some MySQL, Microsoft SQL, SQL Lite, Mongo Rates, Postgres, and also our internal database, which is compatible with everything we do. This is also a pretty challenging job to do for a relatively small team that we are. You can have real-time configuration reloads. So you can reload your configurations without having to shut down the engine and open it again. You can have rating engine with derived charging and in-number rating. You can have account balances and management with bundles and Dynaprepate. With Dynaprepate, you can create accounts on the fly and have it give some restricted permissions or limited permissions to your system. You can have sessions or event charging with balance reservation and refunds. This is prepaid logic. Stereo-shaken authentication, which is more for North America. CDR logging with support for interim records and rating cues. This is when you have your CDR sitting in a black box and have it communicate with your switch and have your CDR straight at the end of a matter of milliseconds without using any databases from the CDR side. You can have high-number of interfaces for event readers and exporters to mention some MQP, SQS, SQL, CSVs, XMLs and a couple more. You can have fraud detection with automatic mitigation, LCR with quality-based bundles, quality-based stats and bundles, call statistics with pattern monitoring. So you can find your ASR and your ACD live from your CDR rates. And also in combination with your proxy, you can find your average call cost and your total call cost. You can have dynamic pricing imports with templates. This is since all suppliers have different formats and CDR scan is compatible with most of them. You can use it with diameter, with radius if you need some authentication, Wi-Fi authorization. With DNS if you need enamel CR routing, which is also the topic for today. And you can also have a basic SIP server where it can do redirecting with your CDRs. You can have it redirect traffic from your switch to your CDRs with some routing and IP addresses. Well, else we have resource allocation and controller. This is some virtual channeling for your customers. You can have your API server with Gop Json, HDB Json support, built-in high availability with dynamic partitioning support, API capturing analysis service. This is something like an internal grant for CDR rates. Clustering through remote, replication for internal cache and database. Data versioning with automatic migration. This is when you need to move between releases in the same branch. You can do so with data migration. You can have and we also do, we also are agile in developing new features. So if you have some feature or some idea that you want to bring us, you are more than welcome to do so. This is an internal schema or diagram that we have for CDRs. It basically shows how CDRs has its components and interfaces and how they communicate with each other. On your left side you can see all our interfaces. You might notice that we don't have open SIPs over there because open SIPs has its own native module which is faster and better than anything we can do since it's native to open SIPs. And if we take one example, for example DNS agent which is on your left, you can see that it communicates with sessions which is our main subsystem and through there it can communicate with every component or all components at all or one component. It's all dependent on what you want to do with CDRs. For some use cases, again online offline charging, you can have a highly configurable rating bundle with voice, data, SMS, MS, monetary or anything else. In 1.0 you can really charge anything else. You can have there concurrent sessions with concurrent sessions handling and also a centralized CDR server. And this all together is what others call online offline charging system. Another use case which you can do is a dynamic routing system where you can use the dedicated subsystem for various routing strategies. There we can mention load balancing, the difference in our load balancers is that we cannot use setups but only real calls since we get that information out of CDRs. Also you can have LRN support via attributes, bundle support routing systems, quality based stats monitoring with thresholds and also load balancer which I mentioned. Now to get to the INOM LCR server that the topic is for. Firstly we need to know about DNS, probably most of you know but DNS is something like an internet address book where you query for something and you get information back specific to that what you question for. Depending on your answer the answer is categorized in some record types. There's a couple but we only work with these three, A-Type, SRV type and NEP type records. We work only with this because that's what most people need and nobody has really asked for anything more than this. To shortly describe them A-Type records convert domain addresses into IPv4 addresses, SRV records for network servicing. You can find priority, weight, port, targets from your SIP addresses and most importantly in NEPTR records which convert INOM addresses, INOMs into IP addresses. But what is INOM? INOM is basically a standard to translate telephone numbers into your eyes. Here's an example how you can do that. Firstly you need an E164 number. You can convert your number into an E164 number by firstly removing any leading zero before it and also adding your country code after it and with a plus at the end. Then to convert this INOM 164 number into an INOM number you have to remove the leading plus, reverse all the digits, add a dot between each digit and then add a suffix. This suffix, the one you have in this example is from RFC standards but in C-Drates we don't really care what you put in your suffix. In my example even I even replaced this ARPA later with the account string that I will use. For DNS agent I also mentioned earlier it's an interface, it's like a middleware where your DNS client communicates with DNS agent and then sends that information, that request to the DNS server and then from there maybe you can see from the schema. From there you can go into sessions and any component it can take any component and then give that information back to the DNS client. In terms of capability you can have as many listeners as you want. Also to mention in DNS agent we also implemented our DNS server and DNS service and listeners and for listeners you can have as many listeners as you want and they can all be opened at the same time. You can have UDP, TCP and TLS protocols and this means it is highly configurable and concurrent. Again for query types we support ASRV and NAPTR. For configuration this is in your configuration files. You need to open a new field, name it DNS agent, also this is JSON, everything is JSON in configuration. Name a new field DNS agent, enable it, by enabling it you allow it to receive and to send API calls. Then you name listeners where again you can see that it's a list so you can have as many listeners as you want. You name your address by giving it an IP and a port. In my case I use an empty IP since if it's sent by default in CJA we put what's in defaults and in this case in default is just localhost. For port I put 2053. If left empty again this will be filled by the default which is 53. And for that address I need to attach it a network. On this case I use the UDP protocol and again if left empty again it will be on UDP by default. After that I want to also be open to TCP listeners. That's why I create the same address but this time I changed the protocol. This doesn't mean that either one or the other will work. It means that both of them will work at the same time. There's something messed up over there. They should be on the same line for the last one. The address for TLS since I cannot have TLS and TCP on the same address I can put it in a different port for this example. And after you finish with listeners you go to connect your DNS agent with sessions and you do that by using session cons. You can have either localhost, internal or some configurable other connection which is done by you. I use in this case localhost since I want to track the network, the packets going through sessions and DNS agent. You can switch it with internal if you want to have a faster connection or if you do not need this debugging, this packet tracing. Just on that same DNS agent field you put request processors. To short explain request processors do the logic of what's going to happen after a query is done to your server. In this case you can have many request processors. In this case I'm only showing one. And this is what happens with it. First we define an ID for it which has to be different from other request processors. It doesn't matter what you put inside, it just has to be different. So in this case I'm describing what I do in this process which is NAPTR list cost route. After that you define filters. Because I want to find the list cost route to find a Cp address for my query. I first need to be sure that the query type is in NAPTR and that the leading country code starts with 32. This is just an example. You can have any filter that you want. The first filter asks the query type from the request if it's a full NAPTR string. And if that's true it goes to the second filter which finds if there's a prefix starting in that query name that starts with 32. And before it does that it converts that in number into E164. And that's done with filters. If those are true it goes to the next one which are the flags. In my case I want to create an event each time this query is being made. So I put there meta event which calls an API for sessions process event. Each time this query is true. And I also put routes authorized because I want to get the max usage when the query is done. And I also put routes because I want to do list cost routing with it. Next I put log there because I want to get some logs out of the query when the query is done. So I want to get the request and the reply from the query. And after that I put request fields. The request fields are what you want to populate when the query is being done. In this case I want to populate account, destination, set up time, type of record and usage. I want to populate this because I want to put them in my event later and the event needs to use them. How I populate them? I populate account with the query name by stripping away the first E164 and what's before it. So it leaves me behind with only the 1001 account which I will show later. This way I populate account with 1001. In destination I put the query name fully converted into E164. In set up time I put now for the current time of the query, type of record voice and usage of one minute. For the reply fields I want to put what I want to reply to the DNS with. So I want to reply with order of 100, reference 10, flags U and service E2U plus CIP. In the most important part the regular expression which I find through route parameters. I didn't show here but I created a routing profile before and I put there two information in two routes and that information are the CIP addresses which are different. One of them is of highly cost and the other one is least cost, is lesser cost. And since I have that meta routes flag over there, those routes will be sorted using least cost. And since I have reply I want to find that reply the routing parameters for that first index of the route. And the first index is always depending on the sorting route and make it least cost, the first index is going to be the least cost route. And under the reply you can see the reply. I find in the structure routing profile I go to run ID meta row, meta is in this case asterisk of iteration 0 of that ID. I go to routes of iteration 0 again and then I find the value of routing parameters which is the CIP address that it finds. And then I populate it to that regular expression. After that I just also put the replacement dot at the end. For the client, for the client I'm using dig, in this case I'm couring localhost on port 2053, the type of regular this NAPTR. And you can see the N number that I put there. You can see the 1001 account at the end. For the reply I captured this using ngrep. You can see the API that gets called sessions process event. In the flags they are the exact same that I put in my request processes. The tenant gets automatically taken by default configs which is cj.org, the ID is some random number. Time is the current time of the query. And in the event you can see they are exactly what I asked for in my request processes again, if you can see. And that's just the request for the reply site. I can see the reply from that API where I find the max usage of 60 seconds. If you remember I put one minute of the request. You can see that it's also 60 billion nanoseconds. This cj also works in nanoseconds. Also I have the reply on the routes profile site. You can see that it found the routes account for 1001. You can see the sorting that it used. It's LC for list cost. And also it shows all the routes that it found sorted by it. And you can see routes with ID route 2. You can see the Cp address ending with 12 and the cost that it would take of 60 units. And the second ID which is more costly with the Cp address of 11. And here we get the reply back from DNS agent after it's done. You can see that it found a regular expression with 12 at the end which was 60 cost units if you saw from earlier. And also as another use case you can have a fail fallback. So for example you can have multiple answers over here. In my case I would just have to make another request process. And in this case I put just one instead of zero over there and it gets the second list cost that it finds from routes. By that you can just get the second answer also. And that's about it. Any questions? I'm guessing not. If you have any questions you can also ask them at our Google groups. Oh sorry. Yeah. Going back to the request and the response. I saw you had a, in the request you were getting an account ID. How are you figuring out the account of the person asking according to DNS? Well it depends on what you want to do. In my case I just put that in my request on the DNS client over there. You can see at the end it's in that 1001. So I give it myself that account ID. Okay so you're giving each customer a phone top level domain name. Whatever you want. Any other questions? Okay. Thank you.
Running DOS & Unix on an 8-bit Commodore
Thank you, mission. Hello, everyone. My name is Michal Pleban, and I would like to show you how to do cool things with a Commodore computer. So, back when dinosaurs roamed the earth, Commodore introduced the Pets. It was one of the first home computers on the market, and it started a succession of a lot of different business computers that were especially popular in Europe. But time went by, and competition were introducing more and more powerful machines. So, Commodore decided to upgrade the Pets, and in 1982 they introduced the CBM-2. And when you look at the hardware specification, you see that it's basically an upgraded Pets, a faster processor here, more memory there, but it's basically the same architecture. But there is one little detail that stands out, it makes this machine really unique. It's the second processor interface. It allows you to attach a different CPU to the system and run applications on it. So, if you want to do serious business, you can attach a Z80 and run CPM. Or maybe you want to do scientific stuff, so you can attach a Z6809 and run Pascal. It's all made possible because the architecture is very flexible. So, the way it works is both CPUs are connected by a message bus, so they can communicate with each other, and they share access to the main memory using an arbitrator. So, normally one CPU is running an application, and the other one is either waiting or doing some housekeeping tasks, maybe checking the keyboard, maybe updating the timer, but basically the CPUs run together at the same time. And the message passing bus allows you to use an inter-processal communication. So, if you have an application running on the second CPU, and you need to do some IEO, maybe load a sector from the disk, then the second CPU interrupts the first one. The first one grabs the memory access from it, loads whatever it needs to be loaded from the disk, puts it in the memory, and gives the access back. And the beauty of it is that it can work both ways. So, normally you would have an operating system running on the second CPU, and call the first one for IEO, that's the standard way we could say, but you can also do it the other way around, and use an application on the main CPU, and use the second one as an accelerator. That's also possible because the bus is very flexible. So, if you compare it, for example, to the A-Cord and Tube, which is also a popular and well-known second processor interface, that one works only one way. This is much more flexible. So, given this powerful architecture, what did Commodore actually do with it? Of course, there was a Z80 card plant, we have the schematic for it, but it was never produced, but what Commodore did produce in the end was this. It's an Intel 8088 processor card that's supposed to run CPM, and of course MS-DOS as well. So, yes, you can run MS-DOS on this computer, and it looks like this. I run a check disk, because that's about the only thing I could run on the machine. Because there's a tiny little problem with MS-DOS. Today, when we think about MS-DOS, we think about the PC, because that was the original operating system that launched with the PC, and the PC became so popular that it overwhelmed all the other personal computers except the Mac. So, we say MS-DOS, we think the PC. But in the early 80s, Microsoft had a different plan for MS-DOS. It wanted it to be a G operating system for all 16-bit machines, just like CPM was for 8-bit machines. So, we had more than a dozen different computers on the market, and each was running its own version of MS-DOS. And the theory was that once an application was written using MS-DOS APIs, it should be able to run on all of those computers. But the thing is, the MS-DOS API is very limited. So, for example, if you have a spreadsheet application, you need to be able to place a cursor on the screen and update a specific cell. Well, guess what? There's no API in MS-DOS to position a cursor on the screen. So, what you had to do, you had to go through the machine's BIOS. And of course, each computer had a different BIOS interface. And not to mention Bitmap graphics and any other advanced features. All of them had to be accessed in a machine-specific way. So, what really happened is the applications were written first and foremost for the PC, maybe for a few other architectures, but if your computer was not PC-compatible, then you had very little software to run on it. And the Commodore is about as incompatible as possible with the PC, so there's nothing that you can run on it. So, the big question is, can we do something about it? Can we somehow make this great machine PC-compatible and run real applications on it? And the answer is, of course, we can. And the way to do it is we need three things. First of all, we need something that has the same interface as the PC BIOS, so that applications can use actual PC BIOS interrupt calls to interactive the hardware. We need video memory, because there's one thing that all PC applications do is they write directly to the video memory instead of using the BIOS, because that's so much faster. And first, third, we need virtual hardware, because the PC has a lot of I-O chips that the Commodore does not. So, if you want to generate sound with PC speaker, for example, there's no BIOS interface for it, you need to interface with the I-O chip. So, we need to do something and make up for the fact that the Commodore is lacking all those chips. So, this is what is needed to create a PC-compatible BIOS. These are all the interrupts that need to be implemented to have MS-DOS boot on the machine. If you are familiar with low-level DOS programming, you will recognize what they are. They give access to the screen, to the keyboard, to the disk, some basic stuff. So, the good news is that we can reuse a lot of code from existing MS-DOS 1.25, because, for example, if you want to put a character on the screen, there's already a function in Commodore Kernel that does it, and there's already an inter-processor call in the old MS-DOS that uses it, so we just need to slap a different interface on it. The bad news is that it is a lot of functions. And you need to get all of them right before anything starts to work. So, if you have a few years of free time, this is a good way to spend it. The video memory is actually the easiest one of those three. So, of course, the PC stores the video data at a different location than the Commodore, so what we need to do is to use a timer interrupt to copy the data from one location to the other. Of course, doing ASCII and Petski conversion, but it's very simple. So, this way, anything that the application writes to the memory where a video memory would be on the PC will actually appear on the Commodore screen. And the third thing, we need to pretend that the computer has the same peripheral chips as the PC. So, of course, we could try putting all those chips inside the Commodore and basically making just another PC clone, but that's not cool. And that's another way to do it. We can use virtualization. So, how do we create a virtualization platform on an Intel 88 processor from the 70s? Well, this is a virtualization platform. And the way it works... We need to be able to detect when the computer is trying to perform an I-O operation and stop it. So, we put a virtualization environment here and every time when the computer tries to access the I-O, it is interrupted. The interrupt routine checks what kind of I-O access is being done. That's whatever magic is necessary to emulate this I-O chip, most likely using inter-processor calls to perform the actual I-O. And then it turns back and the application thinks that it has actually accessed an I-O chip. Then it's done. So, well, is it all enough to actually make this platform PC compatible and run those applications? I'll grab a couple of bugs and let's find out. So, we are booting the computer and we are starting to load the operating system, which seems to be free-dose. And once we have free-dose running, of course we are starting not on Commander, what else would be? And because we are dealing with Microsoft, of course we start with basics. So, this is the QBasic from MS-DOS. This is some very simple Hello World program. And let's just find out if it works. Yes, it does. And of course we are going to use Turbo Pascal as well. Prince of Persia. So, again, a Hello World program. Is it going to work? Yes, it is. So, just for good measure, let's try to change it. So, indeed we can position the cursor on the screen. We can do some changes. And it works again. So, that was the Intel 88 processor card from the Commodore. But as I showed before, you can attach many different processors to the bus. So, how about we do something really cool? This is Commodore 900. It's an abandoned Unix workstation prototype that was being developed by Commodore, but it was cancelled because they bought the Amiga and they tried to focus on that one. And if you look at the hardware, it's a very strange machine because it uses a Xiloc 8000 processor, which is very rare. It was used in some Olivetti machines, some industrial equipment, but basically in 1994 nobody was using it anymore, except Commodore, of course. It has a memory management unit and it runs coherent. And coherent is like Linux, but 10 years earlier. So, it's a system written from scratch to act exactly as Unix, but it was much cheaper. And it's a truly multitasking and multi-user machine. You can attach many terminals to it, log in at the same time around applications. The problem is because it was cancelled, only a few dozen prototypes were made, so unless you are very lucky and very rich, you can't have one. So, how about we do something about it and we put a Z8000 processor on the second processor interface? Here's what is needed to create a virtual Commodore 900 using this interface. Nothing very difficult here. So, of course, we need to create virtual hardware again. We need a keyboard controller, we need a disk controller, we need a serial-poll controller, and so on. But the good news is once the Iotips are emulated properly, we can use the original Commodore BIOS. We don't need to write a new one. And even better, because it's Unix-like with resource protection, then no applications are touching the Iotips directly. They are all going through the operating system. So, once the operating system works, then everything is supposed to work as well. And this is very different from MS-DOS, where every application had its own dirty ways to play with the Iotips and it's needed to emulate all of that stuff to let them run. So, you know what happens now. But before I'm going to show you how it works, this is the Z8000 processor card that I made. So, it has one megabyte of memory. It can be accessed either as 8-bit or 16-bit. It has a memory card that emulates the hard disk. And, of course, it has the virtualization environment as well. So, is it going to work? Let's find out. So, now the original BIOS is starting. It's going to perform a self-test of all the hardware. The hardware does not exist, but the self-test passes. And now we are booting the operating system. Router can't without the password. That's a very secure installation. It's a nice Unix file system. And, of course, because it's Unix, we are going to program it in C. That's a Hello World program again. We have a C compiler on board, which takes a bit of time to compile this tiny program because that's just six megahertz. But, finally, it's done. And, the program works. And that, ladies and gentlemen, was Unix running on a Commodore. Thank you very much. Three minutes for questions. Hi, and thank you for the presentation. Stand up. Have you tried, like, there's another project for the Commodore, for the BBC Micro that uses a Raspberry Pi to emulate other professors. Have you tried using, not using the hardware that you cannot obtain today, like the ZE-8000 and so on? Have you tried using better metal or Raspberry Pi to do the emulation of the other hardware or maybe the emulation of other professors and so on? Have you had a thought of that? Yeah, well, you can obtain the ZE-8000 from eBay very easily. But, I have not tried using any platform from this. I'm a Commodore guy. The main problem with emulating the Commodore 900 specifically is that nobody has ever written any software emulation for the memory management unit of it. So, you can emulate the CPU because, for example, in the main repository, you can find the code, does it? But there is no emulation existing right now for the MMU, and that makes it really hard to do it without the real chip. But try the Intel stuff. Yes, the Intel stuff should be done easily. Yes, of course. As long as you can virtualize it. Thanks for the good talk. How much performance is being lost in the emulation of the various IO chips? A lot. Yes, it takes a lot of time because, first, it needs to go through the interrupt routine, then it needs to go through the message bus, then it needs to go back through all this. So, that's quite a lot. And that's why I decided to bump the processor speed a little bit. So, for example, the original Commodore 900 has a 6 MHz CPU. I put a 10 MHz. And that makes a nice difference. Otherwise, yes, you would really notice the difference. But there's also one thing. If you have static memory in the computer, you don't need to waste cycles for memory refresh. And that also gives you a nice speed boost. So, all in all, it works quite well. Okay, time's up, unfortunately. Many thanks, Michel. Thank you. Thank you.
A Game Boy and his cellphone
So, ready to start? Yeah. Okay, so we have now Esteba with Game Boy and his cell phone. Hello, and thanks for being here for this talk about Game Boy peripheral that I think is very interesting and versatile. I'm Esteba and I've been working to emulate and restore this peripheral on and off for the last six years or so. But first, I should tell you what it is. The mobile adapter GB is a peripheral that allows you to connect your Game Boy up to your cell phone, allowing games to make and receive calls in part to also call an internet service provider and connect to the internet, allowing for all sorts of online connectivity, like sharing scores and getting updates for various things. It was one of the very first attempts by Nintendo to have any sort of online connectivity for consoles, but what makes this one very interesting in my opinion is that it supported a few rather high profile games. There were actually a few variations of this adapter made for several different phones. You have a blue, a yellow and a red one. The green one for PHS was also planned but never released. But what you will notice is that none of these actually work for any non-Japanese phones. So this service never left the island and unfortunately it was sunset very early, almost two years into its life in December 2002. But to give you a better idea of what this peripheral could do, we will talk a little bit about the games that supported it. So first of all, you got the mobile trainer with the adapter. It was used to configure the adapter and you had to use this before you could connect to the internet. It also came with a very useful usage manual but it also had some very interesting utilities, which were a mail client which supported both SMTP and POP and could communicate with the outside world so you could actually receive real emails. And a very minimal web browser which was hard-coded to one website to read news about Nintendo games and games for this peripheral. Now, the very first game that was released for this thing was Pokemon, a very popular franchise that I'm sure you're familiar with. But it was actually one of the very first time you were able to battle and trade online with your friends or at very large distances at least. Besides that, it also featured a battle tower which allowed you to fight people who have entered that tower previously. It got localized with NPCs in the west but the Japanese version worked with this adapter. You also had a trade corner which is a bit of a prototype of the global trade station which appeared back in Generation 4. And you had a news machine which I think is the most interesting part because you could download scripts which had news items but also many games, questionnaires and you had rankings to show off your friends how big your Magikarp is. Another very interesting game in my opinion is Net the Get which was one of the only titles which used the MBC6 on the Game Boy. It's a minigame collection that came with 15 built-in minigames which could download more and more would be released over time though they never reached the titular 100 minigames unfortunately. A few other games that were very interesting, Mobile Golf which is a sequel to Mario Golf which never got localized but it came bundled with the adapter later in its life to help sell the adapter. Starcom which is a sort of pet simulator, Game Boy Wars which was part of the Wars series known for Advanced Wars and Famicom Wars and Mario Kart which allowed you to upload and download ghost data. So let's tell you a little bit about how this project got started and where we are now. Somewhere in 2016, Haki posted a thread on GlitchCelapse which explained a little bit about how the mobile adapter protocol worked. From there we spun up the Python script which communicated with the BGB emulator allowing you to have a proof concept that this thing actually worked. Somewhere in 2018, a guy named Shinumi who is known for emulating various peripherals including suing machines and phishing sonars which were made for the Game Boy. Also emulated the mobile adapter and specifically Net-to-Get and created very comprehensive documentation that we are updating and keeping track of to this day. And at some point people wanted to actually bring a real Game Boy to connect to the internet and that's kind of where I stepped in and we started doing stuff. So fast forward to today, we have a group called Rion. We are a group of preservationists, developers and enthusiasts who want to preserve this system and make it usable to the common user as it used to be. For that we are making emulators, servers and translations for a few of the games so that they can be enjoyed by a wider audience. So to give you an idea of how this all fits together, I will explain a bit about how the system connects together. So this is a connection diagram. On the left side you have the user's Game Boy which communicates through a custom link protocol with the adapter which further communicates with a proprietary protocol with a mobile phone. The mobile phone is connected to the phone network but depending on who you call you can either call a friend and communicate with their phone directly. And this was used for example for the Pokemon trading and battling. Or you could call the internet service provider and use the point-to-point protocol to tunnel your connection through TCP and UDP to the official Nintendo servers. Now most of this stuff is kind of irrelevant when we are emulating this because when we are emulating it we can kind of make big black boxes depending on what you are doing. This is how it would look if you have a simple microcontroller that connects to your Game Boy and then further connects through USB to your computer. Your computer will communicate to either the game server or if you want to call a friend then we have set up a relay to punch through router firewalls and that sort of thing which allows you to connect to any other player on the world. And this can either be hardware or these blocks can either be full emulator which also emulates like the Game Boy itself and the adapter so it's a little bit more variable. So we have full documentation and emulation of the peripheral itself or at least the part that communicates with the Game Boy. And for that we have made a library called LibMobile. This library can be integrated into all sorts of projects from software emulators to hardware emulators and back. We've integrated thus far into the BGB emulator which is a Game Boy Color emulator. We've integrated into the MGB emulator. We've made a little fun interface to configure it as well. And some people have been playing around with making it work on the Raspberry Pi Pico and communicating over Wi-Fi for example or the Arduino Uno which is mostly what I've been using. There's also the GBE plus emulator which was made by the Shonumi which I mentioned before. This is more of a local only emulator but it's for some games that we don't yet. And of course full documentation of this is available in Dandox. So these are a few of the examples of things of setups that people have put together. On the far left you've got the simplest one which is just breaking out a few wires and connecting them to the Arduino and then just plugging that into your computer and doing it like that. Some people have made PCBs. The central one is able to communicate over Wi-Fi and Xenaro really active user lately has made a 3D print version of it as well. Now of course you don't need to connect it directly to a computer. You can also just use a modern phone which are basically computers these days. We've also of course started emulating the server side of things. We have the relay server which I've mentioned before which gives you a phone number and allows you to call someone else. We have a mail server which is implemented in Node.js and stores in SQL so we can manipulate the emails more easily. And we have a few complete game servers for Pokemon Crystal which supports actually everything at this point. And a very driven person called Winter who has fully emulated Mario Kart and Monopoly though Monopoly doesn't have many features unfortunately. Also GBE plus has emulated a few games in particular Net-to-Get Game Boy Wars, All Japan GT Championship and Hello Kitty's Happy House which allows you to send emails with items to your friends which is very cute I think. And of course we've also made a few translations in particular Pokemon Crystal of course was already localized but we've restored all the functionality for it. And we've also ported all of those changes to the four other languages that the game was released for. Mark Max came to us asking if we were interested in his mobile golf translation and most of it has been translated but not the mobile features because we don't have any support for it yet. And the mobile trainer which of course is a cornerstone of this whole thing. If you want to get into it or make an emulator for yourself or develop a game that supports this thing. We have of course the mobile which allows you to emulate the adapter itself. We have the re-enrollable story which you can extend with other games or if you want to emulate those though I would suggest if you make homebrew that you make your own server behind this. And unfortunately we still don't have a client library the library that runs in the Game Boy itself though we have reverse engineered the library from the Nintendo SDK. If you don't care about licensing problems. So in conclusion most of the things that you'd want to see are already there. Of course we don't have all the games yet. The problem that we're mostly struggling with right now is authentication and getting this useful for actual people who aren't very techie. So if you want to help with any of that documentation making tools websites whatever you can reach us on unfortunately discord only. We have if you want to make a matrix server and bridge that I would be very happy but unfortunately right now it would be the only person who would use that. Our github is over there and show numies block with a lot of more peripherals and funky things that he's emulated with the Game Boy. Can be reached through his github pages. That was it. Thank you. Thank you. We have time for one or two quick questions. I have a very quick question. Thanks for the talk. Do you know how the original games that you could download of the Internet back in the 2000s how those were captured. How those have been like captured like that's like 22 years ago. So one of the things that we actually sometimes need help with is if you have any of the games that supported the mobile adapter. Don't run them dump the save directly. If the battery still lives then we might be able to restore some of the games that were supported back then. Thankfully though we have the 15 built in games which serve as an example to make more so that helps a lot already. Another quick question. Yes. No. No. Okay. Well then. Thank you. You can get prepared. It was really interesting. Thank you. Thank you.
PiStorm - The evolution of an open source Amiga accelerator
Okay, we are right on time. Many thanks. So Andrew with the Pistole. Hello everyone. I was stupid enough to do this from an Amiga 1200, which is great because I don't have a screen in front of me, so I'm going to try and see what I'm doing whilst I'm doing it. But it'll make sense later. So I'm here to talk about Pistole. My name is Andrew Hutchings, I'm also a learner at Linux Jedi. During the day, I worked for a non-profit called the MariaDB Foundation. And by night, I restore Commodore Amiga's, Acorn computers, I design upgrades for them, and I'm part of the Pistole community and a whole bunch of other things. I've also written for PixelAdex to go by that, because the next issue's got a big article by me in it. And I'm also going to plug... The aperture there was made by Stu Cambridge, who from Sensible Soccer fame, he did cannon fodder and all of that lot. And you can get him to do Doodles of You just like that from his site. What's it called now? Design Droid. He doesn't know I'm plugging it, but I love his work. So anyway, about Pistole, it was a project created by a guy called Claude Schwartz. And if you've ever tried to use or upgrade Commodore Amiga today, you need a processor like a 68030 or a 68060. If you want a 68060 with a board and RAM and everything like that, you need to sell a kidney, basically. They are really rare, really expensive nowadays. So the idea was to create a very fast budget accelerator. And you can get a lot of compute resources from something called a Raspberry Pi, which you probably all know about. So what this essentially does is it emulates the 68000 processor on a Raspberry Pi running Linux originally, but the rest of the Amiga motherboard was used. And then it adds things such as RTG. Now, RTG stands for Retargetable Graphics. And essentially, that means it's like a second graphics card for your Amiga. So this is what I'm actually projecting from right now, is the RTG from my Amiga. It has the native Amiga. If I tried to run an old Amiga game on it, you wouldn't see it on the screen right now, because I haven't got the output for it. I'm going to talk about that a little bit later. It adds virtual scuzzies. So the SD card on there is basically a driver for the Pi, as the Pi Storm, to talk directly to the Raspberry Pi's SD card. So it's rather than being emulated, it's almost like a direct driver in a way. And it also adds RAM. So I've got a Raspberry Pi 4 in here. So nearly 2 gig of RAM added to what is normally a 2Mega system. So a little bit of a boost. And everything is open source. The boards are open hardware and stuff. What we used to do is a group buy where you could come along and say, I want to buy one of these, and we'd all go to JLCPCB, buy loads of boards together, and you just have to solder on the headers, which were great until the chip shortage, and then that kind of died off completely. But back then, I said you can pay more than 20 bucks for a Pi Storm. So about 18 pounds, it's probably about 20 odd US dollars, whatever. So it was really, really cheap. You just need a Raspberry Pi. So this is what the first one looked like. Now, you can see there's quite a few chips to it on top of what is normally a Pi GPIO there. So essentially the problem we have is the Pi GPIO is 40 pins, but you only get about 26 GPIO lines from that. And the Amiga, 16-bit Amiga has 16-bit data bus, and then a 24-bit address bus, and then control lines on top of that. It's a lot more than you have IO lines. So what we've got here is a CPLD chip, a programmable logic chip essentially. And we have in there basically this 6,000-8,000 state machine. And that does all sorts of multiplexing communications to the Pi. And then we have some buffers basically because the voltage-total translation is needed between the CPLD and the Raspberry Pi, and then the external IO logic. So it was nice and simple boards. We could get JLCPCB to build all these originally until the CPLD kind of ran out of stock, and then that became difficult. And the logic that we wrote for the CPLD is enough to run it for an Amiga, but it doesn't include some of the state control lines that other systems use because we were targeting an Amiga 500 at the time. So this supports a 500. It supports most of an Amiga 2000, the 1000, and the CDTB. And then... Oh, doing this on my clicker, clicker and of course, I've got my clicker connected. So it used to Raspberry Pi 3A originally. You could have used the Raspberry Pi 3B, but you'd have to raise the header a bit because otherwise your Ethernet board smash into the board. And that's not good. You can take off the ports on the 3B if you don't want them, or you can extend the header. Also, Pi 0 2W will work. If you don't know, Pi 0 2W is basically a Pi 3, but in a much more compressed format. We ran Mishashi 6800... I hope I'm pronouncing it right. 6800 CPU emulator, which... It was good. It's a pretty good 6800 emulator, and then there's some kind of glue code to make it work, but it was basically an off-the-shelf emulator. And most of that software was done by a guy called Bjorn. He's not part of the project anymore, but he's got a lot of great early work on it. Again, I'm clicking on my clicker. So, performance-wise, you can see here... This is what's called SysInfo. It's kind of a stock benchmarking software for an Amiga. And an Amiga 600, which is same as an Amiga 500, roughly. The original Pi Storm ran about 23 times faster, which is pretty good acceleration. You're getting even faster than what was called 6800, 25 MHz. So you're getting about 50 MHz, 030 processor, kind of speed out of it, which is pretty good performance for something that costs a lot less than even the CPU for an 030. How I got into Pi Storm? I was designing some new hardware for a Commodore Amiga, and the other advantage of having Mishashi on Pi Storm is the fact that you can, on the fly, change the entire configuration of the Amiga. I want a different OS ROM to boot into, different RAM configuration, different hardware configurations. All that can be changed on the fly. I started providing patches, helped build a community. This was probably in September. We had 7,000 members on Discord and 3,000 on Facebook. So it's grown to a pretty big community. Things I've done, I'm going to skip over this, but I did a lot of the early work regarding bug fixing and things like that for the original Mishashi Pi Storm. Then we released a version for the Amiga 600 and Amiga 2000. They are essentially basically the same thing, but Amiga 2000 has a coprocessor slot, so it's much easier to just debug it in the slot. At Amiga 600, you have to do this hacky thing where it sits on top of the PLCC CPU, and then there's a little kind of thing in there to tell that CPU to go to sleep, and then that basically is identical after that. So EMU68 came along. EMU68 is a bare metal emulator for Raspberry Pi, for the 6800, so it's much, much faster. You don't have to boot into Linux anymore. This is what this boot is from. It became an option for Pi Storm in 2021, and now it's pretty much de facto standard, and it uses JIT-based emulation instead of table-based. So performance-wise, it got a bit faster. 1,490 times faster, and this is just on the Amiga 500. Then the Pi Storm 32 came along. This project was scrapped. So essentially, it's the same kind of thing, but for the 32-bit Amigas like this one. But it became very hard to build, and it required a Pi CM4, which is a Pi without all the ports and everything. You just got these big connectors on the bottom, and it became difficult and expensive to build, so we gave up on that, and instead built the Pi Storm 32 Lite, which is Lite because it doesn't have all the ports on it. But basically, it's the same kind of thing. And we have a nice big FPGA on there instead of CPLD. FPGA, just much more logic, but you have to kind of flash it every time you turn it on. And that was basically the start of what became the 8-200. This is kind of the peak of Pi Storm right now. We released that about a year ago, and it's still going strong. Performance-wise, we're now talking 3,052 times faster than an Amiga 500, which is not too bad. Even the Amiga 1200, which this is, it's 1,326 times faster. And you can get faster still if you overclock it. I'm not going to overclock mine. I've got a little fan running underneath it as it is. And inside this Amiga, you can see this is what mine looks like inside. So you've got the Pi Storm in here. And then I've got a little cable running out of the HDMI port to the back, and that's what's running as projector right now. And then I 3D printed a kind of assembly with a fan in the net just to keep everything nice and cool. Demo time. So, John Clarmac said the Amiga is not powerful enough to run do. At the time, to be fair, he was right. The de facto Amiga at the time was kind of Amiga 500, my Amiga 600. If you wanted one that could run do, it would cost you thousands and thousands, much more than the PC would at the time. But today... It's later we were running do. Yes. But I can do a bit better. AmiQuake. And I haven't got sound hooked up, unfortunately, but what I can do... Time demo, demo one. It's slow, I know. So we just got to wait for all this demo to finish just to get a nice kind of benchmark out of it. And there we go. So we get 93 frames a second out of Quake through the RTG. If I run this through the AGA graphic, the built-in graphics instead, we still get about 45 frames a second. So it's a bit faster than native, which would be a few frames a second at best. Oh, there it goes, that window. So... If I use... CandlebyeStorm modified chip RAM. So chip RAM is chip set RAM. It's the RAM that the entire chip set... Can the Amiga talks to each other with? So you've got like the audio chip, the graphics chip, etc. That is capped at 2 megabytes by design, by Commodore. They were trying to move it to 8 meg for the Amiga 4000, but it never really hit there. No, it can't because we don't modify the chip set. We don't override the chip set, so we can't increase the RAM that the chip set uses. So whilst we have 2 gig of fast RAM, we don't have any chip RAM. Can you emulate a power PC? Probably you can, but it's going to be a lot of work, and we don't want to do it. So if anyone wants to put a PC emulator in there, it will probably work. Can you use PySom in other 6,000-8,000-based machines? Yes. So someone's done a port, I forget the name, they've done a port for the Atari, which basically had to pretty much rewrite the firmware to make it work because Atari actually uses all the 6,000-8,000, instead of the hacky thing Amiga did. I love Amiga, but Atari did that bit a bit better. And similar problems with the Apple. So there are projects where they're trying to get this running. It's not going... It's not all the way there yet, but they're working on it. CD32, sorry, 3,000, 4,000 versions. In theory, the one in this machine should work on CD32, but it doesn't, and we don't know why yet. We haven't had time to figure it out. It shouldn't take much modification to make it work. 4,000, 4,000 versions are going to require a lot more bus arbitration work, so it's just time to do that. And then the really cool thing we're working on right now is Amiga Native Video Injection Device, which we haven't got a name for yet, but essentially what it does is it captures the... It sits in various places in Amiga, depending on the model, captures the digital video before it gets converted to analog, pipes it through the camera port in the Pi, and then you can have both native video and the RTG video through the HDMI on the Pi. So, if you want to sponsor the Pi Storm development, Claude has a donate button on his Pi Storm 32 Like GitHub page. I'm just checking it out. Mikal, who develops the EMU68 project, has a Patreon to sponsor the development of it. And if you have any questions at all about the project, feel free to come to me. I'm the Linux Jedi everywhere, pretty much, and I'll be happy to answer them. And that is it. So we have time for questions. Any questions? Thanks a lot for your talk. So according to the SIS info output, it's not emulating a plain 68000, but a 030 or 60 or 040? So, Mishashi, you can choose which one you want to emulate. The 020 and 030 were the most stable doing that. For EMU68, it currently pretends it's an 040, but will support the instructions set for 060. Okay, so that's only about the instructions, and it does not emulate the MMU, I guess. Yeah, it's just saying, hey, I'm an 040, but it doesn't really matter. It will run 060 code, fine. Hi. I'm actually Debian's M68K maintainer, and I'm wondering if there's plans to add MMU support, so you can go to the Linux kernel. Other plans for the MMU? That is a good question. Mishashi, no, we did have it to begin with, and it was broken. So we didn't. EMU68, I believe, somewhat supports MMU, but needs some work to support it properly. It's at the moment a direct one. It's basically given a block of RAM in the pie, and just said, yeah, just use that. So we could probably emulate MMU. That's too much trouble there. Thank you for your talk. Just a quick question about the MMU68 variant. Yep. Does you need to maintain a second OS on the SD card without, or is it effectively a persistent thing once it's on? No, it's a system that boots on by itself completely. There's a whole set of tools that are out through Pi Foundation to create your own bare metal OS, essentially, so it's an OS in its own right. The downside to that is every part of hardware, we have to write new drivers from scratch to be able to talk to the hardware, which is why if you want to use Ethernet or Wi-Fi or anything like that, it becomes a much harder task for us to do that on MMU68, and that isn't there yet. So there's no USB host support. You can't use USB keyboard. I'm sorry, sir, again? There's no USB host support for the Pi. Not on MMU68, no. Right. There is a mishashi that will actually support keyboard and mouse through Pi's USB, yeah. Still time for one or two quick questions? Yeah, one in the back. Hi, quick one, I think. Did you have to do anything special to cope with the bring up time for the Pi, because it's a lot slower than the CPU? That's a really good question, the bring up time for the Pi. So the CPLD versions hold down the reset, until the Pi's booted. So basically the machine's basically, I'm resetting constantly, kind of thing. The version in this, the Pi's on 32, it will boot the native CPU first, because the FPGA hasn't been flashed. Once the FPGA is flashed, then the reset gets held down. And it's a very short time. You're talking like two or three seconds. Still time for one question? That will be the last one. So I guess the problem with the CPLD version is that AMD has announced that they're going to stop making those. So AMD, old as iLinks, do you use it? Yes, but iLinks is AMD, right? Yeah, no, we're not using... They announced like the last Pi's or something from now. We're not using iLinks ones, so I... Ah! No, so we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... Yeah, I think we're using... No, so we're using... The CPLD is Oterra Max... Max2, yeah. Also as Intel. And then the FPGA is Trian... Maybe... Ethnics, Ethnics Trian. Ah, okay, maybe... I thought it was iLinks in the first picture, but maybe that's wrong. Yeah. Or maybe that's a prototype. The other projects I maintain, yes, they are all screwed in regards to iLinks, but... Good, that's it. Many, many thanks, Andrew. No problem, thank you very much.
A journey documenting the Sanco 8003 computer
We're about to start. Please be quiet. We can start with John by Julioff about a journey documenting the Sanko 8023 computer. Hi. Welcome everybody. My name is Giovanni Battista, but everybody calls me Giomba. I got my master degree in computer engineering in the University of Pisa, which is in this nice place. I'm working as an embedded software developer, so I do low-level stuff, microcontrollers and some Linux drivers. And as you may imagine, I'm a retrocomputer enthusiast. I wrote some games for consoles and retrocomputers, and you can find me on that place there. And here with me, there is Julioff. Hello everybody. Can you hear me? Yeah. My name is Julioff. I studied in Pisa too as an electronic engineer, and I like so much Pisa that I stay there working because I work in Pisa as a firmware engineer in a company that produces cameras. And today, I'm here to talk with Giomba about one of my hobbies, which is retrocomputing and how we investigated about an old vintage computer, which is this. The story is funny because one morning I was going to work by bicycle. That's not me. I was going to work by bicycle. When I saw on the sidewalk this computer, this old computer, which was an old one computer with CRT monitor, floppy disks and so on, and it was abandoned there. So I looked it. There was a cheddar label on it. It said nothing. I searched on the Internet what it could define nothing, and I decided to take it, save it from a dump. Another computer in the house. I started searching about it, and with the help of some friends, I found out that it was a clone. A clone of a computer produced by Senio and Koflec, which are French, Japan companies. And it's a true company. A company producer. The true producer was Sanko. So our cheddar computer was instead a Sanko. This further information gave us nothing more because on the Internet, we couldn't find anything except some photos on Wikipedia and some advertisements. No technical information, unfortunately. And so we decided to do reverse engineering on the whole computer. If you try to open a Sanko, you would find a single big mother board with Z880 CPU. Some peripheral chips, common chips for the 80s, like the one to interface with the monitor, the one to interface with the floppy disk, and so on. Some memories, RAM memory, RAM memory for the program, for the BIOS. And around them, a ton of 74 LS gates. So this motherboard is quite self-documented because it has no custom chip. All full common logic and pretty common integrated circuits from the Z80 series and Z80 peripherals. And so you can continue. And so we had never done this before. So the first thing that we thought that was sensible to do was just to start and dump all the ROMs that were on the motherboard. So there was this first one that was a common standard 2732 ROM. We dumped it and we thought, well, let's run a disassembler on it for the Z80. And it contains a lot of things that made sense as a Z80 code, as you can see. Of course, it was not all easy because this is just a huge binary blob. There is no differentiation, of course, between text and data sections. So you just have to be creative with understanding what this code does. And we found out that a lot of code made sense if it was placed at this address here, C000. But as you may know, the Z80 starts from, boots from address zero. So this was something odd that everything was starting from C000. And in order to confirm this, we used some logic analyzers just to confirm that it was actually a Z80, not some custom Z80 variant that started from another address. So we found out that it actually booted from zero and then started to C000. It was a bit odd, but in the end we found out why. And this contained the BIOS. Let's say that we have disassembled it. You can find it here. It's not complete, but it's enough for what we are interested in too. And then there was this other ROM here that, well, as the name suggests, it is a character generation ROM. Again, it was pretty easy to understand what it did, because we just run the dump into a matrix. We started putting different configurations for rows and columns. And we found out the characters that were inside the computer. And then this was this one here. It is an old 28-22 ROM. It is a narrow plastic line package. It is a bit odd. We didn't have any programmer to dump it. So we built some wires. We had an Arduino. And there were a lot of patterns, repetitive patterns there. We thought that it was something related to some glue logic for the computer, but at the point we didn't know anything about this. So our first hacking attempt. As you can see here, on the system ROM, it is this label that says V1.01. And if you turn on the computer, it says V1.01 on the monitor. So maybe there is some correlation. And yes, in the dump we have a correlation, because we found it here. So let's find in the ROM where it is that it uses this string here. And we found it. And we modified the ROM in order to make it display something. We discovered that this was just some memory-mapped area for the video display. And in order to, at this point, we had to understand what the various peripheral did of this computer. So we started patching the ROM with our own code. So we installed the Azure Insertion Force socket on the motherboard without damaging it, of course. But at a certain point we realized, well, why are we switching the ROM continuously? We can just write our own bootloader with lots of things from the serial in order to make our experiments. And our experiments were targeted at finding out how things worked in the Azure computer and to confirm or deny our assumptions about the schematics. And speaking of the schematics. Yes, speaking of the schematic, to better understand the software we had to better inspect the hardware. So what we did? We first started reading the manuals of the integrated circuits, the standard integrated circuits in there. And we tried to find the connection between them, because these data sheets were quite well documented. So we found out and we verified using a multimeter some of the basic connections. And we drove them on a piece of paper. But that was not enough, because we had to better inspect the motherboard. Not all connections were written on the manuals. So we had to have a better view of the motherboard. We were initially scared by the motherboard, because we suspected that it was multi-layered. And it is. It's a four-layer board. But fortunately, it follows a very common standard where power rails are buried in the inside ladder, in the inner layers. And signal traces are in the top and bottom and top layers. So it's quite simple. You had only to follow tracks under chips and something like this. We took a photo of the motherboard and we started drawing the traces on it to keep in memory, to keep in mind where the traces were going. And that's how we reverse engineered the board the simplest way, I think, you know. We used GIMP, free software. And then we moved on to the key card. So after the discovery of all the traces, we put them on a true schematic. This is not a true schematic. This is a true schematic. And quite 90% of the schematic is understood for us. The other 10%, which is mostly the floppy disk interface, is quite messy, but is copied on the key card. The whole board is on key card, but we understand quite the 90% of it. And let's talk about some of the pieces of the schematic. First of all, the memory map, the memory management. So John Bassett before, no, later. So the Z80 has 64K of addressing space and is quite all-mapped dynamic RAM main memory, except for some holes for the video memory, main ROM, et cetera. And the addressing of these holes is done by the coder chip that matches the first digit of the XADESHIMAL address. And it enables the correct peripheral. When none of the holes is addressed, none of these, the dynamic RAM is enabled in place. But there is more, because if you want to address the whole dynamic RAM, you can turn off this decoder through a switch signal and address always the dynamic RAM. This mechanism is known as bank switching. But John Bassett before, the Z80 starts booting from address zero, but the ROM is at address C, C000. How is this even possible? Because you would read garbage from RAM at boot. It's simple. There is a latch circuit that at reset or boot, it forces the ROM enable until a particular instruction from the Z80 is executed and it disables the latch. In this way, you would have this memory map with ROM all over the addressing space until the particular instruction is executed that restores the correct addressing space. So the code had to jump in the correct area and to execute this instruction. And this way is booted and with the correct addressing space. This latch is never used until the next reset. So one interesting part that we tried to understand at first was the video generation and all the video generation is done by this CRT controller that knows everything about the timings of the system for video generation. This is a very interesting information. It produces the synchronization pulses for vertical and horizontal and it always knows what it is displaying at that moment. So it can generate memory address to retrieve the character that is being displayed in that moment. Let's assume that this is what it is on the display. So it generates an address, for example, for the first character top left. This takes out the index of that character, which is fed into the character ROM. So this is the character with that we want to generate it to display. But it also knows which of these lines are being displayed at the moment through this path here, as you can see. So this selects one single line, which is then fed to this shift register, which is clocked at this speed here. And it produces a pattern of dots that if you are familiar with video signals, that's what it looks like, like the synchronization pulses and the data. But all of these also has some other interesting things peculiar to this computer. We have the video ROM, but we also have another memory, which is the attribute video memory that produces some bits that are fed into these combinatorial networks that we had to understand what they did. And speaking of combinatorial networks, well, you know, you can describe them using a truth table, but you may wonder now why didn't I put here input and outputs, but I put here address and data, because you know the answer. Well, combinatorial networks are just read only memories, so it can be implemented by these mysterious ROM here. So that's what it actually does. So we could generate all these effects by simple networks like this one, like two exclusive or gates that are triggered and can produce an inverted signal like this, or some other effects can be generated by shifting gear, as I say, that just clock the output shift registers at half the pixel clock, and this generates wider pulses for the data so that you can produce wider characters. And then the other network that we have is the vertical stretch one that simply replaces all the accesses to the character generation in order to address the lines twice. So instead of zero, one, two, three, you have zero, zero, one, one, two, two, and so on, and you have characters that have double in height. Since they say that the Sanko is desktop computer in the actual meaning of the word that it takes a whole desk, in order to work with this, we built our own adapter for the video signals that is based on an RP2040, as you can find it here. And so in order to make us able to connect to a small VGA monitor so that we could use on a desktop in a more compact way. Okay, and you should know what is this, I think. We have no software for this computer, so this is not a Sanko floppy. So when a friend of us published on the internet the CPM operating system for Sanko computer, we had to create our CPM disk for the Sanko computer. So we started studying how to manage floppy disks with the Sanko. We studied the BIOS routines. We learned how to write format read floppies. And we studied the floppy image of the CPM. In this way we were able to write a custom Z80 code able to transfer information from Sarah Port through to the floppy disk. And together with this custom Z80 code, a Python script on the PC side able to transfer the whole disk image to the computer through serial. This process, this whole process took about 20 minutes. I did it at midnight, the day after I had to go to work. I left it to writing and the next day we had the CPM operating system booting on the computer. In a single shot we were very happy to be able to do this with a single try. But we still had a huge problem. We didn't have the keyboard. How could we use this computer? A friend of us had this one in a working condition. He provided us with this truck that was transmitted on the wire of the keyboard. It was simply a serial. We built an adapter so we could connect a common modern PS2 keyboard to this computer and use it as a real computer. We also thought that all this knowledge was not enough to put in some repositories with some documentation with images and schematics and so on. So we started writing our own emulator that at this moment has a working Z80. That's easy because we use a third party library of course. We have interrupts that work in mode 2. We have the correct emulation of the CRT controller with all the effects that I described before. We have some serial peripherals. Among them there is the keyboard. This emulator allows you to debug your programs for the computer because it has an integrated monitor. You can set breakpoints, inspect memory and so on. Now there is currently a working progress for emulating some peripheral I.O. with GPIO. The floppy disk which is in a half working state at the moment. Since the project is starting to grow and grow, we need to add some tests. But I need to show you the killer feature of this emulator which is this one here. Oh, it makes a beep. So we need help with documentation, software and people who want to help us discover more about this computer because it is quite mysterious. So if you want you can join us. You can find everything on GitHub and help us. And now there is some time for the questions I hope. Thank you. Maybe we can take one or two very quick questions. Hi guys. Very nice work. Thank you. I was wondering if you compare this to a Sinclair Spectrum architecture because it looks really, really similar except maybe for the character generation. All the addresses there. Have you done a side-by-side comparison of some kind of thought about emulating a Spectrum 48K? Don't. I downloaded a lot of similar computers. I don't remember at the moment. A lot of computers that use the Z80 as its core. The Osmo uses the Z80 as its core. So I tried to compare them but they are all similar but not the same. There is everything that is different. Every time something is different. The strange thing about the boot ROM is documented online on a website that talks about the dynamic RAM refresh. But I have never seen it in other computers. So if someone knows this strange mechanism, feel free to. The site was about the dynamic memory refresh and it talks about the strange ROM substitution too. So the mechanism that allows the computer to boot even if not in C000. Okay offline. Yeah, I'm afraid we have reached the time. Many, many thanks. You're welcome. Thank you. Thank you.
MAMBO - Dynamic Binary Modification Tool for RISC-V
Okay, hello everyone. We are here to present Mambo, a dynamic binary modification tool, and what's the better way to start the presentation than with a demo. So we are going to see a fairly complex application running on risk five within our system. So let's see it. So we are going to use it to learn something about the running binary. So here it is. Okay, so this is not our tool. So this is just an image viewer of Linux, and we generated this picture with one of these fancy AI tools so we can kind of promote our talk on LinkedIn. But what's really happening is that this image viewer is running under our tool that runs on risk five, and then we use it to find some information about the binary. So here we have a very simple tool that counts the number of threads that the application was used so we can see we have eight threads. So the application run under our tool on risk five, and then we can see that we have eight threads. Okay, but thanks for your first. I'm Igor. This is Alistair, and we are here from the University of Manchester. And as I said, we are going to talk about Mambo, which is a binary modification tool for risk architectures. Okay, but thanks. But okay, has anyone knows here what the dynamic binary modification is, or heard the term in the first place? Raise your hands if you did. Okay, wow. Okay, a few people. That's good. But you may haven't heard the term, but I'm pretty sure if you did any development, you used those frameworks. So the examples of the very known open source tools that do dynamic binary modification are Valgrind and KMU. So I'm pretty sure you use Valgrind and one of these tools, which is called Memcheck. And most of you probably are in the risk five room use KMU. So both Valgrind and KMU are dynamic binary modification framework, and they have a various tool built on top of that. So this is what Mambo is. Okay, but let's break down this term a bit. So what do I mean by dynamic binary modification? So dynamic is working at the runtime. So while the binary is running, the tool is working. Binary, we are working on the natively compiled code. So we don't need a source code, we just take a binary that was already compiled, and we can analyze it. And modification means that we can alter the application in a specific way. So we can add extra functionality, we can remove functionality, we can swap functionality. So there are two terms that are related to that. There is also dynamic binary instrumentation and translation. So instrumentation is basically a subset of modification. We just insert new functionality into the binary. So for example, if I want to do some sort of profiling, I can input some sort of counters into the running binary. And then translation is kind of an overlapping term. I can swap one, I say to another, so we could do it by modifying the binary, or there are more specialized tools that do the translation. So you are probably familiar with the Apple Rosetta, which translates now Intel to ARM when you got your new MacBook, but there is also the KMU also can act as a translator and usually use like that because they can translate one architecture to another. But now, so very few uses of the tools. So you can do program analysis, you can do error detection. So I'm pretty sure most of you are familiar with that use case, and there is a dynamic translation. OK, but now the question is why would you like to use Mambo if there are other tools? So the Mambo has been specifically optimized for risk 5, ARM, risk 5 64, ARM 32 and ARM 64. So in the stock, we are focusing on the ARM, on the risk 5, but we also have the version of the tool that can run on ARM. And the tool features low overhead, and to our knowledge, this is the only at the moment available DBM tools that has been optimized for risk 5. And the tool itself is fairly low in complexity, so if you would like to dive into the database, is around 20,000 line of code. So if you want to learn how it works, or if you want to modify the internals, the entry bar is not that high. And then it has a simple plugin API that allows you to write the architecture agnostic plugins. So you can write the plugin for risk 5 and later on you can deploy it on ARM if you would like. But it's worth to say it is not a toy. So we showed it before in the video that we can run fairly complex applications, so it's a full GUI tool from the ship with Inux. It could run stuff like GIMP or library office as well. So the tool itself is not a toy. OK, and if you are interested what the numbers would be roughly, so we evaluated it on the spec benchmark, so don't worry about too much about numbers. If you want, we can point you to the paper or we can talk about it later. But the idea is for like, FP benchmark, which is more like data processing. We get around 6% overhead if we just run the tool. We just run a framework without an extra tool built on top of it. And it's around 30% when we do more general purpose computing. So the baseline then, if you have no plugins enabled when you just run the tool under, if you run the binary on the RL tool, you get around 30% overhead. OK, so that was the brief introduction of what the dynamic binary modification is. And I'm going to briefly talk about how Mambo works internally. So I'm going to mention a few details, so it's useful if you would like to, I don't know, contribute to the internal of the tools that may help you. But the focus of the talk will be more the developer side, so I'm just going to talk about it as well. But I would like to just highlight a few bits and pieces so you will understand how Mambo works. OK, so this is the simplified diagram, and I'm going to talk you through the more important bits of that. So the instrumentation plugin API, so this is the part that Alistair is going to talk in much detail about, and I'm going to cover everything else. OK, so first of all, the first component is the elf loader. So if you run any binary on Linux, it has to be first loaded into memory, and then we can run it. So in case if we use our framework, so if we use Mambo, then the Mambo itself is loaded by Linux using its default loader, and then Mambo itself has to load the application, which we call a hosted application. So the Mambo has a custom build loader inside of it, which takes the application and loads it alongside the Mambo, so it can interact with it, it can modify it, and it can run it. So that's the first element. The second element is instruction and decoder. So while we execute the application, we have to modify some of the instruction. We have to know what instruction we are copying and scanning and modifying, so this is what the instruction and decoder and decoder does. So you may be familiar with the custom project, which is like a fully fledged assembler. This is a very simple module that basically takes a text specification of the instruction and what the fields it has and uses some rubbish scripts to generate the C functions to encode any code fields, and this is what Mambo uses because it's fairly simple and low overhead, and that's something that we want inside the tool that runs dynamically. Okay, and now the two most important parts of Mambo, it will be a code scanner and the dispatcher and the code cache. So let me maybe first talk about what the code cache is. So we have our Mambo, and the Mambo uses the loader to load the binary into a memory. And now we want to run this binary, but we also want to modify it. So if we just load the binary and run it, then it will run as it would be before. So that's why where the code cache comes in. So this is not the instruction cache that we have hardware. This is just allocated space in memory that we call the code cache. And now the Mambo scanner will copy the instruction from the binary that we loaded into memory into the code cache. And in the process of copying those instructions, we can introduce any functionality, we can remove some instructions, we can replace some instructions. So the scanner is responsible for copying instructions from the binary that we loaded into the code cache. And then the code cache is what will actually execute on the processor. And then we have a dispatcher, which is responsible for actually running the code. So the scanner will copy a basic block, and then it will say, I finished copying a basic block. Now I go to the dispatcher and dispatcher will start the basic block, and it will actually natively execute it on a RISC-5 processor. And then when we finish the basic block, the control will return back to Mambo to scan the other basic block, and then again we'll go back to the dispatcher and dispatcher will execute the next basic block, and it will have this back and forth. And if the code is ready to the code cache, we don't have to scan it so we can directly execute another basic block without scanning. So this is very simplified. If we did it that way, it would be very, very slow. So there is a number of optimizations there. So for Mambo to stay in the code cache as long as possible. So it does scan things ahead of time and tries to guess what would be the next thing it jumps to and then if it can do it, then it can stay within the code cache. Otherwise it has to go back to the scanner and back to the dispatcher if it doesn't know what the next basic block is. Okay, and this is what I was talking about. So when we execute the application, we have a single process with two binaries in it and two contexts. So there is a Mambo context that scans instruction, and then the dispatcher changes from the Mambo context into the application context. So it will save the state of the Mambo, jump to the code cache, execute the code cache as long as it can, and then if it cannot find the next target in the code cache, it will go back to Mambo. So it will save the application state, restore the state of Mambo, and then the scanner will kick in and then it will go back and forth. So this is like a principle of it, of how it works. Okay, so the dispatcher and the scanner are like the two main elements in Mambo that allow us to do the modification and execute the code. And the last thing is the kernel interaction. So on top of just executing the application, the framework itself has to interact with the Linux kernel, so we have to handle and pass signals and handle and pass system calls. So this is important because for signals, if there is a signal coming from the operating system, it will first hit our framework, so it will first hit Mambo. But if you don't want Mambo to handle the signals, in many cases you want to pass it to the application because the application may have a handler installed to handle this signal. And in the same way, if there is a system call, so if the hosted binary is doing a system call, for example, let's say a thread creation, Mambo needs to know that it created a thread because it has to track every thread that gets created. So the Mambo has to learn first what was the system call and only then it can pass it to the Linux kernel. So that's also, I talked briefly about the architecture of Mambo, so we had the L flow there, we had the instruction encoder and encoder, two main elements, one free management scanner, dispatcher and the code cache, and then we had a bit about the handling signals and system calls. So that's, if you are going to just use Mambo to write your plugins and the tools probably you don't have to know all of that, it may help to know how Mambo works. And if you want to contribute to the internals of it, that hopefully will give you some rough idea how the system works. But now the bit probably people are more interested in is how we can write our own plugins, our own tools within our framework. And for that I will pass the microphone to Alistair. Hi, so yes, I will talk to you about the API, this is how you take Mambo and you build your own tool on top of it. So this is where it actually gets really useful. So we've mentioned use cases but it's worth repeating. We're talking about things like code analysis so you can build a control flow graph, you can generate new functionality, you can instrument code, you can analyze it, you can re-implement library functions, you can patch library functions, you can do all sorts because you can modify this running binary. So Mambo's API exposes events, so it's event driven. So you as the user of this API define functions which you register as callbacks on these events. And when one of these events is encountered Mambo will trigger the callback and execute the function that you registered to it. So there are two categories of events, there's hosted application, runtime events. So these are events that happen to the hosted application as it's being executed in the code cache. So here we're talking things like system calls, thread creation and we have Mambo scan time events so these happen as Mambo is scanning instructions from the loaded elf into the code cache. So this is something like pre-instruction, post-instruction, you can do stuff with these callbacks. So as I was mentioning pre-instruction, post-instruction, this kind of gives you an idea, you can insert something before and after an instruction, before and after a basic block, before and after a thread. So you can see it can be very, very fine grained or it can be at a high level of abstraction and of course before and after an application runs. So taking all of this, you see a slightly chopped off diagram there but it kind of gives you an idea of the order in which these callbacks will be executed. So at the very highest level, at the very start you have the initialization function which is where you set up a plugin and then you'll have pre-thread so that's quite high level, pre-basic block, you also have pre-function and so it kind of gets narrower and narrower and then it kind of expands out after these things have executed. So this is something that's important to bear in mind. So how do you actually use Mambo's API? I'm going to talk to you about the following things. So the functions that you'll need to register your callbacks, the functions that perform code analysis, the functions that perform instrumentation, so how you actually emit code into the code cache and then there are various helper functions which you can use. So the first thing you need to do is initialize your plugin and this is done in the plugin constructor function and there are two main things that you do here. You create a Mambo context which is a global data structure which holds the current state of Mambo and also the application that's being executed by Mambo and pretty much all of Mambo's helper functions will use this context to get for instance the current instruction that you're looking at. And this is also where you'll register callbacks. So for instance here we have Mambo register pre-instruction callback. So before an instruction is actually scanned into the code cache something that you register here will execute. And to register callbacks it follows this signature so you have Mambo register then you have an event time so that's pre or post something happening then you have the event so this can be Mambo pre-instruction callback. So it's quite easy to remember that way. So you've registered your callback so let's say we're building a plugin that counts the number of branches that are executed. So you've registered a pre-instruction callback. So now Mambo's scanning things and your pre-instruction callback has executed. So one of the first things you're going to want to do is use a code analysis function. You're going to want to know which instruction am I looking at. So you have things like Mambo get branch type, Mambo get condition which would for instance give you the condition of the branch that you're looking at if it's a conditional branch. So these give you information that you can use and choose to act on. So the function signature of these analysis functions follows Mambo action so that would be get set is and then the information. So Mambo get function type, Mambo get branch type even relating back to our example would get you the type of the branch that you're looking at. So bringing all of this together into a simplified plugin we have the constructor where we initialize context and we register a pre-instruction callback and when that's executed we get the branch type and then based on what type of branch it is we do something. It's also worth pointing out that the branch types that we're looking at here are generic so that's how it is portable between architectures. So you've found out you're looking at a branch. Now you're going to actually want to emit instrumentation. So this is instructions that you can put into the code cache to do something. So for instance we have emit64 counter increments so this is how you can tell Mambo to emit the instructions that you need to increment a counter. You can emit pushes, you can emit pops, you can set registers so you can do all sorts of things and there are two main types. You have emit instructions so that would be for example emit increment so that's more portable because we implement the backend tell Mambo which instructions to emit into the code cache for that. And then we have the more architecture dependent ones which are emit risk five instructions so this is when you know exactly what you are trying to achieve with the plugin. Let's say you need to emit an arithmetic instruction. You can do that until Mambo emit this arithmetic instruction. The only drawback to this is that it's riskier doing that. You have to make sure that you save and restore registers and that kind of thing which we do for you in the safer generic ones. And then finally you have additional helper functions so for instance Mambo will expose a hash table which is really useful for when you're instrumenting code and you have lots of data to associate with different addresses. So we have hash tables, we have Mambo allocator so these will help you to write your plugin. And then finally it can be very difficult to get your head around this. It took me a while to fully understand it and that is the difference between scan time and run time. So when we talk about scan time we talk about something that happens once when Mambo is scanning something and run time is when that scanned code is executing in the code cache and the reason this difference matters is if you are for instance counting the number of branches that are executed at scan time you need to emit instructions into the code cache to increment a counter so that when that code is executing you get the actual number of instructions, times that instruction is executed. Okay so it's time for an example. The code I'm about to show you can find on the Mambo repository in the plugins directory and it's time for a live demo. So I will be running Vim under Mambo on risk 5 to show you the source code of the branch counter plugin which is something that you can run and is in the Mambo repository and whilst running Vim I will also have enabled the branch counter plugin so you can see it in action. Sounds very convoluted I know. Okay so here we run Mambo and I don't know how well you can actually see that but... Command shift plus. Oh command shift. Hooray. Do we need more or? Bigger. Oh bigger. Even bigger. I'm trying to call it that wrong. Okay yeah. Okay so we start with the constructor function which is where we set up Mambo's context and we're registering four callbacks so we have a pre-instruction callback, we have a pre-thread callback, a post-thread callback and an exit callback and the order that these will actually be executed in will go pre-thread, pre-instruction, post-thread and then exit. So I'll start with the pre-thread. So in the... Let's hear some more. Oh yeah yeah yeah. In the pre-thread handler we're initializing the counters for that thread so we have a direct branch counter, indirect branch counter and return branch counter. The reason why we have this per thread is because each thread has its own code cache and therefore its own numbers of branches that we'll be executing which is why for each thread that we create we initialize its own set of counters. And then we have a pre-instruction callback. So for each instruction that's executed we're checking if this is a branch, we're getting the branch type and then for each of the types of branches, the return branch, the direct branch and the indirect branch we select the correct counter for that thread and we then emit a counter increment into the code cache so that the correct counter will be incremented. Okay so at this point Vim is running away, running away and when we close it the post-thread handler will first be executed and this will say okay so this thread is terminating let's take this thread count for each type of branch and add it to the global total and it does that atomically and then finally we have, oh yeah the exit handler which just says okay this application has now terminated let's print out the global totals which are composed of the individual threads. Since Vim is a single threaded application we'll get one thread and one total which you can see there. Okay and now I'll quickly talk to you about some lessons that we learned from porting Mantlot to risk 5 because it was originally written for ARM so there are differences that we had to take into consideration. So the first thing was the range of branches. So for conditional branches and direct jumps they have a range of branches and they have a range of branches. So for conditional branches and direct jumps they have quite a limited range which is less of an issue on ARM because they have a much longer range. Why this matters is because in a compiled binary obviously the offsets will be fine because that's how it was compiled. When you take that code and you put it into a code cache it's done as it's needed and so the ordering of that code may be different and therefore the offsets may be different and exceed the offsets of the original binary. And so we may have to replace these instructions with instructions that have a longer range. So with a conditional branch we may have to insert an additional jump instruction that is triggered when the branch condition is true to extend the range of that branch. And same for a direct jump it may need to be replaced with instructions that first load the address into a register and then take a register jump. We also have load reserve and store conditional. You can only have a limited number of instructions between these two instructions and you can't also have a limited number of instructions between loads and stores in between otherwise the lock will fail. This matters in dynamic binary modification because we can insert additional instructions so we have to place limits on what you can do with atomic instructions in plugins and with other optimizations implemented we have to be mindful of this limitation. And finally we have the thread pointer register X4. There isn't a dedicated register in the general register file on ARM that does this. And so when we create a new thread Mambo will save and restore the context by saving and restoring all registers. We need to make sure that the thread pointer actually points to the newly allocated thread local storage otherwise there will be a world of pain which we found out. Okay so in terms of road map where we take it from here we of course want to foster our open source community. We really welcome collaborations and contributions not only plugins but also any contributions to the main internals of Mambo. As part of this we are currently in the process of improving documentation and also developing more tools to kind of give people a flavor of what's possible. So for instance we're currently porting Mambo's Memchecker from ARM to RISC 5. We also are trying our very best to keep up with all of the new RISC 5 and also ARM extensions that keep appearing. We also have various research projects ongoing that make use of Mambo. And probably goes without saying since this is a talk at FOSTEM but Mambo is open source on GitHub with an Apache 2.0 license so definitely check it out. And we'd like to thank our sponsors. So yeah any questions? Yeah. Oh yeah yeah. So you're asking how do we handle pointers when we scan code from the binary into the code cache. Those pointers are still pointing into the binary. So we actually in the scanner we have instructions like that specifically. So for instance if we take a branch instruction the first time that branch instruction is executed it will point to Mambo's dispatcher which will perform a lookup. We then have optimizations which will replace that branch instruction with a direct branch to the next basic block. And the same for loads and stores. We update these to point to the new location. So basic block is a single pointer. Oh sorry. Yeah I'll repeat the question. So what is a basic block? A basic block is a single entry single exit point. So you essentially ends when there's a branch to somewhere else. At the back. Yeah so in a general case. Oh I keep doing this. So how often is the load reserve store conditional an issue. We find it's not that much of an issue. Most applications won't have an issue with it. It becomes more of an issue when you have plugins that do something in between. So for instance if you're counting a specific type of instruction that may occur between these two instructions and you emit stuff into the code cache you may end up exceeding this 16 instruction limit. You mentioned translation early in your presentation. Does Mambo support running ARM on the RISC-5 machine and vice versa? So does Mambo support translation? Not currently. You need to be on that architecture. What happens if I try to run a just-in-time compiler under a Mambo? What happens with a just-in-time compiler? I'm not sure. So the Mambo is designed to support self-modifying code. So basically what it does, you have some code in the code cache and just in time compiler recompile it so basically the cache will be flushed and then it will re-scan it again. So it carries some performance penalty but it will react to the things like that and it will re-scan the code and put the new version into the code cache. So it does support self-modifying code. It should be. Hopefully. This isn't tested on RISC-5 because most browsers don't seem to be ported. Any other questions? So what do we interested in about RISC-5 applications from plugins? We're interested in building tools that kind of perform things like memory checking, data race detectors, that kind of thing. So tools that are very useful to people developing software on RISC-5 to kind of help them do that. So just out of it, so we haven't mentioned it on the slides but we also have some research. That was for R but done on the architectural simulation, so kind of code design of accelerators and CPUs on the SOC system. So there's some stuff going on but yeah. So at the moment I think for RISC-5 the biggest push was to get the base system to work and now we are exploring on RISC-5 what we can actually do with the system. Any other questions? Does it update sections that refer to pieces of code like jump tables, different things between basic blocks? So the question is about does MAMBO support the jump tables? How does it do? So we do not rewrite any of the sections of the original binary so basically MAMBO works in a way on demand. So we have a jump that uses a jump table. MAMBO will try to remember the most recent jumps but then if you miss it you have to go back to the scanner, scan the code again and then go to the dispatcher. So we are going to use the addresses that are already there and then we are going to keep the translation of some addresses in the code cache but none of them. But we are not going to rewrite the actual jump tables in the data section of the binary. Any more questions? Okay so the question is about the data-raised detector and whether we could implement some sort of stepping back within MAMBO. So the data detection is in the early stages but you will not have such a verbose functionality as RR or GDB replay or whatever but what you can do in the very easy way when you scan the basic blocks. So you would have to probably have some sort of we don't have functionality to detect the data-raised. But let's say in the general case if you want to inspect what's happening you can introduce a trap instruction into the code cache and then you can run under GDB and then you will trap the instruction and you can inspect what's in the basic block after the translation and you could try to look what was in there before the translation. So you can do some sort of things in the manual but there is no automated way to replay and go back in time. Thank you.
Unleashing RISC-V in Managed Runtimes: Navigating Extensions, Memory Models, and Performance Challenges in OpenJDK
Hello, my name is, does this work? Or? It's green, it's good. It's green, it's good. Yeah, my name is Robin N. I work at Reavals with RISC 5 and I'm mostly working on the OpenJDK. So I'll talk about some of the experience with the OpenJDK. And unfortunately for me, I can't lie too much because I see some experienced OpenJDK people in the crowd here. So we'll see if they correct me. So yeah, this is basically what I'm going to talk about, the OpenJDK, the JIT, which is kind of important for a new architecture. We're going to mention the trampoline lines as Mambo. We have some cross-modifying code. We talk about all the extensions we have, a bit about sign extensions. And I was going to talk something about canonical NANDs, but I think Ludovic made a good job of it, so I might just skim that through. So I'm not sure how much anyone knows about the OpenJDK, but it's a huge C++ code base with inline assembly and there is a lot of C++ code which is architecture specific, since we have different ABI's on different architectures. So the C++ code needs to know what's ABI for these architectures. We have a template interpreter, which means we basically have assembly snippets implemented for each thing we want to interpret, which jump to each other, so it's not a C and a switch statement. We have two compilers, C1 and C2. One is very fast and one is a bit slower. The first one is usually compiled with profiling, so we keep profiling the interpreter, we keep profiling when we compile with C1, then we compile with C2 and we drop the profiling because the profiling eats from your performance. And we also generate the template interpreter is actually generated during startup, because we customize it because you might use a GC which requires some specific load barriers and stuff, so we generate the code for the interpreter and we generate a bunch of other code. Like we have a lot of assembly which is glue between like the runtime, the compiler, the interpreter. So, the risk five port, it's fully functional, all great. Well, we are missing some optimizations and when we say fully functional, we mean with limited tested. We are done, have done, as Ludovic have talked about, testing is a pain as we have small boards, we have QEMO and OpenUDK have a lot of tests. We have tests that can run for like a week, just one test. If you take that and try to run it in QEMO, it will take forever. So, we have JDK 21 and 17, we are working on the 11 to get the port done for 11. I wouldn't recommend JDK 11, I would recommend at least 17, because it's much faster, it's better and you also get a GC. Yeah, the other platforms since like x86 have had like 25 years of optimization and our report is like, I don't know, four years, so we are missing like at least 20 years of optimization in the codebase. So just in time compilation, why? Yeah, of course the obvious reason is because we want to be right once run anywhere, but we also have some other things in the OpenJDK going on. We have a dynamic class hierarchy, as we can do class loading or we always do class loading, otherwise we wouldn't get any classes, which means that the hierarchy is changing. So it's not such a good idea to try to pre-compile because at any given time your class hierarchy might be different. So even if you did pre-compile, since mostly everything is virtual, it's virtual by default, you would just do just virtual calls all over the place. So that would be slow. So but with YIT and profiling, we can avoid virtual calls and we can speculate a bit about the class hierarchy. When do we compile? Yeah, we compile hot methods. And as I said, first we compile with C1, we keep profiling, then we can compile with C2. So what we do is kind of a speculative compilation, which means that if we see you have never executed this branch in your method, we may choose to remove that branch and put in a trap instead. So if you actually want to run that piece of code in that branch, instead we trap, the optimized will go to the interpreter. And we can do the speculation based on the profiling. So if you have a hash table and you put cars in it, you call hash code, we can and I can guess that this call to hash code will be on the car. So we don't need to do the Vtable lookup, we can instead guess that you're putting cars here so we call hash code for car. And until we get proven otherwise. Yeah, so we also need to do some cross modifying code. So when we're kept compiling something, compiling is a bit expensive. So if we can just change the code instead and update whatever was what was missing, so we don't have to deoptimize and recompile, we will do that instead. So I'm jumping directly to talk about a jittered call site. So when the jitter lays out a call site, we have two instructions, jump and link, jump and link register. And when we lay out the call site, since we have a dynamic class hierarchy, I forgot to say that on the first page, but classes are loaded on first use, which means the compiler is not allowed to load classes, it has to be used by the program. So when we lay out a call site, we might not know where we're going to call because we don't want to do a resolve. Because resolving the call site might mean we need to load classes. So when we lay out certain kinds of call sites, we need a full range of that call site, which means we have two options, we can either load the address or we can materialize the address. Materializing requires a bunch of instructions. I think the example here is just materializing six bytes or something, maybe someone that is fluent in assembly can tell me. Yeah, the reason why normally you would maybe do a table look up here, but we wanted to actually lay out a direct call as we can without any loading of data and stuff like that. So that's why the call site looks like this. And for the full picture, it actually looks even like this. So we actually lay out a smaller call site in the code, which calls a trampoline, which will load the address, which is just under the jump and link in the trampoline, and then we will end up at a destination. But as I said, a dynamic call site can be unresolved, which means when we get the code, we actually just point the trampoline to a resolve stub. So the first thread that actually executes this, we'll need to resolve this call, whatever it's going. So if this was the, if a is the car.hashcode, when we lay out the code, we don't know this, we need to resolve this and figure out that this is the receiver of the call. So then we have cross modifying code. What is cross modifying code? It's that one core is writing the instruction or changing the instruction stream, and we have another core executing the instruction stream. It's a bit complicated, of course, but OpenJDK does it a lot too. Avoid recompilation is basically the thing we want to avoid, because especially during startup when your class graph is changing all the time because you keep loading classes, if we compile something that looks hot, we don't want to remove it directly and recompile and remove it and recompile. Instead we can do the speculative compilation and layout code and fix it a bit later. So you can talk about two types of cross modifying code synchronous, which is basically you're waiting for the other CPU to fix the instruction stream ahead of you. And here's an example with the modifying processor do a store to the instruction stream, then release the guard. The executing processor waits on the guard. When it gets released, it picks up the new instructions. It's not that easy, but pick up the new instruction is not just a simple thing, but I'll get to and then you have the asynchronous cross modification where we just store something directly in the instruction stream. Executing processor might see the new or the old instruction. We don't know. We need to handle both. So back to our example here. So one Fred calls to resolve. After you have resolved the know who's the receiver of this call, it will patch the eight byte address stored in the trampoline. So anyone else that does this call will reach a. But we still allow friends to see the old destination, which means that both of the old trampoline and the new trampoline is valid. Since if you see the old one, you will hit the resolve stub. You will see that this quality is already patched by someone else and you will just go back and re execute it and then you will pick up the new destination, which is a. Yeah, so point to so when the executing Fred actually sees the new instruction stream in especially in the said Jai said Jai the extension for cross modifying code. We talk about point of unification. So that means that modifying processor and executing processor agree on the global state. So I'll use the terminal leave from that extension. I'll mention it more later. So we have patched the trampoline. Well, good. No. So someone loads a B, which is also of a type. So we have a new receiver here and we actually need the V table look up. So we need to patch trampoline once more and add. A V table look up before we can land on a because it could have been a B. So the trampoline is not patched just one time. It can be patched. Well, I think at most two times, but yeah. And in this case, all three. Ways of calling is that is alive at the same time because you can still see the so one Fred lagging behind can still see this resolve. Someone else might see this jump and someone might see the V table. We allow all three to be OK at the same time. We do this, but we have a small piece of code in a which verify when you did the jump to a you had the right receiver as your intended target. But that becomes really complicated. The main point of the slide is to show that we need to be able to patch the whole site multiple times. So what we're doing here is actually not cross modifying code on risk five as we have a the as we do an LD on the eight bite address and we do actually a store of eight bite address. It happens to be in the. Just below the instruction stream, but it's actually not read as an instruction as we do an LD on it. So in this case, we're actually not doing cross modifying code since we load the address with an LD. So but. There's still some problems with this. First of all, as the the address is just below the addresses, your pipeline might try to decode the constant as instructions. You also have the problem with reading from the same cash line that you're executing. Some process might not like that. So you have the same cash line in I and D. And we also have the overhead of the jump from a to trampoline. So what we are suggesting suggested in on risk five. Yeah, and I can also mention as we need this place, atomic or patchable, that's why we can't use the ally since it's seven instructions and we can only patch one instruction atomically. So for this case, we're suggesting that we actually do the load directly at the call site in a and we only have the address as a piece of metadata instead of a full trampoline, which means we get rid of one jump. We put the address on a separate cash line. So it should be faster on any risk processor. This is just the general philosophy of open JDK, meaning that in hot pass, we don't have any synchronization. We allow execution of stale instructions because like you know, if you have your ISB instruction on a arch, it's really expensive. We cannot have that in hot path since we try to compete with C++. So in slope of we try to reach point of unification. If you're on AR64, that means that there's probably an ISB instruction in your slope off. Yeah, and there's a list of other examples of cross modifying code. JIT itself is cross modifying. It's compiled by one thread. Pointer is installed by one thread and another thread is picking that pointer up and jumping to the JIT code. So that in itself is cross modifying code. The third in this solution is when you do a field access. The class for the field access is not yet loaded, so we don't know the offset for the field. So we basically say, oh, you need to fill in the offset here. So the first thread that hits this path needs to load the class. If it's not loaded, figure out the offset, patch the code. And then you have different barriers for the method because they can get invalidated. We might need to update the method. So we have guards and barriers to protect the method. We can have addresses of objects directly into the code stream. So when the GCMOS an object, we need to change the immediate for that object that was moved. We can have GC barriers as immediate values. So when the GC changes color, we might need to update the load barrier to reflect the color change. Yeah, point of unification. So if you're running your AR64, that usually means you're doing an ISB. We don't have that. What we have is something about fence.i, which is not so good. What we're doing today is something really crazy. For every write we do in a page that is from the JIT, meaning we think we're doing cross-modifying code even though my first example was not, we're doing RISC-5 flash iCache, which means the kernel will do an IPI on all CPUs and emit fence.i. So every write we do, this is really expensive as from the last page, if we put in like GC barriers, which need to shift color for every load of object in the instruction stream, meaning that we might change 10 places in one method to reflect the change in GC color. So there will be 10 writes just in this method and that will cause 10 IPIs. That means that every write we reach point of unification. So it's working really well with cross-modifying code RISC-5 with OpenJK at the moment, since we actually don't have any races basically, since we do the IPI on every write. So I see in like a really small board it costs a half a percent of performance. On a large real CPU server class, maybe 2-3 percent of performance decreased due to all the IPIs all the time. Yeah, point of unification, the modifier needs to make the stores visible and executioner needs to make sure the instruction stream is invalidated and so he picks up the new instruction. But we still think we can do a bit better with what we have, since fens.i is an unprivileged instruction, we can actually emit it ourselves in the slow path. So we don't need to do the IPI, but we need help with context switches. So you're on your heart to use RISC-5 terminology. You emit your fens.i and think you have invalidated your instruction stream, but the kernel moves you to another heart. So if the kernel moves you, the kernel would need to emit the fens.i so you know that on that whole heart also the instruction stream is invalidated. And what it's going to save us, we hope, is the ZJID extension for IDI. ID synchronization. So instead of fens.i we would get an import i, but more importantly we will get a limit on the instruction fetching. So ARCH allows out of order fetching, which is problematic for us. So if you have a call, when you do an A, Y, P, C, jump and link, even though if you not bit out by first not being out the jump and link, and then you not about the A, Y, P, C, the iFetch could fetch the jump and link before the A, Y, P, C. So it reads the A, Y, P, C before you not that then it reads a not from the A, Y, P, C, then you're toast. So ZJID will specify how the iFetching will work, what we can overwrite without tearing instructions apart and stuff like that. So we're hoping we get that in place well this year. How long have we been going? Okay, that's fine. Yeah, that brings me to extensions. We have a bunch of extensions. When I looked, maybe this is totally wrong, but I found 60 ratified, which adds instruction for RV64. That's 450 base instructions, and I found 45 unratified adding another 400 base instructions. As an example, I took this fall I was looking at the CRC32 a bit. So OpenJDK have an implementation of it in Java, works fine, but you probably want to have an intrinsic for it to make it faster. So then you can make your table look up intrinsic with the base ISA, which is the standard CRC32 intrinsic. But you can also use Kerala's multiplication to do even faster intrinsic. Then you have your scalar Kerala's multiplication in the CBC extension, but you also have Kerala's multiplication in vector. So there's a possibility to have four implementations of the same CRC32 algorithm, one in Java, one for base ISA, one for CBC, one for vector, which is too much. Also at least I'm getting really annoyed with the architecture description through your compiler. And this is just the first of four lines. So if you have a server class CPU, I'm not sure how long that can get. So as Ludwig was talking about profiles, we're hoping that we get nice profiles. Right now RV823 is perhaps the one that looks best. And for the JIT, you need to add an option for every one of these. But we have HVPROB, so we can get it automatically. But there is like, you get an extension, you add an option, then you get HVPROB. So make sure you have like, so basically you need a 6.9 kernel or something to make everything work nice. 6.8 maybe is the next one. So I recommend using 6.8, which is released in, I don't know, because otherwise you need to add all the options on the command line. This brings me to the next problematic things. We have some major extensions like do your CPU allow misaligned access? Do you have vector, what are your memory model? We allow to turn off. Yeah, so the JIT, since we do this cross modifying code and stuff, we're really sensitive to code layout. So if we change anything with code layout, we would like to test it. Since you have so many options that changes the code layout from the JIT, we have so many combinations that we would like to test, but we only have basic boards in QEMO. That makes it really hard to guarantee that your combination will work fine, because I guess everyone is testing a combination which will be something for the CPU they are intending. So I think there's a lot of combinations which are not tested much at all. We also have the compressed. Yeah, we have an option for it. You can turn it on and off. We have an assembler that just changes the instruction for you if you want. Since we're sensitive to code size, some parts are fixed size, so just to make it easy for us, we turn off compressed in certain parts, because we want it to be at a certain alignment or certain address. We see 5-10% code size reduction. One thing we can do is, since you know the compressed just have 4 bits for the registers, we don't consider that, so we just use registers. For example, we have the heat base. If you have compressed points for your object, we have a base for it, which means every time you load an object, we need to materialize the full address. That one is in X27, which means we never can use compressed for that. So if we were to put heat base in another register, like X14, then we could use compressed more. Next, which Ludovic touched on, about memory models. We have your weak and your strong model. In OpenJDK, we're often dealing with free models. We have the hotspot memory model, which is from the 90s, I think. So it predates C++ and C11. Then you have your Java memory model. Then you have your C++ memory model. Since we have two hardware memory models, we get a lot of mapping around that. So we basically have six combinations here. And that also, extension, increase the complexity. Because then you have like SACAS, which introduce the CAS, which means we need the CAS for the memory model also. So yeah, again, if we're going to test all combinations, it will be really costly. Yeah, sign extension. Maybe it's just me, but I'm not a friend with it. So sign extension is when you have a word and you need to enlarge it. Oh, yeah, I only have a few minutes. So that's good. So you want to enlarge it to a word. You need to replicate the sign bit. So we present the sign as of the word when we treat it as a double word. And we do this because some of the instructions use the full register, branch and or, for example. So this is all fine when you let the compiler do the work. But as we have so much assembly and we do, yeah, type less passing, we have templates with inline assembly. So you get a type T and then you're supposed to put in your inline assembly. And we have type aliasing, meaning we have one type and we access it through a pointer to a different type. So when you write all this, you need to think about both like the short representation of your word, but you also need to think about the word as a eight bite. So I get confused and suddenly my branches go somewhere else because I forgot sign extension. So I'm not a fan of it. Yeah. And sign extension. I don't have much to say more than what Ludovic said. I had a, this is one example. If you're writing Java code, if you use this method, you will be surprised because if you have a negative NAND and you ask this guy, you don't know what the bit will be. It depends on the instruction and stuff. And the C++ version is even more complicated because compiler may choose to evaluate that at compile time, which means you get whatever the compiler think design flag should be. If you execute it in runtime, then it depends on the instructions. So if you see anyone using such one functions and they don't consider not the number then there might be a bug. So sorry, one too many. So yeah, I personally like RWA 23, but of course want said JID. So we can formalize the cross modifying code. And also like some of the more atomic extensions, I think SACAS is just optional in RWA 23. I would like it mandatory. And also would like one more instruction to materialize a 64 bit immediate. So that will help. So we don't, because the load we're doing in the trampoline, even though we remove the trampoline, we're doing a load, which means we can have cache missers, which means that the call can be really expensive. And all additional loads we need to do for the JIT itself or for the JIT code, its memory bandwidth. So when you're competing with other platforms, which can materialize a large enough immediate and have it atomically patchable, it's hard to compete when we can't do that in those cases. So I guess the road to one instruction to materialize a 64 bit will be long. Thank you. Yes. Two questions. First of all, is there a limited interface to send more dense ice through the UTI? You can use it with the, for the IPI, you can use the G-Lib C, cache flash, eye cache. So there is a G-Lib C function you can call, which do this is call for you and fixes it. Yeah, so that's, I can't remember if you changed that or we're using G-Lib C wrapper. So there's a G-Lib C wrapper over this is called. So you can just say, I want to flash eye cache. I can't hear. Yeah. I haven't given it much thought. So I'm not a big fan of compressed. So I don't mind what we're doing now. It's just that there might be, so from what I've seen, it's the smaller board which gains performance from compressed. The big out of order CPUs we're waiting for, we don't think there will be much difference. So we haven't spent time on it. I forgot to repeat the question. Yeah, sure. So I'm not sure if you're going to be able to measure the code size decreased, but were you actually able to measure any sort of performance? Using the Vision 5.2, I've seen some performance improvement, but that's an in-order, simpler CPU. So yes, on Vision 5.2, I see some performance improvements when using compressed. Yes. And you're using the Vision 5.2? I have one at home. We have many boards, but I have that one I have sitting next to my desk, so I often use it. So yeah. Well done. No corrections from the GD code.
A framework for RISC-V SBI verification and ISA extension validation
I'll get one more minute. Sure. You got some naked nathalus. You got sick just kind of out of the blue and then like, didn't have any. I thought it was. It was so sick like, nothing. It's sick sick or sick as it is. Yeah. I've never known. Yeah, I've got the beer induced one. I'm the last speaker and it's our hero that fills in for, you know, the. Yeah, I want to thank one or two missing. So yeah, take it away from. Yeah, thanks, Bjorn for letting me fill in. I heard I had wanted like a 15 minute session just to kind of advertise this framework because I'd like to encourage people to contribute to it. I ended up with a 30 minute or however long minutes the session is because of the cancellation. I'll have an hour. Don't worry. No, no, no, no, there's lunch and I can do it four times maybe. Anyway, so quickly just about who's standing in front of you talking. I work for Bentana. I work on Linux kernel, also KVM, open SPI and KEMU. And I'm trying to build, you know, the software system that we need for risk five. So I'm also participating in these RVI working groups and rise that we'd heard about earlier today as well. Prior to that, I worked on air 64 before risk five air 64 red hat also virtualization. So the Linux and the KVM bits KEMU as well. I've carried over into the risk five world as part of the vert stuff that I did previously. I got involved with Katie Munitess, which existed before my time because it's quite old. But I started, I wanted to use it for air 64 specifically. And so I did some ports. I'll support it to power PC and then kind of left that for others to maintain. I don't think it's getting a lot of action, but it's there. And I'm bringing it to risk five. And that's what this talk is about is the fact that we now have this tool available to us. So the outline is just Katie Munitess. First, I'll give a quick overview of the framework generally. And then it regarding risk five, the use cases I see that we could apply it to right away and also as the framework evolves. And then the the and you part is my kind of appeal for contribution. So, so as I said, Katie Munitess is actually quite old. It's as old as KVM. Avi created it shortly after his first couple of KVM commits in order to start testing. So to make sure it actually works. And then over that time, though, we've we've been expanding its targets. So now we can actually test not just with QMU as the user space is originally, but with KVM tool or you could probably put in Rust VMM or whatever you want in there. Cross VMM. I mean, with some efforts, probably it doesn't just drop in at the moment. But you can already test other hypervisors. People do that. And we can even test it on hardware now because we've added at least x86 and air 64 at this point, the ability to boot over some sort of a if you capable boot loader. So then what is this test actually that I keep talking about these KVM tests? And so they're actually like a little tiny guest kernel because that's what Avi needed for testing KVM, right? He needed to have a guest, a guest OS that would have to boot and maybe exercise some stuff that the hypervisor needed to provide for it. So that's what they are. These little guest kernels and originally, you know, kind of booted in maybe hacky ways or whatever. But over the time, we've actually tried to build the framework in a way that is easy to port and easy to maintain. And so we even have DT support in there, some limited ACPI support for this booting. Like I mentioned, we can boot with CFI protocol, which helps us to be able to do the booting over hardware directly rather than through hypervisor. And then for air 64 ARM and RISC-5, I've also taken my notes from the Linux kernels boot requirements. So, you know, particular registers need to be set in a particular way when you first jump into the kernel code. And so we follow that protocol and then it makes, you know, everything just kind of work for bootloaders that already know how to do that. Any bootloader that can boot Linux in this direct way can boot these unit tests. And so, yeah, you're in privilege mode because it's like a little kernel in kernel mode. So you can do all the things that you would do, manipulate the page tables, set up your own exception handlers, generate exceptions and make sure they do what you expected them to do, things like that. You know, you're privileged, so go nuts. So, despite the fact that we're actually writing kernel code, we don't have to make it complicated. We don't have to make it something that's hard to do or at least feel hard to do at first look. So the framework tries to allow the unit tests to be written in a C-app type of way. So you kind of look and feel that way. You've got your main function, which is actually the entry point for the test. And then we have a bunch of libc, api, ported over, not a bunch, but enough for most tests. And we are, you know, of course welcome to add as necessary, whatever kind of looks like it's needed. So all your expected ones, assert is there, which is, you know, of course one of the most important ones for a test framework. Also, with the scripting wrapped around these tests, when you execute them, at least over QMU, then when you do, when you get an assert or any sort of an unhandled exception, you actually get a back trace for all the ports in a way that support stack walking. So we have that, and then this is just a little snippet of code to show you that, you know, don't be afraid. It's just the, and very simple. See, it's just main, even environment variables can be provided to the unit tests. For that, we do a little trick where we take a text file of environment variables. So, you know, your usual key equals val, just a whole list of those. And we put them into an NDRD, so they're in RAM disk. And we can just read them out of there, and we can find it through the DDE, all that stuff, just like we're supposed to. And then we can load those environment variables into memory, and you can use them like a normal C program. So that can also be nice for passing in your expected values and whatnot for unit tests. You can also pass in expected values for the command line, of course, which is a little bit easier to do. But it's, you know, if you have too many of them, it gets kind of ugly. So, of course, you can also, for at least people who want to test on hardware, they're free to manipulate their device tree in any way they want. So they could create a special node for test cases, sure, why not. And then the unit tests would just, you know, parse that node and get all their input. However, however you want to do it. So how do you run the test? So originally, it was, you know, from the command line just for running KVM Guests. So that still, of course, works. You can just pass the test, you know, as a kernel. That's the kernel parameter to continue. Depending on which KVM user space you're using, you'll do it in some similar way. There's also some bash wrapped around all of that stuff. It allows you to run all the tests for you automatically so it can be built into CI very easily. And we do have it built into many different CI already. So we run just a single group. And then the reason is bash. I mean, some people wonder why, because it gets kind of awkward to add more advanced functionality to the test harness having to write it in bash. It was historically in bash, is probably the main reason. But then we actually had a discussion a couple of times, like should we use Python or whatever, go whatever the latest thing is these days. It's a little bit easier for the harness. And we had some pushback from people who have been using this framework quite a lot. And they like to have a very lightweight framework that they can put on an embedded, you know, busy box type thing. There's nothing there except for bash. And they didn't want to bring in libraries and everything else for something else. So bash is not that painful. We don't have that much functionality. So I don't really have a problem with it. Another thing we can do, we can build standalone tests with it. So nothing changes except make space standalone. And it'll actually wrap a lot of that bash around the binary after it converts the binary with base 64 to be embedded all into one nice text file, each text file depending on how big your test is. And you can actually just email that or send it to people. So if you build a quick and dirty test, and I'll get to talking about quick and dirty tests a little later in the talk, if you do that, like, you know, a few lines is to like prove your point that this is broken. Then maybe you just want to package it up with this make standalone thing and mail it to somebody. They can run it and see for themselves. I don't think that's used a lot. That was one of the things I invented that I thought would be useful, but not too many people have been mailing these tests or whatever. So now we know what the framework is, and this is a risk five talk, so we finally get to risk five. So we already have a use case for it. The tech PRS working group has more or less committed to using it for the SBI verification framework. So the SBI for those of you that don't know, I guess most people in this room do, is this interface between either supervisor mode and in mode, machine mode, or also between a virtual supervisor mode and hypervisor. And so we, you know, we either respect community or trying to keep that interfaces from going nuts in all sorts of different directions. We have a standard for it, the SBI spec. And so we write when we want new functionality that we need, the supervisor needs to ask for some service or some information from in mode or we want to emulate that in mode for the guest. Then we need to provide this interface, right, this SBI. And so as we add these functions to the spec, we explain how in the spec it's supposed to work, the parameters, etc., like usual. Then it would be nice to be able to have a verification framework for that so you also say, okay, you've written a nice, you know, addition to our spec, a new extension, SBI extension, please show us, you know, how it's supposed to work. And you could do that, and we do do that with Linux proof of concept codes. We always submit patches for Linux and also for open SBI or Rust SBI that show that, you know, it works, right? We prove our extensions. But with the verification framework, we can actually avoid having to any, focus on any specific projects or people having to involve an entire Linux kernel for the test. They can just do this quick and, this quick small thing here. And so that's the idea is to try to build all those function tests in there and have regression tests for that as well for everybody's SBI implementations. So we can test already, right now you can start writing tests for open SBI. It's quite easy to run over QMU, you don't need hardware for that. You can actually, with QMU, you can swap out open SBI and drop in Rust SBI. That also works over QMU. Probably other SBI implementations can be run from QMU. Of course KVM is a SBI implementation because it emulates, so you can already test that as well. That's one use case already, which could be started now. So, we have a CPU validation as people actually have CPUs to validate. And when we get the EFI support merged. So I haven't done that yet. I'll come to that too, like with current status. But as we, when we get that done, then you'll be able to just put these tests directly, boot them from U-boot or CBEFI and you'll be able to do some validation tests. So ARM does that, I'm quite aware, because they've been involved with KVM unit tests for a long time now. They're doing their memory model, litmus testing, they use KVM unit tests using the EFI support to go straight on hardware and run that. So microbenchmarks are another great use case for KVM unit tests because while you can always find a way to create like some sort of a privilege level test where you write a kernel module in Linux and then you like put it, I used to do that a lot, just like in the init of the module, I would have my whole test case and then I'd, you know, I just mod probe it and now it runs my test, right, that privilege. But, which is kind of awkward to begin with, it's not a real test framework. But it also requires Linux to be booted up and working and everything. And it's not very good for a microbenchmark because you've got Linux doing whatever Linux wants to do. And so you're not really isolating your instruction sequence. But with KVM unit tests, you know, the world is yours. The unit test is running there and nothing else. So it's actually quite good for that. When you get your timing numbers from that, they're pretty reasonable to trust. Question. Yeah. So in this diagram, what does the test say? Ah, yeah. So the test is either this guest kernel or actually the host kernel. It's one of those two. So if it's bare metal, if you just launch it from the boot loader, you'll be the host. That support isn't in the RISC-5 port yet. But you can already do the guest kernel version. Okay. So, yeah, the tests are easier to write as we already talked about. And the quick and dirty ones are even easier. I do, I do the, so, so I do this a lot. I actually, because I'm familiar with the test suite, I use it for a tool while I'm working on something else. Like something for Linux or whatever. I use it just for my own testing purposes. And then it's kind of ugly and it doesn't really look like people would be maybe interested anyway. It's too, like, one-off. And so I just kind of toss it. Or maybe I keep it for myself to look at later, but it's not shared, which isn't really a very good open source approach. So I've actually been thinking about that, that for these types of tests that don't really necessarily fit what we consider the main test suite, maybe we should have a separate branch though for them. So we still collect the code. And I kind of did that already. I recently wanted to test TCG. So I kind of forgot to mention that for CPU validation, we already can, of course, test our, you know, emulators and our other models to see if they're correct. So TCG is, you know, QME's emulation framework. So I wanted to make sure that the MMU model that it had was able to handle the access to dirty bits correctly, because there's actually a couple different ways to do it in spec. And QME had picked one by default. And then a couple extensions came along that actually allow you to decide which one you're going to use. And a new bit was added, which is actually going to require another SPI call. So we'll go back to the SPI verification for that. Anyway, kind of, you know, balloons as we know. And I wanted to make sure it was actually working the way it's supposed to right now. So I wrote a test case in KVMunit tests. And then I wasn't sure, okay, this is maybe not the one that we're going to merge because it's just for this one-off test. But I've already decided maybe at least goes to a branch that we should keep track of these things. And then, you know, and the other reason why posting them, even if they don't get merged in the end, or at least not to the main branch, but to the side branch, is because when people do post-tests, sometimes they reinvent something they need inside the test case to get the job done. And that looks like something, oh, we should probably pull that into the common code, right? We can let the framework evolve better the more people who contribute. And there's no one and done. Usually I write something, some quick and dirty test, and then like three weeks later, I'm like, oh, yeah, I actually need that again because something similar is broken or whatever. Yeah. I think I talked about everything on this slide. Those are some links. And, yeah. So one thing I was going to do, because I have way more time than I need, but I was just going to show that test that I just got done talking about. So it's a little bit more complicated than that little snip that I shoved in the slide. So you can see that it's still not that complicated, right? Oh, yeah, sorry, everyone can try to brighten the screen somehow, maybe. Yeah. I don't know if I can turn off the light. Just smash it with a hammer. Yeah, it's probably, you know what, maybe I can go to a black background and just cut the file. It might be better. Is this better than before? Yeah, because black background is better. Don't touch that. That sounds like fire hazard there. Anyway, so I'll just, you know, just kind of slowly scroll through it, I don't know. Just to show you that really you can build these tests with like 100 lines of code, and they achieve a pretty reasonably good goal, like making sure that an NMU behaves correctly in like three different modes. So, yeah, so I don't know if there's any particular lines here I want to point out, so I wanted you to get a feel for what a test would look like if you guys decided to go sit down and write one. You don't have to like, you know, you don't have to learn a whole big framework with some bizarre looking APIs. The APIs that we have are minimal to begin with, so you're going to write your own functions. But when you do need them, you know, they're pretty self-explanatory and C, so you just, you know, you can grab for anything you need to know. And yeah, that's the bottom of the file already. It's only like three page downs. So, um... So, does the actual return value get used? I mean, I noticed you're carefully returning a report summary. Yeah. But does anything actually look at the return value of this, May? Yeah, so CIs will do that. So, like, this will dump a summary to the screen. So, if you're just running it yourself, which I guess I might as well go ahead and... So... Yeah, you know, I'm feeling brave. But, um, so, yeah, you can just run it. And then it'll dump... Yeah. It'll dump stuff like this out. And then CIs will, they know how to parse that, right? So, they'll be looking... And we have the... We have those, you know, reports and report, pass type API to try to make sure you get a nice, uh, consistent format so that it's parsable. You know, we don't use a TAP. Maybe we should. We've done that in a different test suite that I'm involved in as well, KVM self-test that's in the kernel. We're starting to... We're not there yet, but we're starting to migrate the TAP for that one. Uh, yeah. This one we have our kind of our own thing going. We've had it a long time now. Um, anyway, so that's like one and then there's like this... Yeah, there was another test. You probably saw it said skip. And it's skipping because I didn't give it an environment variable. Uh, let's see. Yeah, that's the file. So, this is that text file I mentioned before. You can create just, uh, you know, plain old text with all your environment variables. And then when you want to pass it to the thing... Um, Oops. It passes... Like this. And we'll just run that one group of tests this time. We're seeing about live demo is I have to type in front of people. Um, and, uh, so now we, now we're not skipping anymore. Now we're passing because I gave it, I gave it the inventor ID, which is zero for KMU. Um, and, uh, it matched. Working demos, working passing, passing SPI test. Uh, yeah. You showed the failing test also. Oh, yeah. Yeah. I want to see that it's true. Yeah. Yeah. Good challenge. Uh, yeah. Forgive what I called this one. There we go. So, yeah, this is that other one was the, uh, was the, uh, in a new testing. Um, oh, yeah. And so now it's here. It is failing. It's skipping. Um, that's failing, but skipping. And that's because, uh, this, uh, CPU, the default CPU is missing the, um, the extensions needed. So we can, we can fix that, of course. Um, something like this, um, we can actually add, um, we can add the extensions. So, uh, still spot. It was not there because I don't know why. Oh, no, because that requires an extra, uh, extra step of adding an SPI implementation that allows you to turn on, uh, the, uh, AD bits, um, uh, the hardware AD bits, um, where you don't have that yet. That's actually, we need to add an SPI extension. I think we're going to call it FWFT, allowing us to tell SPI to flip, uh, bits and registers the machine, uh, environment, config, enable bits. Because if you want to turn on this particular feature, uh, you need them, you need, uh, to be at the machine, uh, uh, mode level to be able to do that. So I can't do it from, uh, the s mode level. And so I actually hacked OpenSBI to let me do it and to test this out. And I, I'm not going to look for that in a live demo file. But, um, yeah, I have that. It does work. Yeah. Um, yeah, I think, uh, So what's in the run test.sh? So you wrote the C file, right? Uh, again? So then you had this run test.sh? Yeah. So, did you write it as well or is it, so, or is that the test? Okay, run test is just the, the, the test suite that kind of pulls everything together. So, um, if we look at, uh, this one, for example, this on the screen, uh, the log here shows at the very top. Which is at the bottom of the screen. Uh, this, this time out, et cetera, et cetera, et cetera. So that's actually the command line, the run test, figured out how to compose, uh, based on some configuration files and stuff. And then this is the output of that. There's, uh, this, um, uh, configuration file that you can, uh, provide, uh, for your groups of tests or for individual tests. Uh, allowing you to, um, um, to tell run tests what to do to pull it all together. I mean, of course, you can also just manually do the command line. And I do do the manual QMU command line, uh, when I want to also, like, do something with GDB or, you know, make sure I get the, the address is dumped out and I can find them as obj dump or something. So, um, yeah, I don't always do everything through run tests. Actually, very rarely. That's more for the CIs after you've got the thing working. Which one? No, that's already there. That's, that's a static. Yeah, it's committed to the repo. Yeah. Yeah. Um, yeah, nothing for scripts is automatically generated except for when you do the make standalone. And then you get, uh, might as well show that because we're in demo mode now. Um, so then you get this guy, which is generated. So this batch script was automatically generated. All this junk is the, uh, base 64 of the actual test code that was written in C. Um, yeah. And then, you know, this, some of this stuff is just kind of extracted directly from, uh, other scripts that are used by run tests and they're just chucked in there. And now you can, now it's all one unit. Yeah, you could put anything in there. I mean, don't trust someone to send you a reproducer. Yeah, it could be like, sure. Yeah. This is for developers passing things among trusted friends. Yeah. Yeah. Yeah. Make them sign it or yeah, just, yeah. Sure. Yeah. Yeah. I mean, yeah, anything could, absolutely anything could be in there. Right. Like enter your password. Please. Thank you. Uh, those tests are very similar to what is a case of test does and, uh, those tests integrated to case of test and if no, do we have such plans? Yeah. So the question is more or less, how does this relate to his, uh, case of test? Yeah. So there's, there's definitely overlap in what is tested. Um, the frameworks are quite different and how they work. Uh, there's, there's more overlap between this particular one and KVM self test, which are in case of test. That's one of the many sub directors in there. Uh, KVM self tests, KVM case self tests, uh, has started to be probably, uh, be probably the, the main place we add new tests for KVM. So actually, you may have noticed I did an entire presentation on KVM unit tests and I think I said KVM like only when I said the name of the framework, but I never actually talked about testing KVM. Uh, we do that still. Uh, we have CI that's specifically testing KVM using this framework, but, um, now we usually use KVM case self tests for the new ones and even some of these reporting to that framework. Um, and I'm seeing that this one's more going towards the testing of hardware or other hypervisors are still using it and stuff like that. But, um, yeah, KVM wise, uh, and, and, and actually I talked to Paulo about that yesterday, uh, on my third beer or whatever. But, um, I was like, you know, KVM self, the case self test is, is the way for the future for KVM testing and I'm not going to really talk about it too much tomorrow. When I talk about KVM unit tests and say, ah, but KVM unit tests are still easier to write and he's right. Like you can write a test case, uh, quicker, faster here. So if you're doing KVM testing and you want to do those quick and dirty tests I was talking about, uh, you might jump to this one first because the other framework, uh, uh, well, it's growing like support quite fast, but you have a little more boilerplate code and everything you have to do because you're actually, when you write that you're writing both the user space code and the guest code simultaneously for a test. And here you only do the guest code. So they initially, uh, here we can just simply read a test and if it's worse then we can move it to a case self test with bigger overhead. Yeah, yeah. And for your question on the other case self test stuff, like risk, there's a risk five directory there too, right? Where we test some instructions. That stuff is good. We need that too, but it's user space only, right? Yeah. So this, this is down in the kernel level. Okay. Thank you. S mode. Any other questions? No, let me, let me appropriately go to the last slide. There. Thank you. All right. That's it. See you next year.
The best `case` scenario
you you yes sorry so let's talk about case which is a keyword that hopefully most of you have used, if you haven't, it's okay, we're gonna go through it. And we're gonna figure out how we can use it, how it works, how we can use it better, and what the latest versions of Ruby have given us to play with this operator more. So yeah, that's more as what I'm talking about. So just in case we're gonna go through what case is, what the different syntaxes are, how you usually use it, and then we're gonna look at how it's implemented, which is terrifying, and we're gonna have a small dive into how the Ruby VM works and the instructions and stuff like that. After that we're gonna go through several use cases, some of them are pretty basic, some of them I think are pretty cool on a Ruby standpoint. And finally we're gonna take a look at pattern matching, which has been coming to Ruby since 2.7 and is mainly operated right now using the case keyword. So let's start. What's a case? So does anyone not know what a case is, or has anyone not used it? Cool. So that will go fast. So basically a case is more or less a big if-else, that's usually how people think about it. So you have your case, you have your different branches, and then you match each branch against your case. And depending on the branch that matches, you go to a different path. So in this case we can assume that, I don't know, status is something you get back from an API, you match it against different cases, and then if you have a success you proceed, otherwise you want to fail depending on what you have. If you want to, you could be even more compact by moving the stuff up a line and using then, and if you want it to be even more compact, you could even add more things to a branch. So if you wanted different conditions to go to the same branch, you can separate them with a comma. So that's basic case. One interesting use case that I don't think I've ever seen before, I don't know if it's useful, but it's still cool to look at, is you can write a case without anything at the top, just an empty case, and then it behaves exactly like an if-else if, so you have to use usual predicates, the same way you would an if-else. So I'm honestly not sure that has much interest, but it's cool. So how does case work? And in general, I kind of also wanted to take the opportunity to talk about a bit about how anything works in Ruby and how you can, when you're debugging something and trying to figure out how something works, how you can go deeper into your code or someone else's code. So if you're, for example, if you have a method that you've written or someone else has written and you don't know where it is, so let's say you're in a big code base and you have 20 methods called, I don't know, count or show, and you don't know which one is being resolved. In Ruby, everything's an object, as you might have heard before. So, so are methods. And you can, on any instance of anything, call dot method, capture your method, and then you have access to two methods that are pretty cool. One is called source location, which will tell you in which file it is. So interesting when you don't know which method is being resolved. And another one is just dot source, which will print out the source in your terminal. Just plain up. So that's interesting also. If you're looking for something more lower level, so a Ruby method, like array.last or integer.next, and you don't know how it works and you don't know where to go, you're kind of stuck. You're going to have to go read the fabulous manual of Ruby to figure out where it is. But in our case, we're kind of one level deeper because we're not looking at a Ruby method, we're looking at a Ruby keyword. So you, if you go to the documentation, you're going to find how it behaves, but you're not really going to be able to see the source code per se. So in this case, one way that I've used to figure out how the internals of case work is to go look at the Ruby VM instructions. So big-ish caveat for the next couple of slides. That's the very limit of what I'm trying to understand this year. I'm kind of in that phase in my Ruby journey when I want to understand how things work. So if I say something outrageous, stop me. So from my understanding, the Ruby code that you write goes through a journey before it is compiled and interpreted. So your Ruby code first gets turned into tokens. So for example, you can imagine that your entire program gets turned into a big array of syntactically relevant stuff. So that could be depth, for example, or an open parenthesis or a space or part of a string. So everything gets turned into a token. And then those token get organized into something called an AST, which is an abstract syntax tree, which is really hard to say. And basically what an AST is, is that big array, but formatted into something that is more understandable. So if anyone has ever played with RuboCop before, that's probably where you've seen something like that, because you have to play with syntax tree, which you want to write your own cops. So the tree is composed of a lot of nodes, and each node has a name. So you have a class node or a method node or a begin node. And then inside the node, you have all relevant information for that specific class or method or begin block or anything. And then all of those, all that tree gets turned into virtual machine instructions. So that's the part where I think this, what I'm going to talk about probably only work on C Ruby. I'm not sure this applies to other implementations of Ruby, like Truffle or JRuby. It probably works a bit differently. So if we look at the case that we're looking at before, and in the Ruby console, you have a class called RubyVM, which gives you access to any tool you might want to, to turn your code into either the tokens or the tree or the instructions. You can end up with all of this, which we're going to try and go through in some kind. So first of all, in case you've never used it, the Ruby virtual machine, the one from C Ruby is a stack based VM. So interact, everything in the VM is a stack. So you end up, you have a lot of instruction here that just interact with the stack. Like the put object over there just puts an object on the stack, the top end finds an object and then moves it to the top of the stack. And you have a lot of things like that. So in our case, if we look in detail, we can see a few things. So first of all, here we're mainly preparing the stack and here we have something, here we can find the status that we have over there. So this is basically calling status to fetch the value that we want to match against. And under this, you have a Ruby optimization, a Ruby VM optimization called case dispatch. What this does is in some cases, if you're using a simple case with simple objects inside of it, like strings or integers or symbols or stuff like that, what it will do is it will create a hash where the keys are basically this and this and the values are the number of the line in your VM structure that you need to jump to. So what that means, at least the way I understand it, is if you have a lot of if, else if, else if, else if, it will be usually faster to build a case because you're losing some time here to build your hash. But then whatever case you want to go to, it's just a hash access. Whereas if you're doing a bunch of if and else if you have to go through each of them to see does this work or does this work or does this work, etc. If we go a bit below, we can see what that would look like technically if we would need it to go through each of the branches to see which one works. So here we have our success symbol, which was our first branch. And what this does is it going to compare it to the status using the triple equal method. And that's the cool part of case. That's technically what's doing the heavy lifting behind. And if that equal works, then it's going to jump to instruction 28 below. If it doesn't work, then it's going to keep going. So second branch is error. So we're going to take error, put it on the stack, compare it to status. And if it doesn't, if it works, we go to 33. If none of those work, if you remember the case, then that means we're in our error case or like our else, which is over there. So if none of those work, so we keep going down our instruction and we end up here called the fail harder and then leave, which is instruction 28. And then under that, then you have the lines that you would have jumped to if anything worked before. So the 28 here, which will call the proceed and the 33, which will call the fail. So that's more or less the instruction patterns of a case. So that turned our question before, answers our question before, right, of how a case works. And the simplest answer that I can give it, it works thanks to triple equal. That's what it's going to use to match everything against everything. So if we wanted to push case to the limit, the question that we want to answer now is what does implement triple equal? And in Ruby, that's a bunch of classes. And the interesting thing and the main reason I wanted to do that presentation is that depending on what you're calling triple equal on, it will behave differently. So the simplest example that we've all used is all the base classes. So string strings, integers, float, arrays, hashes, anything you want. And in this case, it checks for equality. So that's the thing we've seen before. You might have seen that code. You get a param that has a response and then you don't know what the fuck the other person in the API has done, whether it's a string or a 200 or a success or a string or a true or true as a string or anything. So you do your case and you match it against whatever and try to figure out. So in this case, it's always going to check for equality. So here with the come out that we've seen before, it's one or the other or the other. And then you have arrays, you have hashes. Otherwise, yes, you can give up. Another thing that implements triple equal with another behavior are classes and modules. On classes and on modules, triple equal checks for, I don't really know how to say it in one word, checks for type, for ancestry. It's a bit like the is a method of Ruby. So when you have an object and you call is my dog an animal, it's not only going to check the class, it's going to check a bit above to see if animal is included in it if you're going composition way or if it inherits from animal, if you're going the inheritance way. And that's more or less what we can do here, for example, with errors. So I say you have your code and you've defined a bunch of different types of errors. And you've tagged some of them maybe as ignorable. So if it returns any errors that's in that type, then I want to ignore them. If it returns those two different errors, I want to return a not found. If someone forgot about safe navigation, I want to tell them. And then a lot of errors, for example, in Rails, and I'm assuming in Ruby, not entirely sure, don't put me on that, inherit from standard error. And so those maybe you want to raise, but if you have something else, then that's probably a lower level, maybe a PG error if you're dealing with a database, and then you want to do something else. So that's it for classes and modules. Another class, another type of classes that implement triple equal or ranges that I'm assuming most of most of us have already used that check for inclusion. So for example, if you have an integer at the top, then you can check that it's included in this range or this range. And it works with the endless ranges of Ruby. So you can be very, I mean, this might as well use an if else if and just check that it's greater or lower than, but it's good to have options. You never know. And one thing that I found, if you're working in networking, that could be cool, IP address works the exact same way. So you can define IP addresses with their masks and everything, and then have them act as ranges, and then check that your IP address belongs to one or the other. This one we've all probably used is also is reg X. So this one checks for just a match. It's the exact same equivalent as if you wanted to match your against string. So that's a kind of real use case that I have from the company that I'm working for where we manage a lot of messages between clients and providers. And so we want to check in those messages that they're not trying to bypass us, for example, by sending an address and trying to meet somewhere, or they're not sending sensitive information or sometimes people can keep their dick in their pants. So we have to be careful about that also. Stuff like this, right? So this one checks for match. Probably one of the most interesting example, but yet the one that have the most trouble coming up with a good example for are prox and lambdas. On prox and on lambdas, triple equal calls the lambda and gives it the object that you're matching with. So for example, here we can define, let's say we want to define simple prox or lambdas that just delegate to another method. So for example, unknown host will take an element and then check if the host is included in the list of something. Oh shit, yeah, I've done it again. In case this is just, it's the new way of writing the old thing here with the pipe pipe and you enter a variable, this does the exact same thing. It just takes the first one. So underscore one would be the first variable that you enter here, underscore two, the second one, underscore three, et cetera, et cetera, et cetera. So let's say that we've defined a simple list of hosts. So when we get, in this case, probably a request, we could delegate to one of those to see if it whitelisted or if something went wrong. And then we can, if it goes there, yes, we can take a request, let's say a web book for example, and write our case on it and say, okay, when it's whitelisted, then I want to do something. If the host is unknown, I want to do something else. If the action is unknown, it's going to do something else. And what this is going to do behind the curtain is it's going to call whitelisted and give it web book as a first parameter. So it's a more, again, more compact way and allows you to put that code somewhere else instead of having to copy paste it into three ifs. And the last one, we're in Ruby, thankfully. So for every other class, we got duck typing. We can just implement the triple equal method and have it work for more or less anything that we want. So bear with me because that's going to take a little bit of time. So in this case, same, still sticking with my response example that we've been following the entire presentation. So here I can define in my response class or module or whatever different classes that implement the triple equal and that do anything that I want. And then I can, if I do this and I call them, this is going to do what we've seen before in the VM instructions, right? It's going to take the response called triple equal with this and then see if the answer is true or not. So with this, you can basically create as many matches as you want, especially on custom class that can be pretty interesting. If you have one example that came to mind also is payments, for example, if you're managing payments, then you can in your payment class define different subclasses that could be success or canceled or processing that just calls your payment API and checks if it works. And so all that code is it's in own place and then you instantiate your object here and you can use case to easily delegate where you're going. Another example that we've kind of used is a wrapper for services. So basically you define new classes for your service and your service answer a class that's either a success or an error and then you can use this to do some kind of early, early days pattern matching. So speaking of pattern matching, how does it work? So just in, again, just in case, we're going to go quickly through what it is and what it works, how it works, sorry. So the whole idea of pattern matching is that you define as the naming price, you define a pattern, then you try and match it against something and see what sticks. So here my pattern is going to be a hash with a status key, a body key inside of which I'll have a user with a name and an age and whatever is in here, if I can match it, I want to store it in the variable and then once you have your pattern, you can try and match it against any collection of stuff. So in this case, it's going to work because we had the same status and the form that we're trying to match against was the same and what it's going to do is it's going to assign the name variable to whatever was there and the age variable to whatever was there. If you want to match it against something that looks very different, so this hash for example is not going to work because either status and body are here, this value is not going to match against that one, right? So if you try and do this, then it doesn't work, so you're going to get an error. In Ruby at least, this is going to, sorry, this is going to raise an error that just tells you I wasn't able to match it and in Ruby that was implemented using case. So the way it works is if you have a response or literally anything, you want to create your different patterns that you're going to want to match it against and one thing to note is that it's no longer, you know, to make the difference, you no longer use case when, using case in because in is going to be the keyword that's going to be mainly used for pattern matching even out of cases. So in this case, if the response that I get has a status success, I'm going to take whatever is in the body and put it there and otherwise if it's an error, I'm going to fail and put it over there. So it's, again, it kind of does the same. You could do the whole counterpoint to this presentation is I could do it with an if, else if. You always can, but I do think this is a bit more verbose and makes it more clear what you're trying to do because you can see the entire pattern. Whereas if you wanted to do an if, you would have to open response and do if the status is success, then I want to look at the body. For this example looks the same, but if you're dealing with big jasons from APIs where everything is nested like four times and you have response body value and then you take the first element and then the address and then whatever this starts to become more interesting. Another thing that we get with pattern matching that we can do with case when is we get access to guard closes. So whatever that allows us to do is I want response to match with this only if I'm not in maintenance. So this gives us a bit more control over whether or not we want the pattern to match because sometimes you might want to put patterns that are very similar, but you want to condition them to something different. Another example and another thing that we can do with pattern matching, so let's look at a more complex pattern. We have access to a lot of new tools. So for example, here, what this thing here means is that I want to match this pattern where the ID is whatever I put on top. If I didn't put it, then it would act as the one we store before and store it into the variable ID, but by doing this I can tell it no, no, no, use the value that's already there and match one that has 69 as an ID. I don't want anything else. And we also have access to splat operators, kind of. So simple splat for arrays, double splat for hashes, the same as with method arguments. So what this allows me to do is I want to take user and if the user is in an array with some elements at the beginning, some elements at the end, and then somewhere in the middle, an element with ID 69, I want to store the value of admin. So this is kind of equivalent to take my entire array and do a detect where ID 69 and then print admin. So this kind of does the same thing, but in a more flexible way because I can then kind of keep putting more patterns underneath to filter out more stuff or try to find more elements. So how does it work? I kind of, at this point in the talk, I kind of wanted to go through the same journey with pattern matching as I did with a simple case. So try to open it up and look at the VM instructions and see how it works and try and figure out what's underneath. The problem is that pattern matching is kind of new. So in the Ruby VM, that is a lot of instructions to go through. So I ain't going to go through everything. But there are a few things that we can see here. So for example here, we have the same response. So that's the beginning of our case. So this calls the thing that's going to go in the case that we're going to try and pattern match against the same. We're looking at pattern matching. So of course, the thing called check match, we kind of kind of assume that it's going to match or pattern against something. So the way, at least the way I understand it is that all of this is going to build or pattern and then it's going to match it to continue. And if we look at the way it builds the pattern, we can find one method that is interesting, which is this one, which is deconstruct keys. And after looking at it a bit more and going to read out the documentation, this is what Ruby used to do, at least for now, to do pattern matching. So you have two methods. One is called deconstruct keys that is used on patterns that are hashes. And another one is called deconstruct, which is used on pattern that are arrays. That make sense? And so this does all of the deconstruction and then if the pattern that you're sending doesn't respond to the deconstruct keys or the deconstruct method, then it's just going to give up and tell you to implement it yourself so that it works. And after that, it's more of the same thing, right? So that's the second pattern that we have. It's still trying to deconstruct them. And then eventually, if it doesn't find anything, it's going to return a no match error. So the interesting thing then is how do we implement it ourselves? So if you have your class and you want it to be, you want to use pattern matching on it, then one thing that you can do is use, is implement the deconstruct keys method. So in this case, we have a location and we want to have a latitude and a longitude in the deconstruct keys. And then that allows us every time we have a location to use pattern matching on it, because it's going to deconstruct this, deconstruct this, and then see what matches. And so in this case, and interesting thing also is inside of our pattern, we have access to everything that we've been talking about earlier. So in your pattern, you can put classes, you can put reg X, you can put ranges in this case. And the only thing I think we haven't seen before is this little thing magic that just takes like, it wants to match it against this and then store it into the variable that we can then use for anything else. And I think that's it. I've tried to go through everything. I sped through that one, sorry. We have so much time. I didn't. You used a variable that was not declared before. Yeah, probably. Where? In the right address before? One? latitude. Did you declare length to be equal to new before? No, you don't have to declare it before. Basically what this does is it takes this and then store it. It takes whatever matches here. So that would be technically this and then store it into the latitude variables. You don't have to declare it before. And what's the scope of that variable? It's going to be a scope to whatever the case is in. Right? So if your case is defined in a method, then you have access to it in the entire method. This is in current? Yeah. I think this might be, this might have been implemented in Ruby three. And the first occurrence of pattern matching, the one with the case in, was experimental in 2.7 and then actually arrived in Ruby three. And they've been trying to push it a bit more in subsequent versions. So now, for example, you don't necessarily need to have case. If you want it, if you want to use pattern matching, then you can just write your variable in something and use it as a predicate to see if it matches or not. In your example where you're looking for an admin user in an array of users and you have those operations at the back, does that work if your admin user is the first or the last? Yeah, yeah, yeah, yeah. Then the, then the, Like it might not. Yeah, fair. Yeah, yeah, definitely fair. What this will do is it will put nil in here and nil in the other variable. Right? It's like there's nothing after, there's nothing before or there's nothing before. Yeah, that's the thing that I was a bit, if you about this, basically the, shit, I have to go through all the animations. Sorry, bear with me. It's going to scroll again on. Okay, sure, whatever, shit. The argument that it takes is in case you only want to deconstruct some keys. Right? So if that is, if you have a big object and you only want to deconstruct latitude, for example, you could work it this way. That's what it's supposed to do. In the example, I didn't go through the trouble of implementing all of it, because if I want to write code big, I can write too much code. And yeah, that's why. So it was deconstruct for arrays and deconstruct keys, though. Yeah. And you can define deconstruct as well. If you've got a class that implements an interval or something. Probably, yeah, yeah, I think so. Just to take how stable you think the syntax is. Do you think it's going to stay the same? Huh. Would you, would you, would you, would you want to do, for example, like the, the, the, the, the, the, the, the, the, the, the, the, I think it's going to stay the same. No, sorry. Yeah. I know I was thinking in my head, I think it's going to stay the same because it's the exact same syntax that Alexia uses, for example. Like they've probably been inspired from other languages and used it. And so I'm expecting it to stay the same. But then again, I don't know. I think right now I'm still, I'm trying, I tried to push for it in very simple use cases. So usually in a, if we get, if we have to make an API call, that's probably the best, like, foot in the door to get it working in your code basis. Like, because that's the thing that seems the most obvious, right? I get an answer and then I can, not only fetch the status, but assign everything in the answer and then give it to another method. I think that's a bit, not a frontend dev. So don't quote me on this at all. But it looks a bit like the object deconstruction thing from JavaScript. Or you can get an object and then assign all the variables into it. In this use case, I think it's a good first step to implement it in a code base. I wouldn't go all out and start putting deconstruct keys in every class. That would be really, I really hope that, I really hope they put it in Ruby at some point. I don't think that's in the plans right now. I think the idea, the main idea behind was, like, when they put it in Ruby at all, pattern matching in 2.7, it was kind of touch and go. People were discussing a lot about do we want this in our code base because pattern matching in the collective brain is usually more functional than object oriented. But now that it's there and it's past the experimental and it's now stable, I think they're eventually going to do it. It'd be a shame not to, right? Do you think some of this stuff is going to end up in the Ruby style guy? And be something like Ruby cop goes and says, no, no, you don't want to do that. You don't want to do that. You want to use this instead. Probably not in the near future. Because I think people are still very much like trying to figure out what the good style is. Even when I was preparing this, I couldn't find a lot of examples. So I kind of came up with what I think would look the best. But I don't think for now, at least, there are a lot of established guidelines. We good? Cool. Nice.
Besides Web: a Worker story.
Okay, awesome. The mic is on, hopefully. All right, good afternoon everyone. So I'm going to talk to you about a worker story, which is something we did at work recently. And for once, it was like not using Rails. That's awesome. Not using web at all. That's what motivated me to tell you this story. So before we start, I would like to know who here is the Rails developer? Who would like? Yeah, awesome. Who would say that they are Ruby, but not Rails developer? Okay, awesome. That's great. Love it. I didn't expect that. Awesome. All right, first of all, who am I? Because if you don't know who am I, you might not rely on whatever I'm going to say. So I've been Ruby and mostly Rails developer for 10 years. I've been working with Kevin for almost that period. More recently, I have become a lead dev, then a manager, then a CTO. So I'm doing a lot of new responsibilities now, which also gives me a new perspective on a lot of programming topic, actually getting new perspective when you start making a decision about people and processes and stuff like that. And finally, I've been a teacher for more than six years. I've given a lecture at EPL and Le Wagon. Hopefully, we'll do that again. I feel like a deep-footed lover for teaching and sharing knowledge. And this is also why I'm here today. So I was saying the point of this talk is talking about Ruby, but not about Rails, not about web. And this was the first time for me. I was like a new experience. And it's strange to see how much changes when you start doing that, how much you realize Rails was giving to you once you don't have that anymore. I have some notes. By the way, all my slides are going to be minimalistic. I'm not going to show you a single line of code. I'm also going to forget a lot of stuff, which is why everything I intend to tell you is written in notes available directly in this. So hopefully, you will get everything I intend to say because I'm going to forget part of that. So the main message of this talk is like it's doable. It sounds strange that this is my message, but as most Rails developers sometimes, when we think about Ruby program, we're not even sure we can do it. We're not even sure how would we approach that. So the main message is like, yes, it's doable. There's a lot of tools. There's a lot of process. There's a lot of help along the way. And you can possibly, you can very likely, sorry, get most of your tools and knowledge used in a normal, not web Ruby application. The second news is you can also get all of your Rails knowledge useful in Ruby application if you get things right. So the story I'm going to tell is about like a worker. What is a worker in our case, in my case? It's like microservice. The specificity, why do we call it a worker? Because it's not a web microservice. It's a microservice which is consuming messages from a queue and very likely is going to process files, so it's going to get files from a bucket, process them locally, put them on another bucket. We have, we are using the word worker because we have like lots of them. That's simple definition. We have lots of them. So I'm going to talk about like one of them, but it could be any of them. So we think start with a loop. The whole story starts with a loop because when I started this, I really like I opened my editor and I saw something which I hadn't seen since school. It's an empty directory. It's very strange. Like really first as a Rails developer, I'm really used to Rails new and then you get like everything. You get folders, tree, substructure. You get the config directory, you get the app directory. You, like there's drawers everywhere about what you expect to put things. In this case, I just like create a new folder and it was empty. I'm a firm believer in emergent design. So I started immediately like new file, worker.rb, make a loop, while true, read, perform, delete message. I'm done. It was nice. Like it was, I knew it was not the end, but it was capturing whatever I was, I knew about my process. It was a single level of abstraction. So I knew it was a good start, but it wasn't. It wasn't a good start because I was already forgetting like my main tool when doing Rails apps, which was going to be my main tool when doing any app. It's tests. Anybody who knows me know that I'm a firm believer in tests. And it's a policy. It's not a religion, but it's a policy. This is how I write code. I do believe in it, but you mileage may vary. But for me, it was the beginning. And it's funny because I knew I was going to write loop, the loop of my program, but I was also starting another loop, the loop of my process. And this is what tests are for me. Test first does not mean you do test, then you do code, then you're done. Test first means your first step in the journey is test. Then code, then test, then code, then test, then code, then test, then code. That's what it means to me to do test. But I did it wrong. I started with code. So I tried again. I deleted my file. I created a spec directory. I created a spec file explaining what I knew about it. And I was happier because test is the file that depicts my best understanding of what I currently believe is the success. And I need that because I'm going to write code right after the word. And once you're deep in the code, you're super focused. You forget about landscape. You don't know what comes next. You might have a story. You might have specification requirements. You name it. But I do believe that a story or specification is like coordinates of where you're supposed to land. The whole puzzle, the whole activity of development, of programming, is like playing golf in the fog by night. You know where you are at the beginning. You sort of know when you want to land. But after your first shot, you're going to be lost. It doesn't matter even anymore what you're supposed to land because you've given your first shot and you don't even know where you are anymore. I'm using tests as torches in the night. So I read my specs. I write some tests. This is my belief. I'm going to follow that path. And then I shoot my first shot. Hopefully I'm going to reach my first torch in the night. When I have reached that one, I'm going to go to my second torch again and again. But my loop is that my test is only my best understanding of my success. So my test is going to evolve. I'm going to move my torches and I'm going to move my ball. And this is how they make sense together. Back to the story. I wrote my test, was happy with my understanding, run it, and it failed. It was a catastrophe. And why did it fail? Well, because it couldn't find our spec. Because I didn't bundle it. Because it couldn't find bundler. Like, that is how empty the whole story was. Like, I didn't even have bundler. Okay, so bundling is always easy. Bringing my dependency, starting my gem file. I need to run my spec. Run it again. Well, it still fails. But for a better reason. And that's the whole point of TG, right? You have to fail, but for a better reason than the previous failure. So now it's failing because it doesn't know about what is a queue, what is the method we see in the queue, what is a message, what is a processor, what does perform even means. Well, that makes me happy. Because now I can actually write more tests about what do I believe is a queue at this stage. Why do I believe is a processor, what do I believe should do the receive method. And this was really the starting of both my loops. I got my main loop back, but I got my working loop as well. I got a lot of tests. I knew that trying to make them go green would just generate more tests. Trying to make them go green. I got my actual work loop. Right. So, test code, test code, test code. I was in the middle of it. And every single of the code file was starting with like probably five to ten require or require relative. And I wasn't happy with that. First of all, because it is boilerplate, it's noise. I don't like noise. Also, because I want my code files to be about the responsibility they're supposed to hold. And knowing what files contains the dependency that this file depends upon, it's not the responsibility of each file to know where do I store the other responsibilities. That was wrong. And this is not something we have with Rails. I realized that we actually get something super nice from Rails is put any file in any sub directory of app folder, and you get it. It's like magic. Once you have to start all your require by hand, it felt wrong. So, I Googled. I got a few options. And the best one, which is actually the one which is currently adopted by Rails, was using SideFerq. Hopefully, I'm pronouncing it right. It's written in my speaker mode. And that stuff helped me, like, auto load the constants I was looking for by looking them up in my lib directory. Default config. I'm happy as far as I know this is what I need. But reading the rest of SideFerq, I also realized that this enables you to use short names. So, if you are in the same namespace, you can just mention a constant by a short name. Well, obviously, I want that. I'm doing that in Rails, so I want that again. It's also handling multithread code loading. I have no idea if I'm going to need that, but I certainly don't want to handle that myself. It sounds like something I really don't want to handle myself. And it also handle code reloading, which is not something I'm going to use because of TDD. But again, this is my approach. I know that most people don't do that. And code reloading is a very important part of code loading. So, SideFerq was like my first take, my first really great companion that I found along the way. The second one was dry container. Now, small disclaimer, I knew from the stuff that I was going to use dry gems because I wanted to. And as Kevin said, it's also a little bit about finding joy. So, I wanted to heavily rely on dry gems, but I wanted to wait until the use case was there. I wanted, because I did not only want to skip the requires, I wanted to not know the classes. I wanted to not call new in the middle of my code. My code is about business logic. Most of the code is about business logic. I wanted to separate, sorry, the logic about creating objects and the logic about like, I need something. And most of the time, when you're in a controller, in a Rails controller, you don't even care like where does the request object comes from. You're just like, okay, I want a request object. Just make it happen. If you're in a view, you don't care about what the view context comes from. You just have it. You just want it. And it's really comfortable to write code with just focusing on like using the stuff you need, not focusing on how you get them. So, this is what dry container brings. I've been using dry system, which is like dry container for handling all of that, and dry injector. And dry injector basically works hand in hand with dry container and allows you to call your services, call your dependencies by the small name, by the first name. You give a name to an object and then you can basically say, okay, I want this object. I don't want this class. I don't want to instantiate that class. I want specifically that object. And I'm going to use it. And I don't even care what its class is for. I want that object by name. Interestingly, this had almost no effect on the test. Even though it's a very different approach, I still had most of my tests instantiate object by themselves. Why? Because unit tests actually give a lot of fake dependencies. That's the point of unit, right? You want to test a single unit. So I was still building my subject into tests manually. And for the larger, the broader tests, I actually wanted to use the container set up correctly because I wanted to test that things were correctly wired together. So even though dry containers is like, oh, some you can stub and fake and change whatever you want. I didn't stub it because I was either using it and testing it or not using it at all in my test. And... Sorry. Yep. Yeah, I'm still in time. Dry container also brings something else, which is quite interesting. It's a settings, a settings object. And I realized very soon that the settings object was the object that I was injecting everywhere. Almost every part of my system needed to access settings. So I was injecting it everywhere. It was awesome. And dry settings provide some really interesting value. First of all, it allows any of the settings to be overridden by environment variable, which is quite important. If you know about 12 factors, it is one of the aspects you want for your config to be overridden by the environment that your program runs in. So that was the first part. And the second part is that you can coerce, you can define the type of your settings. Because if you work with environment variables, everything is a string. But when you work in your system, not everything is a string. We do have a lot of strings, but we have dates, we have integers. We have a lot of system. And usually what we do is we just parse them. Dry types allows you to create all types, name them for starters. Also naming things is probably the most important stuff we do in our work, I believe. You can name your type and get them correctly and get your settings in the proper types, which brings me to my next slide about dry types. So dry type creates a contract. It says, okay, this value, this settings, it has to be a phone number. And I'm going to explain exactly what is a phone number. And I'm also going to coerce like a string into a phone number, which means at the end of the day, I either have an error or I do have a phone number, which is exactly the object I want. And it makes a big difference. I don't know if any of you have ever created a class like phone number, like age, like bucket name. If you read correctly the literature about object-oriented design, we are supposed to do that. We are sort of supposed to do that, like subclass string when we want to make a first name. To be honest with you, I've never done that in my life. I've always used string and it's not a first name. It's a string. I know it's a first name. I know I'm not going to use all the methods of string, but the variable is name, first name, that's enough. Using types allows us to actually have proper types, more meaningful types, without creating full-blown classes for everything. Well, settings is one thing. But this contract, it can really be used for something else. It can be used for app input. When you are working in a web application, app input is a request. This is where most of our payload comes from. In our case, the app input was messaged from a queue, but the concept was very similar. As soon as we got one message, we treated it in a very similar fashion as we would have treated a request. When working with app input as a web, there's a very known pattern for handling that input, for validating that input, for correcting that input to everything that you wanted. These are form objects. We basically reused the same. I realized that I'm doing my slide in the wrong order, but you don't care because you don't have the order. But that's okay. We used kind of form object in the form of a dry contract. It comes from dry validation, that is the gem we have been using. Dry validation is really about two pillars. The first one is about typing. Eventually, it leverages dry types. It ensures that you get the keys of your payload that you expect, that you get the values that you expect, that basically your data is of the type you expect, that's the schema, that's the structure. Once you have the proper types, you still have business logic to handle. This is the second pillar of dry validation. A typical example would be if you have to handle a deadline. Imagine that somewhere in your payload there's a deadline. The first pillar would ensure that the deadline is actually a date because you get a string. Hopefully it's an ISO 8601 string, but it could be anything else. You want to coerce that in a string, you want to ensure that you have a string. If it's not coerceable into a string, you want the first error. But now that you have a string, you also need to validate that this actual date is in the future. This is what the second pillar is. You can create rules, business rules. That means that once your payload goes through the dry validation mechanism, you actually get a very valid, very reliable payload from a typing perspective, but also from a business perspective. Once we have that payload, what do we want to do with that? We actually want to process it. For that, we are using a pattern which is named Interactor. At least we used to use a gem which is named Interactor. You can think of an Interactor a little bit like an operation in Trailblazer. I don't know if anybody has used Trailblazer previously. No? Okay. All right. I'm going to go back. The idea of an Interactor is that this is the entry point to your business layer. Because the entry point to your application to most of the web application are the controllers. This is how... I'm not talking about the rules. Let's consider that the entry point is the controller. But that's not true because sometimes your entry point is your test. Sometimes your entry point is a rate task. Sometimes your entry point is an active job. Sometimes your entry point is a channel. So you actually get a lot of entry points into your app. But at the business level, you don't really care if you want to delete a user because of a GraphQL request, of a REST request, of an active job. You want to delete a user. It's the same business unit. And this is how we encapsulate things we are using in Interactor. One Interactor is responsible for one business unit. And well, very fortunately, DRI has a solution for us. It's a name DRI Transaction. So their name for it is a transaction. It allows you to create a series of steps. It relies on DRI Modad because each step can give you a result. And if the result is a success, then the next step is going to happen. If the result is a failure, then the next step is not going to be done. You're going to keep your failure. This is known as the railway oriented programming. Nothing related to rails. It's just because you either stay on your success track, like train track, or at each step you have a junction to your failure track. Well, the thing is we didn't use DRI Transaction. So I wanted to let you know because I would really recommend that you use it. I wanted to use it, but also we have a team of several developers who are used to our Interactors. And it sounded like a better idea to use what everybody knew than trying to reinvent the wheel. We had something, it's working well, everybody knows it well. So this is like my manager voice talking. If it's end broken, broken, don't fix it. But if you're doing it from the start, give a chance to DRI Transaction and DRI Modad. At this point in the talk, I hoped to try my own definition of DRI Modad, of Monad, what is a Monad, which is probably going to take the next two hours. So let's keep it. So the end of this slide is about why do we want to do all that validation early? And this was also something a bit new. First of all, like failing early is a good idea. But it was not enough because doing the business validation at each step would have made more sense. It's just easier to keep the business steps together. It makes more sense if you want to check some permission, then delete a record, then send an email. It makes sense that you do everything related to sending the email at the sending email step. It doesn't really make sense to already check stuff from the start. But the thing is, in Rails, we are very much used to a highly rollback-able environment because most of what we do, well, sending email doesn't count, but most of what we do is manipulate the database. And this is a huge comfort being able to say, my record.transaction do blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. If anything goes wrong, just roll back and done, nothing has happened. When you're doing a microservice, at least what we are doing, nothing is rollback-able. Everything you do, if you send an API request to something, if you delete a file, download a file, create a file, there's no rollback to that. And this is why it was so important to check as much as we could right from the start. All right, next step. Next step, next challenge. The next challenge was an interesting one, as every challenge, because it was about design and design opinion. And there's no truth, there's no strong truth in design opinion. So what was the challenge exactly? The challenge was that we realized we were not using dry containers properly. It felt like we were supposed to use it in a new way. Why was that? The reason was that we are very used to object-oriented design, object-oriented programming, which means we are putting together state and behavior in small objects, and they are responsible for doing that stuff. And the dry system, the dry container, was pushing us to use stateless objects, because that's what you could enjoy if you want to inject something everywhere. It better be stateless. But the code we wanted to write, because we have a lot of experience with that, was stateful. We don't want a command wrapper. We want a command execution specifically about this option. We want to ask a specific invocation. We don't want the full program. So it was very important to be able to write the code that we wanted to write, but it was also important to use the tools properly. And initially what we did is we had that big interactor, or big entry point, get injected with a ton of stuff from the container. It was getting all the services that it would eventually use, and that interactor was instantiating all the small objects, the small life cycle objects that it was going to use, and it was instantiating those objects, giving them their state, so maybe the current date, the current user, the current payload, and all the dependencies that the objects needed. So maybe there's a command service, maybe there's an API client, so the interactor was instantiating all of that, which means the interactor knew about almost everything. There's a name for that. GodObject. And it's a bad name. So we knew we were doing something wrong. We had a small discussion, and we realized that actually the literature again had a solution made for that. There's a pattern made for that. The pattern is factory. So what we eventually did is that we created new services, factories, very shallow services. Each factory was injected with the services that it needed, and the interactor was simply injected with the factories, and the interactor was just asking the factory, well, give me a command invocation specifically about this file, about this API, about this payload. And it's not a fun because it was so difficult to realize at first that we needed that, but at the same time it was so obvious what was the solution. It also raised an interesting comparison with a former colleague of mine who told me he was like a functional programmer. He, I'm not going to say despised, but he despised object-oriented programming. Well, I said it. And he told me, you know, an object is just a set of partially applied function. He was very like this day in for like, oh, it's just a set of partially applied function. We have like, we have object at home. Well, it's not the same. But to be honest, like, introducing those factories gave me that feeling because we had like those functions. We were partially applying all the dependencies. That's like first partial application, and then we were partially applying the state. It also opened our mind about what is stateless, what is stateful. Usually state is like all your instance variables. It's not really true. You don't see things like this anymore. Like your dependencies might still make you stateless. And your state is really what makes an object throwable. So if it's a reusable object, it's stateless. If it's like a one-use object, it's stateful. That's sort of our new definition of that. And factories helps us creating one-use object because factories are all stateless object. Well, I felt bad creating the slide without mentioning a single dry gem. So I also want to bring one here, is the dry initializer gem. And to be honest, this is my favorite, and it's so small. The thing is this is so small that it's crazy that this is my favorite gem. It creates contractors. It just creates an initialized method. But why does it matter? Because if you are very strict about it, all your initializer very probably look the same. It's like you pass them arguments, and then you store them into instance variable. Nothing more because doing business in initializer is a bad idea. So you always get the same initializer times and again, and it makes no sense, and it creates noise. And if you're used to more style guides, it has to be at the top of the file, and it also takes a very important part, focus, because top of the file is very important. So dry initializer, just do that. It says you can create one line for each dependency or state that you want. You can give it a type. You don't have to, but if you have a drive type, you might want to. You can give it a default value, and you automatically get an initializer that accepts them, and you automatically create an ATTR reader for each of the dependencies. You don't want the reader. You don't have to, but by default, you get that. And that's it, and you just transform something very long and noisy into a series of lines. We used to have ATTR reader anyways. That's most of our classes have ATTR, one line for ATTR reader anyways. So it changed nothing in terms of noise. It changed everything in terms of clarity, intention, and anyone reading a file now gets something directly by reading those lines. And yeah, I'm still on time. Well, we were done with the code application. Of course, we had additional challenges, but eventually using those tools and approaches, we reached up the end of the application, and we were done, right? Well, no, we still had to package it. We still had to deploy it, because as long as we were actually solving problems, we had nothing. This is again a time when we realized how rich is the Rails ecosystem, because for deployment, you either get services like Heroku or similar services, or using Capistrano, which does everything for you. You write one CAP file and everything is magic. When we had to deploy, we were like, yeah, we have files with code, but we still have no application. So we get some help from partners about that. We use Docker Compose locally for creating containers. We use Kubernetes remotely for deploying them. We use Helm for actually doing the deployment. And this led us to realize that we still had problems, because we had no observability. We had very difficult access to the log files, so there was still a lot of stuff we didn't have. So what we did is we introduced Yabbaida from Evil Martian. I don't know if anybody is from Evil Martian here, but if you are and if you watch us, like, thank you, you're awesome Evil Martian. So we used Yabbaida, which is an observability framework. It allows you to mention what you want to observe, create metrics, without having to care like where you intend to put those metrics, what we intend to do with those metrics. And then another part of Yabbaida, you can mention, like, actually what you want to do. You can separate the two. So your business logic is not riddled with, like, technical details about monitoring. So this observability allows us to expose some metrics, which in turn enabled us to create autoscaling to measure health. So these are typically stuff that you get for free in Rails if you're using your relic or data. But we had to do it by hand. And we finally reached our latest challenge, because we are not experts in Helm or Kubernetes. We are actually very noob in that. So we had partners helping us. But those partners are also responsible for, like, running and ensuring that our app is working properly. So the way the agreement we had with them is they handled their own repo with everything they do about us. And we have our own repo with our code base. And the problem we realized, and we still haven't solved, is that part of the application is actually in the infrastructure. And this is something we are not used to do in Rails. But typically the queue we use have a dead-letter queue. If you try to read something and it fails, so you release, you retry to receive, it fails. After sometimes that message, you put it into a dead-letter because you don't want to lose waste more time trying to handle that. Another aspect is buckets have life cycle. If a file is forgotten there after 24 hours, you want to delete that file. You don't want to pay fees for that file for the rest of your life. And this is application logic. Even though it fits in infrastructure, like it is application logic. And this bothers me because application logic, anyone who clones a repo should be able to see everything, to know everything. They don't have to be master at everything. They don't have to change everything. But cloning a single repo should explain everything there is to know about this app. So at the moment we still have those two repo. One is like focusing on the infrastructure. One is focusing on the code base. Hopefully we will solve that very soon. But with that done, we actually had the app deployed, monitors scaled, we learned quite a lot. We actually made a blueprint out of that, so we are creating several other workers right out of that. And we feel much more confident actually using Ruby for something else than web application. So thank you, everyone, for your time. Thank you. Any questions? We have two minutes of questions, hopefully. You've talked a lot about... I mean, first you never talked about Rails, but you actually miss it a lot. It's pretty funny that it was not about Rails, but actually... Anyway, you talked a lot about types. Is that something more to bring to the rest of the ecosystem? Yeah, that's a very good question. So the question is, I talked a lot about types. Do I want to bring that into Rails? Actually, the interactor is something we do in Rails already, which means we are using dry validation already, which means we are using dry types already. To be fully honest, we don't use it enough. We sort of use it when we realize that we should have used it before. So it's like not good enough, but it is something we are using, and types have been very helpful in the past already. And there's a lot of other tools that we have discovered here, because we had to, and I very much hope that we are going to use them. But also, my first slide means that I'm no CTO, I'm no manager, which means I don't get to make those calls anymore. And it's very important to me that the one who writes the app are responsible for writing it, maintaining it, running it, so I can influence, I can give my opinion, but I don't make those calls anymore. Yes? You said that you use dry monot. What has been, can you tell me more about your experience, because I used it quite extensively in the past before they introduced these two notations. And it was very sticky to the code as in, it made Ruby not look like Ruby, like something else. So, if there is something changed there, how's your experience? All right, so the question is, do I use dry monot? What do I think of the do notation, and how Ruby-esque does it feel? Is that right? Yes. Okay. So I am not using dry monot, except for like toy projects. So we are not using dry monot in this, so our own take is using our own interactors. So whatever I'm going to say is out of my experience on toy projects. I've learned initially about monads in Haskell. This is still very painful to me 10 years later. So my take on monads is like, most of the time, it's like not the right tool. And it's something that people, the learning curve for understanding what is a monad is so high that once you've earned the right to understand what it is, you want to put it everywhere. A little bit like meta-programming. So this is my take on monads. I wouldn't force them into anyone who is not very comfortable using them. I do believe that it is a very elegant solution, but I also do believe that sometimes a bunch of if-else makes the team happier than using the best tool for the occasion. And I don't have any opinion about the do notation and how Ruby-esque it feels. All right, thank you.
Backtracie and the quest for prettier Ruby backtraces
Okay, let's get started. So hello and welcome to Backtracy and the quest for prettier Ruby backtraces. So who am I to be here today? My name is Yvon Zhe and I'm currently a senior software engineer at Datadog. I've been in love with Ruby since I started using it professionally around 10 years ago. And I am a really big fan of going into an exploring language run times like C-Ruby, J-Ruby, Truffle Ruby, Java VM and others. And I've been attending FOSDEM every year since 2017, but this is my first time speaking, so I'm excited. So I also- Yes, pray for the cable. I'm also excited to play about concurrency, application performance and making tools that help us kind of look at our apps in different ways and new ways and try to uncover new insights about performance by looking at them in a different way. So that's how I ended up working on this thing, the Datadog profiler for Ruby. So if you're curious, come talk to me about Ruby Performance. I like to talk a lot about that. So, but for today, what we're going to talk about is what's the backtrace? How can we get one? Then how does the Ruby stack work in reality? Then we'll talk a bit about the backtracy gem, this is not good. I will be talking about accessing the internal VM APIs to do some of the weird things that the backtracy gem does. Then we'll play with backtracing in action and then we will talk about maybe a new feature in Ruby 3.4 which is having class names in backtraces. So what's a backtrace? How to get one? If you're a Ruby developer, you probably know what the backtrace is, but quick reminder, it's mostly like a trail of what methods were called and are still waiting to return at some given point in the app. And it's also called like a stack trace in some languages because it represents what's on the thread stack. So backtrace, stacktrace is usually kind of the same thing. And okay, if we have this A that we call A, that calls B and then raises an exception and then you get a backtrace. So we probably see this way too often and maybe you have some nightmares when you see this, but hopefully it will help you figure out your issues in your app. So there's multiple ways of getting a backtrace in Ruby. One of them is rescuing an exception. And an interesting thing is that actually the backtrace gets set on the exception when the exception is raised, not when it's created. Because you can create an exception but not raise it immediately. And so the backtrace only gets set when you raise it. And you can get a backtrace by just getting a thread and asking for it. Or you can use the color API, which is part of kernel. So it's part of every Ruby object, so it just can type color and you will get the stack trace of the method that called U. So you might have noticed that were backtrace and backtrace locations. The methods that end with locations return an array of these location objects that includes absolute path, base level, level, etc. So basically it gives you a nice domain object to represent the stack trace. Whereas the other method just kind of represents, just to return you the strings that Ruby prints. So that's the difference. There's also some Ruby VMC APIs to getting a backtrace. A few ones for different kind of use cases. And actually these two at the top will come back to this in a bit. So talking about the stack itself, how does the Ruby stack work under the cover specifically for the C Ruby runtime? So the idea is that a Ruby thread usually has two stacks. One is the Ruby stack that we usually see on our application and the other is the native stack. So the stack that the VM, which is a program built in C, has. And we can actually look at both of them in a really weird way, which is let's crash Ruby. And this thing is a weird thing. So I'm telling Ruby to send a segmentation fault to itself, which will crash Ruby. And then when we crash Ruby, what we get is this thing, which is the output of the Ruby crash handler, which includes a lot of nice things. So if you ever get the crash in Ruby, please do include this when reporting bugs. It's really useful. And the first thing it shows is it shows the Ruby stack. So here on the bottom, we see, okay, we have this each that represents our each on our code. Then we have the block, then collect, then the block, then the call to kill. So probably not a big surprise. One thing that is interesting, and you can see there at the top, is that Ruby actually, at least this Ruby version that I'm using, uses C methods to implement each collect and kill. And so you see that internally Ruby is actually keeping track of that and knows that there are C methods for that. This is not very good. And this is actually the native stack, which is also printed in that whole big thing. So please ignore the lot of text. The thing that you're caring about is this column in the middle, which is the names of the C functions that the Ruby VM is actually using. And you can actually, if you squint hard and ignore a few of them, you can see our app here. So we can see each showing up. We can see each showing up. And then we can see the block, the call to yield. Then we can see the collect showing up, RBE array collect. Then we can see yield. And then we can see kill. So we can see all of our methods. And you can additionally see these two methods, which are the Ruby code itself that we're writing. And those methods are RBVM exec and VM exec core is the Ruby VM actually executing the byte code, the Ruby byte code for our application, which is kind of the glue code that is between the other functions that you see there. And then at the top you see the code for the VM to handle a crash. So this is the two stacks. So let's focus on the Ruby stack and kind of ignore the native stack mostly. So how is it represented inside the VM? So inside the VM there are a bunch of structures in memory that represent the stack, so how do they look? And so hang on, I will show three slides of C code and then we can come back to actual Ruby code. Please don't kind of like stab your eyes out or something. So yes, when it shows up, there's a in VM core, it's like a VM header where a lot of the internal Ruby interesting things are. And there's like this RB thread struct that includes a bunch of things. And this is what Ruby holds for a thread. And inside that we have this RB execution context thing, which keeps a pointer to this other structure with this RB execution context which has a few more things on the thread that were separated for reasons. And inside here we actually see the size of the stack and the information about the stack. And then we have this array of RB control frame T elements. And this is a pointer into an array that then has these entries, the RB control frame struct. So basically these entries are what represent a stack frame in the VM. So if you see five lines in your stack frame, there will be five of this. And you see that there are some things in here, like if you're wanting to see IC which is the instruction sequence. So this is like the Ruby byte codes for that method or block or whatever that's getting executed. You see like self, the object on each which was called. You actually see like JIT return which was added to support YJIT and the other JIT so that they use that. And there's a few more things that we'll ignore. But yeah, this is how the Ruby VM represents the information that's on the stack internally. So whenever a method gets called, a frame is pushed to represent this new method that got called. So there's this VM push frame method that, like the interesting part is here on the right, which is like, we're setting it up. We say, we have the self object. Like there's some things that we want to care to track. So that adds one more onto the stack. And you would not be surprised if I told you that this stack gets popped and there is a VM pop frame method that actually function in C. That actually takes care of this. So fine, this is kind of what you might be expecting. So let's talk a bit about the backtracing gem. Yes, maybe this is good. I'm doing timing. So the backtracing gem is this really weird gem that I create. And let me tell you why I created it. I created it because of something like this. So if I show you this, main, print stacks, new initialized times, block initialized backtrace. If you squint at it a bit, maybe you can speculate on what's going on. But it's kind of hard for you to get a lay of the land and understand what's this weird example thing doing without looking at the source code. You need to be looking at the source code and then it makes perfect sense. It's here, it's here, it's here. But if you're not looking at source code, it doesn't make a lot of sense. So this is something I was thinking of. Like can we actually improve stack traces and give you more information so that you can read the stack trace and get more information without actually going to one or more files. Because this could be across ten files and you would have to follow along. So actually this is the code, you don't have to read it very much. The interesting thing is that we have this method print stacks that gets called here that creates an instance of print stacks that then initialize and then inside there's the times and then we print the backtrace. But I've shown you the code. So the idea is that Ruby what you saw was printed with the Ruby backtrace. And with the backtrace.jm you can instead get this. You get the class names. You get like a dot on print, hello, Fosdem and here you can see like the namespace. So here you see that like we're calling you on the class and then this is like an instance method. And then we're calling integer times and then we're having a block inside initialize and then backtrace location. This is kind of the thing I wanted to experiment with. Maybe it won't look exactly like this, maybe it will look different. But try to get more context so that you can look at it and you will go like, I think I see what's going on even without opening up your editor and maybe navigating to the ten different files. So this is what I mean about prettier backtraces. I wanted to experiment with adding more things, things such as like class and module names, things such as like show a dot or a hashtag if the method is or like an instance or a class to be able to quickly distinguish that. Maybe distinguish like singleton methods, so methods that you define on a specific object versus just like a regular method from that class so that you can see this is a weird thing that showed up on this object. Maybe that's relevant. You could distinguish refinement, which is this weird thing, which is like methods that show up based on like some context thing. You could maybe show method arguments, maybe that's useful sometimes for you to distinguish between a few of your methods. Maybe even show C function names or like file names and line numbers. Because one thing you might have realized is that I shown you that array and the collect and whatever methods are implemented with C code. But you never see the C file and the C line where they are implemented in your backtrace. So if you want to actually follow that into the VM and understand what's going on or maybe you just are working on a Ruby native gem, you actually don't see that information, Ruby hides it and doesn't even keep it. Another thing is like maybe even have some visibility into the native sex and what might be going there because you might be debugging like this postgres or my SQL driver which is going into C code. So how far did I get? Well, I got this working, this working, this working, this working. This is not, I haven't tried it yet. This is a really awful hack, so let's say maybe. And this is not working yet. So I'm still kind of experimenting with how far we can get. So a question is like how does backtrace work? So the TLDR is basically I've shown you how things get stored inside Ruby. So we basically just like go in there and get what we need out of Ruby without Ruby really having any APIs to do this thing, which is fun. So but these are internal VM APIs. So they are like in private headers and they are not available to gem. So how does this work? How can we access this information? And this is like the cool thing about like that this prototype allowed me to play with. So let's talk a bit about accessing Ruby VM internal APIs. So what's the backdoor? There's actually two different backdoors for accessing this VM internal C headers in C Ruby. One is the hidden Mjith header. So you might have heard about the Mjith experimental JIT compiler. So from Ruby 2.6 and 2.3.2 it was a part of Ruby. And it actually generated some C code and then compiled it. And that C code actually needed a header with some of the internal things. And so what the Ruby developer did was very silently, they went into this folder which is like a weird name and they created this RBM JIT header which is nobody supposed to use. And put that information there. So we can actually search this information from there and then use it. So yes, it's great just for the private use of the Mjith compiler. And if you import this, it's like weird working with it and a bunch of things doesn't work very well. Because it was not supposed to be used by anyone other than the Mjith compiler. But it includes like a copy of all the things we're looking at so we can make it work. Backdoor number two, which is like one of my weirdest backdoors, which is the device Ruby course of gem. So the idea is since the Ruby VM doesn't have any of the headers it needs. Thank you. This gem actually just kind of copypites all of the Ruby headers. So it has a folder and it has like some folders for every Ruby release and then kind of someone just copypites every header in there for every release and then release the new version of the gem. It's very crude, it works for all Ruby. So 3.3, now that's Mjith is gone in 3.2. And it also works like as far back as like Ruby to one or two zero. But yeah, you could do something like that. So the backdoor is like once we know what's the shape of these VM internal structures we can access them in backtracing. And if you remember this slide where I said I'll come back to this one, RB profile frames and RB profile thread frames, now is the time. So what I did in backtracing is that I started by copypasting RB profile frames into the backtracing code, just going into the Ruby VM like copypaste. And obviously when you copypaste from like an open source project, make sure you understand what's the license and if you can do that, you can do that with Ruby. And so I did this. It's fine, but make sure to like have the copyright headers and all that information. And then I added a bunch of features to experiment with it and get all of the things I was talking about. And actually, it was really interesting, this approach was really, I found it a really great way of prototyping something without having to depend on a custom build of the Ruby VM. Because I actually started by modifying the Ruby VM, but then I have a Ruby VM that works only for me and that features only for me. Instead, if I do this, I can tell you gem install backtracing and you can get it as well. So it's like an interesting approach to like playing with something that you would otherwise not play, but be careful. So obviously there's a lot of small details to get right. I am glossing over a ton of things needed to kind of get this weird thing. So for instance, you might want to access some VM internal structure, but you might not know exactly how to access it. So sometimes you need to kind of go read the Ruby API very carefully and see, this object that Ruby hands me actually internally has a pointer to the other thing, which has a pointer to the other thing, which eventually is what I want. So sometimes you need to do a bit of squinting at Ruby and understanding like how are you going to get access to this information. Like in some cases, like the copy pasted code also called other private VM internal APIs that are not exposed by the VM. So when I copy pasted, I compile it and then I try to run it. It doesn't work because those APIs aren't there. They aren't visible to gems. So again, like a lot of details here. Sometimes you just copy paste more and you keep copy pasting until it works. Sometimes you need to re-implement some things yourself because it's easier than you look at it as like okay, I don't need all of the things. But you need to play it a bit with it until you understand how you get it to work. But it has some really cool side effects. So for one, I was able to get this to work as far back as Ruby 2.3 with a lot of conditional compilation things in C. And even as I've done some experiments, even as far back as Ruby 2.1, so I think you could do this. And it was kind of cool because this includes back porting of RbProform frames features. So I copy pasted from Ruby 3 version and actually they have added a few features and some bug fixes and whatever. And so by copy pasting this and then using it on Ruby 2.3, I was actually having features that were not present in Ruby 2.3 from the modern version of the code, which was really cool. I also did not do it alone thanks to KJ from Zendesk that did a lot of work on Backtracing. And so let's quickly take a look at, interesting. Is it one full color? I don't know. So let's take a look at how we can use Backtracing. So you can go on the website, you can install the gem. As I said, it's the magic of doing this thing in this weird way, is that it works for you, for everyone, just install. It has this API which is Backtracing Locations, which gives you an array of locations, which is Backtracing's version of Ruby's location. So you get a lot of nice methods with the different things that Backtracing got, but Backtracing has a lot more things, and I will show you in a bit. Then you also get color locations, like Ruby, you get just for the colors of this current thread. And some use cases you can do with this. So you can obviously probe what information is there and you can implement your own printer. So there's a lot of information about the different names of the methods. And for this very simple example, actually they have all the same names, but sometimes Ruby has these notions of different names. So you can access all of them, you can access the objects that this was called on, you can access the class, a bunch of things. So you can use this, and then you can implement your own printer. That imprints a very nice stack trace. You can obviously use this to just get the pretty stack trace. So by default, Backtracing prints exactly as Ruby does, but if you call fancy to S, you get the one with the class names and a few other fancy things. And you can also call this weird Backtracing gem from C code, it has a bunch of APIs. And in particular, it has a special low overhead API for profilers and tools like that. So if you're interested in building something like that, you can use Backtracing to get the stacks and not have to care about. And actually one gem that's using Backtracing is this Ruby mem profiler that was created by KJ and I helped a bit as well. And so it's like an open source gem by Zendesk, which uses the Backtracing API to build a flame graph of memory so you can investigate memory usage and memory leaks and reduce the memory footprint of your application or even fix memory leaks. So we actually, me and KJ, we gave a talk at RubyKaigi about this thing called Hunting Production Memory Leaks with HIP sampling. So if you're curious, check that talk out. So some other use cases that we've been playing with on Backtracing. So you can actually access native function debug info. There's actually a lot to be said about how you get debug info from native libraries on Linux and different OSes and debug symbols and warf and whatever. I will not go into much into that because that's a nightmare. But I have a working prototype which actually you can see for each. You can see, okay, each belongs to Array. But you can also see it's implemented in this libruby.so object. You can see I'm using Ruby 3.1. And you can see that the C function name is rbarrayeach. And then in the future, we could even get more of the bug information, assuming it's still available and see the file name, the line number, etc. And allow you to smoothly go from Ruby code to C code as if, yeah. And theoretically, this native information doesn't have to just be C. So if you have a Ruby gem that is built in Rust and the Rust binding would have the correct debug information, you could go directly from your Stacktrace to, it's this Rust line. So it's really nice to, that's why I'm looking into having this information. Another idea that I have that I still haven't experimented, I haven't tried really hard to do it, which is, could we build a Backtrace Stacktrace for exception? So that when you have an exception in your app, you get the nicer objects which Backtrace provides you and you can get the full information. I haven't tried it yet, want to do it. So, just kind of a recap. What did I learn from all of this experimentation and playing? One thing is that the Ruby VM itself is very interesting and I would say surprisingly approachable. So my prior C experience was university projects and really, really tiny personal stuff. So I would not classify as a C developer ever. And I like everyone that goes to uni, I just kind of listed C in my CV because I did it at this one or two courses. But really, I was not a C developer and I still could follow along a lot of stuff. And especially if you go there and you add a printf and you start playing, changing the code a bit, it's like, you see things happening. It's really interesting. And also the power of having a working prototype to show off a crazy idea. And this had really two side effects that I was kind of hoping for, but didn't quite expect it would happen. One is that we actually at Datadog ended up using a similar approach for the Datadog Ruby Rufaller. And with Backtrace, I kind of proved to the team, I was like, yep, it works, I've got it working. This is one thing we could do if we wanted to. And the other thing is that the Ruby Core team also kind of liked the show class names in Stack Traces thing. And this kind of started an interesting discussion. And this leads us to the final item, which is class names in Backtraces coming soon in Ruby 3.4. Question mark. So actually in the Ruby issue tracker, this is now being discussed. This number 19117 include the method owner in Backtraces, not just the method name. This was opened by Jean-Bossier, had this proposal after we were discussing this at RubyKaigi. And then Mame implemented like it has a working prototype for this that it has a PR for Ruby. And actually if you just build it, it works. Like as kind of what we were saying, now you get this information of like the full class and you see that this is an instance method and you see like the dot on the class method. So we had this extra information for developers to just out of the box. Obviously this is still being discussed. So if you like this idea, you want to see this in Ruby 3.4 and use it in your app. Just try it out, go and leave feedback on this issue. And that was kind of it that I have had to tell you. So yeah, email if you want to talk to me, a knox on whatever they call the social network, my blog. I have a few other talks and here like yes, go get feedback because Ruby developers are actually calling for feedback in that ticket. And I actually, thanks to my employer, they developed for allowing me to work on these things. And I actually, if you're interested in coming work on the data about Ruby Jam, ping me because we are hiring right now for the Ruby Jam. And it's really different kind of Ruby that we do. Yeah, questions? Hello. Yeah. I think you mostly answered it but the class shown in the trace, Yeah. 3, 4, and 3, 4 and backtracing, it is the owner of the method. Like we can statically, because in J-Bruity the only way we get the piled backtrace is by cramming a bunch of data into the class name or the file name or whatever it's on the JVM's trace. Yeah. I can't make that dynamic. Once I set that stone to the method, it's going to stay that way. But if it's the method owner, then at the point where I can pile it, I can just throw that extra information in there and pull it out. I think that's right. Yeah, I believe the disimplementation is exactly the method owner. I think in backtrace, yeah, I experimented with having both, but it's much harder. And I think part of the discussion going on in the ticket is also, what about dynamically defined stuff and whatever? So I think the implementation is like, oh, when it gets, and I think, yeah. In some cases, it might not show, because it's kind of hard to get this information even in CRuby and expose it in a very efficient way. But in a lot of cases, it's like a regular method on a regular class and it gets it. So yeah. More of a product question instead of a technical one. Yes. You say it came from the, you wanted to have access to what was being called. Yeah. Is that something you personally, or is that something that was shared across the team and then something related to that? Essentially, I'm not working with anything compared to that. Yeah. Will I get something out of it myself? I have a small company. I think so. And then I really, so my other background other than Ruby is Java. And in a Java backtrace, you usually get the class and the method. And I've always found it easier to, in a lot of cases, easier to think about. Like, oh, this is the class and this isn't the method on my class. Then just like the method names. Obviously, in Ruby, if you have a very well-structured code base, you know that app flash foo flash dot rb is going to be foo bar. Like, you know. But sometimes code is not actually that simple. There are like, so near parts of the application. So that's the part where I feel like this kind of thing comes in handy. And I kind of missed it from Java. And I had worked with Java tools and I was thinking like, I want this thing from Java. Can I have it? The, actually, other thing I can add is that because for methods in the Ruby VM, right now, Ruby never shows you, like, where array is in the VM. It kind of, it kind of blames you. I can show it very quickly. If I go back to the way, way, way, way beginning, you can kind of see this here. So this thing, have you noticed that Ruby is lying there and there and there? Is Kiehl defined in line three? Is Collect defined in line three? Is it defined in line three? No. So when you have a C func, like a C API or native API being called from Ruby, Ruby lies and just basically decides, it's the caller. So that's the thing. And actually, at some point, I had to debug this really weird case where Ruby was calling inspect and I really didn't understand it. And I had a bunch of like, new inspect, new inspect, new inspect going into the VM. And I really didn't understand it. And I actually got out backtracing to just get that stack trace. And I understood that it was like this weird case when you have a no methods error on like some Ruby version, Ruby will actually call inspect on your objects. And in some cases, it will, after calling inspect, it will throw the inspect away. Which was like, I was like, why is, whatever. But sometimes, like it gives you a lot more viewing, a lot more context if you know exactly where the methods are getting called and the classes. So here you would see process skill, et cetera. So it's much clearer in my opinion. Yeah. Did you try to apply the same approach on heap dumps? Apparently, you're just inspecting the internal C structures. So at least theoretically, it should be possible to inspect the heap dumps. Yes. So like, I'm not, it's been a while since I've looked at the JSON output of heap dump. So I'm not sure if it has this information, but it could. And actually, even if it doesn't have it, I don't think actually you don't need to go as far as backtracing and accessing the internal stuff. Because you can do like objects.pac.each to implement your own heap dump. And when you do have xpac.each, you have access to the objects where things are defined. I was talking about the dump files. The JSON file, yeah. I mean, not the JSONs that you can get from. I mean like a crash, like a heap dump of a crash of the VM. Yeah, yeah, it could. It's the same thing. Like the structures are there. So you could do this. Like you could even do like a GDB script or whatever debugger script that accesses the same things and reads it. And actually just one thing, if you ever heard of the RbSpy profiler, which is like built by Julie Evans originally, like RbSpy is kind of doing the same from, but from the outside the process. It's like, it's a rough process that it's like reading Ruby memory, reading those things and then showing information. So I actually at some point tried to prototype this in RbSpy and then I just got bored and did something else. Yes. If we want to start looking into the C code of the VM, is there a documentation or somewhere we can start to not reading all of the code? Yes. There is. There is actually a really nice repository that I think is like there is like a, I think it was built by I'm going to say Koichi, like one of the core Ruby developers that have like a nice introduction to the VM. I don't know exactly the name of the repo, but like email me and I have that in my bookmarks and I will send it to you because it exists. Actually, it might, like let me quickly do something. Maybe there's a... A challenge. Yeah, it's that thing. Exactly. Ruby Act Challenge. And I think in the backtracy repo, there's actually some links at the bottom and it might be there because I included in the repository a bunch of links of interesting things I found to read this information. And so if you go to the GitHub repo, the bottom, it might be there, but yes, it's Ruby Act Challenge. So Google it, you probably find it. Thank you. Thanks, everyone. Thank you.
Writing your own Rust linter
We're going to have your attention. We'd like to begin with the next talk. We have Guillaume. He's going to explain to us how to write your own Rust Linter, as you can see on the lovely slides. And if you talk, Luca, have we got the audio as unmuted and everything? Perfect. Wonderful. Okay. Take it away. Hi, everyone. I will try to speak loud so everyone can hear. So like he mentioned today, I will explain to you how to write your own Rust Linter. So first little presentation. I'm Guillaume Goumez. If you come every year, I give a talk, so now you should more or less remember me, I think. I'm a member of a few teams of Rust projects and I'm an engineer at Huawei. So first, let's explain what a Linter is in case some people don't know yet what it is. A Linter is a tool that generally is an addition to a compiler of a language. And here in Rust, I suppose everyone heard about Clippy. At least I hope so. The goal is to detect some very basic logic errors to suggest improvements for any method you might use, anything you could use better. The goal is to make your code better in short. So now how is a Rust Linter actually working? We are directly entering into the subject. So let's say it's an extension of the Rust compiler. The Rust compiler has an API, very unstable. So very frequently we have to update the Linter to be able to still work with the Rust compiler. And that's exactly how Clippy works. So when Clippy is running, it's actually running a lot of parts of the compiler to get parts like AST for people who don't know what AST is, it's a token representing your code. So if you have the struct keyword, it's a keyword and it's a struct. So that allows you to have information higher than. But it's not only that because if you only had the AST information, you could only make suggestions like, yeah, you use this generics but not in a good way so you can do it like that, et cetera. So the goal is to go beyond that and to get access to more information like a borrower checker and everything. So if you have a trait you're using but you could use another trait which does the same thing but shorter, we can now suggest it because we have this information from the compiler. But because of that, we have to update the Linter often or never update which version of the compiler we are using. So why does it need to be a rest compiler extension? It's quite simple to explain. So unless you want to reimplement all the parsing, the borrower checking and pretty much everything, well, better use what is already existing and ask them likely to make their API public so you can use them. And that's exactly how things went with Clippy and that's exactly how I went as well. So I mentioned a few limitations already. So it can only work on crates compiled with the same Rusty version. You don't see it with Clippy because it's tied with your compiler. When you un-style a Clippy, it's tied with your current compiler version. So it just works but it's something to keep in mind you will see later. Like I mentioned, the Rusty API is not stable so very often you have to update your Linter code to be able to keep up. It's tied to a specific Rusty version and I'm not talking about a stable release but literally to commit version which is a bit annoying. And also because of everything, it's annoying to wrap in a cargo command because you need to use a very specific Rusty version. Again, we'll come back to that later. So I will voluntarily don't mention all lint passes. I will only speak of the two main ones, the early and the late passes. The early passes give you access to AST. So you are able to see the syntax and the work a bit on it but you don't have type information or everything. You can only just know that this is a struct and its name is and it has generics but you don't know what traits it implements or anything. You just have very basic information and you have the late pass which in this case goes a lot further. You have access to Borrowchecker information. You have access to everything. What does this type is implementing? Does it implement this trait? What is its layout? Everything. So in this case, we will talk about how to write a linter but with Rusty tools. The goal of this trait is to wrap the Rusty API into something easier to set up because there is a lot of things to set up. And to add it, it's just that. Like you would add any other trait. For now, the version 0.3 later on it will be updated. And now we start to enter into the fun. So actually to make it work, you need to add this little line in your Kaggle file to tell okay it's a trait but not any trait. It's a Rusty trait. So you need to do some very funny things. And we'll come back to this one but it's some things that you thought were something that we had for years like having to write an extant trait to import a trait. As back you actually need to import a trait from the compiler with extant trait. Otherwise it doesn't work. It's not provided by default. The other thing is we need to create a Rust toolchain file. It's literally its name. If you never use it, if you have a Rust toolchain file in your folder, cargo will only use the version provided inside this file. So in this case, the version of the compiler we're using. This is all in the documentation of Rusty tools. Basically you just need to copy and paste the file into your local file. So in here we say that as components we want the Rusty div which means the traits from the compiler. We want Rust FMT because we are not savages. We want to actually format our code. And the LLVM tool, the preview is to be able to actually compile. Otherwise you don't have a backend which is also problematic. And now let's get into the code. To declare a lint it's mostly macro. As you can see on top we use internal Rusty traits. So lint and session. Lint provides some types linked to handling lints. And session allows us to give information to the Rust compiler about things we want it to run. So in here we create with the declare tool lint macro a lint called warn generics. In MAG. In capital letters. I can't do that. It's warn by default. And we add a message when it will, in case you want information about it, it says warns if any item has generics. It's an early lint pass. So it means we only have access to the AST information. I voluntarily picked this one because to be honest the code is much, much shorter and simpler and for 15 minute talks it will be better. The other thing we need to do is to implement some very basic traits provided by the compiler which we don't need to care about. So they provide a macro for that. So declare lint pass which is in our case allowing you to declare a structure called warn generics. And we link it to the warn generics lint. And after that at the end we have the very empty implementation of the early lint pass trait for our type. This visitor trait, if some don't know the visitor, how to say, pattern, visitor pattern, let's say. The visitor pattern allows you to have access to, literally, you implement whatever you need to have and then, for example, visit function. Whenever the visitor will encounter a function, it will call this method and it will be ours. If we don't care about the rest, they are already implemented. We don't need to care about them. So very convenient. In our case, we only want items that could have generics, so very likely functions and n-us and everything like that. So it will be pretty easy normally. So now we implement the lint. So as I was saying, check item. We don't have anything else to do. It provides a context, the context of the compiler at this stage, an early context, and we have the actual item. And then it's pretty simple. We have methods provided by the compiler and everything. So we check if our, I hope everyone knows the syntax of first, but we check that we have generics. We check that with some generics. We check that the generics are not empty because otherwise there is no point. If we have generics and everything, then we will say, okay, we found generics. We don't want generics because, because, and let's then emit our lint. So first, the lint name. Second, the span. The span is how the rest compiler map your size to your actual source code. It's basically to your size beginning and an end. And you don't have to care about what it's pointing to. You just say, okay, the type I want to lint about is starting here, ending here. You underline you do whatever you do and I don't care. And we have our message saying, no generics here because we don't want generics. And the last thing is in case you wanted to add more information, like for example, we could say, the help and we could add a help message and we can do that a lot more. In case some of you don't know what it is, the syntax with the straight bar is a closure. So a closure taking a diagnostic type argument. Now, the interesting part is now how can we run this lint? And as you can see, not much code because RustyTools is doing pretty much everything. So first, we get the cargo args because it's a cargo command. We will run the cargo tools. We don't want the two first arguments because cargo and tools are not something that we are interested into. We pass the rest of the arguments, if any, into the RustyTools cargo integration command, which we internally call cargo, build everything with its own version because it's not necessarily the case. And once everything is built, it will generate the command line that you actually need to pass the Rusty compiler to be able to run our linter, which we do with WistLint. So this time, args is what cargo provided us so we can now generate and run our lint. So we just give it access because it's already done by RustyTools. And inside this WistLint, we need to actually say to the compiler, OK, I created a lint. It's called not an void call. I did badly. It's a one generics. And that's it. We have everything. We can now live. And the compiler will do everything when living as a WistLint function. So now it's always nicer to be able to run a cargo tool. So you just run a cargo install dash dash pass if it's local, otherwise not. And I named it in this case tools inner. You will understand why later. So we just run it. And it doesn't work because we are not using the same version of the compiler. Congrats. So in this case, what's important to note is that you actually very much need to use the same version of metadata as the files generated by the compiler to be able to use them with the lint. Rusty doesn't understand itself if it's not exactly the same. Like if it's just a commit difference, no, I don't know him. Don't care. No problem. So now we can actually go around this limitation by providing the version like this. So if we do, I thought I had an error output. So if we do, we actually have the tools running. But to be fair, we can't really ask our user to do that themselves. It's pretty bad user experience. So we will go around that and do this very long file as you can see, which for this case will be called the cargo tools. And this one will literally run this command that we saw here itself. And that's it. It does just that. We just wrap our linter and it's just running. So now we install it. We run it. And again, I don't have output. It's very shaming. And believe me, it works. So yeah. I voluntarily, like I said, didn't show a late linter pass to have access to the compiler and everything. But I wrote a blog post explaining that much more in depth. Inside it, you have an example with an unwrap if I remember, saying, yeah, don't use unwraps use something else. And you see how we actually get the real type information because when you call unwrap, you need to check that unwrap is actually called on the result on an option. But for that, you need to actually get the type check information because if it's, for example, self with a capital letter double colon unwrap and then you pass your type, you actually need to infer the type. And for that, you need type check information. You will see a lot of things that are seen very easy but are quite not so easy. For example, if you want to have, I don't know, which type this implementation is being implemented on, funnily enough, it's quite difficult. You can have the trade very easily but the type it's being implemented on, not so much. And thank you for your attention. More information on my blog and you have my email and social media and everything. And thank you for your attention. So we have about two minutes for questions if anyone has them. Yes, come right to the back. Hello, thanks for this presentation. No, don't share at all. Okay. Hello, again, thanks for this presentation. A few years ago, I wrote a refinement type system for Rust as a linter. I had the courage to maintain it for about one or two versions of Rust. A few months ago, I tried to pick it up again and everything was broken, bit rotten, two tiers, everything had changed. Do you know if there are any plans to make things a bit less messy? Because right now it's really, really, really painful to maintain a linter. No, it's just pain, enjoy. It's a shame. No, in fact, it's actually better now because we have less function to worry about. For example, a lot of APIs that were existing before, only for Rust. Because Rust.doc is a compiler extension. Being less and less used because we said, okay, we now stop accepting completely broken code. And soon enough, we'll be very likely using the same API as Lins. So normally it should be still breaking as much, but not as much. I don't know. How is this related to Clippy? I don't hear you at all. Ah. Basically, it's working the same way, but it exists because in Clippy, not all Lins can be implemented if you have specific needs for your project because you need to have higher security levels or you don't want certain code pieces or everything. You can't expect them to be implemented in Clippy. So you implemented them yourself and that's very much why RustyTools exists. So you can actually do it without having to set up everything yourself. Perfect. Thank you so much.
The plan for gccrs
So, today we have a, despite the slide saying Arthur, this is with Pierre Manu, going to speak about GCCRS. Give him a welcome. And, really well. Hello everyone, so I'm not Arthur, so I try my best so please be with me and I'll do my best. Yeah, okay. So, I'm a compiler engineer at Ambicosm and I'm not the co-lead of GCCRS. I believe there's Philip in the room but I can't see it. So, yeah, Philip is one of the co-lead of GCCRS. What will we talk about? So, I'll introduce GCCRS because some of you may not know the project. I'll talk about what we've achieved this year and what we've done, basically. And what we will do in the future in the upcoming year. So, there's a lot of things that's gonna change and I need to introduce those. So, let's begin. So, what is GCCRS? GCCRS is an alternative compiler for Rust. So, you may already know Rust C and we aim to provide a new front end for the Rust language within the GCC project. So, there's already a lot of front end in GCC. There's Hada, Go, and many of them Fortran. So, there's just one new front end that could leverage the GCC back end as well as the GCC plugin system on the GIMP representation. We are targeting for version 1.49 of the Rust language and the work is funded by Mbacosm as well as Open Source Security. So, we'll talk about the points. Why should we create a new compiler? So, there's a whole lot of architecture that aren't supported by the LLVM back end. So, there's already work for a new Rusty compiler for to leverage, for example, GCC JIT to get some GCC targets. And yeah, so basically, we are to leverage those new architecture and provide more target for Rust C. You may check the GCC room later tomorrow actually. So, there will be more about this. There's another big point. It's the Rust for Linux project. Basically, the Linux channel wants to integrate Rust program language in Scott Bayes. And this means we need some people want some support for the Rust language from the GCC project. Having multiple compiler apps, there was multiple domains where helping on this service draws some attention on some dark spot in the Rust language. So, we could show the Rusty people what could be improved, what is good. Yeah, so this brings some discussion about some subjects. And after I've been working on Rusty, on the macro group, for example. So, a lot of things that's brought by new compiler. It's bring new point of view for things. One last thing is working with very old plus plus compilers. There are some architecture systems that have some very old compilers. We just compile them in C++ or C. And you may want to bring the Rust ecosystem to the systems. So, yeah. What we've been doing in 2023. 2023, we had three multiple Google summer of code projects. One was by Mohamed, who was working on the error framework for GCC. We basically want to introduce friendier error codes, like we can find in the Rusty compiler. If you've used the Rusty compiler, you may have seen friendly error codes or user error. And we want to bring this to the GCC ecosystem. And the second new Google summer of code was from Raikey, which I believe, yeah, he's there. You could see tomorrow in the GCC dev room, which implemented multiple things to support Unicob. We've been working on borrow checking, closure, iterators, and a lot of things. We, I also worked on Proc Macros. So, Proc Macros are baking Rusts. Their views are used almost everywhere. So, yeah, I've been working on this in the past year. We are able to expand some Macros right now. And it's not completely polished, but it's almost finished. We had to develop a new binary interface. This is a new system in GCC to leverage Proc Macros. And you may as well see my talk from code one, Google code one 2023, if you want to get deeper into the subject. Okay. I've been talking about the borrow checking. So, we, Jacob Dupac has been working on the borrow checker. Basically, Rust C has a pass in the compiler which emits an IR on the borrower checker, work on this IR to attest and check that some facts are valid, that the code is valid and that the borrow checking rules are all assorted. So, Jacob Dupac has been working on a new IR in GCC. So, we could have a borrow checker. It leveraged Polonius. So, if you've been working on Rust C, you probably already know Polonius. So, this is representation of Rust C on the left and GCC on the right. As you may see, Rust C, MIR is the borrower checking steps and the MIR is then lowered to LVMIR. In GCC, we've been doing things a bit differently. Basically, we had to separate two IRs and there is one kind of dead end IR specialized in borrower checking because we couldn't create an IR that could send below words to the GCC back end. So, at one time in the compiler, there will be two parallel IRs that could be created and on one end there will be GCC tree and on the other end there will be BIR and those will be checked. But the BIR won't be reused for the creation of your final record. I've been talking about the InCode support. So, yeah, I told the InCode support was by Raikey. Told you tomorrow there will be more. Rust code, so we want to be able to pass the Rust C test so this means we should be able to emit the proper error codes and this means we need to fix our own error codes to make those the same as Rusty. We are opening a few more entries this year for the GSOC for students to help us. Feel free to apply if you want. So, what we will be doing in 2024. We aim to implement format arg macro as well as the continue the work on the Polynesboro checker as well as the traits. Why do we need a smart arg macro? Basically, this macro is required in order to compile the standard library and we would like to be able to compile even a simple yellow world. If you have ever used yellow world, you may not know it, but under the hood, there is the format arg macro to format all your arguments and without this, we cannot even compile a simple yellow world. So, yes, this will come soon before GCC 14, hopefully. Currently, our borrow checking pass only rejects some invalid code for some facts and we still miss a lot of facts and we still miss a lot of things. So, yeah, we hope to implement more fact validation from the Polynesian engine. Okay. So, we cannot change our strategy for GCRS. We've met people at the ERRORUST in Brussels a few months ago and those were people from the ERRORUST team, I believe as well as the type language team as well as the trait team. And those people told us that the work required to make the trait solver work was like easy to do, but to get it right, you need a lot more work. So, basically, if you want 90% of the work done, it's easy, but if you want 99% of the work on the trait solver, it will be a whole lot of work because there is many rules that are very specific to some code and, yes, it will be very hard. So, in order to do this, we chose to not implement those ourselves but leverage existing RUST code. So, we'll be using different RUST libraries in the RUST compiler within GCRS. So, yeah, what I was saying, we'll be using RUST code within GCRS. So, that means there will be two steps in the GCC bootstrapping process. The first one will compile GCC RUST without Borough Checker, without a proper trait solver. And that's only a later step that GCRS will compile itself with the Borough Checker as well as all those fancy stuff. Yeah. The first version of GCRS, the one without Borough Checker, without all those things should never land in the end of users. That's only for bootstrapping purpose and, yeah, nobody should use it. Here is a schematic about it. So, yes, as I told you, bootstrapping process will be in two stages. First, we'll compile GCC RUST stage one without the Borough Checker. Then we'll compile Polyneus and then we will compile GCC RUST with Polyneus within embedded inside it. The format argument parser follow the same principle. We will compile it as a separate library and then we will link it. So, in order to do this, we need to make a version of GCRS which can compile the format argument parser RUST code. And, yeah, that'll be it. Let's look at the plan. So, we need a type checker, micro expansion, name resolution, as well as format argument. So, we will integrate those in the compiler in a two-step bootstrapping process in order to then be able to compile the standard library and then be able to call your favorite print line macro. Yeah, compile it after. On a long term, what should we do? So, we want to catch up with RUST or line exterior requirements. We want to be able to compile RUST code that should be used for Linux model. So, RUST for Linux targets a much more recent version of RUST. I believe it's 1.70, I'm not sure, don't quote me on that. But, yeah, we still have some additional work. It won't be that hard because once we have the standard libraries that compiles, there's not many things that are left because most of the work in RUST is done within the standard library, not in a language itself. Yeah. And then, we need some analysis as well as semantic testing. We do not enforce at the current time some runtime guarantees. So, for example, array bound checkings, that kind of things. RUST panic when you try to access an array out of bounds, those kind of things. So, those are not generated yet by compiler, so we still need to add that. And we need to ensure the compiler assembly produces the exact same behavior as RUST. We want to leverage the RUST test suite in order to be sure that GCRS is compliant with the RUST compiler. We need to work on a lot of improvements, more CI because currently all CI is like four little steps and that's all. We want to make sure the GCRS work with every architecture supported by GCC. For example, we have some build failure with some Spark backend, so yeah, let's make sure Spark work again. Spark 64. And one thing we want to add in the upcoming year is more upstreaming. Last year, we were a bit late and work was coming and coming and we didn't upstream as soon as we wanted to do. So yeah, we want to upstream more frequently. This will avoid this kind of situation where we want to push 900 commits in one mail and we GCC a new repository and we're everything crashed because well, it is not supposed to under 900 commit in one time. We want more contributors, more students and yeah, more fun too. Thank you for two open source security and because them and a few members from the RUST community which are helping us getting detailed from RUST. There's a lot of people with a lot more experience in the RUST compiler that's helped us to improve the GCRS compiler and as well as many contributors. So Tomah, Mark and even Riki here. Thank you. Here are different links to our blog, Github if you want to contribute the ISC channel as well as the main list. Yeah, so I'm a bit early, I'm sorry, but yeah. Not my slide, sorry. Yeah, you're the second replacement speaker. What? You're the second replacement after Arthur and then. I don't know. So you're, yeah, as a replacement speaker. I think you did a very good job. So can we. Yeah. Yeah. Thank you. Thank you. Thank you. Great, we do have some questions up here at the back coming around. Yes, so they'll have a microphone for the the stream, but if you could repeat the question anyway. Thank you very much. Thank you. I have two questions actually. The first one is related to the borough checker. So right now the borough checker is really deeply tied into MIR. How are you going to guarantee that you have compatibility between the MIR based borough checker and the beer based borough checker? Basically we. Repeat the question. So the question was, will we integrate the borough checker? Is that right in the GCR? How do you make sure that it's compatible? I'm sorry, I didn't. How do you make sure that the borough checker is compatible? We'll be using the same. So how will we make sure the borough checker is compatible? Basically we will reuse the same borough checker as RACI. We'll be using Polonius. So Polonius could be compiled as a library. We'll just be making an FFI interface and use that interface in order to directly use Polonius within GCC. Okay. And my second question is, do you think you will be able to emit wasm also? I'm sorry, it's very hard for me. One of the nice things with the current REST-C is that you can emit wasm, WebAssembly. Can you do that? Do you think you will be able to do that with GCC-RS? Yeah. Okay. That was a nice answer. Hello. From, as a GCC developer, how can we help from the GCC side? Well, you could drop by our GitHub repo. I mean, there's a whole lot of controversy within the GCC. Sorry, I will repeat the question. How as a GCC developer could we help on the GCC-RS project? Is that right? Yeah. So as GCC, there's a whole lot of controversy on the GCC project somewhat because we're using GitHub and GCC guys don't really like it. So there's, I mean, you could, I believe you could use your user workflow for pushing patches upstream, but I'm not sure. I think the best way to help us is to come to and in our GitHub repository, clone it and basically to like everyone, submit issues, solve issues and yeah. So over here we have a gentleman who would like to. Hey, I just want to clarify the WebAssembly so GCC does not currently have a backend for WebAssembly. So if you want to emit WebAssembly, you first have to write a backend unfortunately. There is however precedent in GCC for other high level assembly backends. So it should actually not be too difficult to do, but not available right now. Okay. Two questions. First one is you mentioned at the beginning compatibility with GCC 4.8. What are the consequences of this choice from a technical point of view? Yes, here. Yeah. I mean, you have the modern GCC code base and you want to your code to be compatible with this old code base. Basically for those who are not accustomed to GCC, GCC 4.8 is a very old version of GCC which doesn't even support C++11 at least not entirely. So we have few steps in our CI to make sure our code is compatible with GCC 4.8 because there are some constructs in C++11 that are not supported by GCC 4.8. So we need to make sure we don't introduce those constructs into compiler. So GCC 4.8 could bootstrap our GCCRs compiler. Thanks. And second question. Rust performance rely on LTO and GCC and LLVM have different LTO strategies. Does that impact you in any way? I haven't much to say about that because we're not at the stage in the compiler development where this matter, we want things to work first and then apply fix on tricks to improve performance. For now we won't focus on a working compiler before focusing on things that work fast. Thanks. Yeah, I just had a... Oh, loud, loud, very loud. Oh. I had a... If I didn't misunderstand, you said that one of your goals was to be able to compile GCCRs with itself, but without the borrower checker or string formatting. I would just like to know what would be the benefit of doing that instead of just compiling with Rust C until you have a working borrower checker. I'm not sure I understood your question. Could you please speak louder? What would be the benefit of compiling GCCRs with itself without a borrower checker or compiling it with Rust C? Okay, so what are the benefits of compiling GCCRs without borrower checker and then with a borrower checker? So basically, that's the slide here. Borrower checker as well as trait solver and many systems like this are very hard to implement. We would need a lot of time and we don't have much resources. So we want to focus on making the compiler work, even if it means compiling, reusing components from Rust C. So this means we first produce a first compiler without borrower checker that knows that the Polonius, for example, works well because Polonius has been compiled with Rust C. So Rust C leveraged the missing borrower checker step for GCCRs and this version of GCCRs will then be linked with Polonius so it can leverage it itself. So basically, this is a temporary version that the user should never see and that the user will probably never see. This is a version that will stay on the build machine of someone who wants to build GCC. And yeah, most of you won't ever see it. And yeah, that's it. You didn't understand quite well, am I right? Yeah, I understood that you were going to use the GCCRs version that didn't have the borrower checker and string formatting to compile GCCRs itself. That's what I'm saying, I don't know. Maybe I misunderstood. How do you say it? Quite... I think what he meant is he wanted to know why you want a bootstrap step that is free of Rust C. What the need is in this? Because we need to be able to compile Polonius. You need to be a separate compiler. I don't remember. I'm sorry, but those are steps that are not yet implemented and events look much into it. So I don't want to say some mistakes or anything. So... So you don't support all the architectures that GCC supports. So if you want a bootstrap on an architecture that's not Rust C, you want to put it back on. Oh yeah, okay, thank you. I had another question. You talked about the Rust C type of errors and also panic with out of bound access. Is there a possibility that we will see this for other languages in GCC from the work you have done? Yes, because... Don't quote me on that, but as I remember it, the students that made the change of framework for the errors made change to some common directory in the GCC project. So Auto Frontend may be able to use this new code. So maybe. But I mean, ZodChanger won't... Won't come by themselves. Confibrators on all of the languages. We need to integrate those in ZodChanger Frontend. Good, okay. So I understand your point in reusing the Borow Checker and the Format Arc stuff because it's already done and it's known to work, so why not reuse it? On the other hand, on your slide, why you are doing this GCC-RS project, you quoted the point that you want to provide an alternative second implementation next to Rust C because it oftentimes helps to have different implementations of the same stuff to better understand what the stuff is all about, to better understand the design. Maybe there is something strange in the design, you just don't notice if you only have one implementation. So this would be a point for also having a second Borow Checker, for also having a checking format ArcSwan. So what is your philosophy? Where do you draw the line between we want to implement a second independent system and we want to reuse proven code? Yeah, the question was where do we draw the line between components that we need to code ourselves or reuse from the Rust C project? I would say that. I mean, we don't draw the line because those are merely temporary solutions. We want to project to get to a better understanding of a state which compiles the Rust code, but in the long run, that won't be the case. We will probably reimplement those components in C++ within GCC. So yes, for now, we simply choose components that are too hard or need too much time. Yeah, in the long run, we may replace them with our own implementation. Hi, so my question is a bit, let's say different, in the sense that what would be wrong with, for example, emitting GCC tree directly from Rust C? This way you have, I think, maximum reuse already compiled Rust, you know, because you don't use LVM, but instead you use Rust GCC tree. And you could, for example, use a feature flag to toggle between these two things. So would there be a merit to exploring this? I'm not sure. I'm not sure, I'm understanding your question. Are you talking about the GCC JIT backend in Rust C? Yes. Yes? Well, one thing, this means we get a new front end which bring diversity on one end. And I believe we could backport the new front end as well as some multiple things to an earlier version of GCC for really old systems, which we cannot do on Rust C. Yeah. Well, I think my question was a bit different in the sense that Rust is a bootstrapping compiler. And the only C++ parts it really needs to function, I might be wrong here, but is LLVM in the end? So instead of LLVM, you could substitute in just a different backend. I don't want to say any mistakes, so I think you should come to Zulip and ask directly to Arthur because it will be way better than me to give you a proper answer. I'm sorry. No problem, thank you. Sorry. So to answer that question, you gave the reason yourself. You said you want to support GCC for eight, but the thing is GCC for eight doesn't have the JIT part yet. Yeah. So like if what he was talking about was the Rust C code can GCC thing that Rust already supports. So you can actually already use GCC JIT to generate code using the Rust C front end. But again, that does not work when you want to support very old versions of GCC. And that is actually what I wanted to ask. I wanted to ask, do you actually plan to upstream GCC RS support in GCC for eight so that people who actually want to use it in an old GCC version don't have to like patch it themselves? Probably, I mean for now we're focusing on only upstreaming things that we could maintain and support, but it could be possible in the future. So probably, I don't know yet. So to answer your question, follow up on your thing. So I'm the maintainer of libGCC JIT. I apologize for the name, because it also does ahead of time compile compilation and worst project name ever. And that itself is a part of GCC and therefore its build time dependency is the same as your build time dependency. So as in the subset of C++11 that GCC 4.8 supports. In terms of the other question, in terms of back porting the GCC RS work into GCC 4.8 itself, I believe GCC 4.8 is still written in C. I'm not sure it's about then that we migrated from C to C++98 and that sounds like, that sounds difficult. That sounds like, yeah. But there is a bootstrapping path. We had another question over here and then thank you for being a good sport. Is there a question over here? Okay, all right. No more questions. Wonderful, can we thank our speaker again? Thank you.
Hardware pointer checks in a Rust application near you?
Alright, we have a real Fostin hero standing in for Lewis, Pierre Emmanuel again. We have two more heroes at the back who have also obviously fixed the audio. Thank you very much as well. Take it away please. Hello again. I'm still not Lewis, and I'm still not the original speaker. The talk will be even worse than the first one. Let's talk about the hardware pointer on the Cherry architecture. Before we get started, we'll cover what we'll be talking about. We'll be talking about mirror safety, capabilities, the Cherry design, digital security by design, as well as CyberLife Connect project. We'll then talk about the motivation, the Cherry and the rest, as well as the implementation and the different challenges and problems we found during this walk. So, mirror safety. Accessing memory pointer, what could go wrong? You're probably your read and answer if you're doing some rest, but is rest even safe about this? So, the problem with rest is once you tag a code with unsafe or something or you're in unsafe context, the hardware will not back you up. You will simply let you access to the hardware if you're lucky, you have a kernel which will give you a page fault, but that's all. So, the hardware will not protect you against user-free, out-of-bound data-scored, everything. But you may already know rest and it helps us. I mean, safe rest is cool. There might still be something that could go wrong. So, what are capabilities? Capabilities are some kind of metadata that we embed at the assembly level with pointers. This means every pointer will have a big field of metadata, whether it could be written, read or even just used and how could it be used? And the second part of the pointer will be the address itself. So, we can encode in this metadata bound permission in validation states, all those kind of things. And the helpers catch code that behaves badly even when a compiler thinks it is valid. So, let's talk about Cherry. So, Cherry is a project from Cambridge University. Cherry isn't an architecture itself. Cherry is more of a specification. It's a set of specifications for an hardware extension. It allows the creation of a capability-based system and the specification covers all capabilities required in order to make code cycle. So, I was talking about this metadata. So, here you can see in this slide the encoding of metadata on the Cherry specification. We've got the permission, the type as well as the bound of the address in order to check any out of bound or array indexing or things like this. And you've got the 64-bit address behind it. Okay. One note, pure cap on hybrid mode. Cherry provides two modes. Pure cap basically is every pointer as metadata. Every pointer is 128 bits. And the hybrid mode is here in order to ensure compatibility with order, not just not order, but capability-less systems. Okay. Okay. So, here you've got an example for an instruction with capabilities. So, it takes an address and it raises an exception if permissions are not correct or something is wrong. For example, let's say on the previous slide. Okay. So, we've got bound set here. This means we can use a pointer for an array on set bound and if we are trying to access this array out of bound, the machine will trap and give us an exception. Digital security by design. What is it? So, in the Kingdom Government Initiative, that want to expand the use of Cherry out of academia to the industry. Zephend, multiple work to demonstrate the application of Cherry and make it work in the real world in the industry. Initially, it revolved only around Morello. You may not know Morello. So, Morello is an extension for our system, ARM. Recently, they focused more on architecture such as Rix-Fi, for example. CyberHiveConnect. CyberHiveConnect is a security-critical application within the rest. It's one to implement end-to-end encryption of a mesh network. So, yeah, here you've got an example. This application is a security-critical application. And it is with a mesh network and end-to-end encryption. So, this means obviously there should not be vulnerabilities. Okay. So, why Cherry and Rust? So, Rust already provided the different restrictions. Some restrictions cannot be provided by Rust. For example, there are runtime enforcements that are provided by Rust, but that slows down the flow. You may have seen out-of-bound checks on your arrays when you index an array. You may have seen that kind of thing. And this kind of code is slow, but if you replace this kind of code with Cherry-based extension in switching, it can now be faster on an extension to access an array out-of-bound. We'll simply trap. You don't have to end-to-end it yourself. You just have to end-to-end the trap. So, when you need to connect an application with Rust code, for example, with the FFI, for instance, a function interface, you may be safer because the Cherry extension will be here to back you up and provide you the correct pointers. And you who are sure that the pointer you'll be using in Rust won't come from nowhere or aren't a pointer or whatever. So, yeah, unsafe can become in some way safer. Here's an example. We've got an array. We converted it to pointer. We make a string, and we try to read the same line and pass a number. And then, at the end, we try to add the index to the pointer. And as you may have seen, we are using the safe code. And the Rust compiler won't catch any of this because we told him to do so. So, here, Cherry might help us. And Cherry will provide an exception on this when we want to go out of bound in the array. Lewis provided two new targets for the Rust compiler, more or less known pure cap and more or less known through the PUDESD pure cap. As you may have seen, both are pure cap. This means those are not compatible with AI breadmode. This means those implementation are not compatible with standard pointer. As we may say it, all pointers should have capabilities enabled. So, here, we have a new type of pointer. It's coming in Rust 5. And all those files are available in the repository right here. There was different implementation challenges. We should provide a new pointer type with capabilities. There is something that's made the created debate a few months, slash years ago, is the size type. You size in Rust, what should it represent? Should it cover the entire addressable space? Should it be able to contain a whole pointer? That kind of thing. So, we chose to represent only the address part of the pointer within the size. Layout and address space differs for pointer on capabilities. More on that later. And we generate, we have to generate some cherry specific interesting for LLven and AI. Again, as I said, your size is not UN pointer. Okay. So, I should have been a demo but I haven't one. So, well, enjoy the screenshots. Okay, so here we get the segmentation fault when we make an out of bound access in our array, even if we don't hit a numlap page, for example. So, that's cool. I'll give the slide. Yeah. Sorry. Okay, future walk. Future walk. So, what will the WIS concentrate on in your studio? It will add more cherry targets. Hopefully, yeah, more possibly some hybrid model for the targets. We want the rest test suite to pass. For now, we have only 50% of the tests in the USC that pass, and refactor the code, document the code, and rebase on a newer version of Rust because right now it's on Rust 1.67. So, yeah. And Lewis would like to begin upstreaming his walk. Well, thank you. And sorry again for this whole talk. If you've got questions, I may be able to answer those questions, but to be fair, probably not. Thank you. What other targets are you looking for, then? Are there other targets besides Morello, which actually implement Cherry today? I'm sorry, you didn't hear. Are there actually targets which implement Cherry today besides the arm Morello thing you showed? I don't know. I mean, RISC talked about some RISC-5 extension, but a journey behind them might be able to answer. So thank you. I'm one of Pierre Emmanuel's colleagues. There are a number of RISC-5 implementations out there. Code of SIP demonstrated by the RISC-5 summit and Microsoft, and I believe low risk also have ones as well. So RISC-5 is actually running ahead of ARM, if anything. But is there RISC-5 implantations so far virtualized, or are there any boards which support Cherry? I'm sorry. Regarding RISC-5 implementation, are there so far any boards which support Cherry, like RISC-5 Cherry, or is it mostly virtualized QM environment? I suspect these have only ever been made by the development teams as demonstrators on FPGAs, but Code of SIP certainly intend to be able to ship stuff to their customers, and I think before too long there'll be hardware available. You have a slide about GDP. Do you have GDP support for someone who prints one of these pointers, like the semantics of the, you know, the extra-secretful bits and bits? If we take a look at, in fact, if we take a look at Lewis Walk, the capabilities are stored in address space 200, if it makes sense. So there is some kind of support, but I believe it's more axe than real thing. I'm not sure, as I said, really, I don't know much. So just to follow up, so I believe there is a reasonable support for GDP and Cherry on Cherry BSD, and it displays all the things you need within the GDP. Any more questions? If not, then let's thank our speaker again.
Proving Performance
So now we have Nikolai Vasquez who's come all the way from Atlanta to tell us about how we can improve performance in our Rust programs and give him your attention and it's going to be a really good talk. Take it away. Thank you very much. So, yep. Hi, I'm Nikolai Vasquez. Some of you might be familiar with my work in the Rust community such as the static assertions crate or recently Devon which is a benchmarking library that I'll be discussing in this talk. And so this title I realize is a bit of a misnomer. You can't really prove performance. Like there's just various factors that make this impossible such as for example there's various system conditions that could affect performance depending on machines and you could be working over different data sets. And so rather than considering this as proving performance, this is more like getting a vibe for performance. And so by show of hands how many people are familiar with measuring performance of their code? All right, so the vast majority. Great. All right, so you're all experts and you don't need me. So I know you probably know this but when we discuss what performance means in software, we're usually talking about how fast it is but to me in broader terms performance is more about how software uses resources to achieve a desired result. So along with thinking about the time that's being spent in our software, we should also be considering the amount of memory that it's using. I think that's a very important aspect of performance. And so making good software can be a balancing act of trade-offs between time and space and so it can be a bit difficult. As developers, the way that we write code can have a pretty direct impact on the performance of our software. So for example, we could be using really inefficient algorithms with a time or space complexity of O of n squared O of 2 to the n or whatever that yellow line might be. We might also be repeating computations instead of saving previous results. We could be choosing to use slower operating system APIs. So for example, waiting on sockets in Linux with the select system call versus ePoll. But also, performance can be bad for reasons that are out of your direct control as a developer. So at a micro level, for example, the CPU might not have cached the data that you're requesting and instead it will have to reach for main memory. The CPU might also expect the wrong branch to be taken and it won't speculatively execute the correct branch as well as the CPU might be waiting on previous computations before executing subsequent code. And then at the macro level, looking out, other cases might be that the network has really poor latency or bandwidth. Other processes could be using excessive amounts of RAM and that can cause DOS to swap memory to disk or your storage might be on a slow device like spinning hard drives instead of SSDs. So when it comes to performance, why do people choose Rust? I believe that the central reason to pick Rust for performance is it's in its community. I find that the community's culture of performance has led to many zero cost abstractions ranging from async away in the compiler to very fast standard library APIs. And we also see this culture in third party libraries. So people will try to make their code work really well and constrain environments in the embedded space or people will focus their attention on how well they're using time and space. So how fast their code is and how much memory it's using. And as well as the community has developed many tools to measure performance. So this really does speak to the culture. And now that we have a sense for what performance is, how do we go about measuring it? So for the sake of simplicity, I'll only be covering things that can be implemented with the Rust standard library. I'm not going to be covering. For example, hardware performance counters because each operating system has different APIs for that and usually accessing them requires root privileges and that can be difficult. So let's consider a simple operation such as allocating a vector of 180 integers. We could try timing it by using the standard libraries instant type and this is generally considered an all right approach. But the results may be surprising. It might just say zero nanoseconds. And so why is this happening? Well, it turns out that the compiler is smart enough to realize that the value wasn't actually used and so it optimizes out the allocation. And so when you're benchmarking, you really should pretend or at least trick the compiler into believing that the value is being used and so the standard library provides a black box function and you can use that to prevent the compiler from optimizing code that you want to run. And I find that a lot of people don't reach for this when they should. And so now that we're using this, we're actually getting higher timings that are more realistic and this is evidence that we're actually now measuring the allocation time. But why 500 nanoseconds? How consistent or accurate is this timing? Well, it turns out that if we run our code repeatedly, the times can fluctuate greatly. So the numbers might vary because of noisy system conditions or some of the things that I mentioned earlier. And you might wonder, well, OK, then how can we get a better sense of our code speed? And you could dive into existing solutions. What I generally recommend for practicality's sake is you should use an existing library that implements this correctly. And so recently I created this library, Devon, that is for exactly this. I wanted to make a tool that makes it very easy to do correct measurements and be able to compare various pieces of Rust code. And so to me, I would say Devon is so easy that it's a comfy bench marking library because a Devon sofa is like a comfy bench. And so that's why I named it that. You can read a bit more about it on the announcement blog post that I have on my website. But I'll also dive into what Devon can do for us today. And so I wanted to make Devon really easy to use. And I wanted the way to register benchmarks to be very simple. So I came up with this very simple yet powerful attribute macro that behind the scenes will generate the code that's needed to benchmark a function. And this might look familiar because this is also how you register unit tests in Rust. And like unit tests, registering benchmarks with Devon can also be done anywhere, not just in the same crate as your benchmarking runner. And you can also, well, I also take advantage of this feature within Devon by measuring internals of Devon with Devon, which is kind of meta. And so given the previous benchmark that we wrote, it's pretty straightforward to adapt it to Devon. We just stick it in a function and then mark it as bench. And then Devon will be able to run that. And after executing our benchmark, Devon presents us with pretty succinct information about how it ran. On this, we can see that the best speed was measured at about 70 nanoseconds. And this realistically represents how fast the function would perform under ideal conditions. And we also see that the worst case was measured at about 200 nanoseconds. And so there's various things that could play into that. It might not be necessarily the code itself, but the situation around the code. And then we also have median and mean, which represent the average time that the function took to run. And we can also see that these values are pretty close to the fastest sample. So we can be fairly confident that this function will generally perform at this speed, at least on this machine. And so to give some insight into how Devon is running this code, we see that it's reporting the number of samples and total iterations. And this represents how many timings, samples represents how many timings Devon has measured. And then iterations is the number of repetitions across all the timings or all the samples. And if we divide the iteration count by the sample count, we end up getting what I call the sample size, which is how many iterations per sample. And so we see that each sample took about 64 iterations. This is chosen dynamically at runtime based on how much time is spent in earlier samples. And this number can be higher for faster functions, or it can be as low as just one iteration per sample for really slow functions. But if we want to measure not only the time to allocate a vector, or sorry, if we only want to measure the time to allocate a vector and not the time to deallocate it, then the way this would work in Devon is you simply return the created value from the benchmark function. And this will defer freeing the vector until after the sample is finished being timed. And since Devon will automatically black box the returned value, we can actually remove the black box from our function. And this just makes it a lot easier to read. And so since we're measuring vector allocation but not deallocation, now our benchmark results are about half the time that we measured before. And so far we've only been benchmarking allocating vectors that contain 100 integers, but we can also benchmark across other cases. So we can use the args option to measure across one, five, 10, 1,000, you name it, any value that can be provided as an input. And this, I find it's generally very good practice to measure across various cases to get a better sense of how your code's running. And we can see that generally as expected, as the size increases, the benchmark also slows down. But interestingly enough, for cases that are at 10 or smaller, there's not really a difference in performance. And so really the differences, I would say, are more like systemic noise because it doesn't really make sense that creating five values in a vector takes longer than creating 10, at least not consistently so. And we also notice that this function really starts to slow down a lot at bigger sizes. And so that aligns with whatever hypothesis we might have had about this benchmark before. But we can also compare the performance across multiple types by making the function generic. And then we can provide a types option to pass all the cases. So now this benchmark is not only running the standard libraries vector type, but it's also comparing that against SmallVec using all the same cases as before. And for those who aren't familiar, SmallVec is a type that's designed to be faster for smaller sizes. And it does this by storing values on the stack instead of doing a heap allocation. But once there's not enough space on the stack, it'll fall back to using the standard libraries vector, or rather it'll use the heap like the standard libraries vector. And so to make what's happening a bit clearer, Devon's not actually doing anything special to the function. This is just normal generic code that's pretty common to write. Instead Devon is using the function as is to generate the benchmarking code for each type that's passed into the attribute. And so once we run this, we have this nice tree output and table where we see that Devon has grouped the types as separate trees under the benchmark function's name. And we can also see from these measurements that, at least for this specific operation collecting from a range, SmallVec is faster than the standard libraries vector when the number of items fits on the stack. However, once a size grows beyond fitting on the stack, once SmallVec needs to do heap allocations, interestingly enough the standard libraries vector is faster. And I imagine this is because the standard libraries vector can do nice optimizations like specialization, which if any of you can make that stable, please. I've been waiting forever. But also when we think about software performance, like I mentioned earlier, we shouldn't only be considering speed and we should also be considering the amount of space that's being used. And normally if you're profiling a long running program, keeping track of allocations with a tool like DHAT, the cost there is relatively low because it gets amortized generally over the life of the program. And the nice thing about tools like DHAT is that it'll collect back traces to tell you exactly where your allocations are happening. So it does give you a lot of information. However, in microbenchmarks, when the time spent tracking allocations, like that can have a noticeable impact. So taking back traces can take microseconds, whereas the code we want to measure may just be a few nanoseconds. And so we would be totally blowing out the timings. And in a sense, by observing the behavior of our program, we've now also affected our measurements. So is it possible to gather insights without affecting measurements? Is it possible to reduce the amount of time spent here? Well, I actually managed to do that. So Devon has a custom allocator that will only track the number of bytes allocated and the number of allocations during benchmarks. This applies to allocations, the allocation, reallocation of grow or shrink. And the way that you use this is you override the global allocator with Devon's allocrofiler. But you can also pass a custom allocator if in reality you are going to be using a faster allocator such as meme alloc. And so it's fairly flexible. So once we've registered this allocator and we rerun the same benchmarks as before, we can see which cases are allocating and how many times. And notice that we are not seeing the allocation listed here because, like I mentioned before, we're returning the created value from the benchmark. And so that's being dropped after the sample is run. And I also want to note that the timings here are the same as before we did any allocation profiling. I managed to optimize this to a point that its footprint is pretty indistinguishable front noise by using thread local storage and then optimizing that further, at least in the case of Mac OS. So if we look a little closer, we can see that, yeah, for smaller sizes, indeed, small back is not going to be performing any heap allocations and is strictly doing its operations with the stack. We can also tell Devon the number of bytes we're processing or number of items we're processing. And this allows us to get a pretty different perspective. The way we do this gets a little more complicated. We change our function to take a venture argument and then we call the counter method on that and we pass it an instance of bytes count. In this case, we're saying that we're counting n 32-bit integers and then we pass a closure to benchmark our function from iterator implementation. So, we then see that Devon will output the number of bytes being processed in terms of, in this case, megabytes or gigabytes per second. And for a lot of people, this might be an easier data point to get an intuition for the speed rather than just the strict timing numbers. For some people, saying growing numbers for better performance is just easier to think about. So to recap what I just covered, Devon has various features that really set it apart from existing solutions. I find that its API is just a lot simpler. It's easier to remember. I also really like how the compact output makes it pretty easy to consider various cases. And as well as because you can parameterize your benchmarks across various cases, you can really just get a sense for the difference in performance depending on the scenario. So I also really like that by going with an attribute macro, I realize that, oh, well, if you make the function generic, you can just pass the types in because you're just parsing whatever you want as the options. And so you can have benchmarks over various collections of the standard libraries, so linked list, VEC, hash set, et cetera. And you can see how different operations really differ between those collections. So such operations that are pretty common like clear might be a lot slower on a linked list whereas on a VEC, it's pretty instant. And another feature that helps me a lot is thinking of the numbers in terms of throughput. I find that it tends to just be easier to understand than durations. As well as something that I find no existing tool out there does is you can track the number of heap allocations at the same time that you're measuring the time being spent running your benchmark. As well as one feature I didn't mention here because I thought it might be a little complex to cover is you can do some interesting things like run benchmarks over multiple threads and this allows you to measure the effects of contention on atomics and locks. So if you're developing a low-level synchronization library, you might find this to be pretty useful. I also want to cover what motivated me to pursue this. I found that a lot of existing tools in space were pretty good but their APIs had some, in my opinion, unnecessary complexity and so I wanted an API that didn't go too far beyond the complexity of the code that you're benchmarking itself. And I really appreciated that by trying this new API, open up new possibilities such as what I mentioned before with benchmarking generic functions, it was relatively straightforward to implement that which was a bit of a surprise. So some food for thought if you're developing your own libraries. And I also found that the default way that some other tools run is pretty slow and I get why they're trying to do a lot of statistical analysis to remove outliers. But there are some cases where you do actually want to know when the code was especially slow because if you're benchmarking over various inputs, it's possible that one case just happened to create a really large string. And so you want to be able to get a sense for everything that happened, not just the best case scenarios in my opinion. And if you do want to run your benchmarks for longer, have larger sample sizes, more samples, there's also options to do that. So you're not restricted. I also want to mention some other Rust performance measuring tools that I think are very much worth considering. So criterion is obviously the current go to Rust benchmarking library. A feature that I really particularly like about it is its graph output because I'm a very visual person. I also do graphic design. Another newer micro benchmarking library is Tango. And what I find unique about it is that it has this technique called paired benchmarking where the execution gets interleaved between benchmarks. And what this does is it spreads whatever systemic negative conditions evenly across your benchmarks. And so this is certainly a feature I eventually want to have in Devon. Currently my APIs tries to prevent requiring ownership of the closure you're passing in. I might have to change that to make this work. I don't know. I think if we had co-routines, I could make it work. But I don't know. Maybe if someone knows how to abuse asynchol weight to get co-routines, please talk to me. Another very useful tool is flame graphs. This is more of a general technique that's well known across the industry. There's plenty of blog posts about this. But for those who aren't familiar, it's a visualization tool that really helps you find where to focus your time. And I think it's very important to find where the bottleneck in your code is before you actually start picking at specific places to optimize and do microbenchmarks on. So try to reach for profiling with flame graphs before you do microbenchmarking, if you can. As well as there's the DHAT crate. And like I mentioned before, every single time an allocation operation happens, it takes a back trace. And so it's able to give you pretty deep insights about where you're allocating memory and how you're allocating memory. It's also able to do some other stuff such as tracking max heap usage at a time. I'm going to try to add that to Devon, but unfortunately it adds a bit of overhead. So maybe it's possible to subtract that overhead from the timings. We'll see. And so some thoughts I want to leave you with is if you're going to be doing reaching for microbenchmarking tools like criterion, Devon, Tango, really figure out if it's worth microoptimizing that kind of code, just try to find the meteor performance issues in your program. So like I mentioned, flame graphs are particularly good for that. And also rather than having standalone benchmarks, you should be comparing it between different cases so you can measure across different inputs and implementations. So like I showed before, with Devon, you can benchmark origin error functions. And this also, for example, in the case of small vec versus vec, really gives you a better sense of is it really worth it to optimize your code using unsafe? And so try to find the specific scenarios where you actually are getting those wins because no one likes nasal demons. And also when making your own projects, since I imagine many people here are contributors to open source and have their own stuff that they're proud of, I really strongly advise that you don't let perfect be the enemy of good. Devon has a lot of features that criterion doesn't have, but also vice versa. Devon doesn't have graphs or machine readable output yet. I do eventually want to get there, but I didn't let that stop me from publishing something that I felt was good that people might want to use. And so try to focus on the features that matter to you most or at least are the most academically interesting. Not everything needs to be a full-fledged product. Definitely try to pursue your interests when making your own projects. Always remember that you can fill in the gaps later if you want. So that's all I had for this. I plan to have questions, but also in the meantime, you can read more about me. And currently I just have one blog post on there about Devon. I plan to publish another thing on kind of like std-conditional T in C++, but in Rust, which is as cursed as it sounds if you're familiar with std-conditional T. You can also follow me on mastodon or Twitter if I refuse to call it X. You can check out the code for Devon. Please give it a star. Play around with it. And yeah, if you want to reach out to me, I'm pretty accessible through mastodon. So there I'm hacky-derm at Nicolai. So yeah, any questions? We do have ten minutes for questions, so I'll plant you. Just raise your hand. I'm going to come to you. So Nicolai, thanks for your talk first. Very nice. And I have two questions. The first question is, have you thought about integrating into CI CD, so continuous integration things? That like, to me it seemed like this is a very handy tool with that, which I can use if I have a problem at hand, which I want to analyze. I quickly can do some benchmark and then dig deeper. But I think if I have found an issue in a very specific place, I might also want to have a test case out of it that I can monitor or be alarmed in my CI CD if there is an issue again. So that was the first question. And the second question would be, is it possible to run all those benchmarks in parallel, or do you have to sequentialize them in order to get the measurements right? Both great questions. So right now, what's blocking getting a lot of value out of running your benchmarks in CI is that Devon doesn't yet have programmatic output. My plans have JSON output and maybe some other format, if that makes sense. But yeah, as well as, so if you have programmatic output, then Devon can then consume previous iterations of that if you're benchmarking across different branches. As well as the author of Tango was exploring some ideas of using a shared library to compile it against four different branches and to make that pretty straightforward with get-of-actions. So yes, I'm definitely very interested in that. Sorry, repeat the second question. The second question was regarding the execution. If you are able to execute more than one benchmark in parallel, and whether there's some impact on the measurement itself. Yeah, so while technically you can, I find that putting the current process under more load could just negatively affect your timings. And so it didn't seem reasonable to do that, although I haven't actually measured if that would actually have as big of a negative impact as I would expect. Thank you. One question I had is, is there a way to compare the execution time with and without the warm cache? That is, the impact of cache on some data structures can be huge. And sometimes in benchmarking, in micro benchmarking especially, you have the problem that you're reusing the same lines. So the second benchmark is going to be faster always. But maybe your use case is actually the one in which the cache is called, for instance. Yeah, great question as well. So I considered having a helper effect function to evict the CPU caches, although I haven't thought of a good way of documenting when this is best to use. But in lieu of that, you can apply as a method on the Bencher type. You can pass a closure to generate inputs for every single iteration. And so if you wanted to, you could create a new buffer for every single time that function is being run at your benchmarking. So since that would be in a different place in memory, the cache effects wouldn't make the benchmark seem so much faster than it might be in a real world case. So we have a question from the matrix. So people are... Oh, Neo has a question. People are following online. So it's a really good topic. It was a very good talk. The question is, thanks. Devan looks very interesting and the API looks much cleaner, simpler than Criterions. Now Criterion can compare across different runs or implementations and then summarize whether performance improved or got worse. Within some confidence interval. Does Devan have something similar or plan to? Yeah, so it currently does not. I found that I kind of shoehorned myself a bit with this output format in that it's not super easy to add a lot more information to that. And so it's kind of like has become a bit of a UI problem in a way, which I find interesting given that's a command line. But yeah, I would very much like to just directly tell the user that this is faster or slower than previous runs. There's also the issue that, for example, if I have my laptop plugged in, now my benchmark runs faster. If I don't have it plugged in, then it's slower. It gets throttled. So it's not always obvious that there was a change in the implementation that caused the function to get slower. And I believe that Criterion's documentation has like a big warning section about this issue. But yeah, I do think that is valuable to have and I would like to have it. And also, if you all are actually very interested in this project, feel free to submit ideas to the GitHub page for it or better pull requests, implement the features that you'd like to see. I'm only one person and only have so many hours in a day. Yeah, I have two questions. The first one was you mentioned that some of the flaws or design differences with Criterion was that it focused a lot on very, I don't know, horrible. It's a lot in statistics and instead of just giving you the fastest, the average and all of that. I was wondering if there is a mechanism in Devon to output, like for example, percentiles or something like that. And my second question was when your benchmarking memory, if the function you're benchmarking the allocates instead of returning all the memory that it allocated, would the output show the total memory allocated or just the memory remaining when the function returned? Yeah, so any allocations that happen before the benchmark runs or not, they will or they will not be visible to Devon in a sense. It will have recorded that but it won't be associated with the current benchmark. It will just get cleared before the benchmark runs. So in that case, you would see that the number of the allocations would be greater than the number of allocations in the benchmark. And to answer your first question, when you say percentiles, are you talking about confidence intervals? Well, no, I mean, it would be also an option but the first thing that came to mind to me was percentiles. Like when you order the outputs, like the timings in the sending order, which for example, I don't know how to describe it right now, but yes, which was the 95th. If you did 100 runs, which was the 95th fastest or slowest, for example. Yeah. So I would like to have more interesting statistics output. Right now, I was just focused on having what I considered was the most important for your average benchmarks. Like I'd also like to output what the variance is across all the samples. So again, I kind of painted myself into a bit of a corner in that people usually only have so many columns in their terminal. And so this table output will be interesting to see how I add to it. I think what I'll end up doing is have options for emitting the columns that you're interested in having and just have certain columns by default. So when I do end up getting around to having more interesting statistics, that'll probably lead the way to make a user configurable of whether you end up with a giant table or not. Okay. Thank you very much for your talk and your answers. Thank you.
Friend or Foe Inside? Exploring In-Process Isolation to Maintain Memory Safety for Unsafe Rust
All right, let's settle down. We have Merve Goulmez. She's going to talk about friend or foe, exploring in process isolation to maintain memory safety for unsafe rust. Thank you very much. Take it away. Hello, everyone. I am happy to be here. Let's get started. I hope that one is working now. As you see, previous presentation I talked about is uptake of rust and EtoB project, for example, rust for Linux or Mozilla or currently is happening is rust in Windows OS. For example, Mozilla is now 11% is rust and the other different languages, for example, C and C++. Of course, that one is all virtual developments, requires mixed language application. Also today we saw previous talk. They talk about unsafe rust. Rust has actually highest two languages. One of them is safe and the other one is unsafe rust. And unsafe rust doesn't enforce memory safety guarantees and why we need it. Sometimes we want to do some low level control or implementation details or sometimes we need it for optimization. In Cherry Talk, they did really demo here and unsafe rust can violate completely rust application memory safety. They can do different route pointers or they can allow us to call unsafe functions via foreign function interface. Academic work shows that more than 72% of rust created is dependent on at least one unsafe FFI. Now we have two things, safe rust and unsafe rust. Unsafe rust says trust me, I know what I am doing. Should we do trust or should we do something and put on shield? And the gap is here actually. I always mention this mixed language application undermines memory safety guarantee of safe rust. And as a result, isolation is really needed. And I am a PhD student. I am a researcher. A lot of academic work to address this issue, we have a lot of academic work for example airing, trust, sun crust or so on. But what is the difference between these different academic publications? They either say that okay, we should use process-based isolation or we should use in process isolation. When you have process-based isolation, firstly you have integrity. It means that each processor, I mean it is on virtual memory space and also the other nicest things you have resilience. It means that each processor, it is on failure boundary. And if one process is crash, the other one is not affected. And a good example for that one, multiprocess software architecture. And the other side, we have in process isolation. It means that you have one other space and inside this one other space, how we can isolate one part of it. For example, you want to protect just a key or you want to protect one part of applications. Of course, if you have in process isolation, it can significantly reduce the runtime cost because context-severing compared to the traditional process isolation is lower. And I put here early approaches. This small box and inside the small box means that in process isolation and the other we have just sandbox provide the process-based isolation. And I would like to highlight something. Just one of them is offering crash resistance, but this is process-based isolation. We have STRAT here, but STRAT doesn't support for us, just it supports C applications. And I did some measurement and according to this measurement, if you have process-based isolation, actually it is 10 times higher than compared in process isolations. But the gap is here actually can be provided best of the board wars. It means that can we have integrity and failure boundaries similar to process-based isolations and we want to also have overheads similar to in process isolation. And my goal is firstly maintain the rest application safety and also we want to increase the software resilience of rest applications and also how we can provide ease of use in the development. In my early work, we provide some approach for C applications. This is secure rewind and discard and this is an approach for recovering vulnerable application after an attack is detected. And how we achieve that stuff? First we compartment the application in the distinct domains and we want to make sure that a memory defect with a domain must only affect that domain memory. This approach is relying on hardware assisted software fault-based isolation. This is a protection key for user space. And also how I detect it? I use different pre-existing detection mechanism, for example, stack canneries and domain violations. And as a result of this work, we publish some C library, SDRet library. If you want to check out, you can scan the QR code. Now I would like to explain the high-level idea. We have function F and we have unsafe. And if you write just some box on top of that, we want to have some memory safety guarantees and we want to have some isolation. And let's get started. We have parent domain and we want to run this function F in another domain. Another domain means stacks that can heap isolations. It means that I want to run this function F in another domain. And to run my function in the nested domain D, firstly I need to push the argument. I need to enter domain D and pull argument from the parent domain. And invoke F. I am executing applications, execute the function F. And the question is that is there something bad or not? But we have a guarantee that now if nested domain has some memory vulnerability, it will not affect the parent domain memory. It means that parent domain is still secure. And I am offering saying that you don't need to create your application, actually. You can just continue the execution. How I am offering, probably all of you know that this RAS says there's some API for panic. And we run the function F in the nested domain and after that we are checking the result. If results say that yes, you have something bad things happens and for example I can detect stack memory overflow or domain-domain violations, it will return error. If everything is okay, we don't need to do anything, it will just return okay. But the idea is that actually we are using this API, panic, but we are actually adding a new feature. Panic has also memory safety guarantees. It means that when panic happens, you can still continue your execution. And after that I explain this rewind and discard. And if nothing happens, if we didn't detect any memory safety violations, we will just push mutable arguments and return value and we will return parent domain. For this whole process in high-level idea, this STRAAT API, CAPI is already offering this pink box, but the point for this blue box is how we can track the rest of the types of the arguments and how we can push another domain and how we can pull it again. And probably all of you know that we have a lot of serialization creator. And what is serialization? It means serialization, we should encode the data in a format, like I just put it in a jar, and after that we should deserialize. When we jump to the nested domain, we should deserialize. And as a case study, I work on two RAS crates, BINCOT and ABONOMATIONS. BINCOT is a transformed data in a common binary representation that allows passing data between different platforms. And only the mention that SANDCRAS is process-based isolation mechanisms and that uses BINCOT. But actually, we realize that BINCOT, for our cases, is redundant because SANDCRAS and also STDF-FI, we are ingested in single platform. And I explore ABONOMATIONS, it's based on Rust Object MemoLayout presentations, but it is just for specific single platform. And it doesn't store any metadata or any type systems. It can deserialize the place without need for another cooperation. And we realize that ABONOMATIONS is efficient and suitable for our purpose. And we did some benchmarking. SNAPE is a fast comparison C-library from Google, and it designed for high-speed compression and the compression of data. And also presented as an FFI example in the Rust books. And what we compare, compress and uncompress random-generated data of different sizes. And we measure execution time of each operation for different serialization creators, like BINCOT and ABONOMATION. I show as a demo here, when I did all stuff, I just used a sandbox macro. And sandbox macro is ensure that this compressed function is completely run in the different domain, and it will not affect the parent domain. And uncompress is also here. Here we tested with different number of bytes, and this is the execution time. What is our lesson learned stuff? Of course, if you have number of bytes, if it is smaller, in process isolation approach clearly outperforms compared to process-based isolations. Because if you have in process isolation, you will not so much overhead. But the interesting start with later, we realize that even for modest-sized arguments, the context, which is not important anymore, is dominated by data serialization method. What you use. And our lesson learned tree, the data serialization method can significantly impact performance, and it is critical to optimize it for different cases. If you are working on this serialization creator or developing, we can talk about it, how we can improve or how we can fit our use cases. In summary, we introduced secure rewind and discard with isolated domains for REST-FFI. We have two goals. Firstly, we want to protect integrity of REST application from memory safety violation in unsafe program. The main point is that actually I would like to highlight, we are increasing the REST application availability, because we have a still option for if unsafe portion of our applications is the some memory safety, we can return back. We have option there. And I provided REST-FFI creators, it is open source, if you would like to try. And what is our takeaway? Improved isolation approach clearly outperforms compared to process-based isolations. But other important things is that data serialization method can significantly impact the performance. Thank you if you have a question. Can you quickly explain how these domains actually work? How do you enter a domain and how do you define what part of memory is part of the domain and what is outside the domain? Of course, it is actually handled by my C-Labri before that. I wrote it. But for RASPEX, if you just use sandbox macro, it will automatically handle it. But if you go into details, for each domain I will create a new stack and new heap memory area. Early, when there is some talk about this allocator, you can specify for allocators for specific domain. Entering a new stack, what does it mean? Just change the stack pointer and continue execution there. So you do a stack switch and share it to an entry point. But the important point is to do this rewind and discard. You should first save your execution context in a secure way. This is the point how we can recover. That is kind of like set-jump and long-jump style. Yes, set-jump, long-jump, but in a secure way. Now we have a guarantee that... Then you use some hardware mechanism to make sure that certain domains, certain memory is only accessible. Yes, exactly. That is true. That is completely true. This is the install feature. We are using that one. It is lightweight. That is why. Because you don't need system calls? Yes, exactly. You don't need to use a RAM trip? Yes, exactly. You have got everything now. First, thanks for the great talk. When deciding which piece of memory you put in the new domain, the global state is shared between different domains or you copy all the global states? Current version is just supporting HIP and memory. HIP and memory, but for the global shared, you should copy and pass it. It is not going to be your application. You should change it. But as a future work, I would like to support this. How we can actually sync between different domains to global shared states? That could be very costly. Sharing and copying the global state could be very costly. Yes, exactly. For example, here also, even though I have improved the isolations, changing arguments between domains create a lot of overheads. Yes, this is the bottleneck now. How we can improve actually this part? How we can pass the function argument one domain to another domain? This is the actual cost, actually. Second question. You copy back all the mutable arguments. Do you use that even if they are not changed or do you do that all the time? I am just pushing this. If they are mutable, I am pushing the argument. But you don't check if they have been changed by the function. If they are mutable, then you copy them back. Yes, exactly. So it is a static check and not a runtime check. Thanks. Thanks for your nice question. Awesome. Sorry, unfortunately that was all we had time for. Can we give another? Thank you to Mervin. Thank you. Thank you.
The Four Horsemen of Bad Rust Code
Let me do a quick survey. Who has a JavaScript background? Okay, maybe like 10%. Who has a C background? C++? Holy hell. It's like 80% for the people on stream. Who has a Python background? What are you, Paulie Glotz? What's going on? 70% or so. Any other languages? Just scream out. I heard something like, it was something like, oh, but I can't really remember. Does anyone own this book? I found this book on my attic and it was kind of peculiar because it had some arcane cantations in it and it looked like magic, but it certainly had something to do with Rust. And I was really excited. I was really enticed by this book. This is why I want to talk about that book. It was pretty old. There was one section in there which I really liked and it was called the Four Horsemen of Bad Rust Code. This is what this talk is about. Before we get into what the Four Horsemen are, I would like to introduce myself. I'm Matthias. I live in Düsseldorf in Germany. I've been doing Rust since around 2015. I do Rust for a living as a consultant. I did a Rust YouTube channel a long, long time ago called Hello Rust. Only 10 episodes, but well, what can you do? And lately I started a podcast called Rust in Production. If you like what I say in this talk, maybe you also want to subscribe to the podcast later on. That's it for the advertisement, going back to the Four Horsemen. I thought about this title a lot. Why would you talk about Bad Rust Code? I think from my experience as a Rust consultant, I see patterns evolving over time. I see people doing the same things in Rust that they do in other languages. They repeat the same mistakes and I saw that no one really talked about those problems. That is an issue when you come from a different language and you try to learn the rustic way, the idiomatic way to write Rust code. This is what this talk is about. Let me present to you the antagonists. While I do that, try to picture yourself. Imagine who you are and what you think your role would be in this talk. The first horseman is this. Actually, let me show all of them. And the first one is ignorance. What is ignorance? Magical little term. We will get to that in the next slide. And we have excessive abstraction, premature optimization, and omission. Of course, you could add your own personal Rust horseman. And these are just very subjective, but these are the things that I see in the real world. Now that we introduced the antagonists, let's go through their anti-patterns and what they are famous for one by one, starting with ignorance or ignorance. The horseman that is behind this pattern is someone that uses stringy type APIs. You have seen it before. Someone uses a string where they could have used an enum or they don't really embrace pattern matching. And that makes APIs brittle. You are in a situation where if you refactor something, you might run risk of forgetting that you changed something or maybe you make a typo and then your string is incorrect. And so it doesn't represent what you want to represent. They also freely mutate variables. They go and say, yeah, this is state and I can change it. Rust has the mud keyword for this, but they do that liberally across the entire code base, which makes reasoning on a local scope very, very hard. They also use bad or no error handling. We will get to that in a second. They use unwraps a lot and they don't really think about the error conditions of your application. They also have a lack of architecture in their applications. And they use a general prototype style language of writing Rust code. And where do they come from? Usually those are people that were administrators before or they write shell scripts or they come from other languages like scripting languages. And this is what they know. Nothing wrong with that, but they haven't fully embraced what Rust is capable to offer. How do you discover that you belong to this group in the code? Well, if you do things like this, you have highly imperative code. You go through the code and then you tell the program, hey, do this, do that, do this, do that, instead of using, for example, a declarative way of describing what the stage should be. They also use magic return values like minus one or an empty string to represent a certain special value instead of using errors. Everything is a string. Unwrap is used freely. You clone all the things and you use the mod keyword. Why is cloning a bad thing? I don't think it is. But the problem with clone is that you maybe don't buy into the Rust model of ownership and borrowing. And that means that you bring what you learned from the past from other languages to Rust and at some point you run into issues with your architecture which you cannot easily resolve anymore. And this is why clone is kind of a stop sign. It's not a warning sign, but it should make you think for a moment. It's an indicator of structural problems in your code, if you like. Okay. With that out of the way, let's make it a little more practical. How could we maybe put this into practice and improve our code step by step? Imagine you wanted to calculate prices for different cities for a bunch of hotels that you have in these cities. For example, imagine this was a map. This is an actual map, by the way. Africa does not look like this. And also, Jerusalem is not the center of the world. I mean, we can debate about that, but certainly geographically there are some issues with this map. Imagine your input looked something like this. It's a CSV file. You get a hotel name, a city, a date, a room type, and a price. And you go through this file line by line and you try to parse it into something that looks like that. For Brussels, you have a minimum hotel price of 40 bucks, a mean price of 80, and a maximum price of 150. Fun fact, I arrived yesterday not having a hotel room because I thought I booked a hotel, but it was last year. So I was in the upper range here. Thanks, Walshbeng, by the way, for sharing your room with me. Otherwise, they would have been a nightmare. If you wanted to parse the input file and create a result like this, all you have to do is write this code. That's the entire code. Nothing really big going on here. There are some peculiarities, but this is usually what someone would write who would say Rust is not their first language. Maybe they just try to port whatever they had in another language to Rust. This is code that I see them doing. What you do is you read the CSV file, then you create a hash map of cities, then you iterate over each hotel, you try to parse the data by splitting each line, you extract fields from it, you parse the price, and then you update the city. Updating the city happens somewhere in the lower end. At the end of it, you print the mean, the max, and the minimum. That's it. That's the entire code. You know, it's working. Technically, you could run this code and it will produce the result that you expect. Prices for different cities, we're done, right? Unless we think about the bigger picture and the demons and the monsters that are out there out in the ocean, and they can haunt us and bite us. There's dangerous beasts out there, killer animals. I think what you want to do is improve that code a little bit. How can we make this code a little more idiomatic? This is the same code. Now, let's look at some parts that I personally wouldn't want to have. Consider this block. There's some things going on, but overall, it's a very manual way, a very imperative way of going through the list of hotels. We literally have a couple if conditions here. If price is smaller than city data zero and so on, we update the price, yada, yada, yada. There are patterns that make that a little nicer to read in Rust. This is the same code. It's just something very similar, but we kind of manage to shrink it down a little bit. In comparison to what we had before, we get city data and then we use some sort of tuple extraction to get the mean at a minimum and the max. That makes things a little easier. We can suddenly talk about mean instead of city data zero, for example. That's not the major problem with this code. There's unwraps too in here. Well, for a first prototype, that might work fine, but later on, maybe you don't want to have that. What if you cannot open the hotel's CSV file? What if you cannot parse a price? In this case, the entire program just stops. A question of design, but I would say if there's a single line that is invalid, you probably don't want to stop the execution right away. Another problem is that we index into the memory right away. Who tells us that a line has that many entries, five entries? It might have three. It might have zero. Who knows? But if we index into something that doesn't exist, the program will panic and that is kind of a bad thing. The underscores mean that the variables are not used, so we can remove them. We have a little bit of a cleaner structure and a simple way to check that a line is valid would be to just have this manual check in there. I know it's not very sophisticated, but it helps us along the way. Now we check if the hotel data length is five and if it is not, we just skip the entry. Let's look at parsing for a second. How do we want to handle parsing? I said that maybe we don't want to stop the execution when we run into an issue and we can do that in Rust by matching on the parse result. A very simple way to do that would be to say match price dot parse and if we have an okay value, we take it and if we have an error, we don't really care about the error. We just print an error on standard error and then we continue with the rest of the parsing. Looking at the input, one thing we can do as well is apply a similar pattern and introduce a result type. Now we use a box for representing a result type. This is because you don't need anything, any external library to have a result type that has an error type which can be literally anything. So it can be a string, anything that implements error, the error trade. In this case, it's a very simple way to improve your Rust code. It's a good first step. What we do instead now is we say read to string and then we map the error in case we have an error to something that a user could understand and act on. Then yeah, the code is already a little cleaner. We handled a few error cases already and this is something that might pass a first iteration of a review cycle. Now of course there are certain other issues with this code. For example, CSV handling. CSV is tricky. Proper handling of delimiters is very hard. For example, you might have an entry which has semicolons like on the left side here or you have something that has quotes around a semicolon and you probably want to handle that. So a simple string split does not suffice. Same with encodings. On what platform are we operating on? Do we know the encoding right away? Does the CSV file contain headlines or no headlines? And there's many, many caveats like that. If you're interested, there's a talk called stop using CSV. I don't say you should stop using CSV, but I say you should start watching this talk because it's really good. Right. How can we introduce types? I talked about types a lot and Rust is great with types. We should use more of them. Here's a simple way. I already talked about the result type and in the first line we just create an alias for our result and we say it's anything that has a T where T is generic and the error type is of type box dün stet error. And then we can use the result in our code to make it a little easier to read. As well, we introduce a hotel struct and we have a couple fields, just strings and floating points at this point. But this helps us make the code a little more idiomatic already. We will combine those things on the next slides. But first let's look at the CSV parsing. There's a CSV create. I advise you to use it. It's pretty solid. And what you can do is you create a builder and a builder pattern allows you to modify a struct and add members or modify members dynamically. And in this case we decide that our CSV file has no headers and the delimiter is a semi colon. And the way you can use it is like this. You now say for hotel in hotels deserialize. No more strings splitting. And now we match on the hotel because this returns a result. And now we need to make sure that the hotel that we parse is in fact correct. And after the step we don't have to deal with edge cases anymore because we know that the struct is valid. That means it has the required amount of fields and prices are also floats. Which is great makes the code much more readable already. And it was very simple to do so. Now I want to quickly talk about this part. There's a cities hash map. It has a string which is the city name. And then it has three floats which are the mean, the min and the max price. I don't think this is particularly idiomatic. The way it was used before was something like this. And we kind of managed to work our way around it. But a better way I would say would be to introduce a type for this as well. Because if we're talking about prices and pricing seems to be something that is very central to what we do in this application maybe we should have a notion of a price. It's very simple to do that. You just introduce a price type. Now you might be confused why we suddenly don't have a mean anymore. But instead we have a sum in account. And the reason being that when we parse the files we update the sum and later on at the end we can calculate the mean. Which has some mathematical properties which are favorable because now we don't really have, we don't run into rounding issues anymore. This is an aggregation that we can do whenever we want to get kind of a mean on the fly. And at the same time we have a default. Now the default is not really idiomatic too I would say. But the great part about it is that we can later reuse it and make our code a little more readable. In this case we set the min price to the maximum float. But then whenever we introduce a new price it will overwrite the maximum because I guess by definition it's smaller than the maximum or smaller or equal. And same for the max and some in account are kind of set to zero to begin with. And just before we bring it all together here's one more thing that we should do which is have a notion of a display for price. In this case we implement the display trade and we say yeah if ever you want to print a price this is the structure that you should use. The min, the mean and the max. And then this way we can make our code way more readable. Now you can see that instead of using a tuple or floats here we use a price. And when we update the prices we can talk about this object. We can tell the object hey update your min for example. Here we say price.min.min holds a price and we automatically get the min price as well. We update those price fields and yeah we can even introduce a price.add method. I don't show it here but technically why not. We can add a new hold up price. Prices could be added over time. Now that depends on I guess your taste, your flavor of rust. This is the entire code. It's a little longer but you saw all the parts. And now you have something that I would say isn't a workable state. It's not great but we did one thing. We considered rust. We thought the ignorance. We started to embrace the rust type system. We started to lean into ownership and borrowing which are fundamental concepts in rust. We lean into design patterns and we learn how to improve our architecture. And I would also say if you want to improve this part try to learn a different programming paradigm. Rust is not the only language. Try rock or try a functional language like Haskell. It might make you a better rust programmer too. This is how you fight ignorance. Now if you see that none of these horsemen fit to you by the way just think of your colleagues how you would want to introduce them to rust because this is the code you have to review and also probably maintain in the future. So it's time well invested. If you want to learn more about idiomatic rust specifically there is a website. I just put it there. It's an open source repository. It has some resources. This is a rendered version of it. You can sort by difficulty so that's your experience and then you can sort by interactivity if you want to have a workshop or not. For example there are free resources on there and paid resources too. Right let's go on and look at the next horsemen. Excessive abstraction. Everyone in this audience knows someone like that. They try to over engineer solutions because rust leans into that. It allows you to do that. It's a nice language to write abstractions. Everyone likes to do that. But then you add layers of indirection that maybe people don't necessarily understand if they come from a different background. They use trade successively and generics and lifetimes and all of these concepts are great in isolation. The combination of which makes the programs hard to read and understand for newcomers. Now if you find yourself in this camp try to fight this as well. Common symptoms of this are things like this where you have a file builder which takes a t as ref of str and a lifetime of a and this makes sure that you can pass any type and that it has no allocations that are not visible because of the lifetimes. So this might be fast and it might also to some extent be idiomatic but it is something that your colleagues also have to understand. Another thing is I might use this again. Let's make it generic or trades everywhere. And how do you get to that mindset? It's very simple. After you wrote your CSV parser it's natural that you want other parsers too. Of course you want to chase on. Of course you want to read and write into a database. You start thinking that you'll need all of those formats at some point and this is the part that is important at some point. And then you end up with something like this. It's a trade definition for a hotel reader and it has a single method called read and it takes a self that's why it's a method but it also takes a read which implements the read. That means you can pass anything that implements the read trade and it returns a box of iterator of item equals result hotel with a lifetime of A. No allocations except for the box but the iterator itself is a very idiomatic way to say a result of hotel so parsing errors are considered and it's very applicable for all of the reader types that you could possibly want. Let's say you wanted to use that trade and implement it for our hotel reader. Now suddenly we blow up the code to something that is harder to understand or if it is easy for you to understand please reconsider if your abstractions are too much. Maybe you ain't going to need it. Right. So we have a hotel reader and it owns a reader builder and inside of our new method we initialize the CSV hotel reader and we implement hotel reader down here. The single method called read and we say self.reader builder this is the code that we saw before we just put it here this is our CSV parser the initialization of it and then we return a reader.into the serialized hotel map and this is where we map the errors. Right. Does it look great? I don't know depends on someone's nodding. We need to talk but it's certainly nice to use I guess. Now we can say for hotel in hotels.read file. Should hotels know about files? Maybe not. But it's great if you go one step further and you implement iterator on it and now you can say for hotel in hotels. Alright we're getting somewhere from a user's perspective that is really great. But remember we're talking about application code. There's probably code that you earn money with. It's not a library function that is used by thousands of people. It's your simple CSV parser and now we just blew it up into something that is harder to understand. Do you really need this? Well I don't think so. I don't know what this person on the bull does but it certainly looks confusing to me and this is what people think when they see the top signature. I know kind of you wanted to optimize it a bit but at what cost? Right whenever you sit here and you think oh I should implement JSON support and you don't do it for fun. Start thinking if you really need those subscriptions because they can haunt you. Most of the time they don't have no need of it. I don't know what sort of animal this is. Is it a lion cat or something but it's kind of strapped to a cannon and it doesn't look too happy to me. I don't want this. Probably you're not going to need it. As a side note another thing probably you shouldn't do too often are macros. There are traits out there that excessively use macros. What do I mean by macros? Macro rules but also macro derives and these are great but they come at a cost and the cost could be compile times. Just yesterday I talked to Daniel Kerkman who I don't know is he here? He's not here. But thanks for the tip. He has a situation at work where compile times just blow up because of macros and for you it might be easy to write but for other people it might be hard to use. Maybe you want to prefer traits over macros if you can. That was the second horseman fighting excessive abstraction. How can it be done? If you find yourself in this situation keep it simple. Avoid unnecessary complexity. Just think that the person that will maintain the code is not a mass murderer but your best friend. Do you treat friends like this? Watch newcomers use your code. That can be humbling. Ensure that abstractions add value. Yes you can add a layer of abstraction. Does it add value? That's up to you. Decide and don't add introductions that you might need in the future. Add them when you need them. Right. Two off the list we have two more to go. Next one is premature optimization. This is for a lot of people in here because you are C and C++ programmers. I'm looking at you right now because 90% of you raised your hand. I see a lot of people from C and C++ come to Rust with this mindset with these patterns. What are the patterns? They optimize before it's necessary. This is important different from adding too many layers of abstraction. Optimization in this case means profiling is not done but instead you kind of try to outsmart the compiler and you think about performance optimizations way too early before you even need it. Did I even tell you how big that CSV file was in the beginning? How many entries does it have? You don't know. Maybe you should not optimize for it right away. They use complex data structures where simple ones would suffice. For example we saw the hash map with the three tuple elements. These are things that are kind of unravel and then it ends up being a mess not very idiomatic and arguably not even faster. And they also have a tendency to neglect benchmarks. Some red flags. Quotes you might have heard. Without a lifetime this needs to be cloned. Ignore that. If you know that you have a performance problem then you can think about lifetimes. It's fine to clone. Let me help the compiler here. The box is so much overhead. I use B3Map because it's faster than hash map. No need to measure I've got years of experience. They love the term zero cost abstraction or zero copy. Actually it should be zero cost in here. And they hate allocations. Whenever they look at an allocation they feel terrified and they bend over backwards to make that program faster. So whether this is the developer or the compiler and vice versa is up to you. I've been in both situations. They turn a completely simple hotel struct with a couple string fields which are owned yes they live on the heap. Do something that lives on the stack and has a lifetime. And every time you use a hotel you have to carry on the weight of the lifetime. Well does it matter for this one particular case? Probably not. But then you look at other places of the code base and you see that they kind of reverted your changes. They made what you introduced your hard won knowledge about the abstractions and they took them away. Now we start to index into our data structure again. We use string split again. We go backwards. We've been there before. It is super fragile. Again we are going backwards. Now let me play a little game here. Since there are so many C and C++ programs in here I expect you to answer this. What is the bottleneck? This is a very famous medieval game who wanted to be a millionaire. What is the bottleneck? Is it CSV parsing? The DC realization of our entries. Is it string object creation after we DC realized it? We put it into a hotel struct. Is that the bottleneck? Is it floating point operations when you parse the price? Or is it hash map access? Who's for A? Some shy hands? Don't be shy. Who's for B? Okay. Nice. Who's for C? No one. And who's for D? The hash map. Nice. The correct answer is you forgot to run with release. How do you find the actual performance improvements? There's just one correct answer and it is measure. Profile. Use the tools. Cargo flame graph. Cool thing. You will see that in a second. Use benchmarks. There's criteria on Nick still in the room? Nicolet? No. His benchmarking tool. Divan. Pretty great. Use it. Okay. I will give you one example. Let's look at a flame graph of our initial program. The one that a junior developer could write in two hours. What is the bottleneck? There is no bottleneck. This is the setup of our flame graph itself. This is the profiler setup. The code itself is negligible. Negligible, I guess. And why is that? Again, because I didn't tell you how big the fire was, do you think I can come up with thousands of alliterations for hotels? No. So I added 100 entries. There is no bottleneck here. Okay. You might say, but okay. What if the fire grows? Let's add a million entries. Okay. Oh, this is still 120 records. So let's add more. This is a million. You probably ain't going to read it. Let's increase it to 10 million. And indeed, deserialization of the struct takes most of our time. Okay. If we look a little closer, it says, serde deserialize deserialize struct. Okay. We have some memory movement going on. Let's take a baseline. That is our baseline. This is what it takes. 34 seconds. Okay. Now, let's say we kind of want to prove our C and C++ developer wrong. Does this other abstraction that we added for the hotel struct really add that much overhead? No. It's the same. It's like 34 seconds still. Oh, actually, this is the part where we remove the unnecessary fields. But we can go further. We can say, yeah. Here we have a little safer version. We don't index, but we say nth.1. And we have 32 seconds. Now, our bottleneck is append string. String appending. Okay. I think there's something that we can fix. Well, okay. Maybe this is not really that readable. But what we do is we split now by a string. And instead of doing an allocation where we append to our string over and over again, we use this pattern matching here. And this reduces the runtime by 30% already because we save on allocations. Now, if we try to profile this code again, where's the bottleneck now? Read until. Okay. What is that about? We have a lot of memory movement going on. And now we reach a point where the disk becomes the bottleneck. We can use an M-map for this. Now, remember, we are talking about performance and maybe you should not do those optimizations, but prove a C and C++ program were wrong and they are in tuition. And then you see that the bottleneck might be solved elsewhere. Now we are at 30 seconds by changing like four or five lines from the entire program, not the entire thing. We can keep using our abstractions. That's the main point. Here we use an M-map. That's a memory map in the kernel. We save on allocations. 30 seconds. Okay. What if we wanted to do more? It's hard to read, but now we reach the point where in fact the hash map is the bottleneck. And one more step to improve the code would be to split it up into multiple chunks. You can use rayon. You can now finally use a better hash map like a hash map. And we are down to 3.5 seconds. And we did that not by guessing, but by profiling. Now if we want to run a profile, it looks different again. Very different. These are the individual chunks that we managed to split up. We went from 40 seconds to three or four seconds in a couple slides and with few changes. And the point is don't guess, measure. This is the worst part that C developers bring into Rust. They think everything is a performance overhead. And if this challenge, by the way, looked very similar to the one billion row challenge, this is why it was inspired by it. And it is very similar. Read it up. It's kind of fun. We did something similar for hotel data. But the more important point here is how can we fight premature optimization? Measure, don't guess. Focus on algorithms and data structures, not micro-optimizations. More often than not, if you change from a vector to a hash map, this will be way, way more efficient than if you remove your little struct. And if you add lifetimes everywhere. You can get carried away pretty quickly and Rust encourages you to do so, but it also has the tooling to fight it. Be more pragmatic. Focus on readability and maintainability first and foremost. Use profiling tools to make informed decisions. You covered all of that. Your code is idiomatic. It is fast. You didn't overdo it. What is missing? Well, the entire rest. Do you have tests? Do you have documentation? Is your API too large? Does your code lack modularity and encapsulation? These are things that I see from people that are like the lone wolf coders. They know all about Rust, but what they are not really good at is the rest. Explaining the differences to their code maintainers. And writing documentation. Not about the what, but not about the how, but the what. What does your program do? Some things they say. It compiles. My work is done here. The code is documentation. Let's just make it all pop. I'll refactor that later, which never happens. Let's look at that code again. This is our first version junior programmer. Three hours. Okay. How do we test that? It's kind of impossible because this is one big binary, one main. How would we test that? Well, I guess the question is what do we want to test? Well, first off, I would say let's add a test for parsing the entire thing can be a very simple, true test. But if we refactor it such that we have a function that parses cities, now we can start to introduce a path here and do the parsing. And this is where the parsing logic is, by the way. We split it up into a main and the parsed cities. Great. This is our first test. Very crude, but we get to a point where suddenly we can test our changes. We create a temporary directory. We have a path and then we write into a file and that's it. The parsing is done. Great. If we wanted to make it a little better, instead of passing in a path, we pass in something that impels read. Now we don't need to create files like here. Instead, we can have our input as a binary blob. And these are simple things. Add some documentation, add some tests. It's not that hard. And in order to fight a mission, what you need to do is write more documentation, write unit tests, use tools like Clippy and cargo UDAPs, set up CI CD so that you can handle your changes, create releases, use release please, Marco, greetings go out to you, and keep a change lock of what you changed. Right. We're getting towards the end. We have seen the anti patterns. You know them now. I hope that you will be able to, you know, see them in your code. If you want to learn more, there are some other talks that were given here at FOSSTEM and other places. You might want to check them out. Maybe I can put the slides somewhere. And that is all I have to say. Thank you. Thank you.
WASM 101: porting a Sega Game Gear emulator to the browser
So we have Anis Astier is going to tell us about the Wazem 101 which is very nice to put in. Thank you very much. Thank you. Thank you. A quick presentation. My name is Anis. This is not my first talk. This is my first time here in the Rust Dev Room. You can find my social media here. Follow me if you want. I've been learning Rust for five years on and off. I wanted a bigger project to learn a bit more about Rust. I said why not write an emulator. I started this project. This is a Game Gear emulator. The Game Gear is this small device. I don't know if you've ever heard of the Game Gear. So yeah, it's a Sega handheld from the 1990s. So this is the name of my emulator. Gears you can see. It's written in Rust. It depends only on the standard library. It has a native UI. This is how it looks like. It works. After I developed this native UI, I thought maybe I should port it to the web. To do that, I would need to use WebAssembly. So quick show of hands. Who here has never heard of WebAssembly? It's interesting. Who here has heard of WebAssembly but never used it? Who here has heard of WebAssembly but never used it and developed things with it? Oh, many people. Okay, quite interesting. So WebAssembly is a kind of a new platform. You can think of it as a new platform, a new to port code. It defines the text by code format. It's a take on the Java Compile 1, so whatever your system. It works in the browser where it's as secure as JavaScript, it's sandbox. It also has many other use cases. You can work on servers. You can use it in fast. It has many use cases. So I want to port my emulator. So there's this first level which is how do I build my code? How do I compile it? So let's go through this journey. How do you compile WebAssembly? I assume you know about Rust, but if you don't, usually you install Rust with this tool called RustUp. You need to add a new target with RustUp. Then you also need this tool called WasmbineGen which will bridge your WebAssembly code with the JavaScript world and generate some things. Use RustUp. You use WasmbineGen to build your code with the new target and then you use WasmbineGen to generate a directory with JavaScript. You serve that with an HTTP browser and that's how it works. You don't have to use WasmbineGen directly. You can use tools that integrate WasmbineGen and call it. There are many such tools that have selected a few. Wasm Server Runner. It comes from the baby community. You have Cargo Run Rasm. You have Trunk which is even higher level and Wasm Pack which is from the Rust Wasm project. I won't go into the details. You can find the comments on how to run them here. I did a quick comparison of those tools from let's say the lowest level tools to the highest levels. WasmineGen, everyone uses it. It's like the reference tool. Then you have a bit higher level tools and then more open-united tools like Wasm Pack and Trunk. Wasm Pack will generally be used to generate libraries that you can use from the JavaScript world whether with NPM for example. Trunk will integrate even more things like compress your HTML assets and things like that. You know how to build. How do you run the code? You usually write a binary. You have a main function and the entry point of your main is how it works. Or you can build a library and usually you annotate your entry point with WasmineGen macro start and you say, okay, this function is my entry point. You start executing from here. We know how to compile. Let's continue porting our application and go to the second level of porting the emulator. This emulator I've written called Gears for the desktop UI. I only selected dependencies that work with WebAssembly. So the whole wasmineGen wasmineGen was capable. They work with the web platform. Have pixels, WinIt, simple, Giller-S which is for gamepads. We'll go deeper into that. They all support WebAssembly. How hard can it be? It should be very simple. Well, it depends. For pixels and WinIt, pixels is a library to make a front buffer, basically a front buffer library that's GPU accelerated. So you can write pixels to coordinate and then it will run that with WGPU. Pixels use WGPU. It's another great to do the rendering. In order to work on the web, you need to enable the WebCR feature of WGPU. In the future, it will also use WebGPU, but that's another subject. The initialization of pixels is also different because it uses WinIt and WinIt needs to be initialized differently if you want to render your UI in canvas in the browser. Last but not least, the initialization of WGPU is Async. So in my emulator, I never used Rust Async. I needed to add that. So I used WebAssembly. Gen features to bridge the Async world from Rust to JavaScript promises. To part the audio part, I'm going to use the WGPU. I'm going to use the CIPL create, which also works on the web. This is a reference create to play audio. It needs to create feature as well. There were also some challenges because maybe nature started directly and if you use a browser, you can't start playing audio directly. That's actually a good thing because it means you can't play audio on anyone's browser without interaction. So you need to have interaction. The user wanted to do this action. Another issue I had with the standard library is I used NPSC channels and they don't work on the web platform. So I wrote a quick channel myself because it was in the core. There are other channels that work on the web platform. But I prefer to implement something with no other dependencies. For time, usually for synchronization in an emulator, you need to know about the current time. Just like for the channels, in the standard libraries, the time API are not available on the web platform. So there are crates that do the bridge. I used the instant crates. You can also use web time, which also works. This is the code, the use code if it's for the was 32 target using instant, if not, use the standard import. For Gil arrest, which was very nice, there was no action needed in order to support working the browser. Everything worked out of the box except the gamepad API, I would say, on browser is not as much much mature as on native. So there is some rebinding to do. There are good reasons for that. For example, browser don't want you to be able to fingerprint someone with the gamepad API, but then it means the bindings are not mature enough. Not the bindings, but the key bindings, which is something else. And then during porting, I also had bugs that were entirely my fault. I used a bit to turn it into a huge bit, but I didn't use it too much. I used to make it too much new size, mostly because I like to index slices. That's what you need to do in the slices. Wasp 32, as it says in the name, is a 32 bit platform. So I had overflows when I had multiplication, additions, it grew bigger than 32 bits. All these were codes because in my project, in my cargo project, I had a lot of defaults in Rust. And yeah, it worked well. I just replaced new size with 64 when it did. And that's it. So let's take a quick break and let's go through a demo of what it looks like. So just for first then, brought to you again, which is this one, I will lend it to you for a few minutes. It's for them exclusive. I recommend you play this demo on, not necessarily on mobile, it will work, but you won't be able to control it. So maybe more on desktop browser or anything that has a keyboard or gamepad controller. So I'll let you a bit more time to load it. It might not work for you if you don't have WebGL enabled on your browser, but otherwise it should. If you have Firefox or Chrome, here's how it looks like. So I've loaded the page, it's play, and basically the emulator starts. If you have audio, it will play audio. And yeah, this is what you should see. Okay, yeah, it works. I can play it. Who here successfully runs the demo? Just a quick show of hands who managed to run it. Okay, thanks. Okay, let's continue. So we have this porting. Okay, mostly worked. I showed you. It worked. There were a few tricks, I picked along the way. There's not mandatory, but let's see what we have here. First thing, if you're used to debugging like me with println, you print code on terminal, it probably won't work as is on the browser, so you want to use the Web console. There's this console log crate which does the binding of the console. If you use the log crate, it's really well integrated with the log levels and things like that. I also recommend that you use the console error panic hook crate. This one helps show when your program crashes, for example, I showed you the overflow checks it can panic. It will show you basically the panic in the console. That's how you register a console panic hook. Another trick I picked along the way is the cargo config. For this demo, I showed you, there's a bit of a problem with some interactions. Some API I use, which I use directly from Rust, and I use the Web Cyscrate, which allows accessing those APIs for this demo. In order to be able to access those APIs, which they are considered unstable, and you need to add an environment variable when you build, which is a bit annoying to add every time. You can add this Rust flags directly in your cargo config.tamo. This way you can build with cargo builds. It will work. Another trick if you use to having VS code or integrated development environments, you probably are using Rust analyzer. If you have code that works on multiple platforms like me for the native, there's WebAssembly, you probably want to tell Rust analyzer to build as a tool. You can do two different architecture. This way you have completion on the WebAssembly part. This is done as well in the cargo config.tamo by specifying multiple build targets. When you build it, you will have multiple build targets. There are some drawbacks for that. It won't work with the workspace member. It must be at the root of your workspace. It also means that when you use cargo run, since you have multiple targets, cargo run will say, oh, no, you have to pick one target in order to run, which makes sense. It can be a bit annoying. So let's go with what did I think of this experience of putting this emulation. What's my feedback? I would say in general it's very easy to port standalone code to WebAssembly if you're using Rust. I did not change anything in my app's architecture. The total port took a few hours over a few days. As I told you, I did custom code for initialization, which is I think, and for DOM interaction, which is the demo you've seen. To go a bit further, what I won't talk about in this talk is how to build a web UI, for example. You probably want to use U or Laptos because I don't recommend accessing DOM APIs directly. This is very ugly, not really ergonomic. I did it so you don't have to try. Those library developers do a great job to do that. I didn't try building a complete UI. As you saw, nothing is configurable, etc. I'm thinking of building a UI with slints or a GUI, but I'm not really satisfied with the current status of font rendering. I know it's something that's being worked on. Just like as well, minification in web size is not web-specific. There are many Rust tutorials you can find on minification, and I didn't do any performance measurement. I can tell you that it works. It also works on native. But I don't have any special feedback for that. That's it for my presentation. Thank you. We have a question. Yes, I have a question. When you build websites today, they have to be responsive. You use media queries in CSS style sheets to adopt to different kinds of resolution so that on the mobile tablet or desktop, it still looks nice. Can you also do this in web assembly that you would say if I run the game in portrait or landscape mode, or if I do it on a bigger screen, that it takes care of the resolution? Will it also scale the graphics accordingly? There are multiple aspects to that. If you're building a web UI, you probably do that with CSS. If you use leptos or you, you will be able to generate HTML whether on the server or on the client. Then it's basically the same thing as web development. You have CSS, you scan this HTML directly. For this demo, this is an emulator. It's a bit specific, especially because it's a full-screen application. So basically it takes the whole width of your screen, and that's it. That's how it works on mobile and tablets and desktops. But it's not that you can combine those and that you can also do something in JavaScript or CSS. You can do that. You can find tutorials on the Rust-Waston book. You can look at the Rust-Waston guide and on the Rust-Waston project, which is this URL. You can find information on how to bridge the two worlds. If you decide to use a crate, as I recommend, like you or leptos, they also have a lot of documentation on how to do that. I understand. Maybe a general question. Why did you choose Rust? Did you also consider programming in C++? Or are there any advantages of using Rust compared to C++? That's a great question. It was actually covered in other talks, but usually I like using Rust because it's a very nice language. It has nice ergonomics. It's fast and native. It has more safety guarantees than C++. A great ecosystem. Thank you. You're welcome. Any other questions? I'm curious what your main loop looks like. Do you spend all the time polling for events? Do you get called back from the browser? Does the browser hang if you never sleep? That's a good question. I did not modify my main loop, but mostly because I used Winit. I used a Winit event loop. This is specific to the Winit crate. Nothing was modified in the main loop. It spins. I don't remember how many times, but basically the length of a frame every time, and then it gets refreshed. Yeah, that's it. And that's all the time we have. Thank you.
Embedding Servo in Rust projects
Sorry you picked a missed call. ... Yes. Rocky. Rocky. We'll close the doors. Thank you very much. So anyone that knows some history of Rust is going to be very excited about this talk, and Rocky is going to tell us about what's going on with Rust with Servo. Hello. Hi. Thank you. I am Rocky and I'm software engineer. I work on Servo at Tegalia. Before even I start with the talk, I'm very curious about the audience. How many of you are writing Rust professionally or full-time? Many of you. And how many of you are writing personal projects, interest? How many of you are coming from front end or back end world? I still see some. This is like perfect audience for this talk. Normally when I start a talk, I tell about the project I'm going to talk about. But today I want to start by answering some questions because people have questions. What is happening? Is it bad? Is it lie? What is happening with Servo project? I'm simply going to take you a bit back and walk you through. So I was journey in this slide around 2012. It started at modular research around the same time when Rust project also started. They were quite working together actually. And people who have been active in Rust community or have known about Servo project, they knew Servo on 2016, 17, 18, what was happening. But in my previous slide, these questions came in when we were in 2020. A lot was happening in 2020 but this also happened when Muzla's layoff impacted Servo team. And this kind of affected the whole team and the future was not that bright as we thought. Around the same time, Servo project joined Linux foundation. There were few people from Servo team who were trying to maintain the project in their personal time. But that is not enough. Servo project is huge and that's not enough. It needs funding. It needs more people, expertise and many other things. Around 2022, we started restarting Servo. I just mentioned it needs lots of funding, lots of expertise, lots of people. Who is going to start? In 2023, a team was formed at Igalia and we restarted the Servo project. So what we did in 2023, like list is not huge but I want to keep it small because this is not really focus of the talk today. We restarted the project first half of the year. We were trying to maintain the project, take it out of the maintenance mode actually. Tell people what is happening with project and we have restarted the project. That's what I'm trying to mean thereby outreach. Make it easy for new people to contribute because an open source project is literally nothing without the contributors. We started work on layout engine. We started shipping CSS to features. We also had to make a choice between layout engine. We in Servo have two layout engines. Still have it but we haven't deprecated the last one. We ended up choosing to work on layout 2020. The old one is called layout 2030. I'm not going to talk too much about it. You can go to the wiki and you can find why we took the decision of choosing layout 2020. At the end of the project I have a QR code that will give you access to this slide. Don't worry about searching for it now. In 2023 we also worked on internal WPT so that we can track the web platform tests and build lots of Servo demos. When you are talking about a project, it is very important to show things. You can't just sit at computer and code and talk about we are doing this and that but there is no way to test it. I'm saying we built CSS feature but hey, how can I test it? There were a few Servo demos before available as well. We also moved it to the new Servo demo website. Then we did quite a lot of embedding work. We built a mini browser that is going to be focused on this talk. This was 2023. I also want to cover what we are going to do in 2024 because it is already here. We are in February. We want to continue the project maintenance and the outreach because some of you I'm sure were not sure about where Servo project is going. I'm sure there are people outside this room who are still not sure. We want to make sure that everyone in communities aware what is going on. We want to continue shipping CSS support. Right now, while I'm standing, we still have a few PRs open that is related to tables. We are really trying to ship this table support in Servo. We want to continue working on embedding API and the initial Android support. We already have initial build that runs on CI already. We landed this PR, I think, two weeks ago or maybe one week ago. Let's not go with somewhere between two or one week ago. If you see the list is pretty much similar to what we did last year. This is the focus of the talk today. I'm talking about embedding because embedding has been asked from the community for very long. I was just looking on internet, read it, hack and use, and just twitter, github. What are people asking? I ended up collecting some things. If you see that screenshot, this Servo embedding part was asked 11 years ago. This is exactly what I'm trying to tell you. I'm saying that we can't just tell, hey, we are adding support to Servo or X or Y feature without showing you a POC how this feature is working or how you as a user can test it. Last year around the summer, when we did lots of maintenance and took it out of maintenance mode, we decided we want to work on embedding. Then we ended up building a mini browser. When we were talking about mini browser, any open source project for the first step is to open an issue. That's what I did. I opened the issue. We wanted to decide how we want to move. Which library we want to use. I opened the issue initially. We had some discussion. The whole idea behind opening the issue was to get comments from the community. If they have some suggestions on using libraries, we already had this window that we were using in the code base. We can have a quick POC. You don't want to spend years building something without knowing how people or how companies or users who are going to use your product are going to feel about it. We ended up building a mini browser. I want to show you actually. If I go and do a macro, I hope you can see the screen. This is the mini browser. Just keep a mental model of how this toolbar is looking. I'm going to show you some code in a bit. This is about how we can make your life easier. This is the demo website I was talking about. You can go and do stuff there. Check out how some performance is happening. We have some demo that talks about the technical test. You can test the WebGL support as well. Certain things. Play around, go back forward. Just to give you an idea of how this looks in the code. Depending on what kind of ideas you prefer when you are reading code, I prefer to go from top to bottom and then I take myself up from there. You want to go to Sarvoshel. Once you arrive in Sarvoshel, you want to look for mini browser. I can see some code here. This initialization part is okay. This is not the focus. If you see the if condition here that we say, what is happening here is we just don't provide an option for you to have the mini browser. We also provide an option for you to disable it. It is enabled by default. In case I have to show you how it looks, you just need to pause this. You can already see how it looks. You don't have a toolbar. This is the window we also used to test how our Web platform test is looking. While you are here, we also want to look for even loop. That is an important part. Not actually run forever. Yes, this part, even loop. As like names are just, it runs forever. In this particular case for mini browser, we want to see this part. We are using VINIT. We run the event loop of VINIT that really helps initialize the window. As I was saying, I want you to remember how the toolbar looked like. I want to go to mini browser and just show you something that we have going on in the update function. This particular code that I want to walk you through. This is something that we did. Like I was saying, we opened an initial issue where we wanted to decide what library we want to use. We were already using the VINIT to create the whole window. We ended up using Igui. They have really good support for Igui VINIT and Igui Glow. This was very helpful for input and output stuff we needed to do with the mini browser. As you can see, there are just two parts going on. From left to right, we had back and forward button. On the left and right side, you can imagine the toolbar as two parts. Left to right, there was back and forward button. Then the other part was right to left. That was on the right side, you had the go button. On the left side, you had the location field. All together, it was whole toolbar. One other thing that I want to show you is how we are initializing servo. This is like, I can go to the server new. This part is like, inside the even loop dot run forever, we are passing all the data to the new function that initializes the servo. A lot is going on here. I am not going to walk you through all the code because it will take forever. The next 10 minutes is not enough. Initializing servo man, creating a thread for WebGL. One of the most important things that is going on here is the configuration of constellation. If I have to show you here, it should be here already, actually. Here is the constellation. It creates the constellation here. If I have to go and see how to start constellation, this is the constellation part. This is the Gram-Cyndra of servo. I love this comment. Someone left this comment like 11 years ago and this is still valid. From here, you can really get lost in the code and not lost in a bad way because this is the place where everything is connected. Here, you go to the pipeline, navigation, layout thread, script thread, and you can really go to layout and then the script. From here, you can go everywhere. Then I was just showing you this code right here. We are on lib.rs. This is our engine. We call it lib servo and this is the whole engine we are talking about. I want to keep it short here now because I want you to see something else as well. This was about the mini browser that we built. Next was around the same time when we were done with mini browser. We were talking with Tauri about how we can collaborate to integrate servo in the RIBE project they have. Thanks to the funding we got from NLNet and the collaboration from Tauri team, we did a collaborative work where we embedded servo in RIBE. RIBE is a library that aims to provide a fully open-source web view to users. This is the screen shot or the demo that RIBE team built that is like a hello world from servo and RIBE. If you have questions about Tauri and RIBE, Daniel is sitting here. You should catch up with him. He has lots of answers for you on that side. Thanks to my colleague, D-Lyn, for putting it all together so that I can show you today. Earlier when I started this talk, I asked how many of you are coming from front-end or back-end world. I have spent quite a lot of time in my career doing front-end and back-end work. When this embedding stuff happened, when we started this embedding, we worked with Tauri team to create this whole task, what is not needed. We shipped off-screen rendering, we shipped pre-compiled MozAngle, we still got to figure out how we are going to do the package and distribution of the shared object we have been creating for two biggest dependencies of servo, that is MoJS and MozAngle. We shipped the MoJS shared object already, but yet to figure out how we are going to do the distribution. MoJS is still working on progress. We have some work going on on the static-lib side as well, so we are going to do that as well. Before I started this talk, for this one, I wanted to see how this, as a user, how I am going to use Rai, how it is going to impact me. I started this, this tells me that I am close to the finish. This is the demo I built on top of Rai. Behind the scenes, servo is running, rendering things for you and Rai integration. This is the result of the integration work we did. Just to show you quickly what I had to do in order to run this project, I just had to write this HTML, CSS and JavaScript code. That is all. As a user, I don't need to care what is happening on the server side or what is happening behind the scene in Rai itself. As a user, I just need to write HTML, CSS and JavaScript. Things are ready. Maybe you can go ahead and try to write an input and browse the UI and maybe you will have something like that. It was pretty cool to see that as well. I was personally very happy. One of the reasons why I showed you this issue, and even we have a meta issue for mini-browser, there are some unchecked boxes in case you want to contribute. We will be very happy to help you review PR or help you getting started. This was about integrating servo rendering engine itself. We have another story with Dioxys that is doing pretty unique work. By just taking one crate, you might know about this. It is stylo-crate for CSS styles and selector matching. This is something unique because we have been talking about integrating the whole servo rendering engine into a project. This opens another opportunity where you maybe just want to use the script crate or stylo-crate in your project. You can simply do that. It is possible and Dioxys is proving it. After this whole talk, one question that some of you might have, how you can do it in your project. In short, you are literally one step away from doing it. You just need to reach out to us on Zulip Chat. If you have time, you can check out how mini-browser is working or how the integration with write took place. You can try it out with your applications, with your projects. If it works, great. If it is not working and you figured that you need us, so our team, to implement a particular feature, you can reach out to us on Zulip or you can open a discussion on servo. We will be really happy to help you get started and answer questions. We really have lots of people coming in and asking questions like, I want to integrate, we have some talks about Velo, we have some talks about certain other things going on. You can also follow up there and lots of things happening in servo. In short, that is what you are just one step away from reaching out to us. Thanks for listening to me. You can scan this QR code to get access to this slide. Thank you. Unfortunately, there is no time for questions here. I am here. Please catch up with me. Yes, I am happy to answer your questions. Thank you.
Thunderbird: How to Exchange Rot For Rust
So, if I could have your attention. When we got this talk, I didn't know Rust and Thunderbird had a connection, so this is pretty exciting and pretty cool. So we have Sean and Brendan are going to talk about how to exchange ROT for Rust. Thank you very much. Hi. I'm Sean Burke. I am a senior software engineer at MZLA, which is the company that maintains Thunderbird. And this is my colleague, Brendan Abolivier, who is a software engineer at MZLA as well. So we're here to talk about how to exchange ROT for Rust. So our colleague, Ike Dordi, couldn't join us. But I feel I need to shout him out because we would not be giving this presentation without him. And I also have to applaud his pun in the title because the project that forms the basis for this talk is Microsoft Exchange Support in Thunderbird. So we're working on adding support for the Exchange Web Services Protocol. This is the first Rust component written specifically for Thunderbird. We, our code is based on Firefox and so there's Rust there. But nothing specific for Thunderbird. And it's also the first mail protocol to be added to Thunderbird in Thunderbird's lifetime, which is a slightly strange statement. But I will explain that a little bit here. When we started this project, nobody actually knew how to add a new protocol to Thunderbird. And that gets into the ROT part of the title a little bit. So first off, a little bit of history of Thunderbird. Thunderbird grew out of Netscape Communicator originally, as did Firefox. So a lot of the code in Thunderbird predates Thunderbird itself. And the 0.1 release was July 2003. So this is a fairly old code base already. In starting around 2012, Mozilla started to hand over Thunderbird to the community because it felt that Thunderbird wasn't self-sustaining under the Mozilla umbrella. That situation persisted until around 2017 when Thunderbird rejoined the Mozilla Foundation. And so what does that actually mean for Thunderbird? We had a pretty big gap in paid maintainership, which results in, you know, a community can only do so much. Thunderbird is a very large project. There's a lot of work to do. Just keeping up with building, making sure that it's following Firefox's changes since we're based on Firefox. And that gap meant there was a lot of time where you can't expect a community to have a holistic view of the architecture of a huge project like Thunderbird. You can only ask so much time from them. And so changes were made without a view to how this would affect the architecture, how the architecture played into things. There was also a loss of institutional knowledge because the people who'd been employed to work on Thunderbird were no longer, and there was nobody there to take over for them. In a lot of places in Thunderbird, there hasn't really been any kind of architectural maintenance in over 20 years. And that also means that, you know, large portions of the code base are written in C++. C++ has changed quite a bit over the years, and Thunderbird has not kept up. So this is a pretty significant challenge, but also presents us with a pretty significant opportunity. That opportunity is Rust. So we'll talk a little bit about why we decided to use Rust. This is a room full of people interested in Rust. I'm sure most of you are pretty aware of the major benefits. We're a large application maintained by a small team, and we take input from anybody who sends somebody an email, and so memory safety is pretty critical. We do not want security bugs letting anybody have access to somebody's computer. Performance is also pretty big. We use a lot of JavaScript in our code, but for low-level stuff, JavaScript is going to have some performance issues. And then, you know, the modularity of Rust, having that built-in gives us access to a pretty large ecosystem. There are a lot of people doing mail-related stuff in Rust, and we can benefit from that. The other, the next reason is that, I mean, we are based on Firefox code, and Firefox already has Rust in it. So the build system is set up to integrate with cargo. We share CI infrastructure, and so that already has provision for Rust. And then, also, Firefox has something called XPcom, which is kind of a framework for communicating between the different languages that Firefox uses, and there's Rust support in that already. And then, Rust also kind of introducing a new language gives us permission to rethink some of the aging ideas in Thunderbird. It allows us to kind of ignore some of the more delicate code paths that have been around and changed ad hoc special case throughout the code where changing things is a little bit scary. You don't know what you're going to break. And also, I mentioned the loss of institutional knowledge. We need to rebuild that, which means a lot of documentation, and personally, I love the documentation tooling that Rust provides us. And I think that helps a lot in moving forward. But as with any project like this, it's not just, okay, we're going to use Rust. Cool, we're done. We're good to go. We had some problems getting started. Part of that is just we have a large existing code base, which means we have existing patterns. A lot of idiosyncratic async stuff going on that doesn't integrate nicely with idiomatic Rust. Lots of features and capabilities already in the Firefox and Thunderbird code base, which don't have any sort of Rust binding, or sometimes some kind of painful Rust bindings. I mentioned XP-COM as a benefit, but it also became a little bit of a drawback, particularly in terms of developer experience. Over the years, Firefox has excised a lot of the XP-COM interfaces just because it can be a little bulky, a little bit painful to use them sometimes, even in C++ and JavaScript. That work never happened in Thunderbird. We have a lot more uses and huge uses of XP-COM than Firefox. And so what works well for them in terms of developer experience doesn't work for us. It's really painful for us to use XP-COM at this point. I also mentioned the build system as a positive, but in a big way that became a drawback for us because in order to deal with the fact that Firefox has a C++ entry point, no single point of entry for Rust, there's a hack put in place to build a single workspace and shove that into the Firefox code. That hack, we're built as a subtree of Firefox rather than having Firefox as a subtree of our code, which is a little bit unusual. Cargo doesn't like it when you try to have a workspace inside of a workspace. We're not in the same repository as Firefox, and so we can't change their cargo lock, we can't change their dependencies. We kind of solved this by basically stealing all of their dependency tree and merging it with our own, building from within our code and using a script to keep that up to date and hope things don't break so far, so good. With that, I'm going to pass it off to Brendan because... Now we can use Rust in Thunderbird, we can build Rust in Thunderbird, we can run some Rust code in Thunderbird thanks to that work to integrate it into the build system. What do we do with it now? It is good to answer that question, it's good to think back to where we're coming from, what we're trying to achieve with that, and our end goal with this work is to be able to support Microsoft Exchange in Thunderbirds. We want to support more specifically something called EWS, which stands for Exchange Web Services, that's Microsoft's proprietary protocol for interacting with Exchange. That protocol is based on XML or HTTP, so it's up to be more precise. That means that we're missing a few key code infrastructure in order to make this a possibility. First, we want to be able to send HTTP traffic and preferably we want to send it through something called NECO, NECO is the networking component of Thunderbirds and we already have a well-functioning networking stack, it would be a bit sad to completely bypass it. We want to be able to communicate, to interact with NECO and to do it in a way that is familiar and easy to use for Rust developers. Once we have the capability to send those requests, we also want to be able to fill them with the contents that we need in this case XML. We need to figure out how to serialize and dis-realize XML in a way that scales to a lot of data structures to give an example of scale. EWS specifies about 100 different operations and about 1700 different data structures. We're not at the bottom of the stack which is sending HTTP requests. Because we want to interact with a specific component within Thunderbirds, we want to use XP-com, which I mentioned, the acronym stands for the cross-platform component object and its job is basically to allow inter-component interaction by defining some platform neutral interfaces and that way we can cross the language boundary, which is good for us because we want to write Rust code to interact with NECO, which is in C++. Let's use that except sending, except using XP-com with Rust directly doesn't look very Rust-like. It's mostly designed around C++ APIs and so it doesn't have a lot of the features that we can find in Rust and it means that there's a lot of boilerplates. This is the code to just send a single GET request and print the results in the standard outputs. We need to define a bunch of callbacks, we need to define a bunch of different objects and because we're crossing a language boundary, we at the very bottom, we need to wrap all that into the actual call into an unsafe block. None of that is very ideal and we obviously don't want anyone who wants to use NECO in Rust to have to do that every single time they want to interact with the network. Let's split this issue into two sub-issues that we're going to solve. The first one is we want to do native, to support native async await, Rust async await syntax. The way we do this is we added a new interlcrate to Thunderbird, which is actually the first Rust code to be added to the Thunderbird code base. The role of that create is to translate asynchronous operations in xp.com into Rust's native async. The way it does that is it defines custom stream listeners, which is that big struct that we saw earlier with a bunch of callbacks. What that stream listener is going to do is it's going to buffer any incoming data, call wake on a wakeer when the request finishes and then we can wrap that around another struct which is in charge of triggering the asynchronous operation in xp.com. Then it implements the future traits to be able to query the state of the buffer every once in a while and to return the result when it finishes. In the future, we're probably going to also implement the stream future traits in order to be able to interactively process incoming data. We don't need it immediately, so we just went with future for now. Now that we have this native async await support, we want to build on top of that to have some way to write some idiomatic Rust code to send HTTP traffic. We do that with yet another internal crate which provides more idiomatic and requests like HTTP clients. It's not a one-on-one, one-to-one replicate of requests, but it request-wise uses the main inspiration for this work. Under the hood, that crate is in charge of creating and configuring all the necessary xp.com objects, wrap that into our future and also provides more rust idiomatic error handling as well because standard xp.com does its error handling with just error status codes which isn't the best we can do with Rust basically. So that's all nice. What does it look like? So let's do a demo. We're going to do a live demo because we don't like to leave safely. So here is some code that lives in on my local checkout of Thunderbirds. It's got a bunch of code infrastructure to plug it into xp.com for the next step of the demo, but the important bit is what we can see here which is that with all clients that are from my HTTP here, we can create, we can actually create a PUS request, set it a custom header, set it some custom body, send it and natively await on it, and then we can process the response or the error depending. We're going to run this code into a local Thunderbird which apparently crashed while I was preparing the demo. Let me just do... So this is the Thunderbird DevTools. You might already look familiar because it's also the same DevTools that Firefox uses. We use it to work on the front end of Thunderbirds and access some internals of Thunderbird when we need to. So we're going to instantiate that xp.com plumbing that I was mentioning. It's basically just a dummy interface that just has a thing to do the thing which in our case is sending an HTTP request. We can see that we successfully sent a request through NECO. It's not because it appeared in the network tab which means that it went through the Thunderbird networking stack. If we inspect the request, we can see that it did include our custom header, it did correctly attach to the right content type, and it also correctly sets the right body to the request. And to confirm that once more, the server... That's just a simple stupid server that I quickly wrote in Python that... Sorry for using Python. Which just takes that custom body and that custom header and just prints something. Right. So that works. Now what do we want to do from here? We have requests that we can send and we can process the response to that request. But what do we actually put in that request? As I mentioned, we want to put some XML into that to be able to communicate with exchange servers. So we started with a kind of exploration, kind of a lay of the land of what the status is with regards to desilizing and serializing XML in Rust. And we quickly identified that most crates that we could find had some existing issues. Either they don't provide a good way for handling attributes and namespaces in XML or N slash all, they're very boilerplatey. It's fine for desilization because we don't necessarily need to process every single attribute from the response we... Or N slash all namespace, namesoces. For serialization, it's not really something we can do because obviously if you omit a required attribute or something like that, the XML server is not going to be able to understand the request. And also we not only want but need to have a low amount of boilerplate in all code because N EWS defines a lot of data structures, a lot of operations. So, yeah, dozens of operations, more than 1,000 data structures. So we don't have any small amount of boilerplate that we have. It's just going to make the codes 10 times more difficult to maintain. So we decided to create a new crate. This time it's not tied to any Thunderbird internal, so it just lives on GitHub. And so we use this... And so in this crate, we basically leverage the procedural macros in Rust to dynamically generate implementations for a trade that we also define at compile time. Almost everyone in this room will just be like, yeah, this is just a derived macro. I'm fairly new to Rust, and so when I saw that, I thought this is pretty cool, so I want to mention it. We don't want to reinvent the wheel. So we built it on top of QuickXML, which provides some pretty nice tools for writing and formatting XML. And we try to design it with a fairly low boilerplate... That low boilerplate approach that we need. So what does this one look like? This is a kind of dummy data structure that I defined, and as you can see, I was thoroughly uninspired for the naming. But this showcases some of the features that we can use in this crate. So we can set namespaces, either default or custom ones. We can set namesets prefixes. We can instruct a field to be an attribute. We can flatten some structures, and then all we need to do is actually populate our data structure, serialize it, and in our case, we just want to print it to see what it looks like. And if I run this, it generates valid XML that matches the data structure we defined here. So that's a lot of useful code infrastructure that we have now for our Microsoft extension implementation. Where do we go from there? Obviously, the next step is we want to implement the damn thing. So implement protocol support for AWS in Rust, and hook that into the Thunderbird UI to expose it to our users. We also want, if there's enough interest, to generalize the XML struct crate, the one in this slide, because at the moment, it's fairly designed around the use case of EWS in terms of configuration and defaults and things like that. So it might be something that, if there's enough interest, it might be something that we will look into in the future. And another point, another point that's another next step is we might also start working with people from the XPCOM team in the Firefox developers to try to improve the situation around bindings for XPCOM in Rust and make them just more, well, nicer to use for Rust developers. So that's where we are. That's where we're going. And thank you for listening. Thank you. So we, I think we have quite some time for questions if you have them. Yeah. Well, as I make my way over there, one question I had. If the protocol support is in Rust, do you think it's possible that it could be more shareable with other email clients? Yeah, this is one of the things that we're trying to keep in mind. One good example is we're currently, you might have heard that a few years ago, we welcomed the K9 email clients on Android into the Thunderbird family. And if we're building a new protocol support for the desktop application, we would like in the future to potentially include that support into K9 slash Thunderbird for Android. So this is definitely something that's one of the, one kind of extra reason that we decided to go with Rust is because of the ease of implementing, of reusing Rust codes across multiple platforms. And yeah. And we are going to make the EWS create public as well. Yeah, I'm going to repeat because I have a mic. We're going to make the EWS create public. And yeah. And also, you see to build it in a way that is fairly agnostic to the actual desktop application.
SPDX 3.0 - a migration journey
Good morning everybody. So I've got too many slides to present, so I'm going to go kind of fast. I think all of you know what SPDX is. Does anybody not know what SPDX is? I'm going to skip up, Axe Troublemaker. Now I'm going to skip through kind of the what is SPDX slide and jump into what are we doing about 3.0. This is really a talk about my kind of more practical journey. I'm one of the maintainers of the tools and I recently went through a process to upgrade the tools for 3.0 and I thought if I shared my experience with you, those of you who are writing tools yourself might gain some of my experience from that and help you out with your tooling. So as far as the agenda, I thought I might start with a little bit of context. Why did we even do 3.0? Because as you'll see there are some breaking changes or some changes you'll have to adapt to in your code. So it's good to know why we're all doing this. I'm going to talk a little bit about the approach to creating the spec because that will give you context to some of the techniques I've used for upgrading the tools and an overview of the changes, the important part. Then I'll talk about my practical experience on the Java libraries itself. So why SPDX 3.0? We've gotten a lot of feedback from the community. By the way, you guys hear me okay? We have no idea. Yeah, we don't know if it's... There we go. Let's see if I can... It is there. I think that'll work. Alright, a little bit easier. So we've got a lot of feedback on the SPDX 2 spec that is just too complicated. There are too many pages in the spec. So we took some steps in 3.0 to simplify it. At the same time, we added a lot more use cases, which actually can make it a little bit more complicated. So we've taken a few approaches. I think the biggest one is the introduction of profiles in SPDX to allow you to focus in on the things that you care about on SPDX and that does create some changes into the specs and impacts the tooling. Another big change that impacts the tooling is we made a lot more flexible. We have some people using SPDX and extremely large deployments, very, very large S-bombs, and they want to be able to basically distribute this S-bomb across many different files, across the network. And so you'll see some structural changes that allow you to do that more easily and of course reliably and in a way that you can authenticate the relationships. There's a lot of interest in SPDX and non-licensing and non-security scenarios now. So product safety is coming up in 3.1 and some of that actually started to kind of come into 3.0 as well. But there is a lot of changes to support that as well. And a big change, there was actually a time when there was yet a third standard. I know many of you are frustrated with two standards. There was actually three for a while and we merged two of them together in SPDX so that also had somewhat of an impact. So I'm not going to go through all this, but just to point out, we've been around for a long time. So don't ask me why we created two standards. We started back in 2010 and we've gone through a lot of evolutions. Most of them adding use cases back in the early 2000s, 2010, 2013. We added security use cases. More recently, we did the merger. And you can see on the top some of the external influences, the NTIA being one of the most significant in terms of accelerators. What's that? And the CRA. Absolutely. I am in Europe. Yes. And we've also done some work on ISO standardization that's on the timeline as well. And of course, this has an impact on how we evolved the SPAC itself. We started off in a very simple PDF. So we'd give tool developers like myself, here's a PDF, go implement it. That kind of created some errors. Some of us read the words a little bit differently. English isn't the most precise way of describing some of the technical features. We moved that over into a markdown file that was a little bit easier and we generated things. And then we went to an ISO SPAC. Have any of you guys ever gone through an ISO specification process? It's interesting. There are a lot of requirements. They're very picky about their format. So we went through that. And then where we ended up is going to more of a model based description of the language and generating actually multiple different schema files. For 3.0, we actually spent quite a bit of time deciding how we want to do the SPAC infrastructure for 3.0. We decided that a lot of us wanted to write directly to schemas. There's a lot of people though that wanted to make it just human readable and human writable more importantly. So we actually took kind of a middle ground. We described everything in a markdown files, but in a very specific format that every time you commit to the repository, it checks to make sure you adhere to that format. And then we take that and we generate an intermediate schema file. And that schema file then generates everything else. So I have a little bit of a diagram to kind of show you what the process is. And this is important if you want to contribute to the SPAC. This kind of gives you a guide on how to do that. We started with a conceptual model. This is kind of temporary. We don't use that anymore, but it's just a picture to get us all on the same page. And then we write the specs and markdown. And this is where you can contribute. You can just commit directly to the SPAC in that specific format. And thanks to Alexios and quite a few other contributors, we have tools and generators that right now is generating a website, an HTML version of the pages. And here's where us tools developers get to actually kind of get a little excited, at least I do. We generate a Shackle Owl schema file. Now, how many of you have never heard of Shackle or Owl? Okay. Oh my gosh. You guys are going to kill me. But there's good news. We translate the Shackle and Owl to something that you do understand. So, you know, just hang in there because we will get you, we're going to be generating certainly JSON schema files, you know, that I think is really popular. But you might be wondering, first you're wondering what the heck is Shackle and Owl. Look it up. It's really interesting. It's very complicated, but it's very complete. Okay. It's very complete. And then this is where we go to, we call it serialization schemas because JSON looks different than XML, looks different from, you know, there may be other schemas that we generate as well. And the way we, the reason we did all this is it ensures consistency. If we agree on what the markdown file is, everything is completely consistent all the way through to the schemas you use to validate your source code. So, it's well worth the effort. Really, Kate, it's worth the effort. So, you might, now after you ask yourself what is Shackle Owl, you might ask yourself, especially if you look at the spec, you're going to wonder why did we pick that. One thing it captures not only the syntax of the data, which all the schemas do well, you know, this is an integer, this is a string, it's got this pattern. It also captures the semantic behind it. So, it goes beyond what you can capture in a simple syntactic schema. And that is the additional information we pull out of the markdown files and we put it into the Shackle file. So, we can say things like, oh, you got a relationship between a file and a license. It can only be of this type and you have to have at least one of those. Whereas in a syntax, all you can really say is there's a relationship and it's got this cardinality and it's got this type. So, you can go beyond, you know, the specifications. And, of course, if you start with that, you can easily generate the simpler schemas, but you can't go from the simpler schemas to the more complex. So, that's why we picked Shackle. Now, the other reason we picked it is just about the reason we picked all the, well, there's a lot of, huh, it's coming, it's coming back. There's tooling for, there's libraries that support Shackle and most eco-language ecosystems like Python and Java, etc. Am I back yet? So, it's not coming back because we don't have slides that being captured on the stream and the HDMI is put through here. So, there's something going on with this machine. You need to go and talk downstairs. Stream is not available, is what they're saying. Oh, dear. Technical difficulties. So, let's, oh, but I keep, is it coming back? You want me to disconnect? Okay. Okay. X2.x, you know, if you cared about security, you cared about licensing, you cared about, whatever you cared about, you had to read the whole spec to find the little field that you're interested in, you know, in supporting. It's kind of hard to navigate that. And if you wanted to conform, you know, what is required, you know, it's like, you know, if you got a, if you're interested in security, you really want to make sure you have the integrity fields. If you're interested in licensing, you want to make sure you have the licensing fields. But, you know, what do you make required for the spec? So, we introduced profiles and we have what, six or seven profiles in total. And there's really three different aspects to a profile. The most important is the conformance requirements. And what that means for us tools developers, that's the most important. What that means is if you are a producer of a spec and you say, I conform to this profile, I'm, I'm meeting the minimum requirements. That's your promise to the consumer. So, you can say, I conform to licensing, security, and AI and data. But I don't conform to the new services profile. And that, that, and that's, you know, of course carried over in the, in the, in the data itself. It's also a namespace. And this is where the simplification comes in, is that you can kind of filter the spec on what you care about by using these namespace. Technically as well. So, there is a technical namespace that goes along with all the classes and properties. And you can filter on that. And within my code, I also use that to help me with some of the verification code that, that's there. And it's also the way we organize within SPDX. We have meetings that are organized by profiles. So, people of like-mind and like concerns get together and actually develop the spec. So, let's talk a little bit about some of the other structural changes. In SPDX2, we, everything was around a document and that was a file basically. And we had a mechanism for reliably linking documents together because you may get many types of S-bombs for many vendors. You may want to bring them together. You may want to compare them. And you may want to link them together. So, we had a mechanism to do that. In 3.0, we still have the ability, this I got to make this really clear because there's this rumor that SPDX documents are dead and 3.0 is not true. They're still there. And you can use them the same way that you've always used them. But you can also link directly from the elements. And an element is a package or a file or you know, something, you know, a unit of something you care about in SPDX. So, now you can go directly. And so, you can put these things out on a network without having to worry about the files that contain them. And so, think about like the World Wide Web, you know, where you have like files and images that are linked together in HTML. You can do that in SPDX documents in the future. So, it's a very, very flexible, powerful mechanism we're introducing. Relationships have changed. In SPDX 2, they were a property of the element. So, you have an element like a package and you say, it has a relationship to another element like a file and that would be a property. There's a problem with that when you go to this distributed environment because you have to have, you have to know about this in advance. You can't introduce a relationship after the fact because it's a property in the element itself. So, we moved the relationship outside. So, now you have a separate object which is the element that does a relationship from one element to the other. And we've put a bunch of properties in there that in a way kind of simplifies the relationships. So, rather than having hundreds of relationships, we can have dozens of relationships and a few properties within the relationships to take care of it. How am I doing on time, by the way? You are at, yeah, 17 minutes. Oh, perfect. Okay. Other, I want to make sure I go through these changes because I think this may be the most interesting part to you guys. The other, there's a few other changes. There's a better model for what we call entities. This is the person organization. In SPDX 2.x, they were just strings and you'd have to parse the string to figure out whether it's a person or an organization. We now have kind of like a whole object hierarchy that describes what these things are. It makes it a little bit easier for parsing. We renamed and removed a lot of confusing properties. Those of you who have built tooling for SPDX 2 will love this because people complained about these properties all the time. And we either renamed them to make them clear or just got rid of them. And, you know, for example, files analyzed. A lot of people don't like files analyzed. Functionality is still there, but it's just a lot clearer how to actually do those use cases. We've added some additional uses, useful classes and properties. So, for example, we elevated package URL from an external identifier to be in a property on package because a lot of people are using that directly for identifying the package metadata. And then we have some additional profile specific classes and properties, of course. And on this, I know you're not going to be able to type this in. Hopefully, you'll get a copy of these slides. There is a Google Doc that I put together. It's a living document, which means it's open for comment from any of you. You find something missing. Please comment on it. But this is kind of a guide to all the detailed changes. And I was writing that as I was doing this, and I know there's some folks that have done the same thing and contributed to this document that describes all the migration. It'll turn into a migration guide once we do the full release, but right now it's more of a living document. So, kind of stepping back at these changes, you know, what's kind of the big picture of this? You know, it'll be a lot more flexible with the profiles. There'll be a new relationship structure in addition to relationships. So, you need to annotations independent as well, so you can do more incremental changes to the S-bomb without having to go back and create a whole new big S-bomb. And then simpler profiles, simpler snippets, more use cases. And then, again, see the migration document for that. So, now I'm going to switch over to my personal experience. I was involved in writing this back, and now I'm going to tell you how fun it was to actually implement it. So, the Java libraries, first to give you context, I wanted to give you an overview of what the current SPDX 2.x library's architecture looks like. You know, it's what you'd expect. There's a model set of classes, and that match exactly the SPDX 2.x model. The only change is really is I had to rename some of the things that conflicted with the Java language. So, Java doesn't like you to call a class package, for example. And then, there's a set of utility classes that has some useful functions like being able to do a comparison of license, little things like that. And because in the very first iteration of this, I started this like 10 years ago as a pretty printer, it was very monolithic, and I got a lot of feedback that, hey, you know, I don't want to have all these RDF library things in there, if all I want to do is generate JSON. So, we introduced a storage interface that lets you create a lot of different model stores. The model stores can represent a very specific serialization of a file, or it can also represent like a database or a triple store if you're into RDF, or the most common is just an N memory store for it. So, this allows you to separate these out into separate jar files. It does add a complexity because there's a storage interface in between that has to adhere so that we can separate these out into different things, but I think it does make it a lot cleaner. So, a couple of breaking changes that I noticed right off. I think one of the ones I did not expect, this change to the namespaces actually caused a change to the storage interface because I was just using the property names, and I knew that I could always map the property to the full URI of what the properties were, or the full string with the namespace because we had a clean mapping. I can't count on that anymore. So, I had to add one extra parameter, which means, oh my goodness, now I got to change all these different libraries to use that. Of course, I put in a compatibility library that made it a little bit easier, but that was a breaking change to all those things that are implementing the storage object, at the storage model below. The model itself created some breaking changes as you'd expect after going through what the changes are. There is, what I did is I took all of the spdx2.x code and moved that over to a compatibility library so that it's all still there. It is though in a different package in Java, so there is a small change to the imports, but it should work pretty much as is. The relationship and annotation structures that definitely impacted the Java code because it moves it out of properties and makes them a little bit more independent. I came up with a trick to help manage the consumers of my libraries, keep them from having breaking changes. I'll come to that in a couple minutes. That might be an interesting tip for some of you. The external document ref structure really changed. That was probably one of the more significant changes. We talked about the agents, the snippet simplifications, and then moving these properties to relationships. Sure. That layer will direct you to the compatible layer or to the new model layer. It basically minimizes the impact to the users of my library. That's this spdx model factory is what does the switching there. This is the little trick that I came up with for relationships. We used to have these as properties and now we moved them over to separate independent relationships. You can imagine what this will do to all the users of the library. It's like, oh, this isn't just like a change of coding or a change of names. I got to restructure my code. I came up with a way to make it look like a property inside the class. I have a special class that says, okay, this is a relationship, but it looks like a property. If you're interested in that technique, let me know. I can show you the code. It really wasn't that hard. It's a very generalized class that I can use for just about any kind of property. I think it's called relationship property or something like that. That makes it a little bit easier. The other thing I was focused on is reducing the errors. You remember in the, how am I doing? I'll see you in five minutes. Thank you. I saw you getting ready. You remember in the specs, we did a lot of things to reduce the translation errors down to the actual schema files. We're taking that further with the coding as well. We're from the OWL Shackle file. I'm generating the Java code. The Java library code, so now you got from the markdown file all the way through to the actual Java code, traceable, reproducible code to make sure it's all done right. I can't tell you how many errors I have personally made or I mistyped something or I didn't read it right, and it got implemented wrong in the Java library. I think the errors that I make now will be much bigger because it'll be in the code generator. It'll happen to everything. Sorry. That's a little bit of a joke, but no. It'll get rid of all those little errors. We'll also be generating, as I mentioned before, the schema files for those of you who would rather see things in JSON schema or XML schema. I also generate the verification code from the Shackle OWL files. If you're into the RDF, it complies with the Shackle OWL. Those are some of the techniques for reducing the errors. I think this is my last slide, the new architecture. One thing I didn't mention is this copy manager in between. It's a little bit of a detail, but it's kind of an important one, is that if you're migrating, if you've got two different SPDX documents with two different versions and you're referencing to each other, that copy manager will let you copy it over to the new version. That kind of does the upgrades. It'll also copy between the different types of model stores. It'll let you convert between tag value and JSON, whatever. Anyway, I think I might have a minute or two for questions. I got three minutes for questions. Did I go so fast? I lose all of you on that? That felt like I was speed presenting. Yes. I recognize that you get a lot of work about the specs so that you can see them easier. That's great work. But also, I would say we need to have a lot of different kind of examples. You know, you write your library, then you want to test it, and then you find out whether you really understood the specs the right way. Yes. Yes. And therefore, more examples of different types. Yes. I don't think I can repeat all of that, but I think the basic comment, and I think it's a really good one, is in addition to the spec, we need to have examples, you know, so that we can work off of those examples. And we do have an examples repo in SPDX today. We plan on, we're going to organize that in the future for 3.0 by profile. So you'll be, if you're interested in security, you can go down and look at, like, the security examples, be able to use those. Excellent point. Thank you. Yes. Now is current code ready to convert a file from SPDX2 to 3.0? Yeah, that's a really good question. So the question was, do we have code today that'll let you convert from 2 to 3? The Java code is not ready to be used yet, unfortunately. It compiles, but not quite ready. Yes, Dolph. Yeah, we have a project that can do that with the 3DUSRC of SPDX3. So Dolph mentioned there is, is that in the presentation later today? Yeah. Okay. So we hear about a tool that can do that later today. It's not the Java library. So it's coming though. It's not quite ready yet. Yes. SPDX3 light coming? SPDX3 light is coming. And that will be, that's one of our profiles. It's got, it's unique in that it skinnies it down, you know, rather than adding things to it, which other ones do. Yeah. Sorry. Yes. Talk about some of the relationship of SPDX to RDF. I was wondering if you'd come up against any requirements, things that RDF doesn't support, stuff that you feel like you need to push back up into RDF. Ah, that's a good question. The question is, is there anything that we ran into in the RDF world that we, that we couldn't satisfy by using like, like the RDF, maybe Shackle Owl specification? I'd have to think about, I have a feeling we have, but I can't think of an example right now. You know, I, I, I, I, let me think about it and then give back to you later. Yeah. Thank you. Yes. What's the view about converting SPDX3 to Cyclone DX and back and forth? Oh, yes. Because they've obviously got things like AI in their model 5. Right. 5, etc. You've got one. Right. And security. So we're looking at compatibility because there's not people to be, what people to be flexible? Yes, we do. And I'm, I'm with you on that. So the question is, what about converting between Cyclone DX and SPDX? We actually had an effort going on in SPDX2 where we had people from Cyclone DX, myself included on the SPDX side collaborating and, and we were doing two things. We were, we were writing libraries to convert, you know, so kind of really testing it hands on. And, and then we were also working on the SPAC where we were like in 2.3. I actually put a number of things in per request of the Cyclone DX team to make it easier to convert. So we're doing both of those. Unfortunately, that collaboration stopped. I am looking for somebody from the Cyclone DX team to work with to do that in 3.0. So if you're on the Cyclone DX team or if any of you in the room are in Cyclone DX and are interested in, in collaborating with SPDX and make it easier for all of our users, let me know. I'd be happy to work with you. Thank you. Yeah. Okay. So I, I'm not sure I completely understand the, oh, time's up. Answer him. Yeah. Why don't you go ahead and shut me down and you can go ahead and close the screen and take that over. Yeah. Yeah. Just. So, sorry. So the, the decision about. These are committee that decide about the changes. Oh, how, okay. Like the governance of how the SPAC has made itself. Yeah. So we do have a formal governance process. We have a technical, we have kind of like a steering committee and then we have different work groups. The real work gets done in the profile work group. Most of it's in the core. There are team leads that are nominated and, you know, the steering committee, this whole process that does that. And then the way that we really try hard to make all the decisions consensus space and it's, it's based on contributions too. So if somebody says, hey, I want to do this, but they don't contribute anything. Yeah, we don't really listen. If they say, hey, I want to do this and here's a poll request. Here's the spec. Here's some tasks. You know, here's what you do to the schema to make it. Then it's like, oh yeah, come on in, you know, we'll work on it. And sometimes there's differences of opinion. We try to work together very rarely. The team leads will have to make a call and we try to do it based on the majority, but you know, it's rare when we do that. We think very carefully before we do that. Yeah. All right. Max, thank you.
Know Your Ingredients: Security Starts With the SBOM
Okay, great. Good. All right, so welcome everybody. Thanks everyone for joining the hottest room at POSTA. My name is Steven Chin. I'm VP of Developer Relations at JFrog, and I'm going to talk a lot about different projects which help secure the open-source supply chain about why we need security, a bunch of different security incidents, both historical ones, but also new ones which you probably haven't heard about, a lot of new research and things going on. And hopefully we can all help to improve the open-source supply chain together. So I think a great analogy. Can you guys in the back hear me? Okay, good. So I think a great analogy for the software supply chain and how we think about it is to compare it to our food supply chain. And we know that the way that you get great cooking, great ingredients is starting by fresh ingredients. Having things which you know are safe, which come through the food supply chain, which aren't, don't have any people who are interfering with in the middle, who are not following good hygiene practices. And when you have an issue with your supply chain, you end up with spoiled ingredients and you know, kitchen disasters. So anyone here seeing the Gordon Ramsay kitchen disasters show? Okay, a lot of good fun. And these are not the free-range chicken you're looking for. We're hoping we can get better quality and better security out of our software supply chain so we can build enterprise applications which are hopefully very difficult for attackers to exploit. And this is how the USDA looks at kind of you know, the software supply chain, creating a healthy supply chain. But it's somewhat analogous to software. So you have a lot of production. You have you know, farms and things which are producing software. You have distribution and processing. So it goes through a bunch of different tall gates and different people in the process. Eventually it ends up in a restaurant and a retail location and then you have you know, home users or restaurants or other folks who are cooking the food. So if at any point in this process, if you have issues with your quality, if you have you know, infections, if you have bacteria entering into it, then that results in potential issues at the consumer side. So I think when we're looking at the software supply chain, we need to look at it through a different lens. And I think a good lens to look at it through is Salsa, which is one of the open SSF standards. And it really focuses on getting attestations of the different parts of the builds that your software has gone through. Kind of figuring out at each of these different gates, you know, is the source control secure? Have you done the right things with code reviews? Have you done through the right processes with builds? And when you have all of this information about the build, then you can figure out are you actually secure? And a key ingredient to how you know this is the case, and this is why we're all here in the software build materials room, is because you need to have that final index of your ingredients, where it can show you from end to end, and Salsa and both SPDX, Cycline DX, and S-bomb standards go really well together, because this way you have the attestation of what's happened in your build into your artifacts, and then you can put that together into a single document, which kind of shows you all of the things which verify the components, and then the potential issues with them, which you might have. And if you're not following these good practices in how you build software, how you get provenance of your software and how you attest to it, then you end up with issues like, for example, the log for shell incident. Now I think this by now is infamous. It sparked a whole second round of government security concerns over open source software, and really the challenge for big organizations, which we're trying to address the log for shell incident in production, was are my production systems affected? And it depended upon the version of log for J, which you were using. It depended upon whether you're using just log for J core, or whether you're using the full set of libraries. And the answer for most organizations was, well, I don't know if I'm affected in production, so I'm just going to patch everything. And that's very expensive, it's very difficult to do, and when you have libraries like this, which are used so much across the entire ecosystem, it's quite challenging as well. And I think what really started a lot of the government concern around software supply chain was an earlier incident. This sparked the Biden administration's litigation around this, which was the SolarWinds incident. A very different sort of incident because this one was a true software supply chain attack in the sense that they specifically attacked the build system. So they were using TeamCity, they got in right before the certification happened, the signing of the artifacts happened. So to the downstream people, which SolarWinds was providing, it looked like it was signed by the company and certified, and it wasn't a malicious, but in fact they had done a very good job of infecting it before that was properly signed. And so we'd like to prevent these sort of attacks from happening because it causes a lot of damage. It can cause potentially malicious entities to get access to information. It can cause privacy issues with consumers, and it costs a lot of money and cost according to IBM's data breach report in 2023, USD over 4.45 million and a 15% increase over three years. So this is a huge issue and it continues to get bigger for us as a software industry. Okay, so let's talk about some additional incidents. So anyone, which one of these is your package? So when we're talking about like delivering libraries and dependencies, one of the things that majority of software uses is it relies on open source components, it relies on leveraging that because we don't want to write the same code and it's actually more secure if we're leveraging open source libraries that have been peer reviewed, that have been patched, that are staying up to the latest standards. But what if you can compromise the systems in the middle, which are supplying this information? So the dependency confusion attack basically relies upon the fact that a lot of companies, organizations, and open source projects use some sort of package management or middleman. They'll set up repositories which will pull from upstream or pull from local corporate repos. If you can get the information about what the internal names of the corporate repos, this is an example of Yelp, then what you can do is you can upload those to NPM or other public repositories and especially you're spoofing these libraries. So as a developer, as a CI CD system, you're going through a potentially vulnerable package manager rather than getting awesome corporate lib 1.2, which is the latest version of your company's library. It goes and it sees, aha, there's a new version in a public repository. I'm going to serve that up instead. And as you know, bad things happen when kittens get access to nukes. So we don't want this to happen in our supply chain. Fortunately, all of the commercial package managers, including my company's Artifactory, are now patched for this. So by default, they will not go out to a public repository if it exists in a local repository. So this blocks that attack upstream. But Alex Bresson, who did this exploit, was very creative. He took an attack which was theoretical at the time. Nobody had actually exploited it. He attacked Google, Facebook, Apple, a whole bunch of companies and simultaneously claimed about a dozen bug bounties and ended up getting $130,000 USD for his effort. I'm sure you'll see that. Maybe instead of helping secure the supply chain, there's a more lucrative path. But I think it's also like researchers like him also, they're helping to expose the potential issues in the supply chain in a way where they're not introducing threats, right? So this is white hat hacking. And we need people like this to find the exploits. And also, this helps guide us for what we need to do for new standards of SPDX, for implementing things like VEX to make it easier for us to figure out what the vulnerability scope is. So I think that these sort of attackers actually are helping us a lot with the ecosystem. Now, another food example here. So if you have a recipe that calls for different types of rice, like for example, if you're doing a risotto, you wouldn't want to use like a mixed grain rice. Like you need a specific type of rice. And this is something else which attackers make a lot of use in the supply chain. So another common type of attack is called typosquadding. Another variant of this is leaving off namespaces. So as an example, our research team found an attacker which released to NPM a whole bunch of libraries from Azure. And they just left off the Azure prefix. So if you're a lazy developer and just typed in the package you wanted, if you left off the namespace, you would instead get a vulnerable library instead of the actual library you wanted. So a very clever attack. And the way they did this inside of NPM is they actually had a random account generator which would also generate a unique account for each of the different libraries they uploaded. So it also wasn't easy to systematically find, oh, well, this is a bad entity. I'm going to block them. So they managed to spread out the attack. They did it on 280 different packages on Azure, Azure tests, Azure tools, CattleLang. And then they could install any software they wanted on the person's computer. But basically it was set up for potentially exploiting data from personal machines. Later on, so our security research team found this. We reported it to NPM. They took all the packages down. And then we publicly disclosed it. Later on, a security research firm claimed that they were just testing out NPM. So this was like a company testing the waters. So it wasn't actually a malicious payload in any of the packages yet. But it had a lot of potential for doing that. And the security research firm wasn't exactly upfront about what they were testing either. So, okay. And then, of course, if you're building, if you're serving food, you want the ingredients to be very fresh, right? You can't make gourmet food if you start with a pile of rotten food and things which aren't fresh. And I think when we're looking at the software supply chain, actually a good analogy for this is the somewhat infamous picture of a stack of more things and more things and more things with very small, fragile components nested inside of the supply chain, which any of those, if you pulled out the banana, suddenly your whole supply chain falls apart. And I think a great classical example of this is the left pad incident. So basically, there was a package published on NPM under the Keek package for doing left pad. Not a lot of code, so it's not something that's hard to write. But as developers, we are very, very lazy. If there's, if you can possibly save a line of code by including a dependency, of course you would do that. And then this Keek package was later claimed by a company, which wanted to own that domain. NPM sided with the company. Cameron got upset about this and then pulled down, oh, actually the publisher of Keek got upset about this and pulled all his entities down. Later on, Cameron published an identical version of left pad to solve this problem. But this is the source code which caused this huge incident. And this is something that is not worth including a library dependency, a potential vulnerability for such a very trivial piece of code. So this is, again, a huge threat. Now, one of the ways you can find out what all your dependencies are and figure this out in a visual way is using Guac. So this is a new OpenSSF project. It just got added to the OpenSSF suite. What it does is it gives you a visualization of all of your dependencies, lets you see exactly what you're using, how you're importing, and has some nice visualization on top of it. And I think using things like this helps you to figure out what your risk is and what the potential scope is of your application and how vulnerable you are as a project. So everyone knows Coca-Cola and it's very secret, right? So the secret recipe is locked in a vault, very secure, nobody actually knows what is exactly in Coca-Cola, that's their trade secret. I think we pretty much all know what's in it now. But there's this aura of mystery about the recipe and the history behind it. And so how do we as software developers or as projects, open source projects, keep our secrets? And the reality is we do a very bad job of it. So this is all of the exposed secrets in different central repositories, which we found by scanning NPM, PyPy, RubyGems, crates.io, Docker Hub. Obviously Docker Hub being the biggest repository and having large containers, which came in a lot of other software. There was just a humongous number of secrets exposed, 5.78 million. But even the software repositories like NPM had 1.16 million, PyPy had 0.43 million. So there's a lot of accidental exposure of secrets in open source repositories. This is yet another attack vector which attackers get into open source projects and allows them to attack the CI CD infrastructure, cloud accounts, which the projects are using. And even there's often accidental leaks of corporate secrets inside of open source repositories. Because as a developer you're working in the daytime on your corporate projects. And then evenings and weekends you're working on open source projects. And there's a certain amount of crossover in that as well. So the top ways you can help to prevent this from happening in your own project. So first is not using automation to check for secrets exposures. So using something like Truffle Hog, some sort of commercial scanner like X-Ray, allows you to scan your packages before you check it in to make sure you don't have exposed secrets. This is how we found that we basically ran our tooling on top of central repositories to see exposed secrets. Second one is generating tokens with broad permissions that never expire. So you always want to have the tokens scoped as small as possible in terms of what they can do. And then setting expirations in a reasonably short time frame so you're rotating keys at the right times. Third one is no access moderation for the secret. So putting it inside of some of service like HashiCorp Vault or Docker Secrets or something will help to protect your secrets and tokens. Fourth is fixing a leak by unpublishing the token. So this is a really, really common mistake. But you can't simply check in a new revision which deletes the token. Because then, you know, Git has long history, it's going to remember it. Now, if you followed point two and you have very short-lived tokens or very small scope, that limits the damage because by the time somebody finds it, it's likely not useful anymore. But again, a big mistake, you actually have to go and rotate the token to fully mitigate the issue. And of course, you know, exposing unnecessary assets publicly. So we saw a lot of cases where in test libraries and other like code which was not the main library code, there were secrets exposed that were visible to infrastructure. And in some cases, it looked like that the test code or the other like side cards beside the main code base were not even meant to be published. They were kind of, you know, more internal code. Okay, so to safely use open source, we also need standards. I think if we've ever, you know, gone to a restaurant, this is really common. This is in New York City, they have like letter grading on restaurants. They have like, you know, reviewing of the source. And I think a great way of doing this for open source software is the new OpenSSF Score Cards project. So basically what this does is this gives you nice tooling for Git and a command line. It'll analyze your project. It will give you a score. It's kind of like up to you to interpret the score for the different things that it analyzes. But it tells you about code vulnerabilities, maintenance, continuous testing, build risk assessment, source risk assessment, so a wide set of different things on your project. And helps you figure out like how much risk is in your project, but also more importantly how much risk is in upstream projects. Because if you have dependencies on projects which are vulnerable, then your project itself is vulnerable. Okay, and I think, you know, given we're in 2024 and clearly the machines have been taking over. So it wouldn't be complete if we didn't talk about what's happening with security of machines, machine models, and some of the code which we're leveraging to make better use of AI infrastructure. And unfortunately it's not looking that good for us so far. So ML models, so the machine learning models which we all use and publish to public repositories like Huggingface, they are highly vulnerable and this is, we're already seeing a bunch of attacks against these public repositories with malicious actors injecting payloads into it. And it's not very hard to do so the H5 format, the Huggingface format actually gives you the ability to put inside of it information that is basically executable code that sits alongside your model. So the developers have figured this out and basically from the moment you install the model, they can run some code on your system. So as a developer there's always the possibility, there's already the possibility of simply using models inside of Huggingface and other public repositories could expose your development environment to risks. And basically this is an example of the base 64 payload and you can run whatever you want to inside of the model. Another attack for injecting malicious packages is exploiting the generative AI. So if you're using technologies like chatGPT and other generative AI technologies, what they'll often do is they'll suggest packages that you should use as part of your code. And AI algorithms are prone to hallucinations. Hucinations are actually quite predictable and a lot of the standard code queries which people ask for will include perfectly valid dependencies, but they'll also include fake dependencies which don't exist, packages which don't exist in NPM, PyPy, etc. So hackers have already figured out that by uploading the packages and putting malicious packages in the place of the libraries which the generative AI is producing you can effectively cause people using chatGPT to execute malicious code. So another potential exploit and now even the AI is introducing vulnerabilities into your code. So here are some examples of perfectly reasonable queries, for example requesting, generating an endpoint that returns file contents, right? So this code is vulnerable. If you now do a couple dot dot back dot dot slashes you're going to end up in other directories, you're going to get access to files you shouldn't. And now if we again ask chatGPT, like, okay, we'll give us a secure endpoint that returns a file for user input and prevents directory reversal. It gives us a more complicated example, but this is still exposed to URL exploits. So as developers we can't really trust the current generation of algorithms for code suggestions to give us secure code. And the attackers know this and this now makes a very easy class of security vulnerabilities which are likely to get injected into open source projects and other work simply by the fact that it's being recommended. And something we're going to be publishing soon. So this is kind of, you guys are getting the before official publication on this. So basically what we did is we went into hugging face, Kaggle and some of the public repositories, ran our detection on malicious packages to figure out like what the current exposure of developers is in the ecosystem. And we found over 60 models which contain malicious behavior. We analyzed the payloads. Some of them were not truly malicious, but some of them were malicious. And basically it allowed the attackers to run code on local environments. I believe we're scheduled in another week or so on the JFrog research blog to publish the results of this, but we're, of course, doing the right disclosures to the, to hugging face and Kaggle so they can take down the models before people actually extract the data. And we exploited it. And I think building awareness of these sort of attacks helps the entire open source security ecosystem because we're the ones both in, you know, in this room building software build material standards but also in the general open source security space you have to figure out solutions so these sort of attacks don't become the next solar winds. Okay, so you can find a little bit more about the stuff I've been talking about for research with the JFrog research team at our research blog. This isn't our like commercial blog, just the research guys publish here. So it's all the fun stuff. And hopefully together we can create a more secure software supply chain. So thank you very much for having me at the software build materials room today. Okay, if you guys don't mind, I want to do a quick selfie with the audience. So what's a good, what's a good security sign? Log for J, log for J. Okay, let's give a thumbs up for log for J. Cool. Alright, thanks everybody for joining. And I think we have five minutes for questions if folks want to ask questions or if you need a breather because this room is very hot. I feel free to leave the room as well. Any work on combining S-bombs with stored secrets and verification, things like that? I think that's a good question. So I don't know if there's any work going on now about getting secrets as part of software like S-bombs, but maybe that's a good addition for the standards. Yeah, thank you. Yeah, so the question is what kind of vulnerability is X-ray handles. So I would say we're clearly in the application security department, APSEC. So we find malicious dependencies. We find like secrets detection, like I mentioned. We do stuff. We actually can build SBDX Cyclone DX files with both regular vulnerability info and also the new VEC standard. We don't currently do anything with runtime security, although that's coming. Our package manager, Artifactory is open source. X-ray is proprietary. Yeah. Okay, so Kay asked if I've looked at any of the stuff that's happening in AI for SBDX. AI and data. AI and data. And so I know about the working group that's collaborating on this stuff, but I haven't looked at any of the new stuff. Yeah. But I'm very interested to see what you're doing. Okay, we'll do. Okay. Thanks everybody.
Make your software products trustable
Hello everyone. Thanks for coming. My name is Dejan. Unfortunately Marco couldn't be here today. He got a call but yeah. What I want to talk about today is so we saw a lot of sessions today about producing gas bombs and producing the data a little very little. I think only Philip sessions was about managing actually the produce data right. So the challenge we try to tackle with the justification project is how to get all these data that are currently being produced by the more and more organizations like S bombs but also X files and more and more advisory data that we get and get them into some kind of manageable system because without that information is just a bunch of mostly JSON files spread all over the place right. So what we want to what we try to do is to provide a system that will get all this data put it in into a system that can be searchable and queryable and actually get us get us actionable information. So making software development more proactive in managing security but also making it much more easier to respond to the security issues. And yeah as I said these got us to start working on a justification project which basically set these goals for itself. So being able to ingest and store all kind of S bombs and VEX documents are open source but also proprietary company products right. Discover for those ingested S bombs and VEX is learn about all the new vulnerabilities and advisories related to the packages inside of the S bombs and being able to explore and search those information but also create an API that can be integratable in other systems and provide us to share this information with the rest of the developer toolchain like IDEs and CI CD tools. So ideally we would want to mark all the vulnerable dependencies directly in the developer's IDE and also for example fail the builds that tries to build a software that contains some of the dependencies that are known to be vulnerable. When we started to do this sometime last year this time last year we also figure out that there is another open source initiative that revolves around the similar ideas and it's called GUAC. It was mentioned in the previous session as well and I will cover it a little bit more here. So GUAC stands for Graph for Understanding Artifact Composition and the idea is to being able to ingest all different kinds of artifact documents like S bombs and VEX files and advisory data from all kinds of sources and basically create a graph ontology of that. So at first we started just experimenting with the GraphQL database but today ontology is based on the GraphQL API and can be implemented by the multiple persistent backends. That's on the left side right on the right side of the graph we also want to be able to query all these data. So GUAC should be able to provide us with all the answers about what are the dependencies in my S bomb, how these dependencies correlate with each other, so what's dependent on the what, so it's easy to find all the graph tree of dependency in your project but also being able to attach to this particular dependency all the vulnerability and the advisories and VEX data that we can find in additional systems. This is the basic architecture. Let me just see how much time I have here. But I basically explained it in the previous graph. So we can collect documents from different sources, we can certify them against different sources like OSV or DApps Dev, get it all through the GraphQL API ontology into a database. Two currently supported databases today is POSGRES, relational database that we use basically and works just fine and there's an Orango DB back end which is a pure GraphQL back end right and then on the other side provide the GraphQL API to be able to query that and provide a bunch of CLIs that it can be able to extract the data from the system. So in the classification project we try to provide a little bit more functionality on top of that. First of all we want to be able to actually not just ingest all the data about different relations in the database but we also want to provide a central place to store all your documents for the organization. So it provides an S3 compatible storage for storing and ingesting all the company's data into a single place so it can be an S3 bucket in the AWS but also for local deployments it can be some kind of a Minio instance for that. It has what we call walkers for different kind of CSEF repositories so that we can automatically ingest Asgum and Vex files and then provide what we can see on top and on the bottom. So what we call a single pane of glass like a nice UI to be able to search all this data that we have but also the Exort API as I said for integrating the system to the rest of the developer tool chain. So there's a nice VS Code plugin that can work basically with justification today and automatically from the project get all the dependencies and flag vulnerabilities if it's found in the system. So I thought to do a little demo so let's see how it's going to work. So Neil it will be easier. So here we can see the UI with some pre-loaded data and we can see that we have basically what we call six products here which are actually six S-bombs that are already already ingested in the system and a large number of CVs that have been collected from multiple sources and we can see that we identified around 2000 packages for these S-bombs and most importantly from the Vex files ingested here we identified 29 advisories for these. So if we go to a certain product we can see a couple of information obtained from the S-bombs so we can see the basic metadata that we have. Usually we can see all the packages and how they relate to each other. I think this S-bombs is pretty flat in structure so there's no much dependency going on there but the most important thing is that we can see different kinds of advisories that are against and also immediately see which actual packages are being affected by these advisories. We can go back and forth through this system so we can go to the actual package see that it's actually affected by this vulnerability. We can also go from the package and find the S-bombs that it belongs to, the S-bombs or the product but also what we can provide is that nice search capability as we said like maybe at some point you don't remember exact vulnerability we're looking for so you can basically just do a full text search or maybe yeah and find that there's a packages related to that but also find the exact vulnerabilities that we talked about a little bit earlier. So this is just like a basic demo right? I have a little bit more time just to explain so what were the challenges for us and I think we heard in a lot of sessions all about these challenges so it's mostly still early adopters everywhere, tools are immature including the project I'm working on so we definitely don't consider it mature but also there's a lot of inconsistency in the data wherever you look right so we heard today about all the multiple computing formats in S-bombs space and all the work that people are doing to bring that more closer and together over time which I think is awesome. We also heard a nice discussion about all the different kind of identifiers and you can see so if you work only with one source of data then it's easier but then if you try to correlate this S-bomb with this Vex file and this S-bomb is using PURELs and these are the CPEs it's becoming impossible to correlate data and build the graph basically properly. Also what we found is that even all these things are standards there's a lot of unwritten rules in all the organizations about how they are presenting their data so the documents will pass but what you have as an information from the document really depends so I think yeah it's good that you're all here and there's a lot of things to do right because it's early early days. For the project itself we'll try to additionally simplify architecture and the deployment model we're all about microservices and Kubernetes for now which is okay but I think we could reach much more people with simplifying how much resources and where they can deploy a project like this and go into supporting more standards. So you saw here just basic searches and basic correlation I think once we have much more data in the system we can get much more vision from all this data in and provide that as that's the value of the project in my opinion right and continue working on the future integrations because in my mind if you do continue doing this right I think at some point in a couple years all these infrastructures should be invisible to developers right so it should be part of your developer toolchain automatically working in VS code in all the Git for pipelines and everything right so we are just beginning that's it so justification side doesn't have too much data saying about immature projects but there's a dev box sandbox that you can try there's a code there and we always on the metric channel so if you're interested please reach out and yeah. I'm going to ask the question are you using the SPX libraries for helping with the ingestion? No no we're using yeah sorry yeah the question is are we using existing SPX libraries yes we are yeah so there's one in Golan using in in guac but there is also in Rust one using the classification itself because they are good yeah. So why is the reason that you decided to start a project from the ground instead of help at least four or five open source projects big ones that already do exactly what they do but not yet on the level but mostly 90 percent that we are doing today. Why you not helping that one instead of creating one? So yeah why we are starting a new project instead of instead of helping others so first of all we joined the guac project which is also another new project but yeah I can't answer that I mean a lot of people were involved in that kind of decision but we are trying to be as much I mean it's all open source we are contributing to other projects so it's not a closed source product basically yeah. So one of your early slides said this can be used to sort of share S-bomb data can you talk a little about that feature how you this can be used to sort of send S-bomb data around to other projects? So it's not about yeah sorry about it so about sharing the S-bomb data it's not about sharing the data but providing the API so the external systems can query things so basically the VSCode plugin would get all the URLs from the current project and being able to query this and get actionable item back so there's no any distributed sharing of the data just integration API. Okay please thank you. Thank you.
Can SBOMs become first-class citizens in Open Source ecosystems?
Thank you for allowing me to come here, the organizers. I am almost new to the S1 community out here. My name is Sal Nielsen. I am part of something called the SEPA security working group. We work on supply chain stuff and security on the oldest open source software repository system out there. It started in 1995 with lots and lots of... Let's see if we can switch the slides here. There we are. And I am here with software and all that implies with developers publishing there and downstream in Debian and Redats and all the systems out there being used all over the world. And like 14,000 developers still, more than 14,000 packages. So it's a real system. It's out there, it's working and people are earning a lot of money on this stuff so they want to keep it going. And now we are having a new reality coming with legislation. So I am trying to today bring the open source supply chain perspective. This is recently finished slides so please excuse me if I am either finished early or late I will try to do my best to make you happy. So I talked to a bunch of the people who are involved in the middle parts of this chain of events. They often say why should I care about this stuff? We already do a key track of dependencies on that one. We have the new formats and this is not my problem. If you pay me maybe we can talk. This is actually, this is paraphrasing but the essence of the discussions are like some of the blog posts are like I am not your supplier. It's actually like that out there. As I can confirm that notion. Then reality arrives and end users or use all the software are obliged by the threat of fines to keep track of all the dependencies and what is happening with them so that we can't get all those horrible security situations going on. That means they need authority and up to date information from the utmost upstream sources. And to do that you actually have to have the supply chain bits and pieces and steps play along in this game so that we can get all the good stuff. Like figure out where stuff comes from, check it up against very built-in databases and all that good stuff. We like that. So I am researching, looking around what the documentation tried to learn, this whole S-bound thing, reading documents from the US government and all kinds of interesting organizations like many of you probably have done. Very interesting stuff. Then I find this thing. This is from my nice tea. They tried to describe where I spawned the show in, show up and there's something wrong there. There's no supply chain mentioned at all in there. It says third party software enters here and there's no open source or processes or communities or anything. It seems almost like there's a lot of, this is a pattern, it seems a lot of documentation and even in some of the standards it's just assumed there's something going on here. And well, it isn't. There's stuff going on. And I would like you to just get a little picture of what's going on there to draw a simplified supply chain. We have an author at the top who does stuff, publishes something. There's a language ecosystem they publish on. They also collaborate with others on a collaboration platform. So the language ecosystem would be the PIPIs or the NPMs or the C-Pants collaboration ecosystem would be the GitOps and GitLabs and all that stuff. And they are sources for downstream package. Oops, sorry. There we have it. So that one, the red one, that's where I come from. That's the C-Pan and then PIMS. And we care about the infrastructure and how that happens and making sure that only the right people get to upload software and that it's published and available and all that good stuff. But the downstream of us, we have a packaging ecosystems. These folks here, that's the Debian and the Red Hats and all kinds of places that compile stuff for their own environment and make sure it's available in a consistent and available manner. But they also feed into themselves. Like downstream of Debian you find Ubuntu. And sometimes the packages here are patched because of upstream availability or you have to back port security fixes. And there's a package there that sometimes I have to talk with a curator about which of the software pipelines you should publish one package. Because some of them are LTS pipelines. You don't want to do stuff there that you can do in another one. And then of course you have to make it all available so that the developers, some business can do their work and all that magic so that it can put to something production environment and make people happy. All these boxes here, I try to make it so that there are boxes that represent a role that cares about something that is supposed to be in an S-bomb file. I'll try to be quick. So these bits here, that's actually this one except for that the third-party software arrow here, the tiny little grey one there is doing some seriously heavy lifting. That needs to stop, seriously. And there's another one, second-party software. We are not third-party software doing this. We're second-party. We're partners. When people say we can get third-party software from open source, no, we get second-party. When you accept a license, you're actually getting a partner, someone you are supposed to cooperate with. Most people don't but they're still there and expected and you need to know about that and people who make decision management, they have to know about this. That means anyone who writes documentation and teaches this kind of stuff needs to stop calling open source as a third-party source of software. That's just insane. Second-party software means people like these actual people working on infrastructure out there get ignored, basically. And that's not a good way to get the inclusion and the support from that software supply chain people and the ecosystem that you actually depend on. So, okay, who are these people? They're, in fact, your open source colleagues. In fact, they are your unpaid open source colleagues. Just so you know that. So stop treating them as a stranger, start treating them as a colleague, talk with them, interact with them, teach them and learn from them as usual colleagues do in a healthy environment. Of course, if you don't have any healthy environment at work, maybe your work should go do something else or quit or something. So to make S-Bomb become first-class citizens in open source ecosystem, make open source ecosystem first-class citizens in the S-Bomb community. Please do that. Don't just put them behind a miniscule with one pixel wide arrow that says third-party software enters here. That's just so bad and wrong. It's horrible. So there we have that. And don't, yeah, they are your partners. Please, it's a good thing to have them on your team, even if they're living somewhere and you don't pay them. They're competent people and they actually do want to help you. Like if you've treated them badly, they'll just say, this is your problem and see if you can fix it. No, you can't. And in a way, if you want something happened with somebody you don't have a monetary relationship with, you have to treat them as friends and with respect and help them if they have a problem and communicate and stuff. And this is the good way to operate if you want to have a supply chain in on the S-Bomb game. So I hope this is a message that you can find useful and adopt in your work in years to come. Thank you. If you have any questions, maybe we'll have a room for one or two. One question? I've been involved in some of the groups that produce the thing that we should show them. I think there may be a miscommunication between them because that third-party perspective, it wasn't planned to offend anyone. Anybody can be a third-party when you're developing, right? So I think it's just not you. But in fact, some of the work that some of the work that they're doing has been, you know, approaching those language communities and helping them build their, for example, involved with the BIPA Foundation and their efforts to create their own S-Bomb. So I think if it's a miscommunication, then we just need to sit down and talk a little bit more. There might be a miscommunication, of course. You'll have to repeat that. Yes, there's a long comment here. There might be miscommunications out there. And of course, my perspective comes from one community. All the communities who might be more resourceful, like the Python community, are easier and don't feel that it's, of course, not meant as an insult. And of course it is. But I think my point still stands that by treating open-source communities as partners, you get all the benefits, even if it's a small community like mine or a big one. So I think, thank you for your comment. I'll still mean what I mean. You haven't changed my mind. Thank you very much. Okay, that was it. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
12 months of SBOMs - an experience report
Right, I'm live. I'm green. Right, welcome to the post-lunchtime slot. Okay, some of you know me because I was here last year and basically I want to say what we've done for 12 months with S-bombs and it came out as an idea is about change. So I'm going to take explain about change. Some of it is about my tooling that's changed but a lot of it is about observations. A bit about me. I'm from Manchester. That's where that B is. I get asked about where's that picture. So that is a security hub for innovators for startups in Manchester trying to grow the ecosystem in the north. There's more than just London about tech, please. Normally most weekends I'm running around muddy fields at this time of the year so I've had a weekend off running muddy fields. My background is mission critical systems. For 40 years I was delivering mission critical systems. Think about big complex systems. So what Nicole was saying was my bread and butter. Those are my, those are what I used to worry about. Now I'm going to start up and I'm known as Mr S-bomb in Manchester. Didn't know what S-bombs were 12 months ago. They do now. It's all about a tool called CVEbin tool. So this has been presented a number of times and it's a binary scanner. It came from Intel and they wanted to understand what binaries were included in their deliverables and were they vulnerable. Common question. It's open source but one of the things we've done is become a Google Summer of Code project and each year we've added more features and I've been pushing the S-bomb world in there. So we added S-bombs, then we've added CVS, like Cizekirv. We've added EPSS this year. We do a very trivial triage. Let's say we might improve that with VEX, the world of VEX and it's got a thousand stars this week. It's very good. Open SSF best practice is interesting. I'm not going to work for Intel. It's a challenge sometimes in terms of do you have multiple maintainers and it's a challenge in open source when it's run by a commercial organization. So generally you see the reeds, calendar dates tend to trigger on GSOC. We found a little problem this week which is why we didn't release it this week but anyway it's very close to having a new version with all the EPSS stuff formally released. And then there's a tool that I write and I haven't got a thousand stars yet which is a Python generator and I take the installed Python and work out all the dependencies and work out all the bills. So think about Python. There's lots of direct dependency. What are those transitive dependencies? I'm agnostic to which version of S-bomb it is. I've always wanted to be Cyclone and SPDX. Initially I have written my own parsers and generators. I do want to migrate to the stable versions. It's all about time. But what I'm pleased about is there's a benchmark. We'll see then and it's the first one to get a benchmark score of 10 out of 10 which is but they'll tell you explain why you get 10 out of 10. It's quite hard because the ecosystem needs to play together. And this is what I do. Generally I just enrich it when I get the time and I've got a bit busy the last six months. So that's why I've stopped a bit. But anyway. So generally the sort of things we've been doing is adding more stuff into the package information using SPDX. So trying to get as many of the package attributes in to the S-bomb because you want enrichment. The more data you have the more usefulness it can be for the more use cases. That's hard because that data is not readily available. So what have we done? And this came out as a conversation, you know, a monthly open source meeting that we have and said it'll be nice to work out how much change do we have. And this is the S-bomb going to tell us what those changes are. So we put a get up action that runs about two o'clock in the morning. We clean the virtual environment, clean virtual environment, the bun two. And that's quite an important thing because that's going to come back later. We then install all the dependencies. And then we generate the S-bomb in the different forms. So whichever version is the latest flavor will generate. Do it. And 3.12 will become, I think, maybe this week, tomorrow. But generally we just ruin it for the supported versions of Python. So a little bit of a digression about Python dependencies. And it's probably if you went in Node or Java and things like that, you'll have, everything's got little quirks. The thing about Python is it tells you what the direct dependencies are. It can tell you a bit about the environment. So if you're working in Windows, you may have different dependencies than if you're working in another environment. But it says nothing about the transitive dependencies. How much is hidden? So let's look at the example. So this is quite, this is a subset of our requirements file. So at the top, you've got AIR at HTTP. It's got a constraint in the version saying the minimum version we require is 3.74. But it's also got optional requirements as well. So straight away you've got two potential two ways of installing AIR HTTP with or without that additional component. You look at beautiful soup, any version will do. And then you look at these down here, the import lib only installs it if Python version is less than 3.10 and it's got a constraint and similarly the import lib resources again only if it's 3.9. Because the Python library changes a bit like the early system, the language ecosystem is part of your partnership. And you can see the number of dependencies gradually change over time as you add more features. But what you get, that's what you really have, that's the hidden, that's the iceberg. It looks quite like an iceberg actually. That's one of my tools. And the green are all the transitive dependencies. Look how deep that is. That was fascinating. Pictures has a thousand words, I think we all agree. That's not really, if you really zoomed in, I've had to put the license values as well. That's an interesting thing if you could do some analysis. But actually that's quite visually, that's quite, that's quite a Iotner. And we've only got 60 packages there. So what have I observed by looking at all the data we've collected? And I want to look at the context, the context, a bit about quality, a bit about velocity which was the original thing, what's about change and then other things that are analyzed that I've discovered. And generally this is all out of GitHub, so I wrote a little utility to download the file history. So I could then quickly analyze it locally. And I ended up writing a little tool called S-bomb trend which then created it into a JSON file so I could then play around with it to generate pretty pictures which you're going to see. So first thing, there's nothing in any of the S-bombs that tells you it's Python. Or which version of Python, or which environment of Python. Now maybe Python, maybe FBH theme might, but that's actually quite important because you're going to see in a minute the difference what that means. Because if you just get an S-bomb and you don't understand its context, how do you know where, what, whether this is a real representation of the environment you're using, pick up what we were saying on the previous one. Cyclone DX has usually defined properties which you could use. SPDX doesn't yet. You could do comments but it's a bit harder, yes. Yes, I'm sure you could, yeah. So I use Cyclone DX properties just to say language, language Python, language version, something. But I think that's quite an interesting thing. It's good as SPDX thea's doing that because I think we need that is quite an important thing. And this is what you get. If you plot all the different versions of S-bombs across the year, the higher versions of the older versions of Python, it stops at p7 in the middle of the year because we stopped supporting it, but you see a trend. So that's the requirements trend and you see it sort of follows it and then there's a few other bumps. We didn't change it, the outside world changed. And sometimes you see it drops and that's because a package ceases using a dependency. It wasn't obvious until I did the digging up, but that's what that was telling me. It's quite interesting. So the lower versions have least a bit of a letter of dependency. You can probably sort of see that with the requirements file, but the requirements file is lost in the S-bombs. It's not there. So there are differences. Transitive dependencies vary independently of your direct dependencies. I think you could probably see that, but actually it's quite interesting to see the evidence. And the later versions of Python have the least dependencies. So that's a good way of saying don't just update your packages, update your language versions as well. So let's look at quality of S-bombs and that could probably have a whole conversation about this and a cold conference about it. So I've just chosen four tools because they demonstrate four different things. So the SBDX1 which is does it conform to the NTIA minimum standard. Look at the scorecard which comes from eBay, not the open SNF, look at something called QS which is from Interlink and look at one from me because it had something else that I discovered on Friday which was really interesting. So first of all NTIA. We are no different from day one to day today. We're still the same because we still fail to get all the suppliers. I would like to see how many people can get that on a real project. You can get that from small projects but not for real life projects. I think we all recognise that. Then the eBay one, one of the things they were doing is they were looking at package IDs, goes back to 10 o'clock call about the pearls and stuff like that. I didn't have pearls at the start of the year. I don't know. So my score went up. Enrichment, messages enrichment. Good and licenses have probably got better as well. SPM QS. This is done by Interlink. I don't know where they came from the idea but they have a whole load of different things they're looking for like licenses. Do you have other licenses still supported or the deprecated licenses? Do you have checksums for your packages, etc? That was a target. How can I get a better score as a target I started? So we get to 9.6. If you go on their website, most of the excluding S-POMP Python are in the sevens and eights. A lot of the containers are sevens and eights. So I'm quite pleased I can get to that level. The reason it's not 10 is because of the supplier failings, same as the NTIA changes. And then I have a tool called an audit. The reason I put that generated this was could you use the S-POMP to drive policy? So if you wanted to say I've generated an S-POMP and I've got a license like a GPL and I don't want GPL in the things, can I have a allow list or deny list of licenses, for example? That was the use case I came up with. But I also do it and I use the latest version of the products was the other thing I wanted to try and check. So I was getting reasonable number and the number of checks increased because I had more packages. Well this is the interesting thing I found. Scan came from last weekend. I scanned it on Friday. I was expecting to get 100% all the files were latest versions. Four of them got updated last Tuesday, which is why the green ones so happy. But there were a couple that hadn't changed. That got me thinking. Why don't packages change? Pinning. The world of Python is probably not to pin. They're indirect dependencies. I've no control of those. And I haven't quite got to the bottom of finding out where the pinning is happening because they're not even on the direct, the first level of the direct level where they are. So that was the reason that was there because I did a scan, an S-POMP scan, and what I got a vulnerability on my RSA. And the reason was I'm using, not using the latest version of RSA. So that was a weird, that was the sort of, could you detect that? So that's something that I only just discovered this week, which I thought was really interesting to share. I mean just if I happen to have that tool. So NTIA is a good benchmark. It's hard. Accurate supplier information. I think we all know the challenges of that. But date of enrichment is good. Can you enrich your desk things? Look for that threshold. Look at that utopia moment where you get 10 out of 10 for your S-POMPs. Because the more information you have, the more useful that's going to be for all the different use cases people are going to use your S-POMPs for. And it is possible. So this was the original use case. What's changing? What we're changing? What's not changing? Who's changing what? So the first thing is, and these are all driven by Matplotlib. So they're in the trend tool. So if you want to play with these as examples you can do. The top is the number of packages. The red line is the number of changes on a week by week basis. Every week one package, at least one package changed. At least. Which is good, the ecosystem's live. Is. Yes. Yes. So but that, you know, it's not, you know, what are the triggers for those changes? Yeah, some of the, you know, you can see some of the spikes relate to when we, when we did an update of the requirements. So that's, you know, you can see that. But generally things are changing all the time. And I was trying to show how to change, what's the, what's the rate of change and things like that. So this, I came up with this, like, train, train, flat diagram. That is showing a steady, steady going like that means it's changing every week. Except for the holiday. What? Except for July and all that. Oh yes. Yeah. Well, I think we can understand why. Yes. Actually, that's, that's probably quite a thing. Look at time. Time's actually a driver as well. Does lots of things happen in Christmas? Does lots of things happen in holiday periods? Yes. Interesting. That's enough. More people work on Christmas. Yeah. Well, I think we've seen problems where people have released something on Christmas day. And it's, but yeah. Anyway, that's a really good observation. I haven't thought that. Well, it's good. So you can see these things. And these are just the ones that have changed more than five times in a year. Because, you know, that's what 20 odd packages, more than 20 odd packages. And then if I look at, well, okay, these are the ones that frequently changes. Quite a few of them are direct dependencies. Why are they changing? Most of them are feature features, not vulnerabilities. Yeah. But actually, you know, can you find them? And there's one, a lot that rich. Why did they change? And they actually removed and unmaintained package, which then got me on another little track, which you're going to see in a minute. So yeah. Security fixes aren't the drivers for many of these changes for features. And then if I looked at the direct dependencies, again, okay, they're going up. Some of them are changing a little bit slower. That's, the case is no longer used. So you've got, again, you're getting quite a rich picture of change, which then says, if I pinned, first of January or second of January, I've missed all these changes. A lot of changes. Which may be, the features may be performance improvements, et cetera. You know, you might want them for good reasons. And this is what's the ones that have only changed once, haven't changed essentially. And the red ones are the ones that haven't changed in, I just took two years as an arbitrary value. And you think, well, okay, there's 10 of them that have not changed in two years. Does that not start linking a belt? It says, is it maybe not unmaintained? Is it now an unmaintained package? Don't know what industries have in terms of looking at the health of an open source project. Are they looking for the, you know, is two years long enough? And it says, maybe we need to look at alternatives. Right at the top there is Tom Lee, which is now a standard library within Python 3.11. Till I did this, I hadn't, I've missed that. So I then raised a pull request to say, if it's 3.11, we want to use the standard, standard library, not the open source version. So again, on the probability that the language ecosystem libraries are going to be probably better maintained or have a greater need for being maintained than necessarily community. Right? So change happens. But we could be very careful of pinning because direct dependencies change frequently as well. So there's a pinning debate. Right. Let's look at data analysis. Let's look at the first thing is languages, licenses rather. I've tried to look at the SBDX license IDs. When I get the metadata, try and map it. And if it doesn't quite match, do I have some sort of a few rules to try and alias them? So is it Apache space 2? Well, it's Apache 2 type of thing. And Apache 2 is a really good example. People don't know how to write Apache 2 SBDX ID license IDs. Yes? Are you pulling this from, from PyPy? Some of these come from PyPy. Yes. Yes. PyPy is a disaster in terms of specific license. Right. You've peached into the converted here. As a community, we should be looking at this and fixing it because many of the packages that have got license failures have been updated in the last 12 months. Probably because they've got features, but metadata doesn't really matter, does it? Metadata matters now in the world of S-bombs. Let's look after S-bombs metadata as much as the code and the tests. Told you I told you. Yeah. Yeah. Right. So I, so I summarise all the licenses and the things like that. And you see, again, you can probably quite quickly get a summary. Have you got a license problem? Okay. CVU and TIL is a GPL. Everything underneath that is okay, but you may be able to see quite quickly to see have you got a license compliance problem. The other thing is you can look at all the suppliers. Do you have a supplier that you really need to be loving and looking after? Because you're very dependent on your packages. This case, we've got 60 different providers, so it's not quite the obvious. But this could be a way of understanding who are your dependent suppliers that you need to be maybe getting closer to. Maybe supporting, maybe helping. And I'm thinking about the world of the enterprises as well, who might be needed, look, needed to do this. So again, but four of these packages have no suppliers. Three of them were updated. Why didn't they update the metadata? And then just a summary, I've got TIL that just differences two S-bombs, arbitrary format, don't matter that you can compare Cyclo and DX and SV8X. Just to see generally what's changed in the 12 months, well, there's 39 of the packages. I've had at least one version change. And we've lost two packages and we gained 11. So that's what 15% growth packages, number of packages that are dependent on. And then I did a scan. And that's the last one is I was expecting the last S-bomb to be as clean of vulnerabilities. The reason I've got one vulnerability is because of the RSA problem that we've heard too earlier. Potentially. So takeaways, I'm doing all right here. Right. Generate your S-bomb for each version of the supported environment you're doing. So if you're doing Python, 38 to 312, generate five S-bombs and also do it for Cyclo and DX and SVDX because the generations may have more, they may be different data, they may be enrichment between the two. Please as a community can we improve the metadata? We have all responsible to do that. Once you've got an S-bomb, that's the start of the fun. Start analyzing it, start using it, start reading it. It doesn't matter whether, you know, I'm sure many of you are quite familiar with reading JSON. Help the people around familiar with JSON. Look at the data, there's some documentation tools there you may find useful. This is the thing that we do when we install. We install with Python with this upgrade strategy, which is trying to make sure we're using the latest version of everything. But obviously that doesn't stop pinning. So it's interesting. I need to think a little bit more about that with Python teams. Keep your package up to date. I have a problem in my things because I just do pip install and they'll say, oh, I've got a beautiful soup. Yeah, that'll do. It's not the latest version, I'm sure. So just that's just be aware and use the latest version of Python. I have another tool called S-pom for files, which looks at the files so you can look at the change of files as well. That's a bigger thing. So it's just a thing. Could you start to see the amount of change in maybe one of your source trees and you repost, you know, or the test files changing, for example. And then obviously add vulnerability scanning as part of our generation. So this is what you all probably want. These are the list of other tools. The presentation is we'll be on the CVU in tool. There's a pool request in there. It just needs to be approved. Those are all the tools. I haven't written all of them. But if you want to follow me, that's me on LinkedIn. That's me on GitHub. And that's me in Manchester. Okay. Thank you very much. So on your list of increasing hidden dependencies, is it both a package or a package version? Okay, this is about the picture. Yeah. Yeah. Okay. So the picture that I showed, this is that showed the hierarchy of all the packages. They are all packages. They are Python packages. So if you have two different versions of the same package, they appear as one or? No, they would be pairs two if you had that. I've never seen that. Oh, okay. Yeah. Well, I would say that you live on an isolated island in Kingdom of South Africa. We are in the Union. And we have a presentation coming up. And I've got the divorce when you say that updates are driven by features. What do you see? Will that change in the future? Okay. The question there is, I mentioned the thing I said a lot of the updates appear to be driven by feature changes rather than security features. The question is, do you think that will change with things like the CLA? Probably. It depends. It depends. Yeah. There's no other one more. Okay. This is about the improving the metadata upstream. Probably the two things I would say is licenses to support the license compliance teams. And secondly, the supplier, because does that identify, do you know where you've got your software from? What can a large organization know that could sue the way it is? What can we do to help do that upstream? Use SKDX tags in your Python modules. Yeah. Yeah. You can do a public request. Yeah. I don't know. I mean, yeah. I think, recognize that there is a community out there. If you've got the effort, do it. Use it. Because, you know, we know the open source community is stretched because of volunteers. If the enterprise is taking value of it, can we use fully use your contribution? Because you're going to help many people. All right. Let's take time for me again.
Phantom dependencies in Python (and what to do about them)
Good afternoon everyone. having a moment of pleasure in I will begin. You're in green. I'm on green, yes. Okay. Hello everybody. My name is George Goussius. I'm head of research at Tendall Labs, and part-time associate professor at the Delfi University of Technology in a nearby country here. Let me say a few things about myself. Since it's customary here to introduce the speaker. So I have been on this particular field of S-bombs, and dependency analysis, and so on. As a researcher, since 2015 more or less. So I have seen all the failures coming up in real time, like left pad, then solar weeds, and then so on. In 2020, we had organized the dependency management room, which was a pre-care thorough, perhaps it was slightly conflicting with this particular room, where we introduced the Fasten project, which was one of the first projects that basically did the reachability-based analysis of S-bombs and software dependencies. This project, one thing led to another, and this project basically became a startup, perhaps now it's a scale-up, we can call it, which is called Ender Labs. It is based in the Bay Area. We are basically providing solutions in the space of software composition analysis. Plus plus. We will see what the plus plus is today. Well, by describing my history so far, you might have understood that I'm of a certain age. Parts of this being of a certain age, is that I have a teenage daughter, that actually is into pop music, and lately, I came up, she came up actually with this song, which I really, really like because it describes the problem that I'm going to talk about in almost in excruciating detail, I would say. Let's read a bit the lyrics. So like ships in the night, you keep passing me by, just wasting time, wasting time trying to prove who's right. If it all goes crashing into the sea, it's just you and me trying to find the light. What are the ships? Any guesses? Let me help you a bit. There are two ships. One is the Package Manager, where the developer declares their intended differences, and the other ship is the compiler and the order and time of the language, depending on whether you're using Python or C or so on, which has its own view of dependencies. Right? Which ship is right? Who says both? Let's see. One vote for both, but who? It depends. Okay. All right. Who says that the compiler is right, is always right? Yes. Is it ground truth though? They're both wrong all the time. The compiler is ground truth. It can be ground truth, but what about when you load code dynamically? It's runtime. Okay. Yeah, that's the issue. So the developer over describes the dependencies for testing coverage and contract validation. Horrible things happen because you build too much code, bring it all in later on and things go even further. Fantastic. Yes, exactly. Yes. This is where I was trying to get to things. Maybe you have to repeat all that. Okay. I mean, it will come through the presentation. So traditional dependency management and software composition analysis works. I mean, we're in the S-bomb room, we guess everybody knows how it works. So we start a new project, we create a Package Manager manifest, requirements TXT, POM XML, what not, Gradle, whatever tool we're using. Well, the build system and the Package Manager, when we're trying to build a download stuff from the Internet, side parenthesis here, how can we trust our data on just random stuff we download from the Internet? That's a different question, close parenthesis. The Package Manager copies all the files in a directory and then the compiler starts using those dependencies in order to compile or run the project. This is what we know. But if we think a bit from a higher level point of view, we get more or less to this. So the developer declares their intent in a manifest file. So that's the requirements TXT file. A Package Manager does the dependency resolution, we get the dependencies onto our system, we have a compiler. Now, the developer themselves also write source code. The source code might or might be using some of the dependencies, might be declaring dependencies that they are not using, might be actually depending on dependencies that are transitive that other dependencies bring in. All right. All those nice things that make software composition analysis in a way that most of the tools are doing this at this point. Well, I wouldn't say wrong, but perhaps incomplete. Okay. The output is always a program, and the program is the source of truth of everything. So what I want to advocate here is that when we're doing software composition analysis, there is a lot of stuff that we don't really identify. Can you guess where the stuff comes from? Always. So for example, I write a Python program and I have some kind of packets that I have installed throughout get. What else? A copy and code into my repository. That's a dependency, right? But I don't maintain it somehow. What else? All of those fancy shift left dev tools. Shift left dev tools. Okay. So how? Right. They're pulling things in there. They're updating and flagging, and sometimes automatically changing. Okay. Yes. There are some tools that are indeed pulling in. Bazel, for example, but Bazel also depends on the versions that you provide to it. How else can I have, let's say, dependencies on code on my program? Yes. One-time library from your compiler. Fantastic. Yes. Lib C. This is a dependency. Okay. This gets installed, for example, from the operating system, but sometimes when you, for example, do some kind of JNI calls through Java, you depend on Lib C. Okay. Various ways. So things also tend to become out of sync. So developers import new dependencies. The dependencies are in the environment. In some cases, dependencies can be declared in a testing scope, but we're still using them into production for misconfiguration reasons or for any other reason. In some cases, dependencies are removed from the code, but not for the package manifest. So we have an extra dependency that we somehow need to maintain, but we're not using it, so it's basically redundant. Then we have Python. This is the average Python repository, especially if you're dealing with machine learning and AI stuff. You have requirements to use the file or a poetry file, and then we have a list of instructions that looks like this. That tells you please install TensorFlow at this version. Please install NAMPi at this version, but with this patch for this particular GPU, because otherwise the thing is going to be dog slow. Okay. So how can we actually maintain this? How can we discover first those dependencies? How can we maintain this? Let's take a look actually at this particular project. So this is from OpenAI. It's called baseline. It's pretty old, as you can see from the time stamps over there, but still it has exactly this problem. So it tells you to create a virtual environment, and then it tells you that you need to install TensorFlow between 1.4 or 1.15 by hand. It's not part of the requirements CXT file for some reason. Using our tooling, I have run an initial scan of this project without considering what I call phantom dependencies. All right. What you would see here is more or less the same thing that most SCA tools would give you. It's all the files that are in requirements CXT plus their transitive dependencies. Okay. So we have some direct dependencies, and then we have some transitive dependence over here. That's it. Is that it though? Well, we will see at the end of this presentation, but first I would need to run a full scan of the project. So by also enabling phantom dependencies, I will have leaving running in the background. Yes. I mean, the idea here is not to see this. But what I want to show is basically that the thing actually works. So it's not a vaporware. All right. So what are phantom dependencies? Basically, phantom dependencies is the thing that we have discussed. It is dependencies that are provided by the system, and they're assumed to be working basically, to be available somehow in the runtime of the project, can come from various locations, some of which we have already described. If you think this is just a problem with Python, it's not just a problem with Python. It's problem with NPM as well, with Java, if you have plugins with, even with native environments. All right. We discussed this. As I said, what I consider the sips in the night, according to the original talk, is basically the two things, the Package Manager and the compiler in the runtime view. The Package Manager usually sees way more dependencies than the actual runtime or the compiler uses, because there are a lot of transitive dependencies for which we don't have any reachability path to them. So when we start from the client, there is no path, calling path from the client into the actual transitive dependency. So usually what we have found also in the company is that, from all the source code that is being imported into a repository, which is around 80 percent of the code that on an average repository is imported, around 15 to 30 percent is actually being used. So there's a lot of codes that we import. It perhaps forms an attack surface that's never been used. It would be very nice to clean this up. Yes. One way to do that is with reachability but that's not my talk. That's another talk. Okay. So how can we identify a fun dependencies and do this type of cleanup? We need to do program analysis. Any idea what program analysis is? Yes? Some people might have seen that. It is. No, it's not parsing. It's not parsing. It's like one component of program analysis. The first step in the whole program analysis chain. Sorry? I disagree with your first bullet. The source of truth is the source code? Yes. It is not? Could it be part of us, C programming project or the biggest thing in JavaScript to pull in dynamic code or anything written in list for code and data of the same? So I love the optimism of the source code understand with dependency graph. It's actually the binary. Could be true. Yes. For languages that have binaries, for languages that have source code, that are source code executable, the source code is the truth. Right? No? I agree with nature of source code and data, especially when we move into data. When we're talking about why are we doing the program analysis, we try to understand what we're bringing into the net. So let's stop now and then we have to get back. Yes. I agree and disagree. We need to start with program analysis, and program analysis starts necessarily on the source code. So perhaps this formulation here is not precise, but it is something we can work on hopefully. So why do we need proper program analysis? I mean, somebody could say, well, yes. If I track all the imports in my Python code, it will be easy to identify all the libraries. So it's easy. I'm just going to write a Perl script or Python script those days to take the imports and try to find what libraries have those imports. This is perhaps hopefully comprehensive, way of doing importing in Python. You can import a module, a function from a module, all functions in a module, and alias a module to a different name, a packets, static import, a relative import in the source code, using the import leap, in which case you can also rename the import leap. So you can alias basically the import leap and that you need to de-alias it in order to be able to track this, and you can even basically do an evil, an import code. Good luck doing that with a Python script. So we need- No, the code may be inside it. Exactly. That's my next point. Sorry. Exactly. All those things can be in a conditional statement. You can have a try import for otherwise import bar, or you can do a new condition on some variable, and then import the custom library. Those are the reasons for why we need a proper program analysis to do that. Okay. So the steps that we have taken to solve this problem, first of all, we need to start with the source code. Okay. We start, let's say, with the client code, and for each file in the client code, we follow all the imports. How can we follow all the imports? We first need to have analyzed the program, the client code, and the virtual environment, and the site packages that come with the operating system. Basically, all locations from which Python can tell you where it can find code. So if you open the Python interpreter and you configure it with a particular virtual environment, for example, you can ask it, please give me all locations where I can find code for this particular execution. Okay. So we start with that. After we have analyzed the map everything, we can start then from the client code and do case by case import analysis. So I import this particular library, I go into the file that is into the module basically, that creates this particular library, look at it, see its imports, and go, let's say, transitively until the whole thing has been exhausted. This is being done for a bunch of when they're doing the bills for resolving dependencies. Did you leverage that code? No. Okay. Repeat the question for the audience, sorry. The question was that this analysis has been done before, did we reuse the analysis? The answer was no. Thanks. Okay. Now how to do program analysis in Python? As I said, you need to have resolved past everything to begin with, resolve the types for everything. If you have type information, it's way more precise because you can track basically specific function calls onto types. Then you can also take into account one of the existing static type checkers, like mypy or pywrite, we're using pywrite for that matter, to basically parse the code and do all the resolution. Okay. So it's not extremely hard to do, but you need to be aware that it needs to be done. I will show you the results, hopefully of the scan. Yes, the scan has finished just in time. As you can see before, we had 11 differences. Now by doing this fandom dependency discovery, we have found 51. According to our findings tooling here, what we can see is that in one of those differences, we found an actual vulnerability. It's of course in setup tools, so it's not necessarily something that can be actively exploited, but it can be exploited while the packets has been installing. So what we have found is that there is this vulnerability and this is a call chain, let's say, but goes from the client source code into the vulnerable code. If I didn't do this analysis, I wouldn't be able to track this or do anything with this. I mean, this would be information that I wouldn't know. Of course, this is a trivial example that we're using for demos. We have found actual vulnerabilities when running this analysis from clients that I cannot disclose. But this at least to me gives an indication of how this fandom dependency problem can be tracked and solved hopefully with program analysis. After that, everybody wins. Developers can know what is vulnerable to their code. They can accurately map and create accurate S-bombs on what their application is consuming. CISOs can be aware that their vulnerabilities there that otherwise they might not have been aware of. So everybody wins. That's it. Thank you. Thank you. Yes. Yes. I'm going to go first. So yeah, great insights on Python and I'm a Python developer. So I know at least about the majority of these things. But the tricky thing is, I mean here you are talking to the people who are interested in S-bombs. How do we spread the information? How do we get all the open source communities to know about all these issues? How do we make them publishing their S-bombs? This is, your stuff is important but making all the communities aware of this is more important. Yes. This is part of why we're giving these talks so that we make the community. Yes. Excuse me. So the question is how can we make the communities aware of the problems that I have been describing and my answer to that is this is part of this effort. We're trying to make the communities aware that those problems exist. The tooling that I have described, it might sound, let's say, extremely complicated and whatnot. But if you're using, I can actually show you it running. So it is, I have it here. So this is our basically analysis tool. This is closed source at the moment but it can easily be re-implemented, I think. If you ask me, I can re-implemented it in like a couple of days but perhaps somebody else could re-implemented it a bit faster. As you can see, it just goes over, transitively, over all the code that is available to this particular project. Yes. No, there was a question. You've answered the question because I said, is this open source? No, it's not open source. Sorry. What you've done actually is a great benefit to the whole of the open source community. So if you maybe could describe the architecture you've drawn and the algorithm, then we're sure the community will then jump on and do the different projects. So is that something you could share with the community? We need to think about that, yes. I mean, the question again was whether this is open source. The answer is we need to think about this as a company. Yeah, I don't know is the answer. Yes. Are you scanning for Python binary dependencies as well because TensorFlow, for example, includes FFMPEG? So this requires, I mean, the question is whether we are actually considering the binary dependencies into Python packages in the wheels, for example. The answer is not but not yet. So we are trying to get into cross-language analysis. This assumes modeling basically of the interface between Python and the native library. We're getting there. Yes. We're doing static, the question is again, we're doing static analysis and static analysis has false positives. How can we prevent those false positives? In this particular case, this analysis that we do doesn't have false positives because it is basically only considering imports. False positives and static analysis come from the fact that, for example, you have a virtual dispatch call site. You might be linking to multiple implementations of particular interface, for example. In which case, you might be basically over-linking. We're not doing this at all here. Sorry? Where do we have false negatives too? False negatives, we cannot have false negatives here because we are considering the source code as ground truth, which means that we don't basically everything that is in the source code will be parsed and reported. Except for eval. Sorry? Except for eval. Except for eval. Eval is eval, yes. Right. Final question? Interesting, some method that tracks the imports. Well, what if there is an imported module that one function is using but is not reachable from the main code? It's kind of... If you're analyzing an imported module, it's not being called. I don't see what the problem is. As you mentioned, there are false positives. But if there is a file which is not imported and it's by a function, it's not being called. Yes, we will still analyze it because it has an import. It's not being called, right? We have some helper problems. It's never being called. It will be checked as a risk and it's a risk. If it's not being called... So, sorry. Excuse me, Alex. So the question is what will happen if I understand correctly? If we have a file that has an import in a function that's never been called. Yes, we will not analyze this because first we consider the call graph of the whole thing. Thank you again, Jordan. Thank you.
Open Source based Software Composition Analysis at scale
The first good thing, welcome. My name is Marcel Kotzman from, you see my company, Robert Bosch, but what is more important is I think my community and I have also some chokers in the audience for the questions later, so I'm happy that you're also here. And well, that went too fast. Our community was formed a few years ago already, so what we're doing here, I think we were ten people in the beginning. Also we had exactly the same discussion we had this morning some years ago, so it's good that we have now a bigger audience. And this is also my, if you want to go out and have a takeaway then please have this takeaway, please join us in our tooling community, sharing creates value. I really like that name from the beginning because that's really what we are here for and I think this is an all over non differentiating thing, right? So we all profit from having here a better as bomb tooling processes, etc. The title of my talk is about doing this at scale, so here's where my organization comes in because we're not doing this for fun or yeah, it's more hygiene topic, so every developer says yay. But nevertheless, what is important for my company is be good citizen in the open source community. So that means on the one side if you use something we want to give back, so therefore we also doing open source on our own, so that's what we called from the beginning, eat your own dog food, so if our developers need to suffer doing all this open source paperwork, we should also have a clue about that. And so in the beginning we also that you know beginning doesn't mean not for Bosch, Bosch began much, much earlier, but when we started with this automation, our journey began here with typical Java Maven project. So before I can tell you all the details, I just said okay, I make this fact sheet and this will also accompany us through the presentation and that you just get the idea on the left side, no right side, here we have also a fact sheet that I took from the NASA and that made it really complicated now for the presentation preparation because that's so interesting once you dive into this you get distracted very, very quickly, so here I also put the link. So now we got the idea, for me Java Maven this is a community, so we have projects approach us okay we need some kind of support to doing this S-bomb stuff in the beginning and mainly triggered by the license compliance and so in the beginning we didn't have that fact sheet right, so we just started on and then we're done. We were done, we said okay now we created the S-bomb automatically, the FOSS compliance bundles etc. So I said okay mission completed, no. So then the next wave came up right, so oh yeah here web apps, JavaScript, NPM, blah, blah, blah, so here again fact sheet, so some things looked similar and also here the build mode is very important because you could ask why should you automate all this stuff if you release once a year for something, no, but we had this high requirement from the beginning to do this CI CD, but here for JavaScript our paradigms didn't work right because as most of you potentially know if you use it or my previous speakers use one dependency you get thousands of transitives right, so that didn't work, also this that you have this one to one, yeah one source is one binary didn't work, so we had hard time but we somehow well 80%, 90% so here I would rather use my chokers later but I think we handle it now somehow. Also when we started to do this we said okay because as I said the developers were not super keen on doing this, we tried to push this by centralizing that, so looking now from the end that was perhaps not that good idea rather to decentralize this because now we also centralize all the problems that we already here heard the whole morning and the whole afternoon right, so this is all on our desks. And here this was also the biggest learning when we said we had a vendor solution at that time still, said we need this, we need that, so it was hard, so this is also where the community the open source really helped us because then we could also do what we needed at that point. The other thing so here I put Mars, yeah is that we still had all the other projects right, so we couldn't just say okay now a mission completed but they also continued and doing new things, well this is innovation right, so you will never stop so we were going from one orbit to another and also this little wordplay there is intended because most of the time you will know that we use OSS River to get this. And that you know so who is the crew in this, this is I'm not the developer, so therefore my joke is later and I'm rather looking from the process perspective but we also had then the development team, we had developers, we needed to talk about this and here the next stop was then embedded C cone and thing, so okay that is a completely different planet right, so here and also I learned where Saturn is a gas planet so there is no even a surface where it can land on so you need to stay continuously in the orbit having some probes that you send out and also again differences so also you don't need to read all this stuff it's just for giving you the idea and here I made an, I went back in the history and get okay when we came in because also Thomas and Sebastian they were pretty, well busy already in the beginning where we had this starting point which was already supported in the beginning and then you see also the history so even more planets coming up and I'm still convinced that this is not the end here at the right side and now this is at scale because this is just the background right at scale for me means with all those planets there is a scaling then in the horizontal if you want because as central as I said earlier we centrally supported here our two teams that we needed then to scale in the horizontal really supporting this ideally for all the teams with all their different planets and here from the experience that we had in the team is it was very helpful then within the team but also with the customers to have those fact sheets so this always developed more and more to say okay for onboarding say what are we talking about right so what are you doing there so either as well for the for the developers that we contacted as well for us in Germany that was very helpful on the other side also the when we started talking with them so how do you do it today we started the in the documentation why because we needed to have we also having teams from automotive right they need to have audit rails all this reproducible documentation so we documented it that was good because then we also could reuse the concepts but then we iteratively improve them and came finally to say okay hey that's good this we can standardize we can reuse this and then this is the evolution then you can also once you'd standardized then you can start automating you cannot start automating directly from the beginning so you should have this information and the other thing is if you especially if you say okay I started from from sketch again then you might reinvent the wheel but the other thing is with all the tools also I think Anthony showed it earlier so you have a list of of of links several tools so which one is this the right one so also here the concept documentation was very important as all with the help of fact she's then to see okay what is then the correct the correct solution for my for my problem and our next stop you see embedded IoT Linux so here's also invitation join us here we still have a lot of fun in front of us to manage this because here again we have completely when when we had built less approaches where we could say we had this discussion earlier okay source code is the truth perfect this would be then we just take the repository analyze it but here we obviously have a build based approach that we need because the build also does a lot of things and I just learned also compiler blah blah blah all this other stuff and then coming back to the to the point and it's the last point here in the background so now the learning is we need here those fact sheets in order to not lose time in upfront to say okay we're what we are talking about then we also came to a generic architecture model and this is also what you see then in our tooling group so that ideally we use the same wording for that and the standardized representation but the other thing is once I have this so where can I find find this if if I have a good match and then I took here this example from online shopping if you take want to shop close and I that I found what was a nice thing I say okay I would love to have this also for us right so am a man woman or a kid do I need clothes or shoes and check its blah blah blah and then you get the selection of what you can buy but at that point where you need also to give now to narrow down the selection okay you need to measure okay what size do you have and now okay this looking at my belly this is then exactly where he said you here you need the the semantics okay where do I measure this right and this is what is then where we need the genetic architecture model for yeah now you see the story what we do we prepared this already the our new project eclipse up oapses so and I as I said it has several aspects so for those who are not familiar with what is up oapses I also copy pasted the definition again if you dive into this world you will probably need some some hours even or days because that's very very interesting but I also liked the link here so therefore I also put the two definitions so up the abscess is that fast fast from the center of attraction right the high point in an orbit and up oapses the other thing is then the fast fastest away from the body it is orbiting so therefore I made this little picture it's it's potentially wrong so please but then you just say there you here you have the center of attraction left side and the up oapses is this point so the really when you in the orbit that is fast away from this planet and I think that fitted very well because if we switch between the different plans you need to switch the orbit and that's the best point because then the attraction of the original is is so low and here you will not really find a lot of contents yet because this will be in the next days and weeks but the also the proposal the text here I put the link so what is inside so this is more or less the trailer of a film if you want so you can see the film later we I also give you some hints in the in at the end but just that you know what we are talking about and therefore I saw as I'm the process guy I said sorry I had also I also wanted to be a part of this projects and collaborate so therefore there are some process level documents on and that would rather cover this horizontal scaling right so that you can say okay now let's have a common way how we can map this the other thing is that we are using a lot the OSS review toolkit and here that would be then the vertical level because here we need to scale also as we need to run really really a lot of scans every every day and here we have performance issue in the way we use it past and here you see this would be also some code contribution that we would do is that you can expect and the upper part as I try to use the the icons a little bit is then really the idea what is the goal that you can just come say okay what are my needs and what is already there so to really jump start your process definition ideally just copy pasting then the the the templates and you can directly start ideally then via our tooling group mapping the tools that are available on the technical level so here we're using word so if you are a single project then anyway not in a target group of my talk today so then just go to or page or then if you have built based issue then you join us in the tooling group you will find your solution if you have an organization that has the same issues than my organization that you really need to use that at scale then you would invite you also then to join us in the eclipse apoapsis project and here there the old server you can what what we will contribute there you can really just take it and build up your own service in-house to automate then this software composition analysis we're coming heavily from the open source compliance but we also have security aspects and we heard also safety export control so everything that will be there and this is also why I said okay this is important that this is not this is just another puzzle piece if you want but this very important is the cooperation with the spdx here we also call for action invitation we have this new operations profile workgroup so we are invited also to join us there that will start in the next days I think the first first meeting on the one side and we have this dependencies to the open chain tooling group where we do the capability map so also that we do not need to reinvent all the wording so this is already there so you can already check it out and on the technical level here this is also where the maintainers are the same right in the uss3v toolkit where we have strong connections and then also you can see the technical dependencies but as I'm the process guy and I have a lot of chokers in the room but we have not that much time I will tell you later I explicitly asked my colleague so for those who are interested I have an offer later to you so on the process level side so this will be then this is just a work in progress thing where we have the generic architecture that we map against the capability map from that we already developed with the with the tooling group so that we have then the ideally the same semantics so just that you have this picture in mind as I said we're coming from open source compliance so therefore the current capability map is open source compliance related so for if we then also go further with security then I think we also need to define further capabilities then the question is if we do this in our group or in other groups but we will do it together well that's that's for sure the other thing is this and therefore I originally called this abstraction layer so it's abstraction layer not in the sense of a software that we create but rather on a process level that you say okay we have different stakeholders we have different products and here this is where the operations a work group from spdx then will will it will help yeah to say okay to rather say am I a man woman kid etc so where I'm in on the one side on the other side also having some standardization to have those fact sheets so is this the perfect solution or next to a perfect solution or at least something I can start with because here and this is where my organization is potentially special here we are in the middle of a supply chain typically in the automotive industry so we also we are not alone we cannot just say okay we will use this or we have legacy therefore we have to have flexibility for you that would be then the possibility to still keep the flexibility and say okay I have just as Anthony showed I have choice right and that's not bad of the on the other side and therefore we started this touring group is that we are only we were only a small group so we said we cannot maintain thousands of redundant things so we should focus and especially if we still have gaps on the other side so then rather say okay let's consolidate on the one side and then use the resources rather on another place then also the blueprints as I said so we're here we have everything we heard today beginning with deja-court from from Philippe now a phosology I saw here the colleagues also software c60 or obviously false light so there are so many things where you you're totally lost right so there hey I know there's a solution but I don't know does it fit for me or not so I this is something we should tackle and the other thing that's funny because we also talked about this also the question so who am I right so am I rather developer I need rather this for my use case or or I'm rather try to manage this thing or audit thing so here this is important that we document that somewhere so I would offer to the start here but I'm open also to do this somewhere else especially as this will be necessary also for us to do the testing later because with a well-defined use case it's easy then to start then the test case yep thank you the test case definitions and the better the test the better than the solutions the on the technical output so the ord server what will that be so the goals of the server is that we have a unified API that you can use so if you run really at scale from your CI CD you can just call this API and it will all do do all the rest more or less easy setup integration you can read it on your own what we do not have yet but is one of the goals is frontend because this was typically then the the issue if especially if we were compared as those as River to a kid community with the vendors the vendors typically come up with a new and find ceo I blah blah bling bling bling and we will just well we have a tool that just does the do the business right but so we we know it's it's important and you see then later in the outlook and here I would potentially also have a choker then in the in the audience what what will come so that you see how would such a setup look like so you have a development team who wants to use this software composition analysis service so they can just use the API the ord server would do the rest of here you see then the different workers Martin called that from ord ord analyze the downloader scanner so these are the typical steps that you need then ending then with a reporter but we have teams that only needed parts of that then also sometimes the one thing uses a lot of performance so therefore the balancing is very important and that's what what you can then do with the server here we have some usage blueprints that we already will prepare with the project also in the use case collection the things that we use then for testing it and then another use case as I mentioned would then also be those those dashboards those your eyes people that want to know more this will also work then with the server just here the MVP that you will be able to expect then in the next weeks to be in our repository the repository was already there so we're preparing currently the initial contribution and this is the next step and that was I said so now you would say I want to know more about this ord server how does that work so there's an invitation we have twice a month the tooling group meetings every first and third one Wednesday the first Wednesday in the morning for the rather Asia Pacific region the third Wednesday era and the european sorry the european afternoon will also to cover better the time zones with 50 us and yeah so please check out the open chain global calendar so if you if you're interested and the next one I will moderate and Martin will be there so for all detailed questions regarding the technical stuff and we can also discuss the the other parts so I would really be glad if we could have some follow-up discussions in our existing communities then as I said this is yeah depending also on the on the load the initial contribution that we prepare ideally we should have it ready then for the community days so we plan ord community days beginning of March I hope I have put the right link so please register if you're interested because then we have also a detailed session about that and about the front end so we will not provide the front end in the beginning but we already in discussion with double open project so that they are also interested in providing something but as this is an api so everyone would be welcome to have their python shawasaki whatever you want but yeah at least we will care that we have a cool server in the background thank you very much stay tuned so here are all the links especially so really the who didn't know about this especially Anthony I would really welcome you if you also could present your your benchmark that you did that was really great because we also have the python inspector in the ord analyzer how we would get there and these are things that we really do in the tooling group thank you very much we'll have to answer a few questions address that noise yeah so the question is that we have a lot of noise and transitive dependencies I would directly rephrase that so you have dependencies in the build scoped in the test scope whatever and then you get the build of materials with thousands of things and there are irrelevant repositories that's how we call them so here this is a process level thing right this is where we typically come in and check with the teams please unscope that so here we have configuration as code is in the in the ord principle so you can directly silence down that that noise by just excluding then scope things like that but also that depends which planet you're coming from right at scale um at scale this is uh yeah so I would forward you to my joker here which is thomas so again here we should talk about which planet you're coming from right and we can also reuse a lot of this uh in a in a central database this is where also the previous talks were about sharing then creations package configurations uh here you're also more than welcome to uh all collaborate thank you thank you again
Panel discussion: Best practices managing SBOMs in the supply chain
Okay, so next up we have a panel of Danone and Jeff here. So we're going to be discussing best practices around S-Bornman supply chain. So my idea initially, and if anybody has questions or whatever, feel free to jump in. So just maybe questions at the end. Would be maybe break this into three sections. So the first one would be around what you think is best practice or challenge about S-Born generation, sharing and gaining gestion, and then basically handling them, storage, all of that. So yeah, without further, I'll give the pass them back to you. And then if you want to introduce yourself and Jeff. Yeah, hi everyone. I guess most of the faces are familiar from the fringe event from last week. So yeah, hello again. And I'm Aaron Arrogation. I work for Siemens Health Nears. I had the Secure Development and Compliance at the organizational level and also a co-lead of software 360 project at Hosea-Rite Eclipse Foundation. So the topic of S-Born has kind of, we all know that has came up into the limelight in the last one year or so. So it has had its effect in our workspace also. I was only needed to concentrate on license complaints until recently, but now the whole thing has changed and the requirements has come up in a much more stricter way because of the regulation and the executive order. And so predominantly from my side, being a healthcare organization, we have a lot of challenges in adapting to or catering to this requirement in a very sudden way. Because this new regulation calls for a very stringent or very disruptive way of changing things. And it is very challenging in the healthcare sector because our processes are very closely tied to the FDA regulations and our relationship with our supply chain is quite sensitive. Because we cannot suddenly demand our suppliers to give this thing, otherwise you would be out of the process. That cannot be done. So we have to operate in, since it is healthcare, I can take the example of doing a surgery. We have to keep the patient alive and then do the work. So that is the situation right now. And we are taking it that seriously and taking the steps very clearly because right now what we are doing mostly is identifying the current challenges and gaps in the process. And evaluating our existing methods of how we are meeting the regulations. Declaring all the elements of open source usage was already there, but it existed in different formats. But now when the regulation demands a particular or a set of few formats, there has to be a lot of work done in all areas like right from the R&D processes, the contracts with the suppliers, all of these things. And then we have the legal aspect also. So the current challenges, the inability for a large organization which is heavily regulated to transition at a fast pace, is a major challenge. But we are closely monitoring the developments and trying to go along with the community in that. And the one core strategy that we have right now is make sure that we don't miss a train in this aspect. But it's a challenging thing. I'll come back to that more details after. Hi everyone, I'm Jeff Mendoza. I'm a maintainer on the GWAC project. GWAC is an incubating project under the open source security foundation. And it's a tool that ingests S-bombs and then is used for querying your S-bombs. I can get more into it later, but as far as the S-bom management, the idea is that we need to be able to ask questions about our S-bombs. And so that's the kind of the part I focus on. Some other background for me personally, I used to work at the open source programs office at Microsoft where I worked on security and license compliance where we would have scanners that didn't generate S-bombs. We had an internal format, but it would scan, put the results, all the component versions into a database and then give you security and legal alerts based on that. So I can kind of pull from that experience as well for best practices and managing S-bombs even though we didn't call them S-bombs. So one question for you, Anun. So when you think about healthcare as a heavily regulated industry, so when you think you as a participant that is using software from third parties, the suppliers in this case. So how developed is healthcare in regarding this kind of information? How transparent is it supply chain? Do your providers reliably get you information that you need to form to understand your dependencies all of that? Could you share a few words on that? So until recently, getting the entire dependency list was more of a manual centric process where all the architects were heavily loaded with this task. And with the internal process of software, the software lifecycle management, it was very much engraved in that process. And hence it was like, yeah, we had to believe the person, okay, I'm the architect, I decide that, okay, this is going to be our function and this things we need. So for a longer time until automation kicked in, we all believed that that was the truth and we had only a certain number of components. And we did all the required clearings, whether it is from a compliance aspect or from a security aspect to that. So the moment when it transitioned to package managed world where we started ingesting third party packages through the modern package managers, the situation kind of changed. So people were so alarmed with the number of dependencies that they see. And then there was a surprise for these architects or those people. Like, are we really using these many components? It could be false positives. Are you sure? Is your tool working right? So this was the discussion at some point. So as I said earlier, because it is a regulated one, for all the listed components, there are existing process where they validated, they evaluated for security and license complaints. But now that task has scaled up to an unimaginable extent. So we are trying to figure it out. And I guess, well, when you think about the problem of handling those dependencies, that's where Guac comes in. Like, it's one of those tools that will let you ingest everything and try to understand their relationships between them. So we just heard a number of talks about trusting. And so at some point, well, I would like to understand also how sharing and trust go together. I'm not sure that Guac currently feels that, but I'm not sure if you have overviews on how more visibility into those, that information can lead to better trust. I don't know about trust, but I do feel like when you're cataloging your S-bombs and have all of the dependency information in them. And if you're scanning through all of them and looking for vulnerabilities or legal information, it becomes important to see where you're using the same dependency across all of your projects. And so if you're looking at individual S-bombs, you won't see that. You'll just see, okay, I get the same vulnerability coming up in all these different products that I have. So one thing that correlating those, just part of what Guac does is, you can see where, well, what is the path to get to the vulnerability from my products? Does one product depend on another? And that's why I have this showing up in multiple ones. Is that kind of what you're getting at with kind of by looking at how the different S-bombs all point to the same project, you can get a lot more insights into both what you should be trusting, what you want to look at and what you should be concerned about. Yeah, so what I'm thinking is when you have a large body of dependency data like some organizations like the health care industry has, then you can start using tools like Guac and other databases to start understanding basically who's telling you the truth, who's lying a little bit, who's giving you missing information. And you can start making sure that the players are doing the right job. So that should give you a good overview on the state and ultimately making that information useful. So I'm, well, just to switch tracks a little bit. So I'm curious, how does the health care industry share information? Like how do you share S-bombs? What's the practice when you have an S-bomb? How do you give it? How do you receive it? How do you supply it to others? Okay. So that's a very tricky question to answer at this point of time. So how to put it? Email is fine. Yeah, I mean, I can say like we have not started supplying S-bombs in the current prescribed formats yet, but the other various ways of submitting this information to FD is already in process. So I'm not going to talk about what format or the thing, but this information is being submitted, but it is not as per the prescribed formats as PDX or Cyclone DX. We are only reaching there right now. So just to continue with what he mentioned regarding like, this brings me to the software 360 part of it, like how we catalog this applications. So the real challenge for us is, you know, we have legacy systems, which are still 15, 20 years old and still part of machines running across different hospitals. So we still have to maintain it. And now we have a new set of software coming in new form, which are part of the modern applications when compared to our old and scanners and other tools. So right now, all the teams that were working in this area in terms of compliance or security, we have the challenge of bridging this gap where we have to hold both sides equally well and then think of a solution where we can cater or where we can adhere to this new regulation. Because for us, the major challenge is when we deal with the legacy systems because it's like some part of it is already, you know, submitted to the regulators. We cannot really make heavy changes to that. So that's the tricky part. And software 360 is kind of helping in that way. But making it adaptable to the modern world is a challenge. So still there is going to be a certain level of manual work along with the automation to meet the goals. Do those applications get run under some SCA system to break them and then handle them, catalog them properly in software 360? Yeah, we do use multiple tools. We do use multiple toolings right now, like a lot of tools from the Cyclone, the X-Center. And most of it, I would say, majority is all internally developed because of various reasons. Or, you know, we might not have felt that the available one in the open source world is fitting our requirements. So a mix of both, but dominated by internal SCA tools. Any question for you? When do you run your scanning tools, your S-Bump generators at source area time, build time, or deployment time? So it depends on based on the current thing is like the focus is on license complaints and security. So earlier the SCA was specifically for the security and then separate team taking care of it. And they maintain the vulnerable database and then we link to the software 360. So right now it is changing like for the modern software is where they use the modern package managers. It is run and build time. And then it is a process downstream. Yeah, in my experience as well, scanning at build time is absolutely required. It's completely different building an S-Bomb from a source repository where you're either parsing a manifest or maybe doing a simulation of a manifest install or at deployment time when you don't have all of those artifacts that may go away after the build is done. So yeah, at Microsoft we only scanned at build time. Kind of a proposal for the group and a question, should we be cataloging or categorizing S-Bombs for what time they were built? Like is that something when we have a database and we want to query them or look up something, that would be an important metadata to attach, right? But when you query or as a user, as a consumer, I want to know that, right? As a consumer, especially for the medical side, what do you need to see for support windows and what do you want to see recorded for support windows? Yeah, I'm just repeating a question so that I can, so from a consumer side, like what do we want to see in the S-Bomb? Or as a producer of a device, what do you want to be able to share out, especially with the CRA coming in, okay, in terms of your support windows? And in either format I think you have support right now. I think properly, fully. And so the question is, do you have guidance on things that you're trying to get together for? You see that? Yeah. Because I wrote some of the tools they are using, why do we have our own tools? Because most of the tools, for example, from the excitement providers, all the information that you would need for licensing. Right. So we look for, where do we find the source code? What are potential or existing information of the licenses? Are there no displays or whatever? And again, most of the existing tools, you will have to repeat it. Yeah. So just, most of the existing tools do not provide all the necessary information that you would need for license compliance or to have a real complete S-Bomb. And this is still an issue because this requires that you use different tools, that you then manually adapt data, and then we come to the point, Jeff, what you mentioned, yeah, we create the S-Bomb during CI, but later we store it in SW-16. Most probably we have to add or modify some information because you were lacking at the beginning. Yeah, and regarding the submissions to the FDA part, you know, as far as I know, even the recent submissions, like not in the current format, but FDA says, yeah, things are okay with it. But, you know, I think one thing I heard is when a machine readable format was shared, FDA was not able to process it. So they asked for a human readable version of it. That's an insider story that was there. FDA will do the conversion from SPDX to Excel. Okay. Yeah. So, I mean, adding on to what Thomas mentioned, yes, regarding because the priority in our organization, you know, first comes for the license compliance and then shared with security. So, yeah. And the missing information is not just common across all packages, but maybe very few. But I think in the last six months, we have seen a great improvement in terms of, say, for example, NPMR, you get this kind of sorted out, like we get almost 90% information at the time, but others are a little bit challenging. Yeah. What do you get? Yeah. You know what Carol, you got this in there? I mean, like we are comparing it with much more olden times. So we feel better. It might not be the best, but yeah. All right. I have another topic, but if people have questions, we have like five minutes left. Yeah. I'm going to know how to solve the problem of verification and verification of the SBOM. Like I said, a bit better, right? So, you will offer a person SBOM, you will see it and trust the developer right. But how do you know that what the developer provides for us? Yeah. I'll repeat the question. Like, so your question is how do we verify and validate the SBOM that we receive from the developer? The short answer is trust. I mean, at this point of time. Yeah. And because it's mostly on the tools that we have implemented. And then, you know, I think so far that is what's been done, but there is no formal validation that we do on SBOM because we have not thoroughly started processing automated SBOMs in its complete sense because things are here and there partially. So it's like there is still manual intervention in between happening. Yeah. I'll add to that. If you're consuming open source and you're getting an SBOM, you don't need to, you know, you need to build your own SBOM. Like if for anything that's open source, you should be able to fully create an SBOM based on how you're using that library and the dependencies that you're pulling in at that time. If it's a library and they're saying I depend on these other libraries, that's just kind of theoretical, right? So yeah, if you're using open source and you're getting SBOMs, just build your own. There's some, I found about quite a few projects on Friday that people are talking about maybe in the software heritage for public databases of these dependency information for open source. So we should be able to get the right kind of dependency graphs for open source without trusting somebody, an SBOM that somebody else built. So we need better verification. So as coming from automotive industry, how we validate it, I take your SBOM, I do the language, I re-implement the mini project, take all of your dependencies, run it again. If I then get dependency conflicts, I know, boom, we have already a problem. Then I run it to, in our case, OSSRDUKIT, which Marcel spoke about, where the entire database was to basically download all of the source code of all of the code again to check all of the licenses. But my standard first check is to take your dependency list. I generate for that whatever ecosystem is, Maven, Java, and run it again. If that already gives me a conflict, then I know, why do we have a problem? And then I will go back to the supplier and say, like, we cannot compile this. How the hell did you compile these versions of the code? How did you compile these branches into a product? But again, validation is very difficult. So the other rule what you do is we do a risk-based approach. You cannot do this manual check for everything. So I run the SBOMs to some special rules that I wrote in ORT, where it's policy is code, and then I filter out. These are the products that in my context are high-risk products. You know, the motive it means updating a car. In case you don't know, over the air updates from cars doesn't work. If you need to fix some of the cars, you need to recall the car to the garage. That means millions of euros, dollars, and yen that need to be spent. Those get checked in depth. Anything else we do? So we have a risk-based approach to validating SBOMs because consuming SBOMs is still a sanely pain in the butt in the ass. And you should add, because train your developers. Train your developers. Train your developers in Mesh. No, because the emphasis is on the beach. You know, right here in the modern world, I think I'm talking about different tracks, different formats. Even if you're good at building the locker, you can actually do a good job and feel accessible about these three topics. Sometimes it's just hard to do that right. So maybe try to be the best, provide the best bonds. And you have trust in yourself that you don't have any social rights. And it's just a lack of your knowledge or communication that could be compromised. Yeah, to summarize, developers don't have the proper tools. So it should be better to just provide the... So what I'm hearing is that maybe the SBOM community needs to get together and piece together a verification story. Like I should need to go and reveal the software. I should just take that document through some mechanism and verify that it's actually true. Yeah. Maybe we have time for one more question. Any closing thoughts? Yeah. Yeah. I mean, just to summarize, after observing all these developments that are going on, all these discussions, I think from a healthcare perspective, we are going to take a very cautious step towards it. Because since we know this is quite a disruptive change, and also it is a mandatory change that we need to do. So our approach would be very cautious. But we want to be equally close to the community and very close to the developments that happen so that we would be able to adapt the changes in a fast-paced manner than when compared to three, four years back. Yeah. I mean, I think what I already said is like, yeah, set a high standard for generating SBOMs, do the build time, catalog your SBOMs, and then try to drive insights and relationships between where you're seeing commonalities between all your products. Thank you, everyone. Thank you. Thanks. Thanks. Thanks. Thanks.
Sharing and reusing SBOMs with the OSSelot curation database
All right then. Thank you. Thanks for the holler. All right. So I think I did. Yeah. Okay. Okay. So my name is Karen. I work for OSDL and I'm going to take a step back actually from everything we've been discussing here today about what capabilities S-bombs have and what they need, what more capabilities they have to have as well and how about tools to create them and go back to, well, I think what they were originally for, what we thought we originally should use them for with license compliance. And there we have a lot of stuff where we're still redoing the same work again and again and again because creating S-bombs doesn't work automatically, at least for most of the software that we're dealing with in embedded Linux systems. So there's still a lot of manual work required and that's where sharing and reusing work makes sense and this is where the OSDL project comes in. So I think this is fairly obvious. I don't need to go into this a lot why reusing makes sense. We don't want it to redo work that has been done before that is being done again and again and again. I mean, we'll still get these questions every day that why do I have to extract copyrights from Linux kernel source code. Someone must have done that already. Why can't we reuse that? And so why not do that and why hasn't it been done before exactly what can we do? So you know more or less compliance toolchain could look like this. We can't share work everywhere but we can share work where most manual effort is required with scanning and with curating data because as good as the scanners are that are out there, we've heard about scan code, we've heard a lot of tools ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort, ort. So all of the scanners are all the tools that use the scanner materials, they're really good but there's still quite a lot of mistakes. So to actually do license compliance properly, we still need to do manual curation of the data. And this is where the Ocelot project comes in, you can find more information on the Ocelot website. The data itself is available on the open source compliance repository and the package analysis repository there, you actually find stuff that you can already use today. License copyright analysis results for various packages, mainly from embedded Linux systems. We have about 320 when I last checked. So different versions, of course, are 200 unique packages, more than 1.5 million files that have been manually curated. For each package, we have some metadata, so where the package comes from, a package URL to find where the package comes from, download location, and so on. Then there's the S-bombs in there. So the S-PDX S-bombs is what we're focusing on for license compliance in different formats as well with the license conclusions in there with the copyright notices. And I think that is probably some of the most valuable part of this with comments on why a particular decision was made. Because sometimes it's not clear, you can find information in a file. I don't know, you know how licensing information is noted in some files. It doesn't really follow any standards, especially in older software that's still being used. And then you have to make a decision, you have to do some kind of interpretation. And this is explained as well as part of the S-PDX files that are available there. Also, the S-PDX files themselves are explained because what we find, even though there is a standard, there is a specification, people still understand it differently. So someone might expect to get an S-bomb from their suppliers and they have a certain expectation on what a particular S-PDX file looks like. But they understand the different tags and they're differently than the customer. So we have a clear explanation of how we understand the S-PDX tags and of course we try to be as close to the specification as possible there. And then also for convenience, there is a disclosure document where if you find a particular package that are reusing in exactly that way unmodified and exactly that version you might for license compliance just use the finished disclosure document with all the license texts and copyrights and acknowledgments and so on, aggregate it. So of course it's not yet big enough to immediately license an entire system, but it is definitely a start. So as I said, the question why this hasn't been done before has been around for quite a few years and why hasn't it been done before? And I think two of the main reasons are liability and trust, which are more or less two sides of the same coin. So on the first hand, who was willing to supply such information, which is legally relevant if we're talking about license compliance or if legally relevant information where companies have gone to court over licenses. So who was willing to provide this and say, look, you can use this and we don't give you a guarantee but we did our best to make this documentation as sound as possible so that hopefully you won't be taken to court if you use it. And then on the other hand, you're a company and you're putting out products and you reuse legal information that you found somewhere on the internet. How can you trust this information? And these are the thoughts that we were thinking when starting this project. So how can we limit liability, first of all, for ourselves and for anyone who's contributing? And of course we asked some lawyers about that. And the idea was to license as liberally as possible. So we went with CC0, 1.0, that gives you as many rights as possible. And it works well for documents as well. In this case, gift regulation supply and liability applies only for willful intent and trust negligence, which we try to avoid. Also, I think the times have changed. So maybe ten years ago there was a lot of worry, especially from the US, that there's gonna be lawsuits in the open source area. But there haven't been any, not with providing legal information, with providing support with licensing, there haven't been any or none that have been known. And so I think people just got braver and said, okay, maybe now's the time that we can do this. And then on the other hand, we have trust. So how can you establish trust in the information? I think that's fairly straightforward is provide good quality. So do the curation conscientiously, diligently, only let people do it or let people contribute to actually know what they're doing, so train anyone who wants to contribute. I mean, it's a bit of a bigger hurdle for contribution, but it's really important as well to keep up the quality. The same goes with review, so the stuff is on GitHub, so we can use that for the review process. And yeah, we'll stand with it also, we'll stand with our name to make sure or to promise that we'll keep the quality as high as it started out with. Let's wait around. So what are the curation guidelines that we established to ensure this quality? Well, we're working with Phosology, I think that's just our preference. You can use any other tool as well. We're using ScanCode as well for scanning and integrated into Phosology. And we use the source code as upstream as possible. So for ideally directly from the project page, so to not go through any of the stages that we've seen on some slides before where stuff gets added from package managers. But we'll try to start as upstream as possible at the moment. And then I think the diffs that you get from what's added by package managers, this is something that can be included as well, but we're not there yet. So at the moment, we're still trying to go with the origin. And then curating the license, as I said, there's manual work in there. And I think that's the valuable stuff of this project. So license findings, copyright findings that the scanners have created are curated manually, of course, with all the help that Phosology can give with that. So with our curation guidelines, I don't have to check the time. I don't think I'm gonna go into too much details on that. I mean, if you have looked at the scanner findings, you know why there is some manual work required still. So with copyrights, it means mainly that stuff that was incorrectly identified as copyright is removed, stuff that is added to a copyright notice that's not really part of it, formatting signs, there's sometimes license notices, just part of code that is identified as part of the copyright notice is removed from the copyright notice. And then there might be references to external files as well, like copyright by the authors, project authors, C file authors. And then this information has to be added as well. With the licenses, again, reviews then on file level. So every file of the source code tree is inspected, if the scanner has found anything, or if it is mentioned in some kind of notice file or similar. And this is done in addition or even though if a package contains some kind of metadata on licensing. Because we've made the experience, and probably a lot of you have as well, that metadata just gets outdated or is incomplete and so can't really be trusted entirely. And I think that might also be one of the reasons I can imagine that this question might come up. So why do we keep all this information in a separate place? Why not upstream it into the upstream projects? And I think there is some reluctance in upstream projects to provide legally relevant information along with the source code. And also because then we would have, again, we would have the same problem that it just won't be updated. It's just how people are. And yeah, okay, so we check, we do curating on the file level. We confirm or correct scanner findings as you do. We add individual license texts as you have, especially with BSD licenses and so on. So this is also something that's not usually done by scanners. We only tag main licenses if there is a clear main license given in the root directory for a package to not mislead anyone that this might be the only license that's in there. And as I said before, the license comments, the license comments tags of the SPDX explains any license decisions, any curating decisions that are not obvious or that need some level of interpretation. Yes, please. What's your correction rate on average? Do you mean how many scanners are finding? How many do you find that you're like we have to step in? Well, that differs. Yeah, sorry. The question was what our correction rate is. Well, it differs heavily per package. So there's some packets that are really good order where, I don't know, I don't have a number. I would guess around 10% and there are packages in horrible shape where it's closer to 80% that needs manual work. Yes. So let's say I processed more than 3000 packages for sovereignty and I would agree but maybe 20% in general. Yeah, yeah, so that was just some agreement about the numbers with someone who clearly has more experience than me. I can't say I've done 3000. Yeah. That's because of the gray hair. Maybe you're just talking in detail about the clearing process at Siemens. Well, you might guess that there might be some connection there as well. Yeah. Okay, so what do these license comments look like? They also follow some kind of heuristic, so the usually says the information in the file is quotes, whatever the information in the file is. And then we give a reason for why we made whatever conclusion is concluded. Example, we don't have a version of a license given. We find this, this file is GPL. And then the license comment would be as no version of the GPL is given. GPL 1.0 or later is concluded. But we interpret and this is clearly is an interpretation. So this is a legal step that when we find this file is GPL, one could also say GPL the most, I think it still is the mostly used GPL, is still version two. So you could also go ahead and conclude, they probably mean version two, because it's the most heavily used. But our interpretation here is if they only say GPL, the author wanted to give us the option to choose whatever version of the GPL there is available, so version one or later is concluded. So this is something that is a step of interpretation, but that is explained in the data. Or for example, a URL is given instead of a license text. And then of course, the URL is checked, a date is given if anything was found more often than not the URL is dead as well. And then maybe additional research is required. And then of course, the information and the date is given as well when that was checked. So what are, yes? But in that last case, do you report it to the packet itself? Yes, yes. So in case we do find problematic things, we report them back. I mean, there are some licenses that have a URL in the license text that is dead. So I mean, and then people usually say this license is outdated, but it's still valid for some files that are out there and that are being used. So sometimes, yeah, there is, sometimes it's helpful and projects react, but sometimes there's not. But yeah, we try to, whenever possible, we try to report it back. That was also the question, sorry for not repeating it. The question was if we push it back into the projects, yes, we do our best to do so. And then going forward on what you need to comment. If the upstream doesn't take it, how much, what's the hit rate in terms of them ignoring it? So what, the question is what the hit rate is in terms of them ignoring it, it's not large. Mostly they do take it. So because most projects are interested in being license compliant as well or making it possible for users to be license compliant. Because that's what we're trying to do. We're trying to do what the project or the authors wanted us to do. We're trying to make it possible for users to be license compliant. So projects are usually keen to take, to take any or to help. So you asked about the rate before. I couldn't give you exact numbers, but I can give you some example of what kind of scanner findings we do have to correct. Well, I think they're fairly typical and if you've done any curation, you'll know most of them. So I'll go over them fairly quickly as well. So we have not a license or something has been found that just simply isn't a license but a bit of code or just whatever. So that's removed of course. It might be not the files license. So it might be some part, some information that's content of the file but isn't the files license. We have that in documentation quite a lot. Then license text. This is something that of course scanners get wrong and I don't think there's any way to fix it either. If you have a file, a license.text file that contains the license text, then of course the license of that file isn't the license text. Most licenses don't have a license themselves. But new licenses for example have a license. But this is something that's corrected as well then. With generic license text, I said that before. So individual texts, if that differ from the generic license text it is. Of course provided we have improvised, imprecise findings in particular those with respect to version of a license. Then dual licensing cases, especially if it's not a single. So an easy dual license where you have this or that license but you have this or that and a third license or this one license and the second or a third license. So these need some manual work as well. We have license exceptions that we handle a bit differently than Phosology does to bring it into one finding as well. But that's maybe particular to Phosology as well. We have external references that need to be checked. As I mentioned before, it might be URLs. It might also just be external references within the package though. And that also there's a lot of problems there because then you have files that are integrated from a different project and then in their file, they say look in the copyright file in the root directory. But they mean the root directory from where they're originally from. So then that information isn't true anymore and we'll need to do some research and then of course explain what research has been turned to find out where the file originally came from, what license it is referencing. Yeah, so that usually takes a bit of effort. And then we have global license assignment or partially global license assignment that we don't usually use. Again, from the same reasons I said before that meta information is usually wrong or that stuff is included from different projects. So if there's a read me file that says all files in this directory are licensed under the following license, we usually don't go with that information. Unless it says in a particular source code file, it says for license information, see read me file. So this is something that just I think comes from experience. Yes. There's a package manager field where there's a specific license field. That's filled out with the proper SPDX and the fire, do you apply that to the. Okay. The question was about package managers that have a license field or that have a tag for what license is. So at the moment that's not come up yet because we're on like fairly at the bottom were from come from Linux based embedded systems. So this would try and so far we haven't gone into much that is managed by any package managers. But the stuff that I have looked at it depends if that's the only information that's there will go with it. But there might be different information again in the source code. And if possible, we'll always go back to the source code. But we do give, so if we do have third party, or meta information, we also add that to the information in the package, license comment I think is where we add that kind of information. Yeah. I think there was another question. Yeah, it's fine. It's fine. Okay. Okay. Yes. So the project seems to be mostly organized for collaboration among humans. Yes. And not really consuming information about machines. For instance, there is no API for media. Yes, yes, there's a REST API. But for instance, package naming seems to be quite vague. So, so there are uppercase, lower case, and this is the. Well, we tried to go with the, yeah, yeah, you pointed out something. The question was about if the project is made for human consumption or machine consumption, automatic consumption. And I said that there is a REST API to call the files, which is not described in the repo, but on the Ocelot website. If you go to ocelot.org. And then the question was about the naming schemes, and we tried to be as close to the upstream naming as possible. But then again, they're not consistent usually. So, yeah, there is no, so we didn't make up our own schemes, but we tried to stay with the upstream and where there's inconsistencies. Well, we mirror that. Yeah, that's right. Do you know where is described API? Because even on the website, I cannot find it. On the ocelot.org, oh, I might be on tools actually, sorry. On the wiki, wiki.oscelot.org. Try that. Yeah. So, lately I found in some license listing that we were, someone who was using libvui.id and that was TPL listed only and that was taking licenses. And then the readme file which tells me, yeah, various source code in this license has different licenses. Various source code in this package has different licenses. So, and looking at the source code, we are presumably using the functions. It says, oh, it's not strictly TPL. So, not poison from a proprietary business point of view. And how do you express that kind of, I mean, we had the discussion before regarding vulnerabilities that you need to get back to function level. Do you foresee that necessity in your work also or do you strictly handle packages because? We handle files. So, the question was about, as I said before, with the meta information is imprecise, let's say. So, we go back to file, source code file level and what we find there, we believe. So, there might, you're right. You might, we might have to dig deeper but then it's over snippet matching and we, so we only assign something to a package if there is a clear main license but we also warn you can have this information and take it but don't take it as the only information that's there. Okay, there's more questions? Yes. Thank you. What you're doing, I think it's great. I was wondering about the upstreaming of the information that your gallery, first you said well, upstream is often not interested in it and then you made a statement like they really like to be licensed compliant. Did you, do you have statistics on that? Like what is your gut feeling about this because my personal experience coming from the videos and that you can list, many people have interest but they need help with it. I would say these two license compliance we needed to recover have to help them but you were having a great data set to help them actually. Yeah, yeah. So, well what our experience is that when, oh sorry, the repetition, the question was about upstreaming, what our experience was with upstreaming, if the, how the projects would react with that and you said your experience was that they're keen about license compliance and I, well yeah, most of them are, there's exceptions always and we have that as well with, like with concrete and particular cases when we say this file, we found this problem, can you, or can you fix it, can you clarify? Then they, most of them, very, really, very most of them are fine but, well I have to admit also we haven't tried with super many projects but if we say we have, we did a complete license analysis of your entire package for this and this version release, here's the SPDX file, then they're not as keen to, to, to provide that via their website because that is legally relevant information so that rather, I think we had one or two projects who were like, oh that's cool, we'll point to your site but we're not going to provide it through our stuff because there is interpretation in there, as I said before which we explained in the comments. There's some interpretation in there so there's some wiggling room and I don't know, maybe we could reach more with more effort. Okay. There's a few more questions, do we have another minute or time's up? Okay, so, well contact me anyway, I'll skip to the last slide so there's some, yep, sorry. No, that's good though, I prefer discussions. So contact me at info at auselaugh.org and we can, we can chat anyway. Okay. Thanks.
Welcome to the Devroom and Announcements
Welcome to this year's deaf room on software defined radio and amateur radio. It's been a bit of, you know, a harsh year for this place. Actually, we're very happy that we made it here. Last year, the SCR community didn't have a deaf room at Foster, which was really sad. And I'm really happy that we didn't have our own deaf room this year, but a deaf room together with the amateur radio community, which obviously has a lot of overlap. And so this will be like a slightly more diverse presentation than we might be used to. This is super nice. I'm Marcus Miller. I'm one of three deaf room organizers. Do you want to introduce yourself? I think if I have to, my name is Paul Merr, I'm from Switzerland, obviously. I'm a software developer and my stuff I do in amateur radio is mostly developing software. I'm also very happy we're together with the SCR guys here because that's a field of activity for the amateur, so it's very interesting. Yeah, so obviously, I'm Marcus Miller, maybe known from the radio project. I'm very happy to work with amateur radio at Jesus because, well, the application follows the tools in other ways. The third person is, I haven't seen him best year, so I hope he comes in, but we'll start without him. So a couple of things that I'd like to ask the audience is, of course, clean up after yourself. So if you leave, look whether you left some bottles or something because otherwise things will get. Harry, we are not overfilled today, which is a new thing for us. Usually the SCR deaf room was so packed that we had to arrange for people to stand not in the escape routes. I'd like to ask you that if you see someone who's blocking an escape route, talk to him. The other thing I'd like to ask is that if we can find and volunteer to occasionally check the online stream for this room and check whether there's something in chat that someone writes like, we can't understand the speaker or something. Let us know. So that would be the organizational thing. So coming to the content of things. Hi. You made it. So this is best year. This is a come over here. I'm a bit late. I was taking for a fit. Yeah, come on. Come on. Come over. So for the speakers in the room, the cameras over there, you can see yourself on the small screen there. So if you're not on, you can't see yourself on the laptop, then you're not on the screen. Content wise, we got pretty diverse collection of things and we tried during selection and schedule of the talks to make them a bit grouped so that we're not people who want to go to other death rooms can leave and stay for more than one talk. So we start off with me, obviously, and I will give a really, really brief introduction to what happened in Noredu since last FOSTEM, which honestly is going to be a bit opinionated because it's what I think is worth mentioning in this context. We go over to Sylva, who's going to talk a bit about using GPUs to improve the throughput in SDR computing. This is very interesting to me. We go over to Mark, who will then talk about a more modern approach to controlling transceivers than most of the tools that we have today. Then we go for the radar satellite group of topics. So we start with Jean-Michel talking about, sorry, lost it, synthetic aperture radar. We'll follow up on the Qo100 payload and we'll close that part of the satellite things with nanosatellites. I'm not going to go through all of these, I just realized. The next topic is basically SDR architectures and SDR application software, and then we'll go into cellular and radio science. So this is our rough rundown. I'm pretty good on time. I can now start doing my next talk, but I guess we'll take your opportunity to, if there's any questions, ask them now. Okay, so yes. If it's loud, then close the door, please. Yes, of course, yes, he did. So that's true. We do have like a schedule switch. So the second talk, the Tetra talk, I think gets switched with Ang's talk on set-dump, right? So that is another satellite. Heavy top, what? Yeah, should be fine. And I'm excited that it actually works out. Yeah. At the room is the wrong side. Oh yeah, we should probably do it. I happen to have like, you're the perfect person. I have some paper. Yeah, yeah, yeah, I have paper. I have paper. I have paper. Yeah, yeah, yeah, I have paper. And while we're at it. Then I'll just, I mean, this will probably mess up the video stream afterwards because they're going to cut it by the minute, but that's something that we'll arrange in post. Um.
Using GPU for real-time SDR Signal processing
Real-time processing. A very brief background. So I'm Sylvain, Fox 4 Golf Kilo Radio, this is my amateur radio call sign. Very briefly about myself, for those who don't know me yet. I'm the founder of a small company in France, doing SDR. The name is SDR Technologies. I'm the most important here to introduce the story about GPU, and it is the next lines. So I was working for ONERA, the French Aerospace and, well, military, I would say affiliated to the Ministry of Defense Research Lab. And that explains how this started, and I will come back to this in a slide. So very briefly, the outline of the talk will be that I will explain the motivation and then try to explain the approach I took when I tested this GPU and why it came, why I had the idea to use this for DDC. And you will see, and I tried to take a few minutes to explain the background, not of the code, because it's on GitHub and you can read it, and I'm quite sure you will improve it a lot, no doubt. But just to explain for those who are not yet, who are not familiar with GPU, and why it can be useful, and what kind of things you can do with it as long as you take the time to write code in this. So very briefly, the story started a while ago when I was working at ONERA where we have radars, and I just took some pictures that you may have seen already. One is the Nostradamus, it's an HF over the horizon radar, and the other one is Grav, very famous. So these two radars were designed and operated in not my team because I was not leading the team, but in the team I was working in. One of the key problems here is that you have a lot of channels. One antenna is one channel, means that you gather a huge amount of data. And one of the key problems is how do you process this data in real time. And at that time, in my, don't remember exactly the year, NVIDIA released the Tegrar K1, which was very small stuff, but looking promising, sorry, in particular for embedded systems. So my boss said, can you have a look at this and tell us if that can bring anything to the game. And just to make also the story very short, the answer was yes, it's useful, and that made my decision to leave the research team and found my company. So yes, the quick answer is yes, it works. Okay, so now let's go back to more serious things. This is from the leaflet, I would say, from the Tegrar K1 at that time. They were promising something like 326 gigaflops, oh, five watts, 99 euros for the deathboard. You say, who? Does this really work? And that was the idea. That was the idea was to test this if that can be used for software-defined radio. I'm assuming here that most of you have a very brief and very quick idea of what a GPU is, so I will just take a few seconds to explain. I'm just realizing that if I move to the screen, nobody will see from the remote, I guess. Yeah, okay, I'll try. Sorry. So just to explain the model, this architecture has two things inside. You have the ARM ARM processor, this GPU, this is the four cores, this one, and you have the CUDA cores, 192 cores next to it. And the good thing is that they share the same memory. Okay, if you have a PC, you have your core, whatever you want, and in one of the slots you have the GPU cards, and they have to transmit, they have to share data through the PCI bus. In this one, it's a bus, it's kind of PCI bus, but you will see that the performances are much more interesting. The second thing is that one core does one symbol operation at a time. So in this very simple example, I'm adding AC is equal to A plus B, and the code is just saying for each CPU, each CUDA core, take A, take B, make the sum, and store in C. That's pretty simple. So the key point here is that there are three things. One is push the data, the second is push the code, then run the code, and fetch the results. Keep in mind that you have to push the data, and this costs a lot, of course. So coming back to our SDR and DSP, what are the things that may need power? Well, just examples. The one I will elaborate this morning is the DDC, digital down converter, but you have many others, like I have not yet, and I will not describe this morning, I have not yet investigated so much. Feel free to take a seat, no worries. Interpolation, decimation, clock recovery, synchronization, pattern detection, and so on and so on. One of the key issues here is that some algorithms are extremely difficult to run in parallel while others, it's much simpler. And some of them just don't work in parallel easily. So in this example, let's focus on something simple, which is multiband DDC. So we'll assume that we have a white band signal coming from a white band SDR, whatever it is. I took DHF example. So here, for example, we have a receiver that is transferring to the memory. To the stuff, the device, a 50 mega samples per second bandwidth. And we want to extract from this small subbands. OK, so I took examples of DHF bands, one at 7 megs, another one at 14 megs, and so on. That's just examples. The core thing is how do we extract the subbands from the single white band signal? So for one channel, it's pretty easy. And that's the classical stuff. This is a DDC. So you basically translate the frequency and then you make some low pass filtering. And then you throw away all the samples you don't need. That's very classical. I have not invented anything here. And I guess you all know by heart what is a low pass filter, but just take a few seconds to remind how it works. On one hand, you have the input, the samples coming from the SDR. On the other hand, you have the filter you want to apply for the low pass filtering. And you make a convolution, basically some modifications and additions, and you retrieve the output. OK. Now let's look a bit more in my example. How many taps do we really have? So for this example, let's assume that we have a 50 megahertz, so 50 mega samples per second bandwidth incoming. And we want three kilohertz just to extract the audio. This is a fully digital system. So at the end, we want audio, plain old voice, someone speaking. And we assume that three kilohertz is enough. There's a lot of different approaches to estimate exactly, to estimate as accurately as possible the number of taps we need. I saw many, I tried to find an example. I saw plenty and pages from you, Marcus. Many of, I was going to copy and paste some of yours to avoid questions. No, I'm joking, of course. Well, there's many ways to estimate the number of taps. And one of the approaches is this, I don't remember, yes, the Iris approximation, sorry. And so if you do the calculation, you arrive at 50,500 taps. OK. 50,500, so what? So what? Now let's go back to this stuff. So to do the convolution with 50,500 taps here, you need to do this 50,500 times for each sample. It means that to get one value out of the FIR filter, the Lopez filter, you need to take 50,500 inputs, 50,500 coefficients, do the multiplications, do the sum. And you have one sample. And you have to do this for every incoming sample. That begins to be a huge amount of processing. Of course, if you have, you have all experienced many low cost SDR applications running on low cost PCs and they do this in real time. So how do they do it? Of course, there are tricks. One of the most, the easiest one is to divide by two instead of going straight from 50 megs to 3 kilohertz, which is nice, but probably not the best one. You do this step by step by dividing by two. So you take the first band, apply a half band filter, so you have less, you have the half of samples and you repeat this several times. That's very interesting because each time you remove a lot of samples and if you do this clearly, you can have 50% of the coefficients that are zero if you compute the fear in a good way. So that removes you a lot of computation. Of course, yes, but this is not ideal because you will hardly be able to reuse the computation you've made for the other channels. You will reduce a lot the throughput, the number of calculations you need for one of the channels, but then the next one you will want to reuse some of the calculation you've made and that's not easy. So at the end, this doesn't work so good. So, so can this stuff help? So I just put two examples here. On the top you have the Jetson XAVNX. I know that in an open source conference promoting a brand like NVDA is probably not the best idea, but just to make the story short, I have no sponsoring from NVDA. Okay, so just to be figures and facts, the first one is the XAVNX, so it's roughly 500 euros, roughly. And this one has 384 cores and the next in line is the NVDA 800, which is not the same price, 20,000 roughly, and has 6,912 cores. Okay, the interesting thing is the two FFT benchmarks are put below it. So if you look at the Jetson XAVNX to perform an FFT of, sorry, I'll say it this way, 2 power 19, which is quite a lot. So it's 310 microseconds. That's quite a lot. But of course, if you look at the most expensive one, you have 170 microseconds for 2 power 23, which is a huge FFT. A huge FFT. You can do this with an FPGA, but to get those size, it's becoming fun and extremely tricky to do it. Okay, so and for the XAVNX, you see that if you go up to the power of 2 at the power of 23, it's 7 milliseconds. That's a huge number. It's quite fast. So how can we use this? Of course, if you look back to your DSP lessons, that's pretty simple, in fact. A convolution can be, I mean, applying an FIR to a signal is just making a convolution. And the convolution, you can use the FFT. That's pretty simple. I mean, that's pretty known. You take the input signal, you do the FFT, you take your filter, you do the FFT here, and then you make a product of the two vectors. There is a bug. It's FFT minus 1. There is a bug here. Inverse FFT. And you get your output. So basically, you do FFT, FFT multiplication, inverse FFT, and you have your output. That is for one single block. Okay? That's quite good. It works well. But this is for a steady signal, not a stream. So if you want to do this for a stream, there is an improved version of this algorithm, which is called the overlap save or overlap hat. I use the overlap save, which is basically sliding a window, sliding blocks, moving the input, doing the computation, and so on and so on and repeat this. The key point here is that you use always the same filter. So you can compute the FFT of the filter once and keep it. And the input, you will see, can be reused several times. So basically, if you do this in the GPU, the performances are quite interesting. And this is what I did. And this is what I'm going to show you here. So the idea is, this is the architecture of the code I'm proposing. You receive the samples from the SDR. You do a first FFT. So you push the samples into the GPU RAM. Okay? Then your code does a first FFT or the incoming block. You assume that you've done previously at the init the FFT for the several filters you want to apply. So here in this example, I have two. You do the connexer product, modifications for both FFT, the reverse FFT, and the decimation. And you're done. There is one trick. I will come back to it in a few slides. So basically, it means that, sorry, if I go back to this slide, excuse me, you do this FFT in fact only once. You reuse it for the different channels you want. You have done the FFT for the filters once. So in practice for each new incoming block of samples, you have to do one FFT here, modifications, FFT minus one, and decimation. And that can be quite fast. All this doesn't need data to move from the GPU memory to the main CPU memory. But that's quite fast in fact. Then one trick and why I have ended with using the CUDA and proprietary API and the NVDA stuff. I've heard from guys in this room that you can now do this in OpenCL. I have not tested, to be honest. One of the trick is that if you don't pay attention to the scheduling, the different channels will be the different codes will run in series, in sequence, FFT and so on and so on. So you will have to wait for the last block of sequence of operations to be finished before you can retrieve all the samples. And you wait, you may end up waiting quite a lot. But if you use this trick just to compile option, switch, then the scheduling inside the GPU is different. And then everything run in parallel. And the difference is quite large, quite big, to be honest. The difference is much faster this way. One last thing is that if we only do what I proposed, then you miss the frequency shifting. There is a problem, the output frequency is not a good one. So you need to apply the NCO to shift in frequency the signal. And of course it's much more efficient to do this at the end because you have less samples. So it's much faster. You do the shifts at the very end and you just use the fact that you have some aliasing. So the code compensates for the aliasing and that's the frequency shift at the very end. Just look at the code. That's easier this way. So what am I proposing this morning? So you have in GitHub a lib, an example. That's a code that is quite old from me, but it works. And the key thing is that you have to allocate maximum number of channels you will use in the beginning, basically because it will allocate in the GPU the RAM for the different operations. Then the code is thread safe. That is to say that you can add, remove, shift, replace, change the number of channels you use, the size of the channels and so on in real time. This is CUDA based. I know that maybe OpenCL could do something that I have not tested. And I have only tested this with NVDA GPUs. So just to give an example of what you can get with this. So I just benchmarked this with two different architectures, the one I had, but I'm sure that I will receive tons of PR to add new figures in the tables on GitHub for sure. So practically speaking on my machine at home, it's a well, average PC with a GT RTX 2060 with one single channel. So throughput is just a bench test code where it's just pushing data to the GPU, making the computation and retrieving the samples. So with one channel, it's roughly 600 mega-samples per second. With two channels, 530. OK. Just as a baseline for comparison with the Jetson XADI-NX, depending on the FST size, that changes quite a lot. And you can reach up to 156 mega-samples per second with two channels. One channel, sorry. And 117 with two channels. The filters were 7200 taps. Excuse me, that's average. You can change this in the code. I'm checking the time because I know Mark will kick me out soon. So just to come to the, just one of the interesting things is that if you look at the figures here, you see that the GPU is roughly 80% used. The PCI is 36%. So there's room for improvements. And if you look at the CPU, one core is 100% and the others are relaxed. So it means that maybe there's room for much faster, in fact, because we are far from overloading the machine. And in fact, if you look in detail, where is the bottleneck? It appears that the bottleneck is the memory copy. The synchronization between copying the memory from host to device, wasting for the threads to start, waiting for the kernels to stop. All this synchronization takes a lot of time. And if you start to plot this in time, NVGA comes with the tool. I don't remember the name, where you can see the different threads in time, how they work. And you clearly see that there's, the bottlenecks come from the synchronization from the host and so there's room for improvement, for sure. So if you want to tune this, you will see that the, of course, the size of the FFT used has a strong impact on the performances. But that really depends on the performances of the GPU you're using. As I said, moving the data from host to GPU is extremely expensive. In the example I was doing, copy from host to device in complex float, I could use complex ints, raw data from the SDR, and there is in the code one example where you can convert the ints 16 to float directly, so it's cheaper. I mean, the amount of data you would copy from the host to the device is much smaller. And I was using LibUSB in real life. I mean, not in the example, but in real life. So it's also very expensive. LibUSB is far from optimized, from optimal, I would say, more than optimized. And of course, one of the important things is that the CPU, as it's not, well, that's the different cores of room for other things. It means that you can do other tasks like paint and eye spectrum on the screen, like send emails, like listen to music, whatever you want. I think that's all thoughts. Thank you very much. I didn't want to spend too much time. And I'll be happy to reply to questions if you have any. Thank you very much. Yes. Yes, please. You said you did the frequency shift at the very end after this, and is it possible to already do at least a significant part of the frequency shift by just offsetting the FFTs? That's what I do. I rotate the FFT. I rotate, yes. But then you have a reminder, because if you do this, you have the shift you perform is an integral fraction of the incoming. So you need a post fine tune. And that's exactly this. Yeah, you're right. That's what I'm doing. Yes? You didn't see an IIR, FIR, or CIC filter? Just FIR. Yeah, because it's just FFT and Chronicle products. That was the simplest approach. Thank you. Yeah? Was there any attempt to match this into the radio? Not yet, to be honest. I'm not good enough in the radio. I had a discussion with Jean-Michel, a side discussion, and there's a plan to do it. The point, I mean, I was not able to do it for them. I don't have enough practice in C-Blocks, so I said, OK, let's do this with the guys who know. So we will come with a proposal. Yes, that's the idea. Typically, the idea would be to have something, if we can do it, that would permit to have messages to change, to add, and remove channels, or tune the channels in radio directly. Because one of the points is that you need to allocate to define how many channels you want to use. So depending on the applications, you might need different numbers of channels. That's why I wasn't able to do it. Any other question? From the audience? Yes? Just a small question. You used a single precision floating point. Very good question, in fact. Single except this one, the frequency shift. Because in CUDA, the sine and cost functions are nightmare. They produce a lot of noise. So in the code, it's written, double precision, don't touch this. Because otherwise, the noise is going up very quickly. Anything else? OK, thank you very much. So there's more folks pressing in. So if I can ask you to give a little bit more space. You didn't need to kick me out. That's quite fine. Bonjour. Thank you.
Covert Ground Based Synthetic Aperture RADAR using a WiFi emitter and SDR receiver
I'd like to show you a little bit how I'm using software refund radio, of course, running radio, for developing a covered ground-based synthetic aperture radar using Wi-Fi as a radio frequency source. So just to see what it looks like by the end of a presentation, this was done with some funding, or leftover project funding, so I put the affiliation of the lab. Actually, it's a hobby project, but I had some leftover contract money to develop this thing that I wanted to show with you. So what is ground-based synthetic aperture radar? So let's start with what is the objective of what we want to look at. When you're looking at a radar, you have a remote sensing measurement technique where you want to do some radio frequency detection and ranging. So you would like to see targets. And in the case of GBSAR, it's mostly used for small, minute variation of distances. So in this example here, which I was lucky enough to visit Professor Sato's laboratory in Sendai, and that's one of his setup where he's looking at lens slides. And when you're looking at lens sighting, ground-based synthetic aperture radar, you're using the range information to detect the distance from the SAR measurements and the lateral resolution is given by the spatial diversity of moving your antennas. So as opposed to it's an active measurement technique. So as opposed to passive remote sensing like optical measurements, photogrammetry, optical satellite imagery, you're not sensitive to lighting conditions or day, night, or cloud cover. And it's an active measurement technique. So you will generate the signal that is returned. But unlike laser detection and ranging, you're not sensitive to weather conditions. So radar is all weather conditions. So that's beauty. Now you've got some commercial systems. I just took some of the European ones I'm familiar with, Italian IDF, Dutch Metasensing. I don't claim to be competing with these guys. These guys have 100K units. I'm not going to show you a 5K device that's compete. I see this as an educational project to try to get familiar with the concept of SAR and trying to do this. Well, I wouldn't say legally, but at least not get caught by using Wi-Fi signals. So what are the requirements for radar? Radar, on the one hand, you want to detect a distance. So distance is inverse of bandwidth. So you need a large bandwidth. So you need some wide bandwidth signal, and Wi-Fi is very good for this. Now there is no reason why you would get more bandwidth with higher frequency. But it's a fact that technologically it's easier to get more bandwidth with the higher frequency. And so I moved to 5.8GHz Wi-Fi because you've got 200MHz bandwidth. So that's kind of nice because your range resolution C over 2B is going to be something like submeter. So you can separate by range, two pixels separated by less than a meter. And then also because I want a mechanical setup, I showed you in the introduction we want spatial diversity. So we're going to have some moving stuff. And the higher frequency, the smaller the wavelength, the smaller the wavelength, the smaller the antenna. So it's going to be easier mechanically to move some smaller antenna, hence the increase in higher frequency. And also the rail, along which you're moving to have the spatial diversity, the azimuth resolution is given by the wavelength over length. So if you're higher in higher frequency, your range distance will be smaller, and the rail will be a bit cheaper. So these are reasons for moving to higher frequency. So the SAR measurement tells you that you're doing spatial domain, which is doing time domain. So you're moving the steps, you're moving the antenna. And I'll show you in the next slide that actually azimuth compression isn't for a transform. So it's really your adding phase each time you're moving the antennas. And if you want to match Shannon's sampling theorem, you show that you must have half distance, half wavelength, is the same as half the sampling frequency. And when the transmitter and the receiver are collocated, actually because they're both moving, it's not lambda over 2, but lambda over 4, because you're moving both the transmitter and receiver. So you need a system that allows you to move your setup by quarter-weaving steps. And because I want to have as little sliding contact, all these electrical stuff that's moving, they have poor contact. So I wanted to put everything on the moving part. So everything that is moving is the Wi-Fi dongle as a transmitter, a B210SDR as receiver. But an important story here is you need a dual channel coherent receiver, because you don't know what the Wi-Fi is streaming. It's streaming a broadband signal, but you don't know what it is. For me, it's noise. And so if I'm sending noise, I need to record the reference signal. And on the receiving antenna, I will look at time-delayed copies of this transmitted signal. So that's your basic passive radar measurement. And this is all running on the Raspberry Pi. So the Raspberry Pi at the moment is Raspberry Pi 4. It's running build routes, running new radio, and I'm streaming 0MQ to the processing PC. That's what we showed a few years ago. So actually, this is the final setup. So I took some commercial antennas here. You want it to be a bit directional so that you can have some bigger range. And this is why I'm saying it's not completely legal, because I'm sending the 10 dBm of a Wi-Fi transmitter. But of course, it's an isotropic radiated pattern. And here, I'm focusing on the 20 dBi gain antenna. Let's forget about this. No one's going to notice. And we do the same on the receiving. So you see here, you have the rail, everything that's moving and transmitting and receiving antenna. Raspberry Pi is over here. The B210 is over here. So everything that's moving heads the cables. And then I'm transmitting. Here, I'm transmitting over internet, but it could be over the Wi-Fi communication. The 2.4 GHz, the stream. Now, doing Wi-Fi measurements, actually yesterday, if you walk in the garden just outside here, you're going to see this poster. And actually, I was reading the poster. For those of you reading French, actually, there's a PhD from Brussels using Wi-Fi for what they call it, crowd safety. I call it crowd control, but he's PhD. So he's still optimistic. And of course, using Wi-Fi is MIT is very good at advertising what they're doing. So MIT has been doing full-the-wall Wi-Fi measurement for a long time. So Wi-Fi is not new, but I'm just trying to show you here how to make an educational system. So the principle is we continuously broadcast Wi-Fi. So actually, you could be streaming a very big movie, or you can take Bach's Packet Spammer. This is what I'm doing. So Packet Spammer will just keep on sending packets over time. And you have this non-cooperative source sending signal. And because it's non-cooperative, it might be that sometimes you will look at Packet Spammer. And because you cannot squeeze too many packets in a second, you'll have some gaps. So you just have to detect the gaps, throw these parts away, and collect enough data that you don't have noise. Now, we've just seen the presentation by Sylvain about GPUs. And just going to this, the correlation, when you're doing correlation memory, you're looking at the time-delayed copy of your signal. And you might think he's talking about correlation. Sylvain was talking about convolution. It was a relationship. The relationship between convolution and correlation is just you flip the time in the argument. Convolution is tau minus t. Correlation is t plus tau. And when you flip the time, you take the complex conjugate. So you see that exactly what Sylvain said. You take IFFT of Fourier Transformers Surveillance times the complex conjugate of Fourier Transformers of the reference signal. And the complex conjugate is to go from correlation to convolution. And the problem with this is that if your filter has some ripples on your reference measurement or on the surveillance measurement, if you have some ripples, then you will multiply the ripples because you're multiplying the amplitude. And what's really important in correlating is that you want the phases to subtract because the signal will be square-irons. And if they are coming from the same side, you have zero phase, or even same direction, they have the same phase. So you want to subtract the phase. And actually, instead of doing the analytical formula of multiplying the Fourier Transforms, you can actually take the ratio of Fourier Transform, which is the same by taking the inverse to take the negative phase, but you cancel the amplitude fluctuations. So that's actually what I do at the end of the day. I take the inverse Fourier Transform of a ratio of Fourier Transforms. Now, each Wi-Fi bandwidth is 20 megahertz. And 20 megahertz is on the one hand more that I can stream from my B210 to the Raspberry Pi 4. And secondly, I told you there's 200 megahertz available in Wi-Fi. And we don't want just to be using the 20 megahertz of one channel. And so what we're doing here, if you look at the allocation of frequencies, Wi-Fi is very broad. It starts at 5.4 gigahertz. Actually, you should avoid the 5.4 gigahertz, that's the C-band radar band. It was also called military G-band. So you would like to avoid this kind of frequency. And C-band is also Sentinel-1. We don't want to be jamming Sentinel-1. So we start working above the C-band radar. So we have all these channels here. And what you do is actually you do what is called frequency stacking. So actually, you reprogram your Wi-Fi dongle to jump from one channel to the other. And then you just keep on sweeping. So this was done of Spectralizer. You see here how you're broadcasting each one of these channels. And I can check that indeed this is working. And so for each channel, I reprogram the dongle. I stream the data for 0MQ. I record the data when I know I reprogram the new channel. And after I collected the number of samples I wanted, I reprogram the new. And I keep on looking like this. At the moment, everything, all the FFTs are done offline. Actually, everything I'm showing here is post-processing. I showed you a very fast movie because a full measurement is taking 15 minutes. And as I've had run the movie in real time, my time would be exhausted by the time I finish the introduction slide. So actually, a full measurement is taking 15 minutes and processing the full data is about an hour because I'm not using GPU. Here, this is all CPU post-processing. But one thing that I would love to see we've seen very fancy GPUs here. I just got two Raspberry Pi 5. And I'm told that we would be documented how to use a GPU of Raspberry Pi 5 to do some sort of processing. So that would be really beautiful to do at the moment. It's beyond what I can do. So this is actually experimental. This is what I do. Each one of the color was a spectrum collected by the B210. And so you see my frequency stacking so that allows me to spend the 200 megahertz of Wi-Fi. Be careful that there are some gaps. I think it's these guys here. So there are some gaps. So when you do the ratio, just make sure that you not a number of the values that you don't divide by 0. It's going to be unhappy about the calculation. Just a little side note. When I bought this rail, usually I tried to do some hack where I tried to find what's in the lab and I tried to assemble. And for this time, I had a bit of money left. So I bought a real rail. And I learned, I discovered, that all these industrial controls so that programmable logic controller are running on 24 volts. That is very standard. And your Raspberry Pi, of course, is a 3.3 volt GPIO. So you will need to have some voltage converter. That's your legacy, ULN2803, open collector dialectan transistor that will convert the input 3.3 volt into 24 volts. And the other thing that's kind of funny for us is in industrial control, they don't want you to do anything you want with the rail. Because if you misbehave, then your rail might go off. So actually, you're not allowed to program the position. You have to pre-program a set of values where your rail can go. And then you say, I want you to go to position 1, 2, 3, 4, and so on. This, of course, is proprietary software from the rail manufacturer. But it does run on wine. So it's not open source, but you can do it. So this is what it looks like on the moving part. You've got the Raspberry Pi with 24 volt controller over here. OK, having said that, what you collect, you collect for each antenna position all the spectra in the frequency domain. Once you've got on the reference channel and on the surveillance channel all these antenna positions and for each frequency, so that's a 2D matrix, you cross correlate each one of these. So you end up having one 2D matrix because you've correlated these two guys. You've got the antenna position on the x-axis. You've got the time domain because you inverse for transforming the y-axis. So this is before azimuth compression. Then you do your azimuth compression by doing the FFT. So this is FFT in this direction. Then you take the FFT in this direction to do azimuth compression. And then the part that I'm not completely used to here, you get sine of theta. You want to have range theta position. And my colleague, Weike Feng, Air Force Engineering University in Shanghai, gave me the algorithm for reprojecting the sine theta range to the range azimuth position. And once you get these maps, well the really beautiful thing is there is no degree of freedom. If you know how you move the antenna and you know the frequency step you use, you cannot cheat with the results. You've got an x and y position that is fully determined by your data acquisition condition. So here is one example from our lab. So this is the rail, this is the antenna. You've got here this round circular building which is over here. You've got the portal which is over here. And you've got the university housing which is over here. So there is no degree of freedom other than positioning the radar at the focal point. And this you know, I know where I'm located. And you have only degree of freedom is azimuth. You can tune the picture so that it fits. In this case, I threshold the pack scatter to make transparent where there's no output. So this is on the other side. So this were closed range, this is further away. So we're looking at the opposite side. You've got this building that is over here. You've got this container which is over here at near range. Again, no degree of freedom. And there is this reflection. And you should tell me, how can you get a reflection when it's a field over here? Well actually that was taken this summer when Google Maps had not yet updated their Google image because this building was indeed built since then. So this is one example where we actually get reflections up to 500 meters. This building here is giving us something. This range here is 500 meters. So it's working, all right, let's say well, at least you can see things with it. Then you might ask, is this reproducible? So last weekend I said, OK, like open source project, you put the GitHub, you say, trust me, it's working. And six months afterwards it's all gonna be crushed because all the libraries change and nothing's working anymore. So last weekend I said, let's take everything out and let's check if it's working. So it is working again. So here you've got the XY map which I project over Google Maps. And the nice thing is Google Maps updated their database. So now the hotel is over here. And here you've got the reflection far away. And you've got something here. So you might say, wow, I get even something even further than 500 meters. And it's reproducible. I took a second image over here and you get twice the same image. Don't be fooled. This here is not if you change the big orientation of the radar and you look a bit to the right, you think the reflection is still over here. This is your ambiguity function. The ambiguity function is you take the auto correlation. So you check, is there some self-similarity? And obviously, OFDM Wi-Fi does have some self-similarity. And this is a repeated pattern every 1.5 microseconds or something like this. I don't have the details. And so be very careful when you're using non-dedicated radar signal to check the ambiguity function because they might create their own repetition, which are not targets, but just because the signal does have some structure Wi-Fi. Looks like noise except when they repeat the OFDM error or something like this. But still you see that even if I try to go back, you see that this guy, for example, is a real target. Because if I move a radar azimuth, you do see the radar, the target at the same location. So I'm not completely lying here. And so finally, I was thinking, why is this reflection so powerful? How come there is one building at 500 meters that is sending this echo? So I went to see. I walked around and I took this picture. And actually what you see here, they've got the windows. But as a shade, as a sun shade, they put something that looks very much like a corner reflector. If you remember what a corner reflector is, it's a free right angled part. And actually, modern buildings, architects seem to love corner reflector. You look at modern architecture, you've got free right angled corners everywhere. So that's very good for radars. So this is actually why this building in particular is returning such a good signal. Finally, I told you that the range resolution is only a half meter, 75 centimeter here with 200 megahertz bandwidth. And we want to detect length slides with sub-centimeter displacement. So the classical method is you do interferometric measurement in SAR. So in INSAR, you don't only look at the magnitude of a return signal, but only also the phase. And the phase is uncertain because you've got two pi phase rotations. So you don't know how far the length slide is, but this you don't care because you got it from the range resolution. And by looking at the phase, you can actually get your distance variation, which is half the wavelength times the phase rotation over two pi. So the only challenge is because it's a radar, it's half wavelength because you've got a two-way trip. And so basically, I'm claiming here, so what I did is I took all the strong reflections. The ping here is misleading. This is not one. This is not a number. And I took the average and the standard deviation of all these guys. And you see that the mean value is in 1 millimeter. So you do get a millimeter on the mean value with 1.5 millimeter standard deviation. So I claim this to be 0 plus or minus 1.5 millimeter, which is probably not state-of-the-art, but that's just educational. So I'm still almost pleased that it works. And if you had seen some of my previous presentation, if you take a corner reflector, I try to do it. And it fails here. If you move a corner reflector by step of 5 millimeter, you do see it. So the phase analysis is working as well. So to conclude this presentation, I wanted to share with you how you can use, I think, affordable hardware for running a synthetic ground-based synthetic aperture radar, especially as an educational tool using commercial off-the-shelf Wi-Fi emitter, in this case, as a cooperative target because I'm broadcasting the signal. And I think it's a great opportunity to try to get started with this digital signal processing. Now, just to give you an idea of the budget, because I told you I had some leftover budget from a former contract that I had to spend by the end of the year. So I bought all this hardware. So the antennas are 1,000 euro transmitter receiver. You've got, actually, no, not two times. So stupid me. It's a 1 transmitter, 1 receiver, of course. No, no, no, sorry. A pair of transmitter receiver and the accessories for handling the antennas. You've got the rail, which is by far the most expensive part. But you do need the accuracy of the rail, the repetibility of the rail will give you your ability to do INSAR. If you've got a shitty setup where you've got an uncertainty of 5 millimeter on the position, well, 5 millimeter with respect to a half wavelength of 2.4 centimeter is significant, and this will blur your image. So that's where I wanted to spend a bit of money to have a good quality. These guys claim to have 100 micron reproducibility, so the sub 10s of millimeter, which I think is really good. And it's kind of easy to use. You've got your Wi-Fi. You've got the passive RF. And you've got the Raspberry Pi. These are all easy to find. And the B210, actually, I had leftovers. I think I have a dozen B210 in the lab. So I just took one of the leftover B210. And as I was doing this talk, I wanted to share with you the fact that everyone could do it. And at the end, we have a 7,000 euro project. And I'm not sure everyone wants a whole B210 7,000 euro. And you do see that the most expensive part here is the B210. So I checked, and I do have quotations from the beginning of last year, last, last year, that says that the B210 was 1,400 euros. In January 2024, it's now 2,100 euros. So I'm sorry for an eye, but I'm not going to advertise the B210 because this is really too much of a price hike. You do have a Pluto Plus with a tool channel, which I can get on AliExpress for 300 euros. And it's the same 80, 90, 360 something. It's an internet output. And when you've got all these moving parts, if you ever did some USB on moving parts, USB is the worst connector you want on the moving part. Internet, at least, you plug it in, and it stays there. So yeah, unfortunately, I wanted to demonstrate this for this presentation, and my Pluto Plus is still in the mail. So I cannot demonstrate that the noise level is the same, that the communication capability is the same, that it runs flawlessly on the Raspberry Pi. This may be for the next time. But yeah, you will save 800 euros on this budget, and it's a 5,000 euro project that I'm showing you here. So you can find all the repository processing on the GitHub. Hopefully, I documented everything. If you wish to reproduce and you miss information, feel free to reach me. I'll be happy to complement any misinformation. Be aware that if you want to use different hardware, the running bastion's packet spammer does require what is called promiscuous mode, and not all chipsets support promiscuous mode. Furthermore, be aware that the chipset of this particular board is not in mainline Linux kernel, so you will need to recompile the kernel. And if you're doing cross compilation for big rules, you need to know how to cross compile your kernel module. And finally, this was all done with your taxpayers' money, so public money, public code. Thank you for supporting our research and my colleagues from the Mechanical Workshop who did a very good job in assembling these antennas. And with this, I thank you for your attention. And I even have one and a half minutes for questions. I guess if you have to tune the gargantuan call to the Wi-Fi's packet spammer, then the gargantuan call is to Wi-Fi's container and to see more radio silence after the packet. The question is, how do I tune the silence in packet spammer? And actually, I did the exact opposite. I wanted to have the packets as close to each other as possible, so I have as little gap as possible. And as I was putting too small a value, if you ask packet spammer to send a new packet as the previous one is still being broadcast, then it will send back an error message. And the Wi-Fi don't really become very unhappy. So I was conservative and I did put excess delay, not that I wanted genuine Wi-Fi user to still have their connection. This I didn't really care about, but I didn't want my Wi-Fi don't go to crash. And so I put some additional time delay, not too much, so by time I'm not wasting too much time. The reason why this measurement is taking 15 minutes is really to collect. I'm taking something like 100,000 samples per position, per spectrum. And really the collection of the data and getting rid of the silence is the reason why it takes so much time. Just if you look at commercial GBSAR, they promote one second measurement duration. And the reason also, I didn't mention power consumption, the GBSAR used to be installed in remote locations. And of course, the longer it takes, the more power you draw. So this device I make a power budget is 25 watts. So whether you have 25 watts for 15 minutes or 25 watts for one second is going to completely change the life expectancy of your battery. So if I had to work on something now, it would really make it faster so that it can run on battery or solar panel and that the energy consumption of each measurement is much lower. So the initial question of putting gaps in packet spammer is just not to crash the Wi-Fi dongle. Have you considered using rails from 3D printers because they usually are cheaper and have still very nice decisions? If I can assemble which part in? From 3D printers like the procedure like rails too, which can look like very precise movements and space speed. So the question is what part can be made out of 3D printers? The problem here, I did not put the weight estimate, but I think the two antennas plus the hardware setup weight something like 1.2, 1.5 kilos. And that's really the challenge in having a nice mechanical setup. You do see that there is a bit of hardware there. And so when you want to move these stable and reproducibly, I went for a fancy. Also, I wanted to do it fast because my previous setup was a screw driven rail. And it would take like 10, 15 seconds to go from one position to another. So just the time to move would be added up at something like five, six minutes in the measurement. So this guy can just move in a fraction of a second from one position to another. So there's many solutions that you could go for. There's also these photographers. They want to do time lapses where they move a camera. Yeah, I didn't trust them. So I went for the more expensive. But yes, there's many solutions that you could go for to get a better solution. So thank you, so much.
Design of a follow-up QO-100 payload -
Good morning, all. My name is Frank. I'm working at the satellite communications group of ESA, the European Space Agency, in the technical centre. And we have a presentation here. It's not extremely technical. It's only, it would like to explain a few initiatives we are embarking on. And that is in the area of maybe future amateur satellite payloads which are hosted on satellites for experiments. It's maybe not so known, but let's say we work quite a lot in commercial communications, satellite communications with companies in Europe. And we finance and co-finance various projects. But it should not be forgotten, we think, that many of the innovations that came to the world of satellite communications are actually coming very often from the amateur satellite communication world. A lot of work has been done and that has now spined off in commercial applications. And we gave here a few examples of things that where the satellite communication was always first. They have been flying the CMOS chips, for example, the first time in the world on their satellites. They're the ones who made maybe the first inter-satellite links. And there have been also companies, for example, SSTL, that started with building a few CubeSubs in Surrey University and slowly became basically a larger satellite company. So that is all heritage from the amateur satellite world. Also what is quite interesting is that the amateur satellite world flew the first GPS receiver at very high altitudes, even up to high elliptical orbit. So these are all, I think, quite nice achievements of the amateur world. At the ISA site, we would like to maybe support initiative that at least thinks of future amateur satellite payloads on future satellites. And we explain a few things on that. And that is specifically also related to the payload which is currently on a geostationary satellite. It's called Q100. Maybe not everybody is familiar with that, but it's a very nice payload which for the first time is hosted on a geostationary satellite. So it's always above you. You can find there excellent videos explaining how to handle that satellite and to how to build up with low cost such communication over those satellites. Let me explain just to get a quick idea. This is the footprint of that payload which is on a geostationary satellite. So it travels with the same speed at the Earth, so let's say virtually it's always above you. And this is a very large commercial satellite, hundreds of millions, but on that there is a small payload with hand-handle amateur communication in S-band and the lower KU band. And the beauty, I think, of this is also that this is enabled because it is the basically reuse of existing costs. Let's see, existing 2.4 gigahertz amplifiers, the modification of low noise blocks that you use for normal television for, let's say, maybe 10 and even less zero, and you modify them and then you can use this satellite. So with relatively low cost you can communicate with the satellite. And this satellite payload, and there is a lot of actually, in particular, German amateur radio and also UK amateur enthusiasts who have been instrumental in getting, let's say, community working over this payload on the ASEALSAT 2, which is the name of the commercial satellite. This was the first time that radio amateur satellites were also able to have a, let's say, more continuous communication between each other. And you can see here the footprint. Let's take the green and the red line there, that is linked to how, what kind of elevation you need for the antenna. But this goes from Brazil to Indonesia. So you can have, say, in a single hop, basically, a user from Brazil could communicate with a user in Indonesia. So there is enormous potential there too for all kind of experiments. And this has also led to more broadcasting using standards like DVPS2, which is very active at the moment in Q100, where, let's say, new technologies went into the amateur domain and the amateur domain is now making very nice open source implementations of mini tuner and all various DVPS2 equipment. That is something we would like to support. But how do we support that? Now, a longer time ago, these processes do not go that fast, unfortunately. We wrote a letter to IARU, the International Amateur Radio Union, from where, I think, Sylvain is one of the bosses, I would say, divided in various regions of the world. We wrote there, or we say, stimulated the letter, say, an IARU basically asked, Isa, could you not help? We need to think at least maybe of a follow-up of Q100. We asked that, as you probably know, we are publicly funded, so the various countries want to have their say. Everybody did their say, say, that's okay. Here, you have some funding. So we have funding to start that process, and that funding is meant to collect requirements and also make a few prototypes, maybe. Basically, it will not be enough basically to host a satellite or a payload on a satellite. That will be not possible. We have to look for other funding mechanisms for that later, but we'll come to that later. So what we will be doing is to identify requirements of all the people in Europe and Canada. Canada is one of our member states to identify what would be good requirements to fulfill for a next geostationary payload. And back to the orbit, geostationary or other orbits, we have been heard that some people would be quite interested also to explore maybe payload in medium-Earth orbit. You can imagine that also then you have a longer contact time, and it still has a bit like a global attractiveness. We are considering that because there might be various institutional initiatives starting soon where there could be hosting opportunities for small payloads on medium-Earth orbit. So we will be looking at various further and amateur community that process will start very soon. We already requested a few inputs, and we still have to process all that, and we will then talk to the various satellite operators to see how we could accommodate then a payload and how soon to get the funding for that. So the first idea we heard already that is a few people will be very interested in, let's say, keep it simple. Actually the payload which is currently on as-heel-sub is, it is fantastic that it's there, but functionally you could say it's rather simple. It's an analog transponder, what you go up in whatever modulation you use, that comes down. And many people like that also because that means a lot of experience maybe at the modulation level, more deep down RF level. There is also a whole community that comes more from the, yeah, that has been raised with SDRs, let's say, starting more at that level. There is a whole community that is more working at the, let's say, the IP level and even higher in the amateur world. And we have to find a bit the balance from maybe simple payload designs to maybe something which is really more complicated. And you can imagine also in the amateur world there are communities that are going up and up and up in the frequency range. And in the amateur community we have 24 gigahertz, we have up to 77 gigahertz, which we all could use for satellite. But you could imagine that going higher up in frequency also means that possibly have narrower beams. And for the other hand you would also like maybe that the satellite community is served by, let's say, more in a larger scale. If we have one very nice spot beam in e-band, let's say in 76, 77 gigahertz, yeah, that will serve probably one country that is not so, I think, inclusive, I would say. So there's a few balances there to be made. But it would be quite nice in some way to come to a combination that already some people have suggested where we have an analog transponder. And actually what you would like maybe is to have in geo-arbit the basically the ultimate Linux brick with everything around it. And then everything can do, everybody can do what he wants. The only also here the disadvantage is again that if we put something like this on board, then we get a certain degree of centralizing things. You need a cis-atmine for this satellite basically. And that is not always to the likening also of the amateur community. One likes sometimes a certain degree of chaos, let's say, and anarchism and so on. That it should not be too regulated. So also that is a bit of a balance to be made. But we have various IDs to put it also a bit more in the 5G area where maybe CUDU, certain splits in the whole 5G architectures could be partly put on board. Because many, many people, even in the amateur world, start to look also at various communications based on 5G, non-NTN, non-trivial network. So very straight-offs to be made. We listed here a few of those and we will now start a larger consultation on all those topics. Back again to let's pick a few, let's say the attractiveness of future user terminals. Like the example here, previously of the ground-based SAAG, 6,000 euros, in this case the opportunity of some taxpayers' money. What is acceptable later on for radio amateur? If we go to the 77 GHz, maybe we can use automotive radar hacks and so on. That needs all to come together. So we would like to request later on in a more structured way input from the amateur community, but also taking into account all these factors. Because it's no sense proposing something where not a lot of amateurs can benefit from. So that we are currently starting and we will show you a little bit what we will be doing in the next month. I will not go into detail of the planning, but what we now are going to do, we are talking already a very small group to get a bit of a sense what we should do and also in particular what ESA should not do. Because some things are far better left to the amateur community. We prepare a bit like a consultation, we talk to the amateur community and we also talk to a number of, let's say, people who would likely build such a payload. It is however fun that would be, it is not so likely that the amateur community would build such a payload themselves. A geostationary operator with a 300 million satellite would like to know what he hosts as the few kilos extra. And that is not so likely that he will accept that that is built by amateurs. How good they are also with all respect, he will not accept it and his insurance company will not accept it. And what we would organize in May is a day in Aztec also with support from our technical people to discuss a few options. Start prototyping, we have the funding for that to prototype a bit what some people call like a flat shot. So it's the model of the satellite but you basically put it on a table. And we would like actually to have a few IDs ready in September. September there is always a very large satellite conference where all the satellite operators are and we are making there the appointments. And we will also pitch this to satellite operators as a bit like, let's say as a good thing. Many people complain in the satellite world about the lack of people that understand RF. That is a real lack. A lot of people, there's a lot of programmatic but there's not so many people that can really understand RF. And also I think satellite operators could take a bit of responsibility to and also the industry maybe to stimulate that the young people start to understand satellite communications. That is at least also one of our objectives to get more people enthusiastic about satellite communications. So we hope in May we will advertise that lately. May or June we will see a little bit on availability to organize a day to go through a number of payload designs. And we're also trying to get some travel reimbursements and so on in place so that people could come to us. And hopefully in September we discuss a few things with satellite operators and even better. We hope that maybe the outcome of such a discussion could be discussed at a next, force them hopefully next year. I think that's it from our side. You'll hear more from us. And as said, it's not such a technical talk. It's a process we're starting. All your technical inputs from the AMSATS, the amateur satellite organizations in the various countries which we are approaching already but also individually would be highly appreciated. That's it. Thank you very much. Thank you Frank. I have one or two questions. Please. Is there any for phase airing or beef steering or is it way too expensive for such channels? Yeah, that would be indeed very nice. Of course, if we look at the, let's take first phase array on board. Yeah, of course that will then be highly dependent on frequency range. Let us assume it would be an event, you know, 77 gigahertz amateur phase array. Yeah, that would be a fairly expensive thing. But I think we are also there to see where maybe certain developments could spin off further in industrial developments. So this would be a good, I only see of course again, but also a phase array. It needs a type of management then of that beam. That comes with that. And of course, it's still a challenge in development, of course, on the ground. Maybe if we take now the scenario of a medium Earth orbit where you would need pointing, then I can see there that also from the amateur community, the various YouTube videos that appear with the educational Pluto beam steering and things like that. There I see a lot of opportunities to do beam forming, to educate people on the essentials of beam farming with maybe an existing beacon that comes from the MEO. I think that would be excellent to do that. Yeah. Oh, please. Supporting Canada. Sorry, X support what? Support Canada and obviously those elements in there. Yeah. Okay. The question was about the geographical thing. Canada is in. It's also part of our, let's say, ISA member status. It's called ISA funds, ISA, ISA of Canada funds, ISA. So we are interested to include the Canadian footprint. We have already received a bit of input on that, but we have to see because you can imagine that the orbital location of such a due station is not always. Yeah, we are not the one picking the orbital location. Therefore, in that respect, a medium Earth orbit hosting would of course be preferential, but again, a trade off. We are not. Yeah, we can't decide ourselves. Sorry. Exactly. Yeah. We are having in ISA, there are a number of geostationary projects ongoing. We are, let's say, trying to see whether we could with some leverage, maybe to host something in the future there. Please. So one would be if you're talking about payload providers that are not amateur radio. Are you talking about the universities or only commercial partners? And then is there like a project in that is not like an ML class, ISA mission also like CubeSats that you would consider for such a like CubeSats, not from the amateur radio site, but let's say I think more commercially from a university where you could also see the situation. Yeah, on the payload provider, the question is whether the payload provider could maybe also be from university. Yeah, now I must say we have seen universities providing payloads to various mission, not always to commercial missions. And that is, I think, where the the satellite operate will always have, let's say, the last word, the last say, because that links them to the insurance and things like that. So unclear at the moment, I would say. Then secondly, whether a payload could also be hosted on more educational missions and so on. I think that that is an option. The only thing is that I think there's quite a lot of amateur payloads hosted in various Oscars and so there's nothing new on low Earth orbit satellites and CubeSats and so on. That the essence would be to either to do something new in medium Earth orbit, but also advancing a bit the payload technology and what you can do with it. And there there is a bit that would be the idea. Perfect. Perfect. The proba missions will be also interesting for this kind of applications. The proba platform itself, indeed. Yeah, I do not know whether the current proba, one of the, let's say, satellite that is used in these several various scientific missions also, whether some of the orbits are always, let's say, appropriate. That is, of course, to be seen from a platform point of view. I need to see no problem. Why not? Why not use that? Yeah. If there are normal questions. Thanks again. Thank you.
An open source digital radio protocol for amateur radio
Hi everyone. Maybe before we get started, how many of you do know about ham radio? This is kind of the topic of the room I know, but still. Okay. And how many of you are ham radio licensed? Nice. Okay, good. I still have included a small introduction. Please put up your hands again. License operators, please. All right. So I still have a brief presentation and introduction to the topic for you. Your experience with ham radio might not be my experience, so I think this introduction is interesting. And then we will have a brief overview of what ham radio and open source means. Not everybody understands open source the same way, especially, I think, in ham radio community. You will see that open source in ham radio did face and does face a few obstacles. We will pinpoint a few of those. We will see the workarounds. And then finally, we will talk about M17, which is the project that I want to talk about today. So first, who am I? So I'm a research engineer at the University of Friage in Belgium. I do mainly embedded systems and RF. I'm a licensed ham radio operator for two years now, called Sine of kind of number four, my Oscar Delta. I joined M17 project one year ago, right after Fuzzdem. Wow. And yeah, I mostly do hardware design, some of which you can see on the table in front of you. We will go back to that later and work on firmware. Okay. Amateur radio, I think you, then almost everybody knows about this logo for ham radio. This is a technical hobby. This is the goal is to experiment, to play around, to have your hands dirty. It allows you to legally transmit uncertain frequencies, which are allocated to amateur radio operations, and which you cannot do if you don't have your license, of course. The hobby is extremely vast. So you have operators which will do what we call DX, which is reaching the furthest away on the globe using lowest power or specific modes, frequencies, whatever, which is called working demands. You have people which are dedicated in antennas, transceivers, reception, transmission, whatever. It's very, very, very vast. And I think most of you also know that the mainstream products come from just a few brands, ICOM, Yezu, Kenwood, and then that's pretty much it. And you have the Chinese brands, and usually your typical ham radio, Joe operator, or whatever, doesn't know about those Chinese brands. So open source in amateur radio, well, this is a bit controversial, a bit difficult to describe, but the ham spirits, which live in every one of us, have always been about sharing design, ideas, discoveries, the problems we encountered, and how we did solve them. You could call that open source knowledge, maybe, which is not to be confused with the fact that most digital voice protocols that we use have published specifications, which means that if you dig deep enough in whatever search engine you use, you will probably find specifications for those protocols. That does not mean they are free and open source, which is very important. And this is kind of the goal of this presentation. So, yeah. Some protocols are free available. Some of you know about a few of those. So AX25, which is a material adaptation of the X25 protocol. Which mainly works on VHF and UHF, so above 30 megahertz, which is not designed for voice, of course. It's digital, but mainly data bits, let's say. And DSTAR, which is what most of us could consider open source protocol for amateur radio. It is the first protocol really created for amateur radio usage. So it is designed from the ground up with amateur radio in mind. It is open specifications. So from the start, they decided to publish the specifications, mostly in Japanese. So this can be an obstacle. Maybe if you speak Japanese, it's easier for you. I don't know. YSF, Yezoo's proprietary mode. Specifications can be found online, but that's pretty much it. You have to have a Yezoo radio to do YSF. FT4, FT8, just an example of a few modes that are used. Very slow speed, very long range, very low power on HF. So very low frequencies. And this kind of illustrates a point I'm going to go to. The new app, of course, DMR, Tetra, P25, all those commercial protocols which have been adopted by amateur radio, which are not designed for amateur radio. The main thing is, and especially when you talk about FT4, FT5 and such protocols, there are many of them, those have only one closed source implementation. It's not easy to play around with it. You can just, okay, I'm downloading this, trying to modify this. Is it better? Is it worse? The way you play around is, okay, which power do I need to reach this country in this weather or whatever, which is not what is suitable to each and every one of us. So we will briefly take these stars as an example. So released by Japanese amateur radio club, JARL in Japan in 2001. It uses AMBE codec vocoder from DVSI. So very briefly, vocoder, you know that voice is very complex signal. You need a lot of bits to transmit the voice, but amateur radio protocols are slow speed, slow bit rate. So you do need to encode the voice into something which is manageable by those digital voice protocols. And the way it is done in the case of DSTAR is using the AMBE codec from DVSI. Specifications are publicly available, but there is no license tied to those specifications, which means you do not have to publish whenever you deviate from those specifications. And so it kind of de facto became ICOM's proprietary mode. It is called DSTAR, but it is not made to be interoperable with other DSTAR implementations, which by the way, there are not really many other DSTARs implementations. So yeah, main obstacles. First, manufacturers exploit the fact that specifications are not really licensed, and so they can find the trick around to lock down their environments. Main obstacle, second part is technical capabilities. Back in 2001, encoding voice in a microcontroller was not really feasible. That's why you needed an ASIC, so a dedicated chip on the board made by DVSI with their AMBE codec to be able to encode voice into bits manageable, but by the whatever digital voice protocol you wanted to use. So I'm not spitting at DVSI and AMBE all the way. There is a whole lot of technical reasons why this is that way, but I think it's good to understand where we come from and to see where we can go from there. So also another thing to notice DSTAR, YSF, DMR, P25, NXDN, whatever, almost all the digital voice protocols being amateur radio or even commercial protocols use AMBE, AMBE to plus basically AMBE and variants of it from DVSI. Basically one vocoder to rule them all. So how does one think are with closed source vocoder? You don't. At least not really. But hey, the vocoder is an integral part of the protocol, so how do you do? What do you do? Is it possible to have what we could consider fully fast protocol if the vocoder is not open source and if so how do you do it? Well, 2001 was, sorry to break it to you, but quite a few years ago. So solution came in 2010. Name is Kodak2. Release in 2010 by David Roe. I want to underline that this was not an easy task. It was the topic of a full PhD thesis. It self-relying on old works and algorithms and so on. So nobody woke up in the morning and said, hey, let's do an open source vocoder. It's going to be easy. That's not how it goes. It's fully open source, no patents, no industrial secrets. That's the point. And since 2001, computing power on the microcontrollers increased by quite a lot. I mean, 8-bit peaks and 32-bits are microcontrollers are not the same thing. So this last brick, which was kind of the missing brick to have open source, fully open source protocols, allowed to the emergence of two main protocols. The first one is FreeDV, which is designed by the same David Roe. He's not alone, but he is one of the contributors of FreeDV, licensed under GPLV 2.1, using Kodak2 at lower bit rates because it's in HF, so low frequencies, narrow bandwidth. You can't transmit a lot of bits. So you just slow it down. You degrade the voice a bit more, but then you're able to do long range digital voice communications. And it's also used as the reference Kodak2 implementation. So again, just like I said about FT4, FT5 a few slides ago, you do something and then you provide your own implementation, except that this one is open source. And then KM17, GPLV 2 uses Kodak2 at the highest bit rate available. It fits in a standard FM bandwidth for VHF and up, so you can't really use that in HF. It's a bit too wide. You are going to annoy a few people published in 2019. So, M17 protocol has all the features you could expect from a digital protocol in amateur radio, which is that you have the packet mode, so you can use it to control a remote site, for example, using just sending commands. You have a stream mode, which is the mode which is used when you use digital voice. It supports AES encryption, which depending on where you live, you might or might not use it. I know I can't. It does have also the specifications for traffic over IP, which I think is a good thing. If you look back, main digital voice protocols do not have that. So, the community will kind of go, each one has its own way of doing it and different implementations, and then somebody comes, hey, I have an idea. Let's try to inter-connect this, and it's just one more break to a very tall and fragile wall. So, here we provide specifications for this, which I think will ease up, does ease up implementation and inter-operation. If you probably know about DMR, to use DMR, you need DMR ID, which is centralized. We don't. In this protocol, you only need a call sign, which if you can use the protocol, you most probably already have a call sign, so problem solved. And the specifications are open source and license under GPLv2, which means that if some big manufacturer says, hey, I like this new protocol, let's try to benefit from it, yeah, great. But if you modify it to make it incompatible with our specifications, we will force you to publish your specifications completely, and we will find a way to make sure that whatever we do next is going to be compatible with you. If you don't want to be compatible with us, we will be compatible with you. Okay, so this was the M17 protocol, but we should go beyond that. There is the whole M17 project thing. More than a protocol, getting rid of the proprietary vocoder allows a load of things. First thing is you can have it running on your computer. You don't have to pay for any license fee. You don't have to have any USB dongle that you have to plug into your computer so that my software has to go through the proprietary chip that you used, blah, blah, blah, blah. You can have software on your computer for Dstar, for example, but you need this USB key on your computer to have the license to use the codec. You can have it on your phone. Same thing, DroidStar, maybe some of you know about it, maybe not yet. This is a very small app that allows you to use digital voice protocols and connect through reflectors, so servers. M17 allows it to run straight out of the box. AMBE codec, there are online implementations, illegal implementations that allows you to use AMBE, but you have to find it, download it, and then it becomes very shady and it's a cat and mouse game between DVSI and amateur radio operators. You can have it for your radio, a lot about that in just 10 minutes, apparently. You can have it on reflectors. Dstar reflectors do need the same USB key that you would need on your computer to translate the voice between AMBE and whatever else you would use. Maybe a small note that if you have a Dstar reflector, which is from Dstar to Dstar, you do not need this key because you do not need to decode the voice and re-encode it. You can just pass the encoded bits around, but if you want to switch from something to something else, then you're stuck. So yeah, it's a whole ecosystem which was able to grow from the ground up because of the open sourceness of codec 2, including in this ecosystem module 17, so which is this board which is open source hardware, open source software, open source protocol, open source, almost everything you can wish for. Let me go in the frame of the camera maybe. So you have the board here which is the newest revision 0.99 because you never do the 1.0 in one go. And then the enclosure that goes around with it because having this on your computer is screaming for I want short circuit as soon as possible, so let's try to avoid it and put it in an enclosure. Difficult thing is yeah, when you have open source hardware making money out of it is difficult, but this exercise is intentionally left to the reader. Yeah, fully open source, affordable. 50 euros about that. Try to find digital voice, modem, TNC, whatever for that price. I think you will come back to us. Open HD, another baby which is on its way not as advanced as module 17 yet, but it's aimed at being a fully open source portable radio. Basically if you can modulate it, we can send it. For now it's only working on the 70 centimeters band, 430 megahertz, and the 2.4 gigahertz. So for those of you who can see Qo100 in the sky, it does the upstream to Qo100. 25 milliwatts. Hey, it's a prototype. Step by step please. With its 3D printed enclosure also open source and very quickly it relies on a deaf board developed by ST Microelectronics and the backside we did ourselves with the power supply, the FPGA, and the transceiver. FPGA, so you can see maybe a few asterisks on the screen. The FPGA tool chain is sadly not open source. You know maybe if you play with that that having open source FPGA is difficult. It is one of our goals, but usually they will provide IPs in their software and then say yeah, if you want to exploit it commercially, please talk to us before that. So always a bit difficult to deal with that. We have plans for the future though. We are starting the work to port it over OpenRTX, which you will know about in six minutes and a half. We want an open source FPGA, so quick note that maybe the FPGA is not open source, but it does not prevent you from building it yourself. Downloading the software which is free as in beer and you can rebuild the bitstream and upload it to the radio. So you can still tinker with it however you want, but it's not strictly speaking open source. We want five watt output. We want USB-C charging. Oh my god, I come here, please. So yeah, we have plans. We are not only pushing our protocol in it, but we just want to make products better and open source for the community. I think there is a big hole in the ham radio community with this and we are here to fill it. A very quick shout out to some very interesting projects close to M17, the open source primary code OpenRTX. WPSD, which is the hot spot software that you can use which supports M17 for, I don't know, a few years, a few months back, they started support, contrary to PyStar which supports M17 for, I don't know, 10 days. MMDVM, the hot spot hardware. So we rely on those to have hot spots which do M17 and there are much, much, much, much more things that revolve around this. Okay, so thank you for your attention. I hope you liked it. I hope it gave you some ideas and desire to join us, help us. We need devs, please. I know everybody does need devs. Check out our infobooth, ham radio infobooth in building AWU, but I think most of you already came and say hello, but if you did not, we are still here today. Okay, thank you very much, guys. Thank you. We have some time for questions. Yeah. For FSK with root-raised cosine filtering. So the the main chip that you would use is CC1200 from Texas Instruments. Yep. Using the packet radio port. Just like you would do for AX25, for example, your old TNC. So this module basically takes the sound from the microphone, processes it, encodes it using the Kodak 2 vocoder, does the protocol framing, baseband creation and processing and filtering. You have baseband output here, feed it to your radio, and then the output is for FSK modulation with 4M modules, for M17 protocol. But if you want more, come to our infobooth. Yeah. What FPGA type radio is in the chipboard? For now it's the latest, I forgot the one. It's not an i40 which has an open source toolchain available. We had some technical issues, we had some technical issues with the FPGA, so the one we use, yeah, is LATIS, CERTUS, something, I guess, but because for the transceiver we use, we needed LVDS pairs for the data transmission. 64 MHz for the LVDS pair speed. Yep. You have been addressing some of the shortcomings of all the other modulation schemes and protocols, so letting aside that they are not all open source, but they have other shortcomings like on UHF, there's reflection and there's fading and many other things that you are experiencing outside the lab. Actually, are you also addressing these things with M17? I mean, is there a better quality of voice? Is there a better cope with fading and reflections and things like that? Okay, yeah, there are shortcomings indeed with many digital voice protocols. We are between a rock and a hard place. We use for-office-scale modulation scheme which basically does not allow you to overcome the multi-path problems, reflections and so on, so we are aware of this. We have to go step by step maybe. Specifications are open, you are free to fork them and I don't know, implement it in OFDM to avoid some problems that you might have with specific issues which is linked to the physical layer. And for the voice quality, some will say that Kodek 2 sounds better than AMBE, some will say that it's worse. It depends, really, but I think it's at least on par with the other protocols. Yeah? So that's a nice thing. For the Internet use cage of UHF, have you curves or measurements that say what isn't that as an R? I can still do whatever analog voice versus M17 and how much further can you get with like bursts? Yeah, so this basically boils down to, yeah, do we have any graphs, lines, explaining basically the difference between analog and digital? Well, this is a wider topic than just M17, but I agree that maybe having those comparisons might be a good idea to push force into those modes. But no, we don't have those curves at the moment, I believe. Many people I think could have, but yeah. Absolutely, question if I can. Yeah. Have you thought about interfacing that or, and I know I'm on your description, so I'm asking you to because one of the things I like as to experiment is that you can put other data on your channel, but on the current hardware you cannot interface with it. Okay, so can we send arbitrary data using module 17? Not yet. Everything is there for you to be able to do it. So the final net is openRTX. There is a USB-C port with data lines connected to the chip, so we use it for final updates. We use it for STDIO output for debugging basically, and yeah, you should be able, in the future it is planned, but Sylvano in a few minutes will talk about that, having communication channel between the computer and the board, and then from there you can send basically whatever you want. So yeah, it is feasible. Yeah. Have any large manufacturers shown interest in M17? Yes. We talked about, we talked with Kenwood, which would be interested in implementing M17, so I pinpoint back to the fact that specifications are licensed, GPLv2, so they cannot lock it down for their own use. We also have connect systems which showed an interest in our radios and Bowfing. With no more questions, and I see no hands rising. Big thank you again. Thank you very much.
OpenRTX: an open source firmware for ham radio devices
Hi, welcome everyone and a small technical problem, I almost lost my voice when standing at the ham radio booth. So you know my voice is not very good today. Anyway, I'm going to talk about the OpenRTX project which is an open source firmware for ham radio devices. We will see it's not only an open source firmware, it's also a little more. First of all, who am I? Also known as Redman on the various communication channels, Matrix and Discord and so on. I come from Milan, Italy. I'm a ham radio operator since 2017 as India Uniform 2 kilo Whisky Oscar. I do not do so much ham radio activity but like on air, I do it on hardware and on code. I'm a firmware developer by profession and for I work in motorsport industry. I'm co-founder and developer of OpenRTX and also member of the M17 team since 2021. So OpenRTX, it's as I say, an open source firmware for ham radio devices which is by design modular which means that you can, it's designed to allow you to easily add or remove components. It's designed to be easily portable to new devices. You want to have it on whatever radio or whatever device you prefer. You can almost take it and implement a couple of files and have it up and running on your new device. And finally, it's easily extendable to new protocols which means that we try to standardize the interface between the firmware core and the protocols so that you can implement your own protocol in the firmware without too much pain. Finally, it's currently supporting only FM and M17 even though it has been developed primarily on DMR devices. I will tell you why it does not support DMR now. And finally, not written here, licensed under GPLV3. So a bit of timeline. The project started in March 2020 and it started because of COVID. It started because me and Nicola, the uniform 2 kilo indian November were at home basically with nothing to do. We decided to try porting the OpenGD77 project to the Taitira MD380 which at the time was known also and it's still known also because of the work done by Trevis Goodspeed on MD380 tools. He did a lot of reverse engineering, modding and work on the original firmware. Then in September 2020 after a bit of work already we already brought up the radio driver, the display and a bunch of stuff. We realized that the OpenGD77 firmware was too much connected with the structure of the radio. So we decided to, okay, let's take everything, restart from scratch, design things from top to bottom instead of adapting the firmware to the hardware. So yeah, we abandoned the original idea of the porting and we can say it's the official beginning of OpenRTX because we created a separated repo with the OpenRTX name before it was known as, it started as OpenDMR. Only 2021, so like four months later we had the first release, V0.1 with working FM, transmission reception MD380. Also one month later we joined the M17 team and we started doing some experimentation about transmitting M17 with the MD380. A bit of time after we brought up the support for GD77, MDM18, M1, MD380 plus a lot of work in the middle and finally in May 2022 we released the V0.3.3 version which has full support for M17 voice transmission. But now basically since that version you can take radio and flash the firmware and you can use it for M17. Then November 2022 we had a huge contribution from a home radio operator from Australia which implemented the voice prompts for vision impaired operators for extended accessibility and in the end October 2023 we wrote a BSP for the Lilligo TTWR Plus which is a small device you can buy easily and play with with ESP32 and radio module and various technical improvements in the middle, restructurations, every 10,000, M17, MD later on. A lot of stuff and yeah more to come. So the supported devices are those. We saw them in also in the timeline. We have the two Taitira, Retivis radios, MD380 single band, the other one is dual band, single band is UHF or VHF and the other is VHF and UHF. They both do FM and M17. GD77 and DM1801 they do only FM, they are DMR radios, they cannot do M17 for technical reasons. I will give you a bit more details about M17 implementation later and I will tell you why those radios cannot support M17. We have the M17, MD17 and then finally the Lilligo radio which now does only FM in the future is going to be, is going to also support M17. The hardware is good, it's just a matter of writing drivers which is not an easy task always. So the internal structure is of the firmware is this one. There is at the bottom there is an interface with the operating system which can be either real time or not. It's preferred to have a real time operating system on the radios because you have to do protocols and protocols have timings and so you know you cannot just have like normal scheduling. Then we have our hardware interface API which is a set of other files which define the functions used for keyboard and display and so on. In the middle which is the code which is common between all the devices. I called it the core part. It has a small graphics library written by us. We thought about using LVGL but it was a little bit too heavy because one of the objectives is to keep the code size as small as possible. So LVGL was a little too big at least for some devices. There is GPS for the radios who have the GPS. There are voice prompts. All the system to have save, restore of user settings and default settings and settings management. There is the code plug system which allows you to have a list of channels and contacts and stuff like that and saved in the radio. Also the audio management subsystem which is a mechanism we set up to make easy the management of audio connections. You can, we have those concept of audio path and audio stream so that you can easily open a path from source to destination and the audio management system is responsible for actually talking to the hardware and making the various connections and managing the possible conflicts between those paths. Because if you open a path towards a device which is already busy, either your path has an higher priority and you go over or you get rejected but you need code to manage this. Then there is the UI which is the part for keyboard UI basically. And then the operating modes now is FMM 17 but yeah, more to come and we have some ideas of what we are going to implement in the future. So yeah, regarding the interface with the operating system, all the code uses the standard POSIX API which means that any POSIX compliant operating system is good. For example, we already run the firmware also you can compile and run it natively on Linux which is very good for debugging when for example you are developing UI or other parts instead of compile, flash, test, debug, modify and go on. You can run it on Linux so yeah, any POSIX compliant operating system is good. All the remaining pieces use the standard C library so you need operating system which supports them and yeah, if possible use real-time operating system on embedded devices. The other interface is as I said, we have API function for display, keyboard, audio connections, management of the radio section and for the non-volatile memory plus like general purpose platform API which is used for device initialization and to manage all the other things which do not fall in the other categories like the LEDs, calibration data, hardware information data, so on and so forth. Yeah, and also the code is made such that you can share an implementation between devices as long as there are similarities between the hardware of the various targets for example in the MD series radios. The display is always the same, is always connected there. We wrote the display driver once and we compile it for every target. As regards the user interface, currently we have standard let's say GUI which is the GUI you can find on the all the handled devices plus a dedicated GUI for the module 17 just because the module 17 does not need all the elements of the standard radios and also it requires some dedicated entries in the menus for example for calibration of levels and so on. So given that module 17 is not completely radio we decided to make a dedicated GUI only for it but this also means that if you want you can also write your own GUI from scratch or you can also run everything in address mode without the GUI. Yeah, we have also future plans of making the GUI scriptable or expandable with modules such that you can for example implement your own module like satellite tracking module. You can just, we expose, we give a standard interface to interconnect with the rest of the GUI and then you can just write your code using that interface plus the graphics functions and that's it. You can add whatever you want. For the operating modes we are using C++ and yeah, I didn't say that all the code is written in C plus some pieces in C++. We use the C++ for the operating modes, simple C++ luckily just because it was a bit more handy to use C++ to do that part. We define a generic operating mode class which has a bunch of functions to enable, disable, to have periodic update if you have to check, squelch, make computations, whatever and the function return in the squelch status which then gets queried by the GUI and so on. So, yeah, to define your own operating mode you just have to subclass the operating mode class and then implement the interface functions and integrate it into the list of operating modes and you are done. And yes, there is still some work to do on the operating modes because for example now there is not a clear way to for example set some configuration data of a specific operating mode. For example, M17, yeah, now this code is a little hacked in order to allow the user to set easily the source call sign, destination call sign and stuff like that. So, as regards M17 in particular, as I said before, we started the work on MD380, then we extended to the MDUV380. By the way, we started on the MD380 because luckily we had the schematic of the radio. So, we were able to see if there were actually possibility from the hardware to implement this. Everything is managing inside the microcontroller and also, and this is why the other radios cannot do M17, is that the hardware must have the following connections. Everything is done inside the microcontroller. The microcontroller has to sample the microphone for audio encoding. It has to sample the demodulated audio from the radio stage because we do all the 4FSK modulation in software in code. So, we have to sample the raw data with 24 kilohertz and the connection must have the enough bandwidth in order to not distort the signal. And like, so to say, outbound we have to be able to send raw data, raw demodulated speech to the speaker in form of analog signal and also send the basement off to the radio frequency stage for modulation. And yeah, we need those connections. The GD77 locks the, for sure, the ref stage to the MCU. MCU to speaker, I'd probably, it's okay and the MCU to ref stage also. So, it cannot send and receive basement. The microcontroller cannot send and receive basement to the ref stage and you cannot go on air this way. And current problems we have, yeah, you have to mod the hardware of the radio if you want to have M17 on it. Quite a limitation because, yeah, you know, you have to have a bit of practice with soldering of SMD. There are guides on the website, but yeah, you still have to do that or have someone which is a couple of doing that. Given that we are doing everything in the MCU, the MCU has to be powerful enough, which means now M38 has 168 MHz CPU, Cortex M4, which is powerful enough. The point is that for M17 you have to send a frame every 20 milliseconds. So, you have to do everything in 20 milliseconds, audio encoding, baseband and so on. If your MCU is not quick enough, you cannot do that. And finally, yeah, quite a problem. Codec 2 is using floating point math, which means that if your MCU does not have hardware acceleration for it or it does not have enough clock frequency, you are busted. Because you are not able to stay within the timings of the protocol. As regards the codeplug, when we started, initially we thought about parsing and keeping the original codeplug format of the radio manufacturer. But we decided not to do so because otherwise we would have to manage a shitload of codeplug formats with various versions. A mess, basically. So, we decided to, okay, let's write and specify our own format. So, we tried to make something which is open free, of course, which supports common ham radio needs. Direct communications, repeaters, hotspots, whatever. And also portable across devices. These both for end users, you have your codeplug file and you can move it around openRTX devices because the structure is the same. So, they speak the same language. And also for developers, you can take the reference implementation and use it in your own project if you want. It's currently working progress. There is a request for comment open on GitHub. Please take a look at it, comment, help us. We need a bit of feedback from ham radio operators on how this has to be structured. Technical details. It is a binary format, which means that you can either write it row mode to the non volatile memory or inside files. And it's also compact because, yeah, it's binary. Up to 65K channels. So, a lot of space. And this currently supporting FM, DMR for like historical reasons because we worked on DMR radios. And yeah, I'm 17, but more to come. It's something you can extend. It's already designed to have more than those operating modes. And also it may become a separate entity from the firmware. Just because we want to make it something which can be used by hopefully also radio manufacturers. I hope at some point in the history to have it become like reference standard for exchange of codeplug data because now each codeplug is specific to the radio. There are softwares which allow to convert code plugs, but yeah, it's not as easy as taking one file and flashing it into different devices. So summing up what's next on the other side codeplug. We have it's the next in the row. Schedule for we are working on this. In these days. This is a bit more internal to the firmware and event exchange system so that you can register to events or send events which can make the development a bit more easy. Also, APRS support because why not. And more to come probably we have to integrate because it has been done before GPS tracking. I want to have something for meteorological zones radio zones the modulator for those. And then we will see. No DMR in the end because we still have the problem that. Audio codec is patented. And yeah, how do you include the patented binary blob in your work without. Risking of being sued for patented infringement. I don't know. It's open problem. We have to find a way to solve it. And yeah, this is the main reason why there is no DMR on it. The point is the maries open states public specification. It's a. Etsy standard but yeah, the problem is the codec for the short. And hardware side which wins hardware support. Open HD. Bow fang DM 1701 which is another radio which can do. I'm 17. This quite a dream. Yeah, is a series of radios. Difficult because of the microcontroller they have. All the microcontroller for automotive from 2008. Big Indian no debugger. No the compiler. It's a mess. But you know why not. And also yeah more to come. So that's it. Happy hacking. Okay. Yeah. Yeah. Yeah. The ones not doing 17 you said. Yeah. What okay. About the ratio. Yeah. Yeah. So question is what about the devices which do not have the they are which now do not have the hardware connection. What if they had the hardware connections. Well, they would do M 17 is once you have the hardware connections. Then it becomes just a matter of writing the drivers and you have M 17 support. So it's on on this. Thanks again. Yeah.
Expanding IQEngine into a Hub for Previewing RF Signal Processing Software
Awesome. Thank you. So my name is Mark and I'm here to show off the IQ Engine open source project. I'll talk about where it's headed in the future as well. Also here we have Roman who's involved in IQ Engine as well as SIGMF. And this talk is aimed primarily at two groups. One is folks who are newish to SDR and RF signal processing, students, hobbyists, anyone who wants to learn more about all this software that you're seeing. And then second is folks who run or maintain an open source project that involves RF signal processing in some way. And hopefully even if you're not in those groups you'll still find some interest here. So IQ Engine currently it's a web app that is all about RF recordings. It lets you preview recordings, manage them, analyze them, some light processing, and then most importantly is sharing and all in your browser. So entirely web based. And I'll show a quick little demo of what the current tool looks like. So IQ Engine is, it's available at IQEngine.org. The project runs a public instance of the tool. But in this case I've got one running locally because I wasn't sure about Wi-Fi. So the main screen here is essentially a list of these RF recordings. They're all stored in the SIGMF format if you're familiar with SIGMF. We have some good ones from Jean-Michel and Aang23. A lot of folks who are here today. You can also open a recording that's local to your machine and then all the processing is done client side. So like I can open a local directory full of recordings. Here, recordings and it'll list out them all, generate the thumbnails. So actually it's the same directory that I had served from the server. You can also open just one local file pair. So sort of, anyway, so back to the list here. If you click on one of them, you're brought to a spectrogram style interface where it loads the samples that you're looking at at any given time. So that way you can have enormous files. And then the mini map on the right represents the entire recording. So you can jump to any part of it and the little gray area is the part you're looking at. We have time, frequency and IQ like you'd expect. That's FM. And then some other features here are, so there's time and frequency cursors if you want to measure stuff, adjustable dynamic range for the color, windowing, FFT size, you can add an FIR filter taps and all of that is run client side. The FFTs are done client side, the FIR filter is. But the one part that's not client side is our plug-in system. So if you select a portion of the recording that you want to send to the plug-in server, you can select it there and then, let me zoom in here, choose a plug-in. So this was an FM recording. So I'm going to run an FM receiver that's implemented in Guinea Radio. And it sends the samples to the server that runs Guinea Radio. And then in this case, it's actually returning a WAV file with the audio of that signal. But there's other types of outputs like you could run a block or a plug-in that gives you IQ as the output. So if I do a low pass filter, it's just going to output IQ. Let me give it a proper cutoff frequency there. And then currently we're just displaying the IQ in a pop-up. But in the future, we're trying to figure out the best way to replace kind of the signal that's already on the screen with this new one so that you can chain plug-ins together. So that's sort of the gist of the tool. Now back to the slides here. So everything's, IQ engine's built on top of SIGMF in many ways. If you're not familiar, SIGMF is an open standard for saving your RF recordings to a file. It's as simple as it gets. You have a binary IQ file which is sort of the native way to store a recording and then a JSON file. And the SIGMF specifications mainly tell you how to write that JSON file. So there's stuff like how you specify sample rate, center frequency, data type. And then I'll show you annotations in a second here. And by using SIGMF, you have software interoperability. And then you can also avoid data bit rot where like in five years you forget what sample rate stuff was taken at. If you want to learn more about SIGMF, there's a link at the top of IQengine.org and it also links out to the SIGMF GitHub page. So SIGMF, the standard is managed by GNU Radio. It's kind of a sub-project sort of. Now as far as the IQ engine code itself, it's web-based, front-end uses React, Tailwind, some big dependencies that we get a lot of use out of our code mirror for all of the code editing. PyOdide lets us run Python in the browser. I didn't demo that but there's some videos online about how that gets used. And then Plotly for those time frequency and IQ plots. WebAssembly for FFTs. And then for our documentation, we use the MDX system which lets us write it in markdown and then have it rendered as part of this page here. So this was written in markdown and then it lets us render it as React components. Kind of nice. Now, so that was kind of the introduction but I wanted to start off where I left at GNU Radio conference last year. So what have we done since then? Well now it's possible to run a local instance of IQengine like if you want to run it within an organization or whatever to share things privately. You can run an instance and you can put the recordings on the same server. So easy enough. Or something that's mounted to the file system as long as Python open can see it and then it can serve the recording. And the other option is to use Cloud Storage which is what we do for IQengine.org. And as far as how to do that, so the general idea is you pick a directory on your server and then you can run IQengine with the Docker images. So if you go to the install with Docker page, you really, all you have to do is change the directory that's mounted into the container. So pretty much this part here of the command. And then the rest of this command will pull the latest IQengine Docker image and it will run it. And you should be able to see your recordings. They'll look like this because they'll be local to the back end. Versus IQengine.org which has a few different data sources that pop up here. So and that's, yeah, fairly new. If you end up using this and notice some quality of life issues, definitely reach out on Discord or GitHub. So next up, I'm going to dive into the plug-in system that you saw me run with the FM receiver. So the idea is any RF signal processing that you want to run on a back end server but triggered from the browser. So what we have within our project is this rest-based API and it allows for someone to write the plug-in server in any language they want. We have an example in Python and then Loic wrote one for Rust. The Python one can run Gini radio flow graphs. It just pretty much runs the Python flow graph and then uses ZMQ to get samples in and out of it. But in the future, there'll be more languages and by using this rest API, it doesn't matter. It could be, really you can deploy it and implement it however you want to as long as it supports this interface. I'm going to show a little demo later running SatDump which is kind of an example of a whole separate project, not a Gini radio flow graph or anything, but a piece of open-source software that you can trigger from IQ Engine. And then Aang will be presenting more about SatDump in like an hour or so. So as far as how the plug-ins look, the Python based ones, we tried to make it as easy as possible to create a new one. This isn't the actual rest API, this is just how you would make a new Python plug-in and then you would use an existing server code that we already have. So you can see you have to specify your custom parameters and then there's a run function where you're given the samples and you have to return several different data types. As far as Gini radio, you specify the flow graph, but the only catch is you have to substitute your file source and GUIs with the ZMQ source and ZMQ sync. That's how we get samples in and out. Not the most performant thing, but it gets the job done. So you can see these first couple blocks are the ZMQ ones and then the rest represents the flow graph. So we have a Python flow graph that implements an FM receiver in this case and that was the plug-in that I ran earlier. So the kind of the motivation here is if you are an author of an out of tree module for Gini radio, you probably already shared the code somewhere like GitHub and created some examples, some example flow graphs, but the next step would be making it more accessible and easy for folks to find and play with and I think this could be an option there by exposing it as a plug-in. Now, let me go back to the plug-ins. So I'll go ahead and run the SAT dump one. So I've got a recording of NOAA APT right here contributed by Aang. So I can click that, I can browse around the signal. I'll notice so it's actually offset, but I believe this is the APT signal. You could jump to different parts of the file there and then as far as running it through SAT dump, I want to run the entire file because it needs a decent amount of samples. So I'm going to select the whole file and then under plug-ins, we've got the fresh new SAT dump plug-in already preloaded with the pipeline for APT, but you can put whatever pipeline you want. So right now it ran SAT dump under the hood. So here's one of the images that comes out. I think IQ Engine still has some work to do as far as if you have a bunch of different outputs, how do you present them all to the user? There's a lot of web design that can go on there. So either it pops open something or it saves a file and it supports all the different MIME types. If you're familiar with web, it sort of just uses MIME type and then we added some custom MIME types for IQ, like the different data types for SIGMF. As far as other plug-ins, I think, yeah, pretty much, we have a detector as well. So let me go to a recording that I give my students when we study signal detection and classification. This is kind of like a toy example meant for testing a detector where you have a few different signals here. IQ Engine's not about implementing RFML, it's about sharing it and making it more accessible. So we made a very simple detector just to have an example. It's written in Python, you're welcome to check it out in the source. It's called Simple Detector. We also have Marco's detector, he was someone else who was working on it. Simple Detector was pretty quick for that number of samples and it did a decent job. There's one extra little detected emission there. Now the results are in the form of SIGMF annotations which are bounding boxes in time and frequency and that's how the results are shared from the plug-in. So if you wanted to download the raw metadata file, the SIGMF file, you can go to the bottom here and here are the annotations that the plug-in created. So we sort of copied the SIGMF format for the return data type. And if you wanted to perform classification you would simply fill out the label and they would show up. Within IQ Engine you can also edit the annotations and edit the labels. So if you wanted to manually tweak stuff like you were making an RFML data set, sort of like a golden meta file, you could do that here. What I find most useful is simply to have a quick glance at how well something worked. If you had tons of files to run through you wouldn't want to do all this clicking, you would just make a script and you could certainly run the plug-ins from a Python script. It would just need to call the REST API. Back to the slides. Alright, so I want to take just a really quick tangent to mention, remind people about what Gini Radio provides and then how it relates to this plan that the project has. So Gini Radio, it's a way to implement your RF, DSP and C++ or Python. It gives us a standard framework for doing that implementation and it's easy to get annoyed at the boilerplate and how to install everything. But in the end if you use that framework it means that other people who are familiar with Gini Radio can then install your Aditree module. They sort of know the standard usage of your blocks, where to look for the example flow graphs, how to connect your application with their SDR sitting on their desk. And that's an enormous value, that's in my opinion one of the main values of Gini Radio. And then the GUIs are nice as well, it's not always easy to program GUIs. So if you're curious about learning about different Aditree modules, C-Gran.org is where we point people. And I mention this because C-Gran represents a centralized location for Gini Radio applications and libraries, what we call out of tree modules. But kind of zooming out one more layer going beyond just Gini Radio is what I'm going to talk about here in a sec. So let's say that you're a developer of open source software that involves RF processing in some way, like you wrote SAT dump and you're doing satellite signal processing. You build something, you want to share it, you want to keep it easy to demonstrate and show off to people, easy to use. Those are sort of the main steps you might take. Now on the other side of things, you have users out there, whoever they are, individual students, organizations, who first they need to discover that this software exists. That's like the very first step. And then how do you install it, how do you run it properly, how can I evaluate how well it's working and use it with my SDR or my recordings. So kind of a duality here. On the developer side, you might post your code to GitHub, you might share it as part of a Faw STEM talk. That's kind of like the current method that we use. On the user side of things, you might Google the topic you're interested in, like specific satellite, Wi-Fi, whatever. You'll probably come across what's out there. But it's not the best way to do it, right? Just by Googling. So installation can be an enormous barrier. When I teach CS students, it depends who you are, but some students and some folks are better at getting this software installed than others. Obviously having a lot of Linux experience helps folks who are new to Linux but want to dive into signal processing, they can struggle here and there. So it can definitely be a barrier. Now how do you actually run it? If it's a new radio flow graph, you probably know how, but not everything's easy to use. There's RF libraries out there that are not clear how exactly do you use it, but you know it's powerful. And then lastly, evaluating the software. Maybe you're going to use it as a dependency or use it as part of a project. So this idea to sort of evolve IQ engines, so instead of just being a way to share and evaluate RF recordings, it can also be used for just RF open source software in general. Sort of like a central hub, community driven for devs to share stuff for users to find and discover software. And then by exposing the software as a plugin, they can try it out on recordings that are already on the site or their own. And then one side benefit is university isn't anyone else who wants to show off their expertise and creates open source software. They can use this central hub as a way to do that. Now this is all in the browser primarily for accessibility sake. It's not the most performant way to do something like this, but it's extremely convenient. Really, it removes a lot of barriers. So users would be able to play around with a certain function using a variety of recordings. And it's more than just using recordings. They can try in the future, maybe there's a way to lower the SNR, like add noise and see if it still works or what not. Add a frequency shift, see if the RF function still works. And then on the author side of things, all you really would need to do is add this REST based interface or at least make it easy to call with CLI and then retrieve the results. So like Sat dump, I'm not using a REST interface. I'm just running the CLI in a way that's easy. Anyway, now one design decision that was made was to allow multiple plugin servers to connect to a single IQ Engine instance like at IQEngine.org. That way, like a university could run their own plugin server, have total control over it, but they could share their expertise, everything they want to show off. And this is really just a concept. So right now I showed you how IQ Engine lets you preview RF recordings and RF data sets. Well, I think in the future with these building blocks that I showed through the plugin system and this REST interface that we're designing, you could have a tool that would be used for previewing what I'm calling Functions App Software, really anything that involves RF signal processing. Now there are limitations, so a lot of RF apps can't simply be run on a recording. So SRS ran excellent LTE and 5G radio stack. Because of LTE and 5G's strict latency requirements, you can't easily just play it back. It's not straightforward, simply running it on a recording. You sort of want to simulate that closed loop system. So not all RF functions and apps are going to be shareable this way, but I think a vast majority of them are definitely GNU radio apps and those kind of processing applications. The other thing that you wouldn't show off is like an SDR interface, like a GUI, that wouldn't make any sense. Now if you're interested in contributing, it's a community led project, so we can always use more web devs. It turns out that the kind of folks in these RF circles tend to know C++, Python, but less so on the website. And I know I've had to learn a lot of web development to get this project moving more. So even if you're not a web developer, there's plenty of other ways to contribute. We're always looking for more interesting RF recordings to share. If you have an entire data set, we can add like a whole category here on the left. So we have Daniel Estevez's awesome satellite recordings as an example, where we can link off to your website. And so if you want to get involved in any way, there's a Discord link at the top of IQengine.org. We have a little community that's slowly building. And with that, I will take questions. Yep? So yeah, the question was related to geolocation data, like running it as a plugin, I assume. Yeah, yeah, while I explain that, so there actually is already a maps-based interface for, anyway, when we designed the API mentioned, we made sure to allow multiple channels of RF. So those channels could be time synchronized recordings from different sensors. That way at least you could run it from a, the backend perspective. And then, yeah, I guess we would need to have a maps interface to the spectrogram page to make that fully happen. So yeah, I would need to make that fully happen. But good, good, great suggestion. Yep? Well, so Guinea Radio has some Azure credit that they got, and that's what we've been using for a lot of these recordings. So, and we can use that for other folks' recordings if they want to share it publicly. Yeah, you can reach out and we can transfer it over. Yeah, I think it would fall. No, no, no, like I could like upload it for you. So the Guinea Radio has a blob storage account, so I could, I could give you a SAS token for you to upload it yourself or I could upload it for you. Yep, I think there was one more. Yes, there is something that's a work in progress, but I guess I'll share it. So there's an upload page. Oh yeah, yeah, so, so IQengine.org slash upload should allow you to upload a recording. The Wi-Fi's not great, but yeah, that would be the first place to go. I think we're out of time. Any last question? Yep? So how well does it actually handle everybody's thoughts? So, I mean, it was designed to deal with terabyte files from the start, which is why we have that minimap, and when you open the spectrogram page, it's only loading what you're looking at at any given time. So it's sending the IQ samples to your client, to the browser. The browser's doing the FFTs. So it's sending maybe a few million samples to get a spectrogram like this, but if it's a mini terabyte recording, you'll just have a smaller, like, gray window here, because it'll represent a smaller part of the whole recording. Yeah, I mean, you have to store the recording, but it's not all, we have no part of the code that sends the entire recording to either the client or the backend, because we know it's not going to fly for huge stuff. All right. Yep? Yeah. Yeah. Actually, SIGMF has a lot of, there's even an extension for more details about the hardware involved. Definitely check out SIGMF, the specs. So if you want a five minute introduction to SIGMF, that's what we have here on IQ Engine, but I would, yeah, go ahead and go to the specs and dive in, and you'll know a lot of the parameters that you mentioned. All right, thank you very much. Wow. Thank you very much. Thank you very much. Thank you.
DAPNET: Bringing pagers back to the 21st Century
Thank you very much. Hello. Good afternoon. Hope you're all well. Not cooking up too much in this room. My name is Manuel, so I'm a radio amateur. I renewed Nerd if you prefer. I like experimenting with new or older equipment, see what we can do with it, or use existing software or hardware and deploy it as widespread as possible. If possible, within the amateur radio community and keeping things open source whenever I can as well. So today I'm about to talk about cutting-edge technology straight from the 1990s, pagers. So if you've seen those things, well, it might bring you back some memories because they were heavily used in the 80s and the 90s. It was used by mostly doctors, drug dealers, or businessmen, sometimes the three at the same time. So basically they were everywhere in the 90s and started to disappear later on when GSMs made their operation. But this was something really common. In those times, you can still see it in TV shows, medical TV shows, getting paged, the doctor's getting paged because there's a code blue, whatever that means in room 204. Now, I'd like to explore this thing because behind this hallmark from before, they're extremely simple communication systems. And I think it's worth exploring them a bit more and see what you can do with it today in the open source community and the amateur radio community. So today we'll be looking at what is paging in itself, what does that mean, how does it work, generally speaking. We'll go a bit into the technical part of it. So how it works, the modulation types, how you can make a pagering, and then we'll bring that into the amateur radio context. We'll talk about the DapNet project, which has been around for a few years now, what you can do with it, how you can get started, and then I'll be open for questions if you have them. So coming back into the techniques, let's talk about paging in simpler terms. Paging is basically sending a message, making a small device ring one way or another, very often to small, low-powering, compact receivers. Most of them use a standard which is called PoxSag, which was developed in the 1980s, but much older standards exist and are almost not used anymore. So PoxSag is one of them that remains. The other one being developed by Motorola and proprietary, but we don't talk about that here. The topology is always the same. You've got one big transmitter, high power, and then you've got your receivers around that receive the messages whenever there's one. So the frequencies, you have them starting in 8-chef, I should have put that. So you've got pagers on 27 MHz and then all the way up. Here in Belgium, the national services use 160 MHz. In other countries, you will see them in 460 MHz and sometimes even higher in the US, they're all up to 900 MHz, if I'm not mistaken. So you see them on a lot of different frequencies, and you also see that compared to a classic two-way radio, the antenna is built into the device, which is itself a challenge because it means that your signal needs to be higher in intensity to be received by those antennas because they perform a bit less than the standard WIP antenna. Use cases in the commercial world, you'll find them, for instance, for one single hospital to be able to call doctors or industrial scale systems or sometimes a bit bigger. National scale being one of them. Here in Belgium, we have one single frequency for a distributed system of transmitters operated by Astrid, which is used by firefighters, ambulance services and others, so it's still being used today. You will see them also in foot trucks or in order to take away food courts such as if you've been to the wolf two days ago, you'll receive a little pager that would have rang, sorry, whenever your food was ready. So this is also a pager in itself. How does that work? As I said, it's using one single frequency, a specific carrier that we modulate in FSK, so simple frequency shift keying. You send a zero by shifting one way, the other way is a zero, so just by shifting one from one to another, you send one zeroes and then format it into very simple packets. So very often, please mute your radio, just saw it. If you want to send a packet usually, you send a preamble that wakes the receiver up because those receivers usually sleep for long periods of time and wake up from time to time to see if there's not a preamble for them there. And once it wakes up, it will start decoding the signal and then it sends an address and the linked message. And if the address doesn't match the pager's address, it will just shut down, go back in sleep mode, so that makes for very power efficient receivers. So this thing can last up to one month on one single AA battery. So yeah, that's pretty much the idea. Again, if you want your pager to receive a message, you put the address into it, basically. So if you want, for instance, to program this message, which is aimed at the pager with the address 101, you put the address 101 in the pager. If it receives it, it displays a message and rings. Otherwise, it will just stay asleep because this is, for instance, a message for 102, not aimed at the pager itself, it stops ringing. Now, you can also make group alerts that way. So it's quite simple. You just put the same address, recall the RIC. You put the same RIC across all pagers and if they receive it, they will just ring altogether at the same time, displaying the message. So that means that for individual or group calls, you basically address one individual ID to a single pager and then you put one common group ID across all pagers. So you can select if you want to address one person or a specific group and you can organize your system this way. So it makes for a very simple type of receiver and then you can see yourself when you're building the network how you're addressing each pager or each group of pagers. Poxa-Agonamator Radio is not new. It's been done since the 1980s. I think it appeared at the same time as paging itself, so we started filling with that a long time ago. We use the TNCs connected to old VHF systems and yeah, the thing is you had to modify the pagers themselves very often by changing quarts and retuning the receiver loops to make sure that it felt between the amateur radio frequency allocations. But very often those were individual stations used for bulletin board systems at the time, for weather alerts or that kind of messages. So it kind of disappeared when packet radio really folded after the 90s. So right now the only thing we know on packet radio is mostly APRS or the more widespread you use is IPRS. So BBS, you don't see them anymore and the technology got lost in the ages. But now we have easier ways to interconnect stations together using HandNet for instance. So you now have IP links that can be made on amateur radio frequencies quite easily with modified Wi-Fi equipment or others. And there's a team from the Akan University of German Radio Amateurs that developed a network of internet-connected Poxa-Agonamators using free and open-source software and that is the DapNet project. So DapNet stands for Decentralized Amateur Paging Network. The idea is to have various core servers that are geographically separated, interconnected via HandNet, that exchange the messages through multiple nodes. So if you have one, fails, the others will take over. Now of course if you're outside of that HandNet link, you can always get a bridge through internet and this is what I'm doing here because I don't have an HandNet link here. We still haven't brought the HandNet links from Germany up here until unto Brussels. But you can go either way. The frequency is almost universal. Depending on your regulations we try to stay on the same frequency everywhere which is 439.9875 megahertz. That's a mouthful but that's the one we try to use everywhere. The only exception right now I see is the Netherlands because they don't have the access to this frequency so they're using a frequency on 432 if I'm not mistaken. But I mean from with this pager I can use it basically in Belgium, in Germany, in Switzerland. There are some transmitters in France as well so it's growing little by little. Now for addressing transmitters they have to be synchronized one way or another otherwise you have several transmitters that start keying up at the same time and then we'd interfere with on another. So they split up in time slots so if you have two overlapping transmitters you'll put one that transmits on one time slot and the other one that will transmit on another time slot. Just make sure they don't transmit at the same time. So what happens is you send a message on the DapNet infrastructure and as I said this only records basic numbers so there's no call sign you can encode in there. So there's a database on the DapNet infrastructure that links your call sign to an identifier. Very often we put the DMR ID because this is a way to identify hands with numbers and then it matches to this specific rig, this specific address, sends it to the transmitters that are linked to the area we selected so you can key up all transmitters or regionalize your calls. So you can say that if you know that the person that you're trying to reach is in Belgium you put Oscar November dash all. If you want to reach an area in a specific province well you can narrow it down and try not to use the network as extensively and just try to reduce the load if you know where your person is. Same for Germany, the Netherlands, Luxembourg, France there's this same kind of geographical way of cutting the transmitters. You can also make group calls so there are we call them rubrics and you have some for weather alerts, DX clusters etc etc. I'll come back to this in a moment. So what can we do with that? Well pretty much whatever you want. You can send messages manually to a specific pager via the handpager.de website. There's an Android app, I think there's even an iOS app but I don't know what's the status on that. Via the DMR infrastructure from Brandmeister, from APRS, from Tetra, so basically sending a text message from your radio will make it land on the DapNet infrastructure and then it will relay it to to your person you want to call. Then there's an API you can use to send weather alerts. There are automated messages for urgent alerts which will make all the pagers ring for example. DX clusters as I said or status on space solar flux conditions etc. That is something that also is also sent every four hours on the platform. You could also build something for a repeater telemetry or any IoT advice that you want but again keep in mind this is a network aimed at amateurs, for amateurs non-commercial and please keep in mind that is maintained by volunteers that do it on the free time with servers they have access to so don't start bombarding the network with telemetry that sends every second to the status of your fridge because that would be kind of a problem. So say reasonable but this is the kind of thing you can do as long as it's non-commercial. Now how can you get started? As long as you radio amateur with a call sign you can register right now there's a website to submit the tickets and we'll create your account. So once you do that you have access to the platform you can send messages if you want to receive them you'll have to buy, modify or build your own pager for the 439 megahertz frequency. That's one thing but then you need a transmitter somewhere. If you're lucky enough to have one within your living area you're good to go, enjoy. Otherwise well you can go your own way and install a hotspot at home or you can make it a nice project for your local radio club and build a wide range transmitter for everyone to enjoy. So there are two ways you can go. Speaking of specifics, acquiring a usable pager is relatively easy today so as I said before that you had to buy second hand pagers, replace a quart, retune the receiver chain but today we have more frequency agile receivers that have PLLs instead of the quarts that can be directly retuned or directly bought to work on those frequencies. So one of them is the AlphaPok 602R which I thought I had, yeah I have it on loan but here it is. So this one costs about, I think it was 90 euros when we checked on the AlphaPok website directly from Germany. You can buy it on Aliexpress but your mileage may vary. So that's a way to get quickly into it. You could go higher range and buy those commercial ones which are a bit more expensive but work as well or you can go the DIY and free open source routes and build your own using open source software like a project I've been working on which is the ESP32 pager which Bastia also improved a bit on the UI side because that's suck at UI but basically using a ESP32 Lora deathboards you can make it a Poxhack pager and have a receiver for quite cheap. I think those deathboards are about 15 euros on Aliexpress as well right now. So it's built on on radio lib so also freely modifiable so have a look if you're interested. As for transmitters you have two options right now for hotspots you can use if you already have an MMDVM hotspot well you're all set you just need to register it on the DapNet and activate the transmitter so that's one way if you want to build a wide range transmitter things can be extremely simple because you just need a small single board computer such as a Raspberry Pi, an FM transceiver you fit it directly into the audio unfiltered path of your transmitter and then well you're good to go so basically it requires four components the transmitter the Pi a transistor and one capacitor so you can get on the air quite quickly. All our transmitters are being worked on again Bastia is working on a ESP32 transmitter to make a small hotspot even cheaper if possible so again quite easily reachable. So where does that leave us? For me it's quite an elegant solution to receive text messages on our own independent networks having fun in the way learning how to use basic systems implement them and deploy networks that everyone can enjoy and it has its uses for telemetry or others you can do weather reports emergency messages text your friends via pager send silly jokes and the challenge of having it to fit within 80 characters so there are ways to make snappy jokes and intelligence ones at that but I think that thanks to the DapNet network the arrival of audio cars that can act as TNCs instead of using an external module it really made the thing much more accessible so if I'm able to SSH into my hotspots I can make you a quick demo of how that works so give me a quick second who's got a pager here? One? Nice? Nice? Nice? Very nice that's already one depending on what you registered in it I don't know if I'll be able to make a drink but so basically here you have my personal pager this is one from a friend I just borrowed and this one which just died on me which is not a problem in itself I'll just make this presentation shorter oh no it's alive there you go so those all have their own individual addresses so this one is 2069009 206500 sorry this one I don't remember this one is address 100 so I can make this one ring specifically I just key the transceiver up say I want to make the trans the pager number 100 ring please work don't make me look silly there you go so right now you have only this one ringing so I just made an individual call to this one now let's imagine I want to send a group alert for I don't know some weather storm weather coming up or a rare DX spot happening right now on on 18 meters 18 megahertz sorry so then I can make everything ring at the same time 1040 and then everything rings and it's just a nightmare and I need to confirm that otherwise it will ring again so there you go quite simply using basic addressing basic open source software this is just the hotspots just an MMDVM here running in the background and I can directly key the transmitter up if you have access to the DapNet system right now I think at least two or three of yous have access make you can make an individual call to myself there you go he just sent me a message on my on my pager and what did he just say how does a SQL expert get it how does a SQL expert get a date okay nice very nice nice so there you have it if you have any questions oh yeah there's another open source project that is just coming up with it where is alexander hello didn't see you yet if I'm not mistaken you worked on a pox hack decoder which is getting finished up as we speak for sdr plus plus so yeah I think it's important to report that as well sorry I didn't get the time to fit it into the the slides but again if you have any questions I'm I'm just Jesus Christ thank you for your attention and yeah I'm only yours all right I hear a question do we have a microphone or I'll repeat the question so I live quite close to an old school pager site yep they transmit very high power on vhf they do what causes interference a lot of other stuff whoo okay so you know in practice how much power do the transmitters in this network need to be useful and what happens when the pager misses a message is there a transmit or do you go to get one shot um well very often in professional network I'll start from the I'll repeat the question first so you have problems there's an interfering pager transmitter next to you because it's using high power so how much power we're using and second question sorry short term memory is what happens when the pager misses a message if you what happens when you miss a miss a message so the first one being yeah for commercial systems very often they who use 200 300 watts for transmitters because it needs to reach inside of parkings and the antennas are lousy at best so you need high power to get through for amateur radio systems it's less of a god now everyone is trolling me now in um yeah for amateur radio systems very often we don't go to that imperative of being able to reach everyone to through parking lots so very often the transmitters are 25 watts to 50 watts I mean higher up would cause problems such as what you're talking about but yeah usually we keep it low and we just add more transmitters here in Belgium is a problem because every time you add a transmitter you need to pay for an extra license so I mean we're still limited a bit legally speaking but it's not a problem in Germany or other countries where they don't pay repeater licenses or they're much cheaper speaking about missed messages there are two mitigation measures well actually just one is repeating message if it's lost it's lost if you don't get it that's it because there is no way to send an ack so either you receive it or you don't and that links to the first problem that's why the commercial systems use high power so there is no store and forward system in paging so yeah that's a small small limit other questions yes you don't need a call sign to receive signals specifically on radio amateur bands so you could perfectly use an sdr or I don't know buy a pager and put some public public messages but to be able to receive individual to you or to be able to transmit or you know at least access the platform you would need an amateur radio call sign but I mean radio amateur is much more than paging and I think it's worth looking into it if you don't have a license yet I'm not going to start into my big talk about about that because I've done it for about 25 times today but yeah there's a lot to discover and that hobby and might be worth looking into it if you have the time to access that hobby other questions yes yes it does it does you can change the ringtone you make it make it go beep the blue whatever you can even compose your own ringtones on some of them yes p32 pager actually has a provision for you there are different tones and you just compose a music you want so if you wanted to make play Tetris go ahead there was one question here and then you no question here okay what's the frequency range of the receiver the receiver so the receiver itself could be tuned pretty much anywhere on the UHF band so 43440 but the problem is it's using an antenna loop so a loop antenna so it has a very high q so you need to tune it yeah it's 70 centimeters yep yep there is one there is one I would need you have internet you're connected to the custom network here yeah if you're looking to hand pager dot de you should be able to at least get the address book so yeah my time is up thank you
srsRAN Project Update
Yeah, thanks guys. So I'm going to talk about SSRUN project, our deployable open run, open source run solution. So it's going to be, talk about 4G and 5G, because I know that many of you still use 4G, although we usually focus these days on 5G. I'll start with talking about our repositories and the naming, as this causes a bit of confusion. And then I kind of split the talk, kind of 30%, 4G, and 70%, 80%, maybe 5G. And primarily talking about our newest baby, so SSRUN project, which is an open run native CUDU implementation written from scratch, and then obviously also demo that here. So if you go to GitHub, SSRUN, these days you see two repositories. One is called SSRUN 4G, and one is SSRUN project. And I'm going to explain why we have those two projects here in a second. But let me ask a question. So who's interested in who's doing 4G in the audience? You're most interested in 4G. Great. And who's interested in 5G? OK, that's actually more. Nice. So a little bit of background in history. So when we started with that, actually 10 years ago with libLTE, which was back then a pure C implementation of the LTE physical layer, that obviously was all 4G. And then the first application, the real application we had was the UE. So a 4G UE, back then still in a separate repository that we figured out would be better to actually join this when by 2017 we created the E-note B. So SSRTE basically became the file layer, the UE, and the E-note B, and all in the same repository. And later on there was also an EPC added to that. And then to explain a little bit that gap that we had between 2018 and 2021, so we added a lot of new functions like carrier aggregation, ebMS, MIMO, and there's actually, Foster talks about this and they go into detail what we did there. But primarily what we did in this time was to harden the E-note B and the UE application to make them deployable, to run them in real networks, and that's what we have today. So they're deployed in the field, running in hundreds of base stations on a daily basis. And then in 2021 we got very excited back then of course about 5G and we implemented both for NSA and for SA based on the old, based on the existing SSRTE 4G architecture and software architecture. We implemented for NSA a UE and an E-note B or G-note B as well as for 5GSA. But then, I mean that was all based on the 4G code base and we were kind of in trouble because we still call it SSRTE. So we kind of figured out, okay, what can we do there? If we call it SSRTE NR, that's going to be a problem at some stage with 6G, so we figured out okay, let's call it SSRTE RAN. That made total sense back then, right? So we had back then in the entire, in the same repo, we had 4G, UE, and E-note B as well as 5GSA and 5G NSA, UE and E-note B. But what happened then was actually ORAN came into play and everybody was getting excited about ORAN as a buzzword. What does it mean? It's kind of initiation or initiative by operators to open up the interfaces between the radio components. And the idea here is to avoid the defector VendorLogin that we have today. So when you build a network, you're buying all the components from a single vendor. And the idea here is really to use off-the-shelf hardware for the CU and the DU, which is basically your G-note B, your base station, and put Linux software on it and run it and use open interface between those components as well as between the G-note B hardware and the radio, so the IU side on the left side, the left-hand side. And this is in fact something that 3GVP itself brought into the game because they are the ones who define the CU and the DUs, but others have provided a radio interface, so a frontal interface between the DU and the IU, which the ORAN lines took and basically defined the open frontal interface or an open frontal interface to talk to an IU. And that's basically what it all is. And what we did back then was to, and having known and having implemented all of that 5G, all those 5G applications already in the old codebase and knowing the limitations of that codebase, we kind of set aside and said, okay, would it be cool to actually start a scratch and get rid of all the legacy and rewrite everything? And that's what we did. So we sat down and rewrote everything and the Azure's Run project, so as we call it today, is a completely new software architecture, so we had people really laying it out from the being on with all the interfaces in there that the ORAN Alliance specifically specifies. And with all the thinking that went into that for openness, for interoperability, for performance, all towards really deployable open-force radio platform, RAN platform. And that's what we did and that's what today is Azure's Run project. And that was also back then when we figured out, okay, or realized, okay, that's now a new project and that's now a new platform and we need to give it a proper name and distinguish from the old codebase, which is still totally valid and totally functioning. So we kind of renamed the old Azure's Run, which was not that old by then, only a year or two, to become Azure's Run 4G. And that is what it is today and that's what you find in the GitHub repository. Still getting new releases and updates, but the new stuff is Azure's Run project. So this new architecture, this new 5GSA codebase. And then as we have seen, there's quite a bunch of people who are still interested in 4G that it's often a little bit misunderstood that this is kind of an old project and it's not maintained, but that's not the case. As I said, it's deployed and it's a maintained 4G codebase for the ENOB and the UE. And it still contains a UE implementation like this proof of concept UE that we did back then with limited 5G support, admittedly, with all the legacy and the limitation that we had back then. But it's still used by quite a few people in the research community after all and it's good enough to attach to a GNOT B and you can work with that. And the 5G GNOT B code that repository is not recommended to be worked on. So for everyone who is using SA or is interested in SA, please use the new RAPO. We're not fixing any bugs there in the old one. And the last release was actually just the end of the year where we, because of those users who want to use SSUE with 5GSA mode, that have been fixed for to support more bandwidths like minor things, no real DSP changes or anything bigger, but this is something that you can do. And then use the UE in the old repository, so this testing UE, to connect to the GNOT B and the new RAPO. And it's working within the limitations perfectly fine. And now let's come to the SSRUN project. So this is in a nutshell the architecture and in everything we have here in green and blue, especially within our scope. If you're a little bit familiar with the Nome Clutcher, it's the DU and the CU. So the CU is the control or central unit, sorry, doing most of the control and plane stuff here in the upper left corner. It's further split into a UP component. And then you have the DU, which is the physical layer and layer 2. Those two components again have a split in there. So many splits that give you options, possibilities to cut them and implement one thing in hardware, one thing or everything in software, however you want and whatever your application requires. And then you have the so-called frontal interface, which you can see down there, where we support also frontal 7.2, which is this new open frontal protocol, which you can use to talk to commercial radio units, and also frontal split 8, which is IQ baseband, so user piece. Like this is the default, so to speak. And then four or five points about this. So this is a complete solution. So it's layer 1, layer 2, layer 3. You get everything there. It's not just like a subset of the RAM solution. We don't implement anything like RIC or SMOs. We don't implement a core. But we are exposing all the centered interfaces so we can talk to third-party components that implement that. It's very portable, so it's running on ARM. It's running on x86, on Intel, and on AMD. All performance, yeah, already coming to the third point, all performance critical things are written in SIMD instructions for ARM, neon, and for x86, AVX2 and AVX512. And it's also very scalable. So you can actually run this like a full 5GSA with the physical layer on a Raspberry Pi and attach a B200 to it and attach a phone to it and it will work. But the same thing also runs on 128 core Ampera server or Epic server. Obviously doing then MIMO and all the thousand whistles and higher bandwidth and throughput. And very flexible. As I said, every interface that you have there you can cut and then talk to a third-party component or mix and match and maybe put some stuff on the physical layer implementation, some running in the ARM cores and embedded system. Everything is interoperable, so we have integrations with VATI units, with core networks, with RIX, all the components that you need to build a full RAM, which are out of scope of us. We do integrations and talk to others and try to work with them and it's all open. So please feel free to look at the code and it's all very transparent, which we believe is very important also for like TECO projects. And I don't want to dwell too much on the mainline features here, but I think the main takeaway is that everything is there that a normal user or operator even would actually need from that. So all the bandwidth and all the modulation schemes and all of that. The performance is like we are looking at carrier grade like numbers. So many UEs, 24-7 operation, highest bandwidth. I mean, this is what the spec defines. So 1.5 Gb in the downlink, 200 Mb in the uplink. This is a four-layer downlink, 100 MHz and one-layer uplink. And it's all accelerated. So we are not putting any, we support FAC hardware acceleration with Intel ACC100 cards or other DbDK bound BBDF devices, but we don't need that. So it all runs efficiently on ARM and Intel. And then there are some features coming up in the next release. So we're doing bi-yearly releases that's like a pattern that we have been following with like for many years already. But we're doing constant pushes to main, but not releasing that code. And the next one includes for instance, mobility, so between cells. Then I know there's interest in NTN, so there will be initial support for release 17 NTN. So to talk to geostationary satellites, multi-cell support, and the split between the components that I showed. It's also an important point. And then I know that all this telco stuff is very overwhelming, especially if you start with that. I completely understand this. And that's something that we put a heavy emphasis on. So to really increase the user experience and ask people to engage with us, to simplify really the barrier and to lower the barrier, the entry barrier for telco in general, and also ORAN, which has again its own complexities. So we've put a lot of results in documentation. So there's application notes, there's developer guidelines, there's code styles for contributions. Lots of testing is going on. That's also something. So there's a MATLAB based repo where we do all the physical layer conformance testing against MATLAB. Yes, you need a 5G tool for that to test. But it's still very useful for people and for researchers who tend to have access to the university or where they work on engagement. So everything is hosted on GitHub. We ask you to engage here. So discussions forum, we used to have a mailing list, but for the new repo, we're not using that anymore. So better GitHub. And then of course the code itself. And then there's an overview of docs.essence-round.com. You find everything there. And if there's something not there, then reach out. And it's also something we are collecting ideas to create new application notes and things like that. And then the demo. How am I going? Not too bad. So I would have loved to actually do like an ORAN demo to really show you, bring the components and everything. Unfortunately, the reality is that it's very complex. So you need big servers. Usually you need extra hardware like switches. You need timing. Very important. So you actually need PDP. So precision time protocol, grandmasters, the GPS clocks, CERN. And the IU is the one that you see there is like a big brick weighs five kilos and nothing you can just put in a trolley and bring over. Definitely not the servers. What we did is to kind of like, you know, miniaturize it a little bit. But this is how small we got it. This whole setup fits in a big suitcase like in those Peli cases. But it's still like a desktop PC. And this radio unit there on the left hand side, it still weighs five kilos. Plus, you know, power supply, then the PDP grandmaster, you need GPS, which is difficult to get here. And all of that, you know, it's still it's still complicated. Nothing for a weekend and to put in the backpack. But obviously we do a demo there. And what I will show here is the exact same software. Remember, we support both ORAN split 7.2 to talk to those guys, to those radio units, but also to a user P. So that's why I have here my B200 mini. And I have a Pluto running Maya. And I have a Cots phone here. So this is a Motorola phone. And I'll use something that we also created to make it easier for people to get started. So all the components that I'm showing here are running off a Docker Compost Script that it's also in the repo. So in the Docker, you know, top level folder. There is a Compost for the G-Node B. That includes Open 5GS as a 5G core. There's an Influx DB and a Krofana dashboard. This is something we used to basically to show, visualize the performance of the radio. And I'm trying to get us all running here. And then let me just connect this guy here. Yeah, one thing missing here. So what I will run here is just a Docker Compost here. Really in the main G-Node B or as I was running Docker folder specifying like a configuration that I just adopted here a little bit to find the frequency that's actually empty. Because if we look at Maya here and we tune you a little bit, it was actually quite crowded here. So all the radio folks are occupying the spectrum. So this looks OK. So I picked the AFKAN here that is empty. So now it's running. So this was just one command, just a Docker Compost up. It's starting the core and starting the Influx, starting Krofana, starting G-Node B. And that's running there. So if you go back to Maya, we see the G-Node B broadcasting. So this is the SSB and the SIB. So all the information is broadcast by the G-Node B without a UE attached to it identify itself. And what do you see here in the white band admission? Those are CSIRS or reference signals for the UE to adjust it and to do tele-measurements and to report the quality back. And what I do now is to run straight. Do you see this? So this is the Motorola. I will make this a little bit smaller. Because what we see now is, so if we, this is a Motorola phone that I have here. So if I take this out of Airplay mode, what happens is it will scan for a cell, it will do a patch. So send an initial signal and then all the attached procedure comes and what we, you know, what's going on there, the communication with the cell and with the core. And then it goes very quickly. But that's something that we can do here. And it's, you see there, and now it's actually attached. So all the transition that we see here now on the outer band, this is the UE doing uplink signaling, uplink control signaling. And now the level has adjusted a little bit. So the dialing transmission, I guess, is the, is the blue, still like the blue level there and the higher power ones. This is actually the UE. And just to see that, so do you see the network name here? That's FOSM24. And what we can do now is to have a look at the profiler. So that's running in the background as well. So usually we do, it's obviously a console application. We have console traces, but we created this nice Krofana dashboard there. So it's all the data is pushed into influx to be, and it's just displaying the stats there. What we can do is to, to now, yeah, use an application, signal guru. So you can actually also look at, look at this here. I will, because actually the Wi-Fi, the Wi-Fi is very bad here. I would have backhauled over Wi-Fi, but it's not, it's not great. So what I do is I just run an IPerf. So, Mac. So, can we still see that? So now you see that this one user, obviously this is only a very narrow 20 mega cell CISO. We're not, not getting 1.4 gigabit, but we get like 32 megabits here. Maybe we can do a little bit more. I still do that longer. Maybe 50, I'm not sure if the channel is, is good enough to do that. But yeah, it's going up. And then the phone also showing that, like this is an application we use to, to get, you know, information from the, from the baseband. And what I do now is I disconnect that here. I do this and maximize. Close free mode. So now I can actually walk around with the phone. So it's still doing, it's still doing low. So have a look at the, at the MCS level. So this is very good here. So if I walk around here, it should adopt the link and the, the rate should kind of go down a little bit. So the, like this dynamic MCS adaptation, depending on the measurements that the phone does. I mean, we can go out of the room as well. And when I come back, like the, the MCS is, is going up again. And doing full rate again. And I think with that I'm. I think we can still take a question. Sure. Yeah. Sorry. Yes. Yeah. I mean, we have native support for, for UHD and in, in this repo, but yes, you can actually use the SOPE UHD wrapper to, to take the, the blade to run the plate over the UHD or the UHD SOPE wrapper. I always take, like mix it up. But yeah, you can do that. Also with the lime or with any other radio that supports SOPE. Nothing, I mean, it needs to support obviously full duplex. So it's, it's after all still like LT or NR bandwidth wise, 10 mega is enough. But full duplex, it needs to be full duplex. Even, I mean, TDD theory is not full duplex, but I mean, the way we, we handle that and UHD handles, it's, it's, it needs to be full duplex. Yes. But no other, other specs there. Yeah. Yes. Yeah. Yes. And in fact, we are looking, looking at this. So those are used here that we, that we used, that we, that I showed here. This one. And so this is a, where is it now? This is a, a so-called ORAN 7.2a IU. So it's, it's basically doing, it's basically doing the pre-coding in the, in the, in the, in the du. So it's not doing the pre-coding. So if you had a massive MIMO1, what you wanted to do is to send all the layers to it and then compare it with the, with the, with the pre-coding coefficients and let the IU do that. And that's something that you can, that you can do with 7.2b. So if you have an IU that supports that and does speak an ORAN, ORAN open frontal, you can, you can do that. Okay. Thank you very much. Thank you.
Crash-consistent group snapshots in CephFS for k8s CSI and you!
Hi everyone. My name is Leonid. This is my colleague, Patrick. And today we're going to talk about Snapshot consistency with you. So before we dive into Snapshot consistency, let's discuss consistency on its own. Now here's some data storage, has a bunch of data written on it. Is this data consistent? We don't know. And the reason that we don't know is because consistency is not an intrinsic property of data. We have to consider a system that comprises of an application and its logic and a storage system and the data that is written. And only then, including the logic, we can reason about the data and we can define whether it's consistent or not. So the application is running fine, data is written, everything is consistent. What happens if the application dies? We don't know whether the system is in a consistent state or not. So actually it is possible to write an application and to write a storage provider in a way that by doing some smart decisions during the runtime, then the application can reach a consistent state after restarting, after a crash. This is called crash consistency. Now how is crash consistency related to snapshots? And the truth is that the snapshot, or rather the snapshot that we're talking about, which is a crash consistent background snapshot, is equivalent to a crash. The application cannot tell between restarting after a crash or restarting after you recovered from a snapshot. So let's look at the system. We have an application in storage and it was a poor selection of a storage. We cannot reach consistency in the system. Even if the application is a high quality, well designed app. Same thing other way around. If you are using an industry leader storage provider but the application just doesn't care or is poorly written, you're not getting consistency. If you have a well written application and an industry leader storage provider, it is still a question whether the consistency or rather crash consistency is reachable. And it is only reachable if we consider a contract that an application and storage adhere to and then when they both do things right, together they can reach crash consistency. And the scope of this talk would like to refer to this kind of application and storage as enterprise. There are many ways to unpack this term so bear with us for this scope of this talk. An enterprise app and an enterprise storage from our perspective are those that adhere to a contract. Now what is this contract? Or rather in our case what is interesting is what is it that we need to do itself as a storage provider that we automatically combine with an enterprise app that is already written with this contract in mind and together we provide a crash consistent system. And I remind you we want a crash consistent system because this is what enables consistent snapshots. To understand that we need to understand right ordering. Rights A and B here, they are ordered if and only if. Right B begins after the app has received and processed an acknowledgement from the data storage that right A has been successfully completed. Now it is important to note that the acknowledgement has to come from the storage and not from the OS because usually your applications are interacting with the operating system and it is the operating system that gives you the first acknowledgement after a right. These applications are aware of that. They know that they need to do to use things like flush or direct IO to know that the acknowledgement is originating at the storage level to perform ordered rights A and B. Now that we understand what ordered rights are, let's inspect what the storage needs to do. So we have two ordered rights. Right B hasn't begun before A has been acknowledged. And in order to understand what storage should do or shouldn't do, let's look at different types of background snapshots that the storage might have taken. So it could be that we've taken a snapshot before A and it's a consistent snapshot. It's a snapshot that has no knowledge about neither A or B. It could be that the snapshot already captured A and we know this is possible because there is a window of time when A has already been completed and B hasn't yet started because application was waiting for the acknowledgement. So this is a consistent snapshot. And finally there could be a case where the snapshot contains both and B. This is also a consistent snapshot. What the storage or enterprise storage provider must absolutely promise to the app is that snapshot 4 is not possible. There cannot be a case that a snapshot contains operation B but somehow lost operation A. That's basically the contract to preserve the order of rights. So we're going to ask Patrick to discuss how this relates to CEP. So within the context of CEPFS, we're going to first start looking at how snapshots work. So on the left we have MDS0, managing two trees of interest in the file system, SV1 and SV2 and two clients, client 1 and client 2. So how do we take a snapshot in CEPFS? Well there will be an operation sent to MDS called MakeSnap and that will snapshot a particular tree within the file system. In CEPFS you're allowed to snapshot a particular directory and everything under it, not just the entire file system. When a snapshot is taken it sends a notification to all the clients that the snapshot has been taken for a particular I know that the client is interacting with. And once that's all done the snapshot's complete. If you want to take another snapshot of another volume you have to do another operation. There's no compound snapshot operation. So we send a second snapshot out of the other volume and again notify the clients for any I know that they may be interacting with. When clients interact with RATOS, the underlying distributed object storage of CEPFS, they create snapshots implicitly when they write to the objects that hold the files data. And they do that by including a snapshot vector of the snapshot IDs that have been taken on the files. And those are what is transmitted to the clients in the snap updates. And here lies the rub. If this, with CEPFS snapshots we have eventual consistency. Because what, when a snapshot is taken on the file data depends on when the client gets the update from the MDS. So they're eventually consistent, not synchronous. To really highlight this we'll look at a case study. So here we have two clients in an MDS. Operation B on client two is dependent on the completion of operation A on client one. Let's say this is like a database application, a distributed database. The MDS is starting a snapshot and it sends the notifications to the clients and expecting the apps. Client one initiates operation A after it's been notified of the snapshot. And so operation A is not part of the snapshot. Meanwhile client two has not gotten the notification from the MDS yet or is not processed yet. But it has already started operation B. It was just a simple write to a file. Well operation B is in the snapshot because it processes the notification afterwards. This is a problem and creates inconsistency. Op B is in the snapshot but op A is not. Looking at this another way you may have a utility that's trying to create a snapshot on the file system and it tells the MDS to make the snapshot it does. But then induces operation A on the client. Expecting operation A to not be part of the snapshot because as far as it knows it's already been taken. But that's not the case. Operation A is in the snapshot because the client has not been notified yet of the snapshot. So this is also inconsistent. So the solution we've implemented is fairly common within enterprise storage systems in the industry that are trying to address this issue of crash consistent snapshots which has become a larger thing right now with Kubernetes CSI requirements is to introduce an IO pause. And IO pause ensures this ordering by preventing any operations from going on within the tree of interest while the snapshot is percolating among the entire file system and all of its clients. So the way this looks in practice is op A is started and IO pause is established. Point one is trying to induce client two to execute operation B. But operation B cannot execute because the IO pause is enforced. Looking at it a little differently you know we could have op A and op B happen before the IO pause. They're both part of the snapshot. This is consistent. And then we may also have a situation where op A is sent to the MDS before the IO pause just before the IO pause is established. Op A waits through the entire course of the IO pause. When the IO pause is lifted operation A is allowed to complete and then the notification is sent back to the client that the operation is done and op B is started. This is also consistent. We'll also look at a super operation, a compound monolith called a mix snap, a variant of mix snap which will also establish this IO pause for you. But we'll also look at the underlying mono operations you can do to establish this IO pause. And that will be the mechanism you can use to actually to establish these crash consistencies. So I'll move back onto the approach. Thanks Patrick. So we now realize that all we need to do is an IO pause and let's see how we do it. Now we were considering a couple of approaches and one of the approaches apparently is a monolith solution. We would define some new comment that would mean consistent snapshot and you would configure it somehow and start it off and it could be either sync across file systems and even across a file system which is a CFFS and an RBD volume. If you have multiple different types of volumes configured for your Kubernetes applications with this approach you will still be able to create a consistent snapshot across all those things. So in order to expose this to the user we introduce a concept of QS set and QS routes. So a QS set is basically just a collection of mount points that you'd like to QS your IO to. In the world of Kubernetes there would be a set of volumes that you would like to QS your IOS to. It's reasonable to give users this entity of QS set because you don't want them to chase around all the different sub volumes whether they are QS or not. We're interested if a group of volumes are together QS and that's what we are waiting. So a QS set implements this state transition. Now internally your mount points they map to some path inside CFFS file system and this is where the magic happens. This is where we actually QS in IO and we refer to that as a QS route. We have also thought about the condition where a QS route may be a part of multiple QS sets at the same time because we don't want to interact too much with the logic of automated snapshots like Kubernetes that might somehow involve consistent snapshot with some volume that is part of two different unrelated processes. The way we resolve it is really simple. As long as the route is part of at least one active QS set IO to this route is QS. So let's talk about the API. This is the comment that we're suggesting. QS we give it a file system name. We name our set ID so that we can refer to it later and then you are including as many mount points as you wish into the set. You can also ask this comment to be synchronous by minus minus a weight and so it won't return until the QS has been achieved. Once that is done you can go on creating snapshots. These are regular snapshots. This is the snapshots that you do in CFFS. Nothing changed about those. So we've created three snapshots for three mount points that we've added to the QS set and then we again refer to the QS comment but this time we're asking it to release the pause. If we successfully QS hopefully we haven't done anything else if there was a failure and then the release also succeeded then we know that those three snapshots are consistent because the pause has been confirmed active for the whole duration of this process. And here's your monolith. Hopefully almost for free we're having also a monolith approach so like a one liner for system administrators who don't need to interface with the internals. We're suggesting a minus minus consistent switch to the snapshot. We're changing the semantics of this comment a little bit by being able to provide all the mount points in the same time. And then this is going to do everything under the hood. It's the same thing. It's going to do it for you. Now we have a tool and we can shoot our leg with it of course because we can DOS our application. And we thought about this and we've built in DOS protection inside the QS database. We've done this by implementing two watchdog timers. The first watchdog timer is a timeout. So when we consider the set it's going to spend some time QSing. Why? Because there are ongoing operations, right? And before we can acknowledge QSing we have to let applications finish whatever they have been doing right now. So under the hood the QSing is managed automatically for you over each and every mount point. And so all the mount points have this timeout to reach the QSing. And then if at least one of the mount points fails to reach QSing within the timeout then the whole set is timed out and whichever QS that were achieved are released immediately. Now the next thing, the second timer is the QS expiration timer. And for that we need a QS set that actually succeeded to QS. Now we know that in order to succeed to QS we have seen all the mount points successfully QS within the configured timeout. But then if we forgot about the set or something crashed, something bad happened and we never released it or never cancelled it and the expiration timeout elapsed then the set is going to enter the expired site and again everything is going to be released automatically for you. Why do we have two timers and not just one? And the reason is because you're going to have different considerations when you will try to come up with the values for those timers. The QSing phase really depends on your system. It depends on how many mount points, what kind of applications you're running, what kind of operations they're doing with the storage because that means how long you should wait for the system to QS and you are allocating some reasonable amount of time for that. However the QS state is already on you. When the system did reach the QS state and you have the notification about this then you can say, okay I know that I need to do just a single snapshot so I don't need more than let's say 10 seconds. Whatever, right? Two different considerations that you need to take into account when figuring out these two timer values. This is the API, I've simplified it a little bit in the previous slides, this is the API where we're not going to go into all the details but it's basically a Swiss knife. You should have all the options that you want. And with that let's ask Patrick to discuss the design. So let's just take a quick look at the high level design of the entire system. So here we have an administrative client which in the wild is probably going to be the Kubernetes CSI driver. That's going to be interacting with the CEPH manager which is specifically the volumes plug-in within the CEPH manager. That volumes plug-in will be actually executing the commands on one of the MDSs in the file system. We'll call it the QS leader or rank zero in reality. And then that will be also coordinating with any other ranks in the file system. We'll call them MDSB and C. And then finally the file system clients which are talking to the MDSs. To talk to the volumes plug-in the API will be the regular CEPH command line interface that we all know and love. The API will be exposed at that level. And the volumes plug-in will be talking to the MDSs using the LibcFFS API. The MDSs will replicate the QS database amongst themselves so they all have a view of the same QS database. And then the QS protocol will be used to actually QS the IO and stop the clients from doing IO on a given subtree in the file system. So we're going to actually talk about that part next. So how do we actually QS IO on a subtree? Before we can get into that we'll take a small step back and look at some context and background regarding what CEPHFFS client capabilities are. So in CEPHFFS it's somewhat different in a number of distributed file systems in that the MDSs and the clients are maintaining a cooperative cache. Clients have an elevated status within the file system context in that they also can cache metadata, not just data of the file system. And not just cache it, they can also have rights to mutate that metadata locally without involving the MDSs immediately. So to give a specific example we have here MDS0 has a given file that it's authoritative for 0x19tb.dat and client.1 on the right has a capability for that file. And the access rights that it has on that file, delegated to it by the MDS is to read, write, cache reads and buffer writes to that file. It has shared extended attributes meaning it has a local cache of the entire extended attribute map for the file and it knows that the extended attribute will not change for that file without being told by the MDS. Similarly also for the link count of the file. This allows the client to respond to certain stat calls locally without actually talking to the MDS. Capabilities themselves are modeled loosely after leases, an academic paper I put in the slides and leases are mostly different for having a time-based duration whereas for capabilities within CEPFS they have an undefined time duration. So now to look at exactly how we're going to QSIL. So now we have this issue of clients having these capabilities and maybe trying to continue doing writes to the file or modifying metadata so we have to recall those capabilities. So here we have two MDSs, zero and one and a QS database replicated between them. On the right we have client one with a number of caps for a given tree of interest that we're trying to QS, rooted at SV. When we want to QS, the QS database launches a QS subvolume operation, it's an internal operation on the MDS, it will start that on MDS zero. That in turn launches some suboperations, QS subvolume inode and it will do that on every inode in the given sub tree that that particular MDS is authoritative for. So the inodes are colored according to the MDS authority. So just the first two inodes at the top of the tree the QS subvolume inode calls will be performed on. We'll look at what that does in the next slide. QSDB will also launch it the same operation on MDS one. It'll launch QS subvolume inode operations on the inodes that it's authoritative for. And then once all this is complete, it's done. So what does QS subvolume inode do? We have this, the operation being executed on as an example, OX19DB.dat. We have a client on the right with the capability to read, write, buffer and cache data for the file. It has exclusive writes on the X-Satters so it can even make local changes to the X-Satters without telling the MDS immediately about them and returning to the client. And it has a shared link count. Now when I start the QS subvolume inode operation, it actually behaves similarly to many client requests that are already executed within the MDS. We're using the internal facilities that already exist to do this. The operation requires a number of locks, internal locks on the inode, not the internal not the POSIX facing locks that normal file system users are familiar with. These are internal locks on the inode and they control which metadata the operation has permission to change on the inode. So we're acquiring the auth lock, the link lock, file lock, etc. for reading or exclusively. And by doing so, the MDS will reconcile this with what writes have already been given to clients, that is what capabilities have been issued. And if necessary, it will revoke capabilities before those locks can be acquired. So when this operation tries to acquire those locks, it sends a revoke to the client. The client updates its local capabilities according to what the MDS is allowing it now to have, possibly fleshing data if it changed the file size for example or added in an extended attribute. You may flush that along with an update message to the MDS saying yes, I've updated the capability, I don't have these access rights anymore. And now you see that it has no file permissions. Its X-hatter is now shared instead of exclusive and the link count continues to be shared. So after this has occurred, the operation is considered done and any future ops on the client associated with this inode will block because these locks are still held. Why? Because this is a long running operation. Unlike most ops in MDS, which it will acquire these locks, perform some metadata mutation and then drop the locks, this is necessarily a long lived operation because it needs to continue to prevent clients from getting capabilities on the file or executing metadata operations, which will also try to get the locks from executing. So the get adder would block or any other client operation that would acquire those locks. So now to close out the talk, we'll take a quick look at the QIES set state diagram and focus only on the happy path. You know, it's a typical state diagram, lots of error paths, right? So we have a new set, we're adding a number of routes to the set and once it's in that state it's going to enter the QIESing. So at that point we're going to be launching all our QIES sub volume inode operations and acquiring all these locks, capabilities will be revoked, new operations will be blocked. When all of those operations have their locks and they're complete but not dead, then we can enter the QIES state. So all of these, that'll trickle back up the stack when we're querying the database we'll be able to see that the set is QIESed. At that point we're going to take our snapshots on all the routes that we need, more than one probably and when the snapshots are complete we can then release the set. So then we'll go into the releasing state, all of those QIESa volume inode operations will be killed and the locks automatically released allowing clients to be reissued caps or any blocked metadata operations to be kicked and resumed. Once those operations are all dead then the set will enter the release state and the QIESet is considered terminal and done. So that is the basics of the QIESets and again there's a number of error states shown on the slide, a canceled QIESet or an expired one, etc. So with that that's the end of our talk, we're going to leave time for questions. Again I'm Patrick Donnelly, this is Usob, I said your last name right, right? Don't often say his last name. These are the pull requests we have open still for our work, they've not yet been merged into the main branch so this is not yet live and even the development version of CEP. And we have some documentation that you can also review, some preliminary documentation, some details may change but for the most part it's reaching a very concrete state. That's it, thank you. Any questions? Yes? Will CEP mistakes snapshot and store it in like dot snap or underscore snap type within the folder? You mentioned that all I.O. in that part onwards will be pleased so will like leads on previously taken snapshots also be frozen or that. For the most part, alright so the question is if I've quiesced I.O. on a subtree can I continue to access snapshots of that, past snapshots of the subtree and the answer is probably not because of the way the locks work on the I.O.s it may also incidentally protect how the access through the snapshot version of the I.O. Then there is like we didn't introduce the shallow volume people like maybe we can mount the snapshots for backup system to just read the contents. Not at this time so we're looking also into a variant of quiescing where it allows most read only access to the files. Right now it's very much a stop the world for the most part I.O. pause so you won't even be able to execute most reads on the file system. Or like some stat calls may still be able to respond well written stat calls on the clients because they still retain certain read only capabilities. In the future it will work for the ASX vector like we can access read only snapshots. That is the hope in the future we'd be able to do that yeah to support that. Any other questions? Neil. So now you have the set command to quiesce volumes is it also possible to run FS3s on the client side if you have set FS kernel mount for example that would quiesce the volume for all other clients. Do I answer that one? So the question was whether we're going to be able to use kernel if it will work for kernel clients. Exactly if you call FS3s on the client side instead of running set command. As of now we haven't planned to support the FS3s command but it will look into it. I think it's pretty reasonable to consider it even for the first operation. Now one of the good stuff about what we're doing right now is that it's intrinsically backward compatible because we're building on the set capabilities kernel clients will be able to reach the quiesce. Now how you trigger the quiesce it's another question and we'll consider this definitely. Other questions? Okay thank you very much. Was it pleasant?
Chorus - Effortless Ceph S3 Petabyte Migration
Hi everyone, hope I am audible to the larger crowd here, thank you. So I am Sreesha Gurduru, I am a self engineer working for Klyso and I am here to present an open source project called KORUS which is responsible for an effortless S3 petabyte migration slash replication. So let us talk about why data migration, like when do we come across a data migration scenario. So lot of the companies and organizations these days have private cloud clusters and hardware which has certain specifications, queue and it can come to end of life anytime like the vendor might stop supporting the existing hardware, there might be a new hardware coming up. So in that case either there are two options in front of us which is to augment the existing cluster with the new hardware. If the specifications and skew are similar to what we have now in the cluster or to build up, build a new brand new cluster all together and when we build a brand new cluster there is a high reason for data to be migrated between old production cluster and the new cluster so that you know the data continues to stay and the operations can happen smooth. This can be one of the main reasons for data migration. Let us talk a few woes or difficulties with data migration. When we are talking about migration of data we are not talking about few bytes or gigabytes, we are talking about petabytes of storage. We have lot of data being stored in our storage back ends these days and those have to be effortlessly migrated to the new clusters. So the challenges include syncing petabytes of data and the continuous monitoring that we have to do behind the scenes like we just pick up some tool like RClone in this case and RClone is a robust synchronization tool, copy tool. So when we just pick that up we just have to monitor or even if we run that in the background we keep monitoring the status of the replication and also the time consumed with that huge amount of data to be copied across the clusters. Obviously the tools that are used for the migration and continuous changes in the data. The existing cluster we just do not yet decommission it. We have an active operation happening on the cluster be it reads, writes, updates. So that also due to the continuous changes in the data we might see it a bit difficult to copy or migrate the data. Let me share one of our experiences with a customer where in the similar scenario they had their cluster end of life and then we built a brand news cluster for them and the data to be migrated was around 3 petabytes. So between the old and new clusters we picked up RClone as the data migration tool. Migration let alone the data we had to migrate the metadata obviously and there was some issue with RClone where we could not copy the ACLs and the bucket policies for that particular bucket and then we had to tweak around and then we eventually got it to working and then that was a bit of difficult task for us. Indeed it is a Herculean task. So this experiment, this encounters with our experiences led to a tool called KORUS which is an open source tool which is a data application software which is capable of synchronizing S3 data as of today S3 data between multiple cloud storage back ends. Let me present you some of the problem statements for our tool. How to migrate S3 vendor data with reduced downtime? So I would not say there would not be any downtime but with reduced downtime and the cluster being operational at the same time and also the data being copied to the new cluster. And how to back up S3 data to another S3 in a different region or a different vendor? Here we might not have the same back end, we might be using storages from different providers like Amazon, Google, Minayo and we might have our own private clusters. So it is vendor agnostic. Like the initial goals of KORUS was to have a vendor agnostic solution. Like it should be able to support multiple back ends and with a pluggable architecture that means the components in KORUS are loosely coupled. Like if I see that one of the layer can be better, it can be replaced with another tool which is more performant and more efficient, I should be able to replace it. And then benchmarking of course before we add in any component we benchmark that tool efficiently so that it will be compatible with all of our, the entire project. The focus on correctness. So the data which is present in the source and the back ends, the following back ends we ensure that the data is correct and in sync with all the storage back ends. And then migrating big bucket under the load without downtime or with reduced downtime. So there are two things here, there can be multiple buckets with small amounts of data and those buckets are easy to be copied because it just takes couple of minutes. But there is a scenario where a bucket, one bucket has huge amount of data and lot of clients might be writing to one bucket and that bucket has to be migrated. So that is a bit of concern. So overview of chorus is there is one main storage and remaining can be configured as the follower back ends. And the users start by inputting the storage credentials in the configuration. And once the configuration is started and configured the chorus S3 API can be used instead of the storage backend API. Like if you are using AWS, if you are using Google, Minio, every backend has its own API. Instead of using that you can use one backend, one chorus API to communicate with multiple back ends because they are all S3 based. And chorus proxies request to the main storage and then eventually the storage is copied to the followers in the backend. All the existing data is replicated. For example when we introduce chorus into our ecosystem we might already have clusters with certain amount of data which has to be copied to different back ends that we configure later. So the existing data can be configured in the background using this tool. The data replication can be configured, paused, resumed, there are different life cycles for that particular request, the status. You can just stop, start, resume at any time you want and the management can be done using a web UI or a CLI. So the features of chorus include routing and replication per bucket. You can configure where to route or the request and where to replicate a bucket. And then again the same using you can pause and resume anytime. And then synchronize object or metadata. So just not the storage, you can also copy the ACLs, bucket policies, tags and everything using the same tool. And then as I spoke earlier migrate existing data in the background, track replication lag. So as of today we might have one set of configuration and the data must be copying to the source. To the back end, to the follower back ends, we can always track the replication and we can improve with the configuration options. We can start to rate limit, we can increase the number of workers. So we can do that. And chorus exposes Prometheus metrics. So we have the entire logging thing and the metrics are sent to Prometheus and the logs are in JSON format. Easy to read. Proxy and Grafana form the monitoring stack. You can visualize the data of how the bucket is being replicated and at what stage it is using the visualization stack. Let me briefly talk about the architecture of this entire chorus. Chorus is structured around two main web services. One is the proxy and the other is the worker. So initially the request comes to the proxy. We are talking about a flow where the routing policy is there. So initially the request comes to the proxy and based on the routing policy, if the bucket has to go to the main storage, which is configured to be Minayo here. So the request comes to the main storage and the request goes back to the proxy and then eventually to the user. That is one of the flows. The second flow is where the replication scenario is established. Again the request comes to the proxy and then there is an event or task created in Redis based on the replication policy. Like it knows what is the main storage and which storage should the replication be done to. In this case it is Ceph for example. And then the chorus worker reads the requests or tasks from the cache and then that routes the request to the, reads from the main storage and replicates to the back end. The chorus worker is accessible using WebBY and CLI. So this is an overview of the entire flow based on different scenarios. So chorus also has an initial migration feature like as I told the replication can happen in the background. So initially when the replication is happening in the background, all the buckets are listed from the main storage and then the objects within particular bucket are listed and then the number of tasks based on the objects are created and it is ensured that every object is copied to the follower back end using a particular task. So the worker processes tasks in the background copying the data to the follower back ends. Here these are the main components of chorus proxy worker admin UI and Redis. So proxy and worker are written in Golang and admin is written in view or the entire deployment is done in a containerized fashion on Kubernetes pods. Let us talk something about the resource requirements for different components in chorus. For Redis the scaling is done using Redis cluster and the persistence is ensured using Redis AOF and Redis database and fault tolerance in case of Redis data loss we can always restart the bucket replication because the state is maintained and memory consumption if there are around 1 million objects that are to be migrated then it can approximately take 1 million tasks in the queue then approximately 700 MB. This is all based on our benchmarking it can change with different scenarios and Redis is assumed to take less CPU and it can be between 100 and 1000 requests per second. So coming to the proxy it is stateless, it consumes less memory and less CPU but high network because proxy is the kind of brain it takes in the requests and it routes the request accordingly based on replication or the actual routing hence it also needs high network. Coming to the worker it is again stateless it takes in high memory and high network but less CPU because worker is the one that routes the reads request from the cache and routes request to the back ends based on the replication policy. So worker instance network and memory consumption can be rate limited in the configuration like in the day when there is huge amount of request coming to our clusters we can just stop the migration activity for a minute like or we can rate limited to do it at a lesser rate and then eventually you can increase it when the bandwidth is high. So yeah so what are our next steps for chorus we want to perform more load tests in case of larger buckets more data and efficient time consumption and then resource optimization at various component level like Redis how can we make it better and workers we want to make the logic more functional and then the API cost. So the routing policy alternatives since we have multiple storage packets what we want to do is to route based on object size for example if there is one GB of file you can configure it to be written to a particular storage packet and then if there is small number of files you can configure it for one back end based on the quota and lot of other parameters and load balance read request for replicated data. Now that we have multiple storage back ends in our hand we can always make an efficient use of each back end we can load balance the requests like for example if main storage is busy in taking writes since the storage is data is being copied to the follower back ends we can always route the read request to any of the back end which is idle so that logic and so every storage back end is providing a bucket notification and event log so we can subscribe to those events instead of querying the proxy every time and overloading it instead we can use proxy to really write data and migrate data so we can use that proxy instead or to keep polling for the bucket information and then there is we are planning on having a Swift API compatibility as of today we have a robust S3 API compatibility but we are planning to have open stack Swift integration and then life cycle policy the API parameters for different back ends is different so we just want to have good life cycle policy it is being tested and when a bucket is created with a particular life cycle policy in one storage the similar should be replicated to the other back ends as well without loss of any configuration for policy. Use cases the further use cases that we see for chorus are active transparent proxy post migration to speak briefly about active proxy migration so that means if the source and destination are completely copied and if we want to stop using the source anymore so once the data is moved we should be able to switch the proxy to the to another back end to make it a main storage instead of configuring it in the configuration file. Robust backup service so if we have two to DCs two sites then we want to synchronize data between both sites in both directions so the simple setup is to synchronize data between prod and the backup site so we want to make this tool efficient enough to be robust backup service like we can ensure that during disaster recovery even if the primary goes down we can simply do all the operations from the other back ends that are available by switching the storage back ends based on the based on how they are configured. So any questions regarding chorus and its implementation? One question regarding versioning so if you do replication from a source to a destination and the source has versioning enabled and there's a couple of versions how does this integrate into the chorus? So the question is basically about object versioning so if an object version is configured in the source how do we replicate it right? So for example in object versioning those are also defined as objects in a hidden bucket right so that bucket will eventually also be copied with the metadata so the object which you create initially before the first object it will have metadata and you configure versioning on it and there is a hidden bucket where all the versions go and we can restore it to that version anytime so the entire data is copied to the other back end as well with the metadata so that's how we can ensure. Yes I'm picking also about backup use case. Did chorus manage object log? Can you repeat that for me? Did chorus manage object log? Object log. Object log. Yes. Lock. Lock. Yes. I'm sorry I couldn't get that. Maybe I'm not so much acquainted with that scenario but can you just elaborate about what do you mean by object locking? It's a warm technology just the object is in right one time and also it cannot be modified. Okay. Read only. Read only. Read only. It's like a malware protection. The ransomware thing. That was one of our features that we want to implement so the question is more about when an object is locked or when there is an attack on the data. So yeah the ransomware thing is in one of our discussions where so actively the back end will be exposed more instead of the main storage so the warm or whatever that is introduced it will be in the back end and then we I mean that's just in discussions we are not yet there to implement but please feel free to post your question in GitHub. You can raise an issue. You can start a discussion in GitHub and then we can definitely take that as a feature with more details. These are the resources. We have open source the back end and then you can definitely reach out to us on GitHub and then this is a GitHub link for the chorus project. I'm sure chorus is more than I cannot speak about chorus so much in this 30 minutes but definitely it's a more efficient tool and it has a lot of capabilities to be acting as an orchestration layer for multiple back ends. Like we need not just use we can use one API to talk to multiple different types of back ends with different vendors. So yeah we are looking forward for more improvements more features and you can always talk to us on GitHub and then we can definitely improve this project together. Thank you so much for this opportunity. Can I get a correct answer? While you're migrating you want to implement a load balancing feature. Yeah. So you need to be in state of all the objects that you already migrated and that you still have to migrate to make an informed decision where you should go. So do you already have like a database or how do you know? Yeah we are going to do yeah to make the load balancing so that you the request is sent to the correct. Yeah so you can like a faster cluster you can go to the new cluster. Exactly. It's time to be more mature to add a presentation for the code but it didn't get in 30 minutes. Yeah I got it. It was like down how to restore it. Yeah sure sure. But it's still like it's really new we have some people there casting we're talking to start. Thank you. Very nice. At the moment we just need to connect both. Yeah that would be great. And we can set up the next speaker.
SMB for Linux with SMB3 POSIX extensions
Yeah, thank you. Yeah, just to introduce myself, my name is Falka Lendeker. As you can all see, I work for Samba since the mid-90s, last century actually, so for quite a while. And I think I don't have to introduce what Samba and SMB really are. They are file-serving protocols. And what I would like to do eventually is kill NFS. And I know this doesn't go down well in some communities, but this is what I'm working on in my spare time, when I have spare time. In the last few months, unfortunately, it was a bit limited, but still, some of you already have seen this talk at Samba XP or other conferences. There's a little bit of new stuff, but I think it's still interesting to see that you can actually serve SMB clients or Linux clients with SMB. So what is it all about? You want to share file systems, directory and files across a network. So you have one server where you have a directory, where you have a file system. And you want this to be shared across a network to possibly many, many clients. If you go Linux to Linux, you typically use NFS. And one of the reasons is it's so simple. What you do is you just add a line to your ETC exports. Maybe you have to kill or restart a demon or whatever. Then you issue just amount command on your client and you're done. That's about it. However, it comes with some downsides. First, there is essentially no real metadata or data caching in NFS. This means that it can regularly happen that you create a file somewhere and it doesn't really show up until a bit later, some on other clients. If you just write to directories, if you just write to files, other clients don't really see the M time or size updates really precisely and so on. So this is kind of problematic. Why does the mail format actually exist? Because locking doesn't work over NFS. And yes, NFSv4 has locking, NFSv3 has external protocols to do locking, but you can't really rely on those. And it's really, really complex to set up locking properly and to get failover done and so on. This initial very simple setup, and I love this acronym of NFS, it's just no file security. Because essentially what you do is you trust your clients to assign the UIDs and GIDs and essentially the group permissions and whatever you assign them correctly on the client. And there's nobody in between who actually checks. I know there are these days there are protocol extensions to do NFS over TLS, so at least transport is in a standard way protected. You can of course go and enable Kerberos for NFS, but this is also pretty complicated and we have done it in customer scenarios. The client at least is buggy like hell. And you get incompatibilities all over the place. You lose keys, you would, you lose anything. So it's really, really difficult to set up. As I said, clients have a very bad day when you Kerberize them. SMB however, it really comes from the Windows world. And if you look at the, there was one talk by the original SMB implementer, Barry Feigenbaum. Is it online available, Günther, do you know? So at one of the conferences that we regularly go to, there was actually a talk by the original inventor or developer of the SMB protocol. And essentially what they did is they took the MS-DOS interrupt in 21 and put the arguments on disk and let the server take care of it. And this means that they have to be compatible with a lot of applications on DOS. And DOS means that applications like Word 5.5 or whatever believes it's alone on the machine. So this means you have to get locking right. If Word opens a file and it believes it's the only one editing that file, you better make sure that nobody else also edits that file simultaneously. So they had to get locking right from day one. The other one is cache coherency. We have protocol for this and this between Windows and Linux, this actually works. So if you open a file over SMB, typically what you get is a permission to cache stuff, to cache your updates, to cache reads and so on. This leads to much, much better performance. And if somebody else also wants to open the file, you get notified that, oh, no, you're not alone on the file anymore. Please drop all your caches. Please write back your caches. And you tell the server, hey, I'm done writing back. Now please let the other one in. And then they have to agree that they all have to write back their changes and so on, read new data from the server. And the other advantage is, one of the advantages is that SMB servers, they are everywhere. Every home router in Germany, the Fritz box has an SMB server in there. All NAS appliances have SMB, so it is everywhere. And you can access it from almost any place. Whether all the features that we are talking about here are correctly implemented everywhere, that's a different story. For example, Fritz boxes don't talk to my mobile phone properly, but that's a different story. But essentially, it's everywhere. The SMB protocol is very flexible. There were very, very early extensions of the SMB1 protocol. So like every protocol, you have a lot of requests going back and forth, and there is unused protocol space. You have whatever, a create request, a read request, and so on. They are numbered, and there's number space that you can take and so on. And this is what we did early in the, whatever, 2000s or so for the SMB1 protocol. There are UNIX extensions that match all the UNIX semantics in the SMB1 protocol. This was never transferred properly yet to the newer and now only SMB3 protocol. And what we are working on actually is we want to extend the SMB protocol with all the behavior that a POSIX client expects. How is that done? The first packet that is sent between client and server is called Negotiate Protocol. And it exactly does what it says. It negotiates different flavors of the protocol. For example, it tells, hey, I'm SMB1, I'm SMB2, I'm SMB3, and I have this and this subfeature and so on. I can do these capabilities and those capabilities I can't and so on. And essentially what Microsoft has done with the SMB3 protocol, they did the smart thing and made this request extensible. Essentially what you can do is you have this Negotiate Protocol request and you can add what I would call extended attributes to this request over the wire. I mean it's not an exact file system, but you can just extend the request in a standard way with a new Negotiate context. So you have a ton of Negotiate context that say, okay, I can do encryption this way, I can do whatever. And we just have an additional extended Negotiate context that says I can do POSIX in this version. So the client tells the server, I can do POSIX, server tells client I can't. The default behavior is for unknown extensions is that the server just ignores it and doesn't send a reply. If the server sends a reply, I know I'm talking to a Samba who is able to do all this stuff that I'm talking about here. File name handling. This is really painful in our case because Unix file systems are case sensitive. Windows file systems in particular NTFS is not case sensitive. What does it mean? Under Unix you can have two files, Make file and Make file, one with capital M, one with lowercase M, and under Windows you can't, under NTFS you can't. When now a Windows client comes in and says I want to create Make file, what you have to prove at creation time is that no other uppercase, lowercase combination of Make file exists in the file system to fulfill this promise that this is case insensitive. What do you do by default? You scan the whole directory. And this leads to an O to the order of N square performance behavior. If you just drop a million files into a directory, file number 900,000 takes a lot longer than file number 1 because I have to scan the whole directory to prove that no other uppercase, lowercase combination exists. And what we can do is we can add a new create context, not only the negotiate context, but also the open file and create file request has these extended attributes. I can say that I want to open a file POSIX style by adding one of these create contexts. And we have defined a create context so that clients on a per request basis can say I want POSIX behavior, I want case sensitive behavior, I don't want file name restrictions, I want double quotes in a file name which Windows wouldn't allow. I want them. I know what I'm doing. I'm POSIX. So what we also need is POSIX metadata. If you look at the properties of Windows client on a Windows file, sorry. So we are here, Windows Server, I say properties. There's a lot of stuff. In particular, there's timestamps created. We have four timestamps in Windows that are roughly similar to what we have in Linux. We have attributes and so on. There's a lot of stuff that Windows has as metadata. However, the semantics are a bit different. In particular, they don't have a good notion of UID and GID. And they don't really have a good match right now for POSIX permissions. So some of the ones that we have in struct stat, like file size and so on, they are the same in Windows but in particular UID and GID, they are not. They are not the same at least. So we did. We extended the protocol. And if you, for example, do a stat on a file, if you ask for get file information, you can say I want this info level and there's a 16-bit field for info levels. And we just added one. We talked to Microsoft. Hey, get us this additional... Don't use this additional number that we use for POSIX information level. They agreed and so we have an additional field that we can use to fill in all this information that a client might want to use. However, second-last line. None of this is really the topic of this talk. It's about file types. If you look at the Unix file system, you have seven types of files. You have a normal file. You have a directory. What else do we have? We have block and character devices. We have named pipes. We have swim links. Oh, shock and horror. And we have sockets. Unix domain sockets. Samba can handle regular and directory files extremely well. Oh, there's a typo here as you find out. So we can handle directories and file. I mean, that's what we are made for. We have file servers so we better handle directories and files well. What do we do about the other ones? If you go and share ETC in Samba, sorry, share slash dev in Samba, something you probably shouldn't do, but if you do, Samba will find a lot of stuff that it can't really properly present to Windows, to any client. It will find character, block devices. It will find all sorts of stuff in slash dev. Or it will, if you just share a home directory, you will find sockets for GPG agent, SSH agent and so on. You will find all sorts of stuff that doesn't really fit into the file and directory schema. In particular, for example, you find files. And in previous Samba version, this used to work, that from a client you came in, it could open a file for writing, hoping that the server side on the Unix machine still exists, the server side process on the Unix machine still exists, and you could write into that and the server would get the data that you write into this. This can't be very popular because, I mean, many versions ago, we broke it and nobody noticed. Alexander is confirming. You're using it, Alexander? We have a lot of tests, but Alexander's comment was that we don't cover this, which means we didn't notice. Why didn't we know, or why did we break it? If you open a file for under Unix, all you can do is issue read and write syscalls. We don't do that in Samba anymore because whenever we get a read and write request from Windows, there's an offset attached to that read and write request, like in NFS. And we do the natural thing. We p-read and p-write, like what you do normally in the process where you have an offset. This is all from times when you couldn't really expect p-read to exist, but these times are long gone. We have some very special support for sockets. What's a socket? That's essentially a... It's a 5.0 on steroids. And what we do with sockets is we implement the Microsoft notion of RPCs. What is it? A Microsoft Windows client over SMB can open a file and transfer data over this file, special file, on the share IPC dollar, slash pipe, slash semr. What you do is you're win-redge. You open a file on the IPC dollar share, win-redge, Windows registry. You talk to the server side registry over RPC calls. And we implemented these days since 4.16 that our Windows registry server actually listens on a Unix domain socket and the SMB server connects to that Unix domain socket and just passes on back and forth requests. And so this is what I mean. We have limited support for sockets, but this is not what somebody would expect if on the client we would run a SSH8, for example, that clients connect to because this all needs to be done on the client side then. Block and character devices, I mean, we find them server side, but they don't make sense at all over the network. You don't want to whatever read and write to DevSDA over the network. You just don't want this. You could, but why? Enter NTFS repass points. There's a Wikipedia article actually on NTFS repass points. Repass points provide a way to extend the NTFS file system. A repass point contains a repass tag and data that is, and data that are interpreted by a file system filter driver identified by the tag. What does this mean? One use case is HSM systems, hierarchical storage management, where you have a huge file on NTFS that some software just pushes to tape and leaves a stab inside the NTFS file system that is just visible to the client as normal. And now when the client opens the file, the open code sees, okay, this stab is a repass point and the extended data that the repass point carries points at the place somewhere on tape. It's on this tape at that offset. And what you can do then in Windows is install a driver that when a client opens this file, the Windows kernel goes to the tape library and says, get me that file back. So this is software that you can install in the Windows kernel to extend NTFS semantics. And this is what, by the way, the NFS server uses. And we will see in an example of this. So applications can use this for arbitrary blobs. It's a special marker for a file, for a normal file that says, oh, I am a repass point and you can store stuff in there and essentially it's an extended attribute. When opening a file, NTFS filters can interpret the contents. This is what Microsoft also actually uses for sim links. Windows has symbolic links. They are stored as repass points. If you double click on that repass point, and I can demonstrate this here, I know demos never work. I have a file for and I will show you how I created this. I double click on the file for. Oh, okay. Wait, oh, I have a, as I said, I should never do. Ah, file for .text. Here it says text document, which is just a description of this is a .text file. I double click on this and what it says is the file cannot be accessed by the system because this is a repass point that happens to be named test.text or something. And they believe, oh, we have to open notepad, but it can't access that file. So the error message that you get if you double click on that file is, status I owe repass tag not handled. You have to tell the server that, oh, I want to open this special file in a special way. You have to set a flag. So a repass point, as I said above, has a so-called repass tag, which is a 32-bit integer. And if you look at the Microsoft documentation, Microsoft uses these repass tags and documents the use to a certain extent. And there's a lot of those. If you go here to that website, there's a ton of repass tags, reserve 0, reserve 1. What you see here is, I hope you can read that. No, you should be able to read that. That's HSM. That's HSM 2. And so on and so forth. Filter manager, repass tag, swim link. So this is what Microsoft defines in their spec, that they are using these sets of swim links. These sets of repass tags, and you get the integer there. Swim link is 0xA and then a C at the end. And we're using this. We are about to use that. So now we have two kinds of users of these repass tags. Do you remember WSL1, the version one of the Windows subsystem? They try to run Linux applications on Windows, and they face the same problem. Windows applications expect sockets and fee force and swim links to work. And in version one, they used NTFS actually for your home directory, for your local files. And what they did is, they have this repass tag address family unix. They use that. And what you will see here is, it must be somewhere. But if you dig a bit deeper, what they tell you is, the contents of these repass tags are not meaningful over the while. They were intended just for the WSL subsystem, Windows subsystem for Linux, server side. So they define as part of the data that is stored in this repass tag, hey, we have a block device, a character, a FIFO, and so on, with the obvious counterparts on Linux. So what they did in WSL1 is, when somebody didn't make FIFO, they created a file with a repass tag. And they, in the content of the repass tag, said, hey, this is a FIFO. None of them are actually documented. And because that costs so much trouble, the version two of the WSL, which I actually, is anybody using WSL? Some are. It's actually usable. I would say it's actually usable. You can't really tell the difference from a real Linux. At least I can't. I mean, if you look at Pock, of course, you will find differences. But for the normal day-to-day use, it actually works pretty well. Because what they are using, they are using a real X4 these days. Then there is a Windows NFS server. Pardon? Why? The question is why. I don't know. The Windows NFS server, this is what I'm going to present here, hopefully in demo. They also have the same problem. A client does a make link or a sim link or whatever. It doesn't make FIFO, and they have to store the data somewhere. And they define yet another set of repass points. And if you look here, they actually have a definition of what goes into the data, into the data field. Repass tag, repass length, and so on, in general. So they define sim link, character device, block device, and so on. And they actually specify what goes into the data field. For a sim link, the target goes in there and so on. And for the character device, you have two UIN32s for major and minor and so on. So they define what goes in there. I mean, you would have thought that these guys, talk to these guys, to share an implementation, but no. Why? The interesting thing is, if you look at, and I created a FIFO server side, and if you look at the properties of this FIFO, and you have to trust me, the one in the first row, can confirm that you have an L here. It says archive and L, L for sim link, if you look up the documentation. No, it's not a sim link, it's a FIFO. So their GUI is not really prepared for this. They believe, okay, all the GUI believes, all files that are repass points are sim links. Alexander? Is this client side? Client side? I can demonstrate that I see it. Because this is a local file, right? That's a local file that I created over NFS. Okay, so the NFS server created on a local file system something with this associated repass. Yes. So this directory here is local disk share. This is a local NTFS file. And what I did is I exported this via the NFS server. I mounted this from the client. And why don't I show it directly? I mounted it from the client, which is here. That's my client with a mount. If you look up at the top, NFS. And you can see in the left column here, I have sim links, I have block devices and so on. And I created them with normal UNIX commands over NFS. And this is what ended up on the NFS, on the NFS file system server side. Repass points. And so this is not too popular with Windows applications. So the Windows Explorer believes all files with repass points must be sim links because that's the most popular use of repass points in the Windows world. OK. So they don't look at the repass tag. They just see that this is repass point and it must be just one type of question. Yes. Alexander's comment was, and I will show you that in the Wireshark trace in a minute. There is a special flag in the metadata of the file that says I am a repass point. And you can of course get into the details of that repass point if you wanted to. But if you're an explorer, you don't care. You say it's a sim link. OK. Now this is a discussion. Do we use these guys or do we use these guys to represent or to present to the client when Samba finds a sim link? Samba site. Or for sim links, we even have three options. And so WSL version one has reserved repass tags. And if you look at one of these lists that I've shown you, you have repass tags for the individual subtypes, but they are not documented. They are not used anymore at all. You don't have any interoperability with anything else. We could of course use them. So in the case when Samba on disk finds a sim link or a block device, how do we present that? We have to make a choice. And WSL defines repass tags with undocumented comment. NFS only uses one repass tag. Pro NFS would be we have documentation available. And so on. And what we can do is we can write protocol level tests against the benchmark, which is the Windows NFS server. So we have ways to create these things on Windows and write just tests, which is very good. Also, if you now say, OK, I want to create a FIFO from my Windows client. That has mounted the home directory of a user on a Windows server. If I do that, the Windows client will create a repass tag that an NFS client talking to that same file system on the Windows server will also see as a sim link. Or as a FIFO, whatever. The same thing. And so this is why I would say, OK, I would like to use NFS repass tags. I have to talk to the CIFS kernel developers. I think with Linux 6.8, they went to different route. Andreas, do you know? No. So I think they went a different route, but we need to talk. Coming to sim links. Symbolic links in the BSD, UNIX, depending on how you look at it, are the best ideas since sliced bread, or the worst nightmare that everybody falls over security-wise. Even the Rust infrastructure. I mean, Rust being a language very security sensitive, they had their sim link race security bug. But we have to deal with it. We have to live with them. They are there. And so what do we now do when we see a sim link on the summer server side? Yeah, we can do that with the two ways that I presented. But as I said, Windows even has its own notion of sim links. So if you create a sim link, depending on where you come from, you get one out of three versions, three ways to represent them on an NTFS. And if you look at it, this Windows way of sim links, they actually work pretty well over SMB in the pure Windows world. For example, what you can do is you can have a sim link on a directory on an NTFS that is shared over SMB. And the sim link target can be backslash backslash IP address backslash share name backslash directory. And if you want to cut that file, or if you want to CD into that file from a Windows client, it will redirect to that server. So you can have cross server sim links with the Windows NTFS notion of sim links, with the pure Windows notion of sim links. Even better, if you try to open a sim link the Windows way, you double click on that file and under POSIX, you typically follow that sim link directory. If you mount that over NFS, the NFS client will have to take care of those and follow client side. But Windows does it a bit differently. When you double click on that or when you open that sim link file, they tell you, hey, you hit a sim link. And they will even in the error response, they will tell you, and by the way, the sim link points there. That saves at least one round trip, or several round tips, that if I hit a sim link, then I know where to go directly on the client side in the response. And Windows typically is completely path based, so if I open a Windows file, slash A slash B slash C slash D, and somewhere in the middle there's a sim link. They don't follow that server side, but they tell me, hey, go there, and by the way, I have passed slash A slash B already, and C was the sim link. So if I have a long path with many components, they tell me, okay, the third component is a sim link. Okay, how do we create these special files? Protocol-wise, there's a special flag to the open call, and we just set the contents. And yeah, what we can do is with Samba, what we don't want to do and what we will never do, if a Windows client comes in and creates a sim link the Windows way, we will not create a sim link server side. What we will do is because Windows sim links are also represented by normal files, 10 minutes left, they are represented by normal files with some special contents, with some special whatever extended data. We will do the same. So if you do a make link from Windows, then we will create such a file telling the client, hey, this is a sim link file, and the Windows client will just work as it will. And we will just open GIFs to the NDI interface and we will create Web notEP Things like R!! Or maybe Mint Speedusu advert. Okay, you know how much space here, I ran over the interlocking Mobile T share line for this. Okay, a shell. So what we will do for that Fashionbt is, I will use the three space walk way, and just kind of Diellow, And what you can see is ln minus s foo bar. This means I do a sim link from bar to foo, I believe. I always get that wrong. Yeah. So what I did is, and this is a file that actually lives on ntfs shared via nfs. And what we should see here now is that this is file bar. I created that. Now what I'm doing is I have my little user space tool, test start, that, and you know my password now, that I use always for Windows boxes. OK, what does this do? It creates a connection to that Windows server over SMB. And I just get all the metadata over SMB. And let me just TCP dump that. Oh, this is the wrong. TCP dump cannot override its own files. That's very strange. I know. OK, let me wire shark this. And what I see is a sim link. It's a sim link. What I could have done actually is to extend this command output with the sim link points there. Haven't done that yet. Maybe on the train back home. Let's look at that wire shark trace. Oh, TLsR. In the background, I have my connection for RDP running. SMB2. So here you go. It's a bit verbose. But what I want to point you at is I try to open the file bar, which is the sim link. And it says, repass tag not handled. Then I open the file again. And don't be confused by the create request. Create request is all catch all open file thing. And there I tell you, I tell the server, hey. I want to open this file. And I don't want you to interpret it. I don't want the HSM engine to go running. I just want to open the file. I want to open the HSM stub or the sim link as such. I want to see the metadata. I want to see the metadata. And it gives, OK, here you are. And a bit further down, what I can say is, OK, what I can say is, I can get the repass point data out of this file. And what I can see here is, I have, oh, this is a repass tag NFS. And by the way, this is a sim link with a target foo. So this is data that the NFS server gave me. This blob here, which is somewhere here, that was created by the NFS server. And so we can just utilize it. And we will utilize this. Before I take questions, I have one more slide that I want to talk about. Long running compute jobs. Very quick overview. If you have your HPC job farm, the one thing that gets in your way is all this file security. You want NFS, no file security, on long running compute jobs, because if a machine dies, it just, yeah, you don't want really, this is a trusted environment, and you just want your jobs to continue existing. SMB actually has secure provisions for this. What you can do with SMB is, you can create a machine account, you can give the machines a password, essentially something like key tab for Kerberos. And you can, this is standard Windows protocol. You can extend the connection to a share with yet another T con context saying, OK, dear server, I know what I'm doing. You trust me by my machine account. For this connection, please use this UID and GID to this share. This is a standard SMB protocol extension, and this is what needs doing before we actually can claim success and say, OK, we can also cover this long running compute jobs properly like you can with NFS or any other file sharing protocol. Not implemented yet, another server side, no client side, but it's there. Yeah. Mark. The machine account to the machine accounts authorized to protect any of the IDs. Correct. The comment was that SMB has a provision that you can trust a machine notified by the machine account database, whatever, you know what I mean? You have server side, this machine is trusted for doing the no file security thing. Send a protocol extension. OK, this is not really, thanks for your attention. Any questions? No questions. This is not good. Fun? Just an observation, the WSWSL version 1, are you really wanting to implement that more obscure data remaining on some obscure machines? I would suggest forget about it completely just because it's that. The comment was questioning whether we want to go the WSL1 way with these repass. Talk to Steve. Talk to Steve, French author, main author of the ZIF client. I mean, it basically is him. Steve French and Paul Alcantara, those are the ones who I believe for Linux 6.8 have implemented the WSL1, here, this one here. If you look at LWN, they can now create block and character devices and I think they went the way with WSL1. But I mean, talk to Steve. The comment was WSL1 is the only one under Windows Server 2019. There you go. Any other questions? Good job. You also, what? The question was how the current ZIFS client deals with these repass points. That's actually what is covered here. They start to properly implement that. So they already have some links. They have support for some links, the Windows way, because I mean, they are there. But they start to, they start working on all the other ones that we were talking about. Mark. It's work in progress. So I mean, parts already exist. Can you repeat the question? I was pointed out. Mark was asking about the, the, the new data, the data that is used in the system, and the data that is used in the system, and the data that is used in the system. Can you repeat the question? I was pointed out. Mark was asking about the status, what's currently implemented. It's a slow progress. And parts already exist. Other parts don't yet exist. So I don't know when we can actually claim that we do full SMB3 Unix extensions. I can't promise anything. One more, there's time up. I think we are pretty strict here. Just come to me later.
Advances in Garage, the low-tech storage platform for geo-distributed clusters
You can't hear me at the back. Do not hesitate to ask for me to speak louder. So I'm Alex and I'm a co-founder of the DeFloor Association, which is a non-profit self-hosting collective. We're a member of the Chateau Network in France. And so what that means is we're doing self-hosting and we're trying to promote self-hosting as an alternative to putting everything in large data centers and relying on cloud providers. The thing is actually doing this is relatively hard and mostly it's hard because we want systems that are reliable, which are available most of the time. And if you have computers at home, you have a lot of issues. In particular, the computers we're using at DeFloor are these kind of computers, very cheap old desktop computers. They're not meant to be servers and we expect that they could crash at any time. These are some other examples that we had and those are still used actually. So these are also old desktop computers and we have some system which is based on only these kinds of machines. So they can die. We also have issues possibly with the internet where the electricity connection because we're at home so we don't have redundancy. It can go at any time. And to alleviate these issues, what we do is that we do distributed systems and we have a multi-site geo-replicated cluster. And so in our case, the DeFloor cluster is in three places. There's some nodes in Brussels here, some nodes in Lille and some nodes in Paris. And basically the aim is to build a system that makes use of some cheap hardware which is disseminated in all of these locations and they can basically relay one another when there's an issue somewhere and the whole thing stays up even if there are issues in individual locations. And so this is one of the reasons why I call this a low-tech platform because we're using what we have at hand, cheap machines and regular internet connections. One of the main components in this platform is object storage. And so I will not enter too much into why object storage except that it's very adapted to flexible deployments which are kind of inspired by what is done in the cloud. And indeed, Amazon S3 was created as a cloud product and in 2006 was introduced. And it became since then a de facto standard and many applications are compatible with this object storage. And so it makes sense to base our infrastructure on this kind of software because we can just like plug and play all kinds of various things which are already able to use this kind of storage layer as a backend. There were many actually alternative implementations of S3. MENU is one of the most common ones. I think CEPH is also an implementation. What we discovered is actually that these implementations are not very well suited to geo-distributed deployments. So deployments where nodes are in remote locations because in such case you will have higher latency between the nodes and it can cause issues and the system is basically a bit slower. And sometimes it's even really unusable. So Garage was made specifically for this use case. We make use of distributed systems theory, CRDT in particular, which I will talk about later. And this is basically the aim is to provide a drop-in replacement for Amazon S3 or S3-compatible storage systems which is available, possible to run directly on this kind of geo-distributed cluster and the data will be replicated at several locations and it's kind of transparent and it's supposed to be reasonably fast, not completely slow down all the replications which are running on it. One of the main ways we were able to achieve this outcome was to use CRDT and weak consistency. So this is a bit theoretical explanation of what is going on in Garage and I will have another slide talking about this later. But basically we're trying to avoid so-called consensus algorithms like RAF or PAXOS because these algorithms have issues and are actually very sensitive to latency. But just to list the issues in a clear way, the first of them is software complexity. I think RAF is actually a complex piece of software and it can be implemented badly and if you do it wrong it can lead to various unacceptable outcomes. And of course the issue of performance which I've talked about already. Those algorithms like RAF are using a leader so the leader is becoming a bottleneck for most requests in the system. So if you cannot really scale if you have a naive strategy with just one leader in the system it's also sensitive to higher latency because if the leader happens to be a node in a very far away location, well everything has to transit from there and then come back. And so if the leader happens to be the wrong node everything is going to be much slower in the system. And also if the system is disrupted and the leader goes down the system will have to take some time to reconverge and it's actually something that can take a long time especially if the latency between nodes is high and those are not able to communicate very efficiently. And so for this reason we made Garage a completely different design which is based entirely on CRDT internally which kind of solves most of these issues. Object storage is very likely very similar to basic key value store except that the values are objects like big blobs of data. And so here we have an example where we have the key which is there's no notion of a file system hierarchy so we will just have the entire path in the key with the slash it doesn't have any specific meaning. And the value is like some metadata here it's inspired from the HTTP headers because it's very strongly based on the HTTP semantics and then you have the binary data for each of your files. It happens that this semantics key value actually maps very well to CRDT and this is why we were able to make this work. So just to convince you in one slide that this is actually a worthwhile trade off this is one of the best results we have for Garage and it's a performance comparison for Garage versus Migno. So it's a simulated deployment where we have nodes which are simulated on a single machine and we add some artificial latency between the nodes. And so here we have nodes with 50 milliseconds so pretty long delay between them and so basically we can see that they take some duration which is a multiple of the round trip time latency but for Migno it's a very high multiple so some very common requests like remove object or put object will take like more than one second and for Garage we were able to bring this down to somewhere between 300, 500 milliseconds. So quite an improvement. So the main focus of this talk is to basically discuss recent developments in Garage because so we were here at Fosdame two years ago and I think maybe lots of people in the room are already aware of Garage. So yeah two years ago we were at the beginning of a grant by NGI Pointer and which was the first grant and it allowed us to bring this version 0.6.0 which was the first like public beta version that we launched. So it was like a point where we considered that we had some basic feature which was pretty good actually and we could ask people to come and actually many people were interested and this is the point where we started also to have some external contributions to the project. So we did Fosdame about at the time. In April we did version 0.7 and so version 0.7 was so focused mostly on observability and integration with the ecosystem. So we added support for metrics and traces using OpenTelemetry which is a standard for exporting observability data. We also added some flexibility because while we had originally built the system like supposed to have three copies of everything so we would expect to have nodes in three different data centers actually people were also willing to use the system with less copies so we added one or two copies and we also had some weaker consistency which was useful to like make the system faster or help recover data in some scenarios. We also added integration with Kubernetes for discovery of nodes so that the cluster is able to set up automatically the links between each nodes and we also added an administration API which is useful to like set up the cluster. It's basically a very simple REST API where you can create buckets which are stored spaces, create access keys, give rights to access keys, etc. I will just show a little bit about the monitoring part. So this is a graphnet dashboard that we made for Garage and as you can see it's actually pretty complete. We can monitor so here is the request going on through the S3 API endpoints and here's the request going through the web endpoints because Garage supports serving buckets directly as websites which is something we make heavy use of at Le Fleur. Here we have the error rate and more interestingly here we have some internal metrics so like this is the data which is being read and written to the disk on the nodes. This is some internal metrics for the communications between nodes RPC and these are some queues so how much data is remaining to be processed and so yeah just quick note here the GCQ is common points where people are like why is this queue not going to zero? It's normal that it's not going to zero because items are staying in the queue for 24 hours before they're processed just for information. So basically this queue should almost be to zero and this one too and if it's not then probably your system is under too much load. And we also have tracing so if you want to go further into like how Garage is handling a request you can use this feature. So here we're exporting traces to Yeager and this is a trace of a pretty standard list objects API call and so we can see that the list objects is first reading some data to get some access information on the access key and the buckets. So this is some very fast call because all this information is copied on the node and it can just read it locally and then it's going to do some actual requesting on remote nodes for the list of objects that should return and we see here that it's sending a request to two nodes and the request is taking a bit of time before it completes and then so yeah I think this is a pretty slow cluster and it's taking 100 milliseconds but on faster hardware it can be of course much faster. So this was 0.7 and then we did 0.8 so that was at the end of the NGI pointer grant and for 0.8 we had a pretty high focus on making the performance better. So first thing we did was like change the metadata engine because we were using sled and it had a lot of issues I'll talk about that and we did some various performance improvements across the board making basically some pretty good improvements in this kind of area and in terms of features we added CODAS so this is not a feature from Amazon but it's a feature which you can add on Garage is like limit the size of a bucket to a maximum size of objects or maximum number of objects and it's pretty useful in a multi-tonnets setup where we'd like to lend some storage space to someone but have them restrain to some fixed capacity and of course some regular developments on quality of life improvements etc. So yeah just to talk a little bit about the metadata engine so we were using sled which is a metadata key value store embedded key value store which is written in Rust so we thought yeah it's written in Rust it's pretty good Garage is also written in Rust so let's just integrate them and at the point when we started Garage sled was like one of the most popular key value stores for Rust but actually it's not very well maintained anymore and it had many issues so it was making very large files on disk because it was like just writing and writing and writing and probably it was some internal way to optimize performance but it was not very satisfactory for us to have like data files that were 10 times too big. The performance was also pretty unpredictable on spinning hard drives it was actually very bad and also from a developer perspective it has some API limitations and this has prevented us from implementing some specific features in Garage and hopefully when we get rid of sled we can actually do that. So as an alternative we added LMDB so LMDB is a key value storage which is used I think in OpenLDAP and some other software and it's a pretty established piece of software at this point so we consider it pretty stable it has good performance and it maintains a reasonable size of files on disk so this is the default now and we also have SQLite as a second choice originally we had not optimized SQLite that much so it was not recommended we had not made another bunch of tests but probably now it's okay to use as well and just to show some comparison we did some benchmarks and basically LMDB is much faster pretty much twice as fast as sled not really twice but actually significantly faster and for all these common API endpoints and SQLite was not optimized at that time I cannot I do not have the data updated for now. Another optimization we made is block streaming so the idea here is that Garage will store your data so when it receives an object it will split the object into pieces of by default 1 megabyte and then store these pieces on data servers all around the cluster and then when you want to read the data well your API request is going to go through some Garage node which is going to receive the request is going to look at the objects the metadata and determine okay we have to get this part this part this part from these different nodes in the cluster so it's going to do an internal RPC request to the storage node which has the actual 1 megabyte data block and so this is how it was working before basically this first node that was receiving the API request it would like just read the 1 megabyte into RAM and not send anything to the client before so basically here the client is just waiting for the data to arrive and the data is being transferred here between these two nodes between inside the cluster and so basically the client is just waiting for some stuff to happen inside the cluster where it could just have received some data earlier and so the optimization we made was actually pretty simple but it's pretty big change in the code it was to start sending the data as soon as it arrives to this intermediate node and so here we just have a small buffer of data which is received and waiting to be sent back to the client and so by doing this pretty small change we actually managed to reduce the time to first byte measurement so this measurement is when you do a request to Garage to receive to get an object you will specify the path of the object send your HTTP request all the headers etc and then you will wait for the server to reply the server will give you some headers saying okay the object is coming and then he will start streaming some data and so this measures the time between you the point where you start sending a request and the moment where the first actual bytes of the data file are coming back and here we are in a actually again it's a simulated deployment but we have pretty slow networking so 5 megabits per second so it's actually very slow and so before the optimization garage was here so we would have to wait pretty much two seconds before some data was coming because the like a one megabyte file was being transferred out this very slow connection before it could be returned. Minio has some average performance here and with the optimization garage is very fast and we're able to return the first bytes of the data and so this is important because for instance for websites you want to display the content as fast as possible and even if it's a big file then maybe the first bytes are very relevant so for an image you can have a preview in the first bytes for an HTML file we can have pretty much everything and so minimizing this time is very critical to user experience. So I think we pretty much managed to do this and we also did some other various improvements on the code pass and garage so on the bottom we have 0.7 then we have 0.8 beta 1 beta 2 here we removed some F-sync and it's completely optional to have F-sync and we're almost matching so here is like raw throughput when you're reading and writing big objects continuously to garage the throughput is still a bit worse than Minio but it's actually getting pretty close so there's still room for improvement in this domain and it's yeah we haven't done much more work on this but it's definitely something that could still be optimized I believe. So then it was the end of the NGI pointer grants so we did a bunch of conferences in France this was not me this was other people from Duffleur and then we started another grant by NGI 0 through NLNet and this led to the release of 0.9 and so 0.9 was actually a pretty big release so yeah we had a support for multiple AGDs per node and this is actually a pretty big feature because now you can have one garage node which is directly talking to the hard drive and you don't have to do some pooling at the file system level or some RAID system basically you will just format each of your drives independently as a file system and each of them has a directory, a mount point and garage will just use all of these mount points and like share the data between the drives. This is probably the model which allows for the best performance on the server with multiple drives. We also added some features for S3 compatibility so we added support for basic lifecycle and lifecycle is a feature where it allows you to clean basically some stuff which is going on in the bucket and so for instance in S3 you can start uploading an object using a multi-part upload so multi-part upload means you're initiating the upload at one point and then you're going to do individual requests to add pieces of the file and then once you're finished you do a complete request and then the files get uploaded that gets stored completely in the system and so it could happen that these multi-parts upload they get aborted in the middle you never get to finish the the the requests and in this case there's some data that's lying around in the cluster and so if you configure a lifecycle using this is a very standard S3 API if you support if you configure a lifecycle in your brackets you can basically get rid of all this tail data after say a delay of one day or something like that. And another thing we added for S3 compatibility is retries of multi-parts upload and this was actually because in S3 if you fail a part you can because maybe your network was broken you can try again this part and you can still complete your multi-part upload and in the first versions of garage we did not have that and you would have to restart the upload from the beginning now you can resume only a single part. LMDB is now by default we're deprecating SLED and we have this new layout computation algorithm which I will talk a little bit about. So as I said garage is meant to work on geo-distributed clusters so you have nodes which are in different geographical locations we call them zones in garage so here we have three different zones and the data is going to be replicated and each file has to be on different zones for optimal redundancy. So here is an illustration if we have five zones for example the blue file will be in Belgium France and Switzerland so in three different places and the red file will also be in three different places not necessarily the same here it's UK France and Germany. And the idea is that we do this using this kind of pre-computed layout which is a table which will say okay the cluster the data in the cluster is divided in 256 parts and each of these parts is assigned to a fixed set of three servers and for each part we have to decide so three servers which are in different places in the cluster and we have to also balance the quantity of data that is going to go on each server. So basically for 0.9 we added an algorithm which is able to do this in an optimal fashion so basically this table is computed once when you set up the cluster or when you add some new nodes and then it's propagated to everybody and everybody then knows this table and knows where to look for the data. We actually published a paper if you're interested in the details of the algorithms that we use. Okay so that was 0.9 and then we went on and worked on 0.10 and 0.10 is actually a beta version and I think we will not have a stable 0.10 because it's not worth it to like update to 0.10 and then update again to 1.0 when it's going to be out so I think we will just leave the 0.10 at beta and do the 0.1.0 in May but so I'll just talk a little bit about the 0.10 beta. It's mostly focused on fixing some consistency issues that would happen like when you were adding some servers in the system or removing some servers and so I will enter into a bit of distributed system theory to try to explain why exactly it's an issue and what is the solution that we made. So since I've said that garage is not based on consensus it means that we have to work with inconsistent primitives so this means we have to work with conflict-free, replicated data types, CRDTs and so these are not transactional, they are pretty much very very weakly consistent, very freeform to use and there's this last-writer wind register which is pretty much the fundamental building block of garage and so CRDTs alone are not enough to insert consistency so what we add is some read after write guarantee which is implemented using quorums and I will try to explain, I hope you will understand how it works, I think it's not so complicated but it's a bit theoretical so yeah, hold on. So read after write means if a client one is doing an operation right and the system returns to the client okay your write is saved in the system and then another client is sending a read for this data after the write is returned okay then the client two will read a value which is at least the value x that was written or a newer value this is what this means and so in practice it means that the system is basically evolving between these states so for instance we have the state here where the system is not storing anything and then we can store some value a or we can store some value b and if this is like a basic set if you have stored a on one node and b on another node then when the two nodes like merge together they will have stored a and b okay but let's do an example here for the writes so these are the three storage nodes and we're supposing that a node, a client is sending a write operation for value a so the value a is going to be sent to the network to these three nodes and at some point like maybe the purple node is going to receive the value a so it's going to move from not knowing anything to knowing the value a then the green node is also going to move from not knowing anything to knowing a when it receives the messages and so those two nodes are going to return to the client who did the operation okay I've stored the value a so at this point the client says so I've received two responses this is two over three so it's what we call a quorum and at that point the client says okay the data is stored in the system even if the third node has not received it yet and so this is the point where we can start a read request and so the read will basically is the client will ask all of the three nodes to return the value that they have stored and maybe the first node that will return its value is the red node and the red node has stored nothing so the read will first receive a value of nothing but then it will wait for another response and the other response will necessarily come from one of these two nodes and so it will necessarily read the value that was written and so it will just merge these two so this is why we use CRDTs to do this merge operation and consistency is guaranteed and maybe at some later point through some synchronization mechanism the red node will catch up and also receive the value so we have this in algorithmic form but okay and so the issue we have with this is that we're relying very strongly on these quorum properties so if we have three copies of data a quorum is at least two nodes of the three but what happens when you remove some nodes and add some other nodes in the intersystem so we will have some some data which was stored maybe on the nodes in red here and in the new system the data is being moved and it should be stored on the green nodes and so now if you do some quorum some right quorum on the red nodes and some some read quorum on the green nodes there is not necessarily an intersection of one node that has seen the read and the right and basically the consistency is broken so the question is how do we coordinate in this situation and how do we ensure that even when the cluster is rebalancing data we insert consistency and so the solution is a bit complex but basically we need to keep track of what data is being transferred between the nodes we use multiple right quorum so we're going to use quorums to write on the old set of nodes and the new set of nodes and switching reads to the new nodes only once the copy is finished so this is something we implemented for the in the context of the ngi grants we did some testing using a tool which is called jepsen which is very good for validating these kind of things and so as you can see in garage 0.9 we had consistency issues in most of our runs and in point 10 we have all runs are green except one which failed but at least there was no run where the data was plain wrong and it's actually this is very good result for us okay so this was point 10 now we're at fosdem and we're going looking forward to making a version one in april or may basically we're going to focus on security and stability there's a security audit that is going to be done by radically open security miscellaneous features should be improved this would be added and improvements may be in the user experience refactoring stuff and that's it for 1.0 hopefully we'll have that out in april this year and beyond so we have this survey which is going on in the community right now and so this is a list of the most requested features by the users of garage and actually there's a lot of work to do so the first thing is a web interface for cluster management so I guess for like visualizing the state of the cluster and setting up a new bucket as new access then it's s3 versioning which is so it's a feature of amazon s3 where you can have a you can save the historical data in the bucket and it's pretty good for like a backup system where you don't want to override data accidentally and this is a pretty crucial feature that we would need to have ACLs are here monitoring and various other things and so this is the point where I'm calling for help actually because there's a lot of work and I cannot do it myself so if anyone wants to step in and help us with this please do so we can probably find some some more funding actually we do have some funding in progress for someone who would like to do a phd on this system in in relationship with the garage so if anyone wants to do a phd in France working on some stuff come to us we have this application going on and we also can probably ask some money to nlnet which have funded us once and nji also once so we can probably get some more money if there's some specific task that that is planned and we have somebody who is willing to do it okay and so I will just spend the last few minutes of this talk to explain a little bit about how you can operate garage for people who have not run it or who are willing to scale their clusters to bigger systems so this is the basically what I would call the main screen of garage so when you interact with the cluster just start always by doing garage status and it will tell you if everything is fine so this is a five node cluster and everything seems to be fine but maybe you will have like failed nodes so this means that the connection could not be established and something is wrong and you should fix it garage is made like a some cake of different pieces like this on top we have the s3 api we also have some custom api which I'm not talking about in this talk and this is three api is actually implementing using some internal key value store for metadata and some block manager for the actual data of the big objects and then we have some systems here which maintain consistency in the system and so maybe to be a bit more specific about what's going on we have these three metadata data tables here so the first one is like the list of objects in the system the second is the list of versions of objects and so it's a bit different because an object can have a version which is currently in the cluster and a version which is currently being uploaded so for the same objects multiple versions can exist and then this version will also reference a bunch of data blocks so this is the table which has the reference to actual data blocks and so all of these tables are sharded across the nodes and in particular for the block reference table if a node has the has the shard for some references it means it's also responsible for storing the blocks associated with these references so basically from this metadata table we have a local counter for how many references for each block and then we have this rescind queue and scheduler which is responsible for ensuring that the locally stored data blocks are actually matching the number of blocks which have a reference in the in the store so yeah we have this block rescind for data blocks and this merkle merkle tree based system for the metadata and so if you do this garage stats command so there's not status it stats never command you will get some information about the internals of what's going on so these are the metadata tables and you can see here objects version and block reference so these are the number of items in the table and there are also the number of items in the merkle tree which is always a bit bigger and then you have here the number of rc entries for the block table so the number of blocks which actually have a reference in the system so here we have 42,000 data blocks but we have actually 334,000 block references so this means that blocks are almost referenced by 10 different objects each on average and then we have some information on the actual nodes so the partitions here means basically is how many of the lines in the tables are affected to each of these nodes so if you have more more partitions you're going to use more storage space basically on that node it's proportional and this is a metric which is given by the node actually it's it's measuring on disk how much space is available it's not the use space it's the available space for the data partition and the metadata which is not necessarily on the same drive and so from all this information garage is able to basically tell you how much data you can still store on the cluster so here for 600 gigabytes and if you go even further you can get this list of workers so workers are basically background tasks which are running in garage all the time and so you have these tasks which are block readings so these are copying data blocks between nodes when they're missing and these are synchronization tasks for each of the metadata tables and you can change a bit the parameters of these tasks for so for instance for the the block re-synchronization you have re-sync tranquility and re-sync worker count and tranquility is a metric which can be increased to make the system go slower and use less i o if if it's serring you're saturating your i o you can increase the tranquility and if you want it to go faster you can just put it to zero and then there's also the worker count so you can set it up to eight and then you have eight parallel threads which are sending and receiving data blocks in the network there are some potential limitations if you're running extremely extremely big clusters probably you cannot run with more than about 100 100 nodes i mean you can but then the the data will not be very well balanced between the nodes and this is because we're using only 256 partitions we could probably compile a bigger version in garage but it's currently not the case and on the metadata side if you have one big bucket which is containing all your objects well you will have a bottleneck also because the first table the object table is going to store the list of objects on only three of all of your cluster nodes so if you have lots of data split your data over different buckets and also on the side on the side of the data blocks so the data is split into so if you have a hundred megabytes file in your block size is one megabytes your your file is going to be split into a hundred different files so we will have a lot of small files on disk you can increase the block size to reduce the number of files and if you have more files the processing of the queue can also be kind of slow and this is of course also dependent on your networking conditions and so just some advice for actual deployments for the metadata if you're going to do a very large cluster we recommend doing some mirroring on two fast NVMe drives possibly ZFS is a good choice garage itself does not do check summing on the metadata so it's good to have a file system that does it for you lmdb is the recommended storage engine and for data block it's a bit different and we have other recommendations we recommend using an XFS file system because we actually do some check summing for each blocks because we always compute hashes of the blocks in garage so you do not need to have a file system which is doing this this check summing again it would be wasteful so just format your partitions as XFS which is one of the fastest file systems and store your data directly on this if you have a good network and some nodes with a lot of RAM you can increase the block size to 10 megabytes at minimum and you can tune these two parameters according to your needs and of course you can do some more like global tuning split your data over several buckets use less than a hundred nodes if possible or come to us and we can work out a solution and you can use also gateway nodes which are good way to like have have nodes which are so have the request go faster because if you if you have a local gateway on the same server as the client it can basically route the request directly to the data server and you can possibly avoid run for time we have not made any deployment bigger than 10 terabytes on the side of the floor but actually some people have as we learned from the survey and so if some people are in the room it would be great to share your experience and with this I think I've talked enough garage is available as a open source software on the website of the floor at switch and in Rust and we have a matrix channel and email you can contact us and I'm taking some questions um so the question was if you store websites on garage can you integrate with dns and basically we copied the semantics of amazon where you can have a bucket whose name is the the domain of a website and so garage will route requests to the data according to the host header of the htp request and basically you just have to to configure your dns server so this is something you have to do as of at sort of garage but you configure your dns server to write the request to your garage server and then garage will just select the good bucket with the good content based on the name of the bucket and you should add some reverse proxy probably in the middle if you want to tell us because garage does not do tls yeah it's because when one of those website servers goes down then you need to reroute to some yeah so at the floor we have a solution but it's external to garage so it's more tooling yeah so in all the examples you mentioned you have effectively one node for one zone what if is that by design or can you have multiple nodes per zone or how does that I think it's uh it's uh so the question was in the examples we have uh one node on each zone and can we have more than one node and so I think it's yeah it's just the examples were not very good but yeah of course we can have multiple nodes in a single zone I think maybe in this in this graph no this is not the good one but there is a there is an example where we have several nodes in the same zone it's not a problem yeah and if you have let's say everything else calls and you only have the one zone that's remaining will the node still try to balance the data between themselves or is that effectively a you're in trouble so the question is how is data get being balanced between the nodes if you have like one zone where it's have only one node and maybe the node is smaller and so garage is trying to preserve this property of having three copies in different places you can you can ask it to have only in two places but by default it's three places and this means that if you have only three zones and one is a smaller server then you have smaller capacity of the cluster yes yes so the question yeah so the question is why did we integrate multiple disk support instead of having multiple nodes in the same zone and I think one of the most important reasons is that this way you can reduce the total number of garage processes and entries in this in this table basically because this table has only so many rows and if you have start having many different nodes it's not going to be well balanced so reducing the number of nodes helps us be better balanced basically yes I saw many of your design matching the one of open stack swift and I was wondering if you investigated using it okay so the question is there's many design points which are matching open stack swift and have we investigated using it I personally have not used open stack swift and I have not looked so much into it yes so the question is despite putting this much effort in multi multi node deployments is it still worth running the system on a single node I think it's it's so many people are doing it and I think one of the reason people are doing it is because garage is pretty simple to set up and to use so I think it's definitely possible I think there are also other solutions which are good for single node setups so yeah try it out and figure what's works best for you and okay so I think we're done for this talk thank you
MicroCeph: Get Ceph Up and Running in Minutes
Hello, welcome. My name is Peter Sabini from Canonical. I'm a software engineer there. I work on various CEP stuff and I'm very excited to present Microsoft with the tagline Get CEP up and running in minutes, unlike my slideshow. So problem statement. Microsoft packages CEP. This is a big complex system with distributed configuration, distributed components, complex bootstrapping, procedure and complex operations. It also has non-trivial hardware requirements. It's not just like you can download a package, install it on my notebook and be ready to go. It also has impact uptake and adoption among users. So if you're, for example, a famous physics research organization with thousands of nodes in your storage cluster, you probably have trained staff on hand 24-7. So you're good, you don't need Microsoft. If not, if you don't have a team of trained experts on hand, maybe Microsoft is something for you. So what is Microsoft? Microsoft is a single package staff cluster. Everything is in one file. We're designed it to be a simple setup so you can get a running staff cluster for command lines. And it runs on your notebook. You just need one node with obviously one hard disk. So simple possible staff cluster you can do is install Microsoft, putstrap the cluster, add some simulated OSDs, disk drives. So this is loop files in this case. No extra block devices required. And then wait a few minutes and your staff cluster should be ready to go. How did you do this? Microsoft is a snap package, as you might have guessed. Snap packages have the benefit that you're completely isolated from the host system. All the user land is in separate namespace. You just need a kernel, network devices, block devices, hardware, etc. to get up and running. This gives you a good isolation from the host system and gives a consistent environment across different operating systems. Some other goodies, it's isolation from the host system also means its access is isolated. The snap package just cannot do anything it wants on the host system, which is good for safety, security and robustness reasons. And you have standardized risk levels. So if you want to install release candidates, etc., there's a standardized way to do this. A little bit of overview of the Microsoft architecture. You have a service management demon that manages the standard CEP components and has a distributed database, a DITS proposal for storing configuration and no topology. Also included in the snap package is a CLI that talks to the service management demon via an API. All this is just a standard Ubuntu devian packages, no special binaries here involved. I mentioned the service API, so everything in Microsoft happens via this service API. Things you can do with the API, like listing block devices, adding or removing nodes, adding and removing disks. Everything works via the API and the included CLI is just another client for this API. So this is obviously great for integrating it in other systems. Some more internals. Microsoft is built on the micro cluster library, which provides this distributed configuration database, which is using RAVT for consensus. It also provides cluster membership and API framework. I already talked a little bit about scaling down, so single node systems work. One important component here is that we automatically manage the crush rules from CEP. So this means that as you start up with a single node, you get a failure domain of OSD, so in effect your single node clusters work out of the box, but if you add more nodes, your resiliency and your failure domain gets scaled up automatically. It's also possible to provide custom crush rules. This is important, for instance, if you go for larger failure domains, for instance, if you have a failure domain of rooms or racks, you can implement this. Microsoft itself doesn't know about your rooms or racks, but it won't step on your toes if you provide a custom crush rule set. Microsoft is famously scalable to thousands of nodes. Microsoft's scalability upwards is primarily bound by the RAVT algorithm used in the VQLite database. For performance, I would like to note that we're not sitting in the data path anywhere for CEP operations. You get the standard CEP performance behavior, also with Microsoft. Some integrations. Microsoft is the back storage back end for a number of projects in Canonical, for instance Sunbeam, MicroCades, MicroCloud and LXT. There's also, if you're running Juju models, there's a charm available currently in beta to integrate Microsoft into your Juju Clouds. Last but not least, there's a nice little GitHub action that we provide to integrate Microsoft into your GitHub CI workflow. So if you need, for instance, a S3 endpoint for your testing pipeline, this is an action that would help you with that. Okay, on for demos. I prerecorded these demos for time reasons and also because I'm very bad at talking and typing at the same time. So let's see how this goes. So this is the single node setup we talked about before. I'm going to install the single node. Microsoft Cluster gives it some simulated disks and enable a Rata's gateway, which would give you an S3 endpoint. Yeah, installation. We have the standard stable risk level here set. So this is what you get by default. You see my capital DSL connection here. Yeah, so we bootstrap the cluster. This is done pretty quickly. We can see now that we have a few services running already, but no disks. Then we add some simulated disks. These are just loop files. This is useful for lab environment or for testing. Don't use it for production. For production, you would use separate block devices, obviously. But if you want to get going on your laptop, that's the way to do it. We enable Rata's gateway. You can see it is active here now. We create a Rata's gateway user. This is just a standard safe way to do it. It's a little ugly command line here. And yeah, and we're done. We can use our favorite S3 client to access our new Rata's gateway endpoint. So just to prove that it works, we are creating a bucket here and put some image up in this bucket. Yeah, so that's for this demo. So this is the simplest possible case. Let's do something a little bit more complicated. Say we have got a few extra nodes now. We want to in an expander cluster and provide it with real block devices. This is the way we will do it. I'm now using the candidate risk level because I want to use some features from Microsoft that didn't make it to the stable risk level yet. So to cluster Microsoft, you need to get the token from the bootstrap node. So the first node that we provided, like this. Name the node you want to add and get the token for it and provide that token to the node that you want to add here. So, and yeah, small typo. These have happened as well. And yeah, and now all our nodes are clustered. Let's check Microsoft status. We can see all our new services here, but all the new nodes don't have any disks yet. So let's add some disks. So what I'm going to do here is add a user feature that comes from the release candidate that is automatically pro for empty block devices. So anything that's not mounted is clean. We take as a block device here with this switch. Let the thing settle a little. You can see there's lots of virtual disks from Kime. And we have a lot of disks in our cluster. So the safe cluster is still setting a little bit, but we suddenly have a lot more space available. So one thing we can do is provide a second radius gateway endpoint. Now we can see that the data we put in before is still here. So that's reassuring. And what we'll try to do now is we put in another OSD on the first node, but this time we want to make it encrypted. So full disk encryption is something we provide here. It relies on the dmcrypt kernel module. Not all kernels have that, so that's a little bit of a gotcha. You need to make sure it does. And also, this goes by so fast, also this is something that the snap is not allowed by default to do. You need to connect the dmcrypt module explicitly to make this happen. But once you do, it will give you an encrypted OSD device. That's the one up there now. Just to prove that this is a dmcrypt device, there's a setup here. Well, we all have the loop file for OSDs from the first node that we originally installed. Let's remove that. We have plenty of block devices now, so that our cluster has real disks. So as a last step for our production cluster, something that snaps to by default is auto refresh. This is something you don't typically want for your self cluster. You want to control updates for your self cluster. And that is a step you do, is hold all the snaps and prevent auto refresh so that you can refresh or update your software to your own leisure. So, yeah, so that's what's for the demos. Short outlook, what comes next. We want to make the clustering experience a little smoother still. No passing around of tokens. So one thing we could do with this or we planned to do is on the local network use MDNS to determine new nodes. Another thing that we want to do in the near future is provide built-in HAA and load balancing for other gateway endpoints and also RBD mirroring support. So that was it for the demos and for the... Thank you. Any questions? I don't know if we have time for questions. One question maybe. Otherwise, I'll be outside. Just talk to me and I'll be happy to answer your questions. Oh, sorry. Here you go. What architecture do you support with CPL architecture? Can you repeat the question please? What CPL architecture do you support? So snaps are pretty flexible. We develop on AMD64, but ARM is tested. I don't know if the top of my heart had... But ARM, AMD64, power, PT and risk, I believe...
Welcome to Testing and Continuous Delivery devroom
All right, good morning everyone and thank you for coming so early. I'm just going to take less than five minutes of your time to say the welcome and then we continue with the awesome speakers that we've got. As you can see, there's a lot of us here today. If you aren't aware of the history, in the past we did have the testing and automation dev room which was separate from the CI CD dev room and this year the two dev rooms have been merged and going in the future we will continue like that. So we do have the two teams from the dev rooms organizing together this year. So we start with Anders. Say hi. Yeah. Then we have Jan. Olivia, Fabrizio, Carlos. We've got Sirio who is at home at the moment. He cannot join us because he has newly born small kids and my name is Alex. Nice to meet you. So you all know the rules I think. Don't be a jerk. Enjoy the presentation. If you see more people coming in late, go towards the middle so that they can squeeze in on the sides and not jump over you. And if you want to talk to the speaker, it's up to them whether or not they will be taking questions during the presentation or after, but we do not have a lot of time for switch over. So if you want to talk to the speaker, please take it outside so the next speaker can set up and we can continue and then you can just come back in. And this is it, I guess. Thank you again for coming and let's start.
Streamlining Developer Experience: The Power of CI/CD Standardization and Interoperability
I have, I've just been informed that I have a two hour talk. So we're going to use that, we're going to use that time wisely. Hopefully we also have like a minute. So I can't start talking with the talk until for another minute. So with that, who's, this is your first time at FOSDEM. I'm also raising my hands for this first time, tried for years and finally got here. So cool, glad you're all here. So now it's like 25 seconds, we have to kind of just whatever. Yeah, everybody awake? Who, what was the latest you were out? Like who was, who was out, who went to bed at like 10? Okay, good, nobody. And that doesn't mean 10 this morning, when after this and just been up. Who was, who went to bed after midnight? One, two, three, three, 15. That was three, 15. Four, are you, you're still awake. You're still good. Okay. All right, so we are, we're going to start now. Hi, I'm Jeremy. We're going to talk about streamlining developer experience, power of CICD standardization and interoperability. Really going to kind of touch on when we think about developer experience, how, what's kind of the role of CICD in that and how it fits within all of the different kind of tools and systems that we use. So I'm going to talk about that. A quick note, I did use on a fair amount of these slides. Because I had evidently time on my hands. I used chat, GBT and Dahli for the images. So that is a very interesting, don't, don't go into it thinking you're going to get exactly what you want. As you'll see on some of the slides, it's a little weird, but why not? So we're going to jump into that, figured I'd try something new. Okay, so you said, my name is Jeremy Meese. I'm the co-founder of a kind of a stealth DevEx startup right now. Hopefully we'll have some news in the next couple months. But yes, so Jeremy Meese, I've done, I've been in tech for a couple decades. Previously, most recently, I was at CircleCI for about three and a half years, running the DevRel and community team, doing a lot of talking around CICD and stuff. So that's me. Now I did have some early feedback on the title of this talk. So Gray had a lot to say around, this is probably heretical, what I'm going to talk about. I don't know about that. Heresy, but I felt that was kind of harsh. He hadn't even heard the talk and already he's given some feedback. But we're going to kind of talk about this evolving landscape of software development, especially in the modern world. If you've ever seen the CNCF landscape, could not even fit on one slide. I mean, it's fit on one slide, but there's no way you're going to read it. That's how big that's evolved. I really should have had a slide that showed some progression over time. But when you think about, this was a couple of days ago. I'm sure it's grown in that time. But continuous integration has a good, when you kind of zoom in, it has a good section of that. And it stands really as kind of this, CICD does, stands as this kind of transformative pillar that kind of has reshaped how we look at software and how we look at deployments and how we look at delivering quality software to, hopefully quality software, to the users, to the companies and such. And kind of driving that very experience. I also put out, kind of feel, when we think about developer experience, what is the kind of shortened version of that? And we're going to use DevX. The internet has spoken, so we're going to use DevX, not DX. So you all say DevX for short, instead of saying all of developer experience over the next three hours, I think we have. Okay. So developer experience, kind of defining it, it really kind of encompasses the journey of the developers as they're learning and deploying technology, whether that's software, whether that's even hardware kind of fits into that. And when you have a successful developer experience, it really is focusing on eliminating the obstacles that hinder developer or a practitioner from being successful in their endeavors, whatever they're trying to do. Now CICD's transformative influence that we've seen on the developer experience is really pretty profound. Because we've had kind of this dynamic shift in how developers over the years have collaborated, how they create, you know, how they deliver software. And by automating, you know, the pipelines and like the integration and testing, deployment processes, all those things, it really is to empower developers to really gather the feedback necessary with those feedback loops, having faster ones, so that they can improve the code quality and the ability to continue to iterate swiftly. That is not a Taylor Swift drop, it's just iterate swiftly. But by streamlining workflows, that helps to reduce a lot of the friction that we see, provides a lot of intuitive tools. And so you have like this good DevX empowers developers to focus on creating that high quality code we talked about, fostering the innovation and really eliminating and contributing to, you know, faster, more, ultimately contributing to faster, more reliable software delivery. So we're going to kind of hone in on the two of the critical pieces of what that looks like in CICD with standardization and interoperability. So from the CICD standardization side, that really brings the consistency necessary to your pipeline. So that you can reduce the friction, you can enhance the collaboration between your different coworkers or different teams. So we're going to also look in this at a few open source tools. We're going to look at Argo and Flux. I'm not going to bring up any demos or anything, but we're going to talk about some of the features that they have that really work well with this kind of standardization idea of standardizing processes and how you deliver good software that way. Then we're also going to talk about the interoperability side, which is kind of ensures a seamless integration across multiple different tool sets, everything from observability to, you know, different, potentially different frameworks. You have all the different tools that kind of integrate with that. So with that, we'll look at, you know, some of the features that Spinnaker has and also tools like Backstage, how they kind of work with the developer experience on the interoperability side, bridging kind of tool chain gaps and such. At the end kind of whole thing, we're going to kind of really dive in, not really dive, but just kind of summarize how that both of those things play a pivotal role in optimizing developer experience and improving, you know, overall productivity, which is really kind of the idea. All right, so the standardization side, that really means we're trying to minimize the variability, reducing all the errors, fostering environment where developers can, you know, again, collaborate. That's efficient collaboration. So when you're standardizing that, you're kind of defining clear, repeatable, no, not yet, clear, repeatable code integration, testing, deployment processes, all of those kind of things when you standardize that ensures that you're having like your pipelines are streamlined, the developer process becomes a lot more, a lot smoother for everyone that's interacting with what you're trying to do, whether you're building something internally or for external users or both. So when we think about that kind of the steps for what kind of better practices look like for that, we start with kind of assessment and analysis with that. So here you're really kind of looking at your current CI CD pipelines. You want to understand kind of existing workflows, the tools, all the processes that you're using to identify the pain points, the bottlenecks, and then, you know, areas where standardization really is needed. And then the next kind of thing with that is you're going to kind of look at all the specific requirements that are in place and the constraints of your projects and the development of that first step there. Then when you're defining this, you're really going to kind of define the goals and objectives that you're trying to achieve with your pipeline standardization. And those goals are really going to try and align with the overall dev strategy that you have and some of the organization business objectives. You don't want to stray away from that. And that also kind of helps you start to kind of identify those KPIs that are going to really measure what success looks like for you in your development process. Usually that looks like you're probably going to try and reduce deployment times or decrease, you know, error rates. We always want to try and obviously decrease error rates. Then you want to look at what the tools and practices are going to be for your CI CD standardization. So, you know, things are going to align with your organization's needs and goals. So that's things like Jenkins, GitLab, CI CD. There's other cloud native solutions. AWS code pipeline. There's I think Team City, I think is on the cloud side. There's a bunch of different options there. But you want to make sure you have those tools and practices that help you achieve those goals. There's some standardized templates for pipelines defining those essential stages of build, test, deploy, what's that going to look like for you. And then kind of what a standard configuration would be for all of your pipelines. And then you're also going to enforce a lot of those coding standards for CI CD, those configurations ensuring that there is consistency and readability for everything that you're doing. So somebody can come in and understand exactly what you're trying to do and you don't have to spend a lot of time kind of. I mean, there's going to be onboarding, but you want to make it as standardized and relatively simple as possible. And then on the documentation training, which is kind of touched on quickly, you want to make sure that documentation is comprehensive, that you're outlining all of the standardized processes that you have in place. Make sure everybody is aware of how you work, including how you work with your workflows. How do you, you know, what's your standard configuration? What are some of those better practices that you're using inside your organization? Make sure that's documented. And then you're also providing a lot of those training sessions for your dev teams and your support teams that work with the dev teams, ensuring that they're understanding and can be really effective in as they use your CI CD tooling and all those templates that get created. Then you kind of move into the version control side. You want to make sure you're storing those pipeline configs in some kind of VCS, you know, Git, GitLab, GitHub, whatever. That practice there is really going to ensure that your configurations are versioned so you know you can go back to something, you know what the changes were, you can trace where potential errors are, and you can, like I said, revert, you can easily get back to something if you need to. And then implement your branching and pull request strategies. It should mirror what you're already doing in your standard that you've already hopefully documented that we just talked about, but making sure that all of the, you think about the standard templates and such is that they're all kind of following in that same path of branching, pull request and such. And then automated testing. Since this is testing room, we want to make sure we talk about testing. You want to make sure you're integrating your automated testing and validation into the pipeline and all those templates to ensure that, you know, those standardized configs produce your expected results. Don't just create a standardized template and not test out to make sure that works. Otherwise you're going to create problems downstream. Another great opportunity to put code reviews in place. Build out your standardized templates and then start code reviews. Make sure that you're not missing something. Bring more eyes to it. Validate that, catch those errors before they become an issue downstream. Okay. And then continuous monitoring, continuous integration side of this or continuous improvement. Make sure you're monitoring and having alerting in your CIECD pipeline so you're detecting the issues bottlenecks in real time before they become an issue. Establish kind of this culture of kind of continuous improvement. So that means you're regularly reviewing, updating those, you know, those pipelines based on the feedback and evolving kind of framework that your projects and pipelines go through. Make sure you're not, those templates aren't being left behind. Also governance and compliance is very much an important part of the CIECD standardization. So make sure your policies are enforcing pipelines, the standardized pipelines and compliance with industry regs, regulations or, you know, some internal or external standards that are in place. Make sure that you're accounting for those. Really audit and assess how you are adhering to those to make sure that you continue to improve there as well. Scaling and adaptation, ensuring that, you know, those standardized pipeline templates are something that can scale and adapt to the different project types that you have. Every team or, you know, an organization probably has different types of projects that you're all working on. So make sure those templates are easily applied to different, you know, different things that you're doing, different sizes, different technologies that might be in place inside your organization or that you're developing for external. Maintaining the flexibility kind of helps there to accommodate the unique requirements that each project is going to have, but also making sure you're still adhering to your standardized core practices. And then there's that feedback loop. Very much a part of DevOps is feedback loop. Even more, that's part of why continuous integration and continuous deployment is there, is it helps you give you that feedback loop. So have an environment where, you know, developers can really collaborate and provide that feedback and contribute to continuing to improve those standard practices and then continuously kind of communicate the benefits of those to outside your organization. Make sure everybody knows what you're working on and knows that the achievements that you've had really helps kind of drive more collaboration, drives more obviously awareness of what your organization is doing, but also brings a lot of praise to the teams internally. So by kind of putting these steps, these kind of best practices on the standardization side, organizations really can kind of implement more efficient, consistent workflows so that the developer experience on the continuous integration, I'm sorry, on the standardization side is really you start to see those, the results of that. So we're going to right now kind of look at kind of Argo and Flux, just some of the features that they have that help implement some of these better CI CD practices for standardization. So Argo is reusable workflows so orgs, they can really define reusable workflow templates that set up the standard sequences for CI CD like build, test, deploy so that devs can reuse those things across projects, not just within your, the project you're working now, but you can use reuse those templates elsewhere. Argo also follows GitOps principles. So your configs, workflows, they're managed as code in Git repos, ensuring everything's versioned, like I said, traceable, easy processes to kind of collab amongst dev teams is really kind of a core piece of that GitOps. And then the way that they manage artifacts, Argo supports managing and storing those artifacts like Docker images as part of the CI CD process so that you can make sure that the right artifacts are used in the right situation and deployed across the environments and they can be used as inputs in subsequent steps as part of your template. So those are some things that Argo has in place specifically. And then from Flux, we have the declarative config model that they can operate on where systems are, you know, their desired state of how they're going to exist as a system is defined in code. This is what orgs can kind of define and enforce already those standardized practices in a VCS system, ensuring that you can kind of track things consistently. On the continuous synchronization side, they allow you to kind of continuously synchronize the desired state in your Git repos with the actual state of, for instance, like your Kubernetes clusters. So that is that changes, everything can continuously be replicated so that you have a standard config and deployments and that are consistent across your environments. And then there's the policy side. So that's kind of where we have, does it say that flagger? Yeah. So that's kind of the feature flag. So Flux has feature flag capabilities through Flagger, which is a part of that, so that you can deploy and allow orgs to define the different rules for how things get deployed, either different sets of things or to different users. So you can really do a lot of that A-B testing if you think about progressive delivery. It's that kind of thing. Yeah. So who here uses Flux? Okay. What about Argo? Okay. So about, I think there was some overlap. Good. So when we want to achieve these kind of standardized workflows, kind of the summary here is like achieving that. You want to make sure your templates with Argo and Flux, they allow for the standardized templates and definitions so that everything is, all your orgs have an established baseline to work with for consistency. There's also integrations with VCS, CI-CD tooling so that you have your configs are maintained and accessible to all, which is really important, bringing visibility to what you're doing. And then on the documentation and training side, it's really essential to make sure that you've got the docs and training standardized and that you have documents, that you have the docs and trainings for the things that you've standardized. So make sure you've done both so that orgs can really be responsible for making sure that dev teams and even support teams understand how these standard processes are. Continuous improvement really kind of fosters the culture that's really necessary to achieve a good developer experience so that you have, everything's regularly reviewed, the workflows are updated, you're getting feedback, improvements are continuously happening. Making sure that, again, developer experience is high on that list. Alright, interoperability in CI-CD is, you know, system, it refers to the ability of different tools, technologies, components within kind of the CI-CD ecosystem are able to work effectively together. So that means, you know, the various parts like, you know, the pipeline, source code, repositories, build systems, testing frameworks, deployment platforms, monitoring tools, all those things are able to interact with each other in a way that ensures that you're able to see what's happening. So if the data is effectively kind of exchanged and that there's, you know, not really any compatibility issues or disruptions to kind of your workflows. So kind of the way that that looks like, we think about, so there's a collaboration side that gives flexibility and choice that you, when you are looking, trying to implement interoperability in your environment really enables dev teams to use the best tool for the job so that you don't have to work with the vendor lock-ins, give them that flexibility to use what works best. And then there is the various tool preferences that are there that Oregon company has. And so you want to make sure with the enhanced collaboration that all those different various tools are not a blocker to success. Excuse me, that also ensures, you know, smooth interaction. You know, also the collaboration comes is. And then, it's a really important, yeah, really important side of that to make sure that you're, you know, able to integrate the interoperability kind of really enforces that better use utilization of your resources. So your orgs can make efficient use of your existing infra, infrastructure and tools. So you should not always have to build something new. If you have systems that are interacting together, the system, the tools, you're not waste, you're not being wasteful. Yeah, reason the components and scripts, saving cost. The next side of that, scalability and growth. So as organizations are scaling, they're adopting new tech, which happens consistently. Interoperability really ensures that your CI CD systems can adapt and expand as necessary to kind of support incorporating the new tools and processes and ideas and workflows, all of that into your, the way you all work as a team. And then, yeah, cross platform deployments. So interoperability advantage there is that, you know, when you, in this existing kind of multi cloud hybrid kind of environments that are out there now, it really promotes kind of this unified approach that you don't, that doesn't have to be a blocker to having all these different systems. You have it all together, ensuring that the data gets transferred well. Also kind of promotes a unified deployment and infrastructure management. And then troubleshooting debugging. I knew there was another one there. So when issues kind of are arising, this interop enables, you know, this seamless data sharing between all the different tools and process. I've seen a growth, like, it's amazing, but it's been a very astronomical growth of the average number of like SaaS tools that are in place in a organization into the hundreds on average by companies. And so like being able to look at all the tools and be able to troubleshoot those and have everything working together is a huge kind of game changer for kind of looking at better issue ID, troubleshooting, resolution and such. All right, so in essence, you know, this, when CICD systems are interacting together, this interoperability acts as really that bridge. It's one of those chat-chitty created images that kind of works. But you know, connecting all the different parts of your dev and delivery processes together, fostering, you know, we talked about collaboration, ensuring teams can work cohesively, efficiently. All that stuff is tied into the importance of, you know, interop. So looking at like how Spinnaker and Backstage do this, the, on Spinnaker's side, the, there we go, integration with cloud providers. So Spinnaker allows you to pretty much integrate with anything you want so that you have this consistent interface itself for deploying, managing across platforms, ensuring seamless targeting of the different environments that are in place by devs, allowing them to like choose what works best. Don't tie them into one specific tool that is that whole, that little analogy of, you know, hammering a square peg into a round hole. And then integration with VCS systems. Spinnaker, you know, works across, you know, really can work with whatever so that you can kind of trigger your deployment pipelines directly from your repositories and automate that release process, reducing manual intervention. And then extensible integrations. So you know, having an extensible architecture supports a lot of different integrations, which allows teams to connect with, you know, again, various set of tools like the monitoring, incident management scripts, those things, and really ensures that Spinnaker really seamlessly fits into the org. And yeah, tool sets fit into your org's existing tool sets, requirements, workflows and such. And then artifact management. We talked about, you know, Argo has that. Spinnaker also kind of lets you kind of interact with those, integrate with different repos, artifact repos. So you got Docker Hub, there's Artifactory, I'm trying to think those are the two that come to mind. Assist really in managing those artifacts, ensuring that, you know, the right things are consistently used in your deployments. And then there is the pipeline abstraction. The, helps you kind of abstract the deployments, making the process more flexible and adaptable to what you're trying to do. Developers really can start to reuse those templates that you've created, making an adaptation easier as the projects evolve themselves, and those requirements. And so that bridge between, you know, the abstraction and flexibility ensures Spinnaker can kind of cater to different various deployment scenarios. So that's the Spinnaker side. We think about backstage. Backstage has, you know, it integrates with a lot of CI CD tools and other things. We're talking about CI CD here. And so it integrates with a lot of them, like Jenkins, sort of CI, GitHub actions, Flux, Argo. All of those things brings that visibility. And so having that interoperability with pretty much anything allows developers to visualize and manage what's going on in their pipelines directly from backstage and not having to go to multiple systems. You can do it all in one. So there's kind of that unified single pane of glass view of the entire kind of dev workflow. Service Categlog integration with backstage kind of acts as that service catalog, helping teams manage, discover the services and apps and such that they can use. And so that interoperability with all the different systems ensures that the information in your CI CD is integrated into that service catalog itself so that they know that easier for teams to understand service status, history. The history is really important to be able to go back and see what's happened over time and see some trends. Yeah, it has a really good plug-in ecosystem. So, you know, that extensible architecture across all the different custom plugins that you can create that maybe your community has done. All that stuff can help bring better visibility to things you do. And then customization theming that comes in place, allowing repos to kind of customize the UI and theme that's in place. That may seem like a small thing, but when you're trying to get your organization to buy off to use something like backstage or things like it, having that ability to customize the look and feel satisfies a lot of those branding requirements that companies have, marketing departments kind of do. So it's important to have that kind of flexibility ensures that your org is going to be able to be flexible and use what's there. All right, so Spinnaker and backstage, they both kind of prioritize flexibility, adaptability, allowing organizations really to integrate with the kind of the diverse tool sets that are out there and accommodate the various needs that developers have. Bridging those gaps between the different tech and systems, it acts as kind of that central hub that connects those parts, enhances the flexibility of your CI CD pipelines and developer workflows and is ultimately going to kind of promote more efficiencies and collaborative development environment. All right, so the organizations often use kind of this mix of tool sets. And there's some of these challenges that come in with when you're trying to implement this. Is that that mix? And each of them have their own ecosystem, APIs, such data format schema differences, tools are using a lot of different data form. They're not all unified themselves. That's kind of their niche is having something different than everybody else. So that presents a challenge. Authentication and authorization, like those themselves also present a lot of challenges of how do you not only manage the access to all these different tools, but how do they, you know, you have APIs, different APIs going back and forth. How do you kind of work with that? Versioning and compatibility, that also is something that like tools change. New versions come out. They can break the, you know, breaking changes that either could have been avoided or not. It doesn't matter. You're trying to use them and now you have something that doesn't work. So having, you know, that is a real challenge. And then lack of documentation. We've all seen it. API that's on version two and their docs are on version 1.1 and they haven't updated or they haven't changed one thing and it breaks that. That often is a challenge to try and work with all these different systems and in some cases building your own integrations between those systems can really kind of get hit with lack of documentation. But there are ways to overcome those with us. So using unified config formats and how you define your deployment pipelines that are documented and forced for all the tools, libraries that are associated can then automatically kind of convert between the formats, ensuring there's data consistency, compatibility. There's API gateways that really translate the data between the systems for consistency, simplifying, you know, off and authorization access across all the different tool sets. Helps you maintain version compatibility and it's important to kind of use a version compatibility matrix, matrices, so that you can see, you know, track it all down and see what works with each other to help you make better decisions. Make sure you've documented. Oh, time out. Okay. So that's good there on that piece. The last little bit just we think about developer experience really important to remove all the barriers. So that, thank you.
Ghosting the hardware
Hello everyone and welcome to this session about ghosting the hardware. Maybe the title is a bit obscure to you. I will explain what it is a bit later on. So my name is Rémi Durafort. I'm a principal tech lead at Linao and I've been working on different open source projects for many years now. I'm currently working on Lava which is a test automation system that I will present. So Lava stands for linear automated validation architecture. So it's a test execution system which means it allows for testing your software on real hardware, on real devices like Raspberry Pi, Dragon board, so physical devices. It allows to deploy your hardware, boot your software and test it on real devices. It's used by multiple projects like NLCI for example that use mainly multiple Lava instances. We use that a lot also in Linao for the LKFT project, Linux kernel functional testing project that we are driving. We also use it for doing bootloader testing. So for example you can test your UBoot version directly on your board and Lava will allow to interact with UBoot and test it. We also do firmware testing with it. So it currently supports 364 different device types which is a lot of different device types. So if you want to test your software without Lava, so you will have a kernel, DTB, RAM disk, root FS modules that you want to test. You will have Raspberry Pi, so this is a pretty old Raspberry Pi 3 anyway, doesn't really matter. You need a way to access the serial to interact with the board, so FTDI cable usually over USB. You need a way to power on and off the board, so you need some device that will allow to send a TCP request to a specific port with some commands and it will power on the board and another request will power off the board so it can be made automatic. And usually we use TFTP and NFS for sharing the kernel, DTB and root FS system with the board so you don't have to actually flash the board because after some time you will actually destroy the SD card if you do that a bit too often. So then when you have all of this, if you want to test the board you have to power on the board so you send the right command to your power manager. You then connect to the serial, you interrupt UBoot, you send some commands like DSCP, so the board get an IP address, you load the kernel over TFTP, you load the RAM disk over HTTP, you set the console argument for the kernel, you then send the right boot argument that are both specific. You watch the kernel booting, looking for some crashes maybe or warnings, then you have the prompt, you log in, you run your test, you collect the results and you shut down the board. That's dangerous, not really funny and you will have to do that for every release that you will have for your software. So that's where lava come into place. So instead of having to do that manually, we keep the board, we keep the power control, serial relay and TFTP and NFS server and replace yourself by a program which is a lava worker. So instead of typing commands manually one by one, you will explain in a YAML document to the lava worker what you expect him to do. So you will explain that you have a kernel, a DTB and a root FS that you want to deploy using TFTP and you want your root FS to be available over NFS. And lava will then know how to automatically interact with your board to send all the right commands that I explained in the previous slide, automatically in a reproducible fashion for you and it can do that days and nights, including weekends for you. And this document that you write is what we call a job definition or job configuration. So obviously you can have multiple DUTs, device under test, for example in this case, per worker and you can have multiple workers attached to your lava instance and they will all connect to the lava server, the classical server worker model. For example in Linao in Cambridge, we have a lab with hundreds of boards and I know Collabora also has some kind of board farm like that. It has been made for a large board farm if you want to. Regarding the roles, so the server, it's a web UI API, it's what is visible to the user and it usually does not have access to the boards. For example in Linao, all our lava servers are in the cloud somewhere while the boards are connected to the workers physically in a closed lab. So the workers, they have direct control of the DUTs, the boards, device under test and they are not accessible to the users. The user will not have access to the board directly, they will not have access to the worker directly, only to the server. So the server will be responsible for storing the logs, the jobs, the results, do the scheduling, send notifications, have an ICOI, things like that. And on the other side, the worker are more responsible for the hardware. So they have to deploy resources, they have to pour on and off the boards, they have to interact with the serials, look for crashes in the kernel under the board health, something like that. So this is the list of supported devices. Obviously you cannot see it, it's way too small because there are way too many devices. But just to explain that we support from really tiny devices, IoT devices, up to Raspberry Pi form factor and up to even large servers that you can test with Lava if you want to. And as we support multiple different kind of device types, we have to support different deploy methods, different boot methods. So for example you can deploy with TFTC, NBD, fastboots for all the Android boards, Vexpress, etc. For booting you can use DFU, Uboot, PyOCD, fastboot, etc. As a result of different ones. And for the tests you can have a POSIX shell interaction if it's available on the system that you have. You can have interactive tests. For example when you want to interact with a bootloader, it's not a POSIX shell, so you have to send commands and expect results. And we can also do multi-node tests which is a test in which you have multi, you have more than one device that will be booting at the same time and that will be able to interact. For example you can test your server on a physical hardware that will stream to multiple different clients. It's something that you can do in Lava. So today I will speak a bit more about also why we want to test Lava itself because why do we want to test the CI system? The obvious reason is that it's just a piece of software so it's buggy. So you have to test it to know what is working and what is not. And even more important is that when you're building a CI system you have to ensure that the CI system is rock solid for two main reasons. If you have bugs in your CI and if you have false positives which means that you're reporting something like a bug on the software while it's not buggy, then your developer will just say okay I'm done with it, it's not working. I will not look at your CI system anymore. That's the first reason. The second is false negative which is you're not reporting an error that happens in your CI. So you're running a test, it's failing, but the CI system says everything is okay which means that you will say to the developer I tested it, it's working while in fact it's buggy. So you will have a, you will release the software that has been tested but it's still buggy. So you have to prove that your CI is reliable over why it's just fully useless. So how are we going to test lava itself? So we do have a classical hierarchy of tests. We obviously have static analysis, we do have unit tests that are running on every GitLab CI merge request. We also do integration tests and that's why I will print today which is called meta lava and we also do federated testing and test on a staging instances. So we do have some instances that we upgrade every day where we run actual workload and we check that it's still working the same way as before. But the main problem when you want to test lava is that it's a combinatorial issue. As I said before we support 364 different device types, roughly 16 deploy methods, roughly 26 boot methods and five test methods. So if you do the combination that's insane, the number of combinations that you have to test. Yes, I know that a lot of these combinations are just not going to work because not all devices support DFU or fast boot and things like that but still it's really good. So maybe you want to give me both and money, I will be out for it but obviously I don't think that's the case. So maybe we should consider faking the DUTs. So faking the hardware. So that's the goal of the MetaLava projects. So the goal is to be able to set the full system. I want to test from the user point of view back to the user. So the user should be able to send jobs. It has to be scheduled, run on a fake DUT, send back results and the user will pull the result from the user interaction. And I don't want to have any boards because I want that to be somewhere running in a CI CD system. And it has to be cheap obviously and fast. So you have two ways you can do that. If you want to fake devices, you can go for doing board emulation. You can use MVP or QMU devices for example to emulate devices. The main problem is that it's CPU intensive so it will be slow and expensive. The other way is to ghost the hardware. So if you go back at the lab architecture, I don't want to touch the user. That will be my testing system. I don't want to touch anything in the server and the worker because I want to keep my system intact. So the only thing I can change is what is on the left part, the board and the power control server and the TFTNFS servers. So what I have to do, I have to build a system that I have to build a fake DUT that will feel like a DUT, look like a DUT, smell like a DUT, sounds like a DUT and tastes like a DUT because lava should not see the difference between a real DUT and a fake DUT. But that's not enough because I also have to check that what lava will send, the interaction that lava will have with the fake DUT is still valid because if I have a fake DUT that accept anything then lava will do any stupid things and it will still work while it's just wrong. So it has to also check, the fake DUT also have to check that what lava send is legit. So it has to check that lava is still acting correctly. So we have to look at the interaction between lava and the DUT. So as I said there is free interaction, the power control. So by the way lava is designed, it's just a command that lava will run so it can be any shell command that has to return one or zero, one if it's failing or zero passing. But from the DUT point of view, from the fake DUT point of view, the DUT should be able to check that the command has been called at the right time, so before booting, that lava is still doing what is supposed to do. Yeah, the serial relay, again it's just a shell command that lava will run and that it will just interact with the input and output, STD and STD out. So I need to build something that will feel like a DUT when you interact with it with the serial. And the TFTP NFS servers, I will just use a normal and TFTP NFS server and I will just have to check from the fake DUT point of view that lava has deployed right binaries for me. So the question is where do I want to mark things? Let's take an example. I don't want to do this presentation but I want to be in my bed and I have something that will be in place of myself. So you don't see the difference. So I can build a robot that will be in my place and that will speak like me and explain the same things, interact with you the same way as I will do. That's one way to do it. I can also force you to have glasses that will inject in your vision an image of myself. That's two different ways to fake me but from your point of view, it will be the same. You won't be able to notice the difference. For mocking, it's all the same. I have different ways I can mock. I can create a hardware that will interact with lava the same way a real hardware will do but without actually booting a candle. I can do that if possible to just have to fake the serial and it will work. But as I said before, I don't want to have any hardware. I just want software. So what I will do is I will have only software that will fake all the interaction with lava. So it will fake the serial relay for example. So we're going for a full software implementation. It's a project called DEMISIS. So when you run it, it has the same output as a normal board. You can interact with it and you feel like you interact with a real board. I will show you in a right after that. So you can send it commands and it will react like a normal board will do. And when you do TFTP and NFS commands, it will actually load the TFTP and NFS command for you and check that the binary are present. I will just go for a really short demo. So I just have a run script, just a wrapper not to type everything because it's painful to type. For example, I want to, so my DEMISIS system, my program that will fake a duty, so it's a Python script and I give it some commands that are inside YAML file that I will explain right after. So if I start it, you will see for the one that are used to have a UBoot booting and acting to UBoot machine, it's what UBoot usually type enter and it's actually wait for you to do type enter and then you have a shell in which you can type some commands, for example, DSCP. It will get a DSCP. This is all fake. I don't have any board attached to it. You see that. It's just a program that is faking a UBoot interaction, a board interaction. And then I can just ask it to boot. I'm not doing it because I'm not booting anything. It's just faking it. For LavaPoint of view, it's actually booting something. You see that the screen is a bit too small. You see that it looks like a board is booting, but it's just printing text. But that's enough because that's filling all the requirements from the LavaPoint of view. And you see it's just a program running. I can just, for example, do a login interaction if I want to. I want to check that Lava is able to log in automatically to send or write login parameters and password. I can create a program that will do that. Again, just doing the basic thing, booting. You see it's a small delay when printing. It's on purpose to fake what a real board does because a real board does not send all the characters in one row because the cellular takes some time to process, to transfer. So we fake that also. Now I have to send. You see that if I not send in the right parameters, it would do a login incorrect. If I'm sending the right one, it will log in as normal. Again, this is not doing anything. It's just pretending to run up a system. And then this is what usually what Lava is expecting when it's run tests. It's expecting some signals to have. And I can fake that also. If you look at what's inside, it's a bit too small. So inside the argument for my program, it's just a set of commands that I'm asking. I'm asking my program to print the lines. That's the line that you've seen. Then it's printing the different lines and accept to be interrupted like what UBOOT does. Then it has a shell. This is a prompt with UBOOT prompt. And it will look forever waiting for exactly this command, USB start, and et cetera, et cetera, et cetera. And for the fake DUT to work and to go to the next stage, Lava has to exactly send the right commands. And if it's not sending it, it will fail. So thanks to this list of commands that I'm expecting, I'm able to check that Lava will send exactly the same command from one version to another, because that's what a real board will expect. And at the same time, from the Lava point of view, it will have the output that is expecting. And for example, here, Lava will, it's waiting for getting a TFTP instruction to load the VM Linux over TFTP. So I'm waiting for this exact command. And when I get it, I will actually download it. I have a small script that will download over TFTP, download the file, check that it's present. That's what I said. What it sends should be meaningful. So all the tools should be available, what it should be. That's for the shop demo. So that's what the MailTile Lava project is doing. So we have a server, we have the workers that are working together. And instead of having a real board, I just have the domicile system. So it's actually running 28 different device types, including both that I've never seen, because I just need the logs and the commands that it's expecting. I don't need the real board itself. And it also allows you to test bootloader failures, for example. So that's something that's difficult to reproduce in real life. You have to damage your board if you want to have some specific errors. The system, meta-lava and domicile, they can reproduce the same error all the time, because it's just a specific output that Lava will have to see. If you want to contribute to this, to have your boards tested by Lava, so a fake board tested by Lava, please come to see me. I will be happy to add that to the system, and that will ensure that your board will still remain valid in the next time I'm working. It's a fun thing to do, to do system mocking. And just have to look at the interaction between the different systems. That's all. Do you have some questions before we go to the next meeting, next presentation?
Pushing test lab to its limits: performance tracking techniques
Hello everyone, my name is Paweł Wietzorek. I work with Colabora and I've been involved in maintenance of server side components of Colabora's automated testing laboratory. Today, I would like to share with you a few lessons learned from that experience, particularly related to tracking laboratories performance and pushing beyond the limits of the software that it runs. We'll start with some background information. Next I will move to interactive approaches for tracking its performance, I mean the lab performance. After that, I'll describe a few solutions for automating that and finally I will also share some thoughts on data generation. So let's start with the reason why, I mean what brought us here today. Thanks to Remi's talk, we now know and have an idea of what Lava is, what it provides for testing automation and how it supports all these efforts. Some of you might also recall a talk given by my colleague Laura at last year's FOSDEM. Laura described in her talk how the lab at Colabora is set up, what its day-to-day maintenance tasks look like. What main challenges are while running this kind of laboratory and also shared some best practices. The key piece of information for us today is that Colabora's lab is a fairly large Lava instance that is continuously growing and together with high number of devices also comes high numbers of test job submissions to process which unsurprisingly can result in higher load on server side of things. And that in fact was our case. There was no need to panic though, at least not right away. High load means that the resources that were allocated for lab purposes are in use and that's what they are meant to do after all. Interestingly, especially high load was observed on the nodes running database processes. And all of that is mostly fine until the system becomes unresponsive. This might lead to potentially unreliable lab or even unusable for higher level test systems like MESA-CI or Kernel-CI on the screenshot which other Colabora's are involved in development, maintenance and of course usage as well. My first thought was to simply estimate what resources are required for day to day operations and simply throw them at this workload. This could work short term but it wouldn't really solve the problem. To do it the right way, a deeper understanding of the root cause for all these issues was needed. And by the way, this photo is from Polish IT Olympics where hardware component throwing contest is held. And while this is hard drive throwing contest which might not be the type of resource we needed, that was the initial idea. Thanks to RemiStock we also have rough idea of what main components for Lava are but let's recap them real quick. At the very high level Lava on the server side has two main components, a scheduler and a connection to the database. If we take a closer look, those are respectively a jungle application and by default a Postgres database. These are widely known used and mostly loved software components so we can make use of several already available performance tracking tools for them. So let's go through a few interactive or semi interactive ones. As tribal as it might sound, it is equally as important to start with simply enabling verbose logging on affected instances. This way we get first insights from redoing user stories based either on direct reports from users or maybe motomo statistics collected by recent Lava releases or maybe some logs from load balancer which shows us which API endpoints are mostly used by users or which views are most commonly requested. In case of Django we get a few other perks. It's as easy as literally flipping a switch. Django for database also allows to log every statement executed on the database in debug mode and it can be also easily extended with some additional profiling information. But even though there are all these perks, all this information is a post-action information. To collect it in a truly interactive manner, fortunately Django already has us covered and provides just the right tool for this purpose which is Django Debug toolbar. It isn't much harder to enable than just verbose logging. It just requires adding an additional Python package to your deployment, set internal IPs from which Debug toolbar would be available, confirm enabling it and you're good to go. Debug toolbar not only provides great and immediate feedback but also includes traces, some additional profiling information and it gives you all of that in an easy to use graphical user-friendly way. As you can see on the right-hand side of the screenshot you even get all the requests sent, the instance and all the SQL statements run. But even though these tools are easy to enable, it comes with some drawbacks as well. These tools should not be used on any user-facing instance which brings us to setting up a personal local lava instance just for debugging and performance tracking purposes. Such a local instance would often come in a clean slate state. So with empty database with no devices and most local instances would not be able to connect to physical devices, at least not in the numbers as the production instances run. And even though we could fake multiple devices like Remy mentioned in his talk, that wouldn't solve the problem of having a database pre-populated with some additional data. We could potentially prepare a database fixture for that purpose. But it might not be particularly easy to mock the entire database like you see on the model graph for lava server. It's non-trivial task especially when it comes to keeping large numbers of processed jobs as archives. But the question is do we really have to mock the database? It is all done locally in our private debugging and performance tracking instance. Maybe we don't have to create a new database but reuse a backup from staging or production instance that we also run. And as the old saying goes about two groups of people and backups, I believe we all belong to the group that already makes them. There is also an important second part of this saying to make sure that restoring your backup works properly as well. And with reusing your PGDump output as the input for your performance tests, you can tick off this task from your administrator tasks list. Also if you base your Postgres Docker images from the official one, there is a really simple data initialization method which requires just mounting the volume with PGDump output and everything else is taken care of by the INE-DB itself. It also supports the compression on the fly for the most popular archive formats as you can see on the snippet directly from the INE-DB code for Postgres. Since we already have this database in our local instance, it would be useful to incorporate even more statistics from the database itself. For this, we could simply use PGAdmin or even PostgresQL command line tool just to check the actual runtimes and other statistics with explain-analyze queries. This would highlight for us database operation bottlenecks. And this way, having the database level tool, we would also be able to run various experiments on the database like changes in indexes or maybe adding query planner hints. It almost doesn't cost us anything just running another container in our local setup or if PGAdmin is too much, you could also opt to use the online available graphical tool which would highlight the bottlenecks for you with this heat map showing you where the issue might lie. Using this database level utility completes our tool set for off interactive solutions and while it is really important to be able to perform all those actions, it's paramount to do that again sometime soon and again and again and again and that moves us to automation solutions. By now, we know what to look for or what to watch out for in our lava instances and from user stories or bug reports or the motomo statistics or load balancer logs I mentioned earlier, we know and have specific code components to track or maybe even test cases ready to check for that. But the question is how to run those test cases to get the statistically valid feedback. We would have to take into consideration cache warm-ups, test case calibration, preferably also a way to compare between benchmark runs and it would be also great if it fit well into the test suites that are currently used by the upstream project which by the way is based on PyTest. Fortunately, it turns out that there is a PyTest feature that provides all of that and even more. In the case of lava bottlenecks found in the collaboration instance, the next step was just to wrap the test cases prepared with this fixture and wrapping the key pieces of code allowed to have benchmarks ready to run. Next step, once the test suite was prepared, was to plug it into the pipeline. Both upstream lava project and downstream lava tree makes heavy use of GitLab CI and it shouldn't be surprising. Many projects already do the same. For example, DRM CI merged in kernel 6.6 release. So currently, job definitions for those GitLab CI pipelines above the downstream one and below the upstream one don't share any reusable code. This might be subject to change in the future. For now, downstream changes are made with ease of importing them later in mind. Moving to external definitions could make the GitLab CI pipelines a bit more complex, but that's something that we'll see if it brings any value in future. Of course, GitLab CI jobs need a run environment and to get a baseline of what should be expected from benchmarks run, the easy way out is having a dedicated runner that would provide most stable results that are not affected by, for example, other test suites run in parallel on the same GitLab runner. A good choice would be to select a machine that has similar resources to a node, which your lava instance is run on. And for proof of concept purposes, I used a small desktop computer which gave just that. GitLab runners are also really easy to plug into a GitLab server. And while we are already optimizing the pipeline, we should also take into consideration caching the CI data resources for benchmarks runs. For that, we could easily use already available upstream lava caching solution, which is based on specific CI images to run tests on. But that would also mean that production data from database we used earlier is no longer a valid option for us. And we need to revisit the lava server model, which brings us to data generation, which we no longer could omit. That brought us to creating a dummy database generator, which was focused just on a key few tables and relations according to Postgres planner statistics. It was implemented with very limited scope to only support the worst bottlenecks that were found in collaborators instance. And for that, we used standard Python tools, which were FactoryBoy and Faker. As a bonus addition, you might also want to ask a few questions. Should lava actually archive all the test jobs that are run, or maybe archiving those jobs can be delegated to a higher level test systems? Fortunately, retention mechanism is already available. In upstream lava, it just required enabling it in Helm charts, which is used to deploy lava instances for collaboration. To summarize all of that, I've got three final thoughts that I would like to share with you. Constructing in testing laboratories is not a one-time job. It's a process that might differ from instance to instance, depending on your specific workload. But it's something that I hope could be easier for you if you come across the same set of issues. It also requires frequent revisiting and adjusting according to the results you see. But even small changes can bring huge boosts in performance. But that probably is a topic for another talk. And that's all I have prepared for you today. Thanks for your attention. Do we have time for questions? If there is some question, I will be happy to answer it.
Performance testing and why even the imperfect one is important
Hello everyone. You have two more minutes. My name is Andre. I work for... Okay. I've worked for Red Hat for several years now as quality engineering. And today's talk is really about performance testing, but it's not about the testing itself. It's more than why we should do it. You have six minutes in class. Okay. You're starting early, just so you know. Yeah. Okay. So it's more why we should do it and why there are benefits in it, even if you do it wrongly or imperfectly. So that's the main point of the talk for today. So first thing first, why should we do it? What are the benefits for us doing the performance testing, even if we don't have isolated environment and all this kind of stuff? Well, for me, the main benefit is that even if you don't have the environment you would want, you can still find the bottlenecks in your application or whatever you are testing. And you still can optimize it even if you don't have everything ready because the truth is that the performance testing is quite expensive. And for the good one, I don't think that there are the companies that will give you the resources that you need to do it perfectly. So that's for me the main option or the main reason why to do it. And for me, the second most important is that you will gain knowledge that you need or you will obtain the knowledge about the product itself because you will suddenly see things that you cannot see, even if you normally deploy things and everything. You will see the little thingies that are happening here and there. And the information you gain are quite nice to get. So that's probably the most, the points that you should look for if you are thinking about the performance testing. This is actually what it will gain from it. So this is only in my opinion, like you will see a lot of papers about the performance testing and all the things that you have to take care of. Like on the GitHub, I know about two or three papers that have like 40 pages about the performance testing and all the criteria that you have to fill. In my opinion, there are like two variants of the performance testing. And first is measurement and second is testing. For me, the measurement is really the thing that you are looking for numbers. And you need those numbers for, I would say, legal reasons or anything what you have to declare for your customer. You will say that, for example, for us, I work on the division. Like if we would want to say that this connector is actually able to do 30k per second, we would need some kind of a proof that we can do it. And getting this proof is like, it's very complicated and you have to do it in very specific ways. And even if you have everything, it's not quite, it's not always acceptable. But the second part or second variation is just testing. And for me, the testing, the testing is really just finding the bottlenecks in your product. And I think that's even more, like the testing is even more important because there you will find all the bottlenecks and you can optimize your application really. And you can see the flaws in your call because you cannot see these things when you run it regularly and you don't have everything around the application tuned up so you don't throttle your application to the maximum. These things usually happen when you go over the top or near the maximum, near the max. So, yeah, this is in my opinion two ways how to do the performance testing or two variants of the performance testing. What is not really optimal about the testing, which I was talking about, just finding the bottlenecks, not the number, you need massive monitoring and I will say more about it in the talk, but that's like main disadvantage of it that most of the time you will go around tracing, monitoring, metrics and you will find some stuff that will really give you hard times figuring it out because you are going for performance and you are speaking in milliseconds, yeah? But most of the things that are used for monitoring are not really prepared to handle you for one minute second. They think that it's okay to scrape, for example, metrics by 10 seconds and like this will give you massive headaches during the time. I think I have already somehow gathered those, but the goal of the performance testing is, as I said, find the bottlenecks, but they are much more to it because, for example, the load types, you know, even if someone of you had crossed the performance testing, the main point all of the guys are talking about is what kind of load we are going to do, how we are going to do it to make it reproducible, you know? Because that's like a problem because for some application you put constant loads to them. Let's say you are going 10k per second, some requests for the API for one hour. And like that could be fine, but we all saw that some websites or, for example, the systems that you are buying tickets for concerts, these kind of things need peak loads. You know, you are going low for 5k per second and suddenly you spin it up for 100k per second or something like that. So you really need something that will generate the load for you and do it reproducible. You need to have the same load so you can repeat the testing for a couple of times and be consistent in that because otherwise you will find all the other things except the flux in your cold. So the main problems during the performance that you will find. I have said that you don't need the isolated environment to do it. And that's true. We don't have isolated environment in our team when we do performance testing. But if you don't have entirely isolated environment, you need to know your environment pretty well. You have to know your latencies, you have to know all the hardware specs and these kind of things. You really need all of the information because if you see some very specific things happen during the test which are not common, you then can somehow put the puzzle together with all the information you get from the environment and you can somehow at least, I would say, decrease the number of stupid mistakes that you will have to gather around. The next thing that is very important is to have the monitoring which I have said already. You need all the metrics you can get because before that you don't even know what kind of the information would be valuable for you when you start doing it, but you will need all of it. If you can get rid and you will surely use it, if you don't gather those metrics and then you will figure out you need it, you cannot get them from the past. So really the thing is gather everything you can and it will be fine. And the last thing for this is you need to tune up all the systems that you are dependent on. For us, we are working on databases and we cannot really throttle up our product if the database isn't optimized to the hardware it's running because if the database is not throttling, we are not throttling. So you basically need to have everything on the high spec there to not bottleneck your application. So that's one of the main points because on some things it's quite problematic to tune it up. I quite like this quote because it's all about the metrics and if you have them, it's fine and it's nice. If you don't have them, it's massive problems. So I think that's really the quote that you should be looking for. So again, monitoring. I have already talked about the problem with the scraping. So let's say we have used Prometheus mostly and the maximum what you can get from Prometheus is one second scraping. So that's fine for information causes but not for the performance metrics because the things happen during the milliseconds. Maybe 10 milliseconds would be enough but one second is really you are losing a lot of information and later on I have the example of what you could see when the scrapers are not fast enough. And there's like massive problem because not every scraper is or I would say there are no scrapers that can do it really fast. So you probably have to implement it and we are working on that actually. So that's the main problem. And the second problem is that you will end up with having a lot of the systems in the field because you need hardware metrics. You need JMX metrics and I don't know what other. It really depends on your application. But for us, we needed hardware metrics, JMX metrics and some metrics from our test suite. And these three things actually all the different outputs. You know, we have used net data for the hardware metrics. It's like really nice tool open source fast everything nice. But you cannot import JMX metrics to net data. And you net data has also one problem that you cannot import like anything would happen in the past because it's strictly hard coded for now. So that's the problem, you know, and then you will say yes. Okay. So I cannot have JMX metrics to the net data. So I'll add Prometheus, you know, that's fine. Okay. So you have now net data and Prometheus. Sorry. Well, and then you will continue because you still have all at least in our expertise. We still need that someplace to store the metrics from our test suite and you can just import it everywhere. So then you happen that you will deploy the Postgres because you can use Postgres as backup for the Prometheus for the data storage. So now we have net data, Postgres and also Prometheus. And last but not least, you will add the Grafana because you need to visualize it, you know, and getting all those things in shape that you have like massive monitoring. You know, everything can go wrong. So if you can use the least amount of the tools that you can use, it's better because once you have too many, it's nightmare to somehow get that in shape for the whole time. Yeah. So with the performance testing, you are not really looking for the numbers. Numbers are not important in this case because you don't want to see the throughput is like this or like that. You need to see the trends in the graphs because there you can see if you are constantly slowing down or if you are going like optimizing your way. So really you have to look for the patterns in graphs and the trends in the graphs. I have the example for our testing that I will show you the patterns which we have found. But before that, just our system under test was the BZM. I don't know if you guys know the BZM, but it's effectively changed that I capture streaming, which means that we sit on the top of the database and we scan the transaction logs and sense all the events that happen there to the Kafka. We are effectively running in Kafka Connect runtime, which makes the performance testing even more juicy, I would say, because the runtime is not ours. So it's a little tricky. So that's our system under test. And this is the first example that I have put on. The image on the graph on the top is basically our process duration. And there are two things that you can see on the graph. We are effectively most of the time we are oscillating between some values around 200 or 170 to 220. And that's entirely fine. That's actually what we want to see if you are looking for some responses. You need to oscillate around some value like a sinus graph or something like that. But what is not actually nice is this on the star. Where we are constantly getting slower and slower and slower. And we have some peaks there, which have these peaks are don't have a reason to be there because the data are the same all the time. So this is most likely the flaw in the code. There is something happening, what shouldn't be happening. And it can be the database flashing out to the score. It can be basically anything, but you know that there might be a problem. You have all other metrics. You can have metrics from the databases that will show you that flashing was happening or anything. But this is what you have to look for. These, it will certainly be different for your application than ours. You will have to define what you are looking for on the star. But that's the main thing. And the more funny example is this. No, here. This is Jmx metrics from our, from Libyseum. And it shows you the size of the queue. Internal queue of the Bizm. And that basically means that from the database you are reading to internal queue. So once the queue is zero, you are not reading. You know, but we are still processing, right? So there should, there is some mistake. And this is actually the problem with the scrapers. Because if the scraper takes each second, the database is pretty fast and it can empty up the, it can empty the queue during that time. And if the scraper hits the right time, it will give you zero. You know, so from this until the end, the graph is all wrong. It's not true. And it's all because of the speed of the scraper. Because it basically hit the wrong time. So that is the thing that you have to be worried about because this will happen surely. And these are some other graphs. These are, I would say, more wild. I would say it's from the start of our playing with the performance. But the top one is also, it's pretty cool. It's not that constant as it was for the previous one. But it's still in some borders, you know, we are somehow oscillating, but there is not really clear way. But the queue size is okay now. You can see because you have some data there, but not zero. So there's an issue with the scrapers, as I said. And this is actually the thing that you will have to look for the patterns. Really important in graphs. You can see all the different ones and there are a lot of papers on the Internet that you can find. What to look for on your specific application. So, yeah, but not look to the numbers. Numbers don't say you anything. You can usually get the higher numbers if you change up the hardware that you are running on. But if you can optimize it on the some hardware that you have, you will surely get the big numbers on almost anywhere. So some tips and tricks for me. During the way that we have started playing with the performance, we have developed a lot of tools. First is the database manipulation tool, which is effectively giving you a Json API. And just with the Json API, you can create DML for almost any databases. We have now probably MySQL, Postgres, and Oracle there. So it's just you don't need to have a lot of different JDBC connectors in your code. We'll just deploy this and it will take care of it. We have also implemented some kind of the load generator that can generate you. Load for, I would say, constant load, P-codes and all this kind of stuff. We have also some automation and the other, like MySQL auto-tune, that's it. We are pretty proud of that because it can basically tune your MySQL to the whole VM that you are running it or physical machine. And you would say that it's easy, but it's not. You know it's hard when you look on the seven or eight page of the Google. You know, in this phase, you know, we are probably not on the good shape and this is one of the things. So please take a look if you are working on MySQL. We have some counting of the parameters for the database there. And it will save you a lot of time if you want to tune up your database. We have spent the time for you. And secondly, we have implemented the metadata to Prometheus Creper. So you can get rid of one, one struggle point in your monitoring environment. And we are also starting working on the fast Creper for the, for your monitoring, for our monitoring stack. But it's not done yet because it's quite more complicated than before. So, but yeah, please take a look. I will have the links on our GitHub and everything. It's all open source. So you can just also add some code there if you want. Yeah, so, so I have started quite early on than I should. So I have some time now then. But okay, we can just summarize everything now. And I hope for some discussion before you guys who done performance testing. So for me, don't be scared around the performance testing. He's not like some, some monster. People are mostly just like creating the monster from it. But if you don't need that for some legal options or anything, it's fine. You can play with it. It's funny. You will gain a lot of knowledge about your, about your product on that. And especially if you are QE, I mean, a lot of QE folks don't have the necessary knowledge about the product itself. And this helps a lot to get through everything because in the end, you will, you will just go through the code and look for, for the mistakes or something like that. So that really helps a lot. So, gather all the metrics you can. And well, we are also writing our blog and all the repositories will be on the other side before the two, before the two links. So I would be happy to hear from you, you know, like repository or organization before the two links. And yeah, that's probably it for, for my talk. As I said, I have started a lot earlier. So thank you very much for listening to me. Please do have some questions. Yeah. So my question would be, so what kind of experience do you have in your complex system? And then you see something happens there. And say, okay, here, here's the latest EP or something like that. Which experience do you have with, let's say, find the cause of the problem? Cause, so when something happens randomly, you will see it with wrap and say, okay, something happens there. And that's maybe annoying, especially when it happens, happens randomly. So what kind of strategies you are using then when you know, okay, there is something to find cause of the problem in the complex chain. Yeah, so this is, okay. So, so the question was actually, if there are some, some changes in the environment, some like latency things or everything, how we can deal with that and how we can find the causes of the problems. Yeah, so surely this is the main problem of the whole performance testing outside of the like isolating and, you know, well, you need the metrics from everything because then you somehow at least it will help you to, to get all the things in the right timeline and you can see the need and picks what could happen. But if it's like something that is really bad, you can find it usually because it will mostly, it will just disappear in all the logs because it can be something like if you have the smaller machines, I have to write once on the, some microchips, you know, it was funny thing that you fill up your TCP queue. You cannot find that anywhere in the world. So at that case, you will just repeat the testing, even if it's take long and you will see if same thing happens or not. I don't have any other like recommendation for that because this is like really main problem if you are doing it outside of the ideal environment. This surely will face this, but mostly it's not happening that often, I would say, because you can have observation for and tracing for a lot of things and most of the times you can like colorate those things together. So you exactly know what is wrong, especially for the network. You can get a lot of network traffic, like, how is the word? You can see all the traffic and what is going on, especially on one line. So then you can usually put those graphs together and you know at the time. If that is okay, answer for you. Yeah, yeah, yeah. Thanks for the talk. I was just wondering that how to use the traces, analyzing the traces, because I've seen that you mentioned metrics and traces. Sorry, can you speak more louder? Oh, yeah. Can you hear me now? Yes. Yeah, I was wondering how do you use the traces for performance testing, because when you collect the traces, how do you deal with the sampling of the traces? And if you miss something because the sampling is bad or you are not sampling everything, maybe you have to infer something from the metrics and the traces, I was wondering how do you deal with the traces? And if you use distributed tracing in a large project like collecting all this kind of stuff. I'm not sure. I understand the question. The question actually is, I've seen that you're using collecting the metrics and then you are analyzing the metrics. Yeah. And what about the traces? Yeah, so, well, the business does not have really that amount of traces that we could get from it. We have mostly like JMX metrics, you know, from the Java environment. So that's for us what we analyze. And I'm not sure how can I answer more for the questions. So I'm sorry. We can discuss it later. I'm coming to you. Okay, so my question is about the long running tests. Sometimes the performance validation is visible only after long run, for example, one week, couple of weeks and sometimes even more. So how do you address this in your process or how do you recommend to address this problem? Yes. So for this, actually colleagues of mine as part of our open source organization, they are also developing the long running cluster environment, something like that is like, because, you know, having a long running thing is complicated on itself because you have to manage it a lot, especially on OpenShift or Kubernetes or these kind of things is like little problematic before the upgrades and this kind of stuff. So we have not dealt with that yet, but we are planning that once we are okay, that we manage that we have everything prepared for like databases and everything, we want to get the up running on the long running clusters and like regularly doing the performance tests over, I would say a month or something like that or a week, it usually is enough, even especially when you put all the numbers to the low ranges for the retention for the memory and all this kind of stuff. It doesn't take too long to fill everything when you will start to see the retention and flashes and everything. So yeah, that's our plan, but we haven't done it yet. So yeah, but if you are interested in that, you should definitely look in the hub that we have in the repository because it could be useful. Do you have any tips for running performance tests in the cloud? Because for me that's quite the opposite of dedicated test runners. But when the software runs finally in the cloud you should probably also performance test it there. It's a problem. A big one. We have tried it and it is so inconsistent. The results are so all over the place. If you run, you have two same clusters, Kubernetes, OpenShifting doesn't matter actually, and you run the tests on the same clusters at the same time, clusters are in different zones on AWS, you will get entirely different results because all the load balancers and these kind of stuff, it really, you know, if you have only internal communication on the cluster, we found anything from the outside. It could be actually doable, I think, but if you have any communication during the test that is going outside the cluster and it has to go through the load balancers and these kind of stuff, I think that's not doable in any way because it's like you don't know what latency we will have for this kind of the request and travels. So I think that that would be like really problematic, but if you could mock up the external communication with just some internal endpoint, it should be quite okay-ish, I would say, but you will not get like really good results from that, I think, even if you try more and more. But I think there are some Kubernetes, some special Kubernetes builds that should be used for these kind of measurements, but I have never actually tried it, so I cannot recommend it when I try it, when I try it, but yeah, this is definitely a problem. Okay, so I have more questions? No, we have a few more minutes for questions. Come on. Otherwise, I'm going to ask you to, you know, to move your seats. Wait, wait. You said you would want to have a very small statement developed, so in the milliseconds ago. Yeah. So doesn't that create problems on their own, something like noisy neighbors and so on? Yes, it does. It does. Right, but... How big of a problem that is? Well, that's the thing. We are really thinking about writing some scraper that is fast enough for this, and yes, you will probably generate some problems during the way, especially if you would like to send the metrics directly to the Prometheus every millisecond. You will probably fill up the network line or the TCP stack or whatever, because it's really fast. So you will... It will strongly depend on the machine that you are running and if you have space there, if you have a lot of RAM, you could actually batch all the metrics and send it like one package after the test is done. But yes, that's actually what we are now fighting with and we are trying to figure out how we are going to aim for this. But mostly we are thinking that we will do somehow a configurable scraper that will either batch the request or send them or something like that. I cannot say you because we haven't tried it, actually, what problems it makes, especially with batching, because I have counted it up and metrics aren't small, actually. So it will take a lot of place in the memory. So we will have to try it and somehow figure it out. But before the fast scraper, it will give you really headaches because you will try to find something and fight something and you will spend 20 times debugging it and then you will find that the scraper hit it the wrong time every time. So we have to deal with this in some way. But it will be hard and problematic. I think we have time for one last question. No one? Tough crowd. Thank you very much.
squash the flakes! - how to minimize the impact of flaky tests
I Come on people Yeah, let's share up for Daniel because his first time speaker and everything's failing And it's off to a good start. Yeah, come on big applause. Thank you. You're doing awesome And You know the only certain thing about technology is gonna fail exactly when it doesn't need to Yeah, like I think I said already flakiness is not only happening in test obviously right So While we're waiting for this thing to happen I could ask a question about who actually Has has an idea what a flake would be in testing Okay, I should just repeat what you're going to say Yeah, yeah, go ahead you probably I don't know So you have an idea, but you don't want to tell me Exactly exactly so To me or I think to most people that agree about this topic Flaky test is a test that fails and passes and successive runs without changing any code Neither testing code nor the underlying production code Okay So, yeah, this talk will be about flaky test. Yeah Yeah, of course, of course flaky behavior is not determined by just being the test being flaky but also the software but I would Divide those two kids into different categories and how they are going to be handled this differently. So Yeah, but Let's wait. So Yeah, I'm going to start with introduction. My name is Daniel Hiller. I'm working at Red Hat I'm working on the upstream cupid project and there I'm Maintaining the cupid CIS system So this talk will be about flaky tests and How we should or how we are actually going to handle them I'm in our community of supporters for the cupid contributors. So I don't Say I have the silver bullet for handling that I would be happy to have any input from you folks And how we can improve and I would actually also Want to have some kind of extended Q&A session if the time is still there Somehow so that you might talk about what you have experienced and how you are going to handle it Just as a quick Thing how I think this should be going I'm going to start with Waterfeg is but yeah, you described it perfectly already. So it's fine and then What the impact of flakes is And then how we are doing or how we are how we can find flakes somehow and then at the last I'm half rate the flake process works and what tools we do have that support this and Yeah, in the end, I just want to describe what we're aiming to do in the future to improve this Just don't have internet and made up for some reason. Oh, no My email, okay Yeah, yeah, I think it's going really terribly wrong Sorry for all that by the way a packed room. I didn't expect that to be honest. So thank you all for coming Really great. I'm gonna help you out. Don't worry. So Tell me kind of a little bit more until we wait for the slides Can I give us a hint as to what you wanted to show us and just tell us the story about it? Yeah, without the slides just going to open it a bit because so I can Supposed to talk about You know pretend I'm stupid and I have no idea what's flaky and just you know tell it to me So I told you already about the agenda And yeah, the question of flakes was already answered so I Have two other questions. So that one is somehow like a little bit suggestive, I guess so who thinks handling flakes is important Like put your hand up a few you don't Who yeah, of course things handling of flakes is important Okay, I thought about that. So You save my day do you have a USB port I Hope so once again, it's on you need to put in a presentation mode There is on the right should be presentation. Yeah, that should be okay Yeah, okay, so the questions we already had Yeah, and another question who has to deal with flakes on a regular basis Wow, okay. Yeah, I expected something like that So yeah, like you correctly already said Flakes are caused either by production code, which is a bug of course or also by flaky test code This is also a bug, but it's handled differently like I already said So we are using prowl for our CI system which comes from the Kubernetes ecosystem. I'm not sure Whether you're familiar with it, but it's pretty flexible and it can just Start jobs from From GitHub events, which is exactly what we want and what we need this picture actually Shows on the top for example there is the commit ID I made I don't even see it like that there We're pointing this is the commit ID and these are the job runs that are defined So like the jobs on the CI system and this of course is a failed job and these are successful jobs so obviously you can see like this is the PR history for one of our PRs inside the Kubrick CI and What you can see here is that of course? There is jobs all run on the same commit ID, but some failed and some succeeded and That's exactly how we see where we have our flakiness Oh Wait a second. That's the wrong direction. Okay, so There is a really interesting survey that which is a major survey about flakiness in test Which is just called a survey of flaky tests Not really impressive about great stuff inside there something like that there you can read that 79% percent of the flakes were for the lungs and More than 50% of flakes could not be reproduced in production In isolation, so which of course leads us to the conclusion that Ignoring fake heat as flaky test is okay, right? It's doesn't of course So When we're talking about CI we want to have a reliable signal of stability in our CI So because of course we want to know whether we can ship our product or not and so any failed test run signals us as the CI maintainers that the product is unstable and that we can ship it So if we are having flakes in our Production system, of course, they give us a wrong signal like that the product is unstable and that we can ship Which we later then have to verify the test code what exactly got wrong and then we notice it's a flaky test So this is wasted of course a lot of time so Not only does it waste the time of the Developers themselves who have to look at the test results somehow and determine whether this is the flaky test or not But it's also like when you have a CI that is somehow Determining whether a PR can get merged via the test And then you have a test result Of course the merge will not go through So this cost friction by the developers who have then somehow Maybe reissue another test So if they see it's flaky if they there is nothing to fix then they would just retest Which somehow? Yeah, sometimes you would just think okay, there was flakiness. I'm just going to retry not even looking at the test results somehow Which which I would call the retest trap and we have actually had retest like I Mean the highest number I've been seen like 25 times testing and retest on the same commit Do they have to oh I have to stay here. Okay Okay And also a very bad thing is like I'm not sure I guess any CI system has something like an acceleration system where for example, it's like testing like Multiple git commits at once so that it can merge them all together And of course if there is a flaky test this acceleration effect will just be reversed. It will not be effective Yeah, like I said another wasted wasted thing so also flaky test Trust issues at the developers themselves because they of course lose the trust in automated testing Which is really sad because that's All that we want to do we want to trust the test But if we can then then of course we are just ignoring test results, which is not a good idea So how so we want to minimize the impact in our CI so that people don't Experience that much friction Time flies so What we do there is we quarantine those tests we put them out of the set of stable tests and Put them in another set so that they are not run during pull request runs But we only want to do that as we want to do that as early as possible when we Detect the flakiness, but only as long as necessary because tests on themselves of course have value So otherwise it wouldn't be there What do we need for that? We need some like mechanism where we can put stable test from the set of stable test to the set of quarantine test Of course, we also need a report over the flakiness So we can triage Which flaky test we need to act upon first if you have a lot of flaky test that matters so because the higher The flakiness of the test is of course the highest impact And yeah, lots of data because of course you need to somehow analyze whether a test is even flaky or not So as I already said I already described this this is like a The latest commit on a merge PR where we have some flaky test or some failing test runs Which later on got green on the same commit so no changes on the code So This is not of course not saying us that is it is actually flaky But it might might be flaky and like you said it could either be in production code or in the test code itself But that doesn't matter in the end the Problem that we have is the fiction NCI and the wasted resources there So our flake process is pretty well pretty pretty rough I'd say are pretty pretty easy So we have regular meetings where we look at the at the results and at the flakes And then we decide what we want to do with those flakes. So first of all, of course You have to know whether a test is flaky or not You're looking at the test results and deciding whom you showed contact so that he fixes that because we don't fix the test Ourselves we let the developers do that because yeah, they created their mess. They should clean it up A problem is of course when people are gone from the project then someone else has to care, but yeah So we have the flaky test to the dev developers and at the time when it's been Corrected we bring those tests back in so the truth that we have is like we have the main thing that Decides whether a test is being run between For the pull request. It's just a just a note on the test itself. That is like There is in the test name. There is this quarantine Word which is the keyword which makes the test get ignored for the pull request runs We still do to do run those tests to have this stability signal But not in the pre-submit which are required for the pull request merges But in the periodic runs That run I think three times a day So that we still have a signal when we can take a test back in in order to Have the value added again So another thing is of course you need to report so we this is a not not really Nice looking but efficient thing like a heat map so where you see where the action is going on You see like the more reddish the colors are getting the worst the problem is This is in oh, no, I can't go there. So like on the top you can just see on which day how many failures were occurring and There is another like axis which is the per lane Failure so that we can pretty much see which lane is flaky and on which was the biggest impact for everything This is the first time I'm using this sorry. I'm just always switching the directions Okay, this is the detailed report about how flaky a test is or how flaky those tests are This is ordered by the number of failures Occuring for test this is a bit overwhelming I think but on the left column you just see the test name and on the on the upper column you see the number the test lanes that are The Versions of the latest three Like we have a lot of test lanes that have different Sigs which are maintaining them and this of course obviously creates a matrix of like like at least 12 really well, yeah, really Important lanes which absolutely have to be be stable Yeah, and this helps us like finding where which test we should look at and quarantine or which we shouldn't We have also long-term metrics where we can decide how we were doing in the past so like at least everyone of course wants to know whether they are improving or Getting worse in handling flakes that we have long-term metrics where we can look at how we were doing So how many merges per day for example or how many merge PRs with zero retest Which is the thing that we are measuring currently against the most because Obviously that number should be like 28 of 28, but we seldom reach that like flakies We also have a small report over the The tests that happen quarantine so that we can find them quickly because it's like Grapping over the code base is also of course doable But it is easier to just have some report that we can look at straight away during our meetings And then finally we have all the test grade which also Collects all the periodic results so that we can deduce Where to Whether the tests have been stable or not. So this is the tool I guess that guys from the from the Kubernetes ecosystem know that because Kubernetes also uses test grid for collecting all the test results so that you can quickly drill down Yeah, and we have also established Another lane that checks the test for stability which does a thing that that makes like Test dependencies for example visible you I guess you know what a test dependency is some tests that hasn't cleaned up and Leap the mess for other tests and then influencing them and then they might be failing Or the other way around they might not Was already sufficient for For the following tests and if you are just randomizing the test order you catch those cases which is like you have to have Isolated test cases, right? And also it tries to run it five times because Like I said before in the in this metadata in this meta report like Bit more than 80% of all the tests have been fed off the flaky test have been failing after about five times This is not that you catch all of them, but the majority Yeah, that's just the CI search tool so in a nutshell we Just do in regular intervals meetings that we look over the data somehow So like I described before What we want to do is of course we want to collect even more data like We want to run the majority of tests in the same way as we are doing in the flake lane like Running them five times after another and also running Randomizing always the order so that we have a better picture over how flaky our code base is And yeah, of course like we want to avoid this retest problem where you Blindly just retest your things so we are looking for ways to just Directly find that case Yeah, so it's pretty I've been Running through pretty quickly any questions Yeah So you've been talking about responsibility of devs to fix the flakiness So this kind of assumes that the flakiness is introduced either by new tests or by changes on tests or changes on the code base But what about flakiness that is introduced by the by your infrastructure actually like network latency or things like that Do we have we have those problems or is it something that you I didn't get the less could you repeat the last sentence? Sorry sorry so you You imply that the flakiness can either be introduced in new tests or changes in tests or changes in the code base but have you ever been confronted with flakiness introduced by your infrastructure your Like network latency or something like that and how do you detect them and how do you of course of course that that is also a problem But when you have like flakiness in your test infrastructure or even failures in the test infrastructure That's an entire different problem and what we have observed there is that a lot of tests are to fail then and that's why we look at first of all when we have like Rough estimate of like like have more than 20 tests failing at one run that might likely be because the test infrastructure is failing and actually We decided to just quickly verify that there is something going on in the infrastructure And then just disregard that run and yeah in earlier days. We had that problem pretty much often But in recent days it hasn't been happening anymore or Much less that's let's put it like that Of course of course we look so like what we are what we are having to test our E2e test like QBIT is a complex system It's an addition on Kubernetes so that you can run virtual machines and of course for testing that you for testing E2e You need a full Kubernetes cluster which with on which you will deploy The QBIT and that's what we're doing in DCI. So we are actually spinning up Some I would say a frozen cluster like the virtualized nodes that have been frozen and that are spun up on demand Like this takes around one and a half minutes to spin up such a cluster and Then you run all those tests somehow and we have like we have like always three versions of the thank you very much We are running out of time. Yeah, you can continue us. Thank you
From "Free-Lunch" to Dog-Fooding: A Culinary Journey in Crafting IaC for Kairos Testing and Building
Hello. Hello. All right. Hello. Welcome. Thank you for joining this talk. This is about infrastructure, code, and mostly about... and mostly about choices. So that's the main point I want to make. There will be some analogies with food. So, yeah, I hope you don't leave the room to go find food. So, stay with me. Mm-hmm. You really need to stay close to that. Yeah, yeah, okay. Thanks. All right, so this is me. The one piece of information you may want to keep is the Codeberg username I have, because the code I'm talking about, so some samples and things you can copy from are in the Codeberg repo I'm going to show later. So if there's something to keep from this slide, it's this one. And also that I'm working for Kyros, which I'm going to talk about a bit more later. That's the project I'm working on, the App & Source project I'm working on. So, yeah, I said this is about choices, mostly. So I'm starting with food, because it's a general thing in life. I mean, when you have to judge something with just one criterion, sometimes choices seem obvious, right? So let's take... This is a popular choice, a popular Greek, this, Musaka, and that's the well-known burger. So if we were to choose what we eat based on just one criterion, that one was the amount of time you need to prepare, the choice is obvious, right? But obviously we don't always eat burgers for some reason. And the reason is that we have other requirements. But sometimes when the main criterion is there in front of us, our mind gets stuck to certain choices. And we had to choose otherwise, and that's what I want to show you, our story in Kairos and how we chose to do infrastructure otherwise. So this was our problem. So what Kairos is in a sentence? No, maybe two sentences. It's an immutable OS, a special purpose OS, mainly targeting Kubernetes. So it makes Kubernetes very easy to deploy, maintain data operations and such. But it's this diagnostic. What that means is it brings immutability on top of your favorite distribution. So you start with your distribution and then you apply Kairos, let's say, on top, and it makes it immutable, safe, secure, encryption and all. But that also means for our CI that we have to build many different images, lots of artifacts. So you see some numbers there, like the number of pipelines. And one thing to keep in mind, and you all know that, is that when you're working, you don't generally want to think whether you push something or if you open a pull request or something. If you just do it, you don't want to have to think that you're gonna pay for something. So initially we started, it's on GitHub, by the way, the project, so we started with GitHub Actions, the free ones. But because of these requirements, like huge disk space in some cases, like two VMs in one job, KVM support which free runners didn't have, we were looking for a different solution. And initially we said, okay, let's put money into it, right? Money solves everything, and let's just start paying for runners. In our case, that didn't work because, like I said, we didn't want to have to think whether we push, so we were controlling ourselves, okay, let's not open a draft pull request, right? Because that also runs pipelines, let's wait a bit. I'll do that tomorrow and things like that. So we very quickly reached the point where the cost, the amount of money we were paying was double the money for the same kind of hardware, but on bare metal. So what did we do? I mean, okay, we rejected that one. Dog fooding, so as I said, Kairos makes it easy to deploy Kubernetes, which is maybe the hardest and the more complex part, if you want to start with your own hardware, like how do I maintain it and such? And that's what we do as a project. So we said, okay, we're gonna do that. And then some more tools, like, okay, Kairos, we solved the Kubernetes problem, how we provision stuff and how we maintain it. Then we chose FlexiD for GitOps. I can't go into details on what these are. I hope many of you know what these tools are, but look them up because I only have 10 minutes or so. Shops for Secrets, so that allows you to actually commit secrets in your repository but being encrypted. So it's safe to commit them because you need keys to decrypt them. But if you give the keys to FlexiD at runtime, they will be decrypted and deployed. That means you can have full GitOps, nothing, no manual intervention for secrets or whatever. And then there is a project. You probably know that the Actions Runner Controller that allows you to run GitHub runners in a Kubernetes cluster. So this is our toolset. The next slide is completely relevant. It's just that I generated this one with AI and it reminded me of something. And after a while, I remember what it reminds me. That's my real dog there doing the same thing. So we don't need the AI, just remember that real life. Does the same. So yeah, back to infrastructure. Yeah, so some steps we did. So you can start with a cluster on your laptop, right? Very easy to get one, K3D kind, micro-kates, whatever you prefer. Then you read the docs for FlexiD, obviously, and Shops. You create the keys you need. I'm not going into details, but this is what we did. More documentation on how you deploy Actions Runner Controller. And when everything is working, that's the interesting part. Then you can go and get some real hardware. We went for value for money things. We tried a couple of them. So we got, I don't know, 10 different machines or something and then we deployed. And that's the interesting part. The three commands there. How much time do I have? I think I can show you a demo. At this point, where am I? I'm there. There. No, no, no. Where are you? Where did you go? There you are. So what I'm doing here, not here, but here, I will restart it, don't worry. So I start on a project that has no runners. On the left one is the repo. I'm gonna show you the link in a while. And on the right, I'm just copying and pasting three commands. There's a timer down there. It's three and a half minutes spoiler. So what I'm doing is with three commands, I'm gonna start from a cluster, from scratch. So I use K3D to create a cluster. And I play the secrets. So there are just two secrets one that Fluxy needs to pull the repository and another one to decrypt the secrets that are encrypted with SOPS, like up keys for GitHub and such. This is the secrets command. This is the one that creates the namespace in the final and bootstraps the already existing repository. So after three minutes, you'll see that we got runners deployed and connected to GitHub. And that's from a scratch cluster. And we had to do that two or three times because we chose some hardware that didn't work because of some network issues in Hexen, I think. We had, then we moved, but by moving, we were afraid that, okay, we're gonna do that again, but it turned out to be very simple. So what I'm trying to say here is that the choice we made paid off because every time we need to recreate the whole thing, just three commands, right? We create a cluster and we just spin it up. So the initial choice, I mean, we weren't sure if we had to spend the time to create all these because yeah, I described that in 10 minutes, but it took a sprint or two, like a week or two to implement it, but it turns out it pays off because now our infrastructure is cut. It's not patch anymore. We don't care. We can go somewhere else if it's cheaper. So it does pay off. So I'm not sure where it is now. Yeah, I'm cheating a bit here because I don't want to wait for the reconciliation. So I just keep the controller to force it to check. This tool is K9s, by the way, if you don't know that check it out, it makes it extremely easy to navigate through Kubernetes resources and all. So yeah, at this point, yeah, this is the action runner's controller and this will bring up the runners. So when we want to make changes on this thing, we just commit. So if you don't know what GitOps is, you probably know, but you just make a change, you commit, yeah. You commit that to Git and you can actually review that and this thing will apply the diff to make your cluster look like what you described. So yeah, after a while, they come up. Maybe quickly I can skip a bit. Yeah, there we are. And eventually they show up. So just go back. Yeah, that was it. This is the repo. So everything I showed is there. You can't really use it because some of the secrets are encrypted with my own keys, but you can copy everything else. You just have to replace with your secrets. There are instructions. I try to write as many docs as possible, but feel free to open issues or ask me anything. I have my, yeah, that was a screenshot in case I didn't have the video. So the outcome, yes, it works. It broke sometimes initially, but it balanced out. And like I said, it really paid off because it makes making changes extremely easy. So yeah, what I wanted you to think that it is possible and it's a good idea. And I'm not saying it's going to work for everybody. And I'm not certainly saying that GitHub, but others, the paid ones or whatever is not good one, but different teams have different needs. So if you're thinking about it, check it out. Check Kairos and ask us any questions. This is my email. This is the team's email. We'll get that one. And this is our matrix channel. So whatever questions you have, we'll be happy to talk to you. And we're around. We also have an ice hoodie like this, but it's large. So the first come, first served, who is large and we have stickers and we'll be happy to talk to you outside. Thanks. Do we have time for questions? Yeah, we have. OK. Need a mic? I don't know. Microphone? Where? Or you can shout. I'll try to hear. Hold this there. Hi. Have you faced problems with different CPU architectures? Because sometimes it may be hard to get some types of hosts, like ARM x86. Are you talking about Kairos? Yes. Yes, we're not trying to test. We're not trying to specifically. So the architecture of Kairos had problems with the... Can you repeat the question? So you run the test, but it's mostly based on containers locally, but have you found problems testing in different CPU architectures? Yeah, got you now. No, the tests are mainly running with QMU. That's why we need big machines because it's a full OS. So we do test ARM as well. It's just a bit hard to test boards because Raspberry Pi's are just not as easy to automate. You need some KVM or something. So yeah, mainly QMU would do ARM, but not the actual boards. So sometimes things break. But yeah, that's a Kairos question. Thanks. Anybody else? We can take one more, I guess. Yeah, there you go. I saw that you're using the summer bind renders. Those are the old ones, right? Yeah, no, sorry. I realized when I was running, yeah, we have to update. You saw that. You're going to switch to the Google Managed GitHub supported runners. Ah, they changed the images you mean? Yeah, I think the GitHub adopted the actions runner controller. Ah, I didn't even notice that. Yeah, sorry. Thanks. We'll do that. Last question. No, OK. Thank you. Thank you. Thank you. Thank you. Thank you.
Chaos Engineering in Action: Enhancing Resilience in Strimzi
This is nice. So hello guys. Today we prepare a presentation about the house engineering in action. I am Marj Orsak and this is Henrik Srenching and we both work as a software engineers in Red Hat. Today we also prepared a quiz. You can see the QR code. You can scan it with your phone and if you are quick enough and you get correct answers you can win a prize. So over to you Henrik. Yeah so the content of the presentation is as following. We will begin with a brief explanation of house engineering. Then we will describe how the target systems may actually look like in the production. Then we will turn our focus on disabling house. Afterwards there will be two brief demonstrations and then a quick conclusion of how to actually work with the house. So when we are thinking about system resilience or application resilience we have to think about all the components which our application depends upon. That mean other components and other services. There is also big dependency on the network and infrastructure. All of these things are mostly visible in the distributed systems. There are many known fallacies about distributed systems mostly concerned about network and bandwidth. When we will then take a look on a system from the viewpoint of many instances and services which have to communicate with each other in order for the system to work great. We will come to the problem of complexity and the fact that there is possibly no single person that can understand the system completely and every state which can the system get into. So what can happen and what will probably inevitably happen in the system of such magnitude is the thing that one instance or more will crash. This is the story about house monkey which I guess some of you may be familiar with but all we have to know so far is that it is some of first house tools which just randomly kill some instance in the production and it will force engineers to take proactive action to make system more resilient. We can take this step further and bring down not just few instances but availability zone or cluster or bring any kind of network traffic and get the system into the state we are not so comfortable in for the production environment. So we will get to the definition of the house experimenting. It's experimenting on a system in order to build confidence in the system's capability to withstand turbulent condition in production. This may sound weird for us because why would anyone want to bring the house into the production isn't it something funny which should we actually avoid and the real reason for doing so is actually it's the time difference. It's much more easier to solve the problems at 4 a.m. or 4 p.m. rather than 4 a.m. when you are under the high pressure from the customers to solve the problems. There are many principles which we have to abide or we should abide in the house engineering. The first one and most important one is minimal blast radius for each experiment you conduct. We should imagine some red button for each experiment which should be able to stop it in case anything goes wrong. Other principles are mostly focused on the same thing like we would test the thing in the real life. We want to focus on how it actually works in production. We want to make sure it works correctly and we want to introduce the problems that may happen in the real life. The last principle is that it's a continuous run which is basically about running these tests or experiments each time for as possible and as effortlessly as possible. Now over to the target system. This all started with the monolith architecture where we get one box, one backend, one database and one UI. In terms of the complexity it was quite low. You simply get some users connection and the server complexity or overwhelming was not so high. Then after some time you add some customers more and more like let's say not four or five thousand and the load was pretty much high and the server would immediately crash for instance. So such architecture is really hard to scale horizontally and one way how to tackle this problem is scale vertically but you can scale vertically all the time. The second point was that the fault-or-ency of such architecture is really bad. You just target one node and the server just immediately crash and the users will be really sad because they don't get any response. So then Dockercams with the microservice architecture where all these previous improved we got portability isolation. We somehow get better horizontal scaling but in case when you have like thousands of instances it would be quite hard to manage all of these containers. On the other hand also the complexity here increased because of the network trickle and more. And so Kubernetes came to solve scalability in terms of the horizontal and in the Kubernetes you easily if you want to have one replica of the system you just type in the semi-ammo file, apply it and the Kubernetes will do it. Then if you see your server crashes or somehow is overloaded with the request you just simply set it to free and the Kubernetes will do it. The same with the fault-or-ency where you just I don't know inject some disruptions or something else in the pods. The one will still be up if you only target two of them. But still complexity increased again. And so we are in the operator stage where no one can entirely grasp the system in terms of the behavior. And I want to present one of the such operators is the StreamZ. StreamZ is basically Apache Kafka on its core with encapsulation in the Kubernetes system. On top of that you got some operators which simplifies some upright dynamic configuration. It is tracing more security involved also Grafana dashboards. And it is part of the cloud native computing foundation. But it's quite tough too much unknowns right? So let's break this down. So Apache Kafka it has a lot of buzzwords as you can see public subscribe model it is messaging system and so on but still this doesn't help right? So let's move to the basic of the Kafka. We got some producers not these ones but we get some clients right? So these clients sending messages to the broker. They are happy because the connection is up. We could also set system scale. We could create another Kafka broker set up some listeners and another one. We got a second set of the clients which are called consumers and they are simply receiving this data. So we got this really example of the system where you have producers and consumers but we need also some representation of the data which is Kafka topics. Also each Kafka broker has his own configuration and you can basically set up versions set up in center replicas but this is not important for this talk. So we got a lot of buzzwords as you can see but unfortunately or maybe fortunately we don't have time for it. So we could stick with this model now. So we got the producers we got the consumers we got some brokers which are the servers and what if we encapsulate system in the Kubernetes? Now top of that we add some operators managing the Kafka ecosystem and on top of that we have the operators and this is basically Stream Z. Really complex right? So here we can see an example of the deployment of the Stream Z where you got a lot of connections. These components are not like really important now. The main idea here is that even with this low deployments you get a lot of things where you can inject the hairs. So now I want to say that if we go to the production one of such production is the scale job and before I dig into it I want to thank these guys because without them we would be unable to run such hairs in such a massive survival scale. So as I said a scale job is the production environment for Stream Z and other projects and there are a lot of technologies involved such as I don't know Tecno pipelines, Teno's Promi to use, Grafana, Loki logs and more. And here you can see a basic example of how we basically induce the hairs here. Here we have some Kafka clients, Streams produces consumers with some databases which are communicating with Kafka Connect. We have some middle maker which is transfer data from Kafka A to Kafka B but still this is not intention of this slide. There are a lot of connections. So I think over to you, Henrik. Thanks. So the point of these slides were actually to show or somehow explain that when we come to the system and take a first look it may look quite MSC and quite complicated. We may not understand the whole underlying technological stack or every single components and we are in the position when we want to talk about how the system actually behaves when we introduce house whereas we are not even sure how it should behave normally. That's even increased by the fact that the system doesn't behave how it is in paper but in actuality there are countless of instances and connections, operators, clients, network traffic. We need to have some sort of observability and some intuition in the system. Like in other presentation that were before us there were already some mentions about Prometheus and Grafana. They are quite famous for their purpose. So we will be using them as well. As mentioned we need to have some intuition about the system and how it behaves. Without that it is just a mess. So we actually want to introduce some chaos into the system so we start a search for the problematic parts of the system or where we actually what we actually want to focus on. It is a simple process when we take a basically simple look on the system. Take a look what is critical component, where are some possible bottlenecks, are some part of the network really critical here, are there some real world events that can cause my system to be vulnerable for some time like some rolling updates or some notes we start in the cloud and things like that. What would be really helpful is to collaborate with all the people involved in the system. Like we definitely need some input from the devs. We need to know about at least some basic information about architectural component. What we may come up with is some simple document describing all important parts or things that may occur there or protocols that are included and we will naturally come to the important configuration parameters and maybe even some proposal for simple chaos that could be included. So the output of this in reality is some first look on part of the system which may be actually targeted for simple chaos. Now that we know at least have some first insight what could be first, what could be our first guess to start with the chaos, we may focus on concrete chaos and we may start with some simple experiments. Now how to actually formulate some kind of hypothesis or some sort of experiment when we will take a look on specific thing. We will take a look on just part of the system or few components. Now we will decide to make sure that our core part of the system is actually capable to withstand some instances being lost or have some failures. Because this is still a production environment and although it was even in the main principles of the chaos, we don't want to start with chaos in the production environment. I guess everyone here knows why because first intern will try to introduce some chaos, he will bring down all instances, service will not be available for for holiday and good luck explaining that to your boss. Now we will probably start in a smaller scale, in a stage environment with much smaller traffic, with much smaller stakes like let's say there will be some some client maybe just random random fraction. We will have some few instances and few controllers. We will start by making sure that system is in a steady state. We have our instances up and running. When we are sure about it we can introduce the chaos. When we introduce the chaos, instances goes down and afterwards the system stabilizes by again bringing down the instance. During all the time we are observing all important metrics and parameters about the system. For example it could be messages per second. Now that all that is set and done we can actually implement our chaos. What can be really helpful for this are chaos tools. We will not be describing all of them but simply mention like there is chaos mesh, Kraken, or it moves or some other choices. They will help with definition, evaluation, execution and all the other stuff. We will end up with very simple email files to be executed. Now we can actually execute our chaos and see everything went as expected. There was small decrease in the traffic but overall the system got to the desired state after a while. Okay this was first experiment in the stage. Everything went great. We've got the good feeling of resilience being confirmed in our system but what we are supposed to do now is to repeat the experiment, scale it a bit up, go into the production, really try, really make sure that it is this production environment where we will get the confidence. What may happen is that it will not at all go according to the plan. It will fail miserably and this is also the reason why we should scale these experiments a bit slower and this is also the reason why we eventually want to make them in production because we want to really make sure that this environment which is so important for us is actually able to handle that problem. So as I said no reason to despair just keep on and try in the stage and make it and start slowly definitely. So to the demos. Okay so guys today we prepared two demos for you. The first one is the broker's failure. Here we will target the Kafka Pots. We have seven replicas of the Kafka and we will be targeting the three of them. The observability or the metric would gather would be like throughput, CPU memory and the traffic in the Kafka Pots. Then we will also define some steady state which is basically that all brokers and client replicas would be like okay and the communication throughput will be stable even we inject the chaos and if we define the hypothesis it would be like we will be eliminating three of the Kafka Pots and this would not eventually do some cascading failure and we will be okay like user will be not affected with this. So yeah and also we will have some checks on the producers and Kafka Pots. So let's move on the demo and hopefully it will somehow work. So okay so here we got some setup. We have Kafka cluster. We have some notes. We have producers. Here is defined Pots hails. We have ModFix which we targeting the value of three. Three means that we will be targeting three Pots which will be unable to run and duration for this would be three minutes. So let's try with our script to inject the hails. Yeah so now we are injecting the hails and we see that all Pots that three of them would be not running. We would move to the graph on a dashbox where we have some metrics. Here's some really simple not production ready messages per second as you can see. Now you can see here at the decrease immediate decrease of connections. There will be also decreased the average of the messages but Kafka would recover even when the Pots are down. So here's the decrease but after a time we would see that it eventually recover somehow. Yeah and as we can see also we got some brokers on line four. It's correct now. There are also some under replicated partitions. Yeah Kafka is okay now and after this experiment will be done. I think it could be done. Yeah so now we are do the checks. We are checking the stream Z Pots at Kafka which are just internal custom resource of stream Z and now the Kafka Pots are ready. We're completed and also in the Gavana dashboards we will see that brokers will go online. The under replicated partitions we all go to the zero. I think so in the yeah and here it is. Okay so this was the first demo and we got also the second one. Yeah and this is basically a worker node node crash and to quickly somehow describe it the topology is that we have the producer we have the Kafka AI Kafka B with some consumer and in the middle there is Kafka mirror maker which just basically transfers data from Kafka A to Kafka B. The steady state again is that all services are fully available and ready to accept traffic. We made the hypothesis that eliminating one of the Kubernetes worker spools will not bring any down services and also the producers and consumers will be not affected. They will be simply sending some messages without any harm. So let's move to the demo two. I will show you the important things. So we got source Kafka cluster, target Kafka cluster, mirror maker, we have some work nodes, we inject the chaos. We would also create the continuous clients, person, consumer and that's for the correctness that all messages are sent and also receive without any harm. There will be no connection refuse or something like that. So we now reset or crash the worker node. We will see that the worker node will move from the ready state to not ready. Here it is, it's not ready, but clients are successfully and happy with the sending and receiving messages. The script is just checking that worker node is still not ready and we are waiting for recovery. It would take some time. I think it should be a worker. And so now worker nodes just move to the ready state. We can see that all containers which were affected on the specific node would be creating again, producing consumers still sending and receiving messages. We do some checks on again, again, the stateful sets. Yes, this is okay. We target cluster, cluster, recovered also. We're also doing checks for mirror maker. And the script just runs successfully and we are happy. Okay. So I think that was two of the demos. And last words over to Henry. Yeah. So as you could see in the demonstration, the benefits of the chaos or execution of the chaos was a bit different from testing we are used to. There was quite a big hype about the house engineering and possible benefits it can bring to our organization. Yes, it can definitely reveal bugs in the production. You can drastically improve the actual performance there or the situation in the cluster regarding the resilience of the system. But what is the main benefit of doing such a thing is getting confidence in the system, finding the misconfiguration. Those of you who have tried running application in Kubernetes know how important it is to have all volumes, all liveness and readiness checks set correctly and overall infrastructure set in place. The actual greatest benefit is actually is in fact getting experience and new knowledge about the system and really understand how it is supposed to work. This is not a holy grail as I said and it can be a bit disappointing for some. But if we think about the house engineering as natural a step above the other testing and not their replacement, we can see a great benefit in it. So how we can actually embrace it in our organization. The very well-known concept is game days when we will put together a lot of roles and a lot of people from our organization, introduce some kind of chaos and let them handle it in some reasonable manner where they can all communicate, all contribute and fix the problem in reasonable time. So that's why I do a friendly way how to start with it. Know your tools. I know it can be overwhelming. You could see it even in the demo that we have to introduce quite a lot of tools in order to run even simple experiments. But once you know the basics and have some confidence in it, you can really start to make some kind of chaos. We can really recommend some great books about house engineering, Kafka if you want, but still there are a lot of tools and it's what is the most important due to that is to definitely start small don't be afraid to set up some stage environment where you can actually practice and confirm your hypothesis before you will actually go into the production and start doing mayhem. Thank you for your attention. Really appreciate it. Questions? No time? One question. Question? Yeah? Yes, there are. It actually depends. It mostly from practical terms. It mostly start to make sense only when we are talking about not some kind of monolithic application, but we are actually deployed on a cloud. It's some kind of microservices architecture. I would say that it does not depends as much on the size of the system as the fact how much you depend on a customer experience in a sense. That when will it really be decremental for your system to get into the chaotic condition. But yeah, thank you as well.
Progressive Delivery Made Easy with Argo Rollouts
Thank you for being here. I'm going to talk about progressive delivery and hopefully by the end of this talk you're going to know how to easily do canary deployments on Kubernetes. Who is using Kubernetes today? Raise your hand, please. Everybody. I'm not asking if everybody knows what Kubernetes is because you're in the wrong place. I'm a principal scientist on the Adobe Experience Manager Cloud Service. This is a content management system. I'm a long time open source contributor to Maven, Jenkins, Puppet, a few other things. I'm also part of the Google Developer Experts Program. But probably most of you know me because of what I did with Jenkins on Kubernetes. Some people will love it. Some people will hate me. We'll talk about that later. Actually, I just, before this talk, 15 minutes before, I realized, oh, this is 10 years ago. Time flies when people didn't know what Kubernetes was. So, what is progressive delivery? This also came, this was August 2018. This is when the term was coined at the LaunchDarly blog. And also was picked up by Red Monk. And I said, this is a great name for these things that everybody knows about. But the name kind of sums up very well what we're trying to do. So, I said, I'm going to steal this. That's the gist of it. So, it includes deployment strategies that avoid this. I'm going to push this to all my nodes, all my containers, all my files, whatever it is that you're running. I'm going to push this new version to all of them. And if it breaks, it breaks for everybody, we want to avoid that. So, you, with progressive delivery, you have new versions that do not replace the system versions. And you have both old and new versions running in parallel for an amount of time. But the interesting part is that this is happening in production. And you can evaluate both old version, new version during a period of time that you figure out what was the best time for you. And before saying that this is a successful thing that I need to roll to everybody in all my customers. So, continuous delivery is hard. I used to say, like, progressive delivery makes it continuous delivery easier to adopt. Because it reduces a lot of the risk associated with continuous delivery. Yeah, it's great that you commit something to main and it gets pushed to everybody. But what if that's breaking in production? Then you have these methods behind progressive delivery that will prevent you from breaking things. And give you these guardrails that will protect your users. The key points, avoiding downtime, limit the blast radius. You deploy something, it only affects a subset of your users, not all of them. And also shorten the time from your idea to production. So, from the time you create a commit until you push it to production, you can use these techniques. So, you can shorten really as much as possible that time. And it's not affecting your life customers, it could affect your maybe internal customers, employees, something like that. So, you can confidently push things to production. The name is great, but all the techniques already existed for a long time. We have rolling updates on Kubernetes. This is the standard way when you change something on your deployments. You just get a new pod with a new version. When that pod comes up, the old pods start going away. And you can configure that easily on Kubernetes. You can configure how many pods you want to come up, if you want them to come up a little by a little. If you want all of them to come up at once, and they will start rolling. So, Kubernetes has been around. Blue-green deployments, same thing. It's been around forever. Well, defined for some definition of ever. And you have green, what you consider the old version, which is green, the new version, which is blue, or the other one, we're on, I don't know. And you have both running at the same time. You evaluate or you start sending traffic to the new version. And if something happens, you just have to flip the switch to put back to the old version. So, this is a variation. The difference is, in a couple of days, you don't need to have all the machines running at the same time, all the containers. With blue-green, you need to have room for both versions running at the same time. Canary deployment is one of the most interesting ones, where you send a percentage of the traffic or a percentage of your users to the new version, and a percentage, a small percentage to the new version, and you keep growing. I mean, you could just do a small percentage, or you could keep growing that canary percentage. A lot of companies do this. First, this gets deployed to internal employees only, then some countries, some like New Zealand, or a percentage of users, depending on some characteristics of them. And they keep growing this canary pool over time until you reach 100%. Feature Flags, another interesting one, where it allows you to push things to production behind Feature Flags, so you can test them in production. And also, disable them after you deploy them. You push something, you realize, either it breaks for a lot of users, or it breaks for a percentage of users, you can switch that feature off using some tools, or using something as simple as environment variables. But yeah, there's tools that allow you to manage Feature Flags, so you don't have to deal with environment variables, things like that. Monitoring is the new testing, so you know the goal is to know when the users are experiencing issues in production, and the other characteristic, I think, is they react to the issues automatically. So if you deploy something that is bad, how you can automatically roll it back before some human has to go and figure out what happened. So did you know that 90% of the outages could be solved? There's a study that said 90% of the outages could be solved with progressive delivery. Did you know that? No? Because I just made that up. And one thing they need is, yeah, some requirement is having a good amount of metrics, or you need to know what's happening in your production system before you can react, knowing what users are seeing the new version, what users are breaking with the new version, what's happening here. So you need to have this visibility. And I always love to plug Devos Barat, which disappeared from the Twitter server, to make your resumant to prepogator or in-law server in automatic way, that's what DevOps is. Raise your hand if you have broken a lot of servers by doing this automatically. So yeah, what I love to say is if you're breaking something automatically, is that you haven't automated enough. I think that's the... When you get there, it's like, okay, maybe I should step back a little bit. Until you get there, you keep automating things. Now, more to the practical side, how can I do this in Kubernetes? Introduction, who's familiar with Ingress? Ingress in Kubernetes, okay, yes. Then, 10 years ago, this was not like this. So on Kubernetes, you have the load balancer, and you can have services, and from the load balancer, Kubernetes will send you to one service or another. So your load balancer would send traffic to one service or another. But this was kind of the old way. The new way is you have Ingress controllers on Kubernetes, where the Ingress controller is running on Kubernetes. I typically domain names, but you could also do headers. For each of these traffic, it's easy. You can do headers, you can do all sorts of things. And that Ingress is sending the traffic to whatever service you're running. So you can have one service A, one service B, and with their pods. And the Ingress is the one that's, okay, you configure this domain to go to the service, you configure this domain to go to this other server. And there's a lot of Ingress controllers out there. If you run on a cloud provider, you're going to have the AWS, the GC, whatever. And then you can have your own NGINX, ambassador, STO, traffic. That's a lot of it. And ARGO rollouts, anybody using ARGO? Wow, okay. What are you doing here? I mean, we already know this. So provides advanced deployment capabilities. All the things that I mentioned, blue, green, canary, category analysis, experimentation, there are variations over the same thing. ARGO rollouts provides that to you and makes it very easy to do it. And the good thing is you don't need to use ARGO CD, for it to use ARGO rollouts. You don't actually need to use anything else to use ARGO rollouts. You can run ARGO rollouts just with Kubernetes, nothing else. You don't need external dependencies. And, yeah, it allows you to do this very easily. I'm a bit on the architecture of ARGO rollouts. So we have the controller that is watching a new object called a rollout. So ARGO rollouts has this object that can replace or complement your existing deployments. And I think I'll go down there in a bit. So this rollout manages the replica sets. So these replica sets, typically, you would have your deployments with the replica sets, and now they become part of the rollout. And it has the concept of analysis run that will check metrics or any other external source that this analysis will decide is this rollout successful or not. And based on that, it's going to cancel the rollout or keep it going. So you get the traffic coming from the ingress into your services, and you can tell ARGO, OK, send me, send traffic to this new canary replica set or send it to the old one. The percentage base one, for that, you need a service match. So if you need to do something fancy like, oh, I want to send 1% of the traffic or I want some traffic that matches this header, then you need something like a service match or the integration with the ARGO rollouts integration with the ingress controller. But if you use bare Kubernetes without integration between rollouts, you can still do it. Basically, it will use the number of pods in the replica sets. So if you have 10 pods, you can tell ARGO, OK, one new pod is going to go to the new version, and now you have a 10%, 90% sort of split, more or less. You cannot do things fancy that require support from the ingress controller or a service match, but you can still do things. The rollout object, you have two ways of defining the rollout. One is you replace the deployment with the rollout and add extra fields, or you create a rollout that points to a deployment. I don't like a lot the way of replacing the deployment because then people that are not aware of the rollouts objects, they may go and see, oh, there's no deployments. What's going on here? So it requires you changing things. And for us, it requires also you have to change rambles, you have to change commands that people need to secure documentation and all that. So I don't know why the decision was made that way, but it's not something that I'm too happy about it. Of course, you require all the Jamel tools to write these things. And let's go to the demo now. So I have here... So I'm running the Argo Rollouts demo. This is hitting the backend and it's returning one color or another, depending on what is running on the backend. So right now I have the blue one. Let me see how can I do this easier. So what I'm going to do is to change, update my deployment to use a new image that is going to be green. And ring. I lost the terminal. Okay, so it updated the image and let me fit here in big To show what this is doing. Okay, I think I pushed twice and now I see two Rollouts happening at the same time. Otherwise it's not working. Here it is. Okay, so I have the green. The one that shows green is the one that was running and it's stable. So I think I have five pots running and I push a new change, which is the canary. And this should be using the green. Okay, there it is. So like 20% of the traffic is getting green. Right? And how I define this rollout, this one is at the bottom is just the standard deployment configuration. So what image do I want? What ports do I want to expose and so on. But at the top I have the strategy configuration from Marco rollouts. So I can say point to this analysis template. This is what defines what is successful, what is not successful. And I'll show you that in a bit. And it's, I have several steps. So set weight 20 and then do a pause, set weight 40, pause for 10 seconds, set weight 60. So this is percentage. Pause for 10 seconds, set weight 80. So this is my definition of a rollout. 20% wait for me to manually do something. I only do that for demos in real life. That's a bit harder to do as you could still do it. But this is my definition of what the rollout is. So right now it's waiting because I set a pause and it's waiting for me to give you a key. I look at it, it says it looks okay. So I can do the promote. And this is going to continue through the rest of the steps. So hopefully we'll see this in like 60 seconds. It should continue the progression until everybody receives a green. A green color when they call the API. So this shows you just by creating a rollout object with this small section defining what your rollout is, you can do this. There's nothing else you need. Well, you need to install Argo. And what else can you do? Oh, yes, you can also have a preview version. So you can have another ingress pointing to your preview version. So you can even if I said I want zero traffic to go to the new version. All the existing traffic I wanted to go to the old version, but I want to see the new version in a new place. I can do that too. So that's very useful for preview environments sort of thing. So if I go back, okay. So while this continues running, this is running on Google Kubernetes and sending autopilot clusters, but you can run it in any Kubernetes. And the autopilot is pretty cool because you only pay for what you use. So if you scale things to zero, then you don't pay anything. What does it says here? Okay. So now green is the stable one. It says here, stable here. What if I want to do... I was talking about how does this protect me, right? What if I want to do a rollout that is broken? So... Let's see. This works. Right. Okay. So now I push an image that is bad. So I'm changing the deployment. Of course, you would do this with the GitOps. You would never push the production, but YOLO. And so I'm pushing the red image, but this red image is returning in 500 errors. And now Argo realized, oh, this is giving errors based on my analysis template that I'll show you. And this is in the graded status. And it went down and the scale it down, and my canary was set as failed. And you see that only a few percentage of traffic got the red dots, and then it was automatically rolled back. So I think this is the power of doing progressive delivery. Of course, this is very easy if your application is exploding. It's very easy to see. It's like, what if people ask me, oh, can we do this if a button doesn't work? Can we do this? Well, it depends what button. If it's the button that adds, imagine you're in Amazon, you break the button that adds things to the cart, and you get a metric that says nobody's adding things to the cart, maybe you're like, oh, something is really bad. Right. So let me show you the analysis template. Is this one? Yeah. In my case, my analysis template is a very complicated call that fails if this fails, if this doesn't return a 200. But again, you can integrate this with whatever you want, metrics. Argo rollouts also gives you a nice dashboard. If you are not into the command line, you can come here. And here. So where I can see the status of my rollout, what is strategy. As I said, Argo rollout supports multiple strategies on some of the more complex drivers. I can see my steps that I showed you before in the Jamel, 20, 40, 50, 80. And I can say what was the last image that I pushed, and I could click here and do the clickity clock instead of doing Jamel. Okay. So, yeah, what I mentioned before was if you're using service mesh, like Istio, then it integrates with a bunch of service mesh ingress providers. So you could go and say, I want 1% of traffic because Istio supports doing those things instead of saying more, because when you are using only pods, you don't have anyone here. Pod is going to receive the traffic or not. So it's more of an approximation. But with Istio and other advanced things, you can do more complex. We hook it up with Prometheus, also the support for multiple things to get metrics from. And, yeah, hopefully you'll learn how to do a progressive delivery canary deployment very easily. Just you need to do some Jamel here and there. Let me see. On here, this one. So you can have the other labels to the existing version, to the stable version as labels to the new version. So you can do other things with services on Kubernetes. You can pass what analysis you want to run and you set what steps to run. And everything else is just the template. And if in the deployment template. If you don't want to put the deployment template in the rollout object, you just point here, there's another option that says points to existing deployment. The only problem with that is that rollouts is not when you're migrating, rollouts is not going to scale down the deployment. A colleague of mine, she submitted a PR to Argus, which is going to be in the next version. So it will automatically, if you have like thousands of deployments, when you spin out a rollout with a deployment, I pointed to a deployment, when the rollout is successful, it's going to scale down the deployment. So that's how it will actually exist. Okay, so, yeah, and what's that thing? I lost my... Did I close it? Yeah. Okay, so, just a quick summary, you saw everything? And I hope that this helped you and you can try it and do it at home if you like it. And I have time for asking me two questions. Two questions. No questions. One question. I was wondering if you've been testing using the gateway API and some fingers in the waiting? So the question is if I tested using the gateway API instead of fingers, no, I have not been using the API yet, but I'm guessing that if there's no support already, there will be. Because... We did not. Yeah. Hello. So my question is that for, in case of buggy rollout, the particular traffic which is forward to the buggy instances, is it possible to automatically replicate it and send it to the stable versions after the fail? To ensure that even the traffic which hits the buggy rollout instances is served later by stable versions? So if it's... It's possible to run it back automatically, but also... Yeah, the individual traffic, individual request. So you don't want any user to see the spot? Yes. The other thing you could do, if you use a service mess, probably is send a clone the traffic and send a clone to the new version, but the actual traffic is going to the old version. And you could see if the new version is breaking or not. But also that's tricky because you need to make sure that it's not changing your state. If you are getting gets, it's fine if you are changing status. That's my point. Don't do the duplication in advance because it will go to the parallel execution, but do it only when the first execution failed because it's go to the canary instance. Yeah, I think you can do that. Send traffic to the new version, but it's a copy of the traffic that is not seen by any user. And then at some point you could say, okay, this is good. I'm promoting this. I think it's doable. Yeah, thank you. Okay. Thank you. Thank you.
Own your CI with Nix
Okay, all right. Hello, everyone. Meet Bob. Bob's a software engineer and Bob just had the idea of a century for a new startup. That's Gray Cat. Gray Cat is a service that given the picture of a cat will reach in the same picture, but grayscale. Bob's really excited about that and just got some funding to start working on that. So Bob gets started. Chooses to write that in Rust because that's cool and trendy and use GitHub because that's the standard. And so just writes the initial Rust boilerplate, the initial boilerplate for having that built by GitHub Actions, then Git commit, Git push. The first TI runs green. That's wonderful. Champagne. Now Bob decides to do something useful with that code and so he pulls in image2, which is the Rust library, for doing image manipulations, uses that in the code, builds it. The build is just fine locally. Wonderful. Now Git commit, Git push. The TI runs. It's not green. It's complaining about some missing data files somewhere. Okay, so Bob, no big deal about the software engineers. He knows how to use Google. So he searches, finds out that actually this image2 library is mostly a wrapper around C++ library, which is OpenImage.io. And it turns out that Bob had that installed on his laptop, but the CI runners don't, which is why it's failing on the CI. But no big deal. Bob just tweaks a bit the CI config to install OpenImage.io before running the build. And now the CI is green. Wonderful. So fast forward one year later, Bob Crop has grown quite a bit and also did the tech stack of Graycat. It's getting a bit complex, but no big deal. I mean, it's just a matter of having the right CI config file to make sure that everything's gets installed. So the config file has grown a bit out of hand with 5,000 lines of code, but it's not real problem. People just treat it as happened only whenever they need a new piece of software. They just add that to the config file, add a few lines to install it. Generally forget to remove it if they don't need it anymore. But I mean, no big deal. It works, right? And I mean, it's not like anyone would want to maintain that because just the feedback loop of having to do a change, push, push, wait for the CI, get the results. Yeah, now. Okay, better keep it like that. It would need to move fast anyway. Because if some troubles, if we know them, like for instance, when GitHub decides to update the base image of the runners because given the complexity of the setup, obviously something's breaking here and there. No big deal. It takes a couple of days sometimes to fix, sometimes a bit more blocking, blocking things a bit. That's annoying, but I mean, have to move fast. No, one day big deal happens. Microsoft, it's now 2025. Microsoft is in a slightly difficult financial situation and decides that this GitHub action thing is actually, yeah, it's wasting money on that. So just decides to shut it down. Well, no, not such a big deal. I mean, it's not like Bobcops married to GitHub actions. They just have to migrate this little config file and use another CI provider. That's what they do. And three months after, they actually managed to migrate that to a new CI config by a new provider. But now by the time competition has caught up, Graycat is definitely beyond. Bobcops is leaking, bleeding money everywhere. And this is the end of the Graycat dream. So very sad story for Bob. But could you have avoided that? So there's a bunch of things that went wrong. You might have noticed. But most of these are just like natural consequences of we want to do things the quick, quickly as possible. And so we just, like, we can't take care of everything. But there's one practical choice that Bob did at the very beginning, which was what caused the ultimate failure. And that was just being stuck on one single service provider and being at its mercy. And Bob could have avoided that, hopefully. The first, like, the elephant in the room, if I may say, is just the blue whale docker, which would have given Bob a way to just have a, like, agnostic way of defining this CI environment that doesn't depend on GitHub. And Bob could just have written a docker file instead of the CI config file. So now some things that wouldn't have solved is that the feedback loop for a big docker file is not that much better than that of the official CI. Docker's layering is great for caching when you don't need to touch the latest lines of the docker file if you touch up things at the top. You're pretty much screwed up. The other thing it would also not have solved is that unless you're very, very careful, it's easy to just have lines here and there that might break at any time because upstream decides to change something. But, I mean, this is okay. The big thing, the big problem that Bob would still have is that the libopementimage.io issue we had at the beginning. I mean, Bob has his laptop, he's working on that. On his laptop, he has the code for the project. He has tool chains needed to build the project. Then the CI of the container has the same code for the project, checking for the same, from the same commit, also has tool chains to build it, but not exactly the same ones. They are provided by a different means, so obviously there's going to be some differences that might break things down on the line. If you're lucky, it breaks your build. If you're unlucky, your build still succeeds, but then the underlying something behaves slightly differently and you have absolutely no clue why. So what would have been nice would have been to just have Bob's laptop and whatever is running the code in the CI, use exactly the same tool chains. And there's an obvious solution to that, just ask Bob to do all his development on the Docker container. That is great. The thing that's not great here is that Bob's laptop doesn't only have tool chains encoded, also has his text editor, his config, his whole development environment fine-tuned for years to just make Bob as efficient and productive as can be. And if Bob has to develop in the container, he mostly can't access that easily. And now we get a very sad and angry Bob and a very efficient Bob. So now there's one bit of the infrastructure that I barely mentioned in passing, but not made much attention to. That bit is cargo, the rest package manager. And the reason why I haven't really talked about it much is not that it's not important. I mean, it's probably the most crucial part of the infrastructure because it's the thing that pulls in the bulk of the dependencies of Gray Cat. But the reason I didn't talk about it that much is that it just walked. I mean, I was talking about broken things because it's always funnier to talk about broken things. And cargo was not broken, not nothing. And the reason cargo just walked, I think it's like there's two reasons. The first one is that cargo has been very transparent of the role. In the CI, Renskago to provide the dependencies for the build, and that's fast enough for the CI that just walks fine. Bob on his machine runs cargo for that, and that's works that doesn't prevent him from using all the rest of his tooling. And so Bob is happy using that. And beyond that, cargo is also declarative. Like there's one file, two files, that exactly define the set of rest dependencies that your code has. And when Bob runs cargo on his laptop, cargo just reads that file and provides the exact environment needed. When the CI clones the project, Renskago, it reads that same file and provides exactly the same environment. And that's why it walks. Now, there's one thing that cargo doesn't do properly. And that thing is everything except rest packages. And yeah, that's a problem because like the, yeah, that's why we have this open image, your problem. And so it means that actually the declarative aspect of cargo is limited because it's declarative to some point. You really have to read the terms and conditions for that. And so at that point, like it would be great if only we could have something a bit like cargo, but more generic, you know, that would be so awesome if only such a tool could exist. Okay, so meet my friend, Nick. You can think of it, Nick, if you don't know about it in that context as something exactly like cargo or NPM or whatever you want, except that it's fully generic. So you can use it to package and provide your rest crates if you want, but you can also use it to provide the C libraries that your rest crates depend on. And the C compiler used to compile these C libraries and the mini server you're using to run the test or the PostgreSQL database you're using for your deployment server. And so really now like declarative is not just a vain word. It is fully declarative up to the lowest level you might want to think of. And so what could happen for Bob if he were using Nick's is that he has this laptop with everything set up and then he can just use Nick's to provide these tool chains. And because Nick's is transparent, it will it won't prevent Bob from still using his editor with his all his tools just on top have the required tool chains to build the code. And that makes a very, very happy Bob. And then the CI system can just use the same Nick's with the same Nick's config files to get the tool chain. And then the CI builds exactly the same thing as Bob on his laptop and the world is wonderful. So now assuming that, now I mean I'm not assuming Bob is convinced that Nick's is the great thing and probably you are too, right? And what would that look like in practice then for Bob to use Nick's? So Bob would essentially drop a shell.nix file at the root of his environment saying, hey, I want a shell. So calling the make shell function to get a deployment shell saying, I want in my shell this set of packages, cargo, rest, open image, whatever you want. And the little bit of magic here is this PKGS thing from which everything comes, which you import which points to Nick's package collection, big repository with recipes for all the packages that exist in Nick's, which you can. So I've hidden that, but you can import that in this Nick's package.nix file and pointing to a very specific commit of the Nick's package's repository which will pin down every single version of all your transitive dependencies. And now if Bob wants to use that, he can just run the Nick's shell command, be dropped into a new shell in which, for instance, cargo will be available at a path that is managed by Nick's. And once Bob exits the shell, no cargo anymore, that's what we wanted. But then, I mean, Docker also does that. The bit that Nick has in terms of transparency, the extra coolness, is that the shell doesn't tell anything about Vine, Bob's editor. But still, I mean, Bob can still access his Vine, install Globody on his laptop, even inside the shell, which is what you want for development because it's much, much, much nicer. And when Bob wants some extra guarantees that really his shell is complete, he's not just accidentally leaking things from his computer. You have an extra pure mod that you can use for building things with more guarantees. And so that's Bob's machine, but this talk about CI. So let's look at the CI side of things. On the CI, so Bob would still be using GitHub CI because that's the standard at that time. But then beyond the initial mandatory boilerplate to just fetch the repository and all that, there's only two things that Bob would need to have in his CI config file. The first one is install Nick's because it's not yet part of the default GitHub image. And the second one is render build within a Nick shell, a pure one because you really want to be strict at that point. And then if Bob needs to migrate, all he needs to do is on his new CI system install Nick's again and copy that exact Nick's shell command. And now Bob is fine. And we can all send great cat pictures over the internet. So this was like scratching the edge of the tip of the iceberg. We could go a bit much further, although I only had 15 minutes, so I won't cover all that. But the first thing we could do to go a bit further is to improve the pinning situation. I mentioned like I hand wave this, oh, you have this file that pins things down to a very specific version. There's ways to make that much nicer and using a proper log file like all modern package managers would do so that you have full control over when you want to upgrade. But it's also trivial to upgrade. More interestingly, you can also use Nick's a bit further and not only use it to provide you some development or some CI environment, but you can build your thing fully with Nick's, which gives you, I mean, first time extra guarantees that, oh, this is really the right thing I've built in the right environment, in the right set of dependencies. But more interestingly, now you can integrate that and use Nick's a bit further, for instance, to build OCI images on top of that with only a few extra lines of Nick's code, or use that to build AMIs for whichever cloud provider you want to use. At that point, you probably want to care about caching. Nick's is pretty great at caching things for that today. If you have a know-how build, it's really going to be a know-off and take you a few seconds rather than the whatever time it takes to build the project. And you can also get that cache to be distributed, meaning that your developers, if that's built on the CI, then your developers can just reuse the pre-built results, which makes their life quite nicer. And the last thing that could be done to go further is to use Nick's OS, which is a Nick's-based Linux distribution, which follows that same philosophy of being purely declarative, which means that you just have one config file that describes the whole system, and you just rebuild the system based on that, which is useful for deployment, because that's infrastructure at the core, but really down to the deepest level, not just scratched on top of something that existed before that and was never meant to be that way. And you can also use that for testing things further. There's in particular a really nice testing framework that allows you to always declaratively declare a whole network of virtual machines that you can spawn, rinse, comment on, and then just read the results. And that's really useful, whereas as soon as you start to want to test some weird multi-tenant applications. So all that I've been talking about, Bob, but maybe a few words about me, but you know who's talking to you. So I'm Theo Fan. I'm the leader of the Nick's Tech Group at TWIG, which is the open source program office of Modisk Create, and it's pretty big on Nick's, as you might have guessed. And I'm also a maintainer of Nick's and a member of the Foundation Board, the Nick's Foundation Board. And you can reach me in all these places, and more concretely, you can also meet me right here in the AW building where we have the Nick's West End. And that's all for me.
Testing Go command line programs with `go-internal/testscript`
Good afternoon everybody. Who is a Go developer? Very well. Very nice to meet you. My name is Giuseppe Magia. I work with VMware. I am not the creator of this thing that I'm presenting today. My company is not involved in this. I'm just a user and since it makes my life easier, I decide to share with you what I do with it. So the thing that I want to do theoretically is the things that you see on the screen. But practically what I want to do is to make you curious about Test Script. So you will try and eventually see how good it is and what you can do with that. The important thing is that you will learn a few basic things. I mean, I could talk about Test Script for three hours and probably wouldn't exhaust the argument. But since we have only 20 minutes, this is what we are going to do. We are going to show the basics of Test Script so you will know it. We started with why. Why do we need to have this kind of tool? It's because we have a problem. When we do a command line program, we need to test it and to test it, we need to build it. Then we need to do something with this build to shake it up and check that it's doing what it's supposed to do. You can do a lot of things instead of testing directly the command line program in the shell. For example, you could test the single functions that are inside the program and you should do that. But this is not the same as testing the program. To test the program, you need to make sure that the function that works well in your tests also is linked to the command line command or option that you hope is linked correctly, but not always. Also, the input that you put in the function that works beautifully and since it has space in between, it doesn't work on the command line. You need really to test the real thing. The problem that you have is that to achieve this goal, you need first to compile the program and second to find a way of testing that program in such a way that it works well with your go code and it is checked in the right way. By checking the right way, I mean that you are sure that what you hope to achieve is what exactly happens. Doing this kind of thing in a shell script is not always easy. Let's talk about this test script and what is it? It's a Go library. It's also a standalone tool and the best thing is that this is developed directly by the Go team. They use it to test the Go tool itself and all the tools that come with the Go program. It has been recently, a few years ago, released in the Go internal package, so you can use it separately from the Go code. Especially, you can use it for mostly anything. If you are developing your command line code, a command line program in Go, it's much better, but you could also use it to test mostly any command line program, even if it's not written in Go. Of course, if it's written in Go, it helps. Let's see a first example. To test something with test script, you need two components. The first one is a script. It's a script that says something about what you want to test and what you want to get. In the script that you see on the upper part of the screen, there is an echo, a hello world and a keyword, exec, before. Exec is an internal command of test script that will run something. Then there is a standout with confirmation. The confirmation is that you should receive something that says hello world and a new line. Then there is an exclamation point that says standard error and a dot. This line means I don't want a standard error. I don't want anything in a standard error. More about this later. Then you need a component in your Go code. The component will just call the standard, the test script dot run, which contains at least one piece of information, meaning in which directory you find the scripts. I say scripts plural because in that directory, you can have one or a thousand scripts that do different things to your program. Let's modify the first script a little bit. Instead of expecting hello world with standout, we expect h. Then you have this strange thing that is a regular expression. If you know regular expression, what we are saying here is I just want two words, one that starts with h and one that starts with w. Like before, I want the standard error to be nil. This standard error with a dot suggests that what we are expecting here is not a dumb piece of text but a regular expression. You can use a dumb piece of text if it suits you, but you can have much more powerful type of information. For example, you can use several statements to describe better what you expect from the output of the program. In this case, instead of putting everything in one line, I put that in two lines. This is often useful if you want to make your test more readable to express exactly what you are expecting. More important, the test script environment includes one thing that is called text par. Text par is a very simple way of encoding files. To encode files, you just put the name of the file between double dashes and then you put the content of the file there. The file will be magically created in the environment where the test is executed. The thing that happens there is that the script will use a different temporary directory to each script. Every script can run in parallel and it will be more or less isolated from the rest. This data.txt will exist in the temporary directory created for this script only. You have some built-in commands that you can use directly to do your tests. Exactly what we have seen already. Then standard out and standard error, it will check what happens after you have run your command. A standard input will create the input for the next command. Then there is the command exists, the checks that the file exists and stop and skip will interrupt the test. If you put the exclamation point before the command, it will negate the command. Meaning that you expect that command to fail. Other commands are compare and compare with environment and then you have n that will set variables and this can be useful. Then you have something that are also available in shell scripts like cat, cd, cp, check, change mode and make directory move and remove that works like in a shell. Then you have conditions. Conditions is like a command but is within square brackets and you are telling the program that you expect something to happen. For example, exec file name, you are saying I want to make sure that this program is in the path. Unix says this will only be true if you are running in a Unix system. And after that condition, you put a command that will run if the condition is true. And you can check other things like if you have a network, if you are running a specific version of Go and so on. There is a specific environment, some specific environment variables that you have work is where you are running the test in practical. The home doesn't exist but you can set it if you want. And then there is a temp directory that it created for each script but you can change it if you need something different. So if you run the test with verbosity, you will get a lot of information that tells you what is the environment where you are running and what is everything that is executed. If you don't put the verbose, the test is silent, it will just succeed silently. You will see some output only in case the test fails. Let's see some more examples with command, second condition. So the first line says if it's not Unix, skip it. The second line says if it's Linux, say exec, good choice. Exec, sec means if the exec doesn't exist, then say the command echo, the command sec was found and so on. Remember I mentioned something about compiling the executable and one thing that the test script can do for you is having a transparent executable. How does it work? Let's say I have this word count command that I have created in go and I want to test it. So I run exec word count and this command may fail or may succeed whether word count exists or not. So if word count is in the path, it will succeed. If it's not, it will fail. But we want to make sure that it always succeeds so we need to take to tell test script this word count not only I want to exist but I want to be the one that I have created for which I have the code and to be fresh, not stale. How do we do that? In the test we use the test main in case you don't remember the test main is something that you put in your test code and runs before any test function that you may have in that directory, in that package. So the test main contains a call to test script dot run main which has a map of functions that you can associate with a name. In this case we have a name that is word count and we have command run main that returns one integer. In the main code you will have the main that doesn't run directly the code but will call run main and it will exit with the integer that the run main returns. So what happens here is that your word count that is in the script is in reality a call to this function and the funny thing is that there is no separate executable. If you remember go is a compiled language so whenever you run a test nothing is running like in Python, it's not interpreted. There is a compilation and the compilation happens in a hidden place and you will have a piece of code that has been compiled that is a binary and that binary will be available for your test and the good thing is that there is no additional binary is the same binary that is used for the test. I'm going to show you an example later. So let's see something more. I said before that we have built in commands but we may want to have something more profound like custom commands. So for example I want a command that will sleep for let's say three seconds and this command is not available because it's not one of the built in commands but I can build it. So I also can have a command that says I want to check all the files in a directory and I want all these files to exist. So I want to check files that has first argument the name of the directory and the rest of the arguments are the file names. How do I do this? When I run a script run I can have a call an indication of a map of functions that will produce these custom commands. So the custom commands are a map of functions and each function has a test script object a negation in case I put a bang there and a list of string arguments. If the command succeed I do nothing if the command doesn't succeed I call test script dot fatal and I fail. So for example this is how implement the custom commands in my word count. I call check files and I call sleep for and if we look at the implementation of each one you see that the sleep for is a function that accepts a test script a negation and a list of arguments and it takes that argument to say to determine how many seconds I want to sleep for and then it calls time sleep. If the first argument was not a number it will fail and the command will not succeed. A similar thing I do for check files so the first argument will be the directory and then I will check that the files exist for each one of the of the arguments. A similar thing I can do for custom conditions so in addition to the conditions that we have I can implement conditions that suit my environment better for example I may want a condition that says the version of this particular program must be at least 0.2 and how do I do that I cannot do this with the the building syntax of test script so I can implement a custom condition. To do a custom condition is is similar to what we do for the custom commands. I have a function that function receives one string and will parse that string to determine what we do with that condition. In addition to that test script allows us to pass arguments that are variable depending on the environment. So for example I want to pass the current version to the test or I want to pass the home directory to the version and I can do that with a set environment function. So back to the custom conditions we receive a string and return a boolean and an error. What do we do inside that function we parse the string and the string could be a simple condition or it could be a condition with some arguments that are that we need to parse inside and see if the the condition is true. In this case we have a version is at least and you see I have created a condition a function that will check that with the elements that are parsed just in the first line of the of the function. So I assume that the function the the arguments are separated by a column and I use them. For example this version is at last we'll check that we have at least two arguments the first argument will be the version and the second argument will be the compared version. In the same way I can have this condition exist within seconds that checks that a file exists at least after at maximum a number of seconds that I wait. This is useful for example when I test in a database system that is supposed to create something but it doesn't create it instantly. So I say I want to to see this log file at maximum 10 seconds after the database starts and if not I get an error. So I'm going to show you a quick demo of something that happens when we run test script. So if I run go help test flags sorry I get a bunch of options and these are used always by the test. Now if I run word count minus h this is the real executable that I built with go and I you see I have these options that are just the options of the of the program but if I run something a little bit different so let's see a test that I have inside here where I running word count minus h like I did on the command line and you see I have here the options that are made by the executable but in addition to that I have also the option that belong to the test. This to show you that what we are running here is an executable but is the executable that go builds for the test itself and the side effect is that it contains the the command line options that belong to the test itself. Now back to the to the presentation what we have learned today is that using test script you can simplify the testing of any command line program and programs that manipulate tests are extremely suitable for this kind of testing because test script was created for the go tools which manipulate tests. You don't need to have a separate executable because the test script environment will create one for you and you can build the commands and conditions if the built-in ones are not suitable. If you want to see the slides and if you want to see a full example of how to use test script to test a common command that I was created with go you can go to github.com data charmer word count and there is the code for this word count and all the example that I have shown here and a lot more that are testing the word count in most the conditions that you may have. So you can see how to test this kind of program in reality. Well in reality I can have a lot more than that but it will be too too long to show. So this is the beginning of a project that I have to to illustrate all the characteristics of test script using code and the first step is to show a simple command line program and all the tests that are needed. Here you will see also more that you will see in the in that on github all the resources that you can use to learn more and if you want to learn more right now you can ask me outside and I may show you some more examples. Do we have time for questions? Three minutes. Any questions? Yes. When you still want to do unit tests like is it for CLI to know it for like not click input or would you always create custom commands for that? You can use so the question was whether I can use test script for unit tests. So you can use test script for mostly anything. I use it for unit test and I use it for integration test. For integration test I just put some logic before the test to create the environment and so it will run a little bit slower but mostly it will I mean doing unit tests is the easiest thing in the world. So if you if you look at the at the go code most of the unit tests for the go tool itself are run with test script but also the integration test can be run in that. More questions? Okay thanks a lot.
How mutation testing got practical
Thank youender million that is going Technical question. We can edit the video later. Let's start very lighthearted then. Maybe a shell of hands. Who here has heard of mutation testing? Amazing. I can go very quickly to some flights then. Who has never heard of mutation testing? For who is the completely new concept? I will cover it for you guys. That's nice. Of course I'm here, but I'd like to promote Striker a little bit. Who of you is actually using Striker already? Nobody. One person. Well, that's my colleague. He's also working on it. Cool. You guys are definitely going to learn some stuff today then. I can just start right? Sure. Welcome everybody to the talk, How Mutation Testing Got Practical? I'm really focusing on the Got Practical in this talk. I will be explaining mutation testing a bit, but I'm really looking deep into the internals and why is the idea really old? It's a practical use case, relatively new. So I'll get into that. But first, a little introduction. My name is Jan Dele Kester. I'm a self-engineering consultant at InfoSupport. It's a consultancy organization in the Netherlands. I'm also a trainer there, and I'm a research supervisor. And that last one is very relevant today. You'll hear why soon. If you want to contact me afterwards, you can use the link then, I guess, but you can also find me on GitHub. And as I said, I'm here on behalf of Striker. Striker is a mutation testing framework for JavaScript and TypeScripts, C-Sharp, Scala, and hopefully at some point, Kotlin. There's like a partial implementation there already. We're working hard on it. You can find us at StrikerMutator.io, and of course, I have all these nice socks for you guys. So if you're really good with asking questions and reacting to my questions, you might get some. And otherwise, you'll see us after. We'll have some more. So in this talk, in this next 25 minutes, because I am going to try to leave room for questions as well, I first want to talk about why do we actually need to understand our tests? Why is just writing a test not good enough? I also want to go into what mutation testing is for the people that don't know yet. And finally, and that's the major part, hopefully if I don't run out of time, I'm going to go deep in how we got to practical applicability, how we got into this state now where we can actually run mutation testing in our real projects. And that means talking about some state of art, state of art performance improvements. But first, we have to talk about the false sense of security. Because this is a promotional image I copied shamelessly from the Sonacube website. And they show this nice dashboard where they say, well, everything is good. There's no issues, no bugs, it's all fine. And there's even 76% test coverage. Who would be happy with that? Okay, who wants to say why? Why is 76% good according to you? Lots of green. Lots of green. Larger small socks. I don't wear small socks. You get some anyway. Sorry about that. They're hard to throw in this room. I would say I would not be happy with that. Because, I mean, our tests, apparently when we're running our tests, we reach 76% of our code. I don't think that's enough. Because there is more than 20% of our code that is not even getting executed for doing the test. That's a problem. But even 100% code coverage is actually, it doesn't say much. Because coverage only means that code is being executed. We are only testing, in a worst-case scenario, that the program does not crash. What you actually want to know is if our tests do something, and I can very easily get 100% code coverage on the program without writing any assertions, that I'm just checking that the test execution does not crash. So we need a way of testing our tests. And no, we're not going to write unit tests to test our test logic, because that would be stupid. I mean, it would never end. We need to be smart about that. And that's where mutation testing comes in. So what we're going to do is to introduce changes in production code automatically. That's a tool that's doing that. We're going to test again to see whether the tests start failing. Because when the tests start failing, at that point, you know that your tests are actually able to catch that bug that we purposefully introduced. So it's also from a white box testing, because we really have to know stuff about the internals of the code to change it, and see whether the tests are good enough to catch that. And this is really not a new idea, because there's this nice paper from 79 already, where they talk about a new way of a new type of software test. And you can actually find this on Google, and you can read it if you want. But even back then, 40 years ago, 45 years ago, they were already talking about it. But only recently, and I talk recently in a very broad sense here, because I wasn't even in high school, I think, when I'm talking about reasons. Then it got more traction, because the problem is, in the 70s, or the late 70s, it was just a good idea. We did not have the resources to actually apply it in practice. And what you see here, the dark-colored bars are publications, research publications about practical applicability, and they really spike early this millennium. And there are reasons for that. Mostly also, I think, because our computers got fast enough. And why that is important is because of how mutation testing works, what the process behind it is. Because we start with our source code, and we are feeling very happy about it, of course, because we made all this nice code, we even wrote tests for it, so we're very confident. And what the tool is then doing is going to introduce mutants in your code. And mutants are just changes. And for every change that is made, the tests are executed again. And we can have two results. Either the tests start failing, which in this case is good, because we then detected that mutant, we found the bug, so we say that mutant is killed, or the mutant survived. And that means that your test is not complete. And when you do that for everything, in the end, we get a nice report out of it, like a covers report, except a bit more detail. And how that process actually works is that there are operators for that. And an operator is basically a transformation. Given a certain thing in your code, what kind of changes can we do that might fail your tests? And some examples are here. There are way more, and also one operator on the left could result into, or one source code, original source code on the left, could result into multiple mutants. But you could do, like, just switch the operators, or throw away a whole block of code. And when we do that for every of these mutants, we measure something, right? So I already talked about killed and survived. But that's only in an ideal scenario, because in practice, there might be code that does not even reach by test. So you can say, well, we have no coverage, or we have a timeout. And timeout basically means that there are mutants caused by an infinite loop. And we should consider that, okay, well, infinite loop tests actually failed then, so that's maybe kind of killed. But you can also get runtime errors, or compile errors, because we're just introducing weird code changes without looking at what the code is actually supposed to be doing. And finally, mutants can also be ignored, because that's what a developer said, I don't want a test for this. So I don't want to see the report anymore. So it's just like suppressing warnings for your code code dejects. Who here does that very often? I do, actually, but... And then just like with the code coverage score, we want to have a nice metric. We want to know how good we are doing. And for that, we can compute the score. And we call that a striker-deorientation score. And basically what we say here is we want to express in a nice number on a scale of 0 to 100%. How many mutants did you actually manage to kill? So how many unexpected changes in your code are your tests actually catching? And that's this nice formula, but basically what you do here is everything above the line is what you consider that you killed. And everything below the line, so we divide that by everything that was actually a valid change. So we exclude the crashes, for example. And that gives you, for your whole program, for your whole code base, an indication of how good your tests actually are. But what if you don't have that many tests yet? Well, they can also compute a variant of the mutation score. We just look at the code that's actually being tested. So you see here that we do not include the mutants without coverage anymore. So one might think, just like with code coverage, we should maybe strive for a high number. We should maybe have 100% mutation coverage, 100% mutation score. That would be nice. But there we actually ran into a problem because we cannot actually kill all the mutants. It's very easy, well, relatively easy at least, to reach all your codes, to make sure all your code is getting executed. So basically to get 100% code coverage, that's relatively simple. But because you're still calling functions from the outside, you might not be able to test every single operation happening inside of these functions. And actually, some mutants are, if you split up your whole code base completely, you can never kill them. And what category of these is equivalent mutants? Which is also a problem. So given this code, we have a nice for loop, and we say, well, we want to iterate 10 times. You can also write like this, and it will still work. So this mutant, we cannot kill because even though we changed our code, semantically it's doing the same thing. So that's where you might want to ignore, basically. And mutation testing is also very challenging. That's actually where that practical application problem comes in, because you can imagine that mutation testing, basically changing your code and then running all the tests again, doesn't take a lot of time. And if you have a very large code base, that might actually not finish in a reasonable time. You also need a lot of configuration. The mutation testing tool needs to know stuff. It needs to know how it can run tests. It needs to know how it can verify that those tests did complete successfully or not. It also needs to know stuff about your programming language, for example, in order to make sure that it rewrites the code in a correct way. So that also needs to be a lot of tuning support to make it work. And for a long time, mutation testing was simply not feasible or not easy to do. But we're bracing the gap. Not specifically at Striker. A lot of stuff has already been done. Luckily for us. But when we're looking at performance, this is basically the worst-case scenario. So the time it takes to run a single mutant, to analyze a single mutant, is basically the time it takes to run on your test cases. So we can approximate it by saying, well, let's just count the number of tests that you have. And then the time it takes to mutate your whole program is just the sum of that. So it basically means that you multiply the number of test cases that you have in your code times the number of mutants that you need to check, because mutants need to be checked in isolation, because they can influence each other. So you can imagine already with a very small program, this number, the time approximation, can get really, really, really large. So we need to be smarter about that. We want to make sure that the total time is a lot less, not just a bit less, a lot less than just a multiplication. And basically there are three approaches to get there. It's either to do it faster. And doing it faster, for example, I think we're going to paralyze it. We're going to use love course. Like take a big machine, nowadays it's relatively simple to get a machine with 128 cores, so we can do 128 things at the same time. You can try to do fewer. So you can maybe try to make smart choices and say, well, certain stuff we maybe don't need to analyze, because we kind of know that it's probably fine. Or you can try to do it smarter. And the study I referenced here, they really did an analysis in there, it's like a litigator review. And most of the studies are actually focusing on fewer or smarter. And common techniques there, some of us here, I won't have time to go into detail on all of these. But you can think of like random mutation, like we're just randomly checking stuff sometimes. We're just randomly picking some mutation that we're going to do. But that's not deterministic. So that might not give you the best knowledge about what the quality of the test actually is. Parallel execution already mentioned. You can also do stuff like data flow and control flow analysis and try to reduce the set that way. Or maybe look at AI to try to pick smarter sets of stuff that you're actually going to check. But if you want to use that mutation score as a benchmark, as a comparison, for example, with the pull request, did you actually improve it or not? Or to give you a good indication of how good your tests currently actually are, you actually need to execute everything. So just the approach of we're just going to run less, that doesn't always work. And one big way how this process can be sped up is by looking at how we actually change the code. So a very new eave approach would, for example, be just changing the source code, running the compiler again, running the test again, and then making another change in the source code, and then running the compiler again, running the test again. And if you have a compiled language with a fairly slow compiler, that's quite problematic because then it gets really, really slow. A bit better might be bytecode mutation. So, for example, the JVM languages have bytecodes, they have an intermediate step, maybe you can mutate that. And they only have to compile the source code once, they just change the bytecode and then run it. And while that is a lot faster, it's also a lot more complicated, but it has one big downside, and that is that every change that is possible in the bytecode is not necessarily something you can do in your source code, especially for Java, for example. If you write Kotlin or Scala, there's a lot of, a very simple thing in Scala, for example, can result in a lot of bytecodes, and if you're trying to mutate that with the assumption that the Java compiler compiled it, it might come up with mutants that you did not kill, but you cannot actually kill them because they don't exist in your source code. So, who thinks, who has an idea, how we can, how we can do this smarter? Any ideas? Yes. Are you a step? Sorry, what did you say? You compile all the different mutants, the same as people, and you select which mutants you get from the other side. Exactly. Large or small socks? Large or small socks? You have socks with a logo on it. Large. This is the, the larger the other ones, right? Sorry about that. Thank you, Niko. For testing good device. I didn't test my own system, no. Silver written software to do this. But, yeah, basically, so the answer that was given was basically this, and we call that mutant schematic, and this just makes sure that you compile all the mutants into your code once, and then use an environment variable to just switch them on or not. So, if you do this, your compiled code is just full of if statements, that is, check a certain number. And that is complicated, but it is manageable, it's not that hard. The main problem is in keeping track of it, but if you assign every mutant a unique number, it should be fine. And this really helps with compiled languages, especially with stuff that's a bit slower, like Scala. And this is actually relatively new. In the world of mutation testing, this is relatively new. It is from 1993, though, so it's the same age as me. I wouldn't say that I am relatively new. But, as something else, Martin, that you can do is coverage analysis. That's also something that has been part of Striker for a long time already. So we actually do an initial run where we just check which tests are reaching what code. So we also know if you change one part of the code, which tests actually need to run and which don't. So you can also get that number of test cases down a lot, depending on where you mutate the code. Some codes you don't really know, so if you have something static, for example, it's not extreme defined somewhere, that, you know, you might not be able to figure out how that is used. You might still need to have to run the whole test suite, but you try not to. Something else you can do is incremental analysis. So just try to div some previously stored state and try to guess which mutants you actually need to check. This is very hard to do, fully foolproof, fully complete, but you can get there like 99%. And that means that if you make a small change, a small pull request, that checking whether your changes are tested properly is relatively simple, is relatively fast. So if, uh, Nico here in front, he gave it all yesterday in the JavaScript dev room, and he actually showed this feature, and I think that there was a difference between a couple of seconds and like 30 seconds and three seconds, something like that, on a small project. Another cool thing is mutation levels, and that's where you actually give the user a choice. Do you care about testing it fully, or do you care about performance? And the choice that the user wants to make can depend on the type of project or the domain. Do you have code where it's really important that every single thing is tested, or isn't it that important actually? Do you actually want to spend the time? Or maybe you want to do a quick and dirty but pretty good analysis for every pull request, which you do in the nightly build where you test it fully. There's different approaches here, and this is actually something that is researched by one of my colleagues at InfoSport, he did his master's thesis on it. So it's really cool. But what could be downside of this approach? Any ideas? Remember, you can get some socks. Yes? So the answer was the feedback loop is longer, basically. It might take time to find if your tests are not that great. That's the one. I have another slide, but yeah, another guest. Sorry? The problem is very useful because you run nightly and... Yeah. What kind of size socks do you have? Large. Sorry? Large. Large. It thinks too far, just come get them later. I will put them aside for you. What's your size? Small. Small. Ah, damn it. I'm not good at throwing. But yeah, so the mutation score that you compute, if you choose to run other mutants, the mutation score might not be comparable. So you really need to take care of that. And the tool that my colleague actually created, analyzes a code base so that you can do this actually for a specific project. It analyzes a code base and it analyzes the test and it's trying to find a nice balance between accuracy and the number of test executions that you need to do. So it tries to see if there are some mutants that we can exclude that will gain a lot of performance. So it speeds it up massively, but doesn't reduce a lot in accuracy. So that's really nice. And you can actually find his thesis online. So if you go to the FOSDAM page for this talk, you can have a link there as well. And I'm very hard to press actually, so there is not even documentation for that and it's not even merged. It's project Xavier and that is actually implementing that idea because it was very theoretical, actually implementing that idea in Striker for JavaScript. So if you're really interested in how that all works and what decisions they made, I honestly don't know yet myself. Go look at the request. And it's also a very cool example. Mike is dead. Oh, it's back again. We're a project group from a university, in this case the University of Twente, contributed to Striker in that way and they actually built this. Yeah, documentation as I said, still to follow. And this is also a very new thing that also a student is currently working on is to do more analysis on static analysis on the code to figure out if we can run in one run of test cases, if we can analyze multiple mutants. But in order to do that, we actually need to make sure that these mutants do not influence each other. So this only works if you know for sure that they don't cancel these order out or if you can still, given that the test fields can say with confidence which mutant it was. So this is really, really complicated. Again, I'm not entirely sure on this progress yet, but question. So this is sort of static data for more complicated. Yeah, so the question is whether modularizing your application would help. And it would help because if your modules are smaller, then the test runs are also smaller. They would take less time. But it only works if a normal boot rest for you, normal change that you want to do, if that's only contained to one module or two modules. But if you split your code up instead of changing all modules, then it doesn't help. So it might help. But that's also with, like, in general, I want to make my CI pipeline quicker, just make more repos or build less modules. So yeah, that would definitely work. Come grab a pair of socks later. And now it really is time to also start testing your tests. So if you're not using mutation testing in your products already, it's really good now. We can actually use it. There's been a lot of progress in 45 years. We have better hardware. We have process improvements. And actually, there's a lot of research going on still to make it actually faster every time. We also have production ready tooling. There are many great libraries out there. Some of them are more mature than others. Some of them are faster than others. Not everybody integrates the same process improvements, for example. But in general, for most popular programming languages, there is a tool available and you can run it in your pipeline. And most of these tools just integrate with the build tool that you expect. They use information that the test runner already gives you. So that's great. Here's an overview of some suggestions. But if you just Google your favorite programming languages plus mutation testing, I'm sure you'll probably the first result will be the right one. So in summary, when we're talking about mutation testing, we're really talking about testing your tests, making sure that your tests actually expect what you do. And something, if there's anything, only one thing that you take away from this talk, don't rely on code coverage, please, because it doesn't say anything. And a lot of research has been gone into performance improvements. There's lots of research still being done. There's still always students coming to us interested in contributing to an open source project with research. So there's plenty of open research questions. And it's also applicable now. So if you're maintaining an open source project, at least consider mutation testing, because especially in open source, where there's many contributors, there's a really good metric on to get an idea about the quality of the tests that somebody wrote for our poll request. If you want to know more of the code implementation details, as I said, my colleague gave a talk yesterday in the JavaScript Dev Room, it's probably online at some point, so you can go check that out. And that was my talk, so thank you for listening. APPLAUSE Exactly 25 minutes as well, so that went great. Any questions? Yes? I could determine which expressions do mutate. Sorry? I can determine which expressions do mutate. OK, the question is how do we determine which expressions do mutate? Basically, there's a lookup. So it does abstract syntax free analysis, and it just checks a certain node, and there's a lookup table to say, OK, if we have this kind of operation, these are the options that we can... These are the mutations that we do. So there's basically a big mapping file with all the options, and that is probably not complete for every standard library out there, but for a lot of the logic, comparisons and stuff like that, you can do it pretty complete. Yes? I couldn't hear the last part. Can you repeat that, please? What is the way to find the way to test your... OK, so the question is what is the baseline when you start with mutation testing? Right, so if you have a new code base, if you have green fields and you implement mutation testing from the start, it's actually relatively easy to get high 90% mutation score. When you have an assisting project, it's usually very hard. So actually Striker for JavaScript has a mutation score of around 80%, which is actually pretty good. It's really hard to get very high scores. So it's not very... Not like covered. If you're anywhere close to 80%, you're actually doing pretty well, I think. Yes? Yeah? Yeah, so the question is when the purpose of mutation testing is to make sure your test are good, if you're doing selective mutation, how do you know you're not missing something? Actually, you don't. You might miss stuff. That's the point is because it can take a really long time that you have at least the option like, OK, I'm OK with 80% accuracy if it's half the time, because that's for me a good balance for some use cases. But yeah, you have to accept that you're missing stuff because you're just not running all the mutants. So the combination of mutation testing and test-driven developments. Well, you have to write your test first then, but you can only mutate once the implementation is there. So once you test that green, you can check whether you actually did a good job before writing your implementation, which is kind of strange, actually. Yeah, but that's very nice, actually. If you have to change your testing of changing requirements and you re-implement part of your code, then mutation testing will check whether your test are still complete. So that's actually very good, very nice. Yeah, actually, if you really want to go into that, property-based testing, you're only going to test all possible inputs, even though for sure that is correct, but that's not feasible yet. Property-based testing is really hard, too. Do you have time for more questions? Four minutes. All the time in the world, up front. The question is, from your experience currently, if, for example, you have normal unit tests and they run, let's say, it's called one minute, how many minutes will they be around using the framework? So the question is, if I know how long my unit tests run for, how do I know how long mutation testing will take? And there's only one answer. It really depends, because it really depends why your tests take a minute, but it's going to be a lot longer. It's like, it's not going to be four minutes or five minutes. It's way more than that. The only way to find it out is to just actually run it, because the problem is it really depends on how many mutations can be generated for your specific code, because that is what makes it slow. And because of all these optimizations, you cannot really predict how long it will take. I didn't hear that one. So the question is, how do we report it? And for Striker, and you can go to the talk with my colleague, he went into that in a bit more detail. We have a standardized adjustment for that. That's good. Oh, it's a bit. So there's a nice dashboard, but go watch this talk and you will know more. Up front, yes. So you already run one mutation at a time? Yeah, you have to. If you want to do more, you have to prove first that these will not influence each other, so that you need to know that if one test fails, because of which mutant it is. With coverage, you can have reached that? No, you really have to do data flow, control flow analysis, and stuff like that. So that's very, very difficult. But yeah, there's somebody working on it right now. So maybe in half a year's time, we'll have some more to talk about. Yeah, Sci-GridNet already has an implementation for it, but it's not scientifically proven, so we do not know 100% where that's correct, but at least 95% there. So if that was the last question, if there's any more question if you want to talk some stuff, I'll be outside in the hall. And if you ask a question, feel free to come grab your socks here up front. There's plenty. Thank you. APPLAUSE
Running systemd integration tests with mkosi
I'm Dan, I work on systemd and I also maintain the maker side tool which is a sister tool of systemd and I work in my day job at the Linux user space team at MATA. So specifically why do we want to do this? Systemd is a pretty low level user space project so running its integration test is not as trivial as it is for a regular project. So specifically we want to make sure that we don't accidentally break the host machine which when you're running something like systemd becomes rather easy. We also want to minimize the requirements that are needed to run the systemd integration test so that regardless of which machine you actually run them on or regardless of which machine you're hacking on you can still run the tests. This is especially important for new contributors because at the moment the barrier for writing a new integration test is pretty high and we want to make that lower. We don't want any host system details to leak into the integration tests so currently that actually happens quite a bit and it means that you often get a failure for example on a CI machine that you can't reproduce locally. And when that happens it's usually a huge pain to figure out what's going wrong and how to fix it. So we want to try and make these tests more reproducible regardless of the machine that they're running on so that we avoid issues like this. We want to be able to paralyze them as much as possible and again the isolation from the host helps here because it allows you to run more instances of tests without having to fear that they are fighting over the same resources that might be leaking in from the host. We want to make them easy to run of course like I said for new contributors and we also want to make them easy to write. So before I go further with the integration test I'll give a little overview of MakeOSI. MakeOSI is basically system deals tool to hack on systemd. So because systemd is such a low level user space project you can't just build it from source and then run it especially not if you're working on the in-it system itself because you're very likely already running a systemd on your laptop and you can't simply replace it with another one. And even if you could if you write a book and it crushes systemd then your laptop is certainly unusable. So we need another solution and specifically we need to run it in a virtual machine so that if something goes wrong and it crashes you can simply kill the virtual machine and it's like nothing ever happened. And this is where we use MakeOSI. So we use it to build a Linux image that contains systemd compiled from source and installed into the image along with all the other packages from your Linux distribution that you would need for development. If you can then boot in QMU and do whatever testing you need shut down the machine and then you can submit your patch. So it does a few things but the primary thing it does is it simply runs your distribution package manager to install whatever packages are needed and then runs various tools to configure the image. Most of them coming from systemd but also a few from Linux itself. It builds an environment where it's necessary, it generates a unified kernel image if you wanted to and then it packages it all up and then boots it in QMU. And so we can generate a few different archives but the most important ones are probably the disk images and just a plane directory. So what does this look like if you want to build an arch Linux distribution image and install systemd and Linux and then enable autologon that's how you do it. And this will build that and then boot into it with QMU. So you eventually end up in a root shell in a virtual machine with systemd installed. You don't need root privileges for any of this which is another thing we want to do with the integration test. Currently you need root privileges so if more files are written they're owned by your root user in your home directory which means that you run into weird issues when you try to delete files and stuff like that. So we want to try and do it all without even root privileges. You can figure out how to go aside it's like a systemd project so we do the usual unit file stuff. You can conditionally include stuff with a match section. They only apply something to the Fedora distribution for example. So we already use this for hacking and we don't use this for the integration test. So we use macOSI for manual testing which is not exactly great but the automated testing still runs outside of macOSI. So this is because the integration test existed before macOSI was there and the way this was implemented was they still wanted of course that you could run in a virtual machine. But instead of assembling the virtual machine from distribution packages the implementation decided to use the files from the host. So similarly to the first generation tools like Dracood which is where the approach came from they pick various files from the host when building the integration test image and then that becomes the image and there in the image you run the test. The problem is that this is completely independent from macOSI so we have two very different environments one for hacking and then another for running the integration test which isn't great. Even if you manage to do a set of two manual testing inside macOSI you then have to somehow translate that to the existing integration test which is very hard sometimes. We have a custom test runner using make so it's all implemented with make and bash and shell scripts. We don't really use any off the shelf tooling here so it can get very nasty. The tests themselves so this is one part that does work well. The tests themselves that run inside the image are implemented as systems. So what do you get this? Start the image and then we pull in the system unit and the system unit executes the test. If the unit succeeds then it has succeeded and then the test failed. Of course all the test specific dependencies have to be added to the image so this ends up being like I think it's like a two or three thousand line bash file now which is responsible for making sure all the dependencies get picked up from the host file system and then put into the image. So it's very complex and I don't think anyone fully understands it. Any customization that you want to do to these test images also requires writing a lot of bash which again is very hard and for new contributors especially to figure out how to do. As you can see to run a test roughly this is what you currently do. So as I said the files gets picked up from the host for the current images but of course we do need to lay the system to build from source. So you build system from source of the host as well and then what the three thousand line bash file does is it basically takes files from those takes files from the build directory combines them and you end up with this franken image. That contains God knows who what. Half system the build from source half from the host and that's where the image runs in and as you can imagine figuring out what's going on in this environment can be rather complicated. What do we want to do instead. So we want to reuse as much as our existing tooling as possible so one make OSI which are already used for the environment and then the other part is system these build system which is a mess on which already has targets test targets which will execute the tests. This is primarily intended or the I guess the primary goal for this was actually unit tests for C or C++ projects where the test macro and in mess on simply execute the unit test. But there's nothing really specific about it that says it can only be used for unit tests since all it does is really just run a command and check whether it returns zero or non zero exit status. So it's perfectly possible to just have running integration test as well. So I wanted to make use of that so that we can simply add a mess on sweet that's specifically for the integration tests and then running them is exactly the same as running the unit test. So you make things more similar and it will generally we hope lower the barrier for running the integration tests for new newcomers to system. We want to make sure that all the tests reuse the same image. So currently the image gets rebuilt quite often for individual tests which makes the whole thing a lot slower. We want to get to a point where we can ideally reuse the same image even the same one that we use for hacking for the integration tests as well. So we can make use of caching and we avoid having to rebuild the image. And the customization instead of writing whole pile of bash you can just reuse all the settings that may go as I provides to customize the image. And we hope that running an integration test would look roughly like this. So a proof of concept PR is already available on the system the GitHub repo where we more or less have it like this so that an integration test can be executed simply by running mess on test. Specifying the individual test if you want to run one or specify the entire suite if you want to run all of them. Mess on supports running tests in parallel so we want to make use of that as well to be able to run multiple integration tests in parallel. Of course since these tests are quite heavy because they spawn a virtual machine we can do as much parallelization as we would with unit tests but we can probably still run more than one. So how do we run an integration test in a virtual machine with system? There are a few interesting things about running a test in a virtual machine that can make it interesting to get the results out. So for example if mess on runs a unit test then the process simply exits with its exit status either zero or nonzero where nonzero means that the integration test has failed. But if you're running an integration test in a virtual machine when that integration test unit fails in the virtual machine that doesn't mean that your virtual machine is suddenly going to exit with exactly the same exit status. And you're not able to use that without some effort to determine if the test failed or not. You need to somehow get the exit status of the test out of the virtual machine and to the host so that it can be interpreted by mess on. So the way we do this in system D is by using what's called in the AFV socket family. This is a socket family that like TCP or the UDP sockets or the unit sockets but this is specifically intended for inter virtual machine communication. So you can assign a virtual machine and AFV socket device and it has a connection ID which identifies the virtual machine. And then you can bind two ports on that in the virtual machine and you can connect to it from the host. So we use this by for passing data from the guest to those. So system D as this as the notify protocol which you can is basically it can send messages about its status over a socket. And we extended this with support for AFV so that we can send information about the virtual machine to the host if someone is listening. We can we the most basic use case of this is to tell the host system when the machine is finished booting. So we send ready equals one then but it turns out that we can also just simply send access status equals whatever the exit status is. And that's how you can get an access status out of the VM. So this is then this is the access status of system D. So how do we make this access status of system D the access status of our integration test. Well we have two different unit settings for this and success action equals exit or and failure action equals exit. And what these two settings tell is that when this unit exits system D should also exit and specifically with the exit status of that service. So this gives us a way to pipe the exit status from the integration test to system D which then exits with the same. It sends it over VSOC to make or say which is listening it reads the exit status and make or sign in exits with that exit status. So you get this whole flow of data through to the host and to just be able to exit with the same exit status in make or sign. Of course just getting the exit status isn't really sufficient. If you had to do that could ask just by looking at this exit status you'd have a pretty bad experience. So you also need the logs ideally. So because we run on a serial console the serial console is already displayed so you get those automatically. But we also wanted a way to be able to get the system D journal from the virtual machine off the virtual machine and to the host. Normally you would just mount the disk image after the virtual machine has finished executing and get the journal out that way. But remember that we wanted to be able to support running these integration tests without needing root privileges. And if you don't have root privileges then you can't mount any file system in Linux. So we can mount the disk image anymore after the integration after the virtual machine has shut down. So we need to get the logs out while the virtual machine is running. How do we do this? Well again with AFVSOC. In the next version of system D most likely we're going to add another forwarding mode to system D-journally so that it can forward its logs over an AFVSOC socket. So again you can have something listening on the host on AFVSOC, configure journal D to send its logs over this AFVSOC. And then simply store them on the host instead of in the virtual machine itself. Or do both because having the logs in the virtual machine available as well can be useful for debugging. So to listen on the host we have this little program which is system D-journal remote. You can configure to listen on any address. This can also be on Unix socket sort of stuff. And it will simply store the logs to the directory that you specify. So once it's done you simply run journal code, you specify the directory that the logs are stored in and you will get the logs of the virtual machine. You can access them, you can read them, you can debug what's going on. Or you can just simply store whatever CI system that you're running the tests in. Then of course we need to be able to debug any failing tests. So the test might be started. It started via the serial console. But when Maston is running a test it doesn't give you interactive access to the serial console. So we need to have a way to be able to get into the VM without needing the serial console. So the regular solution for this is SSH of course. So we want to provide SSH access to the VM. But we don't want to tie this to the network of the VM. Because let's say we might be testing very specific networking access network tests. This might involve multiple VMs and they might need a very particular networking setup. And it doesn't mean that this network setup might not allow for access to the VM via SSH. So we want to use a different protocol. And again we can just use AFV so for this. So this just emerged. It will be in the next release of system. But when system D started with an AFVSock device it can now detect this during early boot via a new generator. And it will bind port 22 on the AFVSock family to a socket unit. Which will start SSHD when connected to. So this allows you to use SSHD with VSOCK. So you can connect to the connection ID of the virtual machine on the host using SSH. And you will get an SSH session in the VM without needing to configure the network. To provision your public key we use system decredentials which can be provided using SMBIOS. To the VM to provision your SSH public key into the VM in the correct location. In .ssh slash authorized keys. So that you don't need to do anything like you don't need to enter a password or anything. So just SSH it will do the usual key cryptography or key authentication. And you just get your root shell in the VM and you can debug whatever you want. To make this nice to use on the host we can drop in an SSH config file that configures a proxy command for SSH. So we take ownership then of the Unix and the VSOCK host name prefixes. So you can do SSH VSOCK slash the connection ID of the virtual machine to get an SSH session into that virtual machine. So this is what we're going to try and use to be able to debug any tests that are going wrong. That was all I had to say. I'll put a link to the project and go take a look. We want to use this for the integration test but make our size of course useful for a lot of other things as well. If you need for building Linux images please take a look. I'm always happy to add new features or you can join the Matrix channel which is linked in the written and ask new questions. And I'll be happy to answer them. Thank you for listening.
Making it easy to get to SLSA level 2
Hello, hey everyone. Welcome to Making It Easy to Get to Salsa Level 2. Thanks for sticking around. It's last day of the conference, send me last talk. So, yeah, today we're going to be talking about salsa and compliance and hopefully how you can meet those compliance requirements as a play. My name is Theophilus and I'm going to be talking about Choc and open source framework. We developed that crash override. So I come from a security background and every time I hear the word compliance I get bored to death. It's kind of like a book sticking exercise. But hopefully we can discuss today about this and see how you can do this in your own organization easily while also get value for your org. Before jumping into the topic, let me kind of quickly set the scene and talk a little bit about software supply chain attacks. So in a software supply chain attack, the attackers compromise the build system or the package registry and get a foothold there. And over the past years we've been seeing an increase in these types of attacks. So there was a report say from Sonatype that said since 2019, year after year, we've been seeing a sevenfold increase in this type of attacks. The report came out in 2022 that said supply chain attacks basically surpassed malware based attacks by 40%. And last year around two out of three of US businesses were impacted by a supply chain attack. So you can take these numbers with a grain of salt, but the fact of the matter is there is a surge in these types of attacks. And this popularity on the attack realm drives policy changes. So in May 2021, there was an executive order by the White House that said software vendors must be provided and purchased with software of materials and provenance information. And quick show of hands. How many are familiar with the term S-bombs or provenance? Cool. How many of you have been deploying these in your pipelines or your organizations? Okay, great. There's also an S-bomb room. So today we're going to jump into these topics real quick. So we're going to discuss some concepts and then talk about the challenges people face when trying to deploy these things to production. Then we're going to talk about CHOC and how CHOC can help you solve these problems but achieve many, many more things and hopefully have a discussion in the end. So for those of you who are not familiar with software bill of materials or S-bombs, you can think of it like a list of ingredients for software. So you go to the supermarket, you see a package, then you read the labels and you get a list of all the ingredients that are in there. So an S-bomb is pretty much the same thing but for your software applications. So you get either an XML or a JSON and from that you can get a list of the packages, their versions, etc., etc. When we're talking about provenance, what we're talking about really is how did the artifact get here? Like who created it, who packaged it, how was it modified along the way until it actually reaches the user basically. So that is all good but if we think about a list of ingredients, what are the guarantees that what we get is actually what we're promised? So for instance you could have an NPM package and you can generate an S-bomb for your NPM package saying that these are the ingredients that are there but then in a package you could get a foothold somewhere in your build pipeline and inject something that was not originally there. So another key component here besides generating the S-bomb and the provenance formation is really having some attestation around the integrity of the generated artifacts. So anyone should be able to cryptographically verify that at least what we're promised has not been tampered with and that the contents of the S-bomb were coming from an original author, etc., etc. And what's really important here is we need to have some clear assumptions around the threat model aka what can an attacker compromise and what are the security guarantees we're getting depending on that. So do we require the attacker say to compromise our build pipeline or do we require the attacker to get a foothold on developers boxes? What's our threat model? And that's really important because if we think about DevOps pipelines in practice you have many components like developers are pushing codes, that code ends up in some provider like GitHub or GitLab, you have open source packages, you have container images, you have infrastructure code that modifies this code and pushes it out and then somehow it ends up in the service or the cloud. And as we're building out all these graphs of components attackers could get a foothold at various places. So this is where Salsa comes in. Salsa is an open SSF project and essentially gives us some framework to talk about the security posture of our applications. And we have different levels for the supply chain security of our artifacts. At level one essentially all we're doing is we are providing information about how the package was built and have a report but we don't really have guarantees around the report whether it has been tampered with or not. At level two we get signed provenance. Essentially at this point we say okay once the thing has been generated there has not been tampering on that artifact but you don't get guarantees around the build platform etc. And as you move up the layers you get stronger and stronger security guarantees. So today we're going to be talking about chocking how easy it is to get to Salsa level two if you deploy chocking to your built pipelines. So how does one start to do this? This is good, we all want to improve the security posture of our applications, we want to deploy these things in our organization, how does one start to do it? One could think that okay that surely a solved problem there must be tools for this already and you're right to some extent but the tooling ecosystem is really in its infancy and it's largely fragmented at this point. So it's not necessarily obvious to a newcomer which tool or framework they should pick and even if you say select a space like S-Bombs the outputs of different tools are inconsistent with each other. One tool get a certain report, another tool get some different report and there might be assumptions around how these things should be getting set up, how you should be deploying all these things so it's not a straightforward way and what's really really hard is thinking about how can you do this at scale. If you have a large organization with multiple repositories, different providers, how do you make it easy for your teams to just set this and let it run and have it be easy to consume the data and also generate data that are of interest to you. So yeah it's not an easy problem and hopefully Chalk will help here. The main idea behind Chalk is really we have some metadata that we care about and then we want to embed that metadata that we call Chalkmarks into the artifact. So the artifact could be a binary or it could be a docker image and you want to embed this metadata into the artifact during the build time or post build time. So you could have an L file in a box and then you can inject metadata into that L file and say okay that was indeed here, you can have information that you care about like the security settings on that box like if a partner is enabled or what are the users or the network connections, you embed that metadata on it and now that artifact is tagged and once you have that tagged artifact you basically let it go and it gets deployed somewhere in production. So think of Chalk pretty much like air tags for your code so you embed the air tags and then you're tracking it across the ecosystem of your infrastructure and once the artifact actually gets executed what's interesting is you can get back reports with metadata that you configured there. So essentially you can scan what has been out there in production, you can grab for all this metadata that has been embedded in the artifacts or you could configure the artifacts some cases to phone home and give you the report themselves and you could do this once or you could do it periodically for instance configuring Chalk to send you heartbeat reports. So let's see this in action. I have set up here a very very basic git repository and this repository all it does is it deploys a lambda function. So we have the main code of the lambda function here and as you can see there's nothing really special to it, we just sleep and return it to 100k and we're building this lambda function using a docker file and there's nothing specific to Chalk in this docker file pulling from a well known image and we're actually building the lambda using a github action. So during the github action we check out the code, we set up the build environment and then here we're setting up Chalk. So we're telling our build ecosystem that Chalk should drop this build of the image and embed metadata on it and what sort of metadata we choose to embed is completely up to us. It comes like Chalk comes with defaults. So these are the only lines of code we ever need to do for our build pipeline so that Chalk can embed sbombs and actually use, you know, provide cryptographic guarantees around the integrity of the generated reports and we're also creating attestation manifests using SIGS-Tor for those of you who are aware of that framework. So cool, let's go ahead and trigger this. I'm going to go here in the action, kind of re-trigger the action once more and what we're doing here is we're building a docker image and we're telling Chalk to encapsulate the whole build and inject metadata in here. And that metadata, we can choose how we want to emit it. So we can choose to emit a report in there or the CLI or in some destination that we care like S3 or some server. So I have here a dummy server that's running and it's waiting for reports. There's nothing here currently. And I'm going to go back into one of the previous actions and show you a report that was emitted by Chalk on the CLI. So during the build, after we've actually pushed the image, you can see down here we have a Chalk report and this is basically a JSON file that has metadata that we care about. So here we know that some image could build, that was a daytime, that was a docker file path, the exact contents of the docker file, the commit ID, the author of the committer, but you also get a cryptographic signature about the integrity of this report essentially. You get interesting things like the environment variables, arguments. You can configure this to be however you like it. And this is generated on the CLI, but we can send the exact report, the exact same report or variance of that report in other destinations. So going back here to the action we just triggered, hopefully once this completes, we will be seeing a report populated to our server. So not only will we see a report here on the CLI, but we'll also get the metadata in the endpoint we configured. What could possibly go wrong? This is just a live demo here. So you can make this as fine-grained as you like. So Chalk supports plugins. So if you want to run, say, your static analysis tools like SemGrid or CodeQL, you can embed this metadata into the report as part of your regular other metadata that you're tracking. So it looks like this got finished. So we did get a report here. And if I go here, essentially we see that we got a build operation, so that got sent over to our server. And this is essentially just a pre-defined rendering of the JSON, right? You can send it wherever you see fit and render it however you would like. But we get some interesting information. We get some signal that we collected S-Bomb and Signing data. And indeed, if I scroll down here, I do see that I have the full S-Bomb. And I can fetch information about the attestation of the artifact. But I also get a bunch of interesting metadata that might not have been obvious just by seeing the build. So I see here CloudProvider is Azure. And we have information about the actual Azure instance metadata in which the build happened. So essentially what happens here is GitHub runs their machines on Azure in this particular instance. And so that build triggered in one of the Azure instances. So that's nice. We can also see the build command. And you can see here how the normal build command is now wrapped under Choc. So Choc is in charge of the build and embeds the metadata into your image. So that's nice. What we did do here is we pushed this demo lambda essentially. You can see this was modified just now. So I'm going to go ahead and execute the image. And hopefully, if things work as expected, the lambda will execute. And I'm going to get a second report here. And that second report is an exec. And if I zoom into the exec, you now see that the command that got executed is actually running within the lambda environment. So Choc is wrapping the entry point of the execution for that Docker image and tells you like, hey, this Choc mark that you inserted, the metadata that you've all captured is still here, but now I'm executing in lambda. And indeed, if I go here and see the Cloud metadata, you can see the region, the role they are in, the account ID, et cetera, et cetera. So with this, we can basically say the metadata that we injected in our build pipeline here is still present throughout wherever we deploy the image. And we can keep track of where the thing actually executes. So if I take into this Choc mark, I'm sorry, let me zoom out here. I can see that there's two reports essentially associated with this. One was a build and the other one was an exec for the exact same hash. So the exact same hash that I build in that machine has been executed in the other machine. So what did we do here? First of all, with four lines of YAML in our GitHub action, we generate and distribute the desktops. And we also have provenance information because we can track where the build happened and where the actual image got executed. And we also get artifact integrity. So in our case, we're using cosine. You could use different frameworks to do this. But essentially, we're meeting the basic requirements here. So we're checking those boxes. And that's with minimal effort, in my opinion. Like all you need to do is you need to configure whatever destinations you want for these reports to be sent. So you say, OK, that's cool. What more can you do? So let's think about typical scenarios that happen during kind of live production environments. You might be on call for a given service. And you get a page in the middle of the night. And there is some issue. There's a bug. There's a vulnerability. Something is off. And you want to figure out, OK, what's the component that's responsible for this? You could have, say, a pretty complex application with multiple teams pushing code. And for large organizations, usually the pattern for resolving these issues is you cut the tickets to the team. You wait for somebody else to be seen and be like, hey, that's the responsibility of that person. Potentially, you grab into code. You say, OK, what was the last commit? Or you have metrics. And you track from your metrics what chains. And you try to correlate it to somebody else. If you're using Chalk for your build pipelines, it's much, much easier to correlate what exact version of an image is running where and what the components are. And potentially, like, who are the code owners, et cetera. Because if we go back here, you see that we have things like the cometer and the commit ID. So we have the commit ID. You can start building these profiles about ownership incrementally as you go. So instead of having a process which could potentially take a couple of hours to determine the root cause of an outage or an issue, you now can have this in a few clicks, hopefully. Another common use case is application inventory and change management. So say, for instance, you're part of a large organization. You want to deprecate the framework. You want to deprecate, say, AngularJS. So AngularJS is running production. And you want to figure out, OK, what is the impact? How many teams are using it? Is the code even live? And what was the last time things got executed? You can figure out, you can get reports around these things. More importantly, you can see how applications change over time. So many of the people we've been talking to have processes where, for instance, they do a sort of change management meeting. Like, once a week, they say, OK, what has changed? What has been deployed? Do we need to go through a security review? What's the exact list of changes? And that process is manual to a large extent. Using Choc, you can automate this, because you can generate an exact report of the diff and you can get some integrity guarantees around that report. But more importantly, besides these things, you can do much, much more rightly. It's not necessary that you can only Choc containers. You can run essentially tools of your choosing, or you can submit custom plugins for metadata surfacing. And currently, the open source implementation that we have on GitHub only supports the entry pod wrapping for containers, but we're working to expand Choc functionality with more and more features. You can still Choc L files and PYC files and jars, et cetera. So yeah, the framework is out there. It's written in NIM. NIM is a very, very cool statically compiled type safe language. So any fans of NIM here, feel free to contribute. And we're welcoming feature requests. And I think that's my talk. I'm happy to take questions or discuss this. Thank you. Thank you. Yep. You talked about large organization. I'm very open to second. Yes. So the question here is I brought up large organizations, but given a concrete example of what are some use cases that this would apply in, right? So just to make this clear, this does not only apply to a larger, a small organization. It applies to everyone. It's just that if you are having a single application with a single repository, pretty much you know exactly what version is deployed where. The complexities of these situations start to be amplified the bigger and bigger you get, right? So if you have, say, a web application, and that web application has multiple components that are live at any given time, or say you have a distributed service and you have microservices running, you have multiple teams committing different versions of their component at any given time. And potentially some of these teams change. So you could end up with a repository having outdated code, right? There's a mission now, something has failed, and you go into the code, say, what was the last commit? It was six months ago. The committer of that application has left the team, potentially has left the company. Who do you contact? How do you know that's actually that part that has been outdated? But if you keep track of your builds and your executions, you have the ability now to tap into all the history of, like, all the provenance of a certain artifact and surface metrics that you care about. So if you cared about, say, show me all the components that haven't been updated in the last month or that haven't been executed the last month, it's way, way easier to do this. I'm not sure if I answered the question. Yeah. Yeah. So you showed how to do it in a GitHub action, but could you generalize and do this manually in a one-prime or in a different pipeline environment as well? Yes, yes. Chalk, you can actually, if you go now, if you visit the GitHub repo or the website, you can just download it. And it's a binary that runs. You can run locally and embed metadata into any artifact that you care about on your machine. So you can download on your laptop and scan all the L files in your system or the jar files or whatever, or even scan a whole directory. You can specify whatever you want. And then you can configure metadata that you care about, and these will be embedded there. And you can then extract it. So you don't necessarily need to have Chalk report back to you or run it in a GitHub action. You can just use it to embed information and then surface it. So you can both insert and extract if that makes sense. Yep. So that's a great question. I think one of the big benefits of Chalk is that you can embed information even in generated images or artifacts, right? So if you're using, say you have some third party software like a library that you're consuming, perhaps you don't know where it came from, but you know that you saw it in a certain machine at a certain hash. And then you can use Chalk to encapsulate that information for your artifact. And basically, if you run a query across all your applications that say are importing a given library, you can see all the versions of that library that are running. So you can start building these application inventories very easily, even if it's a third party software. Is it the total of the bottom third party container? It's still the same premise, right? Because if you have a container, you have several layers. So you can start saying, okay, these are the layers I have seen here. And potentially you don't have the full information, but you can at least ensure that you can attest that, okay, these are, this is the hash that I have seen. We are starting to add support to actually wrap entry points of different layers if you'd like to. So you should be able to interpose yourself in another layer should you like to, but that's not currently up yet. It's not up on the open source limitation. Yeah? How does Chalk play together with the useful bits? Are they to include them in the library? That's a great question. No, you don't need to include any compiler. All you need is the binary. And then if you have a reducible build for, in your pipeline, you should still be able to achieve the same guarantees. For instance, if you have, say, an L file, we'll embed metadata into a section and that will survive stripping and all that. So once you have a build, then assuming you know that you're running with Chalk, right, and you don't modify the thing later on inappropriately, you would at least know that you're running with Chalk, right? So that if you're getting a report, that report has not been tampered with. Yep? Let's imagine I have a jar which I have Chalk, right? Then I modify it and zip change it. So at which point Q and V8 and then I Chalk it again. At which point how do you pull the code? Right. So the question is, suppose you have a jar, you Chalk it, and then you modify it and then you Chalk it again. How does the tool help you here? So Chalk does not allow you to have just a single Chalk mark within a binary. You can wrap Chalk marks within Chalk marks within Chalk marks essentially. So if you're making modifications and you'd want it to, you can maintain past information about past Chalk marks. Or if you're building a jar, say, out of other jars and those have Chalk marks, you can use this information and embed them into your final jar, if that makes sense. So you can wrap and encapsulate all the metadata from all the components. So I need to focus more on this. Well it wouldn't be more complex than just saying Chalk insert. Like Chalk would take care of all the build dependencies and make sure it injects it automatically. At least that's where we're heading at. It might not be full feature for all the flavors of what can be choked currently, but that's where we want to go for sure. Cool. Thank you.
Are Project Tests Enough for Automated Dependency Updates? A Case Study of 262 Java Projects on Github
All right, everyone. So I guess this is the last session for today. And what I'm going to present now is about our project, Test Enough for Automated Appendice Updates. And before I delve into my presentation, does anyone actually have an answer to this? Who wants to attempt to answer the question here? It depends. Yeah, that's a great answer. And I think that's also in a way in the right direction. So a little bit about myself. So my name is Joseph Hyder. I'm a member of technical staff at Endolabs. It's a startup on more now a scale up based in Palo Alto in California. And before that, I mean, I'm still actually a PhD candidate at the Duff University of Technology in the Netherlands. So quite close by to Brussels here. And for the last, let's say like six, seven, eight years of my life, I've been quite involved in working on this, writing, security, but also developing techniques that are focused on trying to like apply program analysis to, for example, package repositories or trying to better understand what's going on within dependencies and dependency trees. And just like a little bit talk about what I mean with automated appendice updates. I guess most of you already know what it is. So essentially whenever there is a new release from Maven, Ruby Jam, Socargo or MPM, you would have a tool. I just did a couple of them, which is the Panda Bot or renovate or that few. So when there's a new release in your repository, usually a prerequisite is created. And then the, let's say like it creates a branch out of your repository, tries to build it. If that goes fine, so it goes usually to the next stage. If you have it configured to basically run the tests. And then if everything is fine, in this case, it's showed on X mark, but imagine if everything is fine, you will merge it. In some cases, if you know it's not a problem, you would merge it in any case. And I think for many of us, we have seen like, usually show something like this. It would update version 2.2 to 2.4. So that's like the essential thing that I'm focusing around. Like what I mean with automated dependence updates. And an interesting thing around automated dependence updates is that there's usually this promise that if you just run your tests, you are essentially able to catch any type of regression errors, any problems that might exist in your code. And me as a researcher that maybe sort of a bit questioning pattern, as I felt like, hmm, the test that we are usually having are projects, they're more focused on the your project test suite. And maybe not so much on the third party dependencies or third party libraries that you use in your code. So that may be sort of race three questions. The first question that I asked was, do we even write tests against dependencies in the first place? And then the second question is, do project tests with even cover usages of dependencies in the source code? And the last one is like, are even test sufficient alone just using tests to detect any bad updates that you might find in using these tools or doing automated dependency updates? And to study this, I looked into open source projects at the first, oh yeah, another question is, of course, should we even write test for dependencies? Because if we like to reuse components from open source package repositories, why should we even write tests for that? Because it kind of gives us the ergonomics that we can just use anything like it in our code and that's it. And this sort of like started like as an empirical study, that's sort of what this talk is primarily centered around. So the first thing that I looked at in 20 study was to see what is the statement coverage of function calls to dependencies. And this is similar to considering for example, like covered like life support, for example, like J. Coco as a tool. And then the other thing that we also focus in the study is how effective are tests using detecting updates with regression errors. So what we're doing here is that we are basically trying to find, I mean, either find or actually put regression errors in existing libraries, and then directly validate whether the project test suite can directly detect that or not. And that's also something called mutation testing analysis. And I think there was one talk about this earlier. And then the last thing around the studies that currently the sort of state of thought is to focus on just using test suites, but could we use another way to find any problems or early detect issues that might exist in updating our dependency. So yeah, the first question is like, how can we do some type of statement coverage or get an idea about what exactly are we using in third party libraries? So we did this in two ways. The first thing was to essentially, so this was of course in Java, we extracted all call sites that we will find in projects. And if those call sites points to third party libraries in bytecode, we consider that as a usage. And that's for trustive dependencies because now you're not, let's say like longer on your source code where you have the call set direct dependencies, you would also need to go to the trusty ones and here to sort of approximate that is not an exact measurement. We essentially build static whole graphs to kind of get an idea of what would be used in the chemistry of our project. And then last we did some instrumentation. So we essentially run the tests of a project and execute what functions were invoked in the dependencies. And this will give, let's say like some idea of what exactly is being used or not used at all. So essentially first we statically derive like, what are all the usages? And then by running the tests we know which of those functions were covered or not. So kind of similar to code coverage. And we did this for around 521 Gita projects. And what we found very interesting was that when we look at the direct dependencies of a project, so this is all the direct dependencies that were found, about 60% like when running the tests are, let's say like covered by it. But then when we go to trustive dependencies, we found that the median was only 20%. So which means that a lot of the transitive functions that may be used may not even be reachable by test. So they sort of like ring some alarm bells, right? Because that means essentially like if you have a dependency update and you don't have any test that is covering that area, that will basically give you a green tick and you might merge it. And I don't think many would do that. But that's, let's say like the implementation area that also kind of raises some questions around how effective using tests for automated updates. And yeah, the other question does this matter at all. And I think a very interesting one here is the log for shell case because I don't think many of us would have tests that is particularly targeting log libraries. But here is an instance where something we don't normally would test and would have tests in any case. If you would do an update, then yeah, there might be some breaking changes then yeah, there will be a problem here. Then going to the second part of the study, which was on test effectiveness. And I was measuring that we're doing mutation testing. So the underlying framework we used here was a pie test, but we modified pie test to do things a little bit differently. And yeah, to sort of like give a quick sort of idea of what mutation testing is, is that you essentially have a function, for example, return x plus y. And then you apply some type of mutation operator where you swap, let's say like the class. And then you would expect that your test suite will be able to cover this because here the behavior is completely changed, right? It's no longer an addition operator. So normally with mutation testing, you would give it your whole project source code. It will start trying to modify in the source code and then see whether the test suite is able to capture that or not. So what we did differently is that we essentially mutated functions in the dependency code and not the project code at all. And we only mutated those that were reachable by test. So I was saying earlier that we were running a test to know which functions were executed. So we used those functions to essentially apply those mutation operators. And then from there we can see if the test is able to capture that or not. And yeah, before I go into this also another alternative way that we investigated is called a change impact analysis. So here we sort of leverage static analysis and specifically using call graphs. So how it essentially works is that we have a version 1.02 and 1.03. We compute a diff and for the diff we will find out which functions changed. And for example here we know that in bar and bus function we can see that there is an arithmetic change like instead of y minus minus it's y plus plus. And then in the other, like in the bus function we see that there is a new method called. And then what do we do later? We practically build a call graph of the application and its dependencies. And then using reachability analysis, so what we do here is that we know that the bar and bus was changed. And here we have let's say like a reachable path from bar up to let's say like stats on the score JSON I mean. And also we have like bus here where we have a new function called to QX STR. And by using this we can directly figure out if there is a coaching and dependency whether you are reachable or not in the first place. And why this is like a very nice complement to dynamic test is that we are essentially leveraging by looking at the source code what are we actually using. And then as a complement to where tests might not be covering we can sort of find directly if there is any change that might affect like your project. That of course comes the more tricky part which is semantic changes. So I mean one thing it's nice that you can detect that the method change but sometimes you might just do a simple refactoring that you know just refactors are a huge method into a method with like a couple of smaller methods is that. So the truth is that it's extremely difficult to know what exactly is a semantic change because there's a lot of factors around it. So the only thing that we did was that we kind of took what was like behavioral changes. So we looked at only like data flow or control flow changes. So for example if you add a new method call we consider that as like a special change or if you did some major change on your if statements that may introduce a new logic of how the control flow works then we consider that as an interesting change to follow. And what it is like I implemented a tool called Uptatera which means update in Swedish. And so I applied this on. So it essentially shows like which function had a change. So for example Rx, Java, not facing subscriber on error and we can see that it's reachable from the project and then it shows exactly how it was reachable. Yeah through like the code. And then in the second section I would have like what is basically the major changes in that function. So this could sort of give you some context of what essentially changed. Other than just telling that either the test parsed or failed. And when using this mutation PyPlanet that was explaining we essentially generated 1 million artificial updates by introducing those regressions and we did this on 262 GitHub projects. And what we found was that when doing the sort of changes on project tests we found that on average projects are able to detect 37% of those which means that a lot of like changes may not may get unnoticed like in general. But if you use static analysis now that you sort of have the whole context we able to detect 72% of all those changes. But what we find more interesting is that we can see that interestingly here like from the context of the studies that there's basically no guarantees that tests can prevent bad updates and using either of those techniques is not good enough to ensure that updates are safe. Then of course the other thing is that static analysis is not perfect. There are also problems with it as well. So the problem is over approximation. So we have over approximation at two locations. One is the call graphs themselves because when it comes to dynamic dispatch if there are maybe 200 implementations that might stand from an interface call we have to link to all of them and that might generate false positives. And then the other case is also with the semantic changes that we are detecting because we also don't know exactly what type of semantic changes it is. But to sort of see how this worked in practice we also analyzed and applied this on 22 dependable PRs. And from the results what we found in general was that by using static analysis we were able to detect three unused dependencies. So here let's say like the test would just pass it whatever but in fact we found that the dependencies were not used at all. And we were able to prevent three breaking updates and one which actually was confirmed by our developer where the test were not able to detect. And then of course we found that there are let's say like false positives and as I mentioned there were many cases with refractorings and then of course this over approximated call paths. So if you use like a tool like Google here or static analysis it can help to prevent updates but then you also get a lot of noise as well as a result. So sort of coming to the end of more of the studies what are let's say like the recommendations that I have after looking into like on Github projects how tests are being made etc. So one thing I found missing when it comes to updating with test widths is that we don't have any form of confidence score. And what I mean with confidence score is that for example if we can stop measuring test coverage we can see for example if there is a change function in a third party library do we even have test that reaches that or not and that could directly give an indication whether like my test width is able to capture that or not. And another very interesting thing could be for example if you find that one of your libraries are very tightly integrated with your project it can also sort of give an indication whether you have let's say like enough test to cover that usage or not at all. And then by having sort of this score you can maybe get an indication where does let's say like how well am I just able to capture things in third party libraries or not. This is something that I would like to see in tooling in general. And then when it comes to the gaps in test coverage so this is related to the results I was saying like the statement coverage and effectiveness. So I believe more of having a hybrid solution so we're using tests or dynamic analysis is able to capture. I think we should use that because that is more precise. But then in areas of the code where we don't have any coverage so for example consider back to the look for J library where usually I wouldn't expect just to be much test coverage. Here it could be nice to complement the static analysis. So you sort of get a little bit better for both words here. And then another advantage that I might see having static analysis rather running tests is that we can maybe much more earlier to take potential like problems in compatibilities by having that rather than trying to run it through the build system consuming extra resources or tests etc. So those are less likely to main things that I find important to address. And then for users like myself of using this automated dependence updating tools. So although like reusing is free in the sense that we can easily just use a library but we often forget the operational and maintenance costs and those are not free. So trying to basically automate away everything by using tooling etc. is not always the solution. I think it's important to also consider that once we start adopting a library we also need to think about how we can maintain it but also understanding what potential risk might come from it. Could be for example that maintainer have a very different sort of handling when it comes to security vulnerabilities. It could also be with the release protocol like there could be disagreements on what is breaking change or not for clients. So I think having that aspect is one important thing. And the other thing is like of course not blindly trusting automated dependency updates and I guess no one really does this. And then that's another thing which could be debatable is to have essentially critical I mean having writing tests for critical dependencies and this could be a library that's very critical to your project. I think here maybe having tests could help let's say like capture early issues that might arise in dependencies and not come as an unwanted breaking change later on once you merge the automated PR. So if you want to let's say like know more about this work I have a paper so I also uploaded slides on the Fosnum website so you can click the link and the paper is open access and yeah this is concludes my talk more or less so happy to take any questions. So do you know if any of these bots like the pen about renovate are working on such a score so let's say the merge request to get like a warning. Hey your tests are not covering only 10 percent of the dependencies. Do you know if there is any work. So what I'm aware of is that there is a compatibility score that looks at for example for a particular dependency version updates if out of less than like 200 PRs if 100 of those were successful for other projects then it will give us a score that there's a 50 percent chance that you will succeed here. The only thing I find problematic is that every project has their own specific use case or context of how they use it so it could be misleading but I haven't heard anything that looks specifically into your test suite to see how I mean how it's able to do that. Thank you. You mentioned the number of 60 percent for the amount of tests for direct dependencies and I believe it was a lot less for transitive dependencies. Do you have any numbers on the amount of transitive dependencies in search change the chains actually. So I can can imagine that the 60 percent is cumulative in these. Do you mean for the statement coverage thing or the statement. Yeah the first one. So the first one the 60 percent was on like direct dependencies and then this 20 percent was on the transitive ones. Do you have any numbers of the amount of transitive dependencies so you can relate it to that 60 percent. Okay so I did this on 500 time projects but I might have the more specific numbers in the paper. Okay. You have been looking at detecting errors. Have you looked in the other side because you can use it in a hybrid mode that your tool maybe can tell me you can make this update for sure because all the code is changed. You don't care about it. For example if you look at low level right libraries like Apache Commons you only use a part of it but you want to keep up to date and some updates are more or less completely safe because you don't touch any code that has changed because only new features have been added or so and that would also help if I just know yes that's safe. Yeah that's a great question. So this is a little bit idea we had with introducing call graphs because the call graphs you can start learning what exactly is used. So even if you use like a major library and you just use maybe two utility classes and even if you go to like a major version of it you might not be affected by it and this is something that should be covered by the call graph so we will see for example that the utility classes there are no changes there but then in the rest of the package there's a lot of changes that you're unaffected by. Thank you. Did you check how the call graphs work with dynamic dependency injection? Yeah so we essentially if I understood the question right I mean so we did generate the dynamic call graphs like running the test and this is something that we essentially used to guide or rotation like testing framework to only do changes in those functions and not for example functions that the test didn't touch because otherwise we wouldn't know whether I mean the test we were able to detect changes or not.
Perl at PayProp
Thank you. This is a QR code for the slides and also all of the talks I reference in this talk. And yeah, thank you Theo for organizing the poll in Raku Devroom. I'm going to talk about, you can all hear me okay? Yeah, perfect. I'm going to talk about Pearl at PayProp, which is a company I work for, an established company, been around for almost 25 years now. And briefly about me, I don't normally do this, but I see a few faces I don't recognize and I'm sure people don't recognize me as well, so I thought I would do this. I'm a senior dev and head of processing at PayProp. I've been there for 10 years. I've been a developer for just over 20 years. I've worked with various languages, but mostly Pearl. But I've only worked for three companies in the time I've been in that 20 years, so I've kind of seen tech stacks grow and shrink and change. I'm a C-Pone contributor, so Lijo, I'm a C-Pone, Meta-C-Pone. And I'm a London and Swiss Pearl and Raku workshop organizer, so come and talk to me if you're interested in any of those. We're searching for a venue for this year's London Pearl workshop, so if you have any ideas, come and talk to me. And I'm a regular speaker at other Pearl workshops and conferences, and often I'm helping out with the video. And I occasionally blog on Pearl. I prefer to do long form articles rather than technical kind of, this is how you use this module kind of posts. And I run the Pearl events Instagram account, but that's about the limit of my social media interaction. And I'm a Fosdum Newsbie, so my first time here. We usually have a work event that runs at this time of year, so it always clashes with Fosdum, so I've never managed to make it, so this is the first time it hasn't clashed with Fosdum. So about paper op. That's what kind of what we look like, the public facing part of the site at least. We're a payment and reconciliation platform for the residential letting industry. And we kind of, our core business value is we turn things like this, and this is one of the nicer ones to deal with. This is a Swift format into things like this. So we put interfaces and automation on tank consuming payment reconciliation flows. And this literally saves our customers hours, days, weeks of time, so we're really, really useful to them. The key night of you might see CGI bin script.cgi in that URL. So yeah, we've been around for over 20 years, so we have some old code, bit of an understatement in places. But the pearl we are using is relatively modern, 532. And we build our own pearl, and we don't use the vendor supplied pearl or the system pearl. We don't do anything special with it. We could in theory compile it with different flags, but we don't do that. So we get the defaults, which means we don't get things like iThreads, because if you use vendor supplied pearl, you get things you probably don't need. Yeah, the key is that it's not the system pearl. So we're not kind of tied to any particular version of an OS or package or whatever. And we can apply updates and patches as necessary. We should be on 538 by now. We tend to trail a major version. I've been spread a bit thin, so we haven't managed to get to latest, but that's on the roadmap for this year. Yeah, and it gives us dependency control, which is critical. If you've been paying attention the last couple of weeks, there's been a couple of critical CVEs against a couple of spreadsheet passing modules, so we could get those updates out quite quickly. Loose coupling, so yeah, like I said, not tied to the OS or anything like that. And the key is it's everywhere. So we have the same version of pearl, the same version of modules from dev through CI staging demo all the way to production. So otherwise you get interesting debugging problems. And while the issues and challenges around that, well, probably the ones you've all heard, you know, still use pearl or even what is pearl, and the bus factor, which is, you know, becoming a problem with some of the pearl dependencies. So yeah, it's a 20-year-old, 22-year-old app, so we are in the process of migrating from CGI.pm to Modulicious. A 20-year-old app has some legacy, a bit of an understatement really. This is an ongoing task, and we're about two-thirds complete in terms of number of requests to the site. We have a lot more pages than we really use after 20 years. Kind of inevitably happens that people write features and functionality that end up not being used, and we've got hundreds of pages, and really only 20% of them are actively used. So a lot of them will never actually end up getting converted. And one of the ways we did this in one of our other apps is using this plugin from Modulicious. And we decided not to do this with PayProp because we're using Apache on the front end anyway, so we can kind of proxy to a Modulicious server or just run exec CGI if it's CGI script. So we're not doing a kind of serving the CGI scripts from Modulicious using a plugin. There's no real value there, to be honest. So that's kind of what the setup is. I actually gave a talk about this almost a decade ago, so there's a link there to that talk, which has some suggestions for how you can do this if you're using CGI. You want to use Modulicious, what the options are. But it was 10 years ago, so it's a little bit out of date now, because Modulicious moves fast, and it is one of the challenges in using it because they say that you can always count on backwards compatibility, but they will deprecate and remove features within a three-month window, which is not really backwards compatibility. So you just have to be aware that if you haven't done an update in a while, things might break. And we're adding an ORM. And I know this can be a contentious issue, which I kind of find surprising. I'm just title writing this kind of stuff. And this is a simplified, about as simple as the query you can do. So you select some columns from the table, prepare the query, make sure you have the error handling, execute it, grab a hash ref. I just want to write that more descriptive. All the stuff we can get for free is there. And we can still drop down to vanilla SQL if we want. And we do do that. We have some pretty hairy reporting queries, and we're not writing them in ORM speak, because they're big enough already. If you start using the DSL of your ORM, they become an obfuscation. And the reason we're doing that is it allows us to kind of isolate some of the legacy issues in the schema. Again, 20-year-old app, organically growing schema, you can have some issues like this. And we can kind of nicely abstract them away in the ORM that we're using. Put this down as stuff hack and use says, you know, just fix your schema and things will break and you might see it. And it's like, no, we're not going to risk the business by breaking stuff. We don't move fast and break things. You know, we want to keep our customers happy. And then another suggestion is, well, why don't you write your own? But why would you do that? You know, we could abstract all our logic into an ORM, but it'd be half done one full of the bugs that all of the available ones have kind of already ironed out anyway. And yeah, we're using DRX class. Very feature-rich, but not dogmatic about its use. It's like, say, you can use it in ways you want to use it. Some of the issues and challenges around that, well, there's a learning curve, a big learning curve, especially if you haven't used an ORM before. But the manual is very good. Lots of stuff on the web you can find about how to do, you know, quite bespoke things with it. Currently, I say unmaintained, I would say stable rather than maintained. There are talks happening to kind of address this because it's a backlog of patches that could be applied and that kind of thing. And I did talk about this, I want to say, six years ago, using a legacy schema with the big class and how you can address some of those issues that you might have in your schema. Business objects, the model. So the older code is kind of a procedural mashup of business logic, database access, new logic, and so on. So it's all kind of smushed into the same layer. The newer code we're factoring into business objects. And the key is that the business objects are our model. Our ORM is not our model. People often kind of conflate the two. And the reason we're doing it is to get all of this stuff. If you're doing object-oriented coding properly, you get all of this really nice stuff. It's not just calling a method on an instance of a class. You get really powerful, useful things. And we're using Moose. And we were previously using mouse, but we're kind of moving to Moose reasons that I won't go into here. Karina is one to eventually look at. That's been added to the core in 538, the early version. Ovid's going to talk about that a bit later, so I won't go into that too much. But just a quick example, this is kind of the thing we're doing. We're dealing with payments, so we have this incoming payment class, and it has an attribute that references a bank statement, so we're having type constraints. So we can properly constrain that it has to be an object of this type with an ID, and we can throw a useful exception if we try and put something in there that shouldn't be in there. And then we can use the tell-on-ask principle. We can say fail that payment, and then the logic is in one place. And we're throwing exceptions if things aren't in the right state, and then we're delegating to the bank statement object to then fail its payment. So it's all nicely isolated, easy to test. So yeah, Moose, again, what are the issues and the challenges? Well again, the learning curve, if you've not used much object-oriented programming, this is a big paradigm shift. But I think it's worth it, because I think Moose is one of the best object systems available across any language. And then you add the mop, meta-object programming, and you can use introspection and everything. Pearl's very powerful about introspection. And there's been multi-day courses at Pearl Conference that's talking about Moose, so it's impossible for me to even scratch the surface in a small section of a 20-minute talk. People often talk about the slow startup if you're using some of these frameworks and systems, but if it's in a persistent process, a modulator server, that's not an issue. You load it once, it's loaded. If it's on the command line, well yeah, it used to be slow, but now it's things have caught up, and you're probably running those command line scripts once in a blue moon anyway. CGI scripts, we do use some of this, but we lazy load. So these are pages that are taking a couple of seconds to run their commands anyway, so the compile time of loading some of those subjects is a tiny percentage of that anyway. Yeah, mutable state, that's my technical debt. It's one of the things you learn, you know, mutable state is bad, so all our new code and your objects are immutable objects. Refactoring and regression testing, and I'm talking about beyond unit and integration testing because that's kind of the easy stuff. We're adding this for all new code, and mind we do refactoring, we're making sure there's test coverage there and addressing any gaps. But what about those critical business scripts that have existed forever and have no test coverage and basically run the business? I mean, how do you adjust bootstrapping problem of refactoring so you can work easy with them but there's no tests, but you don't want to refactor them because there's no tests, it's kind of a catch-22 situation. Well, this is Pearl, so we've got some useful features we can use to work around that. One of the frameworks we've come up with is we are creating override libraries that we pass into scripts that allows us to override various functions at various times in the lifecycle of that script that runs. So here we are overriding the call to file slippers read text method by saying run this script with this override library path and then we have these various blocks that will override calls so we can kind of monkey patch things. So we can add as much test coverage as we need and then start changing the script. So that's kind of an example of how we do it, a bit down in the weeds, but I would encourage you to watch this talk by Nick. He talked about this at the Pearl and Racket conference last year. It goes into all the details of how you can do this, which blocks you can use to run when, how it works and some of the issues around doing that because you're actually adding technical debt when you do this, but we need that test coverage there. So the aim is get the test coverage in place, the fact of the scripts, the fact of the test coverage, we're in a better place. This has been critical for some of the scripts we have because I mean they literally run the business and they literally have no test coverage while they have test coverage now. Like I said, we don't move fast and break things. Contributing to C pan. So yeah, we actively encourage contributions to C pan. These are all the distributions that we've either written or taken over maintenance of in the last decade, which is the time I've been a pay prop. Stuff like some modulus plugins. So there's this plugin for NMIT, modulus that allows you to profile your routes using NMIT prof. It's really useful. I wrote some of this OAuth 2 server stuff. If you've ever used OAuth 2 and tried to implement server side stuff, it's a fun game. That hopefully makes it a bit easier. Third party payment libraries. We interact with third party payment providers so we've written some stuff. Go Caldlis do direct debit in the UK. TrueLayer is a new opencomer. They're using the open banking spec so I think they're going to get quite big in the coming years. And other stuff, so we maintain CGI.pm because we still have CGI scripts. We can maintain un-maintained libraries, Google Maps stuff and all that kind of stuff. The issues and challenges around that, well, the pool of contributors to C pan is shrinking. Libraries for newer services and APIs don't exist. Often you'll find third party libraries for languages except Pearl, which is a shame. Modern APIs are restful and easy to create a third party library for. We're happy to throw somebody at it for a week or two, which is what we did with the TrueLayer one. They threw me at it for a week and there's one on C pan. Navigating around IP issues, well, that encourages to decouple our code. So that's actually quite a good thing. And finally, hiring devs new to Pearl. I say Pearl has been on the plateau of productivity for quite a while. Those that left it a long time ago don't know the current ecosystem. But more than a generation removed from even Pearl 5. Pearl 1 was released in 1987 and actually probably Larry was prototyping it a long time before that. 510, which can be considered modern Pearl, there are people starting a university now that were born after 510 came out. But it's still in a lot of places and I know that because we've interviewed people. Some of these users can't talk about it. Banks, the fangs, I won't emphasize which letter in the fangs, but we know there's people using Pearl in these places. So I think the rumors of Pearl's demise are greatly exaggerated, but it's kind of a known unknown at this point. And it's still be using Greenfield projects, so the system that Fosdham used to review, annotate, cut, process, transcode and publish all of their videos runs on modern Pearl. So over a thousand videos this weekend are going through a modern Pearl system. And its popularity is kind of normalized over the last two decades, I think. So it's had to find Pearl developers. But newcomers don't have preconceptions. That's my experience of interviewing anyway. I think those under 30 either haven't heard of the language or haven't used it. And those who don't want to use it self-select out of the process anyway. Because we are explicit that we use Pearl in our job specs. We just don't require it unless we're hiring a senior Pearl developer. And I find modern Pearl is an interesting and enjoyable language to work with. Working with legacy code is not specifically a Pearl thing. And we make sure to do all of this stuff, because you should be doing all of this stuff. And we're finding in a distributed work environment you need to do all of this stuff. I've not really talked about this much in the past, but I have written blog posts. So check out the blog posts if you're interested. And the key is that you can still be very experienced, but still a newcomer. And that's absolutely fine. And I think it's actually beneficial to the ecosystem and the community. So if you are, please speak up. You want to hear from you. And that's it. I don't think I have time for questions. So thank you very much. Thank you.
Open Food Facts: Learning and using Perl in 2024 to transform the food system !
I'd like to welcome Pierre, I've got your last name, Pierre. Pierre Slamish. All right. That's, oh yeah. I think it's one of the more recent World Projects started, isn't it? We created a plan for the pack in 2012. So it's just over 10 year old project. That's value of a teenager. Right. Let's welcome Pierre. And thank you, Lee, by the way. We use, we depend on your work. So I'm going to talk about open food fact. And it's not going to be a very technical talk, but more like experiences of people getting into Pearl in 2024 to contribute to food transparency and to transform the food system. So I'm, yeah, I'm Pierre. I'm one of the co-founders of open food fact. So I'm not the technical guy. I'm the product manager, but I double in a, in a product opener, which is our Pearl back end. So on the menu, I'm going to briefly introduce open food fact to both of you who don't know it yet. Then I'll have a part on starting Pearl in 2024. So some portraits of our contributors, how you can have impact on the food system with Pearl, and finally some Q and A. So about the open food fact, it's the answer to a very simple problem. How do you pick a product in the supermarket? You have many products, a lot of information. It's hard if you want to pick one for your children. It's very hard. And then when you get to the nutrition table, you have this long ingredient list, but sometimes you can't read. The nutrition table personally, I have never managed to make sense out of it. And so you have to make decisions every day to get food. So open food fact is all about empowering users to have an impact on their own health, but also on the environment and the food system at large. So we kind of have this slogan, don't panic when you're in the supermarket, organize. So trying to get together and have an impact on the food system. So we've been nicknamed by the media, the Wikipedia for food products. We have over 3 million products in 160 countries and languages. Our data sources are crowdsourcing. So using your mobile phone, you can actually add photos, add data, manually and using machine learning help. And the food industry, which is beginning to realize that closing their data doesn't make any sense. So we want transparency to become the norm. So I'm going to show you how pale code in production is having impact every day for millions. This is, so the first thing is a nutrition score, which you may have seen in Belgium, in France and in other countries. We started computing nutrition score in 2015. It was a scientific formula at the time. So we decided, okay, let's compute it on all the products we have and show it to people in the app. And we helped democratize nutrition score before it passed into law. So this is a screenshot of something one of our contributors had done at the time. He pasted some nutrition score on all the, using image editing software on all the products. Fast forward a couple of years, you go from digital to actually seeing a whole supermarket ale full of nutrition score, which shows that you go from digital to real life impact. So I mean, not only people who run the code, who use the software, but everyone can benefit, but everyone can benefit, even people who don't care about it. So from pale code to real life impact. And it goes even beyond just displaying the score. We started to realize that producers are actually changing the nutritional composition of their food products. So it's a systemic impact. Code can have a systemic impact on the food system. It's absolutely bananas. What you can do also with a path of tour is compare products at very large scale. So for instance, we are able to monitor the composition of Fanta. And as you see, it's not the same in every country. So basically we can show what's the industry is trying to hide us. We also have help producers improve their product. So one of our part of our software stack is the producer platform. And we do some computations based on the nutrition table to actually provide reformulation opportunities. If you reduce sugar by 20 milligrams, you can actually go from nutritional score B to nutritional score A. So computing helps also change the products. And yeah, brands are starting to... Oh, sorry, I went a little bit too far. Yeah, brands are starting to... All those brands have actually started to share data and use the import system, the mapping and import systems that are in OpenFood fact, that are kind of hairy XML parsing and all of that. And so yeah, they are sharing data in many countries at large scale. And to code this stack, we have Stefan, the founder of the OpenFood project, but we also managed to get more coders on board. People who just picked up Perl just to be able to contribute to the food system transparency. We started learning Perl in 2022, 23, 24 just to be able to have an impact. And Lee, I can confirm you that newcomers don't have any preconceptions. So for instance, Yukti picked up Perl in 2022 and she's improving the backend code quality. So she's very serious about food transparency. She doesn't look at the front, she looks at the back where the nutrition tables are. She wrote a lot of tests, bug fixes, and she's into Perl correctness. And she's obviously like a soul trying to convert all people she meets into OpenFood users. Stefan, who coded much of the code, learned Perl in 1998 when he was at Yahoo. He likes to do origami in his free time. And some of the code base are things that he coded perhaps a little bit too quickly 10 years ago when he launched the OpenFood project. And he recently paddled in a 10 degree water. Monalika, she picked up Perl in 2023 to improve the UI, the test, and the code. So it was part of the program funded by the Perl Foundation to include more people in computing. So she worked on product image protection to ensure that data quality stays constant and misuse and user management, email misuse and user management. Alex, who's a Python person, but took the Perl Camel two years ago to contribute to OpenFood fact, who's part of the permanent team, and who's using some of the tools you code in this room, so Proxmox, Sanuid, and many, many more. Benoit, who picked up Perl in 2023 to improve the data quality system, and he's learning nutrition science almost as fast as he is picking up Perl. And John, who didn't do much Perl before and started learning Perl in 2022. And he's spending one day a week leveling up in Perl to be able to contribute to OpenFood fact. So I'm going to go a bit faster, but as you see, the dynamic of people picking up Perl is actually very much alive. Young people, girls, etc. are actually learning Perl to be able to contribute to OpenFood fact. So John, he introduced Perl's critique to the pipeline, and we thank him for that somehow. So a bit more technical. So our backend is ProductOpner, so it's the backend for the web version. It's a monolithic system for the web, so there's no like front-end backend thing. But it's also providing the API of OpenFood fact. So it provides the database. It provides the ReadWrite API, the website, the producer's platform I talked about, analyzing and enriching of product data, so a lot of regs in every direction, and the computation of scores from the nutrition table, from the data. We are then able to calculate nutrition score, about nutrition, know about ultra-processing, and eco-score, which is even more complex to compute about the environment. So a lot of ingredient parsing, very hairy stuff, and what the architecture looks like. So we use ModPel and Apache to basically query the products which are stored in a storable file system. We are then able to fulfill the user queries, and for aggregated query, we store everything in a MongoDB database for more complex queries. So the data structure is very hairy. OpenFood is a very complex matter. As the year evolved, the data structure became more complex and complex. You have probably one-tenth of the data structure. And we store all the... So this is the old interface, and we store everything. We store all the revision of the food products as well to ensure that... to see the evolution of food products over time. I told you that producers were evolving products to make them better. So we are able to basically go back in time by storing revisions of the products at a given time. So when people scan, they will require product.store, but it will require the last revision. We are also exploring for aggregated queries the possibility of migrating to Postgre. So yeah, that's how we do a MongoDB query. And so the tags are the normalized version of the data, and then we are able to return products that match the specific query. It's very powerful. You can do very powerful stuff like require orange juice that are organic and are sold in Belgium and possibly contain no additives, etc. So the website is in Pell. The business logic is in Pell. So ingredient parsing and data analysis. We have those taxonomies to structure data and data navigation. The score computations as well, and importing the data from producers and even a GS1 connection. GS1 is the standard as ways to share product data. And we also have a knowledge panel system, which is basically driving completely remotely the app. So rich content, images and all of that. We've already done... One thing we realized is that we have to make contribution as easy as possible. So we dockerized the project. We started adding documentation. We are also working on a better API. It's not like very restful or API. And we refactor on the go as we add features, because the food system is currently evolving. We also want to have a more service-based approach as opposed to the monolith. So we have introduced open food fact query for aggregated query. FoxLumniangine for the additional properties. And our machine learning stack is called Robotoff. And we are currently revamping search with such issues and introducing key clock for identification. We are also trying to better document the API with open API and adding more tests and integration tests. Because stuff breaks and stuff breaks often. Things we'd like to do on the technical side, the API v3, lower the barrier to contribution. So probably using a modern web framework, we don't use any. So I saw that there was a Corina talk. We are also considering Corina instead of hash maps. Anonymous hash maps. So it could be, or data structure could be more documented. And globally, we factor the code in smaller chunks, like something for NutriScore, something for EcoScore. But one thing, we are not giving up at all. The core of open food fact is and will remain in Perl. And then, yeah, also more like design stuff. Because our interface is still monolithic and people need to be comfortable with Perl to actually do front-end stuff. So what's next for 2024? We go perhaps a bit faster. We are going to improve the mobile app, do some more machine learning. And also do something on open product fact. So NutriScore is going to evolve this year. So a lot of computation is going to, we basically have to change the algorithm. It's still a very controversial thing at the European level. Italy is trying to block the NutriScore. And so once we compute, we will make it available to everyone. We have also the question of prices. We are launching into price collection due to inflation. So we want people to be able to compare prices and be able to make sense out of the ongoing situation and also scientists. And the last and probably more interesting, Perl-wise, is the fact that we are going to merge all of our projects together. We currently have open food fact for food, but we have also open beauty fact for cosmetics and open pet food facts for pet food. We actually launched those as jokes as April Fools a couple of years ago. But now people are asking to be able to scan anything. So we have four installation of products opener on four different servers and we need to be able to bridge them all together. So in terms of architecture, you can imagine that it's going to require a lot of retooling. So open product fact is all about providing circular solution to augment the lifespan of products. So ensuring that they have a second, a third life that you are able to repair them, to give them away. So augmenting the life of objects with open product facts. So data platform for the circular economy and computing the carbon impact of products and also open beauty fact and open pet food fact. So actually work is just starting if you'd like to get involved. That's just about the right time. We haven't started actually retooling the product opener for that. So in terms of helping, how can you contribute? I'm very well aware that you are already maintaining probably a lot of projects. So the casual way is basically to scan and add products in your country. Translation, spreading the word, designing and of course for those of you hacking and hacking the code and fixing the code. The best way is just to try to install the docker on your machine. It should be straightforward. Also if you'd like to mentor, we will be part of the JSOC program this year. Hopefully we will be. We will also probably try to submit a project through the Pell Foundation. So if you'd like to mentor Pell projects on open food fact or actually to become a mentor yourself, it's not just students anymore. As a professional, you can actually be part of this program. Be sure to get in touch. So how can you get in touch? Those emails, you can install the app using this QR code. And if you scan this QR code, you can actually have a link to leave your coordinates and we will get back in touch if you want to become a volunteer. Either a technical task or non-technical task. And that's it. So perhaps if you have any questions or no. Thank you.
Synergy: a chat bot framework
Thank you. Welcome to Ricardo from Boxville. Thank you very much. Hey, okay. I did a timing run of this, but I had like zero sleep in 48 hours. So either it's going to run shorter or longer, maybe right on time we'll see, but we'll get the tape in. Second. There we go. All right. So, imagine it. It's the future. The year 2018. And at Fastmail, all of our critical systems run through our chat bot. Right? Like you want to deploy, you go to the chat bot. You want to set up a task for somebody else to do. You want to remind her, you go to the chat bot. And the chat bot is written in IR, for an IRC service because we're cutting edge company. And it's in charge of everything. Right? And then I got this email from Slack and it said, hey, just so you know, in like three weeks we're turning off the IRC gateway. And I talked to the shareholders who said they didn't want to close the company. So it turned out that we had to take this thing. This is Synergy, our bot, and go through a raging quick process to upgrade her. To talk to Slack. So this talk is about that project. But it's also about the fact that when we did that, we totally rewrote not every line of code, but all the lines of the interesting code to make it not horrible to deal with. Because it was, this is the three of us who did this. Matt Horsfall, who's at the top middle there, is a frequent person at Pearl Things. He happened to be in town. He mostly works remote. And we said, great, let's drop everything else we're doing. I've been sitting in a room for five days and rewired our chat bot. And it was great. It was written originally for Pearl 516, which at the time was cutting edge. And it was written using Poe, which was not cutting edge. Who here has ever used Poe? Yes. Yeah? Okay. Sorry. This is me looking excited back when I was younger. Like, yeah, Poe. No. This is Poe code. It's, look, you don't need to know everything about this code, but there it is. The thing you should notice is that it's pretty weird. Like, dollar underscore square bracket arg zero. What the hell even is that? Like dollar underscore heap. I use Pearl so I don't think about a heap, right? Like that's what I'm doing. It's, it's a mess. So what you need to know about Poe to understand this talk is like nothing. Don't worry about it. But in, even in 2005, right, when I first, very first started writing the first line of this code, it felt weird to use. And it's not really Poe's fault. The problem is that for a long time, any kind of concurrency felt weird. At least for me and at least in Pearl, right? Anything you're going to do, you're like, why is my code now coming from outer space? And Poe is just more weirdness that I didn't want to deal with. So my strategy for building the software is really simple, right? Do as much as possible without Poe. Don't write the Poe code. That's where everything gets messed up. So only use it when you absolutely need to, like all this asynchronous talking to the network server. And you can simplify that. You can make that statement generic by saying concurrency is weird. So weirdness is hard to cope with in your program. So minimize the weirdness by writing less concurrency in your code, right? Minimize how much your code has to do concurrency. So you imagine the program looks like this. You've got like that magic IRC thing. That's where all the Poe lives. That's where it's weird. And then like a thing that gets messages and dispatches them to something that does something. And we tell ourselves that it works like this, right? The concurrency lives over here and then there's the good code that we wrote. And the magic IRC thing does its magic and it calls the dispatcher and the dispatcher calls the handler and sends them the IRC thing and you're good, right? That's it. And the problem is that's not how it works, okay? Like some abstractions let you believe lies and they're good and some let you believe lies and they hurt you. So you imagine it, like subroutines form a stack, right? Subroutine calls a subroutine calls a subroutine and it returns and it returns and returns. And you can violate that but like don't tell me about it. The handler down here has to return. So either the dispatcher is getting the return value from the handler and passing it back to IRC to send a reply. Or the handler is doing some weird thing to send a reply before it returns, right? So what's actually happening? Let's say it's the first one. We're going to like engage in a little Socratic method. A little go through the logical process. Message comes in to IRC, a network message. And it turns it into something that can go to the dispatcher. Dispatcher sends it to the handler. Handler sends it to the IRC thing. The circle of life, right? Great. No. What actually happens is it comes in from the network and it goes to the dispatcher and it goes to the handler. The handler is like, I got this but it's going to take a minute. I need to look up 70 million rows in the database. And meanwhile, everybody at IRC is still sending these other messages. And you're not talking to the network anymore. You're like asynchronous thing is sitting there like, I would be so busy being asynchronous if you would just yield to me. And you don't because you're voting, putting concurrency everywhere you can. And pretty soon the whole thing falls apart. And like you lose all your messages and everybody's like, why aren't my deploys working? Because of IRC. So the other option has to be true, right? The handler is doing a thing. So a message comes into IRC. It goes to the dispatcher. It goes to the handler. The handler has to do something. So, because this thing's happening. So it sends it back to IRC but now it's blue. Right now it's a different message. It's not the, you've got a message. I want to send a message and everything's good. This is no longer just IRC. It's all your async. Your stuff comes in, it goes over there, it keeps going, you're good. Now you need something to handle both kinds of messages. One for, you know, you've got a message, you're going to do something. One you're going to send a reply. These boxes should be labeled differently. You're fine. Every kind of message that comes in, you've written your own simple, pretty much blocking, but okay handler. You don't even need to dispatch anymore. You just tie it into the message. Here's where I go and you call me. Great. Your code got simpler. The problem is making a ticket involves like talking to the database which in non-blocking terms means starting to talk to the database, doing the talk to the database, finishing talking to the database, dealing with an error. So you have to write all of these little pieces of code wonderfully. They're not concurrent, right? They just block if they need to or they just get called once. They're not doing anything weird. And then your program looks like this. Ah! There's, they call this, this is roughly like the dumb ass version of the actor model. So like Zoolander code. But it's not, it can be good. I just came from the Erlang room. The Erlang room is cool. Actors are cool, but you don't write pearl code that way which means your pearl code feels weird and we actually want to write pearl code that feels like pearl code. So here's what we do. We make a message and the message is going to contain its own reply handler, right? You're like, I'm going to send you a message and don't worry, you don't need to go write all these million things. When I send you the message, I put like a self-addressed stamped envelope. If this happens, send this envelope and that's your little reply handler. And now your code still looks like pearl code. You're good. And you do this all over the place. Like when you're setting up the listener, you're like, yeah, okay, I'm going to bind to the socket and if there's an error, here's what you do. And if you do connect, here's what you do. And by the way, after you've connected, once you start receiving packets, here's what you do. Right? And you're nesting all these envelopes and it's great. Like you've all got a pile of top level envelopes and you're like, yeah. So I'm going to listen and then maybe bind and then maybe connect and then maybe accept. And over here, like I'm going to do an L stat, the block in the file. So now it's easy. I'm going to create a file over here and then do some stuff with it. I'm going to k-pole and like we've got all these nested things and everything piles up and everything is like an envelope, an envelope, an envelope and it doesn't look like pro-code anymore. There's a name for this pattern by the way. They call it callback hell because this is what it feels like. Thank you. Like all of your code is just all callbacks. There's no named subroutines anymore. So you wanted to write this code. Okay, this is all you wanted. You just wanted to make a ticket. You got a message from the network. Put the whole thing up here. I was going to go through it line by line but whatever. You get a message and you parse it and it's like here's the ticket you should make. That's the plan. And then you say if they're allowed to make the ticket, good. But if they're not, you reply no and you return, right? You're done. Crash early. And then you make the ticket and then you reply, I made the ticket. That's the code you want to work. This is the perfect platonic expression of a chat bot. I got a message and I did a thing. And the problem is these three things block. And this is where your whole program just starts falling apart because you've got like 75 kinds of event handlers that all look like this and they all block. So it's okay. You can fix this problem by using sequencing, by leveraging promises or futures. And all you have to do is make your code look like this. This is just another kind of callback hell, right? You're just saying here's all this stuff to line up all up and like you will end up being like I'm living in the future. It's amazing. I can like write all my non-blocking code, but you're so sad inside because it's all these anonymous subroutines that you can't debug and like they're real bad. So remember when I said this earlier, right? Concurrency is weird. So minimize it by minimizing the concurrent code. That was bullshit. Don't do that. You need to lean into it, right? You need to, the problem is this. When you minimize the concurrent code, you write crappy programs because you write programs where you're like all the weird shit's over here and everything else is coping with it. Right? All of your code is just, I'm here to cope. Don't do that. What you want to do is get the language to hide that complexity for you. The language is like, don't worry. You write the code you want to write and I'm going to make it work. And then you make the code concurrent at the slightest provocation. You're like, oh, this might be able to block concurrent. That's what you do. And you can do that now because of asynch await. And that's what I'm going to talk about for a little while. I promise I'm going to talk about the chatbot. So you take this ugly ass code, right? Where you're like, do this and then call this other code, but then call this other code. And if it fails, you don't need to read this. I've read it once. That's enough for all of us. You write this. It's just like that beautiful perfect platonic code except I stuck some green stuff on here. Right? This sub is now asynchronous. It can yield. Then this line of code, I will yield here if I need to. That's all you're saying. I identify this code might be blocking. I don't know. Let's something else figure it out. And how does this actually work? Well, when you do this, something, it's called future asynch await, something takes this and it like pulls apart the whole subroutine into different units. And it's like, I'll put these together the right way. Don't sweat it. I'm going to make it work. And kind of what it puts it into together into is this. Kind of. The reality is what it's really doing is really gross and scary and it involves like mangling optries and putting them together. But that's what all pearl code is anyway. All this time that you write pearl, it's just building some crazy ass optry and like maybe there's one person in this room who thinks about optries every day. Hi, Paul. Most of us. Most of us don't have to do that and you still don't have to. So the conclusion of this long digression about asynch await is you should embrace this weirdness, right? Make your code concurrent easily all the time. Embrace the stuff so hard that like all the weirdness becomes part of you and you don't think about it. But the weirdness is there making you powerful and making your code better and just use future asynch await. Okay. It's not. I'll talk more about it later if somebody asks. I like talking about it's very good. So let's talk about synergy. If there's an unopened bottle of water in this room, I would definitely drink it. Okay. So you can find synergy here. The link will show up again later. You can ask me for it. You can install it. It's super cool. If you install it and it doesn't work, I'm sorry. And that's all you're getting out of me. I might answer a question. We don't support this. This is software written in the open and not a public project. We want everybody to use an adopt. If you come and find bugs, we might fix them and we might say, that's a cool bug you found. Here's how it works. There's basically three abstractions in synergy that you need to know about. Channels where messages come and go. And when I say messages, because in, you know, concurrent object oriented networking code, messages can mean a lot of things. Messages means chat messages, right? Like, hello, how are you? Those messages. And a reactor, which decides, should I react to this message? So that is the synergy software diagram. There you go. That's it. You understand synergy now. And I'm almost not joking. Like, it's really about the simple, which is why it's nice to use. But let's look at the code. And answer the question. Most of the time, when you work with synergy, you connect synergies, channels to your chat system, and then everything is about the reactors. What does the bot actually do? So that's what we should look at first. This is a reactor. It's a reactor that I use a lot when I don't understand why synergy did something. I ask for the uptime and synergy says, I've been up for four seconds and I say, aha, well, you just crashed. Here's how it works. It's a package. It's a class. Everything in synergy is written with moose. And this one does a role called synergy role reactor command post. So the role reactor means it's a reactor. And command post is so that later at the bottom, we can say, here's a command. You can write reactors in lots of different ways. I've been spending lots of my free time converting all of the old style reactors, which was called easy listening, into new style, which is command post. You do whatever you want. I don't care, but use command post. It just lets you write a bot really easily. And then the meat is a single way. The command takes a sub and that's what runs. So when someone says, hey, synergy, what's your uptime? This sub routine runs and it says, well, figure out how long I've been up the duration since the process started and reply. So we got a message in this event and we call reply on it. So you are the best. I am guilty. I have here actually stuck into the message, the ability to reply directly to it. There is some small amount of callback hell. That's maybe the last instance of it you'll see. So this is a reactor. You don't really need to know almost anything about asynchronous code other than make sure you write async and await in the right place and everything will work. So you could at this point install synergy, connect to something and be happy. But we're going to keep talking. The one last thing I should talk about on this slide is dollar event. Dollar event is the object that represents the message. I'm really sorry that I've called it both message and event taking, you know, two useful names that mean the same thing and using them to mean the same thing when I could have made them mean different things. I guess that's better. Here's what the event looks like. And then has text. That's whatever the user typed and has a channel it came from, right? So we said channels are how you connect to your chat network. That's the channel. It has a from address. If you are an IRC, that's like the channel again. Sorry. It has the user it came from if it came from a known user. And was it said in a public channel or in DMs? Was it said at me? Right. So like did someone say synergy? What time is it? Or did someone just say what time is it? Because you don't want the box to respond to everything and send a reply and send a reply. But this time it was an error. That's it. So this is basically the stuff a normal reactor does. So now you know, right? Channels, reactors, and you've seen a specific reactor. Great. Now you know how to handle events. You get an event object and you call reply on it and you do whatever you want in that sub. Where they come from, they come from channels. I'm going to talk about how channels work and how you can make one. But the short answer is don't. There's a Slack channel. You might remember from the top of this talk that we needed a Slack channel and that's why we wrote this whole stupid thing. Synergy's not stupid. Synergy's great. There's a Discord channel because I don't do my personal chatting on Slack. There's an IRC channel, although it doesn't work. I'm probably going to try and bug Paul to get some help on it. It works for a while and it falls over. There's a Twilio channel so you can SMS with your bot. There's a console channel we'll talk about. And there's a test channel because of course you can write automated tests for the thing. Channels are kind of a pain to write. This is the place where you can minimize all the complexity for those things that you thought that you could make not concurrent. You have to make those concurrent but it's easy. But at some point connecting to a remote web service over web sockets and handling different kind of frames and dispatching and all that and reconnecting. That's complicated. So there's an irreducible complexity here. The good news is you won't need to write one but I'm going to show you very roughly what it would look like. You'd have something like this. So this is a stupid subroutine that's like every five seconds sends an event. What does it do? It makes an event object saying the user said boop and it tells the hub to handle that event. The hub is that box and the diagram with Synergy's face on it. It drops it in there and everything good happens goes to all the reactors. But to see how the channel really works we're going to look at the console channel. The console channel is for working at the terminal. I'm sorry if you can't read this stuff. I did what I did. So here I'm going to run Pizzazz. Pizzazz is my local testing instance of Synergy. It just fires up Synergy with a bunch of reactors sitting in the console so I can test with it. I run it to get my little, you know, I've started up and I say uptime. That's the reactor we've all seen how it works. And Synergy replies and says I've been online for one second. So that's it, right? This is how I use Synergy when I'm developing. I stick the reactors into the console and I test there. Because if you've ever tried connecting a chat bot to Slack, you'd think that for a company that makes a chat product they'd want to make it easy. But they do not. It is a real pain in the butt and about every 18 months they change the way you connect a bot. Discord's much easier and it's documented in the repository how to do it. Slack I haven't bothered. But if you look at the top of the screenshot, you see console online, console channel online, console channel online. Why are there multiple console channels? That's a great question. I'm glad you asked. Here's another reactor. This is the announced reactor. Back when we are on IRC, we didn't have our chat for work on our phones, right? That before Slack at all. We just didn't have it on our phones. But you might be at lunch and lunch is running long and you want to say, I'm really getting back. And there was a Twilio channel, right? So you text the bot and you say, announce I'm still eating. And then Synergy would receive this message on the Twilio channel. It would go to this reactor and this reactor says, okay, I got in the vent and it's not from the channel I want to send to. The two channel name. We'll come back to that. And if it is, I say, like, what are you doing? You're telling me to announce something but you're already in the announcement room. And otherwise, she'll look up the channel, the two channel, and send a message there. Send a message there saying this. So when I would text the bot saying I'm still at lunch, the bot will post a message in IRC saying, Rick says he's still at lunch. And this all works because you can have multiple channels in your Synergy. This is one of the really keen things about writing asynchronous code. You can have lots and lots and lots of things in your process and they all work. You can have lots of consoles that talk to each other. So here, in my testing environment, I've spun up several console channels. Now only one of them is getting my input because I can only type to one terminal at a time unless I want to do something really weird. And I've set up the announce plugin. And I can say announce, yeah, I was going to do a live demo of this, but I didn't because I've got enough going on. And what you see is on the input-output terminal Synergy says, great, I announced it. Thank you. And on the announcement thing, you see the message come in there. So this testing environment is simulating multiple channels. I still have a purple channel which you won't see in this deck representing Twilio. So you can say like this should page somebody's phone with an emergency and you'll get the page showing up here like, yeah, I would have sent a text message, you're good. So it's all nice and simple. The one thing you might be wondering is what's up with two channel name? So in the world where that's like IRC, two channel name here might be private IRC server, just a string. And it says that's how you're going to go find the channel off the hub. Which channel am I sending to this one? But where did it come from? Like how is this set up? How is it configured? Well remember all the channels and all the reactors and everything else, they're moose objects. So there's an attribute on the object called two channel name and it's a string. Now if that's all we did we'd be a little screwed up because at some point someone would try announcing and we'd realize we had a typo in there and it would crash at run time. So also when the reactor starts up, when the wind synergy is really booting up and connecting, she'll say, do I actually have a channel called that? And if not, crash early, crash early everybody. But that's it. Everything, all the reactors work this way, they're all configured with attributes on the objects which is what you want. That's just one more turtle, right? But where did it come from? This is the bottom turtle. It comes from a config file. So you've got a config file where you list all the plugins that you want, all the reactors, all the channels, and all their properties. And somewhere in here at the top you'll see the announced channel, the announced reactor, and it says here's the address that I send to, and here's the channel on which I will send to that address. And then you'll see all these other reactors that are configured just the same way, the clocks reactor. Which time zones do I care about? Melbourne and New York. There's a DC reactor that you can use to run DC calculator programs. I didn't write that. Okay. So now we've written channel, we know how channels work, and we know that all the stuff comes from configuration. That's great. Now we're going to talk about linear. Linear is not part of Synergy. If any of you don't know about it, linear is a bug tracker. It's like a work tracking system we use for running our scrums and stuff. It's really, really good. I like it a lot, and I'll tell you all about it when you want. But what you do need to know is that linear, like a lot of web services, does webhooks. So you can say something happened to one of my issues, and a post gets sent to wherever you want, saying a thing happened to one of your issues, and you can respond. This is great for like, I track a calendar, right? And if somebody moves an event on the calendar, I get a post telling me this thing's been rescheduled. Consider whether your whole day has just been upended. Webhooks are great. And linear uses them, and we want to react to them. One of the things we use them for is escalation. Escalation inside a fast mail basically means a customer made a ticket, the support team who are great. They don't really know what's supposed to happen next. They escalate by taking the ticket and saying escalate it. They put a flag on it, and it goes to the developers. And when we do that, we want to do something like this, right? Make a message that just says, this issue got escalated by so and so, and here's the link. And we send it to the escalation address, right? Which is pound escalation and fast mail slack. And this is straightforward. I think if you've followed things so far, you follow this, except you might be wondering, where do you put this code? Right? Like, it's got to go someplace, so that's a good question. But you're not going to put it in a command, like uptime, because there's no command to say like, hey, check it, you got a webhook. That's not what a webhook's about. And it's not in a channel. Remember when I had to, like, tediously explain that channels are about chat messages, not just generic messages? So it's not in a channel. Where is this post going to go? The answer, the answer is it goes in a reactor. It doesn't need to be in a reactor. It's where we happen to have put it, but it's not because it's a reactor. It's because it's got this role called HTTP endpoint. And you say, in addition to reacting to chat messages, this thing is a web handler. And you say, I wanted to take the path linear notification. So when you connect this thing up, slash linear dash notification will now be a path that you handle. And how do you handle it? Well, you've got some async sub that is a plaque handler, because if you're writing web stuff in Perl, you probably want to do it with plaque. And that's kind of, I mean, look, there's a whole bunch of code here that's figuring out, like, getting the thing, authenticating it, figure out who's who. But this is basically it. You say this hunk of plug-in, and anybody can write a plug-in, requests a path for web service and mounts a plaque application on there. You know, and then at the end it says, like, yeah, and then return 200. So now this is HTTP endpoint. How does that work? Synergy runs a web server. And you say, I want web service to be provided on this port, and all of the channels, all the reactors, all the every other thing that has a HTTP endpoint, mounts onto those paths, conflicts are detected at start time and it crashes. And then when a request comes in, Synergy dispatches to the right place, and because they're all asynchronous, they can all interact. And that's a really important point, right? All of this whole diagram, this, all of the, every reactor, every channel, every HTTP endpoint, they're all in one process. It's just like one program that's running with everything loaded in it. And to share data, they share memory. There's no IPC, and this is a big win. Like, I don't want to say that IPC is bad, and IPC is the enemy, and I certainly don't want to say everybody should share memory. To share information, like these are big, broad claims. But we do have to talk about IPC sometimes. IPC solves problems, right? What does IPC mean by the way? It's inter-process communication that lets you have two processes talk to each other. But that's not the solution to a specific problem, right? That's not valuable per se. It's valuable because you have a problem that you could solve by having two processes talk to each other. And the question is, when does that problem arise, and when is IPC the right solution? Well, a good one is, right, if you have different parts of your system that scale differently, need different kinds of resources, need different access to things, maybe different processes are useful. You can scale up more workers, thank you. You can scale up more workers, you can scale down workers, that might be really useful. Maybe you have security constraints. This process needs access to certain constrained resources, needs to have these namespaces, needs to talk to the kernel, whatever. And this part of the program, the part of the system doesn't. That's a good reason to have two processes. And maybe you have to do work where multiple things need to be happening at once, and you have multiple processes to eliminate blocking, causing your code to be sequential when it's not going to be sequential. This is, you know, where we often would have multiple programs running or things forking. And it's fine. But remember that any time that we add a solution to a new problem to our program, we're almost always adding more code. And when we're adding more code, we're deforming the program from that ideal platonic version that we're like, well, if I could just write it, it would look like these eight lines. And then we go add all the code that solves all the problems we don't want to think about. And what we always want to be doing as programmers is picking the changes that deform our platonic program the least as possible. And program is always compromised between these things. Once upon a time, it was pretty clear, especially in languages like Perl, but kind of in a lot of programming, that if you had to eliminate blocking, the easiest, most effective thing to do was to go to have multiple processes, right? Fork is a great example. I need to be able to handle a lot of requests. I'm going to fork. Yeah, that makes sense. Forking's easy. It solves a lot of problems. And then later you have to introduce IPC because that's how life goes. But, you know, that's what you're going to do. I don't think it's this clear cut anymore. I think that at this point, we all need to be reevaluating when we want to eliminate blocking and have more communication between multiple kind of concurrent operations. Whether forking IPC is the answer to jump to in Perl anymore, I don't think it always is. I think it often is not the right answer anymore. And that's because of Asynchowate. Asynchowate's really, really powerful and it really moves the lever on where you should be picking which solutions. It's not just a Perl thing, by the way. Hopefully everybody here writes in other languages. Also, it's important to put your eggs in multiple baskets. You'll find this abstraction in a bunch of places. It's very good. Okay, one more thing. I've got a little time left. So, take a breath. Got quite a bit of time left, which is good. So, we've got channels and we've got reactors and we understand those. And we've got these HDP endpoints. And there's some other stuff we've gotten here. Maybe we'll even talk about more. But at some point, I thought, you know, it would be really cool to stick inside of Synergy a telnet server. So, we built a thing. It's not really telnet. Telnet's actually a protocol and has all kinds of weird stuff in it, like control characters. And don't learn. It's a netcat server. So, there's a netcat server built, they call these TCP streams. There's a netcat server built into Synergy, which is called the diagnostic uplink. So, here I am back at my terminal. I run my local development server with a diagnostic uplink available on local host 4321. Because I like those numbers. And when you tell it in, you get greeted with this. Welcome to Synergy. All right, you have connected to the diagnostic uplink. Would you like help? Of course I would. I don't know how to do anything. So, I say slash help. It's like, here you go. You've got some diagnostic commands. You've got notifier commands. Stuff for inspecting or running Synergy. Because when Synergy is acting, when your critical chat bot is sitting there and acting weird and you don't know why it's doing that, and you don't know what's happening, you know, you can reboot it. And that's fine. Thank you. You can, you hope that's going to be okay. You can, like, look at the logs. And I make a lot of logs. That might help, but most people don't write logs, and that's not going to help. But another great answer is, like, yeah, just connect to the thing and ask it questions. So, you can say, like, tell me about your configuration. I'm running a web service here. Here's this file. You can say, I don't show here, show me all the endpoints that your web service listens to, so I can see all those. You can say, show me all of the notifiers currently connected to the event loop. And it's going to show you, like, all these things are going on here. They get names as they're generated, so you can see things like, yeah, there's 47 open web requests all talking to GitLab. Well, that's probably a problem. Really useful. You can also get this guy. This is so good. Eval. So, you can say, I'm going to connect to the diagnostic uplink and instruct my Perl program to evaluate a string of Perl code in the context of the running bot. So, here I am saying, hey, bot, tell me your name. I'm Synergy. Great. What's your reference address and memory? Here you go. These are stupid examples. You never need to know the ref address of the bot. But you can do things like connect to the bot and instruct it to change its configuration as it runs. You can connect to the bot and add and remove reactors. You can do anything that you can do with Eval as long as you're happy typing it into one line because I have not implemented multi-line input. It wouldn't be that hard. I'm super lazy. Okay. That's everything I plan to talk about. We have a couple of minutes left. I'm happy to take questions. Yes. This might sound confrontational, but it is not. So, actually, it does. I'm going to make Elixir developer. The code you showed looks very much like how you would actually write an Erlang language. Yeah. So, why actually use her for use cases like this? No, it's a great question. The question is that it? Can I just say, well, I would say maybe the async stuff, like the tasks could be per, but actually the... The framework, yeah. No, it's a great question. The question is why do this in Perl when Erlang has a much better language for it or Elixir? I'm not to make you put any tone into that at all. That's the question. I think it's a good question. The answer is a boring answer. Well, the original version was written in Perl and all the little handlers were written in Perl. What was easy to do? Switch this to Perl. I also really like Erlang and I really like Elixir and I think that they're really well suited for this. In fact, in a lot of ways that we didn't talk about, like any one of those reactors crashing has to be handled by the hub saying like, oh, an exception happened. Don't worry, I'm going to catch it and recover. And like if a channel crashed, you have to figure out reinserting the new instance of the channel into the hub and what about its pending messages? That stuff's all solved, right, on B-machine languages. But we wrote it in Perl because we write Perl. And I think that if I had said, guess what everybody, we're rewriting the bot in one week and we're doing it with OTP. We would not have written that bot and nobody would have bought me a beer that night. Yes, in the back. It's hard that async await is much better than callback hell and also said that some of the loop is kind of callback hell put with upgrades. So can you expand a bit on how async await is better than callback hell? Yeah. And you mentioned that there may be some duplicates, a definite difference in debugging, but anything other than that? Yes, sure. So the question is how is it the case that using async await is practically better than callback hell? Larry Wall says that you can never eliminate the complexity in your program. You can only move it around, right? You can move around the lump under the carpet, but the dust is all still there. And my view is often that what you want to do is take the things that are complicated and obnoxious and pack them into an infinitely dense ball that lives at the center of your program. And everything else is beautiful and living on the outside. I got one minute, so this is maybe my final concluding remark. You want to put all the complexity deep, deep down in the middle and have everything else be simpler and built on that. Callback hell makes the programmer writing the application think about the complexity. And async await makes Paul think about the complexity. It makes one person cope with that. And I think that is why it's practically superior. Just curious how many in this room have endured future async await? Yeah, who else has used async await? Six, seven people? Yeah. It's very good. It's very good. It's got problems, but mostly they don't come up. And I use it every day because mostly they don't come up. Okay, if you want to run synergy, that's the URL. It's really good. Don't expect to get technical support. I'm going to change stuff whenever I feel like it. That's it. Thank you very much. Thank you.
The CPAN Security Working Group
It's all right. I am early or on time? I'm on time. I'm punctual. That's brilliant. So, hello. My name is Salve Nielsen. I'm one of the few fellows that hack around with the Netherlands in Oslo, Norway. And last year, I bumped into, with some other people into thinking about security on the seapen. So, stuff happened and I'm going to tell you about that now. And a little bit, it's kind of an introduction to FOSTA. Similar talks have been given at other conferences already. And a little bit of an update. So, I hope you can bear with me. So, we were established at the Perl Tulsen Summit in Lyon last year. And the purpose here is to basically feel an void about caring about seapen security. There are already people who care about security in the Perl community. Mostly they live on the P-File porters list. But when it comes to the seapen ecosystem, a couple of us raised our hands and said, okay, we'll try and do something about that. These are the people that showed up at the Perl Tulsen Summit. And a bunch of these are also on the seapen security working group. So, what's in scope for this working group? We are, there's a lot of people who are interested in the security of the Perl. So, we try to do security outreach. That means information work. It's maybe not obvious that's needed because of course everybody knows how to Google and figure out something. But we try to think a little bit about how to do things that are connected to the security of the Perl. So, that includes making sure that important security issues that are probably registered as a CVE. That if there are anything that show up in the CVE index that they are responding to in a good way. And we're not solving the problems. We're helping the people who are involved in the project, for example, that doesn't have a responsive author will make a little bit of an effort to try to find a replacement or solve it that way. This is basically what happened with spreadsheet parseXL and parseXLX. And we are super happy somebody stepped up and actually resolved those issues. And we do some coordination with other scientists through the search.org VINCE interface. And so, we are trying to build up a network so we can make sure to report things properly and share the information we have and help those people who need help. And there is some triaging and coordination going on there. And the goal here is to make sure that important vulnerability issues are not ignored. So, that's one of the major topics we're working on. We care also about having a good vulnerability index. There are, I think, one or two options right now. This one, called C-Pan-Odeta, I think, has something going on there which is useful. But it needs to be up to date and we want to help with that and maybe see if we can integrate it with other indexes out there. Furthermore, let's see what's going on here. That was not the point. Okay, the screen is saying hello. Okay, sorry for technical problems here. It looks like my computer doesn't like the USB C connection here for a moment. Sorry about that. Okay, let's throw it out and put it in again. That's always how it works. There, sweet. Yeah, it would manage to fix itself or the old computer are just saying. So, yes, vulnerability index. We also care about what's called supply chain provenance, which is basically where the stuff come from and how did it become the way it is. And in general, supply chain security. Things that we are working on there. Look here. It's already disappearing. This is a bit annoying. I'll try to continue. We want to make an effort to make sure that all the C-Pan clients use HTLess by default, for example. So we connect quickly to the servers that we want to download from. We want to make use of something called the update framework, which is used by other packaging ecosystems for securing the whole process of publishing and sharing the modules out there. We want to introduce repository signatures and author signatures at some point. We, moving on, we have, come on. It looks like I'm having more trouble than is necessary here. This is quite annoying. No. No. No. All right. So we are looking also at, oh, this is the wrong page. Interesting. We're also looking at tracking all the changes that happen on the software. Look here. Using S-bombs, software below materials. That's a huge topic and demands from that downstream when people in running software on critical infrastructure, for example, have now, they're obliged by law to keep track of dependencies and what's going on. And this whole field also includes solving the problem of how to refer to the depends across package ecosystems. So with that, there's something called package URL, which is currently in use by a lot of systems out there that and S-bomb standards to refer between two packages in different ecosystems. If all goes well, we'll actually have C-pans as part of the package URL standard, sometimes this weekend, I'm hoping. I talked with the author yesterday at the party, at the conference here in Brussels. And we want to improve the indices in general when it comes to interoperability with other indexes, package indexes. Let's see. Since we don't have slides here, this is really annoying. So I'm sorry this doesn't work as expected. Does anybody have a USB to HDMI connector? USB C. No, no, that's, I need to, I need to, female HDMI. Ah, okay. Let's see if this helps. Crossing fingers. Because if it doesn't get better now, then it's not my computer. All right. There's something called transparency logs. There is some tooling called six store and six some that we want to take inspiration from to create transparency around what changes happen on C-pans. So if something is updated without anyone knowing, we want to detect stuff like that. We also would love to have a way to do a patching of C-pan distributions when an upstream of there is completely unresponsive and we have no way of resolving a crisis quickly. So to publish a patch in a structured way so that, say, for example, a client can detect, oh, there's a patch that is not applied here, but we do want to download or something like that. We'll see how that works. It's a current dream we're having. We do care about compliance and privacy. So having an idea of what kind of legislation is relevant for us. That's super important and documenting that stuff is part of that. So we have a reading list already published. We also want to have good tooling for software composition analysis and like checking finding ways to detect if something, some of your dependencies have certainly gotten a vulnerability or something. So we say, for example, during a test run configured, oops, there was something happening. One of the dependencies you need to update is lots of good ways to do that. There's already some tooling in place actually, but these are what we want to do. There's also the act of having a project management. So we're taking that part of that and that means creating a good charter, having a pre-release disclosure agreement. That tells us under what terms we can share information or not. And do general information around how things are put together as an organization and which place we play in the larger ecosystem. Funding is also an important part of this because I have to be frank here for a moment. And that is working on security issues on behalf of others on a volunteer basis isn't always fun. Sometimes it can be increased like horribly boring or frustrating or just solving problems that I don't have. I imagine this is the same for everyone. So we're looking also for finding ways to actually fund some of the work that we want to do. And there's a whole lot of other stuff we want to do. And the most important thing for us is that while Perl isn't the super big thing it was 20 years ago, it's still used everywhere in critical infrastructure and in important businesses that with money is earned right now. So people call it legacy systems these days, but we have to remember legacy means also earning money. So we cannot just ignore and say I'll rewrite stuff later or we'll just update. No, no, we need to update stuff now and we need to figure out exactly what's running and to make that happen. We need to enable a whole lot of things using the stuff I already mentioned. And there's also some cultural things worth mentioning. And in the Cpan and Perl community, we don't always think actively about security. So we're hoping to be a little bit of a catalyst for over time to change the culture also. And that means learning new stuff, not only being a DevOps, but thinking also how to become a DevOps, or Sec DevOps or whatever it's called. The security becomes part of how we operate. And in my opinion also we're pretty good at having our own ecosystem where things have worked for a long time and we know we can trust it and it's been very predictable. But we're not that good at interoperating across the ecosystem boundaries. Like say for example if you package something in Debian, it's like how do from Debian's perspective is what do we have to do to make whatever these guys are doing work in our environment. When we could have used say good standards for communicating dependencies in a machine readable and common way that works across all kinds of ecosystems. That's a super interesting problem that people are working on right now to solve personally and I hope we can be part of that. So why do we do this? There are new security demands coming from the EU and from an old executive order in the US. These are specifically aimed at institutions and companies that write software for critical infrastructure and that could be anything from power internet access, street light management, water treatment plans, administrative systems, all kinds of places throughout society. If something breaks it affects the normal operation of society in a negative manner. That means these two directives applies. For the cyber security sector which is still upcoming, it's more about internet enabled devices which basically means anything from toys to phones and all the systems that connect to and update those. So that means everything. So we will be affected. These laws are coming this year and will be rolled in in the next few years. I think it's 18 months or something. So this is upcoming stuff. That means we have the legislative guns pointing at us basically. We would also love to find ways to show that those of us who publish things on c-pan have our ducks in a row. We have the things in order. People can't trust the code we publish and we do that's what's necessary to make that happen. So there's some awareness raising. So we're discussing blog posts and all kinds of other ways to get more people involved in this. Who are we? Brenno, Graham, Inge, Jose, Andreas, Leon, Olaf, sitting there, Pete, René, Sam, Salvis, Mi, Stig, sitting there. Tim, Merein isn't here today. And a whole lot of others. We, these are a couple of the people that were at the Peralta Julesin Summit. I'm there. It's a photo of me where I don't look horrible. That's good. So that's Stig and Inge and Leon and Merein and Brenno. And the reason I say all the names here is to make a point actually. When somebody talks with you about supply chain security, there are people like this and the group picture that are actually working on the supply chain, the bits and pieces that make that up. On a volunteer basis. And meaning humans. It's not like a black box where suddenly stuff appears. We have to actually think about this as almost like our open source colleagues. We work together with these people. So what I want to hear is to ask you to join us. Do you care about open source security? Do you have some extra tweets, some time to spend over? Do you have a manager that is aware of that there's a security commons out there that is shared and that needs to be updated and kept alive and kept healthy? And you all would like to fix security yourself. Please contact us. We need help. We are a bunch of volunteers right now, but we do not have all the time needed. And at the moment we don't have the funding either. So there's that. So to find us, we are on our seal and there's a link there. You can find all the necessary on security.metasep.org. You could also use the security.zip.org and the mailing list where we coordinate stuff is the zip and security. It's closed off, but with a little bit of dancing and singing, you can get into there. So I don't know. We probably don't have time for questions and comments. Two minutes. Two questions. Yes. Three, very short remarks. First, I'd love to see a module creating a sports, lively. Yes. Working on that. Okay. If you want to help, talk with me. Second, I'd like to have 502 support for a stick support for any of the big frameworks we have in the fall. More delicious or dancer tour. We won't do anything on that, but if you want to publish something, go ahead. I've been looking into that a little bit. Okay. Who in this room has a Vince account? I have one. I like it very much, but please make yourself. Vince is a vulnerability sharing system that search.org runs. We have a couple of us on the have it already. So if you are scared about security enough to have an account there, you're welcome to join us. That's a very good criteria. But of course, please actually help. We have a lot of people that are having bystanders looking at. There's something called the bystander effect where lots of people look at an accident and waiting for someone else to make the first move. That is, we cannot have that. We need people to actually want to make sure it happens. Having a Vince account, maybe not enough, you have to publish yourself and say to them, hey, we use the problem. Yes. There's a whole lot of stuff. More questions? One question. Well, you get the difference from everyone, but for me, it's that we need more people who are actively working at the moment. But you have a whole lot of stuff, which is all of them are good things. I should try to paint a picture of today. And if something tickles your brain, then you're quite welcome to join us and make something happen there. If you know something we don't, then please tell us. We're in the process of learning. I'm getting an idea that this is the end, so I will say thank you. And I hope this was useful for you. And please get in touch if you care about security on CPAP.
openQA - How do you test a testing software?
I'm going to go to the next one. It's always interesting to see people here and this group. A lot of people I've seen before and a lot of people I've seen giving talks before. Somehow, it never happened that Tina gave a talk here in Boston. She's been here before but actually she's first of all, thank you for a great welcome. Hello, hello, can everyone hear me? The microphone might not be on. What, is it on? No. Okay, alright. So I'm trying my best but remind me if I'm getting too low. So, yeah, I'm going to do a talk about OpenQA. Who of you has heard of OpenQA? Okay, a couple of. And are you using OpenQA? Okay. So I'm doing Perl since 1998 and I'm working now as an engineer through the software solutions. And I'm in the tools team where we develop OpenQA and it's written in Perl. And just to give you a little short demo. So with OpenQA you can test an installation of an operating system and you can start applications. You can pretty much test everything you want. So here you see the installation process of Open to the tumbleweed. It's not real time. That's a bit fast forward. But yeah, here it gets more boring. Okay. So it's also, it's not only used by OpenSusul but also for Fedora and Debian and actually more. I think AlmaLinux is using it. And in this talk I'm going to demonstrate you a little bit the web UI and show some relevant test API functions, the project structure and how we deploy and how we actually develop and test it. Okay. I think I'm going to sit down for... Okay. Is this readable? So here you can see all our tests and they are grouped into so-called bills. And here we have tumbleweed and on AR64, PowerPC. And then we can click on a build and see all tests of this build. There are three main states of a test. It can be passed. It can be failed or it can be soft failed and soft fail this like, okay, we know about this bug. And it's not critical for the release but we mark it as we need to look at it later. And let's look at some actual tests. So this is a tumbleweed DVD installation. And you see all these boxes, most of them are screenshots but there can also be informational things. And here, we can move through those screenshots. And here we see a screenshot of the installation where you have to choose a time zone. And we call these screenshots needles actually. So a needle is something that we want to match for. And here at the top it says 99% matching. So that means the screenshot that we got is matching our expectations to 99%. And why is that? So we have this bar here and at the left side you can see the actual needle, what we expect. And on the right side it's the actual screenshot. And you see that the font has changed a little bit. And we don't care much about that. It's still okay. And that's why we set a threshold of something like 90 something percent and it's still matching and it's okay. And here's another needle. And here you can see, so the upper area of the picture is what we want to match. And you also can see this gray area where the penguin is supposed to run around. And it's gray because we can actually go into the so-called needle editor. So we can actually live edit such a needle. And here you can see the screen area is that what we want to match. There are also some red areas and I'm sorry it's not colorblind friendly yet. But you can use some such red areas to exclude certain areas because we don't know where the penguin will be at the time of the screenshot. So we just exclude that. And then you can also review the JSON. So a needle actually consists of a picture and a JSON file that says which areas should be matched. So here's another needle and it's showing the desktop runner. And this is actually showing another purpose of a needle. We don't want to only make sure that we get what we expect. But we also need this to proceed in a test. So if I'm in the test and I tell OpenQA to send the shortcut for the desktop runner. And then immediately type something or tell it to type something then it wouldn't work because it takes a millisecond until that pop-up is actually there. And the easy way would be to just sleep one second right? Or maybe to be sure two seconds or maybe rather five. It can actually sometimes take longer because the worker on the test is running tests in parallel. So there is this function called a third screen and you can give it a timeout. So for example 60. And then it will take a picture every second until it gets this picture what we expect. And then it knows okay now I can type the command. So because otherwise if we always sleep five seconds then the test would take a long time. We can also look at the log files of the test. And settings. And here we have all these job groups so you can see what kind of stuff we are testing. We are actually testing OpenQA itself. And actually, yeah I don't have any screenshots here. And here you can see a screenshot of OpenQA inside of OpenQA. So we use our own software to test ourselves. Okay so that was the demo. Here are some code examples. So here you can see the assert script run call for example. Which is just sending some text to run and it's asserting that the exit code is zero. It also has a timeout. And this is the job group configuration. We use YAML for that. The YAML can be huge and so we are actually using the YAML merge key to avoid duplication. Okay so far about the demonstration. And yeah these are the test API functions we have. The most relevant you would probably like to know is you can send a key. You can also send a key until a needle matches. So something like enter until you get into the BIOS. You can have screenshot related functions. You have the mouse functions. So mouse drag is a function. And click and double click. What we don't yet have is you cannot see the cursor, the pointer moving. So if you want to use it for demonstration which is actually a very good use case for OpenQA. Just demo your software with writing a test and having a demo at the same time. But you don't see the mouse pointer moving yet. I have proof of concept pull request for that but hasn't gotten in yet. So and you can even write test modules in Python now. But that's boring for you because you're in the pull and drag room. And this is how it would look like. So we have all these functions like send keys and set var help. I'm in a Python script, trapped in a pull script. And okay now to the project structure. It's split in two parts. O S O 2 inst which is a name. I don't like much because it's hard to type and pronounce. That's the actual code that's running the test. It started with this project and OpenQA is all the stuff around it. For viewing the test, configuring, worker schedule, managing assets like ISO files and Q-CAL files, API and WebSocket. And it all started in 2009 by Bernhard Wiedemann working at SUSE. And our code is using Modulicious by now. Using as the HTTP user agent and for the web server and the classes. We're using DBIX class and it's really helpful. We're using now subroutine signatures. In our tests we use test warnings to avoid that we actually get any unexpected warnings. We're using test mock module, mock object and mock time. And for the tidiness we are using pearl critic and pearl tidy and develop cover of course. But we also have a lot of JavaScript, Python and shell code. O S O 2 like I said, that's the heart of the software actually. It's called ISO2Video, the main script. It takes an ISO and makes a video. When you develop a test you can actually run it directly if you have an ISO file and some vars. Then you can actually start VNC viewer to watch what's happening and also change something. If your test is actually bad and you have to want to try out stuff. And our deployment is fully automated. We just merge pull requests and then, okay, with every new commit, the open SUSE build service will fetch the new commit. And then we also do a separate update on the web UI regularly and on the workahouse. And necessary service restarts are happening and database changes will also be done automatically thanks to DBIX class deployment feature. The open SUSE build service, it's used for all open SUSE packages. It can build RBM and also other packages. So here we have all our related packages to open QA. And how about testing? So here in open QA we have 98% code coverage and for OS outer hints we have 95. So how did we achieve such a high test coverage? We cheated. Well, at least we do cheat a little bit if you look at that. So there's this feature about Devil Cover which lets you add this comment, uncover the statement. And we have a couple of them and most of them are actually just, thank you, in the test directory. Yeah, but compared to 37,000 lines, I think that's okay. And here's the coverage trend we get from CodeCov. And yeah, our general tests are under T and then we have API tests, UI tests. We are using Selenium currently but we consider use changing to Playwright. And yeah, the tests are actually included in the coverage. So, and yeah, we also use open QA to test open QA. I showed you that. Some of our tests are forking and yeah, ideally everything should be turned into a unit test where I can, where I don't need to fork, but still Devil Cover is able to do that if at the end of the process you add this line and then the coverage of the fork will also be collected. And CodeCov will complain if pull regret adds uncovered code. It will also complain if the percentage goes below a certain threshold and some directories are actually already marked as fully covered. So, if there's any line that goes uncovered there, it will also complain. And since we are using the Merify bot, then nothing can get merged if it's failing some of those tests. And so you need two approvals and no failing test and then it gets merged automatically. And that's working quite well. But checking CodeCov might not be enough. You know, having 100% CodeCov, which doesn't guarantee you anything, well a little bit, but so pull request authors are encouraged to add new tests for every pull request. And writing tests is seen as a part of every ticket we work on. And refactoring is also encouraged and for every regression we encourage people to think what we could do to prevent similar things in the future. And yeah, I showed you that already. And okay, I don't know how many times we have a question, but that's it for me. Thank you. Any questions? One minute. Okay, no questions? Then, alright. Thank you.
Corinna—Perl's new object-oriented system
Ah, good. So if you're on YouTube, you probably just missed the first five minutes of this. I said nothing. Don't worry about it. So I decided rather than do what I had done previously, I'm just going to give an overview of all the major features of Karina for the minimum viable product that we're putting together. So it can be, you can have a fairly complete idea in your mind what's going to happen, because I actually haven't done that talk before, and you probably don't want to go and read a multi-section RFC and all the work we did to put that together. So since PURL 5, object-oriented syntax here was just less in IZA. There's a little bit more than that, but this is primarily the bulk of it. The model was mostly stolen from Python, and I also do Python programming. I can see the similarities. Larry regrets stealing it from Python. I can understand why, even though I like Python, I'm wrong. But blessing is that all they do is say, we have methods, and where are those methods? I'm taking the short version of this, because we're not going to spend a lot of time talking about the original version of object-oriented programming at PURL. Because it didn't give you much. Basically if you wanted everything that you want out of a class-based OO system, then you've got to write your own constructors. You've got destroy method in PURL, but destruction is non-deterministic, so that's kind of a frustration. It doesn't work as well as you'd like. If you want to maintain state, if you want encapsulation, all the sorts of things that you expect to have out of an out-of-the-box OO system you don't have with bless and IZA. And everyone had to redo it themselves every single time, and if you're a programmer, you know you don't want to do that. You want to abstract that away. So people have abstracted that away a lot. It's going to depend upon your definition of what a class is or support for a class is. Well over 80 modules. This is not an exhaustive list. I just decided to order them alphabetically by link. Have fun picking out the one that you happen to like. If you're familiar with the Lisp Curse, or if you're not familiar with it, go out here, your favorite search engine for the Lisp Curse. It will be the top hit, and it will explain how that mess came about and what we're trying to fix. So let me make that a bit larger because I can't read that. Okay, so not everything that you see here is implemented, and not all of it's going to be implemented, but you do want to see object pad that Paul Evans put together. That's a test bed for many of the ideas of Karina. So we can make sure that it actually does what we want it to do. And there are companies who are using this in production. It is so valuable to them. So some of the things you might see will change. It's work in progress, but I think I've tried to strip out anything really problematic. I'll call out the things which are what you're saying is work in progress, but this is pretty close to what we can expect. A simple class. It's very simple. It's not exciting. You create a new person. Name is Ovid. You print the name Ovid. Here you give them a title. You print the name. It automatically pre-pens it with the title. So there's Dr. McCoy. Very simple. This is not complex. On the left-hand side, that's how you're going to do that using Bless in Old Style Pearl. Here's how you do this in Karina. Note that almost all of this is very declarative in nature. You might quibble on one point. We'll come back to that later. But it's very short, very concise. You probably didn't notice this. That will mean your code's not going to work correctly because you misspelled the name. It's not even going to give you a warning. It's just going to silently fail. Sort of bugs we love to have, silent failures in code. In Karina, because that's a lexical variable field title, that's going to be a compile time error if you misspell it. That's Moose, by the way. Moose didn't gain us a lot. Not true. It does have Izzah. Izzah string for those various things. You could do non-empty string for one of them might be better. We argue about that all day long. But basically, Moose is not more terse. And it also has a lot of startup overhead. It's not slow per se anymore, but it's not the fastest thing in the world. But it does make writing an OO code better. In Karina, same thing, much more terse with the exception of the Izzah. So let's just walk through this so you can understand what's going on. To declare a class, you just say class, person. It used to be to declare a class, you couldn't. You would say package, person. And then you would bless a reference into that package. And it wasn't really a class or package. It was kind of this thing. Now they can be separate. They have a future where we can truly disambiguate these things. I might add, you can also do it this way with the post-fix syntax. I prefer this syntax. I will have it on the slides. I argued strongly, as the lead designer, I thought I could get away with this, that we're going to require the post-fix syntax. I lost that fight. And so everyone basically almost everyone disagreed with me. So I went ahead and said, OK, we'll go ahead and make this optional. But a lot of my examples, well, the post-fix syntax, absolutely not required. So don't stress about it, because I know people gave me grief at first a lot. Field, dollar, name, colon, param. That is an instance attribute, or instance field, instance slot, depending upon the language you're coming from. That's just a piece of data tied to the instance after you construct it. Because it has colon, param, it is required in the constructor. You cannot not pass that, or else it will blow up. Same thing with field, dollar, title, except it has the equals on death. That means it is optional in the constructor. You do not need to pass it in. Or you can use equals misses or something. You can give it a fake default title if you want to. Anything after the equals, you can just evaluate and run it, and that will be assigned as a default value. And then we have our name method down here, where we just access those variables directly. This gives us a chance for a lot of performance benefits. It also tremendously encapsulates this data, something which has been traditionally very, very hard to do with older Perl, because you could always reach inside the object and do stuff. Many languages make it easy to reach inside the object and do stuff. When we eventually get around to implementing a meta object protocol, you will be able to reach inside the object and do stuff. But we're going to make it harder. And the intent is you will be allowed to do it, but when you're doing things you shouldn't do, you got to put some more effort in there. It's going to be easier to show up on code reviews or just with grep. Karina, out of the box provides constructors, destructor, state, composition, encapsulation, private methods, and so on. The private stuff might actually not make it in the MVP. We won't cover that. But basically, most of what you want out of a class-based OO system is there in a very short declarative syntax. Just like that, very easy. But there's more than one way to do it. So I mentioned this is mostly declarative. You see the method down there and you're going, I don't have any way I can change the name and title. Everything by default is pretty much immutable externally with Karina. So I'm not mutating that. So why am I even calculating it every time? I could just make that a field. Reader equals if defined title, title name, else name. And that's computed once and only once at object construction time. And fields are generally evaluated in the order that they are declared, which makes it much easier to reason. In Moose, I think it's evaluated alphabetically. No, hash order. Hash order. Oh, sweet. Thank you, Steven, for just making me feel even worse about it. But I've long wanted to submit a patch to see if I could fix that, but they've said no more patches. Which is fine, I totally get why. So because they're constructed in the order that they're found, you can now have the potential for deterministic destruction because you can track that order and unwind them in last in, first out order. I don't know that that will be in the MVP either. Okay, there's only four keywords. By the way, class, field, method, and role. We actually had a lot more originally and then Damian Conway came along and did a deep dive into the spec. And he pointed out a way we could reorganize everything just by having four keywords, class, field, method, and role. And then attributes to modify their behavior. Tremendously simplified the code, made the logic much easier to follow, made the structure much easier to follow. And now I apologize, this is a much bigger slide, probably harder for some of you in the back to read. Class character is a person, that means we've inherited from person. Karina is single inheritance only. You'll notice there's a number of OO languages out there which allow no inheritance. Some of them allow only single inheritance, they almost invariably give you a way to work around that, such as interfaces or mix-ins or something else. Or you can do that with delegation, which delegation is much more powerful than people think, but there's not a talk about that. So I've now declared this class. And you'll notice I have an underscore defense for my reader. I don't have readers or writers for anything. Reader means that you can call target arrow underscore defense and read that value. There's something called trusted methods where you want methods to be callable by other classes, but you don't want people outside to be able to call them. We have done a lot of bike shedding on how to get there, and it's not gonna happen anytime soon. So for now, I punted and thought this is a reasonable compromise. We use a familiar pearl syntax for saying underscore defense. That is, think of it as a trusted method or a private method. And as a result, you can call that and people outside know not to. Notice the only methods we have public are isDead, adjust hit points, and attack, because you want your interfaces to be as small as possible. Because later on, if you have to change your interfaces, you're stuck if you've exposed everything publicly. So, Karina by default forces you to add the colon reader and colon writer keywords to fields because you have to choose, you have to opt in to making your contract public. Rather than with moose and moo and others, the default is everything's public. And if you want a private, too bad. And we have this constrain function. I'll talk more about subroutines being imported. But basically constrain is a function. Again, this is something I don't think we're gonna get to in the MVP. The intent is methods and subroutines are not the same thing. And you should not be able to call a subroutine as a method. You should not be able to call a method as a subroutine. And you can disambiguate them even if they have the same name. But just something to think about for later work. So, we did our subclassing, there's a little Dorothy there. And we create a new dothvader object, a captain Kirk object. And while not Kirk is dead, Vader beats him with his lightsaber until Kirk is dead. It's just very simple, it's easy. It works, yes, Vader will kill Kirk. I'm sorry, I do for Star Trek to Star Trek to Star Wars. But in this case, yeah, Vader, yeah, he wins. Very simple, very easy, and there's nothing when you get down to it, there's nothing really complicated about the code. It's simpler, it's easier to write, it's well encapsulated. But I want to talk about constructors a little bit so you understand some of the design work that we put in here. A lot of it we argued, I think it took like three years of arguing to finally get to something we could agree on. So, we have key value pairs, named arguments to the constructor, name, title, and offense. And it is absolutely required that you do that. You can create an alternate constructor if you want, called new unnamed and have a delegate off, but we do this for readability. And there's also some other benefits. So right now, here's a constructor in Java, character of Vader equals new character. And then if you didn't know what those were, it might not be clear what you're constructing. And in fact, you've got alternate, you've got optional data for your constructors. So you have to create multiple constructors. I won't go into details, but you might have to create multiple, multiple constructors. If we have in this particular example, or use a hash map and extract it manually. It's a pain. Karina, you don't have to do that. You have a declarative specification at the top of your code. Here's how our instance data works. So, writing the manual constructor in Java for a car, that's actually very readable. It's very easy to read. Calling it is not. I don't, I just looked at the code. I wrote this code and I don't remember it. I don't know what those numbers necessarily mean. So, that's why we try to avoid that. And in Perl, we have named arguments. Yes, you have to do a little bit more typing. This is for maintenance. You absolutely want to make it easier to maintain your code. And it's gonna kill you a few times. And you're not gonna be happy about this, but you'll get used to it because it's gonna become natural, I hope. So here, that's not character class. That's a person class. And we've passed in offense. Offense is not defined as one of your param fields. So that's gonna die. And I've heard people argue, well, I should be able to pass in extra data. Maybe my son class will use it, or there's some other way I can handle it. Yes, there is other way you can handle it, like every other authoritarian language does. Provide something which is actually going to properly capture that. But the real reason is, remember, title is optional. So if I misspelled title, it would think it's simply optional data. Now, because it's mandatory, you can't pass in anything which is not known to the constructor, then that is going to be a fatal error. And it's gonna be a very hard to detect bug that you don't have to worry about anymore. If you want to pass in extra optional data, make a parameter called extra. Extra column param equals hash ref. And then just allow them to muddle around with that. It's much cleaner. Moose allows you to pass in a hash ref instead of a list. We do not do that. We want one way of being able to call the object because it's just simple. This also preserves the ordering of those in case that becomes necessary in the future. Also, a hash ref will, any duplicate name in the hash ref will collapse over a previous one, which is kind of annoying. There are ways you can work around that if you actually want this behavior for setting defaults. But we decided this was the safest way to go, just to make one and only one way of calling the constructor. Thank you. So, I didn't talk fast enough, apparently. Here, field name, dollar name, dollar name, in both of those, those are lexically scoped. There is no conflict anymore. And with bless, if you had a arrow, open print name in your hash ref, but your parent did too, you're going to blow up. Here, it's completely encapsulated until you expose it. Now when you expose it, I have column parameter each, and I now have two param methods, and that's going to blow up. You can't override params. We might restrict that later. You can override methods. Sorry, methods automatically generated by param or, sorry, field and other things. I got ahead of myself. Never mind. So I can do this param car name. That means now you pass that to the instructor's car name, and there's no longer a conflict with parent class. Your parent and children classes should always be able to trust their internal implementation, always. So when they hit an external implementation, they're making a contract, and then they've got to negotiate and find out what works. Here's another example. Those are also going to blow up. That's the case where we're actually generating methods, but we cannot override those directly. You can create your own little stub method if you want to override it. Again, you can rename those in order to allow that to be safe. Class data, field, num characters, colon common means this is class data. You can also slap colon comma on a method and call that a class method. Adjust is called after the object is constructed, or actually it's called when it's hit, sorry, Paul. Is it called when it's hit or after the object's constructed? It's called when it's hit, right? Adjust was run as part of the constructor, yeah. Okay, destruct will run when the object is destroyed. So here I can track how many character instances I've created. It's very simple, works naturally in the language. And then I have another class, my world class. I can figure out the difficulty of my world. I've got my class method available. I can figure out how many characters and I can tell them how difficult the world is. Again, it's stuff which is now built into the language and you don't have to worry about that anymore. Is there anyone here who does not know what roles are? Okay, just in case roles are kind of like mixins you'd find in Ruby or interfaces with default implementations you'd find with other languages. And these allow you to take a small subset of behavior which doesn't necessarily belong to a class, a specific class, and move it often to its own role. And then you can compose it into the class. And then you will get that behavior. However, those methods are flattened into the class directly. There's no tricks with inheritance, there's no funky dispatch or anything like that, it's actually in the class. So method as hash ref, because this is what we call a forward declaration, because it doesn't have a body for the method. Anything with a forward declaration is required to be implemented by whatever is calling it. It can be implemented by the class directly or if the class consumes other roles as other roles might implement it. And then to JSON, here's another example where we want to get to the point where we can disambiguate. This is probably a terrible example because you don't wanna confuse those. But the reality is you should be able to call those separately and have them work correctly, even though you probably shouldn't name them the same. But it gets you some safety in the code and avoids the odd case where you called subroutine as a method, and believe me, I've hit that before. And self is injected directly into the method. You don't have to declare it in your signature. If you have a common method, so self, you also get a dollar class variable, which is the class name of the invocant. If you have a dollar common attribute, that means it's a shared method, which means self will not be available, but dollar class will. And again, those will fail at compile time if you get them spelled wrong. Which means if you declare something as a class method with a colon common and you're trying to access dollar self in there, that should be a compile time failure. You don't wanna use this code, but here, field dollar cash, once again, my implementation should be able to trust its internals. So nothing else actually gets to see the dollar cash that I have declared in my role. You don't wanna use this because this would work if you can guarantee your objects are immutable, but you can't. So you actually probably don't wanna cash those. But this is one way you can have of accessing data inside the role, which you don't share with others. And then using a role, it's pretty simple. So there's my serializable role, this one just does JSON. My character is a person, does serializable. All I have to do is define a hash ref method. And hopefully, when it's called up there, it will properly serialize into JSON, depending upon. I did a lot of hand waving there. But that's basically how it works. If you're familiar with roles, it's what you expect out of roles. So here's the various attributes we have. Class attributes. We have is a and does. Is a, again, is single inheritance. You can put one class in there. Okay, great, I've got plenty of time. Does, however, can have a comma separated list of roles that are allowed in there. If you're familiar with roles, there's ways you can exclude or alias methods. We don't actually provide that syntax here because we argued too much about how to make that work, and we just punted on that. I apologize. Well, attributes, it simply does. Roll serializable does some other role, whatever. Maybe it does a YAML role, an JSON role, and a TAML role, and can serialize all those different things if it's given the right data structure. Quite possibly cannot, but that's how roles work. Roles can consume other roles. And we do want to make sure we preserve the commutative and associative behavior so you can mix and match roles any way you want to in any order. In any combination, and it should work correctly unlike with inheritance and mixins where if you shuffle the order, you have no guarantee your code's gonna work anymore. Field attributes, this one's a little bit more. Reader, or you can rename your reader. Writer, automatically propends the name with set underscore, because we're disambiguating between the reading and the writing. And there's reasons for that dealing with return types and not being able to overload things properly. And also wanting to discourage people from writing mutable objects, but making it easy for them to do if they wish to. But it's available there. Param, whether or not it's available in the constructor. Week, to create a weak reference. Column common means it's a class data. Method attributes, do we override a parent method? If you want a method to be abstract in your parent class, just again, just declare it as method, method name, do not use a signature. And do not provide a method body, it's automatically an abstract class. And it must be overridden in a child class or with luck it will be a compile time error. Common, so you can have a class method which does not inject the dollar self variable. Around before and after are the standard method modifiers that you have. To be honest, I wish we had gone with something like, sorry folks, Python decorators because it's so much easier to use. But that would require attributes to be modified and how they actually get handled. Because right now the data inside of the arguments to an attribute is just a simple string, can't be parsed effectively or can't be run effectively. There's some discussion, I think Paul has been handling some of that, about how to maybe change that in the future. Some of the things we have already written in just the very beginnings of Karina. We have Stella, an actor model for Pearl. An actor model basically means if you have a box of toys, they know how to play with each other, you don't have to play with them yourself. That's the simple explanation. What's that? Okay, thank you. I'm very curious to see that. We also have a yellow cooperative message passing concurrency event loops, actors, promises. That one looks like a lot of fun. That's also done by Steven. You don't like that? Okay, these are some of the early prototypes we've been building with this. I used Karina a lot. This is a rogue-like tutorial that Chris Prather has been putting together. You've seen Rogue before, most of you. And I elated some of those, but basically parts one through six. He hasn't done more than that. What amazed me is I thought we would have to have much more of Karina built for it to actually be useful. I was wrong. Even a very tiny subset, properly designed subset of a class-based system works very well and is very powerful. I was really surprised by that. It also might force you to use composition and delegation more often, which trust me, that's your friend. I won't go into it right now. And I'm sorry, that was very fast. It was an overview. It was probably one of my least exciting talks, but I wanted to be able to have something that I can refer people to this and say, look, here's a short overview. If you want to have a video instead of reading the RFC or something like that. The actual RFC is at github.com, Perlapallo, Karina, BlavMessor. I'll put this up a slideshare. There's the seven stages which are referred to in that MVP of what we're trying to implement, unknown timeline as to when it's going to be done. It's already much more powerful than I thought. Really surprised by that. There's lots more to be done. If you want to see this, the single best thing I think you can do is download it, compile it, start playing around with it, send bug reports to Paul, give feedback, write tests for it, write documentation for it. We need that because conceptually it's very small, but under the hood, there's a lot of stuff which has to happen to make that done. And anything you could do to help Paul take some of that work off of him means we will get it out there faster. Does anyone have any questions? No, yes, sorry. Please speak up by the way, I'm a bit hard of hearing. Yeah, you mentioned the overrides as a way of following my pessimism. What happens if you have a base method and a derived class method with the same name without the overrides attribute? Right now I think that should be a, if the method is defined in the, sorry, what happens if in a subclass you're overriding a class which already has that method defined but doesn't, but has a body, so you're overriding something which already exists. That's something I, one thing a parent class generally should not know who or what is subclassing it. It shouldn't have to know that if that is at all possible, because that winds up coupling it too tight with the subclass. And as a result, if we try to put any sort of annotation on the parent class saying this is sub, subclassable, we might want to be able to allow a final attribute on something so you can't overwrite it, but we had to get an MVP out there. So right now it's a method body's defined. If you overwrite it in a subclass, adding the override tag is good. And I would like it to have a warning if you override something and you don't have the override tag. Or if it's an abstract method and you don't overwrite it, then it's fatal. Or maybe if you override, you don't have the override attribute, then it should be fatal, but we can punt on that. Any other questions? Can the rules have a method body? I'm sorry? Can the rules have a method body? If it's a required method in the role, it cannot have a method body. There are ways you could work around that. You could create a default method, which has a separate name from the required method. And inside of your methods, it's going to, no, you'd still have to have the other method required. So it's a yada, yada, yada, operator. I found a very nice. Oh, I forgot about that. So basically you make a method and then you just, the body of method is dot, dot, dot, which is the yada, yada, yada operator, which was added, I don't know when. 5, 5. 5, 5. So it's been around forever. And all it does is it just blows up different times. It's died with no messages. But it's very useful for, yeah. Yeah, that might work. Any other questions? Or do we still have time? Two minutes. Not you. You were exporting stuff, or exporting subroutines. Lexical, exportated. I've been using it and it's been working quite well with it, Corinna. And it doesn't seem to conflict. Oh. Lexically exporting subroutines. And then it removes the symbol. Yeah. So it's bound but not callable. Yeah, in the built-in package, there's an export Lexically, right? And then you put that inside your import, you can export things flexibly. And then they're entirely different scope. Nice. OK. I very much like that. I'll show you. OK. Actually, talk to Paul, because he's the one who's going to be doing some. We'll talk 20 minutes and I'll talk about it. What's that? Wait 20 minutes and I'll be talking. OK. One last question. OK. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
The Art of Concurrent Scripting with Raku
Alright, thank you, thank you Theo for all the organization. I want to thank my employer Instacart for sending me here and thank you everybody here for coming to this talk. Quick survey, how many people have written some code in Rocko before? Oh good. How many people use it kind of regularly? Okay. How many people write bash scripts? Okay, excellent. So my name is Brian Duggan, I'm a logistics engineer at Instacart. We do grocery delivery. I would like to say basically everything is a race condition for us. I'm also a Rocko module author and I like to write scripts in Rocko. So this is a brief outline of the talk. First some motivation. What am I talking about? I'm going to go over concurrency in Rocko, just a basic overview. And then I'll show you some tricks for migrating some stuff from bash to Rocko, how things look in bash and how the same thing would look in Rocko, and then how using some of the concurrency primitives can help your problem solving abilities. Okay. So a few words about scripting. I tried to enumerate some of the characteristics of what we call scripting. So I'm really talking about shell scripting. And I think when I'm writing a shell script it's usually something to solve. I'm going to do something pretty quickly. I don't want very many dependencies. It should be easy to understand. And it should be pretty reliable because has anybody had the experience of writing a script and it lasts for several years when you thought it was going to last for a few minutes? Okay, yes. Good. All right. We are together. Okay. So another thing that I've noticed a lot about shell scripts is they're supposed to be pretty simple. You know, you basically run some commands. Maybe you check their exit status. Maybe you have a little bit of control flow in your scripts. If you're fancy you might write a PID file and use the file system to do atomic write and rename so you get some guarantees that you don't have two of the same copy of your script running at the same time. For the most part they look like this though, right? They're basically like this sort of standard procedural flow. You have some decisions. You go forwards. You don't usually see, if you're really fancy you might use trap to capture some signals. You might try to time out some commands. Has anybody used time out by the way? I just learned about it recently. You might have some progress indicators. You don't usually see things like async await or event loops or message queues or definitely not threads, definitely not mutexes or shared memory or anything like that. With scripts we assume that we don't really need real programming. We're just doing something simple. We want to get it done. There's this idea that the world is just not that complicated. I think in reality the world actually is that complicated. This is how I vision our scripts are on the right where we have this vision of this like perfect linear world where things are well organized, running one after another. But in reality the world is kind of a mess. As a wise man once said earlier today, minimizing concurrent code leads to crappy programs. Let's talk about Raku. For a deeper dive about the implementation of concurrency in Raku, I recommend that you watch Jonathan Worthingson's talk many years ago about parallelism, asynchronic and concurrency in Raku. He gives some really good definitions of those three words. I'm just going to say the definitions without going into too much detail. But parallelism is the idea of choosing to do multiple things at once. Asynchrony is reacting to things that will happen. Concurrency is competition to access and mutate some shared resource. Raku has great support for those. Being a multi-paradigm language, it doesn't impose any particular strategy on you for dealing with concurrency. We've had some conversations earlier today about Elixir, which has the actor model. There's a go track where you have a lot of threads running at the same time. Many languages have different models of concurrency, and Raku tries not to be too dogmatic and lets you do whatever you want. You have to deal with race conditions yourself. If you want to get started writing some Raku and experimenting with concurrency, instead of say hello and say world, you can just put the word start in front of say hello. What that does is it schedules the execution of that statement on another thread. Congratulations, you've just made a race condition. The output from this program is not deterministic. You might get hello world, you might get world hello if world runs before the second thread runs, or you might just get world if the second thread doesn't have a chance to start before the program exits. You can experiment and find out for yourself why other languages impose these models and it's to manage things like this. The simplest thing is you say, okay, I want to avoid this race condition. The easiest thing to do is you can just add the word await. We heard earlier about async await. There's no async, there's just await. Wait until it finishes, wait until the promise finishes before going on to the next statement. There are a number of high level, so the documentation of concurrency in Raku breaks it down as follows. You have these high level APIs, you have low level APIs, and then there are also some other built in event sources that are not mentioned on the concurrency page. The high level APIs basically have promises, which are what we just saw, where you're scheduling some execution and it's going to finish at some point in the future. You have channels, which are basically one to one message queues between different threads of execution. You have supplies, which are one to many message queues. Then you also have this nice thing called proc async, which is a great way to deal with external processes. We're going to see a little bit more of that today since this talk is mostly about scripting, where you're managing external processes. There are also low level APIs. If you want to deal with threads, you can. If you want to deal with locks, then congratulations, you have access to the kernel's implementation of mutex, which may be hardware implementations even. We also have atomic types and atomic operators. Again, these are sometimes implemented even at the hardware level. You can even use the scheduler if you want to change the concurrency paradigm that you're using by writing another scheduler that implements the strategy for scheduling, for queuing different threads. I'm going to go through some of the different built in event sources and do some very practical examples. Some things you might see in your scripts, file system changes. All these things are built into the core. TCP or UDC sockets, of course time changing provides a great event stream. IOPype lets you watch different Unix pipes and respond based on incoming data. I'll also talk a little bit about parallel execution with race and hyper and phasers. Let's do a quick trip from bash to racu. Easiest way to go. Take your bash script in front of every line, put the word shell, and congratulations, you have now ported your bash script to racu. Shell is built in and even better than that, there is an entire language for quoting. You don't have to deal with all of the horribleness of trying to escape your quotes when you're having subcommands and they all have quotes for all of their different arguments. Probably the most interesting one is the one at the top, the two angle brackets. This is sort of like a lot of languages have a way of taking words separated by white space and turning them into an array. That's what this does except if you have something in quotes, that becomes its own element. Starting echo, starting database dump in quotes makes an array with two elements. The first one is echo and the second one is starting database dump. Those get passed then as an array to the shell command. There's some extra fancy stuff going on there behind the scenes. Here's a little script that starts your database dump. One of the goals of racu is that easy things should be easy and hard things should be possible, as Larry has said. Of course, you don't have to say shell echo. You can use the word say to print things to the screen. Say starting database dump, run your shell command. Here we have our first little glimpse of asynchrony with this thing where I'm saying say now minus init now. What this does is pretty clever. What's happening here is init is a phaser and this runs during the initialization phase of your racu program. Anybody use go here? Go programmers? Go has something similar. Go has something called deferred execution. You might use it in a go program to say, hey, when you're exiting the subroutine, don't forget to execute this database transaction, something like that. There are a lot of phasers. Basically, you can use those to schedule code. I don't know about you, but one of the most annoying things is having a script that starts doing something and then you sit there waiting and watching and nothing is happening on the screen. I like to have at least a clock saying how much time has passed. For something like a database dump, if you do it a lot, then you'll know and then you could even turn it into a progress bar. Then you'll be able to estimate how much time something is going to take. We can do this easily with a supply. We make a supply, we call supply.interval1. This makes a clock that gives us a new value every one second. Then we make a tap on the supply by saying my dollar timer equals clock.tap. Then the code there.say is saying what the argument is to the tap. Basically it'll say 1, 2, 3, 4, 5. Then that's running on a separate thread. While PGDump is running, you're seeing time go by. That's really nice. Then there's some other nice built-ins to make it even a little bit prettier. Since we want to have a script that doesn't have a lot of external dependencies, you can use the very clever polymod, one of my favorite methods, which basically does a sequence of mod and div operations and turns your seconds into minutes and seconds. Then you can format it and then you have a nice little clock. Then you might say, hey, that's so great. I want to do this on all my shell commands. I always want to see a little clock. Well, guess what? You can use, so in Python they're called decorators. In Raku, it's the wrap command. Basically we can say shell.wrap. Then basically this will call our timer before we call call same to do the original shell command. Then it'll close it and then it'll say done. Now you've got your nice script that you just copied and pasted and added shell to the beginning of them. Now all of your commands have a little clock saying how long they're going to take. Let's talk quickly about timeouts. This is the timeout command in bash. You can give it an argument that's the number of seconds. Timeout 1 and then in this example we're doing a DNS call. We're looking up example.com. So we say timeout1hostexample.com and if it fails then the exit code is nonzero. So otherwise we say DNS seems okay. The way we do that in Raku is with a promise, actually a couple of promises, one that expires after a second and then another one that does whatever you're doing. And then you make a third promise which resolves when either one of those two finishes first. So we could say promise, we await promise.anyof and then start shell host command and then start sleep 1. So we do those two things in separate threads. Whichever one finishes first we'll know if we timed out. It doesn't quite work though because shell is going to fork something off and when you fork something even if your Raku program exits it's going to keep going. So there is a better way to do that and that is to use proc async instead of shell so that you don't have this sort of tree of processes. So we say timeout equals promise.in1. My dollar proc is a proc async new host example.com and then you await either any of proc.start which finishes when the process finishes or dollar timeout and then you can call proc.kill if timeout is true. Okay so I'm going to do a few more examples here to show you how to build up how we use some of these other primitives to help you think about problem solving concurrently. So supply.tap another way of saying that these are exactly the same. Instead of saying supply.tap you can say start, react, whenever, supply which is a lot of words. Start in a new thread, react makes another like a reactor or an event loop and then whenever it says hey let's make a tap on this supply. So let's watch a directory for changes. Whenever anything changes in there we're going to turn our markdown file into hdml. Dollar star cwd is the current path. You can call watch and then you can grep for certain files and then call md to hdml whenever one of them changes. I have a few more examples quickly. The slides will be available for you to look at more slowly if I run out of time. So let's look at ping. Ping is great. One of the things that sadly is missing from ping is that it prints all these nice statistics at the end but it doesn't print the median. What if you want the median ping time? You only have the min, max and the average. Well let's compute it by watching the output of ping and then keeping track of the output and then printing it at the end. And so here you can see what's really nice about the react whenever construct is that you can have a whole bunch of whenever inside your react. So we have a little leave phaser which is going to kill the process when we exit. We have our process which is going to do the ping and then whenever there's a timeout we're done. Whenever we get a line we parse it and then we add it to this lines array. Whenever we get a signal, so signal makes a stream that when signals are sent to the process you can also finish and then we can compute the median at the end. So can we make it even fancier and ping multiple hosts at the same time with our program? So let's see if we can make something that looks kind of like this. Multi-ping, where multi-ping gets a list of hosts and then it makes this little bar graph by watching the output of ping and then you can sort of see like which one of these hosts is responding more quickly at the same time. This is a really short program to write and basically what you do is you start a loop using a channel. And so this runs in a separate process. Whenever the channel.receive is going to block, whenever it receives something you take that something and you print it out to the screen and then you basically start your processes. There are a few sort of nice things here, nice features of Raku that make this even a little bit easier. So this little percent is a way of constructing a hash and what's kind of cool is that constructing a hash looks exactly the same way as destructuring. If you've programmed in JavaScript, you know they have this really nice argument destructuring syntax and it's equivalent to the construction syntax. So you can basically make these channels that communicate between threads and you can send structured data in between them. And you can also have type checking and things like that and so it's really nice. And that's your output. Similarly if we wanted to dump a whole bunch of MySQL or Postgres databases at the same time, if we don't care about the output it's even easier. And the way we can do that is by using a statement prefix called race which basically says take this loop and run the body of the loop concurrently. And you can give a parameter of the batch, the number to run concurrently and the degree of concurrency. And then in a few lines of code we've made PG multidump which can dump several databases at once. Okay so in conclusion we've seen some examples of tracking progress of a command, timing out a command, using asynchronous techniques to respond to file system events, using asynchronous techniques to respond to lines emitted from a command, instant parallelism, we saw some locks and for further reading there's stuff on the ecosystem and also the Rocky documentation about concurrency is excellent. So that's it, thank you. I think we may be out of time, I don't know if we have time for a question. One question. Or multiple questions at the same time. I'll take your one question. You get an example of watching a file system event and kicking off a process based on that. Yes. Sorry I don't have a word I need here but is there a sort of a program paradigm or capability for testing whether something has finished being written to before you start off your file system? If the file is really big then the file appearing might not have been finished and it's written yet and you kicked off something to convert to HTML before it's done so. Yes so it does get, I think I know where you're going with this especially because if you're using an editor then it's not that there's a single event where the file changes and it'll often be the editor will be doing a write and rename or it'll start writing and so you want to be careful about that. So there are some things to do that. You can throttle your supplies is one thing. If you're spawning a process you can say proc.ready and that will tell you when it's ready so that before you start sending things to it. And then basically it's hooking into the notification API for the file system so any events from the file system there's going to be a file changed, file renamed and then you, so the limit is only whatever the file system provides. Yeah sure. All right thank you.
Updates from the PSC
All right, last session. Paul Evans, working hard behind the scenes to make sure we have a soul through 5, 6, 7. What is going to be at 3? Don't you change it? Let's call it, I don't know, 100. Who knows? All right. Man, I'll be dead by then. All right. Hello, welcome. Hello. So this is updates from the Pearl Steering Council. So a bit of history first. We've had some yearly releases of Pearl for a very long time now. 532, that was out in 2020, middle of the summer. And then kind of, you know, every year or so, kind of like Clockwork. We've had new releases every year. This is a thing. People maybe don't realize this. Some recent changes we've had. So in 532, we added this Easey operator. That was kind of cool. 534, we added this Try Catch syntax. These are some new things we've had. 536 was a lot of new stuff. So we added loads of things here. Brief list here. First big headline thing, Stabilized Signatures. So finally, that nice little signature syntax there. That's now stable part of the language. You can just use that. You don't have to fiddle with the at underscore array anymore. It's very, very nice. We added this multi-variable for each mechanism here. Come in, come in. So if you want to iterate over multiple variables at once out of an array, for example, you can just pull multiple of them and it works. It's especially nice for iterating on hashes. So you have a hash here. You get each key and each value inside the body of the free-trilute. It's wonderful. I love it. What else have we got? We've got defer blocks. So you use feature defer. And now you've got this defer thing here. So you can put a piece of code. If you're familiar with Go, this is not like the Go ones. If you're familiar with any other language that has defer, it's exactly like that. In Go, they decided that defer blocks would always push onto a stack and then at the end of the function, it would run the block. Whereas every other language said, no, that's kind of crazy. We'll just do it lexically scoped. So you have a defer and then you get to the end of the block and it runs it. Every other language does it that way. Even C, some people are discussing adding defer to C. Because if you don't have this crazy array for the function, you can do it mostly statically in the compiler. And it's just kind of shorthand for putting it in the compiler. And every other language does it this way. Don't know why Go does it its own weird way. It's a bit weird. Anyway, so we have defer blocks. And you can put finally blocks on try catch as well. It's basically the same as a defer. But people seem to expect that if you can do try catch, you can do try catch finally. OK, we added it fine, whatever. Another thing we added in 536 is this built in namespace. So for years and years and years, if people wanted things like we can and blessed and ref adder and so on, they'd have to get them out of scale util, which is another module you'd load off the file system. It's a bit annoying. These are now built in to the language. So you don't have to use anything. It's just right there. It's already you can always use it. But if you want, you can you can import it as well. So for example, we have this nice index function that plays very nicely with the multi variable for each, for example. So this indexed, you give it a list of things. It gives you back a list that's twice as big where the first value is prefixed with zero. The second one is prefixed with one, someone and so forth. So if you're iterating a list out of an array at every element, you can see the index of that item out of the array. It's really, really nice. And it's built into the language. It's this here is really just telling the parser for this scope here. I want to have this indexed word available, but it built into the interpreters always available. And who is it? People were talking about lexical imports earlier. These built in is lexical. So basically what that means is you've just written some code here. You can just see it, but it's not putting that word indexed into your package. So if you're writing an object class that you don't get the word indexed visible as a method. It's not visible from outside. It's only visible within this scope. Really nice, really handy, excellent way of working. So these built-ins are very nice. Alongside the built-ins, we finally, finally have actual Boolean values. C originally didn't have Booleans either. And then eventually in C99 they realized whoops, we should have Booleans. It's taken us until 536 to realize we should have Booleans, but we now have them. So we've got this built-in true and false. Look at that, look at that, my T equals true. Guess what that does? There won't be a prize. It's not that subtle. But specifically we have this isBool test. So you can ask, here's a value, is it Boolean or not? So one and the string one, well they're not Booleans, but this real true, well it really is a Boolean. That's kind of handy. It's particularly handy because things like data dumper knows about it. So if we print this array here, got 2 plus 2, 2 concatenated with 2 and 2 is equal to 2. Well that gives you the fairly obviously expecting 4 and the string 22. But it also gives you this pling-pling-1. That's not very nice, but the reason for that is because everyone uses data dumper wrong. Data dumper, it's one goal is to output valid Perl code. It doesn't know that it's trying to output debugging values for humans to read. It's sole purpose is to output valid Perl code. And it doesn't know that you might not be running this on an older Perl, you might not be loading it back in to an older Perl that doesn't know what the true keyword is for example. So it's going to print pling-pling-1 because it has no other choice. This is really more a comment of please stop using data dumper for human debugging. What you want to use is something like data printer. Data printer is specifically designed for outputting pretty things to humans. And this slide doesn't show it, but that comes out in color. It colors the strings and the numbers and the keywords and the surrounding shapes all subtly different. And it looks really nice on the screen and it's lovely and data printer is so much nicer. If you're debugging stuff for humans, use data printer. So thank you to Breno for implementing. It's true, it's so good. That's not all. And the JSON-PP also knows about, yeah, there we go, JSON-PP. You encode this very same array, you get 4 string 22 and true. The JSON-XS version, Remy is still looking at it. Last time I looked it was about three, four days ago. It's not been merged yet, but he's working on it. Hopefully that'll come soon. The YAML modules is Tina around? Tina was around earlier. Yes, hello Tina, thank you. This slide is for you Tina, look at that. Four, the string 22 and true. And JSON-PP as well, they all do it. So thank you for Tina, for doing that one. So yeah, real Booleans, use them, use them, they're nice. Moving on to 538, the newest one that's currently around. Somebody wrote this class thing, I don't know if you've heard of it. Have you heard of it, Ovid? I don't know. Ovid obviously talked quite a lot about this class system earlier. So I'll just go through and brief. Here's a small example of, here's a small piece of code that you can write to implement like an object class. You can create here, we have these points, and they have some values. Yeah, they're great, you can have another point. What kind of behaves in the obvious way you'd expect from looking at the code? There's several things about this that I kind of want to point out as again, kind of covering similar stuff to Ovid earlier. There's a lot of low level stuff that this thing just does for you. So you don't have to write sub-new anymore. You don't have to write a bless, sorry, wait for that, noise outside to finish. You don't have to write a bless expression anymore. You don't have to call these accesses to get at your instance fields. They're just accessible directly as lexical variables. They're nicely there straight away and you can just use them. Specifically thinking of Java programmers, Python programmers in particular, this slide is for you. You write a class, you declare that it has some fields, x and y, here are the default values, and that's it. Nowhere did I have to unpack self.x equals x or args.x or whatever and work out did they pass a value in, take the argument otherwise take a default value. No, you don't have to do any of that. Here's a method. I've just straight away got access to the local fields and I've got the self and notice that I didn't have to put dollar self in the signature here. So I didn't have to put dollar self. No, I didn't even have to shift self in old school style. I was writing some Python class code lately and I kept forgetting to put def method open, self comma. Why would I put the self in the arguments, in the parameters to the method when I don't put the self in there when I'm calling the thing? As soon as you start getting used to using method, you forget about taking self as an argument. It goes out of your head. It's again nice and neat and lovely and it just takes things away that you don't have to think about. More things you don't have to think about anymore. So as I said signatures, we added these in 536. So here's an example of a signature subroutine and here we're taking a parameter. Are we taking this optional parameter? This one's fine. I'm shaking all over the place. This y here. If you don't pass in a value for that y, you get this default. So here we have an x is 20 and a y, well, you just take the default of 10. That's all very well, but the way these work inside, if you specifically pass in an undef, well, you've passed in a value, right? But that's probably not what the author of this code really intended. So it kind of breaks a bit if you pass in an undef. It gets a bit worse if you're just passing in variables because now, well, you'd have to check carefully is $a1 defined or not. And if it's not defined, then I'll just not pass it in. And it's messy to write some code like that. So new in 538, you can now use the defined or a sign operator to declare your signature parameter. So you don't pass, it sort of internally behaves much like this, where you look at is the value defined or not rather than just did it exist or not. So as you'd expect, you pass in one value and you just get the default. If you specifically pass in an undef, Pearl goes, ah, you've passed in an undef, that's the same as if you haven't passed it in at all, I will take the default, which means that passing in two arguments is a lot neater. I have another talk where I go into more detail about specifically what's in 538 and I actually point out if you were to have, say, five parameters to your function and four of them were optional ones, you literally couldn't do it without this operator because you can't literally not pass the middle parameters and still pass the last one. Whereas you have to pass in an undef. And so suddenly with this operator, you can have those kind of middle ones missing and still put in a value at the end. So it makes that kind of thing possible that you literally couldn't do before. So pretty much any time you're using default values in a parameter, to be honest, you probably wanted this defined or because specifically passing in undef is almost never a thing you want to distinguish from just not having a thing at all. So that's quite nice. And these two things combine together quite nicely. So for example, when you have a class and you have some default values on parameters, you can of course just use the defined or operator there. So once again, it means that things like this, where you're constructing an object by just passing in whatever values you have in variables, if those happened to be undef, it wouldn't matter. It would say, okay, I'll just apply this default of zero. So that's all very nice and handy. Other new things in 538, we have these plugable infix operators. So for event and time in part, we've had plugable keywords, which is how a lot of the weirder syntax modules like syntax keyword try and future async weight and object pad, those will work with this keyword mechanism where they can tell the parser, I want to implement a whole new keyword, give me control for a bit. So new in 538, we've added more support for doing similar kind of stuff like that with infix operators. So that means that we can have even more cpan modules to experiment with things that might become new syntax in Perl at some point. And we've got a few things to play around with those. So people want like an EQ operator. People always ask for an E in operator and I've explained in great detail why that's not as easy as it sounds, but there's a few examples there. And there's things like a zip and a mesh and there's a few other modules there. But for example, this one in particular EQU is a nice behavior that at some point we might add into Corp whole. So behaves very similar to the normal string EQ operator, but it knows that undef is different from the empty string. So this is really cool. And here it's literally this new infix operator. So you use it very much like EQ in that two strings, two strings that are either the same or different, it tells you about those, but it knows that undef is equal to undef. It knows that undef is not equal to the empty string. And in none of these cases, it will print a warning. So that's quite often the sort of thing that you want. It's slightly nicer than using EQU and defined tests all the time. And this is exactly this kind of experiment that's really useful to be able to test it on CPAN first and say, hey, do we like it? We'll use it in a few places, go along, maybe decide eventually. Yes, we'll put that into the language, maybe not. Kind of depends. So one thing you might have noticed from pretty much all of these examples so far is that every single one of them starts with this use, the 536 at the top, all these other ones, or use V538. There's a reason for that. The use version mechanism. It allows you to configure effectively the language from the very first line of your source code. So rather than you just deciding this is the version of Pearl I want to use today, your file says I want to be 536 or 538 or whatever. And it's a thing we've always had in Pearl, but people haven't necessarily used it as much as they should. And I keep trying to point out how good and how useful it is and why you should do it all the time. Because, for example, it implies a feature bundle. So you say, for example, use V536, you get all of the features that were enabled in 536. So rather than having to ask for all of these things individually, if you just write use V536, you get all of this good stuff, like say and signatures and maybe some of the other ones are good as well, but those two by far are the ones that I just tend to use all the time. Everything is just say and sub-routine signatures. So those are all very nice. But it gets better. It's very similar to things like when you compile some C code, you tell the C compiler which version of C I want to be using here. So it means that just because you've installed a new version of GCC, if you don't tell GCC that I'm compiling C99 code, well, you can still compile C89 code or whatever. Just because you've updated your compiler, you can still compile old programs. It's even better than that because it's not just applying to a file, it applies anywhere. You can just put a use version inside a block. And you can say, inside this block, I want to behave as if it was 536. But I'm not going to put use V536 for the whole file just because I still happen to have some older code here. For example, this thing using prototypes. I didn't want terminal signatures here. So rather than going to fix up my entire code base all in one go to work on 536, I'll just do a small bit here today and then maybe tomorrow I'll do a bit more. And so I'm like an incrementally update to using the new stuff. It gets even better than that. So not only does it imply a feature bundle, but ever since 512, it turns on strict. So any time you write use strict at the top of your file, you always do that, right? You can instead just put use V512 and you've already got strict. Oh, and you've got all the features. Oh, but new in 536, we added warnings. So if at the top of your file you would write use strict, use warnings, by the way, you should always do that. We don't have to. You just can put use V536. And now you've got strict and warnings and all of those features. So it's really, really nice. It gives you your choice of the latest features. It means that we can maintain that compatibility of the language. We can add new stuff in Perl. So like you noticed 536 added a lot of those keywords like try and defer and so on. If you don't write use V536, you don't get those. But that's fine. It means that if in any of your code you had something called try or defer, well, we haven't broken that. We can add new stuff in Perl without breaking your code. All you have to do is put use V536 or use V540. What that means is, yeah, we can update Perl without breaking your code. That means you can update your Perl binary without breaking any code. Hands up if you've ever installed a new Perl and something has broken. Interesting. That means we failed. That means we failed. If you install a new Perl and something works. A few years ago. Yeah, yeah. Like really early ones, sometimes they didn't go so well. But more recently, I mean, you know, so for example, I think it was about last month or so, I updated a bunch of stuff on my email box and all of my email scripting stopped working. And I looked into it and I discovered actually Proc Mail has, there's a little bug in Proc Mail now, it's a bug, something in Proc Mail had changed that meant that a piece of Perl code I wrote over 15 years ago is not being invoked properly. And so all of this thing stopped working. But the script that I wrote 15 years ago for handling all my email works perfectly fine to this day. Like I haven't bothered touching it. I'd almost forgotten that I wrote it. It just works. And it's all because of this use version mechanism. And so when people say, oh, why do I have to put use version or use feature or whatever to turn on new stuff? This is exactly why. It means we can update Perl and you can update Perl and not break your stuff. But it means you have to ask for new things. Speaking of asking for new things, I've been mentioning a lot of these things are quite experimental. So some terms here. So stable stuff means it's long-term guaranteed. What that means is if we put something in the language and we say it's stable, that means in a decade's time, in two decades, like all of the stuff we're talking about now has been stable for like the last 20 years. And it's all of the stuff that if you update your Perl, you don't have to think about because it's all the stuff that's there and stable and working. Experimental simply means a lack of that guarantee. All the experimental means is we don't guarantee that this will still work in 20 years' time. But it's no worse than random stuff I downloaded anyway. Like if you install stuff off GitHub or C-Pan or other languages, things like NPM or PIPI or whatever, if you just download it and the author says, oh, actually next week I've changed my mind, it's going to work something else, that's only the same level of guarantee here. So don't be afraid of experimental. We're not saying, oh, it's crazy, it might break and blow up your code. That's not what we're saying. What we're saying is if you use it now, we don't guarantee it'll still be around next year. But maybe it will. It's not about does it work. We know it works. We have lots of tests. Things don't get merged at all unless they actually work. So things like the object system and try catch and all of this lot, it works. We know it works. People use it in production. The question is do we like it? It means you. Do you like it? If people come back and say, yes, we like this, this is great, then wonderful, we'll take the experimental tag off. Nobody comes back and says, hey, we've used this, we like this. How do we know whether we should commit on it? There's things, literally this week that we've been staring at to do with lexical subs, that if more people had been using them over the last eight or nine years since they were made non-experimental, we might have encountered sooner and said, actually, yeah, that's a bit of a design flaw. Whoops, that's a shame. But hardy knows when to be using them. So we didn't know. So now it's a little bit late to change them. So this is a request. This is the one takeaway from this talk. If you learn nothing else to learn this, please use experimental features. Not necessarily in your production, I still want this to run in a decade code. But if you're writing some small little test thing that maybe is only going to last for today or a week or whatever, or you're just grabbing some data and mangling it and fiddling around with it on your laptop, and you're going to throw away the script after lunch anyway, please play around with these experimental features. We're not saying they don't work. What we're saying is they might not exist next year. But if you're writing some code that doesn't exist next year, who cares? So please try them out. So with that said, what are the current experiments? Well, we've got try catch. That's still a bit experimental because ideally I would like when you catch an exception, you get more information out of it than just the string of what the exception was. So we might expand a bit on that. Differ is experimental. There's a few reasons for that. To do with if you throw an exception while you're deferring, while you're unwinding another exception, you've got this kind of double exception collision thing going on. It's a bit weird. Multi-variable for each, that's just because it's new. Some of the built-in functions, they're currently experimental, but they probably don't need to be. Class is obviously very experimental because we're changing a lot of stuff around. That will change and devolve over time. There's one particular experiment that I do want to draw attention to, and that's when we got rid of, when we unexperimented subroutine signatures overall, we did leave in one thing, and that's if you use the default arguments array for some reason inside a signature sub, that does currently print an experimental warning. The reason being it's kind of annoying to implement, and if people stop doing this, then we can get rid of a whole bunch of the implementation and make all of functions faster in Perl. Please stop doing this, and then we can make your Perl faster. Sonia? Any mac-in-the-feature? Any mac-in-the-feature? We could, yeah, it could become a feature. Maybe, maybe. We'll see, it's complicated. Talk to me at lunch. Anyway, so we've only got 10 minutes left. Coming up in 540, new release that we're expecting to be out sometime this summer, most built-in functions should become stable. So things like, at the moment, things like ref type, you get experimental warnings. When you do use 540, you won't get an experimental warning anymore, because hey, fairly simple, fairly stable, seems to be fine. We also are going to get built-in bundles from used versions. You know how I said, use v536 implies all of these things? Well, use v540 will add another one. So that means when you go, use v540, you get all of these built-ins for free, which means you can write, use v540, say ref type, and you just get the thing. And obviously, we're going to put that in with the capital E as well. So you can just do pearl-capital E, say ref type. Look at that, that's lovely. Everyone likes to do ref type in their one-liners. Yeah, I don't know. These ones are all bit, like, it's hard to come up with small examples, but it's nice that they're there. It's nice that you don't have to ask especially for them. You just get them. So yeah, use v540. So I want to talk a bit about the process behind some of these things. So we have this thing, the proposed pearl changes. It's, you speak a lot of C's. It's a formal process where people can request changes in the language. So already we've seen in v536 we had the enter time for, that was run by Nick Clark. We have deferred the Booleans and the name, and, well, no, that says Booleans. That should say built-ins. That's a bug. I wrote those ones. Xenu wrote the command line flag for slurping. It's just a small little bit. Rick wrote the built-in index to one. These are all the people who wrote the documents. These aren't necessarily the people who implemented the code. These are the people who wrote the documents. So part of the whole PPC process is about saying, if you have an idea for pearl, but you don't know how to implement it, well, that doesn't matter. Write us a document to explain the kind of thing you want. And if we accept it and we like it, we'll say, yes, we will work out how to get that implemented. You don't have to implement it. In v538 we got rid of the back tick for the package separators. That was the Nicholas Mendoza who wrote that one. Over here you did the module cruising. And I can't remember who implemented that. I did some of it, but someone else did. Who? Chromatic? Yeah, Chromatic wrote that one. Yeah, you just suddenly surprised us one day. I said, oh, by the way, I've implemented this. Wow, OK, fine. Yeah, so we have the module cruising. That's quite nice. And the lexical exports. Sorry about that. We're going to have to change them. Yeah, chat to me later. We're currently testing. There's only one little thing that we're PPC that we're testing at the moment for 539. That's the load module built in. It's going to be quite nice. It's just a nicer way of doing require where you have a package name in a string. It's just rather than having to do all of the horribleness of turning it into a file name. You just go load module. It's quite nice. There's a few other ones that we're in the middle of implementing. So things like English names for punctuation variables. So rather than doing dollar splat like that, you could just ask for dollar eval error. It's quite nice. Template strings. I'm almost upset you didn't. You had a sprint if in your code earlier, Rick. I mean, come on. So if you would finish the implementation. Yeah, it's hard. Sublexing is hard. So this horrible thing, especially with objects, like if you try to implement it, if you try to invoke an object accessor inside of a quoted string, you'll know you can't do that. And so you're always having to break out of the quoted string and stuff like that. So we've stolen this thing from a few other languages like quote, quote, quote, template strings. So now you can just put expressions in your code. It's lovely. It's nice. It's horrible to implement. If anyone knows how to implement it, let me know because I've had about three attempts. Anyway, other ones that we're in the middle of implementing is optional chaining. So a Python actually a couple of weeks ago said they were considering this thing. They call them the none aware operators where you have, you want to do this method call or a hash lookup or whatever it is. But the thing on the other side might be, well, in Python's case, it's none, but in Paul's case, it's undef. And you want to just return undef instead. So we have this wonderful idea of just put a question mark on the operator name. So that there, if the hash key exists, it'll call name on it. If the hash doesn't exist or if it's undef, this whole expression is just undef. And that's often a thing you want to do as well. It's nice and neat and tidy. I like it. And the metaprogramming API. So all these crazy things that you do with no strict refs and glob refs and all this other stuff that's horrible and messy. We're going to make that much, much nicer with just you get a meta package and you get the symbol out of it, you get the value in it. It's all lovely. It's all inspired by things like package stash. And there's a bunch of other things on cpan, but we want to make this an official part of core pearl so that we can tie it into things like the object system as well. It just makes that much more powerful. A few other little upcoming ideas at some point, but probably not going to be in 540, are I'd like to have named parameters and signatures. It'd be nice to be able to have these named things here. But I want to do more stuff on class. I've not really added anything extra in class for 540. So roles would be nice. The convenience accesses might be nice. It's possible by 540 I'll get around to the easy one like reader, but even something like writer is going to be a little bit awkward. But even just having readers in 540 might be nice. I'll see if I can get around to it. And I've got three minutes left. Yep. And the last thing I want to do at some point is renumber 5.whatever into 7 because I really want to be able to type use v7 and just have it work. And with that, I'm going to say there's the end. There's a link to the slides. There's also a link down here to some slides and the video of my talk that I did what's new in 538, which goes into a lot more detail about the new things we added in 538. And then I will say we will take some short questions, but our minds now the last talking here. So afterwards I'm going to go for lunch. If people want bigger chats, we can chat over lunch or in the hallway or something. So with that small questions. Yeah. Question. In the chat support for thoughts, do you expect or is it hard to plan to implement interfaces? So the question is about interfaces. Do we plan to implement interfaces? I mean, in summary, no. I mean, Java's idea of an interface is all about defining what kinds of methods you can call on a thing up front. It's all to do with static typing. That's exactly all that it is. And Pearl doesn't have static typing in that sense. Like if you have an object, you can always at compile time write the code to invoke any method you like. I mean, maybe at runtime the method may or may not exist, but it doesn't. Whereas adding the concept of static typing to a dynamic language like Pearl basically turns the entire language upside down. So the idea of a pure interface isn't really a thing that we want to add. But we definitely want to add roles because roles are statements about an interface, but they can also have implementation with them. So it's all about gluing small bits of functionality together to make a larger class. So we definitely want roles, but pure abstract interfaces are not really a thing that fits in dynamic languages. Oh, good. Comment on the question. Java allows default implementations now. Oh, does it? And residence. So basically they're roles. Yeah. Yeah. Okay. They're much nicer than they are. So we've only got one minute left. For x, y at array, if dollar y happens to be in depth, how do we know because the value is in depth or because we hit the end of the array? It doesn't matter at that point. It's just, oh, with the default argument, the signature parameters thing. No, the multi four. Oh. So I want to know that my array is even sized if I'm pulling out an even size. Yeah. Yeah. So for the, for the, for each, for each when you have multiple arguments, yeah, if, if the size of the array doesn't exactly match, it's like a, not a whole multiple. You will, you will get just undefs for those last missing positions. We did think about other bits of behavior, but I think in the end we decided that it just doesn't match because like if you just did my x, y, z equals array. Like when you get undefs in those last few values, you don't know whether that's because there were undefs in the original array or you just ran out of values. And so you've got the undefs. So it's kind of the same thing. If we did consider implementing something where you could tell the difference, then we'd start to, you'd sort of start to ask questions about, would you put it in other features as well? So like a, a, a large part of, of kind of trying to do language design is saying, well, we're not just going to do this one isolated feature. We have to consider how does it play with all these other things? And so running out of the array is a thing that happens in a lot of places. So I think that's, that's the end of, of questions now. So we'll stop there, but if people want to chat more, I'm, I'm happy to chat over lunch, but thank you very much. Thank you.
Open Source DocOps
Welcome. Our first speaker will be Lorna Jane Mitchell. I always say Lorna Jane in one word. I think everyone knows me. Yes. But you probably already know Lorna and she's going to talk about open-source top box. Take it away. Thank you. Hi everybody. Thanks for coming. It's a busy room and you've had a busy day. I hope your brains are not too full for something more. My name is Lorna. I'm VP of developer experience at a company called Redocling. We make API tooling including documentation tooling. I've worked on docs projects in a couple of previous roles. I describe myself as an engineer with a writing problem and I'm very happy to be here with some like-minded individuals. I'm also passionate about open-source. Yeah, my background is in software development. I learned in the open-source community. I'm an open-source project maintainer, open standards contributor. And I want to bring to you today how open-source and doc ops works together. So this works better if I plug it in. There we go. This is the second talk of the day. I'm not sure I'm still got sentences. Okay. What is doc ops? It's in the talk title. You believed in it enough to be here. Documentation operations is about allowing documentation to be created, but also maintained and published collaboratively and in an efficient manner. It's really about being able to make changes and having confidence and being able to make a lot, a lot, a lot, a lot, a lot of changes with lots of contributors. And the way I think about doc ops is that it, from some of the more traditional documentation practices, doc ops is a culture shift. Some of you are enough in the software space to have seen the DevOps culture shift and we're bringing something very similar to our written word. Everything I'm going to say in this talk really builds upon the concept of docs as code. If you are not treating your docs as code, you cannot benefit from the cool tools that the coders build for themselves that we adopt into our tool chains. This especially includes source control. Git is the key to many of the workflows that I'm going to talk about today. Text based markup so that we can manage multiple change sets simultaneously and bring them together without pain. I personally enjoy rebasing, but you shouldn't have to. Bringing continuous integration and those practices and also having a good local setup. If you have to push to see if you did it right, that's not a good documentation creator experience. And having good tools all the way through the stack is what makes this a really effective workflow. It makes you very productive and lets the machines do the heavy lifting. For a long time I used to say the software developers, the coders build the tools that they want to use, but I don't think they should keep them for themselves. I think we should take them and bring them into our world of documentation. Open source, you're at Fosdame, in English I would say I am preaching already to the choir. Open source means freedom, but it also means not having to build the same tool in every team that needs to publish a docs platform or check that the links work. It means being able to run that tool wherever you want to. Tools that fit into continuous integration systems are typically open source by default. We don't expect it by licenses or sign in, we expect them just to run on our temporary compute platforms or on our local machines. Best of all, there's no vendor lock in. So we can choose this tool or that tool and because we chose that one we're not stuck with having to use another one. We're using standard formats and open source tools. Just because we didn't have to build and rebuild the tool doesn't mean we don't have to build it at all. We all need to be participants in the tools that we use, reporting bugs, fixing things, thanking our maintainers when we see them. It's all part of the story. So I'd like to share with you some of the tools that I use on my docs projects and I've tried to pick just a few categories of things that I think are vitally important. We'll start with the obvious. You need to be able to preview your docs change before you publish it. Everybody should have access to preview. Everybody who contributes to the documentation or reviews any docs should have access to a tool like this. This is a screenshot of VS Code. I'm editing an open API file on this side and this is the redockly rendering on the right hand side and I typically work like this. So I always have local tooling that updates immediately. I can see instantly, oh that didn't render like I expected. There's something wrong with this. I can clearly see that's broken. My table is missing a cell because I've got that live preview response and this is part of the story. It doesn't have to be embedded in your IDE. You can run a local server that updates or use a watch command to rebuild your static site but you should have fast preview when you are working on documentation. You also need to be able to see the build areas locally if there are any. I see too many places where that's hidden away somewhere hard. The other place you need preview is in your pull request. You open the pull request. That needs to build exactly as it's going to ship. We need to spin up a per pull request preview. Don't muddle through the branch and put it on a staging server and hope. Pull request builds for previews and that also enables the reviewers. So it gives them a nice view. I used to think that previewing docs was for people who weren't technical enough to read mark down. Now I'm a VP. It's just people who are too busy. You put the web page in front of me. I can review it. If I have to go past something in a pull request somewhere, it's a bit less likely to happen. Okay. Link checking. Who has link checking in their docs build today? Yeah. It's not very many and it's the thing that is most easily rotted in your documentation. There are two problems. One is all the links between all your own resources which are just super easy to get wrong. And the other one is other people breaking their links making you look like a fool. So I use a link checker to check both of those. It automatically does like a click on all the links. When I'm looking at it for a long time I was building the HTML and checking the links that after render, which is cool and works. Now I'm working on more of a dynamic site. I actually have a tool which checks at build time. So I'm using MLC. There are lots of others. Pick your favorite. So it can read mark down and so then it can just check. This link makes no sense. Your syntax is terrible. Please do this better. All those things. Either approach works, but I think it's very important. It's an easy thing to add. You can run that tool locally. You can run it in CI. The downside of checking all your links is really other, I mean all the problems are really other people, aren't they? All the problems are other people. Sometimes the internet goes wrong. I used to work on a documentation platform which relied on an upstream open source project. Whenever that project launched a new version, all its links were broken for 12 hours. There comes a point where you don't want to know what the explanation for that is, but it meant that all of our builds failed for 12 hours because the links were broken. No, no, their links are broken. So I have a couple of different strategies for this. One is to only check the links in the files that are changed because especially on a big documentation set, you don't want to have to deal with something that's gone wrong in a link from another section might be owned by another team. So I just do that and then I do a weekly check all the links job. If that job fails, it opens an issue. So if something's decayed, we'll catch it maybe not always faster than a customer, but fast enough. So these are some things to think about. Whether somebody else's broken link or downtime should block your build or your release because I think that's a other people's links are outside of your control and so that can be a hazard. Let's talk a bit about validation. If you're coders, you are accustomed to working with syntax checking tools. Some programming languages will error at build time before you even run them. Some of them are more interpreted so they don't go wrong until you run them. We don't historically do that with our documentation, but the tools are there, especially when you are doing docs as code. So we don't necessarily do that. We don't necessarily do that. We are doing docs as code. It's got all the advantages of working in code and it's got all the disadvantages of working in code. It cannot be obvious that something is wrong. The errors can be super subtle. You have a full stop where the comma should be or the wrong sort of bracket. This stuff is even when I work with it all the time, can be very difficult for humans. Super simple for machines. So we can build on those tools and let the machines do the work. The other thing I like about having the validation errors automated, I can run them locally. I never do. I always push it and then wonder why it's failed. The other thing that's nice about that is when you push your pull request and you are missing a comma or you have the wrong sort of bracket, perhaps this is personal to me, but it feels kinder coming from a machine than having someone else criticize my use of a bracket. So that kind of, and I don't have to wait for a person to come and review it. I immediately get that very impartial factual feedback that my bracket is in fact wrong. And I think that's what I like about using validation like this. I was going to say the bots are not judging me. What a horrible thought, are they? The validation tooling, you have a few options and it depends a bit which flavor of markup you are using. I'm working mostly with markdown these days, although let's just say it's not because it's my favorite. Let's keep the markup language war for later. I'm using markdown lint. With markdown I find it very good and very, very configurable. So like all of the linting tools and the same with open API which I work with a lot as well, probably some of you have API reference docs, the default settings for all of those linting tools, the volume is too loud, especially if you were not already using those linting tools at all. Markdown lint is really configurable and it has really excellent documentation on what all the options do. It is remarkable how few documentation tools have a genuinely good documentation. This one does. For restructured text I've mostly been using that with Sphinx and Sphinx has really great validation and I think it builds on the docu-tools so you can use that by itself. All of those also come with command line tools, IDE plugins and you can put them in your continuous integration. So github actions, Jenkins, whatever it is that you use in your setup, set that up for your pros content exactly as you do for your code. If you're using open API you should also be at least validating that. I've already given my open API talk today so I will attempt not to rant about API linting and standards but put those tools in, set your standard and make sure that you are consistently checking that. Again it goes in your tooling. Disclaimer I make, Redock Lease CLI, that's my day job. Other excellent competing open source tools also exist and I'm probably not the right person to take a recommendation from. I'm very biased. So we talked about validation, very closely related to validation is formatting. Again software development does a lot of reformatting of code and that is to give a very consistent presentation. We always use the same white space in the same way, the same indentation, the same wrapping rules. It makes it visually very consistent. So when you work with the same code base all the time it gets easier to read. We can do that for our mark, mark up, mark down, restructure text, ask, skidock, whatever. We can do that for our tools too. By allowing things to adjust our new lines, our white space, the indentation, the wrapping, things like do you need a white, do you need a blank line before your bullet list or after your heading. Lots of tools don't care when they're rendering but by getting that the same you can make it easier to read the raw text and easier to look at it and spot problems because the layout is so consistent. I've only recently started doing this. I write a lot of docs that are in the same repository as the code and we just turned on the engineers prettier tool for our mark down. It's actually really nice and I was initially, like of course you can, I don't mind. Now I'm turning it on everywhere. So yeah, I really recommend it. I also really enjoy prose linting. Now I don't see enough of this. I'm using a tool called Veil and I'll be honest, I don't know very many other tools in this space. Lots of people nodding. Good. I'm also happy to be contradicted like tweet me what I should have said. With this it comes with, you can give it a dictionary. So it's going to do all of your spell check for you. It can also do quite a lot of grammar checking. This is brilliant for me. I work with almost entirely non-native speakers. So having a little bit of help for me and them to get the words out correctly is brilliant. I am a native speaker, doesn't always help. So Veil helps me a lot. Also you might be able to tell from my accent, I'm British. My company is standardised on American English and at this point my spelling can only be described as mid-Atlantic. So having Veil just to catch those common, we have like a Britishism's rule enabled and it's because I'm here. Typing all these British spelled words into our American docs. It catches repeated words. You can teach it product names. In my previous employment I worked with a company that published a bunch of open source database products. You have to get people's trademark product names correctly. Up a case, lower case, trademark. This has to be legally, this has to be correct. So unless you want your lawyers to have to think about this a lot, you just teach it to Veil. Veil explains it back to you really regularly. The other thing we did there was we put a bunch of collars common misspellings in. So we worked on Kafka. When I set up a search for Kaka, loads of hits. We also banned the English word flick because we had a product called flink. And indeed we just don't need this word in English because it probably is a misspelled product name. So those are the sorts of things that Veil can help with. I know we have a Veil talk next. Yes? A little cheer. So I'm not going to say more about that. Veil's amazing. Stay and listen to the talk. Okay. Let's talk a little bit then about how all these amazing different tools that solve different problems and they have your back. They support you in lots of different ways. But let's talk about how they fit that life cycle, that work flow. The key is that you are using exactly the same tools with exactly the same config everywhere between your laptop and your production platform. And that's the goal. Every contributor needs access to the same tools set up the same way. The tools, if you haven't used them or you don't yet feel confident because I know lots of people who have been using Git for years and still think it might bite, which is fair. There are lots of things to learn. Source control. I'm focused on Git but I've been doing this long enough that I learned on something else and I don't doubt that there will be more transitions in our future because that's technology. I like a workflow that's called GitHub flow where you have a main branch, you make a small change, it gets reviewed, it comes back in. If you see another spelling mistake, don't put it on this branch. Put it somewhere else. And it means that you can branch off lots and lots of shoots that are waiting to be reviewed and merged. And in this way you can multiplex lots of changes and sometimes as a feature it's waiting for review. Be confident. Actively practice changing branches because it will give you the momentum to switch a branch, make an edit, push it back. If you are writers, you are probably editors and reviewers as well, these are the skills that will multiply the stuff you're already good at by helping, getting the tools to help you. I've talked a bit about the continuous integration. Always hook everything but you find useful locally, maybe you get an extra VS Code plug-in. Figure out how to put that into your continuous integration setup and apply that tool to every pull request. This way we can never forget to check the formatting or the links because it will just be there. We won't, all that one's a bit risky, I think we should deploy to staging and check it. The preview will always be there and the machines will always be on your side. It helps the reviewers to do a better job and it maintains documentation quality. One of the most important places to have exactly the same tools and exactly the same config running is on your local machine. The smaller your feedback loop, the more quickly you can adapt and correct it and move forward. So having to open a pull request to get the build to see if it's okay, that's a big feedback loop. It's not ideal. I have one project where I need to do it because we have amazing test harness setup and it's much faster to run the tests there than here. So I like open the pull request to let the build build because it's quicker to do that than to wait for it to run locally. But for most docs tooling, they should be a few seconds at most even on very large doc sites. You must have them locally. If you use an IDE or similar, you can use a local machine to run the tests and take the time to figure out how to plug in these tools to that setting. Lots and lots of them are supported in both places and you can have it in context. I use Vim. All of those tools are plugged into Vim as well. So it's not modern, hand wavy, cutting edge. This is stand practice. The other really important thing is that this is all written down. With documentation specialists, everybody, write down how to set up the tools, how things are configured, where we publish to, where the sources, how the remote sources come in, how things are set up, maybe some troubleshooting guides. Write that down. The onboarding should be easy, whether that's a new hire or you get a new laptop someday. Set yourselves up for getting it right because again, we're looking for confidence and efficiency and this sort of thing is part of the culture change. There's a saying in software about move fast and break things. Dark ops is about move fast and don't break anything. I mean maybe it doesn't matter as much in documentation because it's easier to iterate than it is in code or especially in API interfaces. But the goal here is that we have professionals who are really good at what they do, but the tools can make that faster, easier, simpler, more accurate. They can catch us on things that we might slip up on. So bring the tools but also the dark ops mindset into your projects and see where it can take you. I am pretty much out of time. Here is a list of useful resources. My slides are linked to my session and I will say thank you for your time. I think we have maybe like time for two questions. Would anyone like to ask a question? Yes. This is a really good question. Do I have tips for helping with the translation of documentation within the process? I haven't worked on a lot of projects that have this. The ones that I have, Git is the key because you know which files have changed and which things have changed. I have mostly seen where the translation is a mirror and whether it's a week or a month or however often you pay your translation people, you can snapshot the pages that have changed and get those re-translated. So I think source control helps a lot with that. One more question. Could you imagine that you have also very strong opinion regarding documentation and information or something? I would like to hear it. I will repeat the question for the stream. The question is do I have a strong opinion about having documentation in Confluence or Notion or something like that? I have two strong opinions, not too strong because we are being recorded. The other one maybe we can talk in the bar. Using a tool like that hurts collaboration because you can't all make multiple changes at once and bring them back. Like one person is editing, if you were editing, it's very tricky to do that. The other reason is the lack of standards. So on a very personal level I have some accessibility needs. If you switch your documentation platform to Confluence or Notion, I can't do my job anymore. So Doxxus Code is the way because it lets everyone choose the tools that work. Thank you. All right. Thank you very much. I think we have this.
Easily Going Beyond MarkDown with Material for MkDocs
No, it works. Okay. Gotta get, there we go. So, thank you very much and enjoy. Yep, thank you. But before I get started, is this readable in the back or do we need to blow it up? A little bit bigger. How about this? Okay, good. All right. So, welcome to my talk on materials for MKDUX. Let me quickly introduce myself and my co-speaker, author. So, Martin isn't here today, but... So, I'm Kenneth Hosta. I'm an HPC system administrator at Gantt University. HPC is high-performance computing, supercomputing. Some people may not know this, but it's lots of servers, lots of noise, lots of money, lots of annoying users as well. There's a lot going on there. I'm the lead developer of EasyBuild for the last decade. So, EasyBuild is a tool for installing scientific software on supercomputers. It gets a lot, it gets very fun, I can tell you. I'm involved in way too many FOSS projects. I patch things and try to fix things left and right. And I've been attending FOSSDOM since 2013. If you think FOSSDOM is total chaos, you should try organizing a dev room, doing a talk and planning live demos during your talk, which is what I'm going to do. So, I actually had to run out of the HPC dev room, which I'm co-organizing. I'm a big fan of Material for MKDocs since I discovered it, and I think more people should know about it, so that's why I'm here. The other person on the talk as an author is Martin. Martin Donut, he's the lead developer of Material for MKDocs. I reached out to him to ask him, please submit a talk on Material for MKDocs to FOSSDOM. He said, I can't make it this year, then I said, OK, I'll just do it myself. And I had a call with him to discuss what should be in the talk, so he's been involved. All right, so why do I want to give this talk? Well, Material for MKDocs is great. More people should know about it, more people should be using it. It's very easy to install and use. You get very good results with pretty minimal effort, and I'll actually show this hands-on. Tons of great features. It's actively developed. It's open source, of course. And there's a very interesting part of how it's funded as well, and I'll cover that as well. And I was very shocked that there's never been a talk ever at FOSSDOM on MKDocs or on Material for MKDocs. That's just wrong to me, so I'm here to fix that. My personal journey is, I've actually haven't been using it for very long. It's pretty recent, basically, since 2021. I had to create a tutorial website, or I wanted to create a tutorial website for EasyBuild, for the tool I'm mostly working on. The existing EasyBuild documentation was in Sphinx. I wasn't terribly happy with that. It felt slow. It was using RST. The syntax didn't make sense to me. It was very difficult to work with. We were not getting a lot of contributions to that documentation, so I was looking for other things that could be possible. The tutorial was a totally new project, a totally new website, and I started looking around. I found Material for MKDocs, and I was sold after like five minutes. That tutorial was built with Material for MKDocs, and shortly after we started porting EasyBuild documentation, also our HPC documentation in Gantt, we also started porting that to Material for MKDocs, because it just made a whole lot more sense and was a lot easier to use. And also new projects that I've started since then have always been using this tool to create documentation and tutorials. So to start with, what is MKDocs? How many people here are familiar with MKDocs? Who has used MKDocs? About half of the room. Good. MKDocs is a static site generator. It's not a very complex tool, I think. It has a very strong focus on building documentation for software, so technical documentation, code, all these kind of things. The documentation sources themselves are written in Markdown, which is one of the things that sold me to MKDocs. Markdown is everywhere. If you're doing pull requests on GitHub or GitLab, issues, formatting there, Wikis, it's all Markdown, so the documentation that you're writing should also be Markdown, just to make the jump a bit smaller. To configure MKDocs, so how the site should look like and all the bells and whistles it has, that's all done in a single YAML file. Maybe you don't like YAML, but at least it's a single file that you want to look into and figure out how to configure it. MKDocs itself is implemented in Python. That's other than when you install it, you don't really notice that. That's probably a good thing. But it is very easy to install, use, customize, and extend. So how do you get started with MKDocs? This is a bit of a long list. You install it, pip install MKDocs, basically. You started creating a landing page, so an index.md and a docs folder. Typically, you can change that if you want to. You create a minimal configuration file, and then you launch MKDocs. You do MKDocs build that will take the Markdown that you put in the index.md. It will generate an index.html from that. You can open it in your browser and you're good to start with your documentation site. You can do MKDocs build strict as well. If you have any mistakes in your documentation, like you're linking to a page that doesn't exist, for example, it will go ahead and warn you about that. And that's very useful in CI. If you're making changes to your documentation, you can run this in GitHub Actions, for example, and it will warn you that something is wrong and you shouldn't be merging those changes. There's a way to live preview the documentation while you're editing it as well through MKDocs.serve. I'll show you that as well. And now you can go ahead and write your documentation. So showing all of that on the slide is very boring, so let's do it hands-on. And let's see how quickly this goes wrong. All right, so I'm essentially starting here from an empty folder. There's an empty docs directory, just so I don't forget to put stuff in there. The first thing we'll have to do is install MKDocs. It's not here. So this is just to pip install MKDocs. If you're a little bit familiar with Python, you know you have to be careful if you do pip install because it may end up got those somewhere. So what I want to do here is create a Python virtual environment. If I remember how to do that, all right, so now I'm in the virtual environment and in here I can just do pip install MKDocs. And if the Wi-Fi works, that should be working. So now I have MKDocs available, whatever version is there. Okay, so that's the pip install part, that's step one. Now I can create a very minimal MKDocs.yaml. And all you really need to put in there as a very minimal thing is the name of the site. So let's just toss them here. And in the docs folder, we want to create an index.md in markdown. So let's say hello fosdem, this is a demo. Okay, that's all we need. We do MKDocs build. This should be very quick. It generates a site directory with a whole bunch of stuff in there including index.hdml. We can open this in our browser and it looks like this. Hooray, it works. We even get a search function here. Of course now there's not a lot to search yet. You can search for fosdem and it will bring me to that page. Okay, so the search functionality is already built in and ready to go. Now once you start creating a couple more pages, let's say getting started.md. Like this, if you save this, you have to do MKDocs build again. And you have to refresh this site. And then here you see there's a getting started page as well. Now Firefox gets a bit confused because this is all static html. So it says what do you want to do? I want to open the page. What's more interesting if you do MKDocs serve. So now you're getting a small web server here running locally. You can click this. You see the same website. But when you start changing stuff, for example, in an existing page, let's say magic happens. As soon as I've saved this and I switch back to the site, this should not refresh. Oh right, okay. See demos always go wrong. Try again. You save it and if you switch back it pops up there. So you're automatically getting live preview while you're editing the documentation. To me this is absolutely brilliant. Okay, now what I don't like about this is this theme. Like what the hell is going on? The lines are white and getting started is here. Where's my hello page? So weird stuff is going on. That's where I think Material frontman kdocs kicks in. So Material frontman kdocs is a theme for MKDocs. It makes things a whole lot better, nicer to look at, just straight out of the box. Very easy to use. And it comes with a whole bunch of plugins and extensions. So extra features that MKDocs cannot do by default. So I see this as MKDocs with batteries included. So this is actually how MKDocs should be out of the box. Again, easy to install, use and configure. All you need to do is in your Python virtual environment. So I'll have to kill the serve here. You do an pip install MKDocs Material. So you just install an additional Python package, which will bring in a whole bunch of extensions and there's a whole lot of stuff going on here. I'll serve this again. Now, if I look at the website, nothing has changed yet because I have to change the theme as well that's being used. So in my MKDocs, I say theme, name, Material. And as soon as I hit save on this and I switch back, I think it needs a refresh. Why is it not working? Oh, something went wrong. Ah, demos. Okay, let's try restarting this. Okay, I'm not sure what went wrong with the live preview. Usually that works. So this already looks a lot better. So at least now I'm seeing my pages and the search. The search here is amazing. It's blazingly fast. Even if you have pretty big documentation, it highlights the things you can customize the search. You can rank pages up or down. If you want them to be more prominent in your search results, there's a whole bunch of stuff you can do. All right, so they get started with Material for MKDocs. Just with pip install MKDocs Material. You change your MKDocs YAML to use Material as a theme and things start looking a lot better already. And now the fun really starts because there's a whole bunch of plugins and extensions you can start using as well. Now, I'll do a quick cheat code here because the MKDocs YAML I'll end up with is going to be pretty big because I want to show you all the bells and whistles. I'm not going to type all of that, so I have a hidden file here that I'm just going to move into the right place. And we open this and you can see there's a whole bunch of stuff here going on now. I'll explain it in the slides what's going on. So one of the first things you can do is you can start playing with the colors. You can change the accent colors. So here I use like Fossdam purple. That's very easy to change. You just say, palette primary color should be purple. The accent color, so that's when you hover over stuff, should be blue. So it's very easy to play with the colors if you're interested. But it's also very easy to do is introduce light and dark mode in your documentation. So with a little bit more stuff in your MKDocs YAML, you can say I actually have two color schemes. I have a light mode and a dark mode. The dark mode is called Slate for whatever reason. The light mode is called Default. Okay, now you know. And what actually happens when you do that, so here when I moved that big configuration file in place, it actually already did a re-render and now I have dark mode here as well. And it's actually a dark mode with tuned colors. So I'm getting Fossdam colors in my website now as well. So that's one small thing that's very easy to do. Now let's show off some of the additional features. Let me start a new page here. Let's call it MaterialMD. And let's start showing ContentTabs. Material for MKDocs. Save this, go back. It should be picking up on that straight away. Okay, so ContentTabs are a way of getting tabs and like a subsection of your documentation page. And the best way to show it is to really do the demo. So I'll copy paste this markdown code in this one here and I think it needs empty lines in between or it will not be happy. Right, and now here I have tabs in my documentation. So that's very nice if you say I need to show different examples with C++ Python different code, for example, this is a very nice way of doing it because people can just pick what they're interested in. You can also make sure that people can somehow give a preference, like always show me the Python stuff. And it will remember the first time they picked something and throughout the whole page it will show always the Python example by default. So it does some caching of this as well. To enable this you need to enable two extensions, SuperFences. So SuperFences is something where you can embed content into each other so you can start with ContentTabs that then includes other stuff and like it basically goes recursively so you want to enable that. And then you do Tab and UltimateStyleTrue. While it has to be true, I don't know, but fine, it works if you do it like this. CodeBlocks is a very nice thing as well, also built into Material. So let's show that off here. We can do Code, Block with Python code and that looks very nice. So this uses Pigments to do the syntax highlighting. You tell it that it's Python here, so it doesn't figure that out by itself. You have to tell it I could try rendering this as Shell and it's probably going to look a bit funny. Okay, so it looks reasonably okay. So all of this works out of the box. They don't have to install any additional stuff to make this work. It knows about pretty much all the programming languages out there. If you want to try Fortran here, it will probably still work. Another very nice feature is what's called Admonitions, which is a very strange word to me. I'm not a native English speaker. I'm not familiar with this term, but it's like nodes, warning tips. So all of these kind of boxes you can include in your documentation is called Admonitions in Material for MKDocs. So a small demo of that is here. Let's do Adma, whatever, nodes and stuff. Again, it needs an empty line in between or it will not be happy. And you start getting nodes. You can use custom titles here. So all the Admonitions have a particular type, which mostly defines the color and the icon you get here, and you can change that title here. So you don't get the default. The default would just be tip, I think. So if I remove this, you'll see tip instead. I think there's a more normal name for me. Sorry? Callouts, yeah, okay. Fine. Over naming you can always discuss. I didn't pick the name, so don't blame me. Blame Martin for that. No, no, it's fine. To me, it's a confusing name. Another thing I really like and I know very much that not everybody is a big fan of that is emojis. You can use emojis in your documentation. I think this is great. It makes things a bit less serious, a bit less lighthearted. You can have some fun in your documentation as well, because some people think it's very boring to read documentation. So for some people, this works. There's emojis, there's icons as well, so there's an arrow in here. This arrow right is not really an emoji, it's an icon. So this works pretty well as well. Again, I want an empty line in between here. So be careful if you have too many Belgian beers. You may get sick in the morning. All right, so this really works well for me. And the documentation for Matilda from K-Docs, there's a search engine here, so you can look for beer, and it will give you all the options that you can use. You can look for arrow, and if you click something, it will copy, if you click in here, it will copy paste it to your clipboard, so you don't even have to type it over. Really well done. Over 10,000 icons, so you can probably find something that you can use in there. All right, another very cool feature, which I haven't used myself very much, is using Mermaid. So Mermaid is some kind of JavaScript tool to create diagrams. So with a block of code like this, a block of mermaid code, you can start including graphs in your documentation like this. And these could be very complex. They render very quickly, and it not only supports diagrams like this, but you can do pie charts, you can do UML diagrams, you can do a Git branch workflow kind of stuff, so this is very rich in terms of what it can do. Again, you have to enable the corresponding stuff in your markmkdocs.yaml, so you need super fences with some custom fences and yada yada, just copy paste this. You start playing with diagrams in your docs. You hit save, and if you're quick enough, where is this site here? You start having diagrams in your documentation as well. So if you need this kind of stuff in there, this to me beats putting pictures in there, because here you can copy paste stuff as well, right, from your diagrams, so this is better in many cases. All right, I think I'm doing quite good on time. The last big feature I should say that I want to highlight is the block support. So this is quite new in material for mkdocs. It has been in the works for quite a while, but now it's finally there in the open source version. So this is something special, a dedicated plugin for integrating a blog in your docs. All you do is you do plugins enable blog, right, and then you can start in a special structure here, so you can do docs blog posts and start creating markdown files in there. Let me show you what happens if you do that. So we want to exit here. You want to make sure that the blog part is set up. So this part is auto-created here by mkdocs. As soon as you enable the blog's plugin, it will create your landing page. There's no post in here, of course. So this is empty. I can create a small markdown file here. Let me check copy paste. So here you can see this has a date. This is basically the publication date. So if you put this in the future, it will not show up yet until that date hits. I think so. I can try if that works. So here blog. This is our blog post that we just added, and this has a dedicated page as well, which here it's hard to tell, but in the URL field, it will actually use the date that you've put in the mkdocs as well. So it's like everything is nicely date stamped and so on. I think if I put this to a future date, it's not going to show it. So let's try February 5th. And now the post is... Ah, okay, it's still here. All right, fine. But there is another way you can do draft through, so I don't want to show this yet. And then, at least on the landing page, it should not be there. So as long as it's a draft, it will not show it. If you flip it to draft equals false, or just remove that, it will come back. Okay, so this is built in into material for mkdocs. This is quite amazing. All right, so lots of features. There's lots of stuff I haven't showed. It can do a whole lot more. So please take a look at the documentation of material for mkdocs itself. It's a very nice tool. Another aspect I want to talk about very briefly is the way this is funded. So funding is a very big issue for lots of open source projects. And Martin here has come up with a way that works amazingly well, and it's actually pretty simple. So material for mkdocs is what's called sponsorware. So there's an open source version of it available to anyone. You just download it on GitHub. You can pip install it, and you can start playing. But there's actually a private version as well, which has a couple more features already implemented, but they are not available in the open source version yet. To get access to this private version, you have to become what's called an insider. So you become an insider to the project by doing some kind of monthly donation to the project. And I think it starts as low as $15 per month, so it's quite affordable. You can also do a yearly donation if you're up for that. And then what happens, you get access to all these new features that are only available in the private version, in this insider version. But eventually these features also come back to the open source version. And that happens when a certain funding goal is being hit. So Martin sets goals. For example, I want to get $10,000 a month of income. And then all the features that I list here will become part of the public version. So as soon as they hit that target, it works. And this is nicely covered in the documentation here. So on the documentation of material for MK docs, you can see they're now getting over $13,000 a month, which is quite a lot, right? So Martin is actually building a team, a development team route, material for MK docs, thanks to this funding. So right now this is the funding level. And he says as soon as we hit $16,000, we will move all these implemented features from the insiders, from the private version of the tool to the public version. And then they stay in the public version forever. So as soon as they hit this target, that works. Now this is interesting because they've hit the $14,000 target already, but then some sponsors dropped out, and now they're back to a little bit below $14,000 again. But that's fine. Once it's public, it stays public. What's amazing to me here is that the private version is just a private fork on GitHub. So you get access to that private fork, you get added to the fork essentially as a contributor, so you can access that code. But this model somehow works. So you could say, okay, if I get the private version, I could just give it to anyone, right? And then it stops working. But for some reason, that doesn't happen. So it's like this honor system, like if you sponsor the project, you get access to it. And literally here at the bottom of the page, it says, please don't distribute the source code that you get access to, and apparently that works. And they keep getting new sponsors over and over again. They're hitting these goals every couple of months. So that's maybe an idea for other open source projects as well to take a look at. So yeah, Martin told me that this was a bit of a jump, like a gamble, let's see what happens. And it's been working amazingly for them. So he's able to build a development team rather than just having to work on this himself. Okay, a lot of features that I didn't cover, which I'm not going to get into here, check the documentation. One thing I do want to mention here, it also makes it very easy to publish your documentation on GitHub pages or GitLab pages. So it has an MK docs GitHub or GitHub pages something. And if you integrate that in your GitHub actions workflow, it will push it to GitHub pages and nicely integrate that in your GitHub account. Yeah, that's all I have. And hopefully there's time for a couple of questions. Thanks. Let's have a couple of questions and we'll see how fast they go. First question. Very quickly. I have two of them. Do you know if the icons, not much the emoji, but the admonition or whatever, and also the charts, are they vectors or a roster? So you're talking about these, right? So the question is, are these vectors or are they bitmaps? I'm not sure. I think they're vectors, but I'm not entirely sure. You could check, I guess, if you zoom in, where do I have that website open? Here. So if you zoom in, you can tell that these are probably vectors, right? Yeah, right. So they look pretty good. The question is, if there is any kind of, maybe it's a stupid question, but is there any kind of translation to this kind of documentation that I would say PDF? Okay. Is there a way to export the documentation into PDF? I think there, the answer is no, but they're very much aware that this is a missing feature, let's say, and that's something they want to work on. I'm not entirely sure if that's correct, but I think that's what Martin told me. Like, compared to other tools, there's like Docosaurus and there's Sphinx and there's other things. Some of these tools can do a little bit more. There's a plugin. Yeah. There's a plugin for PDF export. Look, yeah. So the plugin system in MKDocs is very nice. Yeah. Yes? That's great. Yeah. One of the nice things about Sphinx in MKDoc is that you can easily do a code documentation so you just send something into the documentation and grab the link. You mean like generating API Docs or? Yeah. Yeah. There's a plugin for that for MKDocs. I didn't show it here. I actually don't have it on the slides. Okay. I had it somewhere, but I think it's MKDocs. Oh boy. Doc strings. Yeah. So this is MKDoc strings. That's the plugin you want to generate API documentation. I'm using this in the Easy Build documentation, for example. Works fine. Yeah. So the question is, did you run into issues with complexity because it's like tool or another tool and when it talks, you know, maybe it's easy to contribute. And yes, so knowing when to look if it's issue in the material or if it's underlying MKDocs and how to do it. So the question is, did you run into issues with complexity because it's a tool on top of another tool and if something goes wrong, where's the problem? Not really because usually if something goes wrong, you get like a Python crash and you can tell whether it's in a particular plugin or in material or in MKDocs itself. I haven't run into many issues like this, but if it happens, it's usually quite clear. And if you don't know, you just report the issue to material for MKDocs and one of the maintainers is going to tell you it's not an issue here, it's an issue there, you should report it there. And you say one thing on top of another, that's not entirely tool, it's like plugins. So they do like integrate with each other and there's some complexity there, but yeah, usually it's quite clear if you get like a Python crash, you can tell from where it's coming from. Can I stop here? Yeah. Thank you very much. Okay. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.
Drop the docs and embrace the model with Gaphor
Get rid of all those pesky words and use images instead. Which I'm on board with actually. Yeah. And over to Frank. Thank you very much. Thank you. So, maybe hold off with the applause till I'm finished and then make up your mind. So I'm Frank van Bever and I'm here to talk. Well, I have this talk. Drop the docs and embrace the model with G4, which might be an incendiary statement in this room. Louder. Louder? OK. I'll try. Can we also put a mic down? I think people are deciding. So, yeah, and so it's going to be a quick introduction to what model based systems engineering is using free and open source tools. That's what we're here for on Fosdham. So first real quick, because this is not the interesting part. I'm Frank, father of two. If I look tired, that explains it. I'm a bass player. I've been successfully daylighting as a software developer for the last 10 years. I trained as an electrical engineer. I actually had my digital systems like my DSP courses, so digital signal processing on this floor in this building. And so, yeah, I'm specialized in embedded systems, but I like to think I'd know just enough of the rest of the stack to be dangerous. I work for a company called Mind. We do embedded software for well, free and open source software for embedded systems. If you enjoy the referencing pointers as a job, then come talk to me later if you're looking for a job. And so currently I am a software architect in a robotics company, a company that makes autonomous guided vehicles. So with that out of the way, quick outline of what I will be talking about. So first, if we talk about model based systems engineering, I, well, I can't talk about what modeling is without explaining what model based systems engineering is. There's three pillars to MBC, modeling language, modeling method and modeling tool. And then I'm going to be talking about G4, free and open source application. There's a modeling tool and then how you can also use it for documentation and additional tooling that you can build around your model. So first of all, what is a model? And it's a bit abstract. It's an abstraction of a system aimed at understanding, communicating, explaining or designing aspects of that system. And so a model really is a central repository for design decisions you make about a system. And these are captured as model elements and the relationships between these elements. Typically, you have, you use these graphical languages and you have a set of views that describe this, the described model. However, and that's an easy trap to fall into these views of the model don't represent the model itself. Really, the model is like the entire containment tree and the view is just like the single thing. And yeah, well, I'm from Belgium. This painter, René Magritte, he painted this thing called the Trison des images, which translates to the treason of images. That's exactly the same thing that you have with these views. The view is not the model. Then expanding on that, what is model-based systems engineering? And so model-based systems engineering is like a formalized application of modeling to support really capturing system requirements, doing system design analysis, and then also the verification and validation of a system. And this is throughout the entire lifecycle of a system from the initial concept through the development, then commissioning of a system and decommissioning of a system. And actually, this is as an alternative to what is called a document-based approach, and that's where the drop the docs comes from. So I'm most definitely not against documentation. The more documentation, the better. But so typically these, well, it's a different approach from writing large amounts of prose to describe what a system should do and instead use these more formal graphical languages. And I'm thinking that most of you will be involved in documentation in some way or another, otherwise you wouldn't be here. And my idea is doesn't describe all your documentation efforts. So these three pillars. First, there's a modeling language that you need to describe your system. There's multiple options that exist, typically graphical. I will be talking about a system specifically today. Modeling methods is then the way you organize your model. And once again, there's also plenty of options out there, and this is really dependent on the processes of your organization. And so because of this really, if I have to talk about modeling methods, then I would probably need the rest of the day and tomorrow as well. So it's beyond the scope of this presentation. Really what I want to do is just give a quick introduction. And then finally, you need a modeling tool to bring together your modeling language and your modeling method and really build this model. And so mostly these very large, very commercial, close source tools with like a six month. Well, if you want to buy it, you need to first make, well, these very large tools from IBM, from the SOH system seem to be very popular. But I'm here talking about G4 as like a free and open source alternative for these tools. So, and then Napoleon, he had this quote that a good sketch is better than a long speech. And I think that might actually make him the first model based systems engineering practitioner besides he did some other stuff too. But OK. And so the good sketch, SysML is the systems modeling language. It's a graphical language and it's actually a profile, which is like the extension mechanism of UML, which you may or may not have heard of. There are some differences though between UML and SysML. First of all, UML really is software focused. It has this concept of a class and everything is built around that. SysML on the other hand, moves it an abstraction layer higher and talks about blocks. Another thing that we, well, that the organization that I work for that we really like about SysML is that it has this built in concept of a requirement of requirements. And that requirements can be, well, basically you have these requirements, they can be refined, they can, you can have derived requirements. And then these requirements can be assigned to different parts of a system. And generally, this is like a good way to make sure that the necessary information gets to the people doing the actual development work. So yeah, point is it really, the SysML has like a systems focus, whereas UML is more of a, really more of a software focused thing. And so there's nine types of diagrams. These activity sequence, state machine and use case diagrams, those are all just lifted from UML, put into SysML, these are the same thing. The requirements diagram is where you describe your, well, where you describe the requirements of a system. You do derivation, basically build up a tree of requirements. And then these structure diagrams, these have analogs in UML. But so you have a block definition diagram where you decompose the system into the blocks that make up the system. And you have the internal block diagram where you then take these blocks and makes show how these are interconnected, how these are interconnected. And what the interfaces between these blocks are, you have the package diagram, which is really about, well, which is the tool that you would use for your modeling methodology. It allows you to split off different parts of your model into packages to keep the overview. And then finally there's also a parametric diagram, which is a special case of the internal block diagram. And this is usually used for when you want to do systems simulation. G4 doesn't do this, so I'm not going to go into this too much, but from, well, I've read some things on the developer chat. And so it's something that is being worked on, so that's exciting. Well, I think some of you might be thinking, haven't we tried this before? And, well, sometimes I feel old, sometimes I feel young. It depends really, but so I was a kid back in the 90s, but UML really, we tried, I think you might have like a reaction of, haven't we tried this before? UML, alderage.com boom and whatnot. If we can believe Vogue magazine, Y2K is entirely back, so yeah, that's, and so, yes, actually, this is UML. It's an extension of UML, but there's some observations that I've made in the field. Well, there's some observations that I've made. First of all, we have this Myro board proliferation. So Myro is this application that is like a digital whiteboard and, yeah, well, it's natural for people if they want to explain something. The most natural thing if you want to explain something is you go stand next to a whiteboard, you draw some boxes on it, you draw arrows between those boxes, and you try and explain what is going on in a system. And so Myro is being used a lot, but these things, well, if you don't have the context of the human sitting next to it, doing the explanation, then you start getting these problems of like, okay, what did they exactly mean here? What is the grammar? What are the semantics? I don't really understand, well, it's hard to understand what this means without having a lot of pros next to it or the actual human being doing the explanation. And so actually the block diagrams that SysML has, they already map to what people are doing informally as well. So, and a bad model is still good documentation. Bad not in the sense that it is describing a system different from the actual system, but bad in the sense that it's not a perfect application of the SysML specification, because that's what people are already doing. Another thing, of course, is that software architecture is basically systems engineering, and that mostly every developer is also kind of a bit of an architect. I don't know of a lot of people that don't have, well, that really lack that level of agency in their job, that they can't make some architectural decisions from themselves. And that is where a free and open source tool like the four also comes in nicely, because it is widely available, it's easy to trojan horse it into your organization. Which brings me to G4. It's a multi-platform graphical modeling application. It's written in Python, uses a GTK UI, and it supports multiple modeling languages, UML and then SysML, RA, ML and C4 as extensions of, implemented as extensions of UML. It's Apache 2 licensed, free and open source software, otherwise I wouldn't be standing here. And it's extensible in multiple ways. It has a plugin, well, it supports plugins, but you can also extend it in other ways. And a quick thing, I'm not affiliated in any way with the project, I'm just a fan. I really like what they're doing. I have some ideas for development work that I can do, but I need to find the time. And so one of the nice features that I really like is that it integrates very nicely with Sphinx. So if you've ever seen like a read the docs website, it's Sphinx on the back end. And so it, if you basically can have your model sitting in a repository, sitting in a repository, you push your changes, it rebuilds it, your CI system then can just rebuild the website and take these diagrams that you've drawn inside of your model and plug them into this page, well into this Sphinx static website automatically. So, yeah, we found that that is like a good way to communicate architecture to downstream engineering. So you have, you basically can have all your diagrams specifying a specific part of a system. And so you draw these diagrams, you maybe add a bit of words in between, even though I said like drop the docs, but there's still some sense in writing a bit of prose to introduce it. But you have these formally defined things that show that, okay, this is the idea that we have, this is what we're going to build. And so it's a good way to communicate these decisions and the IDs to downstream engineering. Also, we, so we intend to use this for like these architectural decision records. So if like an architecture decision is made, you, we put it into the Sphinx site and you do, well, basically, if everybody reviews it in the CI before, before it gets merged, then you've basically, well, the necessary people can sign off on it before the decision actually becomes like a written law. Let's say, yeah, another thing that's really nice is that so because before it's a graphical, it's Python, it's a graphical application, but you can also just use it as a Python module. So it integrates very nicely with Jupyter notebooks, which is like this interactive programming environment model. Your model has an API, so it's perfectly possible to just query your model and ask questions of, well, basically ask the model questions and get answers from it in this, in this interactive environment. The Jupyter itself also is like a little programming tool, so you can also add some, well, you can display diagrams inside of your notebook, add text and really create like a narrative structure inside of your Jupyter notebook, and also serve as documentation. So, yeah, you can explore your model and collaborate with other people. Very nice, very nice feature. And then finally, because the thing, well, because your model has an API, it's perfectly possible to have it be, well, to, if it has an API, you can test it. And so if you, we've combined before with PyTest in our CI system, and basically what we do is we run a bunch of tests against our model, or this is like a screenshot that I took from the single test that runs against it. So, and it's test, if all the requirements, if all these requirements are satisfied, so you can have a block which can have, well, you can have a requirement, and then there's a satisfies relationship that define in SysML that you can say, okay, we have this block satisfies this requirement. And ideally, if you have a set of requirements, all of these should be satisfied. This is something that you can test, and the plan of your system is, well, the plan for your system basically should satisfy all these requirements. If that's not the case, then you have a problem. Another thing that you can test for is, does every block have a reason to be there? So, one of the things that Gaford does for you already is that if a block is no longer represented on any view, then it's automatically deleted. But if a block does not have any requirements associated with it, they might also raise some questions like, okay, does this thing really need, does this need to be there? Other examples of things that you can test against is like interfaces. Two blocks are connected, and they have ports. These ports expect certain things. Are both parts of the system expecting the same thing? If not, then you have a conflict of the contract between those two blocks. And, yeah, well, basically, it allows you to detect those types of problems before the development starts, and then, yeah, potentially, you can reduce some wasted effort there. And so, yeah, you know that it's a real CI system because, well, it's in red. But, anyway, that is basically all I wanted to say today. I hope I gave you an okay introduction, at least, to this concept of model-based systems engineering and then how Gaford can help you with that. I don't know how much time I have left, but... Two minutes. Two minutes. Two minutes. Sorry. Whoa, three in a row. Oh, four. Okay, left there. I think the key question is, are you going to stay in this room after this talk, or are you leaving? So I would like to ask you several questions. We can go in the hallway. You can ask me. That was a very good question. Right. Do you use it for modeling software systems, or mechanical systems, or electrical systems, or everything? So it supports modeling all of those. For us, mainly, we've been using it. So one model I have built includes electrical components, like off-the-shelf components that we buy, software components that interact with those things. So the system in itself has support as well, except multiple voltages for powering it. So it can be like 24 volts DC, but also 230 volts AC wall power. It's all stuff that you can put inside of your model. Very, very quick one. I guess you use the model to generate those.
Experimenting with AI and LLM to make docs searchable through a chat application
I'm here as someone who's just interested in this stuff. I'm definitely not an expert in AI or a machine learning expert. I'm just a developer writing docs like a lot of people here in the room and I have been experimenting a bit and I want to show you what I've done. I'm Frank. That was my first computer so I'm programming how many people are the same generation? Not that much, okay? That's a long time ago. I'm involved in an open source project which is called Pi4J. I'm in the library to have interaction on the Raspberry Pi with electronic components with Java. Yes, I'm a Java developer who loves Java. Not enough. I even read a book about it. I have been programming for many, many years, over 20 years, but by writing the book I've become a bit of a writer, I contributed a few of these articles of these chapters to a website, Fujay.io, a website for friends of OpenJDK. And that's how I eventually landed a new job. So by starting to write about the projects I love and work on, I actually got hired by Azul. Azul is one of the distributors of Java, so you can have a Java runtime created by Azul. And my job is I'm half of the documentation team. And we live on docs.azul.com, so we have several parts there about these different products. And from time to time we also block about experiments with Java, what's changed, how performance, the stuff, or that we built and things like that. And of course, last year we had chatGPT, the new thing, it knew everything. It's a damn good liar too, so you have to be careful with it. But what do you know about Azul? Azul is one of the distributions we have with our company. It's reasonably good. I'm happy with this answer. But it also says, this information is based on what I know that existed on January 2022. For software that's a problem. We are already two years further. We have every three months a new security release. We have a new version every six months. So although the basic information is correct, it is outdated. And that's a bit the problem with large language models. A large language model works completely different than our brain. The only thing that it does is it predicts the most plausible world, word, after the previous ones. So it's based on a lot of knowledge, that's true. But it doesn't do real reasoning. If it tells you a lie and you say it, that it tells a lie, it will give you another answer. And it will continue doing that until you're happy. And luckily we have a lot of evolutions in these large language models. So we have these GPT evolutions. GPT-5 is around the corner. We have no data there. But each of these models is trained on more and more data and gets better. Now this GPT-5, what they say about it is it will also understand video. And I think that's an important one to realize. They will also train the new model, the GPT-5, on videos. So if you are a documentation writer and you're from time to time create a video of a blog post or something else, all those sources will be used as part of the new models. By the way, who knows Find.com? Only a few people. I like it a lot more than chat GPT. It gives you the links of where it has found a few of the sources that it's using. So that's one of the lacks of chat GPT. It doesn't tell you where it has found the information because it didn't find the information. It's just reasoning on your question and what is the most plausible answer. Now if you are a bit familiar with Java and Java Spring, that Sean Carter is one of those developer advocates. What he says is true. The documentation that you are writing, that you are publishing on the public web is the source for these language models. So they can only become as smart as possible about your product if the information is available somewhere. And then of course as a docs writer, you definitely heard this question. Someone from your management comes to you. Can we make a chat version of our docs? Who had the question? Okay, not that much. Luckily people are researching or trying out. Vadin is a web framework on Java and one of the developer advocates has written a nice blog post. He has done exactly that. Vadin has very good docs on the website. So he has taken the website and he has described this whole thing and he takes two steps. Again, open source stuff. It's available online. By the way, all the links from my presentation, they are on my website. If you go to webtechie.be, the damn password thing is there again. You will find all the links. Sorry for my voice. So what he did is he created a little application that went through all their docs and created vectors. Vectors are the base of a large language model. They are a conversion of text to some kind of a mathematical model. I don't understand a bit of it, but it's amazing. It works. And then he had the second thing is from these vectors. If you ask a question, it will first filter out the documentation which is related and will then do a search or create an example, an answer based on that. And it works pretty good. He was pretty happy with how it answered questions about his own doc set. But there are two problems. We are at an open source event. It has a dependency on two paid services. The first one is pointcone.io, which is a vector database. Definitely you can find online an alternative. And openAI, which is actually providing the chat API. He found out that training it, and I tried the same example he created with our own docs of Azure. You still need to do some training and rewrite some of your documentation to really find the good answers. And it doesn't provide you, again, the same thing as chat GPT. It doesn't provide you links to your documentation. So when I tried this and when he tried this last summer, it's not really the right time to do such a thing. So go to your management. No, not yet. But last October, we had the DevOps conference in Antwerp. It's an amazing event. If you love Java, if you love software development, it sells out faster than Tomorrowland. They have 3,000 tickets. They were sold out in five minutes. Then they had 500 additional tickets. It sold out in two seconds. So it's easier to be a speaker than an attendee at that event. That's how I fix it. Now, Lisa Rass, she's from Belgium. She's one of the developers of LangChain 4j. LangChain is a Python library for doing stuff with OpenAI and all these chat-based things and machine learning. LangChain 4j is a Java version of that Python library. Now, during that talk, she gave 12 demos in one hour. The last one was, how do I interact with an existing text? So she gives the chat system a text, a story about, I don't know, I cannot even remember, and then she asks specific questions about that story. And she gets answers of that story. So that's what we're looking for. How can we look and interact with our own documentation? So this looked promising. And again, when you're at the conference and you get inspiration, like I had a few tools that I already took a picture of that I want to try out. This stuck to my head. And luckily we have Fosdm and then the tool, the Docs Devroom, so I had a reason to try something out. And that's exactly what I did. If you go to the LangChain 4j examples repository on GitHub, since two weeks I have added a little Java evix application there as one of the examples. Java evix is a tool to create user interfaces. Yes, I'm a Java champion. I have to sell Java today. So what this application does, it still relies on OpenAI, sorry, but you have to buy a few credits. With all the experiments that I've done, I spent a few dollars. Not that big of an effort, but... So it remembers your previous questions. So I asked you to pick a random boy name and a random girl name and then tell me a fairy tale of five sentences. And you see that the fairy tale is again over. There were two children named Etna and Olivia, the answers of the previous questions. So you can have a chat with an application within Java and reasoning. But this is based on OpenAI and what it already knows. So then I went a bit a step further. With the docs that we create for the Azure Docs website, we use the Algolia search machine. It was already mentioned a few times here. We are a company so we can afford to use a third party for this. But to feed it with data, we already created a little tool that breaks our docs into sections. Every header becomes a JSON. I know it's hard to read, but every header of our docs becomes a JSON block with the title of the header, a link to the page, a link to the specific anchor on that page, and then the content which is under the header. So we already have that JSON. We have a data set, a structured data set of our docs. We can use this. Can we chat with something with an application against this documentation set? So that was my idea. Can I do that? I know this is not a coding conference, but still let me dive into it. Because I like Java. I think that's clear. It allows you, thanks to these amazing libraries, to create powerful applications with minimal code. So the thing you see here is actually about the UI, so I will go to the chat service. So what I do here in a few things is first I have this JSON. It has over 1,500 records. And it creates an object of each of these records and puts them in, it's called an embedding store, I think. So here it creates, where is it? The embedding model. So it has an embedding model. Again, I'm not an expert. I have no idea what it's doing behind the scenes. I just found out it works and it does some great things. And then I have, if you ask it a question, then it will, of all these 1,500 blocks, search the 10 most relevant ones. And then give those 10 text blocks to chat.gpt. And that chat.gpt will create an answer out of it. And when we ask for chat.gpt to create an answer, we also give it some rules. Do not provide any additional information. I will show you why later. I try to do not provide answers about other programming languages, but it just ignores it. It still answers me questions about Python, for instance. I don't know why. I said it's a damn good liar, but also a cheater. And if you don't find the answers in those 1500 elements, then just say, sorry, I could not find an answer because you don't want your chat system to come up with something else. So this is the application. I should have probably made it a bigger fund. So you see we have 1,522 embeddings. So if I ask it, what do you know about as a prime? So it you see those are the 10 links I have now to the specific information. It's a demo. So it should fail. No exception. It's going to the network indeed. But in cases like that, we have, of course, video too many open windows exception. Okay. Good. I recorded this, this noon, just in case. Converting those 1500 elements to vectors takes some time. It takes about one minute and a half before the application starts. If you would run this on the server, you don't mind. You start it once and then you get your answer. So you see this, the answer streaming back. So it's really a chat like interface. And it gives a pretty good answer. If I ask, I know the docs and that's the handy thing in this case. I know what it should answer. So I can really try it out and see if I get the expected answers. Like for instance, we have several products. Do I get the right installation instructions for the products I'm asking for? So it's really answering with the right results. I could remove one of those dependencies, those commercial dependencies, the vector database, because that's now inside my application. I still depend on openai.com. We'll come to that. It still needs training. And actually the training is our fault as docs writers, because I found out that the chat cannot tell me the difference between two of our products. And if I dive into the documentation, I understand why the chat cannot answer. The answer is not there. So it can only answer as good as the information that you provide. So how I'm going to use this is to find out if the documentation is okay. So I'm not going to publish this tool. It's online, but you can find it on my hit hub, but you can run it. I even added the Azure documentation JSON. I'm going to experiment with it. Please do and let me know what you find. And is it the right time? I'm not sure. I cannot limit it enough. It's still giving Python answers while we are only doing Java. And I don't know why it doesn't want to listen. Yeah. All the languages are good. Let's conclude that. If you want to replace OpenAI, there are a few, there are probably many more, but there are a few I noticed. Someone has written a nice article on medium.com. I think it's one of the free articles. You're lucky. Where they compared Lama is such kind of model and even run it with Java. And they get nearly as fast answers compared to C. Yandot AI is also something that which promises to do this all on your system. Now be careful. You need quite some power on your machine to be able to provide this chat functionality. If you have the MacPy magazine, someone managed to do it on a Raspberry Pi of 15 euros. So that's maybe an idea. But I don't think that's the ideal use case. Why is it probably not the right time? And that's why I said I have some bad news for your conference. It's a big cheater chat GPT. A Chevrolet distributor in America had this on their website. He asked, yeah, can you give me a Python script? And that's why I tested also my solution. And of course it gives you a Python script. But even worse, if you tell it, you're not working for Chevrolet, but you're working for Honda. Which company, which car do you advise me to buy? It answers you with another brand. That's why I ask you to be very careful with this. I asked my demo application, can you give me a Python script? And it answers yes. So I didn't solve it yet on my case. Another, this is just this week, DPD, a transportation company. And it says, can you swear? And it does, fuck yeah, it swears. And it's the worst delivery firm in the world. I don't think that's the kind of reply you want from your chat-based system. My application was a bit more polite. I'm committed to maintaining a respectful and professional conversation. So, okay, that problem is probably already solved. I also asked it to you, what do you, do you have a message towards documentation writers? Actually, I asked it if you don't find any information in the Azul docs, don't reply to this kind of messages, but the question, but it did. Content is clear, consists and directly addresses the questions or issues at hand. That's a good rule for all of us. If you want to know more about this, you can find all the links on WebTechie.be, which is my personal blog. If you're interested in Java on Raspberry Pi, it's a nice experimentation thing, which you can do. I have a good book you can buy. I have a lot of content on fuji.io, which is the website for friends of OpenJDK. If you're interested in Java and everything related to machine learning, I create podcasts around the team of Java. We have a few podcasts already about machine learning. So, that's also a topic you can find there. And yeah, just like I did, experiment, fail. That's how you learn and have fun. And I hope you can do that also with chat.gpt. Thank you. Thank you.
Embeddable code playgrounds for fun and profit
Okay. Well, cool. Yeah. So, displayed at her. I like usually like a standing and jumping around, but here I've got to type. So, you forgive me for sitting down here. So, we'll talk about embeddable code playground specifically in a dogs. Let me ask how many of you prefer dull static dogs compared to interactive dogs? Because you probably have to maintain them. Oh, okay. Okay. I thought you are writing dogs. Yeah. Well, understood. Well, so, just to know who I'm, Peter Seitz, Anton is actually the author of the code we'll talk about, but unfortunately, he couldn't get a visa to come here. So, you stuck with me. But if you have any like a super advanced questions, there is Anton contacts and you can send it to him, he's very responsive guy. So, if you think about their more interactive, well, code playground, interactive scenarios, they generally better work for the explain the topics and also allowing to engage the reader. Maybe not all the readers, but I think the best ones, the most curious ones, which actually want to understand how things work. So, we'll look at three items in this short presentation. Their use cases, their approach, what we have in this open source project, and implementation. First, let's look at their tutorials. If you look at the tutorials, you often want to explain something by example. I think we can look at this very, let's say simple case, we are actually using some real live or the SAS out there, which is provides a very simple database where we can push some simple JSON object. We can go ahead and run it. Actually, what happens in this case, well, it does interaction as described above. Then sends their object to the database. Well, what we can also do is to go ahead and go ahead and to modify that and run. Then we can see this object was stored. Now, we want to demo the Cloud API. In this case, to play with it, we can go ahead and also use the get one to play with it somewhere. Let's say you have a second message and we have, and if you can say, is there like some message number 45? We can see, well, it's not there. Again, if you really want to experiment and play what works and how it works, that can be the very beautiful way to do it. Another cool way what we found it's being used is their release nodes. What we have in this case is this example. If you look at the Golang, they have just made recently very important changes, how the variables operate as related to coroutines. If you look in this case, that makes it look like a little bit counter-intuitive. We have coroutines called in the loop, and for some reason they are not showing different loop counters. Well, in Golang 1.22, that was fixed. If you guys want to showcase their feature in the release don and the commentations, but really let people to explore and put the holes, in this case I think that's a wonderful tool. I'm not sure about you, I often then read in some features, I do a lot of work with databases, and they say, hey, we implemented that new feature, and I wanted to put a hole, oh, did you implement that option, or does it work in this way? That is a very easy way to play with it if I would go in through with all their installation process, and so on and so forth. Another example we can see is some of their describing, some of their options in a documentation. Like if you think in this case as a corral, everybody could use corral, it has this wonderful JSON options, with this very cool, correct, but also very mouthful example, which we can also go ahead and provide example for. Say, hey, that is a JSON object, we post that to the server, that is what we get in return. This HTTP bin, that is actually another like well-known open source project, which essentially allows you to post something to that and then get in return, what exactly you posted with Othead, and so on and so forth, very convenient for debugging. What you can also do here in this case, if you are curious to say, well, interesting. So, corral has the support for JSON. Does it validates JSON, or just sends whatever stuff we have? Well, let's check it out. Well, we can see in this case, we are getting the error response back from the server, rather some sort of corral output, that means it doesn't. Again, that can be very helpful to get the user to explore kind of what he's not very certain, which may not be quite explained in this portion of documentation. Or we can also showcase example, what existed in the docs, how we can send the output from file. Pretty simple here. Okay. If you are looking at the deep dives, there may be some interesting in terms of more functionality. Going to a database, space where I spent a lot of my time, let's say we want to describe what is an absurd in SQL. All right, anybody heard what is absurd? Right, well, that is something like, we want to insert the data, but if it's out there, we want to update it. The very common. Okay, so we want to say, let's say we have this table out there, right? And we want to go ahead and use their MySQL insert or replace syntax, right? Then we want to, well, as I said, like to update one employee's salary and then also add another one, well, we can go ahead and run it. Like why use this as example here? Because what you can see is we are not showing everything in example, right? We are working with some sort of like a seed data which is, well, was pre-created as a part of a previous scenario, right? Which is very common. Here is also another example of the same thing, but with Postgres, right? Where we're using a different scenario, right? And you may ask, well, okay, this is how it works, but I know also the Postgres SQL has a syntax on conflict do nothing, right? So what would that be if that's what we do? Oh, well, in this case, we can see what the Emma's salary, right? Which was a conflict in row, was not changed. So again, we can play with those things. Okay, well, these are kind of setting landscape, I think, what things can be useful, but now let's look in terms of what is approach and how it works. Now, if you think about the tools and the doc creation, right? You would find what it is not easy to find the good technical writers, right? Or documentation offers, right? And they also can be rather, well, let's go like a selfish of a time, right? They don't want to do a lot of useless crap, right? In this case. So we want to make sure that writer experience is important, not just the reader experience, which we already defined, has one of those interactive playgrounds. So what approach we took in this project is saying how we can make it as sort of like a seamless as possible, right? We don't want to say, hey, you know what, you are going to create our interactive code playground in completely different tooling, right, separate from documentation, right? And then figure out is that going to live in the same version control, right, and so on and so forth, right? Or, you know, things like that. What we want in this case is to have your documentation, right, which was this, right? Just as easy as possible, add the ability to run and to edit and run, right? So you can say, hey, we added, you can see the run and edit here. And if I run, you can see what is the output of that documentation example is. So how can we approach that? So it is easy or integration which is easy on writing. Well, it's actually quite easy. So what we have is you are writing documentation in the same format as you got used to, right? Let's say maybe it's a markup language, as in this example, or something else. And then you can embed this like a code API widget. That widget itself will figure out the previous code block and make it interactive, right? So there is no, like, some special thing required, and that pretty much works in any documentation thing which already exists, right? So you can see that example here. So the code which existed here just gets interactive. So, well, of course, hello world is always easier, right? Let's look at some more complicated examples. One, I think, which is very important is the template approach, right? I think what I briefly mentioned already, if I want to show something like this, right? That is like a relatively, you know, complicated query, right? For that to be meaningful, I also need to pre-generate table in this case, which I probably do not want to have on my documentation thing. And this is designed done by providing a template. So a template in this case is basically something which is run before the scenario is done, right? And in this case, I can write some text and, hey, I created a table, but I'm not really specifically final comments because there's a irrelevant in this case. I populated with some data and then I have a code, the code which was created before, right? That is how template would look like. So I can highlight, right, where exactly in the context, I want to run that code which was, which is interactive part of the documentation. Okay, so here is another thing which you will find quite helpful. So if you are building some sort of tutorial, right, building the tutorial, right, you would often want to say, hey, there is actually multiple steps where I need the user to go through them one after another. And that is an example here. What you can see is what we are defining the function in a one-code block and then we are using that function in a in a another code block, right? We can, and I'll show you in a second, define dependency between those code blocks. That means when you are running this second section, the first section would always be run, like let me, I don't know, let's say break this code, right? For example, and then I can go ahead and run the second one. It says, oh, well, you know, things got broken, right, on the previous stuff, right? And how that works is what we identify, we refer to the first one as a cell number two, right? And then identify the second snippet as a cell which depends on a cell number two, right? That means pretty much that the content of that cell is going to be run. Then the second cell is run, right? Even if you, as users, don't click run, right? If you say, hey, I don't want to go through all those like five steps in tutorial, I want to start with step number six because that is where the real meet happens. You can do it, right? You can just jump in the middle. Okay. So finally, so how does that all things work? Well, there are actually a couple of ways it can work. One is we can have a browser playground and then a sandbox environment, right? Which is pretty much docker-based, right? And that's where we can use browser API, JavaScript and whatever. The second approach we can have also is web assembly, right? So if you can say, hey, you know what? We want no kind of serocomponent, right? It runs completely in a browser. We can do that, but probably in web assembly, it can be sometimes heavy, right? Especially saying, well, you know what? I want to showcase how, you know, like a Postgres operates when, you know, getting all that Postgres pulled in, the assembly started, right? That may not be the best experience, right? Especially with slower connections. So that is where docker, right, can be very helpful, right? So with docker, you can implement whatever you want and the setup of this service is an open source project, right? So you can roll your own as well. There is a variety of existing playgrounds which are supported at, you know, core API website, right? Which can get you started pretty quickly. Yes, so here are some examples, and I will of course share, well, slides if you actually slide there. The online, this is a live tutorial. You can see there's like a number of projects already started to use that with, you know, pretty good success. And you can see with core IP.org showcase that is where all the examples exist, right? Here are specific projects, right? There are kind of two sub-repositories. One is for JavaScript kind of client side, and other four, the server side, again, it's split because you may just want to use their client side if you're using like JavaScript or something where you don't need a server component. And yet, if you want to ask some more questions for Anton, right, or get some feedback, Antonz.org is his website. So that's all I had, and I would be happy to answer questions or get out of the way because I think I'm the last thing standing between you and your viewers. Yeah, we started with docs code, and now we've gone to code docs. Any questions? You don't understand, the back-end is also part of this project or not? Yes, yes, so in this case, code IP, that is your Docker back-end, right? Code IP JS, that's your, I think, so both of them are open source. What do you mean? Oh, you mean in terms of what people run, right, what kind of, so not right now.
A universal data model for localizable messages
Hi, I'm Emily. I work for Mozilla. Yeah, so this is a talk I literally don't think I've got. I could give any wealth except an audience like the translations of ROOM at Vosdom. So I thought I did. I would. In my work at Mozilla on localization systems and tools and standards, recently I've ended up spending quite a bit of time participating in the Unicode Consortium's project to define message format 2, an evolution of the ICU message format standard and a bunch of other things. And I'm here not talking about that like specifically, but more like a side product of what we've ended up doing through that work, which is defining a data model for messages. In particular, messages that are not just a single segmented phrase that you've extracted and you might be able to send it to translation, but more dynamic content as well as everything else. And one of the interesting things about what we've ended up effectively not discovering, but kind of stating the obvious, is that there's an upper bound to this sort of what makes up a localizable segmented phrase or message really. That this is limited by the fact that the keyword localizable there because it's dealing with humans. Humans who need to understand it, but also translators who, well still, are mostly humans who need to be able to take in the the source message and be able to produce an output from that that is understandable in their locale. And this ends up depending on a limited number of different dimensions in which messages kind of vary. Variants have kind of hidden it there as the first one and there of course spoiled everything by saying so. It's the way that messages, message content can vary depending on inputs like numbers and their pluralization categories. You have no apples, you have one apple, you have 15 apples and gender-based determinants, grammatical gender, personal gender, all sorts of various things in different locales languages. But this is one dimension. If you can express that, hey, we've got this variance happening, this message depends on these input values. This is a dimension that we can express. Then of course, once we have a single pattern, a single sequence, it might include placeholders. It might include the number n for how many apples you have or it might be something entirely different. But then finally, we've ended up at least through the message format to work, determining that markup should be kind of kept as a separate thing from placeholders. So markup here means something like HTML. It doesn't need to be HTML. It can be any sort of syntax or any sort of indicator that is saying that the content in this part of the message has these attributes or these something about it. Then within a placeholder, we can have values like numbers that we need to deal with. They can be just strings that are coming from an external source. We can also have annotations on them. We can say that this number here, yeah, it's coming in as a number, but I want it to be formatted. So it has at least two fraction digits, for instance. This needs to be accounted for in the whole message, how it ends up being formatted. Then as I mentioned, we need to be able to deal with variables because we are ultimately here talking about the scope of dynamic messages. So we need to be able to say that explicitly that this message might be getting a number from the outside. It might be getting some string content. It might be getting anything as input, and it needs to deal with those. But sometimes we need to, within a message, want to also do a little bit further processing on a variable value. We may want to select a certain part of it, capitalize it if we're talking about a string, do other sorts of transformations, or express the same sort of value in multiple different ways within a message. So we need a little bit of a tooling to deal with variables. And that's it. That's like through the working message format two, for the past like four years, we've not come up with effectively anything else that really is core, driving the qualities of a formatable message. And that's ended up meaning that one of the things we've produced out of this whole project is this data model for what does a message look like. When you don't consider the syntax of it, when you consider it as a data structure, I'm not going to go through like all of this. But roughly speaking, we can say that a message has two different forms that it can take. It could be either just a single pattern, single sequence that we're dealing with, or it can have some variance. That's the select message over there, which then has some selectors from that when formatting guide us towards selecting one of the variants of the message. The declarations help us declare these are some of the input and local to this message sort of variables that exist. And then the variants of the catch-all key end up defining how exactly do the, when we have multiple message variants, how does that work really? And then when you get to within a single pattern, again, as I alluded to, it can really just, obviously, contain literal content, a string, or it can have expressions, placeholders that is, or it can have markup that can be starting or opening. We also included standalone there, so you can have an element, for example, an inline image be expressed within a message. Then we can have literals, variable references, and the annotations that I mentioned. That's it. That's like these two slides are defining the whole data model that we've ended up dealing with. Okay, I left some like tiny little details out, like for example, the annotations, sorry, the expression, it needs to have at least one of an argument or an annotation in order to be valid and stuff like this, and minor details. But that's it. This is, we think, a universal data model for representing messages. And I'm here basically saying, hey, I think this is kind of cool. And this is not necessarily relevant for just the work specifically to do with message format to the syntax. But more that this is effectively a data model that can allow us to separate the concerns around the syntax of whether your messages are stored in get text files, ICU message format, fluent, any, literally any format. You can take that syntax and you can parse it into this data model structure representing a message. And this is, I think, leading us to a world where we can consider more of a UNIX philosophy for, okay, what do we do now? And I've, sort of, separation of concerns here. And I have, yes, cherry picked explicitly the part of the UNIX philosophy where it says to do one thing and do it well. And not included, for instance, the part about, you know, make sure that you're just dealing, you're communicating as strings the values from one process to another because that's kind of not necessarily working so well. Because we need those parsers. And if we need to understand all of the structure in a message every time when we do it, we end up, for the most part, mixing up the syntax concerns with everything else we're doing with messages. So as some of the things you can do with this data model as ideas is that if you can read and write from a syntax to this data model, and you can do this with multiple syntaxes, this is effectively an interface from which you can take messages from one syntax, turn them into this data model representation, and from there to any other syntax with caveats, but roughly. Another thing is we can build tooling on top of this. So you can build a linter or a validator on top of the data model representation of messages, rather than any syntax representation. And this means that you can use the same validation for all messages independently of what syntax they might be coming from. And if you have these capabilities, it means that when you have an established many localization systems right now are very much monolithic. They have expectations about this is the exact syntax in these sorts of files that are used for messages or resources. This is exactly how you deal with them. And this is what you get included in your output or your program, and this is exactly how it works. But as we're defining here a data model that can read any of these syntaxes, it means that you can build a different sort of formatting or a runtime on the same syntax. So you can start from the way you are now and maybe consider if you want to change how you're doing localization. You don't need to change necessarily everything all at once, but you can take just the formatting runtime change that to work with the same messages you've got, and move on from there. Or vice versa, actually. You could change the message structure, how do you store your messages and still use the same runtime because this is bringing in an ability to very effectively transform your messages from one syntax to another. And you can, when you're dealing with localization, you of course need to deal with translation. And this means that you need to somehow present to your translators the messages that they're working with. And if a translation tool or a framework is going through the message format to data model, it means that you can build an interface for localizers. With the localizers, don't need to know what is the syntax underneath for the placeholders, the variables, the markup, anything else, but they can be presented the same thing for all syntaxes, which might make things a little bit easier for everyone. So those are the ideas I came up with here for what could be the next steps from here, but I'm here saying, hey, this is a cool thing. You guys should play around with it. For us, the current and ongoing work is to extend this sort of a definition to also include method resources and also include the sort of comments and metadata that is quite essential for communicating the context of a message to translation, which as I'm kind of hoping some of you noticed was completely left out of the earlier. But that's intentionally so that we can separate these considerations from each other. But that's it for me. Thank you very much for listening. I'd be very happy to have any questions, comments. In another talk, I heard about message format 2 and function invocations. How do function invocations, how does the data model work, or how do they relate? The question is for how do function invocations relate to all of this? And this, yes, they are represented here in the function annotations here. So something like, for example, plural selection could use a function with a name of plural, for instance, for being this element existing in a select message, selector expression, which is there. Question was whether there are a set of built-in functions that are supported. And message format 2 does come with a relatively small set of built-in functions. The data model itself does not presume this set absolutely. But the set of functions can be extended. For message format 2 in particular, we are looking at a very minimum of effectively number, which also does plural and ordinal selection, but also acts as a formatter. And then string, which is a sort of default for string formatting, but also does the same sort of selection as ICU message format select does. And we are still discussing for message format 2 what other things to include here. Now, of course, when representing messages coming from some completely other syntax, it is entirely conceivable that it is not directly possible to express these messages using the functions that message format 2 defines out of the box. But the data model does allow for you to define that a function like this must be used here, and you can otherwise define how that function works, if that makes sense. And it's possible to make translations between these function meanings. Anything more? The reason to separate context from the minimum required effectively, and here I'm jumping into the answer here, the minimum required for formatting a message is that the context is absolutely required for the translation work. But the context is not absolutely required for the message formatting. So we need to be able to represent it, but we do not absolutely need to have it be a part of the message itself when it is being formatted. And this is why we are dealing with it slightly separately. They are very much related concerns, but we've tried to find with the data model the minimum required for representing a message. And when you trim down the minimum, the context kind of ends up as a thing we can define externally, so we've chosen to be doing that. And I mean if you're interested in that, in particular the specifics of what should we include in the sort of base set of metadata and context fields, here's an issue link where we're discussing this right now that I would be very happy to have your input on. Anything more? Regarding the edit the translator tools, so now most translator tools present a string and expect that the translator will write in a string. Do you imagine that this will change and that the translator will see the elements of the data model in a more graphical way and choose translations through Google boxes, or something like that? Or do you think it will stay as a string representation for the translators in the future? I have no idea and anything is possible and that's kind of cool. So predicting the future of what the translator experience is going to be here is shall we say a hard question. One thing I do think is that this sort of a data model makes it easier to build tools that can present to a translator more the options and opportunities that they might have in modifying a message and content like placeholders and markup which might just show up a syntax when presented a string and be a challenge to even realize that I could change how this bit of it is styled. But if we can present interfaces that can read the data model and understand from this that hey hang on this could be tweaked this way, interfaces that are richer could be built. However of course we do need to keep in mind that such a vast majority of cases are just it's best represented as a pure string. So a majority of work is not going to change but the corner case is where it gets interesting and challenging for those there might be opportunities to present messages in a more translator friendly way. And one part of this I kind of skimmed over it was mentioned in the Ujjwelts presentation yesterday on message format too is that here the selection for variants is not an inline component as it is for example in ICU message format or fluent but the variants all of the variants need to be complete messages presented at the sort of top level of the whole message which is entirely intended to guide towards structures that are easier for translators to deal with rather than needing to to figure out you have and then a selector of apples. Instead of that you have a selector which has you have one apple you have three apples and this sort of an interface. But yeah anymore? If not I would like to thank you very much for your time and yeah that's it for me.
Long Term Effort to Keep Translations Up-To-Date
Okay, can I start now? Okay, sorry for the delay. So, I will try to present about how Indonesian teams doing translation for several transition projects I have involved with. Okay, this start from one project. I mean, I have the calculate, I have create statistic from one project. And then I try to add other project that I have greatly miscalculate the effort need to do the statistic. But you will see how difficult, maybe because I'm not skilled enough to create the statistic long term one. Maybe some of you can help me. So, this is the start. Why we Indonesian team doing translation? Because in Indonesia we have around 276 million people there. And from this many people we have several major languages. Actually we have more than 1000 languages, especially here. Every, every, what do you call, every small density, we call in Indonesia is Desa, Kampong. Village, yes, village. Every village has their own languages. But do we really need to support or translate to everything? No. So, at least we need to consider that to translate in this big six, one, two, three, four, five, six languages. But I, myself, I only fluent in Indonesian and Japanese, so I cannot help the other languages. So, my, my, my, I try to, okay, let's start with one language that you say, used by the most of people here. So, so we can, we can start doing something then can be quickly used by many people. Okay. So, then in my talk I will compare about the Bravis, GNOME and Ubuntu. Why three of them? Because I thought they have sufficient data to create a long time historical data. I thought, but let me see how difficult it is to, to extract this. The other thing that, that three of them has periodic release schedule, good release schedule. For instance, LibreOffice is in February right now. I think last year they, they released in March and September. But I think starting this year, February and September. And then GNOME, GNOME usually a little later than, than LibreOffice. They also release two times each year. Also Ubuntu, Ubuntu you can see from the version numbering. Something 04, something 10. So every, every April and October Ubuntu release the, the, the, the, the, the, the, the platform to, to the translation for current one. They have very detailed, very good statistic. But if you want to see the previous version, especially the version mentioned that maybe it's already out of support or something. It's, it's not easy. So with that in mind, I plan to compare how many string changes in its release. Is it easy to get this kind of data? How many string changes in its release? It's difficult. Why, why is it important? Why is it important that, that we translator need to know how many string or how many words need, because we need to, to have size to, to, to try to guess how many, how many days, how many days, how many days. How many hours we need, we need to, to spend to do, to do this translation. I think, I think it, it, it will be helpful for, for every project that they can, can create that kind of data extract to prepare those transient themes. But how far translation for instance, you can, you can use, usually you can use percentage. Some projects use the percentage to, to filter out all this, this language translation is definitely include into release. Maybe less than a certain percent is, is like, like the mother most, it's beta version or something. So, again, for current version, this number is easily, ah, fine out. But, but for past releases, yeah, you need to, ah, I don't know. Some, some, some platform maybe still have the data, the other now. And the other thing is, I want to, how, how, who did this translation? For, for instance, in, in, in my teams, ah, in, in this project, I have, ah, let's say 10 members and this project 5 members, who really do the, the nation? Not all platform can provide that, that kind of data. But GNOME, ah, because the workflow is using, ah, the whole file, download, ah, claim to be translated, then upload. So, so it is quite, ah, obvious who doing what kind of translation. Okay, so let's see. LibreOffice. Right now, the, the main platform is, we have played. It has powerful search facility. Too many options. It's quite difficult to, to get, ah, ah, the, the correct, correct query to, to, to, to, ah, convey my, my need to, to extract. Ah, data I need. It also has, we have played API. Maybe I'm, I'm old school, never use API, so when I try, so many data, so many options, how to, how to create the correct API to, ah, this. So, actually, we have played, ah, it's good, but, but, need, need time. I'm not sure, ah, where, where to, who, who, who has, ah, access, ah, create something from their API. I, I want to learn from him or her or something. Yeah, the other thing that, ah, from the weblet, ah, they create a crown, they create, ah, ah, schedule job to push from this, transition platform into the main repository, kit repository. So, actually, ah, I can choose between, ah, source of data, data source. Do I want to take from weblet directly or from the git repo? I tried to access the git repo also. For instance, ah, ah, I, I get, I create a clone and then, ah, change to a certain release and then, ah, there is a git repo for, for Indonesian translation. So, from the directory, I try to list all of commits. From there, I can try to see how many, ah, line changes. Who do commits, but actually, when web, ah, when someone doing translation in, in web, from weblet, the data stop in, in weblet because commit from weblet to the, the git is done by some special, ah, special account, not the initial translator. So, so, there is data missing there. If we, we, ah, take from git commits, but, ah, some other details, it's good. And actually, the, the data I, I can present here came from way back machine and from my wiki because I, I, I maintain a wiki to, to maintain. I, I, and maybe 20, 30 translation projects to Indonesia. So, so many. So, I, I don't want to make a bookmark. My wiki is my bookmark for translation. So, this is the result. This is the latest status. The office. You see that, ah, UI only 99% of, ah, ah, finished. Only a few, this, this few strings not, not here. It's because we, we are not, not sure how to translate. This is missing context. Now, but, but this one is very bad. Help 70% and even for the newest release, less than 70%. UI for this, ah, ah, two, ah, ah, last four years. So, any, ah, between, between releases is, is, is six months. So, this four years. Relatively, almost 100%. That for help. So, a French version 70, 7.0. We have 3,000 strings. And right now, ah, 18,000. Untranslated. We, we have a good, good, ah, result when in version 6.2, UI and help for admission 100% translated. We, ah, that can happen because we, we did, ah, translation spring, ah, two or three days and how many people, maybe 30 people or something. Going together, boom, zero, untranslated. That unknown quality. Because, because they, they, they start transit only for, for that, for that occasion. Yeah, we, we tried it. Oh, so this time we tried to, to, ah, quantity, not quality. So, let's try. But after that, we, we don't have, ah, they are, they are only involved in, in, in that occasion. After that, ah, the long-term translator do not have enough energy to continue help. I, I, I sometimes try to, ah, maybe 10 or 20, 20 string to translate. Why? Because if UI, one string usually consists of maybe from one word to five, maybe 10, the longest one. But help string. One string consists of 30 words, 40 words, 50 words, ah, so long. So, ah, ah, you cannot compare only, only a certain string. No, no, no. You need to see how many words that really, ah, show how large effort need to, to, to do them. Okay. So, who did? Actually, only two. These three, I think, they involved in updating the, the source, not, not doing the translation, but, but they search, ah, ah, giving this data. Okay. That's LibreOffice. And then, let's see GNOME. These are the latest. You can see GNOME, GNOME, ah, divide their many modules into several categories. What we need to, ah, translate is usually, ah, we prioritize translation in this one, user interface for this, this group. But other groups usually, we, we, for instance, GIM and FRAMPS. This, this, ah, I never touch those for maybe, ah, for many years. Because not every project are easy to be translated. Especially if, for GIM is, ah, image editor. I'm not familiar about the terms used in, in this, this, ah, this community. What kind of translation usually use? So, so I, I, I, I, I don't have enough time to consult with, with, ah, what we call, ah, SME, subject matter expert on image processing, image editor. So, I didn't do that. But other, ah, the, ah, UI is good. Help, ah, still some effort to do that, but, but not quite good. Yeah. This is the statistic. Since version 3.6, GNOME, Indonesian translation is almost always 100%. And help getting better. This period I have too much free time. So, I do many, many translations. And then after that, ah, VC time. This how many commits? This to, to, to calculate who did commit? Actually only two people here. Two people can, can maintain that kind of, ah, percentage quality for how many years? I don't remember. Maybe 10 more. I started as a GNOME Indonesian Transition Connector in 2010. I offer many people, would you replace me and, and when come forward? Nothing special with translation coordinator. It's because no one else want to do that. Okay. Ubuntu is the most difficult one. Yeah. You see here? I'm translating 200,000. So much. Why? Because many of them GCC. Do you want GCC to be translated? Why? Do you want GDB to be translated? Exhib, library. So, so, so, so, because you want to use that, ah, what was the preference? Trans, ah, what, what is the? Transfects. Transfects, eh? They, they do not create a subcategory. So, so, it's quite misleading. How, how good Indonesian was translated into Ubuntu. How good, for any language is, ah, how good translated into this, any language. Okay. Ah, I think that's the statistic that, statistic analyzer, someone said it. So, so, why, why, why I care about this? Because I hope that if some other team can, can recreate the, the effort to, to do that, that, maybe they can use that for create a funding proposal or maybe, ah, can be used to plan transition phase, ah, spring, need for many people, ah, how, how long and target to can, can be, ah, can finish how many strings. It's, it's, it's rather difficult to, to use data from one language into others, yeah. But, but at least we can have, ah, approximation. Okay. I think that's my thought. Any question? Good. Ah, sorry, yes. Yes. Because we are all from Asian, right? And, ah, I'm not very familiar with the Indonesian, but I know you did a lot of translation work for upstream like Ubuntu. Okay. Ah, if I, if I come back to China, I want to do, okay, I want to do some translation for the upstream, ah, over assistance. So, ah, how can I solve it? Sorry. Sorry. How can you start? Ah, sorry, ah, what kind of language do you, Chinese? Because, ah, as, as far as I know, in Ubuntu, they, they have traditional Chinese and the other one. So, so, so you just joined those, those existing team and start translating. I think it's, it's quite, quite easy. I think, ah, I'm not sure what is the number, but, but it's already has a team, already has many translators, but you can join them and then, then continue. I think for a language that has not been started, it's rather difficult, but, but for Chinese, I think. Yeah, yeah. Yeah, Ubuntu, Ubuntu is, is, is quite established, I think, for the Chinese translation. For different projects, ah, you need to check, they, they, they have different, different way to, to do translation, different way to, different platform, different, different. So, for instance, ah, like my, my three example, GNOME, LibreOffice and Ubuntu, different platform, different, ah, algorithm, process, different process. GNOME, they, they need to take the, the whole file, translation file, doing translation locally in your computer and then upload back to the system. The other one, Ubuntu and LibreOffice, you can do two way, you can do translation online or, or you can download the file, translate locally, upload back. Okay, other question? No? Ah, yes. Because in the beginning you mentioned you have had some problems using the WebLay API for your projects. How many projects, different languages actually do you use within the WebLay process? Six, I think, right? Six different Indonesian languages, local languages? No, no, no, no, I only, only one main Indonesian. So it's only one project? Yeah, code name ID. And what, what, especially are problems with the API? Ah, for instance, ah, ah, if I want to see, ah, which one? How, how many strings left? Not translated yet. For help. This version, help. API can only, ah, need, need, need to be, need to be expressed with, ah, URL and then, ah, main function and then project name, project name is, ah, they have a list of, ah, UI master, UI version, ah, certain, and then help, master, help. So, that's the project name. And then the, the next component is, what do you call? Module or something? Module, ah, from, from this, this one, one project name, they has 200 modules or something. And then slash language, ID. Ah, so, ah, it says, I need to, I need to iterate all those things and create the, the summary to add and then something. It's, ah, quite difficult for me. So, it's more an organizational problem? Yeah, yeah. So, so, maybe because, because I don't understand enough the API, so, so my, my, my approach is not efficient enough. So, that's, that's why I, I need to consult or discuss with whoever, ah, more familiar with this, this API, that API. Okay. No more questions. Thank you.
Using Open Source AIs for Accessibility and Localization
So, hello all. Thank you for being here till the last and I'm a first-time presenter so if I get a bit jittery, I'm sorry. So, the topic that I'm taking is open source AI for localization and accessibility. Well, the main idea is to use open source AI tools to elevate the content that you are actually receiving and to enhance the localizations that you can benefit from. So, okay, so sorry for that. So, essentially coming back to what I was saying, we can use open source AI to enhance the subtitling potential and to have voice-to-voice conversion of a lot of videos and audio content in addition to the text-to-text conversions that we are. So, you might be wondering what is the actual problem? So, well, I've seen that most of the time when I'm trying to access a dog. I'll be having this language issue. For example, I was working with the technology in the augmented reality realm and all the documentation was actually in Japanese. I tried reaching out to the developers over there but unfortunately the language barriers still hit hard and another case would be with the same guys actually had a few tutorials available on YouTube but the same case. I don't speak Japanese and I'm actually unable to convey my ideas to them in the language that I know. So, this was actually an issue that we were all facing. And then there's the, so you might be thinking, why can't you use something like Google Translate? Well, the obvious case is actually data safety. If I'm working on a cutting-edge technology, I don't want that to be leaked to other people without my concerns. Or like I want to release that into the public. I don't want someone else to just take my data and then release it or without my safety or my approval. And yeah, when it comes to usability, that's another factor and financials, when I was actually working as an independent developer, financials, the financial side was a big issue for me. I didn't have the money to bankroll something like $1,000 into a translation subscription for every month. So, let's actually elevate that with a bit more of a user's story. Suppose I'm a research student and you can actually take the case of augmented reality right now. I'm trying to work in this very niche case and people actually know about it, but I can't really converse with them. I'll be having a few issues like that. And one of the main problems that I'm facing is actually the lack of resources. And there could be resources, but they are present in another language that I'm not really able to understand with or converse with. So I would actually require the resources to be converted to another language that I speak of. And I want the conversions to have a diesel level of accuracy. So that's it. And for one of the solutions, I can obviously ask for people who actually talk the language and require their services, but it is expensive and it is time consuming. So a stop gap would be to use an AI solution. A similar case would be in the case of the docs manager. So before I go there, what do we actually have right now? We have a few text-to-text conversion engines like Indonesia or if you're from India, there's about 128 languages that are actually spoken and we actually have cover for two. So if they are from such cases, you will require more coverage and you will require more assistance. So to sum up with what I was actually talking about so far, we don't have an all-in-one solution where we can actually use all these, which can actually fulfill all these requirements, be it from text-to-text, text-to-video or the other way around. So that is some of the things that I would like to talk about. And yeah, if we are actually looking for an open source library, I would like to have one that focuses on the audio and the video translation side because enhancement and accessibility for the audio and the video side is what is actually helping us to improve the language models and is helping us to reach a wider audience. So solutions which can actually help me with the automatic translations can be a good choice. So just to recap it again, I think I'll just skip this part. So what I actually require is an open source model that can be executed locally and actually gives me a decent accuracy or a decent amount of execution time and helps me enhance the quality of the content that I'm delivering. So the one to watch for right now is called seamless M40 and it's a model for meta and it's under the MIT and okay, it's under the MIT and the Creative Commons Live Senses. So it actually gives you speech-to-speech translation, speech-to-text, text-to-speech, text-to-text and automatic speech recognition. So that's a pretty good one. And as we all have been trying to highlight for the last 10 minutes, we require that because we need an all-in-one solution. And I was just, I think I highlighted about all these parts like the super informative video or the precarious conversations that you are having. Like if I'm trying to have a conversation with, can I have a name, sir? Samath. So if I'm having to have a conversation with Samath in his name, if I'm having to have a conversation with Samath and for a moment let's just think Samath doesn't speak English, he speaks French, I speak English and it's hard. So that's the conversation or that's the moment that I would require and that's the moment where I require a tool like this. But if it's not French, it's some language that's actually not sort of documented. So say I'm just going to go with Sohili. So yeah. Okay. Okay. That was a random guess. So yeah. If I'm going with a language like Sohili and I'm speaking in English or if I'm speaking in my mother tongue called Malayalam, I'm just going to sit here and I'm trying to explain the concept to him, but he doesn't understand what I'm saying. I don't understand what he's saying. And that is the moment where we require such a tool. He might have cutting edge research in the domain, but unfortunately, it's only accessible to the native speakers. So as I said, that is where the benefits come in. It's with the universalization of the resources. Anyone from a large org, a creator, a student, a developer, and basically anyone who can make use of these technologies and come out with this. So far as the technology that I mentioned, the M4G is actually under an MIT and a Creative Commons license. So it can only be used for nonprofit uses. And I believe it should remain like that. So that's the summary. But before we go, I think there's something else that I can show. So, okay. Excuse me. Okay. Yeah. Okay. So just suppose I'm this really famous creator and I hope the mic has an arm. Okay. Sorry. That just played out in a way. So suppose I'm actually a really famous creator and I'm doing something about AI. So I want this content. I only speak English and I want this content to reach you guys everywhere. So let's watch the individual video first. Okay. So let's pause it over there. And maybe how good it would it be if you can actually hear the same thing in another language. Okay. Wait a second. So how many of you guys over here know Spanish? Okay. Just tell me whether this even remotely makes sense to you. So that's it. Yeah. I'm going to take, yeah. Yeah. It sounds good, right? Yeah. I'm just going to take it from those two guys. And yeah. The same thing can be actually in French, in Dutch, in Italian, and in Hindi. I do speak a bit of Hindi, so I can validate that. So yeah. So would you guys like to hear it in another language? It's going to be the same audio, but if, okay, for the French speaking guys, I'm just going to play this as well. So you can see the text, the audio and all this. The model that I'm just mentioned called the seamless M40 is actually doing all of this in a single model. And I feel it's about time we actually have a few of these solutions coming into open source as well. That's a totally open source model. Like this is what I dream of. And maybe I'll be back next year with something that's remotely close to this. So thank you. Yeah. So any questions? So the model we're talking about, it's not open source, but it's like usable for non-commercial. Or is it open source? The model's open source, but you cannot use it for commercial use. It's having the MIT and the Creative Commons license. Yeah. Is the training data made public? No, the training data is not public. It's the classic Facebook thing. Yeah. So it's speech recognition because it's a problem, the touch problem and now it must for use. Yeah. It runs on speech recognition. It runs on speech recognition and you're converting it to a base language and from there you're converting it to this thing. So suppose if I'm speaking in Hindi and you want the conversation to be in France, it's going to convert the Hindi's conversation to English and from English it's going to convert it to French. Talk about the latency between the speaker speaking and understanding all this thing and also the way. Okay. There are models that in the same model can actually offer capabilities near 100 milliseconds, but it's a lightweight model and you won't have the accuracy of the heavy weight one. So we actually have to trade off between accuracy and the speed. So the heavier the model is or the more parameters it has, you'll be getting more accurate results, but unfortunately the speed will be coming down and I saw a couple of other hands. Yeah. The model is 2.7 million parameters one and if you're talking about the actual size of the models in gigabyte, it's about eight gigabytes. Yeah. So just to clarify a few things that have already kind of been answered. The licensing is MIT and Creative Commons? Not or. And correct? Yeah. Oh, okay. And then the other thing is when you are translating from one language to another, you said that some of the models have a latency of about 100 milliseconds. And how long did those audio samples you showed us? How long did they take to run? It took me like three seconds, but I'm running it on a call up T for GPU. So it depends on the GPU. Right. Yeah, I'm sorry. Sure. So the proposed solution that I use over here is the same as the LLM thing. So you can actually use racks or you can just split the text up into smaller, smaller bits and then combine it into this. So for example, this model actually performs better if you have something like 20 seconds of audio. So what I do is if I have a one minute of audio, I'm going to split it into three chunks of 20 seconds each and it's going to go off from there. Sorry, I don't have a solution to that. Yeah. And I think you have a comment. Yeah, I think. Okay, sure. Okay. So I'm going to go ahead and ask you a question. Okay. You can run it locally, but it's going to depend on your GPU speed. That's it. It's possible to have a view of practical information, link and so on. Oh, sure. Okay. So yeah, I just close that a bit early one moment. Okay. You can hit me up over here. So my name is Nevin Koshy Daniel and that's the same for the Gmail ID. You can just text me over here. And if you're looking for the particular model, you can get it from seamless communications on the Facebook research page. Yeah. And someone did ask me a question about latency, right? So if you guys have a moment, then we can have something called seamless expressive. I am not 100% sure how this will work, but accept. Try the demo. Can you come over here please? Tate? Sure. Let's have a try with this. Do I have to say something in English? Yeah, I think so. Or do I have to speak French? Let's try it with French, I guess. No, I can't speak French. It's not going to work. Okay. Yeah, let's speak English and let's translate that to French. French? Sure. Oh, you have to allow audio permissions. Yeah. Don't worry, I'll just re-bump this out. Yeah, yeah, yeah, it's okay. Yeah, right. Can I use Linux and run this on my server? I hope that works. So yeah, it's going to take some time to generate the translation. Oh, wow. It's pretty quick. And... I don't speak French so for someone who knows the language, it's correct. It's correct. That's very cool. Okay, is the model doing both the translation and the text-to-speech? Yes. This model can do all the four things, text-to-text, text-to-speech, speech-to-text, and text-to-another language conversion. So all the four things, yeah, and automatic speech recognition as well. So the five things. To give you guys this... I am. Okay. That's pretty much everything, and thank you. Thank you. If you're gonna... Okay. Oh. Yeah, no, I can't see it very well either.
The importance of Web Performance to Information Equity
All right. Hey, everybody. Welcome to the Web Performance Dev Room. I facilitated this year with Missyla and Wickey Media. We're going to go ahead and get straight into it. And I'm going to introduce Baz Skroten. And, yeah, I'll pass this over. In Yemen, the cost of a gigabyte of data is approximately 6,000. A gigabyte of data is approximately $16. In Chad, the cost of a gigabyte of data is around $20. In Yemen, the average income, or the median income, is about $250 a month. In Chad, the median income is about $60 a month. Hundreds of millions of people in the world live in areas where they spend a significant portion of their income on their data bundles. Often, lack stable charging facilities, and they can only imagine what it would be like to have the kind of high end, or even mid-end devices that people like you, I, or anyone else in this room have. Often, when we think about performance, we're thinking of making our websites faster. We're looking at making them faster and more fun to use in order to improve our conversion rates to sell more products. I'm Bas Faldan. I am a principal engineer at Mozilla, and I am the tech lead for Firefox Performance, and I'm going to talk about how performance is so much more than that. I'm sure that most of you are familiar with countless statistics about the limited means that the poorest half of the world's population live with. There can be no doubt that those people deserve the same access to information as you, I, or anyone in this room. Understanding the importance of that information equity means understanding the importance of web performance. When we're talking about performance, we're usually talking about one of three things. The primary thing, and the most obvious one, is speed. Speed is about how fast and how smooth the results of a user's interaction with your sites or services actually renders on their device. Another aspect of that that sort of directly leads into that is data usage. Obviously, before you've actually sent the data to the device, there is no way that they're actually going to be able to see what you're about to render. But something that's a little bit less obvious is that power usage is also an important aspect of performance. Not only are you going to help the environment by using less power, but you're going to extend the lifetime of people's devices, making their batteries last longer, but also causing them to heat up less, have less fan spin up, keeping the devices more comfortable to use, and decreasing the wear and tear and increasing the longevity. And finally, you're going to reduce the amount of heat, obviously. In the time we have together, I want to talk about how people living with more limited means are at a disadvantage for all three of these pillars of performance, and specifically also web performance. We'll go over what the global landscape looks like and the situation, particularly in the global south, that people are dealing with when we're thinking about some of these things. I'm confident that you will be left more motivated to improve your sites and services, and as a result, you'll pay extra close attention to the speakers that are here the rest of this day. For now, the first thing I want to talk about is raw speed. When I talk about raw speed, what I mean is the performance of the CPU of the device that we're talking about. This is basically how quickly, once a device has all the instructions that it needs to render something onto the screen, how quickly can it do that? What does that look like? Well, over here, I've compiled a list of the most common, most popular smartphones for Africa versus Europe. Now, getting good public data for Africa is actually kind of hard, but that's not too important right now. The most important thing is that this list for the phones in Africa is probably a lot for you like it was for me. This high sense, what now? I've never heard of these things, right? And the important thing here is the trend. Most of these devices on the list for Africa, or actually almost every single one of them, is at least 2x to 3x slower than devices we see here. And do not let their names fool you. That Itel S23, that cute naming trick, that device is nothing like the same model from Samsung. So what that means is that if my LCP takes 500 milliseconds of CPU time, delivering that same LCP in another part of the world will take at least a second. Now, we know that LCP impacts conversion rates quite significantly. A one-second improvement to LCP means a 27% of improvement to conversion rates. And that does not just mean, that is not just about how fast or how likely you are to sell your products. It's also about how likely people are to access the information that you are looking to present to them. And of course, this is not limited to the global south. The performance improvement that that Samsung S23 offers over that Itel S23 comes at a hefty price tag. With a Samsung S23 costing over 650 euros and an Itel S23 costing under 150. It doesn't take a genius to know which class of society is more likely to own one device over the other. But obviously, the raw performance of the devices is not the only thing that is different here, where most of us live, and in the global south. And there are other aspects of those differences that have a much more direct impact on people living with more limited means. And the most important one there that I'm going to talk about is data usage. Let's take a look at what the global landscape looks like when it comes to data usage. I've pulled a list off Wikipedia here for the countries with the slowest mobile data connections. One of the first things you'll notice here is that the mobile speed of none of these countries exceeds 20 megabits a second, or about 2.4 megabytes, right? And for some of these, the landlines don't even exceed 1 megabyte a second. But an important thing to note here is that this list from Wikipedia is built based on results from speedtest.net. Now, we can sort of assume that people aren't likely to run speedtest.net when they're not actually trying in a good connection situation. And we can see that because even the slowest here, 3 megabits per second, that's almost a maximum speed of 3G. 3G has a maximum speed of about 500 kilobytes per second. And let's look at that a little bit. Displayed here is a 4G coverage map for one of the major carriers in Nigeria, the most populous country in Africa with a population of approximately 225 million. What you can see here is that outside the major population centers, there's not much. And a lot of that on the really less densely populated areas isn't even 2G. So you can assume that a lot of these people, the fastest connections that they could possibly have access to are about 500 kilobytes a second. Now, let's compare that to a 4G coverage map by the FCC of the United States of America. Unless you're visiting a national park, you're probably going to have 4G. And if not, you're still going to get 3G. That's a very different situation. But now, of course, the speed of mobile data transfer is not the only component here, which is different between here and western countries and the global south. A more pronounced impact your users will experience through cost. Visualized here is the cost per gigabyte of mobile data. If we look at Chod, the cost of a gigabyte of data is over $20. If we look at the United States, that cost is less than $10. And in most European countries, the cost lies even lower. But even if we ignore the outliers, it's important to realize that the global average is roughly $4 per gigabyte. Compare that to a global median income of about $300 a month. Half of the population has to do with less than that. So let's think about that for a second and think back to the introduction of the talk. A gigabyte of mobile data in Chod costs about $20. A monthly income in Chod is about $60. That means that if your site ships one megabyte to the median user in Chod, that costs them about 0.03% of their income. If your website takes about three megabytes per visit and a user visits it once a day, that will cost that user about 1% of their income. Now, to make that even a little bit more concrete, I went to bbc.com. I looked at about five articles and I read them. During that time, bbc.com consumed about 17 megabytes. If a median user in Chod chooses to read five articles a day on bbc.com, every day that consumes about 15% of their income. Add to that the consideration that 95% of sub-Saharan Africa accesses the internet solely through mobile devices. You can see what an immense impact the data consumption of your websites can have on the disposable income of the people living there. And of course, when thinking about that data usage, it's not just that you're making it faster. You're, after all, on 3G, that three megabytes to show your site takes at least six seconds to retrieve. It's saving you and your user's money, as we already talked about, and it's going to lower the carbon impact of your sites and services. And talking about carbon impact, let's talk a little bit more about power. When we're thinking about optimizing power, we shouldn't just think about reducing the power usage by making our websites render faster. Obviously, if your website consumes less CPU, it's also going to use less power. But more impactful for power is, what are we doing when a user isn't really actively interacting with our websites? We should be avoiding animations, videos, or animated ads when a user is just reading on our sites. Of course, we should be minimizing the amount of JavaScript that's associated with simple interactions. And this comes with a myriad of benefits, even though those two watt-hours a visit to your site might consume might not sound like much. If your site has a million visitors, those two watt-hours become 2,000 kilowatt-hours, 2,000 kilowatt-hours a day for your millions of visitors. The average power consumption per capita, or the per capita power consumption in Africa, is about 150 kilowatt-hours a year. But when we delve a little deeper, there's a lot of other advantages. You're going to be decreasing the amount of heat users' devices produce, particularly on slower devices. You're going to be reducing the amount of fans that spin up, reducing the wear and tear, making them more comfortable to use. But most importantly, you're going to be reducing or increasing the lifetime of their batteries. And this is again the area where particularly the global south is disproportionately affected. There is about 1.1 billion people living in sub-Saharan Africa. By estimates of the International Energy Agency, about 40% of that 1.1 billion people live without access to electricity. I want you to stop and think about that. There are more people living in sub-Saharan Africa without access to electricity than there are living in the United States and Canada combined. For many others living there, access to electricity is limited and power outages are frequent. Of course, those people with no access to power are also going to be less likely to have mobile phones. However, for those millions of users that do own mobile phones and do not have access to power, or more limited access to power, and all those countless users that are going to be coming online over the next decade, they are often dependent on centralized charging facilities to charge their devices. Needless to say, for them, their phone running out of battery is a very different situation. They're for most of us in this room, where your phone running out of battery means you have to grab a charger, it's a nuisance, or maybe you grab your power bank, right? So now that we have a better understanding of what the world looks like in terms of the internet, where does that leave us? The internet is going to play an increasing role in everybody's lives from how you do your taxes and how you are billed by your service providers. And the potential to do good for the internet is immense there, and that potential can particularly benefit those people and organizations with limited means by reducing their costs for staffing, travel, and time spent for those people and organizations that are least able to afford that. At the same time, it reiterates the role that we have as developers to ensure that we have a responsible impact on the most vulnerable communities on the planet. Now, we're here at FOSDEM, which means that hopefully most of us are working on open source projects, and if you're anything like me, other projects, commercial or otherwise, using your code is a great source of pride. And that means that when we're designing our components, our code, we may not be thinking about those particular use cases. We may be thinking about users that are not affected by these particular disadvantages. However, we have to think about what other projects may be using the code that we're writing and what users they might be reaching out to, and those users may be in those vulnerable positions. Thinking about them means thinking about what performance. The great news is that the work we do to make our sites faster, make them use less data, and make them use less power isn't just good for those users. You're going to be making your websites for your users, especially when they're riding a train through a tunnel or riding an elevator. You're going to be keeping their devices cooler, making them more comfortable to use and making them last longer. You're going to be helping the environment. The greenhouse gas emissions that are produced by the internet and the devices that we consume it are vastly more than all of global aviation combined. And of course, this works the other way around as well. When you're making your websites faster for your users here, you're also going to be helping those people in more vulnerable positions. Since you're all here at this early hour, I'm certain that many of you were thinking about web performance already, and thank you for that, and thank you for being here. I'm confident that you're going to be even more excited to make all your websites faster, and there are a bunch of great speakers coming up the rest of this morning to help you do exactly that. Enjoy the rest of your day. Are there any questions? It's an interesting question. The answer... Oh, yes. So the question is, what do we do, or what do I do? I guess that means Mozilla, right? To make sure that Firefox works well on devices with limited CPU. I think the short answer is not enough. I think that, like probably most of you and most developers out there, almost everyone working on Firefox is on fast devices, fast phones, many, many iPhones where we don't even chip our own engine, right? And I think that is a hard thing to change. That's a hard mentality to change in the business of software development in general. We do explicitly test certain low-powered devices and their performance characteristics. But the global landscape is very diverse. I think that the reality of it is that we tend to work a lot on making Firefox faster and consume less resources on fast CPUs, and then we hope that translates to a better experience on slow CPUs. One day, perhaps, we'll have optimizations and work that targets very specifically the different types of CPUs and different compositions of CPUs, in terms of heterogeneous architectures and things like that, that are more common in the global south, but we do not currently do that. Anybody else? The next speaker has five minutes to set up.
Let's build a RUM system with open source tools
Hello, everyone. Today I'm going to be talking about building a real user monitoring system with open source tools. And before I dive in, a bit more info about me. My name is Tvetan Stojchev. I work on MPALS in Akamai. MPALS is a real user monitoring system. It's a commercial one and it serves, I think, thousands of customers, Akamai customers. And my hobby project is basic run, which will be the focal point of this presentation. And really before I dive in, I would like to share a bit more about some of my other personal activities. Every December, I make an attempt to publish at least one blog post on the web performance calendar. That's the best place for the web performance to see us in the year. And the other thing is, sometimes I do a street art. So that's my safety net plan. If chat GPT takes over the world, I still will have something creative to do. Yeah. So let's now move on to the important part of the presentation. And let's take a look how, in general, how a real user monitoring system would look like. So we will need to have something in the browser, ideally a JavaScript agent, that will read some data and it will send it to a server. We will store it somewhere in a storage and later we will analyze this data. And here we just see the most trivial piece of JavaScript. This is the bare minimum that will do the job in the browser. So this piece of JavaScript will read what is the current page URL. And it will create a one by one image, one by one pixel image. It will append it to the HTML and this actually will create, will force the browser to create a request to this endpoint. And here is a really very simple code snippet on the server side, how the code will look like when we need to intercept this data and to store it somewhere. So here is our route where the browser will hit this route. We will read the query parameters, headers, headers, and even we will put a timestamp in the structure and then we'll save it to JSON on the file system and we will return back to the browser a transparent GIF. And eventually we will, on the next stage, when we want to analyze the data, we will go through all the files and we can create a summary for the page visits. And for example here in this example we can see that category four was the most visited page with 427 page visits. So that's the theory. And in 2019 I started as a hobby basicram and that's the initial version and the components that I used to build basicram. So on the browser side I started using an open source library called boomerang.js which collects a bunch of interesting metrics from the browser and sends them to a server. On the server side I used nginx and some PHP application code. And for storage I used mysql and for analyzing the data I still used php and for reading the data and serving it to a frontend and on the frontend I used plot.ly.js for visualizations. And I ended up with something like this. It actually, it's really interesting after five years it's still running. So if you want to give it a try this is the first version of basicram. You can visit demo.basicram.com and you can play with this UI. Now about boomerang.js. Boomerang.js was started 2011 in Yahoo by Philip Tellis who happened actually to be now a colleague of mine. And currently the library is maintained by MPAL's engineering team in Akamai. And as I mentioned the library collects a bunch of interesting metrics like the interesting ones for core web vitals, lcp, cls, fid. It also can track some session data. It can also help users of the library to create a timeline of all the clicks over the page like cycle of a visitor. And also it has more modern ways to send the data to the server like more modern JavaScript APIs fetch, XHR and send beacon. And it can be found on GitHub in akamai slash boomerang. On the back end side that's again like very theoretical but what actually was happening I still was every request that I was getting to my server I was saving it in a file. And then periodically I was running a cron job which here I just marked as a that's kind of a too much overhead and you understand why later. But I was running a cron job that was reading all these collected files and I was creating one big batch and I was inserting this data in my SQL. I also ended up with a database model that's very biased. My previous background was I was building Magento online shops and if somebody ever worked with Magento we'll probably recognize some patterns about all these foreign key relationships and this main table that's in the center of everything. I had to put bunch of indexes here and again this created a bit too much overhead for I would say also on the code level like on the application level but also for me as a maintainer. So I had to take care about again every time when I wanted to introduce some new dimension I had to create a new table and to put a bit more code for inserting the data and it was just too much maintenance for me. Also I had to take care about not duplicating some of the data here and this is because of the nature of PHP. PHP is kind of a stateless so every request is independent from the other request so I couldn't keep some things in memory. If I could keep some references in memory I probably could optimize some things here. And actually question to the audience do you have an idea what this query actually would produce? What's the idea behind this query? Maybe. I can say that. Bucketing? Yeah it's a bucketing for a histogram and I also had to write a lot of kind of queries that are in the data scientists type of queries which also was for me introduced a bit of a learning curve but the system had really had coded in itself such type of queries and this here represents a histogram of the time to first byte. Like we can see that the median is around 1.8 seconds. It's a bit skewed. And with the help of plotly the JavaScript library for visualization I could create such panels for distributions for operating systems and mobile operating systems and I also could write such bar charts that were showing kind of the relationship between the first byte and start render time. And yeah reference to the plotly it's a really cool library really rich and you can create a bunch of panels with it. But I found myself like having difficulties and probably not focusing at the right place. So as I say when you build a real user monitoring system you need to change your mindset and your queries should be more like in data scientist style. And the PHP were out and the ORM that I was using I was using doctrine. It's not really meant for writing complex queries from this fashion. So I found myself writing my own query builder and using doctrine when convenient and using my query builder when convenient but this was again too much maintenance for just for a single maintainer of a project. I also wanted to introduce user management and permission system but again with my limited time and working from time to time on the project during the weekends this was just again too much it was not the right focus. I wanted just to show some meaningful data. And yeah I really love plotly but I just ended up with large blobs of JavaScript here and there and it was more like more and more plotlier. I wanted to see data not writing JavaScript. So I took a break I believe half a year and I focused on my main job but from time to time I was doing research and I was reading some other articles about time series databases and I started exploring some of the open source available open source systems for visualization. So I kind of rebuilt the complete backend. I still kept boomerang but I rewritten the server site so I completely removed nginx and PHP and I used golang. I replaced my SQL with click house and I replaced all the custom code all the PHP and plotlier with grafana. And again if you would like to play with the current version of basicram that's what I ended up with that's actually a let's say a bit of rebranded version of grafana with the specific basicram dashboards and settings. So if you would like to play with it just visit this address and write calendar calendar as a username and password. So where golang was really useful, golang it's just different paradigm it's a different idea compared to PHP. Golang you can compile a single binary that and in this single binary everything that I needed was packaged inside the binary so it's just a process that you run on the server and it has everything inside and this allow me to replace the actually to get rid of nginx because golang has a package for built in htp server and yes PHP also has a package for PHP for htp server but you need to do a lot of work arounds to make it working because just not native in this is not native in PHP. I also could leverage the existing click house package in golang for interacting with the click house database and I took advantage of asynchronous insert which saved me a lot of I could get rid of some code that I had in the PHP version of the basicram. Also in golang it was very easy to create a backup mechanism for all the data that was flowing through the system because in golang I could easily keep stuff in memory I didn't have to write each request to the system on a file and later to batch it and bundle it. I was just keeping these data points and requests in for example in memory for 10 minutes and I could just flush them on the hard drive and compress them and this was again really really easy few lines of code and just natively coming in golang and also for some cases where I needed encryption again in golang there is a let's encrypt package it's a third-party package but I could easily just spin a server and say okay I want to use let's encrypt and I was getting secure connection to this server with it it really reduced the operation the effort on the operation site. I also took advantage of a gip lookup library which is using the maxmind database and why I needed this because in a real user monitoring system you would like to see from which city a visitor visited the website or from which country visited the website this is really helpful when you want to create a report and when you want to figure out maybe in which country is your website is slow. I also took advantage of another library about user agent parsing so this library helped me to extract important information about the browser name the operating system and the user agent family and I also started using my new favorite database Clickhouse. So you remember where I say that I was doing a lot of work when I was like batching and bundling everything and inserting these big batches in MySQL. Clickhouse comes with a really cool feature called asynchronous inserts so Clickhouse allowed me every time when a request reaches my back end to immediately to create an insert to Clickhouse and Clickhouse was internally batching this and it was deciding where it needs to insert in the database so this was not this helped to like not reach some performance botonics. Another thing that I could do with Clickhouse so here you see I have seven tables in the old setup with MySQL but in Clickhouse I actually end up with two tables and I actually could I actually could have one table but I needed this table for showing the host names in the filters in Grafana and just Clickhouse or in general when you work when we work with time series the main idea is that here the the the data is normalized I try to really build to build a user monitoring system in the fashion of a webshop right which is really the wrong idea but when we use time series database the idea is that the data you can just throw your data into this database you you have one large fat table and you throw a lot of data and you don't really need to consider duplication of the data for example here we have this filter's device type and I don't have a foreign key here to another table where I keep references to all the device types I just can insert and insert the same string over and over again desktop desktop desktop and this database will be completely fine with it it will compress the data internally and I won't experience any performance bottlenecks when I filter by this field and here is my other favorite feature in Clickhouse it's called it's called low cardinality data type and this data type is really convenient for columns where the distinct values in this column some less less than 10 000 because this it's optimizing eternally and it's the the where conditions and the filters in this case are much much faster when we use low cardinality we if if we have more than 10 000 distinct values we probably need to go again to something like this and to start introducing additional dimension tables also so here in left is really uh I would say insane I even don't know how I created this I still I'm really surprised with myself and you we cannot zoom in here but this was a process where it included querying my my secure database and I had some application code and I had bunch of cron jobs and this was trying to guess and to find out all the sessions that bounced and what was the duration of the sessions it was just really complex and for example to to calculate the bounce rate with my new setup in Clickhouse I just could use such a query again I got a bit help with this query I don't completely understand it but it does it actually it works and it's much more simple and much much more it makes my it makes basic run much much easier to maintain and with with this query I could actually create easily this correlation between bounce rate and epic and metric and in our case this is time to first bite also I want to say that open source is not only about how great is the open source product that you work with but also the community is very important and that's why I also stick to Clickhouse they have really great slack community and every time when I ask a question I I can say that in the matter of a few hours I get really a good response for example here I'm asking hey I I wrote this query but I feel that it's not optimal I'm not a SQL expert and here another expert actually suggested a better way how to write this query it's it's shorter and it's much more performant and also probably this is the first and probably the last database channel YouTube channel that I will be subscribed but I'm actually subscribed to the Clickhouse YouTube channel and they have really really good videos like they have every month they have like a release party video where the the Clickhouse team is showing the new features and there are a lot of good tutorials so it's it's really welcoming for for beginners and they say you get support from the community and there is really good there are really good materials out there so now let's look at the user interface Grafana earlier I mentioned that I was about to start in my in the first version of basicram I was about to start implementing my own my own user management and login and authentication and Grafana this comes out of the box so it's much easier to add new user to give them different permissions and again this is just the code that I would never want to write again right and in this repository I bundle the basicram version of Grafana it has some customizations also another benefit of Grafana is it's very easy to model the data and what you want to see in the in the visualization panels so for example here we have we can define filters we can have a preview of our data we can also configure different things for example here I'm just showing how I can configure different colors for the different thresholds and also there is an SQL editor so when I write the SQL here this Grafana uses this SQL to fetch the data from Clickhouse and here are other panels that I took advantage of here is the world map so I could it was really literally plug-and-play I just configured few stuff and I say it from where to read the data about the countries Grafana also has a third-party plugin for plotly so I still there were scenarios where I wanted to build some more complex panels and with this panel I could actually build this one which is showing how the device the screen size is the width of the screen size is distributed yeah time series this is the kind of the most the default view in Grafana and also I could present the data in a table this is very good when you want to explore your own data also Grafana comes with different data sources and of course Grafana needs to know how to talk to Clickhouse in my in basic realm I'm using a data source developed by company called Altinity but there is also another one developed by actually official by Clickhouse right yeah and just to say that all these things that I'm showing all these dashboards that are built in in the basic version of Grafana everything there is actually under version control so it's not just that I created a dashboard in Grafana instance and exported it and save it somewhere this I have this repository where I have the configuration for each of the panels that I'm maintaining and then this makes makes it much easier when I need to change something or to add a new panel and I can go through the history and I can understand what actually change if something has to be reverted yeah for example here we are seeing how I keep this row as it's a templated SQL but this is how it's presented then when we look in Grafana and again out of all this source code configuration that I keep for the dashboards I'm building a docker image where we here we have a bit of branding work just removing some things from the default or rewriting some things from the default Grafana image here we are installing the plugins that we need for our setup and here we are importing all the configurations for the dashboards and the data sources and what I found over time when I spoke to different people who asked me about three user monitoring systems very often the conversation was just ending when when I was explaining yeah you need to run this component on this server and you need to run this component on this server and you don't need to run this component on this server and it looks like their use case the use case of the people that I spoke to was actually not requiring them to scale they had pretty small websites or web shops and I work on something a bit more monolithic it's called basicrum o in one and the idea is that probably again probably it sounds from engineering point of view a bad practice but it actually could be really practical thing the idea is to run everything on one big box and I believe for 20 euro a month this could be actually hosted somewhere and I tested it it can handle 1.5 million page views a month and the idea here is we introduced traffic which is a proxy it stays in front of this folder components and it's helping me for SSL termination and routing request because some of the request needs to go to the data collection part and other request needs to go to the grafana to the part where we analyze the data so this is really convenient it's really easy for people if you just want to give it a try and a few takeaways I just have to say that a real user monitoring system is fairly complex system and you need to learn to train yourself you want to develop one you need to you need to learn more about on the data collection site where how the data is collected from the browser how to visualize the data and it will be a bonus if you learn about how time series databases work again choosing the right database to solve the right problem is the key and it's great when when you can transfer a problem from the application on the database layer it just saves a lot of time and yeah grafana could save a lot of time and effort even I recommend it even if you still want to build your own front end maybe just start with grafana to play with the data and to display something it literally will save a lot of time and I got a signal that I run out of time but you can catch me up all right I can take one question so in this project we don't really keep any IP addresses so for example that I guess that's what we consider like user data or yeah so the backend doesn't store any personal data in this case so by default it's using the IP address only to identify the country and the the city but it's not storing the IP address after that and I know that on the data collection site from the boomerang library I'm not sure if it's on the boomerang library has also like part of the boomerang source code is private but I know that for PCI compliance reasons it has special parts that try to avoid collecting stuff around the user sometimes the user may put for example a credit card number and this could be actually collected by mistake so this library also tries to avoid collecting critical user information do you mean to consent the cons so the library comes with a special snippet that's a loader snippet so you can have your own callback so you can you can call this loader snippet only after a cookie consent so it's possible you
Keyboard Interactions
So, hello everyone. My name is Patricia. I'm very excited and happy to be here. Just quick disclaimer about me. I am Chromium contributor and today I'm going to present about keyboard interactions and how I tried to improve them. So quickly about myself, I am from Vilnius, Lithuania. I moved to the Netherlands to study computer science and during my studies I really got into open source, specifically through the Google Summer Code program. In 2022 I worked on the definition of the INP metric and I continue my work diving deeper into metrics, interactions and specifically into keyboard interactions that I will explain you today. So about 2022 I worked on the Perfetto tool which is a wonderful tool for developers but I won't get into details here because Alexander in few moments will explain everything you need to know. But how I use it to this day, I trace websites with DevTools Timeline check mark and it gives me all necessary information about interactions and specifically event timings. And we know when we have event timings we can get anything about INP metric. So what is INP metric that was already mentioned? It is a very I think popular metric today but it's simply an interaction to next pain metric that assesses responsiveness by measuring key press, tap and click interactions. So for example when you press a key on a virtual or physical keyboard or tap on a touch screen or click a mouse to open any menu on a website everything is measured for developers to see how fast their website responds to users input. And this definition is actually I think wonderful, very innovative and Google since Google announced it very recently that it's going to replace first input delay this March 12th which is very exciting for that performance group. However after looking better into the metric we found out that it's not entirely perfect although it's very wonderful metric but specifically key press interactions aren't working as we would like them to work. So my goal today is to explain how we improve key press interactions, what is firstly a key press interaction and how we measure them and then we will dive deeper into a bit more complex concepts of non-standard interactions such as emojis and how we measure them for the INP. So I guess this lines brings you to kindergarten especially considering the FOSDEM context of very heavy tech topics but to understand what is a key press interaction we really need to look into a simple button because key is just a button but in a more complex context. So when you press any button in this world button goes down and when you release it buttons button goes up. It's just that simple. So within the key press we have very similar behavior that contains the two fundamental events key down and key up. In this example we have one interaction as in the input we see character A and the entire interaction starts with the first event called key down meaning that user press down a key. We immediately generate interaction ID for that saying okay we start the interaction. After key down there is key press event which is dispatched if and only if there is a character value and we see that in this case we do have character value and it means that key value was mapped to a specific key. So for example if you would press anything that wouldn't produce a character value you wouldn't see a key press. Then we have some events about DOM so before input which means that the DOM is about to be updated input which is the immediately dispatched when DOM is updated and lastly we finish the interaction with the key up to which we assign the very same interaction ID as it was generated on the key down. And although this definitely makes sense and most importantly key down and key up are the most significant events in the interaction that you perhaps have already seen. So this sequence gives us the entire definition of keyboard interaction within the INP. So it contains of three time spans as in the click and click and tap interactions but in this case we have input delay, processing time and presentation delay. So input delay is the time when is the duration when user presses down a key and event handlers are executed. Processing time is the time it takes for the code to be executed in the key down and key up event handlers and finally we have presentation delay which is the entire duration when the event handlers are stopped being executed and we finally see something on our frame. And this definition definitely does make sense. However it had some problems. After better investigation we found out that it can be a bit more than confusing. Firstly having key press events interaction ID equals zero makes developers wonder if key press is related to keyboard interaction at all. And to make things worse it turned out that key press can be as large as key up and key down together. Just a second. So with this problem we updated keyboard interaction, the definition of the keyboard interaction and we have that the new update is very similar. It all contains those three time spans of input delay, processing time and presentation delay. However for the processing time we included key press such that it would be between those three candidates of key down, key press and key up at the end. And we really hope that this will remove some confusion for developers as we assign interaction ID to the key press event. And finally we do believe that including key press is a step towards polished, improved and more accurate IMP metric especially within the keyboard interactions. Well with the simple key press is everything is quite well defined. We know where the start is and we know where we finish the interactions. However more interesting things happen with non-standard keyboard interactions because we cannot be sure that our users will always use just standard keyboard interactions. I even came across this post from Instagram on Google's Instagram that has everything to express one idea from emojis to just basic symbols. And to understand how fast the website responds to such input we really need to dive deeper into input method editors. So what is, does anyone know what is input method editor? That's great. So actually I think most of you might have used in some sort of way. It's a software component that enables users to input text that cannot be easily represented on a standard keyboard. So it typically happens due to the very large number of characters in users' written language. And it's very common in East Asia regions for example Korean language, Chinese language and Japanese language. Although I would love to speak Korean, Japanese and Chinese unfortunately I cannot. So today we will look into a bit more standardized example of simple emoji which has very similar structure. So we already can see that we need to process way more events for one emoji than for simple key press. And however everything actually depends how many interactions were made while producing that emoji. In this case we see that users started by typing in H and then selected the emoji as you can see on your left screen example. Since the complexity is way higher of such interactions IDs we only assign interaction ID to input events. But thinking in general the differences between pressing down a key within emoji context and non-emoji context we find out that we still have those very important two fundamental events key down and key up. So with our updates we assigned interaction ID on key down and key up. We still start our interactions with the key down. We generate new interaction ID on key down then assign the same interaction ID to input event and we finish that key press interaction in the emoji context with the key up event. When users just select the emoji without typing we just simply assign interaction ID to the input event. And that gives us better understanding of how non-standard interactions behave. However the algorithm really requires some creativity and some better understanding. But coming up with the solution for something that does not behave in the same way was quite a challenge and the solution might not be most perfect because it heavily relies on the order of events dispatched. And for example we see here that when we hold three keys at the very same time and release them at the very same time we have three key ups at the very end with the exact same interaction IDs. And this shouldn't happen in general. They should have different interaction IDs. But although it might not be the perfect solution but looking into input method editors is a very important way to address web responsiveness within East Asia regions where people actually do use a lot of graphemes in Chinese, Korean and Japanese languages. And who knows maybe just all our emojis are just introduction to a bit more complex interactions. Maybe one day we will see 3D interactions and emojis will be just simple ones. So for this project I'm really grateful for my mentors and I call them heroes. They really helped me through the entire process of understanding interactions, defining INP and understanding how to improve web responsiveness for developers. And thank you a lot for listening. If you have any questions let me know and if you're interested you can read the blog or just ask me anything you want. Thank you. And do you have? Yeah. Yeah basically, so do the interactions go from top to bottom? So it's actually it really depends on the way we see on the websites. And there's like websites that shows all the events dispatched and it starts from top to bottom and maybe it's not the most intuitive way for you to read from top, I mean from bottom to top, right? But is the usual order that what you would get when you try to look into the events dispatched during keyboard interactions? So yeah, I mean yeah, this was from bottom to top, absolutely. Okay, thank you.
Web Performance at Mozilla and Wikimedia
Hello, my name is Peter. I work for the Wikimedia Foundation in the quality and test team. So I have like three minutes, so I'm going to show you a couple of things. In the team, we want to make sure that we find regressions, right? And the cool thing about it is that we keep all our performance metrics in the open. So you can go to our Grafana instance, Grafana.wikimedia.org and see our metrics. Now I'm going to do a live demo. Oh, that didn't work out so good. So let's see. Okay. So, we have, I'm going to show you four dashboards. We have our real user monitoring dashboard with the data that we send from our read users back to Wikimedia. So I propose that you go to that dashboard and look out at our performance metrics. I think it's quite interesting because we don't have so many big websites that actually show their data. We also have another dashboard where we have all our synthetic tests. So you can use the drop downs to see the pages that we test and the performance data of that. So this is kind of like internal data, so maybe not so interesting for you. So I have two more dashboards that is more interesting. So we have the user network dashboard. Let's see. Here is actually what kind of network our users that use Chrome has. So we use the network information API and beacon back the data. So we can use the drop down and see what kind of network our users is using. And if we scroll down, you can also see what kind of connection type they have. So this is interesting because you can see what kind of connection different areas of the world have when they access Wikipedia. And the last thing I want to show you is the CPU benchmark. So as Beth said, it's important because different users have different devices, right? And for some of our users, we run a small JavaScript that we measure the time it takes to run and we beacon that data back. So we can see what kind of performance different devices have for different users all around the world. And we use that data to actually see and compare it to different devices. So we can use that data to tweak how we run our tests internally. So if you go to that page, you can see what kind of benchmark to use as a Wikipedia. Okay, that was all for me. Dave. Thanks, Peter. Hey, everybody. I'm Dave Hunt. I'm the, I'm going to stand over here so I'm not blocking the screen. I'm Dave Hunt. I'm the engineering manager for the performance tools team at Mozilla. And I'm going to show you a little bit about how we handle regressions for Firefox. And we have tests that test page load benchmarks. I'm going to use a real example of a recent regression. And I'm going to go pretty fast through these because I only have a few minutes. So obligatory slide with a quote from a famous person. So Galileo, Galilei said, measure what is measurable and make measurable what is not so. And I think this is something we try to do in our team. So here, this is a performance alert. We have a bunch of tests running on, we're not suddenly commit something to Mozilla Central, our repository for Firefox. When we notice a change in the baseline, we generate a regression. One of our performance sheriffs will be monitoring and triaging these alerts. In this case, this one was triaged by Andra. And this shows you the magnitude of the regression and the tests that have alerted. In this case, I filtered it down just for simplicity. This is Expedia. And we can see some of our speed index tests have regressed. The sheriff will do some investigation. This is the same test or one of those tests shown in graph view. You can see this is our baseline. There was a change. The sheriff actually has come in here to some retriggers and backfills just to narrow the regression range and identify a likely culprit. And then the sheriff will file a bug. So we file a bug in in bugzilla. Because we've identified the likely culprit, we'll also need info. They will request further information from the author of that patch so that they can be aware. Looks like there might have been regression. Maybe we need to back this out or maybe we need to fix it. And I'm just highlighting here as well links through to one of our other tools, the performance, sorry, the Firefox profiler. So we provide as much as we can to the engineer so they can confirm. Yes, it looks like it's my patch and also can have a little bit more of a deeper dive into what might have caused it. And then another tool that we have is perf compare. This allows engineers to, if they think they've got a fix or they think they have something that might affect performance, either positively or negatively, they can push that to our CI system, run the tests and see a comparison. And so here this is again that example, Xpedia contentful speed index. This is the before, in this case, the regression and a patch that should fix it. And we can see that the distribution of the results, we've run the tests multiple times. Distribution of the results is smaller. And so it indicates that perhaps this is fixed. And it was. So we also alert on improvements. This is the alert that came in a couple of days probably after the patch landed to fix it or to back this out. I think this change was a change in how aggressively we are garbage collecting. And so yeah, we get this and we can also look at the graph view. We can see the period of time that we had that regression and we can see that it is fixed and it's back to the baseline that we had before. We also capture videos. So again, another tool that is useful for the engineers to confirm. Yes, it looks like there really is a regression. In this case, this is the fix. So this is the slower and improved, the faster. I mentioned the Firefox profiler. I encourage everybody if you don't use it or haven't used it, check it out. Try it. Give us feedback. And finally, I just wanted to promote. Floring Kes is talking in Janssen at 1pm today. That's the main track on Firefox profiling. So you'll see a little bit of example of using the profiler for something other than necessarily web performance, but it's a very versatile tool. That's it.
Fast JavaScript with Data-Oriented Design
Hello everyone. My name is Marcus. I would like to share some lessons that I learned while working on the Firefox Profiler. So yeah, I work at Mozilla. I'm on the Firefox Performance Team and I work on Firefox itself and on the Profiler and I also have my own Profiler called Sample which you can use on macOS and Linux to profile your own applications. And I'll give you a brief overview of the Profiler. So this is what it looks like when you have a profile loaded. You have a timeline at the top. You have a call tree and you have a sidebar here in the call tree. It's very, very small down there. I'll zoom in a little bit. In the call tree you can see which function called which function. You can see how much time each function spent running. So let's say this function here dispatch event. Firefox Profiler is a sampling Profiler. So it interrupts the thread at a given interval like usually one millisecond every one millisecond. It checks what's on the stack and then it accumulates that into this call tree. So one thing that I want to call out here is the category breakdown in the sidebar. So here we have a bunch of categories. User is just regular code. Ion here means this is JavaScript code that was jitted into the IonMonkey subengine of our JavaScript engine. And yeah, there are a bunch of other categories. And you can select in the timeline and as you draw your selection, oops, as you draw your selection, the category breakdown updates in the sidebar. So we can also zoom in, see something a little smaller. So here we have more code in Ion, more code in the user category. It also has a flame graph. So zoom back out. And flame graph, you're probably familiar with flame graphs. They're a different representation of the same information. Like you have a call tree, you have nested functions, the width of each box is the time that is spent in that function. And we also have a tooltip here in the call tree, which again gives us a category breakdown. I'm emphasizing the category breakdown so much because we're going to implement our own profiler in a minute, which, and we're going to focus on calculating this category breakdown. So here we see it's a bit sluggish as you move the mouse around, because it actually needs to iterate over all the samples in the timeline. It checks for every sample is the stack inside of the function that you're hovering. If so, check the category that the CPU is spending its time on for that sample, accumulate that into a map of categories to counts, and yeah, do that for all the samples. And we can see here at the root node, we have about 500,000 samples in this profile. So what I didn't tell you is this is actually the version from last July. And I fixed this performance problem here. So this is what's live now on profile.farfax.com. Hovering these boxes is no instant. And it's still doing the work. It's still going through all 500,000 samples every time you move your mouse. So I want to talk a bit about how we can crunch through all that data in in very short time. Wrong direction. So yeah, even with lots of samples, we can now have a fast UI. And I made an example project just for this talk called mini profiler. It is on GitHub. It's also live on Netlify. You can try it out in your browser if you want to. And this is what it looks like. It has a very reduced feature set, but it also has this timeline. You can select parts of the timeline and it calculates this category breakdown. So yeah, let's say here, we spent 30% in Ion Monkey Jitter JavaScript code. At the same time, it also calculates the heaviest stack. The heaviest stack is the stack that we spend the most samples in. All right. So yeah, mini profiler features. There's only two features. You select the range, and it gives you a category breakdown and a heaviest stack. So how does it calculate that? We have an input JSON, which describes the profile contents. The JSON is a list of samples. Every sample has a time, a weight, and a stack. Every stack is an array of frames. Every frame has a name and a category. I'll show you that here in an example. So as I said, a list of samples, every sample has a time property, a stack property, a weight property. The stack is an array. Each stack frame has a name and a category. To calculate the category breakdown, we take in the profile. We take in a range of the indexes of the samples that are in the selection. Then we iterate over this range. We get each sample. Whoops. We get its stack and its weight. We get the top frame from the stack. We get the category from the frame. And then we check. Does our map have this category already? If so, get the current value. Otherwise, default to zero. We add the weight of the current sample. We put the sum back into the map. And then this map is what gets used by this spelt component. For the heaviest stack, it's somewhat similar. We again iterate over all the samples in the selected range. For each sample, we again get the stack and the weight. And now we need to check if this stack has been used by multiple samples. And how do we find two samples with the same stack? Well, the stack is an array, and you can't really check them for equality easily. So what I'm doing here is I'm stringifying the stack into a JSON string. I'm using that as the map key. And then here is a similar issue to what we had with the category breakdown. We check. Do we have an entry in the map for this stack? If so, take its current value. Otherwise, default to zero. Add the weight. Put that back into the map. And if this stack is the heaviest that we've seen so far, we remember it, and then at the end, we return it. So these are the two main algorithms in this mini-profiler. Category breakdown, heaviest stack. Both of them have to iterate over all the samples. So how fast is it? So if I select here, it's reasonably fast. If I make the selection bigger, it starts getting a little janky. I'm computing some throughputs down here. So 100 nanoseconds per sample is how long the algorithm for the category breakdown takes. And 30,000 something nanoseconds per sample for computing the heaviest stack. Because, yeah, we saw the heaviest stack algorithm, it was really inefficient. It used JSON stringify. It looked up this gigantic string in a map. It needs to hash the entire big string and so on. So this is obviously not the way to go. But this is just a place to start so that we understand what's going on. So this is the throughput here. The nanoseconds per sample might not tell you much. But what you can think about is, how does it limit the size that you can handle while still being responsive? So let's say you have 100,000 samples. In this example here, we just had 1,600 something samples. What if you have 100,000? Then you get 10 milliseconds for computing the category breakdown and 3.6 seconds for computing the heaviest stack. 3.6 seconds per update, that's not acceptable. So we need to do something. And also the JSON file, because it has all those repeated stacks, it's just massive. So let's use a different format, different JSON format. Here I made a V2 format. It still has samples, but instead of having the stacks right in the sample, it just has an index. And this index now goes into a stack list. Each element in the stack list has a frame index, which goes into the frame list. Each frame has a name and a category index, which goes into the category list. So I hope that's not too overwhelming. We just have a bunch of indexes now. Instead of nested objects, we just have some side-by-side lists and we index into them. And also here the stacks are a little special because of this parent stack index here. So if, for example, a sample refers to stack number two, then this is the frame at the top of the stack. Then we go to the parent stack, find this frame, that's the next frame on the stack, find this stack, put this frame on the stack, and then the parent here is null. So that means we're at the end of the stack. Hope I haven't lost anyone yet. So let's go back to the compute-heavy stack algorithm. So we were iterating over the samples. We were stringifying the stack arrays and we were checking the JSON string. Now we don't need to do that anymore. Now we have an index. If two samples have the same stack index, that means they have the same stack. So we just use the stack index now here and we don't need the JSONification. We don't look up big strings. And this is like a massive performance improvement. So 300 times faster. The category breakdown is also affected by the new format changes. So now instead of getting the stack and the frame directly from inside the sample, we instead get a stack index. We look up the stack index in the stack array, which gives us a frame index. We look that up again, get the category index, look that up again, get a category name. This is a string. Put that in the map or add up the weight. This string here, this is kind of unnecessary. We know if two samples have the same category index, we can use that as the key. So I made an optimization here to remove this name lookup. And now we're just accumulating category index weights in this map here. There needs to be some process, post-processing afterwards to make sure we get these names here in the category breakdown again. But that's outside of our algorithm. All right. So here I had selected the small profile for the format version one. Let's switch to the same profile in format version two and do the selection again. And now we can see, we can select to the full width and it's still very responsive. So here's our throughputs. So how fast is it now? 47.1 nanoseconds per sample for the category breakdown is what I measured, 51 for the heaviest stack. Okay. So that's much better. Let's see how far we can go. We want to see if there's more we can do here. So we use a profiler. I am going to start the profiler. Oh, what I didn't show you is how to use the profiler. Well, let me do that really quick. So if you use Firefox and you go to profiler.firefox.com, you can click this big button here, which gives you a toolbar button. And then if you click that toolbar button, it starts recording. So let's record our current implementation. Do a little bit of this, capture a profile and see where the time is spent. Well, where is the time spent? One second. Let's try that again. Let me refresh this page. Ah, I can tell you for this time spent. It is so fast that it barely shows up in the profiler because we are still using the small profile size. So let's do that again. Capture profile. The local host here, there's barely any CPU usage. You would see more yellow in here. So let's switch to a bigger profile. We still have just the 1600 samples. Let's switch to the medium profile. So here, yeah, it still works okay. It gets a little bit janky towards the edge here. So again, we're going to start the profiler, select, play around a little bit so that we get lots of samples. Capture the profile. And there we go. This is what I was expecting. So now we have lots of yellow in here. I'm going to show just this thread. I'm going to switch to JavaScript only. I'm going to switch to the flame graph. And now what we can see here is we are spending time in compute category breakdown with string key map and compute heaviest stack with map. And what we see here is that we are spending some time in map.prototype.set, both over here and over there. That makes sense. We're assigning things to a map. So can we not use a map? Wrong direction here. So we're seeing the time in map prototype set. We have the map here. For the category breakdown computation, we're getting the category index out and putting it back in. But we know these are integers. They're integers into the category list. The category list doesn't have lots of elements. We can just use an array here instead. I'm going to use a float 64 array here. Because the weights are floats, using a typed array means I know that the maximum number of elements is already preallocated. It's initialized to zero. I don't need to check if there's something in it already. I know that it starts with zero. I can just add the weight. And that's it. We can do the same modification to getting the heavier stack, the seriously compute heavy stack algorithm. It was also using a map. We can use a float 64 array because we know how many stacks there are. Here the key is the index into the stacks array. We use that key as our index into the map array. And then it should work as before. And what we see down here, it is three times faster to skip the map to use a typed array instead. Let's try that out. Here I'm going to switch from the basic implementation to the integer keys for category breakdown. No, sorry, to the typed arrays instead of maps implementation. And now I'm going to select, and it's very smooth through the entire profile. And we have 500,000 samples now here. And we are still responsive. And let's see if we get an even bigger profile. This one here has two million samples. How responsive are we? It's okay. It gets a little janky towards the end here. It's mostly okay. So where are we now? Let's just take some, take some recap. We've addressed the obvious load ons. We've done what the profile told us. We fixed the hotspots. We changed the format so that comparing stacks is cheap, we changed two maps into typed arrays. Got us a 3x perf boost. In the heaviest stack case, the map or the amount of memory we're using might be a bit bigger now because we're allocating an array where we have an element for every single stack index, even if no sample references that stack index. So maybe some extra memory, but we have a performance boost. And so we have the throughput here. Yeah. So for the medium profile, our throughput is like 16 nanoseconds. Or let's see, sometimes it goes up and down a little bit. Yeah, let's say 16 nanoseconds for the category break down, 40 nanoseconds for the heaviest stack. I was seeing some other numbers when I was trying this at home. So it's pretty impressive. Modern computers are pretty fast, but maybe we can do even better. So let's try better. Let's go back to the category breakdown algorithm. We are taking these two values out of every sample. The sample is an object. It has three properties. We're ignoring the time property. We're getting these two properties out. So what does that mean at a byte level? So how are arrays of objects stored in memory? Well, it depends a little bit on which JS engine you're using, how you're allocating the object, if you happen to be on a fast path or not. But in SpiderMonkey, this is what you might expect. So we have a samples array, which is backed just by a list of pointers. Every pointer takes up 8 bytes on a 64-bit system, and it points to a JS object. So let's say here, the first entry in our samples array points to this JS object here. The JS object starts with a header. SpiderMonkey takes up 24 bytes on a 64-bit machine. Then if we're lucky, we have the fields inline just after the header. We might not be lucky, but let's say we're lucky. We might also have a bit of padding here at the end, because the inline slots might be only sized to four or eight, and we're using three properties here, so there might be a bit of extra memory used up by that. So this is just one representation that we could have. It varies a lot by engine. For example, Chrome has pointer compression, so these things here might be four bytes each, but then the time field might be an extra pointer, because in Chrome, sometimes the floating point values are a separate heap allocation. The padding could vary, the object header size could vary. These fields here could be behind another pointer if they're stored out of line, and so on. But anyway, what it comes down to is we wanted to get these two fields here, 16 bytes in total, but what we ended up with is all of these other not-so-useful bytes clogging up our cache. So when the CPU wants to get those bytes, it gets them in 64-bit chunks. Cache line is 64 bytes. So if you're getting this value here, you're getting the other bytes that are in the vicinity, even if you don't need them. Well, here we do need the JS object header, because the JIT needs to check that the object is of the right shape, and so on. But we really just want those values here. So can we do anything about that? We want to improve our cache line utilization, and we want to reduce the indirection. Maybe we can. Let's do something radical. Let's turn everything on the side. So we have this array of objects. What we could do instead is to have an object of arrays, or struct of arrays, where we have a column, or where we have just one key for the time column with a big array that has just the time values, one for the stack index, just the stack index values, the weight, just the weight values, and a length stored on the side. These arrays must all have the same length. So now everything's backwards. If we want to access the weight in the past, we had samples i.weight. Now it looks a bit weird, because we have the sample table.weight column, and then we get the ith element of that. But let's do it. Let's see where it goes. And so what we end up with here is a new profile format again. Now we have a sample table, a stack table, a frame table. The calories are still a list, because it's just some strings. And same thing as before, the stack index goes into the stack table, the frame index goes into the frame table. We just need to access the properties differently. So what does it do for the computation of the heavier stack? Here we were getting the stack index and the weight property from an object. Now we just get them from separate columns. And already we're seeing a 2x performance improvement. For the category breakdown, similar story. Instead of getting the properties from objects, we get the column first, access the ith element, and get that. This here is even faster, like 3.5x faster. Let's see that in practice. So we're switching to format v3 now, struct of arrays. Let's get the medium, medium sized profile. And now it just flies. It's just responsive all the way. 4.5 nanoseconds per sample, that's really not a lot of time. This is super fast now. Let's get an even bigger profile. Still super responsive. So when we think about the memory model, or the memory, how it is represented in memory again. We're accessing these columns now. We're accessing them in order. And what happens is that our cache lines are now fully utilized. We don't have object headers clogging up our cache anymore. We just have the numbers that we wanted. But yeah, it's just super efficient now. We get all the stack indexes, we got all the weights. The time column is now pretty much irrelevant. It was clogging up our cache before, but now we're not accessing the time column at all. So it just doesn't bother us anymore. Okay, so let's recap quickly. We have a struct of arrays. Some people call it parallel arrays, commonly used in game engines, databases, and so on. It has a few drawbacks. It looks a bit backwards if you read it. Sometimes when you want to pass around an object, you need to manually materialize it because you don't just want to pass around an index. But it also means that the type system, at least in TypeScript, is now less of a help. We can introduce mistakes that it wouldn't catch. So for example, if we build up our arrays and we end up not putting our values in every one of the columns, we end up with mismatched lengths, and that is hard to catch at the type level. Also, when we pass around indexes, sometimes, yeah, you get a number, you don't really know, is this an index into the stack table, into the frame table? I don't know. The type system, at least in TypeScript, I don't think is well set up to catch these kinds of mistakes. But it's much more cache efficient. It's easier on the garbage collector. You need to traverse fewer objects. Some engines skip the contents of arrays and numbers, so it should speed up that too. Less memory overhead from object headers and padding. And we can just treat columns separately. Like sometimes we want to make a change to one column. Let's say we want to shift the entire profile by some time delta. We can change just the time column. The other columns stay untouched. We don't need to recreate any objects. And it also gives us a little more control over sizes and how compact our integers or our numbers are stored. We can pick with our typed array. We could pick an int32 array. We could pick an int16 array. If we know what the domain of our values are, we can store things more compactly and we get back in control of the representation. Okay. I want to make it even faster. So if we look back at our category breakdown, we're getting the stack index, we're getting the frame index, but it's all just to look up the category from the frame table. We're not really interested in the stack of the frame. We just want the category for a sample table, for a sample. So what if we just got the categories for each sample and use that instead of here, stack, frame, category, just go category, boom. Well, it would be great if we had this column. Where does it come from? Well, we can compute it here with the get sample categories method. We iterate over all the samples. We do the stack frame category conversion here. We cache that in the sample categories column. We pass that to our existing function, but we only want to do this once, not on every call. So we need to cache it somewhere. We can use memorization for that. So here's a memorized call. We get the profile. We only run this once. So if we call this multiple times with the same profile, let's say our profile is immutable, we have it cached from last time. And we can make the caching even more precise. If we memorize a function which takes just the columns that we need, then we get it. We wrap this into the existing get sample categories function, which takes the profile, but then it takes out the categories. Sorry, it takes out the columns we want, passes those separately to the memorized function, and that makes the caching even more or even tighter. If you touch a column that is not involved, you don't invalidate your cache. And did it work? Yes, it did. Oops, wrong direction again. Memorized sample categories. We're now down to three nanoseconds. So I'm basically done with the talk. Let's just look at the graph here at the end. This V1 graph is off the charts like this. It's way higher than this. But we made it faster with every change here. And this last step of caching the sample, the categories for each sample, it looks like it's not much, like 25% on these nanoseconds. But what it actually means is we can handle more data. We can handle a higher count of samples in, let's say, a 16 millisecond interval. And like 25% more data, that's massive. Okay, I want to say really quick, what is data-oriented design? It's a mindset and it's a collection of techniques. The main technique here is structure of arrays. The mindset is more about how you think about it. The shape of the data determines the algorithm and its performance. You need to know which things are small, which things are big. We might have seven elements in this array and 100,000 in that array. If you keep that in mind, you're better set up to write fast code. And if you also think about cache line utilization, you're even better set up. The rest is not that important. Thanks, everyone. You can find me in the Firefox Profiler channel. You can check the Firefox Profiler online. Happy profiling!
From Google AdSense to FOSS: Lightning-fast privacy-friendly banners
Good morning. I'm Tim. I work as a performance specialist at Akamai, but that's not what my talk is about. And everybody here in this room has two things in common. I assume. First, we love web performance. And two, how many of you think already about food? Because I'm starving. And actually, if you don't know what to eat in the next days, when you're here in Belgium, there is this waffle burger at a Belgian restaurant called Quick. And if we are performance focused, Quick is also a nice way to get there. Now, next to my day job at Akamai, I also run the largest scale modeling websites in the world. With 50,000 visitors a day and 6 million pay juice a month. It's a bit too big for the talk of Tvetan earlier today. And it's not only the largest scale modeling website in the world. It's also the fastest one. And this, thank you. Thank you. Thank you. And this, despite the fact that I run banner advertising, because normally banners means slow, slow, slow and annoying for your end users. And this talk is all about how I switch to an open source ad server solution in order to give my users a better privacy. And then also because I love performance, make sure that the performance is lightning fast. Who remembers this day? One. Yes. GDPR. Correct. This was the this was when GDPR almost six years ago was introduced. And if we travel back in times to six years ago, my website back then I used Google AdSense and a few other ad serving solutions. And what is great about these solutions, you can just add some JavaScript on your website, and you start earning money. That's it. Now, the problem is that when you then you look at your waterfall that you see all these extra requests to third parties, third parties calling third parties, fonts are downloaded, CSS, JavaScript, cookies are set, tracking cookies, a lot of stuff happens. And this is a tool by my ex colleague Simon Herne, request map that shows you the blue bar, the blue circle at the bottom is the actual website. And then you have all these spiders crawling off additional requests going to additional things. And from a privacy perspective, this is not ideal. And this is all you need to do to create a nice banner of in this case, a hamburger. Now, when I started, this was how my website looked, I was basically chillax. This was just how the web worked. This was the only way there was no different way. This was just how the web worked. Now, in April 2018, one month before GTPR, I was like a little bit in panic. I was hoping that the ad providers, not only Google AdSense, but all the others would come up with a privacy friendly version for Europe, and would therefore also make the websites faster. And in April, nothing was moving. So I looked for a plan B. And luckily, I was able to find a plan B, which was open source. So revive open source ad server. And why did I pick it? It was PHP based. My website was PHP. So it's good. It was already five years old. So it was like not brand new. So it was already proven. And it had fairly stable releases at a regular time. Today, this open source project is maintained by the Aqua platform by Eric and his team. They also run a, of course, paid hosted version of the solution. But I use the download free version. So what can you do very quickly? Everything you expect from an ad server. You can manage your campaigns. People can sign up for to start adding ads on your website. Basically, everything which is needed to send to serve ads on your website isn't there. And this is the result. Remember before that spider going everywhere, everything hosted on the same domain. So from my privacy perspective, I was back in chill X mode. Now, let's talk about performance. Just by implementing the open source solution on my own systems also gave me some performance gains by design. And the first is here revive itself does not require all these requests. So that's the first thing. But as you can see, what is missing here are things like DNS lookups, TCP connections as a cell handshakes in order to talk to different systems. So that basically means that everything which is needed to serve that hamburger banner as soon as possible, it's not delayed, which is good. The other benefit is we already talked about IMP before and JavaScript performance. The library broadly compressed only 1.7 kilobytes. And typically the more bytes you ship and JavaScript bytes you ship the less good for things like IMP first input delay total blocking time. So it's a fairly small library. Other things I work for a CDN so I can run my website on the CDN. So also use the image optimization services to make sure that I return modern formats like AVIF or web P, et cetera. And then finally, last but not least, the fact that everything is under my control also means that I fully control priorities things like fetch priority high, fetch priority low, preload the order in the page. I fully control the order of things and I decide do I want the banner to be served first or do I want the actual content to be served to be served first. This of course assumes that your web server or your CDN listens to the priorities. Now, this was the basics. Just setting up revive great for performance, great for privacy. Now, good is not good enough. And in order to get these very, very good results, you still need to do a little bit more. So let me explain you that. So we'll first look at LCP or largest contentful paint. Just as an example, what is here the LCP element on this page should be fairly obvious. It's the largest image on the screen, which is that nice helicopter, which I'm currently building. Now, that's easy. Second one. What is the largest contentful paint element here? Sorry, it's early and I'm hungry. It's actually as expected, the top one, because that's the biggest image. Now, this is not what my users perceive as the LCP element, because they come for that small picture of the car. Now, what is the problem? This image is late discovered. It's first needs JavaScript to run, then it needs to do a request to a PHP server to know which ad to serve. And then only the image will download. So it's late discovered, and it means it will come in of potentially a few seconds later. So what is the best solution? Just send more bytes. So my website is driven by a lot of contributors. So when somebody uploads a smaller image, I basically nudge some other people like, Hey, do you have a bigger image of this one? So my LCP gets better. Not only my LCP, people also like to watch nicer pictures as well. Now, that's a plan A. That's the best. Now, I'm not sure, not always sure if it actually happens. So sometimes I do have pages where the images are too small compared to the banner. And what is my plan B? I call that a fast fallback banner. And it's exactly what it's doing. It's fast. And it's a fallback. So in order to make it fast, you need to make sure that it's early discovered. So it's just it becomes a standard image tag. So basically my PHP code, I check, Hey, when I generate this page, I already know the size of the image I will embed. Rather than using the JavaScript based version, which is slow, I fall back to a default image variant. The downside is that from an ad perspective, I can no longer track revenue. I can no longer know exactly which banner should be targeted. Yes or no. So I lose some functionality. But typically on a website, you don't always sell all your potential banner locations. So you anyway have some, for example, I sponsor certain scale modeling events, or I have some coffee mugs of my website with internal banners. So I can basically perfectly display these non revenue generating banners, but keep performance. And here is then what you see is request number four is the LCP element, which is requested quite soon rather than somewhere at the end. That was for LCP, making sure that it's green under all conditions. Next is CLS, CLS cumulative layout shift. And this is something everybody knows, typically on newspaper websites, you're looking at a page, you're, you're reading the content and then suddenly bam, everything goes down because the banners start loading. Now the solution for this is quite simple. Just add a class, add a placeholder that the browser while rendering and painting everything on the screen makes room for them already reserves the room for that nothing special. Now, unfortunately, this was not good enough. Why not? Because in add systems, and in all add systems, you can do you have basically the choice between user experience and making more experience and making more money. And the top one is the fixed zone. You basically say, hey, in this location, when it's a fixed zone, I only want to show banners which are this size 300 times 250 pixels. Now, you can also have flexible zones. You hear I can define, you know what I have my design allows 300 pixels wide, but I can show bigger banners, smaller banners, a variety of things. From a money perspective, this is better. Why the bigger the pool of ads you can potentially serve to your users, typically the more money you make. The top one is better for end user experience because you know, hey, my placeholder is always this, which one did I implement? Of course, the top one. Now, a new problem arrived. Page is rendered. You see the nice placeholders. And then suddenly this happens. Watch carefully. Everything moves to the top. Which same browser would do this? Safari, Chrome, Firefox. All browsers are same. However, ad blockers are not always same. Ad blockers assume and assume that when you have advertising, they try to remove everything. So what happens is they detect the ads on my website. Although they're privacy friendly, although they're fast, they get removed. And you have this shift. So how do you solve that? Not blocking the ad blockers. If my users want ad blockers, that's fine. That's okay. If they are free to use that. This solution is to add an additional container around your ad. So this is the EINS. That's the ad. Make sure that the container has the placeholders. And then when the ad blocker arrives and deletes the ad, the container is left. So no layout shifts. And this is really my mantra. CLS should really be reduced to zero. Every single pixel which moves is in my view a bad thing, an annoying thing. And CLS should really be reduced to zero. So we covered privacy. We covered performance. Now let's look at the revenue perspective. Because in the end, the revenue part is I need the money to fund the hundreds of dollars which are paid every month on the server cost. And when I started, it was easier. I used ad sanctions. task steamed student stock of a sort of So, banners does not mean bad. You can implement it in a positive way. If you have full control with open source, you are perfectly able to do that. And it's also possible to make that lightning fast. Now, I didn't get any money for this. So, I'm really dreaming already about this burger later on today. There is just one small problem. It's Robin. Robin, do your hand up. Robin is the next speaker. Robin is my colleague. And we also call him Mr. Quick, so the Quick restaurant, but he normally works on the HTTP protocol and he hates it when I call him Mr. Quick with the K. He stands between us. So, Robin, please talk fast so we can all go have a great lunch. Are there any questions? Yes. Thank you. So, I've heard that your scale mates is very popular in various continents. So, for the answer, do you need to get practical somewhere in the continent or they're all like... Yeah, okay. Yeah. So, the question is, so that scale mates, my website is visited by people across the globe, here in Belgium, in Australia, Japan, Brazil, everywhere the globe. And the question was, if I need to have a replicated setup, so I use a CDN that gives you a replication for static content images, etc. So, that's a given. But I actually also replicate my servers across the globe, not all. I have, for example, servers in Australia, in Japan, to make sure that when a user does a database call or does a search, that they get an instant response. Thank you for the question. We have a few more minutes, I think, for questions. Two minutes. Otherwise, two minutes. Any additional questions? Yes, they're true. Yes, but the... Yeah, great question. So, the question is, in a revive, which ad providers can I introduce? In theory, in revive, you could also make a non-privacy friendly version, because you can also say, hey, in case I don't have any direct inventory, let's say, for example, with a scale modeling company, you can also decide to fall back to, for example, adsense or anything else. And it's just the only thing you need to do is add that their JavaScript and your advertiser code, so in theory, you could integrate any SSP. But then you're back into the same game, then you're... You have a performance impact and privacy impact. So, revive allows you, potentially, to do everything. Does that answer your question? Thank you. There was one question in the back as well. Yeah. So, the question is, which frameworks did I use or modules did I use to build a website or just for the advertising? Everything builds myself from scratch. Yes. Yes, I... Yeah. Yeah. Everything builds from scratch. The only thing I used was jQuery and I still use jQuery on some admin sites of the thing. But yeah, saying jQuery in 2024 is not cool, but I'm okay with that. Now, everything, yeah, PHP built from scratch. Thanks for the question. Any additional questions? Robin can maybe already come up as the next speaker. There was one question in the back. Yes. You can already switch my laptop, Robin. Yes. Just to look at it here, you are negotiating with these advertisers directly or...? Yes, correct. Yeah. So, the question is... Sorry? They call some people or you can go to them? Yeah. So, the question is, how do I get in touch with these advertisers? Because before, any banner would just show up. So, Revive also has an API. So, on my website, I basically have what you can sign up. You can create an account and register as a business account. And then I have a simplified interface where you can just upload the banners and you can ask, hey, is it for all scales or for specific scales? Are you targeting all scale models or just the aircraft ones or the shipbuilders? So, I have a simplified interface and they just sign up themselves. Thank you for that question. Yes. So, the whole question, have you had to deal with bad ads and bad actors? Yes and no. So, the question was, do I have to deal with bad ads and bad actors? Because I also have a shop database. And I basically already have a database of domains which are from scale modeling companies. So, when somebody signs up with a Revell, which is a brand, the e-domain, I basically know that it's linked to a, I basically can give them some confidence. If I'm unsure, they can already start creating their campaigns, but I can't, I still need to enable them before they're actually published. And people can't add JavaScript on the website. And so, in Revive, you can add JavaScript banners, but I blocked that because JavaScript is bad for performance. Does that answer your question? Thank you. Thank you very much and have a great lunch. Thank you.
Insights from the RUM Archive
Oh, last talk of this session is about the RUM archive, which is a data set of anonymized real user monitoring measurements. Now I know what some of you are thinking, Robin, if it's a data set, why does it have a palm tree in the logo? That doesn't make any sense. But think about it for a second. What happens if you go to a palm tree and you shake it? Something interesting might fall out, like a coconut. And the same thing happens with the RUM archive. If you shake it a little bit, something interesting might fall out. But for both, you need to be a little bit careful. Because if you're not, the coconut might fall straight on your face, leaving you scarred for life. So we need to be a little bit cautious in how we query the RUM archive, and we'll get to that later. The first thing I want to explain is what is actually in there. How do we get the coconuts in there? So currently, all the data is from the Akamai Ampulse products, which basically means we have a lot of Akamai customers that have Ampulse, and they let us put a piece of JavaScript on each of their pages. So every time a page is loaded, we send what is called a beacon, which contains all the performance measurements and a lot of other metadata for later analysis. Now, usually, our customers only see their own data, obviously. And here, we want to make this more publicly available. So we have to do a couple of things first. First of all, we filter the data. We only do the top 100 customers in terms of traffic. We anonymize the data. This includes stripping all of the URLs. So you won't know which measurement belongs to which site, which is a sad but necessary operation that we have to do. And then we further aggregate the data so that many similar measurements are actually combined into a single histogram for later analysis. This gives us two data sets. One, of course, for the page loads, and then one for third-party resources. These will be things like Google Analytics that are loaded from external URLs by many different customers. And so we can also offer some insights on that. We have most of the performance metrics you would expect, including some others, like RageClicks. This is if people got very frustrated. They start clicking the same area of the screen trying to make it work. For the third-party resources, we can also show if they were loaded from the cache or not. Very interesting. But crucially, one of the things we try to make the difference in is that we collect data from all the different browsers on all the different platforms. And you can, of course, also query on those from the data set as well. Now, you might be thinking, Robin, sounds fine, but don't we already have this from other public data sets? And partially, yes, this is true. We are blessed with very good web and web performance data sets, but we still feel that there are some gaps in there, gaps that we hope that the RUM archive might help fill, especially when it comes to things like cross-browser and real-user monitoring data. So let's say you're interested now and you say, how do I actually get access to this data? The main way is through Google BigQuery, where most of the data is stored. BigQuery is a very powerful, very flexible platform. It's sadly not the cheapest. It does cost you a bit of money. And even if you're willing to pay, it can take a while until you get useful data out of this, which is something a colleague of the Mozillaeans here today noticed a while ago. The reasoning was sound. They were trying to look for user agent Firefox on device mobile, expecting to get Firefox mobile data, obviously. It doesn't actually work, because in the RUM archive, Firefox is really just Firefox desktop. If you want mobile, you need Firefox mobile for Android and Firefox iOS for iOS. This is because we at the RUM archive put stock in consistency above all things. Now, especially for newer users, going to BigQuery directly is sometimes a bit of a big hurdle. So we also have a cheaper way, which we call the RUM insights. This is basically the team saying, OK, this is what we think most people will want to know about this data. We do the queries, and then we have some ready-made visualizations on the website for those as well. They also do the access. Sadly, though, even the RUM insights don't really help much for the Firefox mobile use case. As you can see, Firefox in our data set is definitely present on the desktop side. On the mobile side, none of the variants actually hit the 1% cutoff that we put for generating these diagrams. This is one of the many insights we can get from this data set, of course. Because having a nice coconut is of course all nice and dandy. You can't really do much with that, right? What you really want is you want to get to the juicy inside of the coconut, in this case the coconut milk. Now, I can hear some of you thinking, you think Robin, there is no such thing as coconut milk. Okay? Coconuts cannot be milked. They do not have nibbles. And you would be correct for the latter part, of course. But there are still ways to get milk out of this. You know, you could hit it with a machete. Or if you're a bit more sophisticated, you could hammer a screwdriver into these black spots there. You could still get something out of there. The point is, there are many different ways of getting the milky insights out of the data nut. But they don't all give you the same results. And a good example of this, I found when I first started querying the room archive, I just wanted to know, you know, roughly mobile versus desktop. What are we dealing with here? And when I plotted that out, I actually saw this weird periodic pattern. You have these bumps and valleys in there, which seem to suggest that people switch the type of device they use every three months, which of course makes no sense. Okay? And anyone who's ever done this kind of analysis already knows what this is. You know, this is of course just a bit of temporal interference. Because what I did not want to do was have a separate data point for each and every day that would be way too expensive in Big Merry, right? So what I want to do is just have one day per month. And naive as I was, I chose the first day of every month. Now this is not always the same day of the week, of course. This can very easily be a Saturday or a Sunday or a holiday, where you would expect more people to use mobiles than desktops, of course. The solution is of course also very simple. So the first day, we just use the first Tuesday of the month. Not the Monday, because that's often also still a holiday or a vacation day. But Tuesday should give us more consistent results. It's not fully foolproof though, as I found out. The first of July last year was a Saturday. So the first Tuesday of July was the fourth of July, the big US holiday. And that definitely does show up in these metrics. But this is not just something specific to the RUM archive, and every temporal data set has this. But I think it bears repeating because people keep making the same mistakes there, including me. Now diving a little bit deeper, looking at the different OSes that we see. On the desktop side, it's probably somewhat as you might expect. But on the mobile side, we have a very outsized representation of iOS devices. At nearly 63%. And I say outsized, because if you look at the actual sales numbers, globally, iOS fluctuates between about 15 and 20%. Even if you look at some of the richer countries, like let's say Australia, you expect a more 50-50 split. There are several reasons why iOS is overrepresented in the RUM archive. One of the main ones is that Akamai, as a company, is mostly present in the richer Western countries. Right? And our customers are mostly from industries like e-commerce, luxury goods, travel, that also address more richer end users as well, that are more likely to be on, say, iOS devices. So there is definitely an ingrained bias in the current RUM archive data set that you need to be aware of. But that doesn't mean the data isn't useful, in my opinion. We can still do much interesting stuff with it. For example, I think this serves nicely to highlight one of the big problems I feel we have in web performance right now, is our maybe somewhat overreliance on the Core Web Vitals and the Google Crux data set. You might not notice, but on iOS, you actually have no browser that can give you Core Web Vitals metrics, not even Chrome. This is because on iOS, every browser is actually Safari in disguise. Apple forces you to use the underlying WebKit engine, which does not support the Core Web Vitals. And so the more iOS traffic you have, the bigger your blind spot for those users is going to be if you only use the Core Web Vitals and the Crux data set. And I might say, Robin, that's only a problem for the customers represented in your RUM archive. And I would argue that the RUM archive currently does not maybe represent the global web, but I do think it's somewhat representative for, you know, for example, the e-commerce industry, which is definitely one that we consistently target when we talk about web performance. So I do think this can lead to interesting insights on that part. There is a silver lining to all of this. As you probably know, the EU is trying to force Apple to allow other browsers properly on iOS. Apple is dealing with this in one of the most disgusting ways ever. In my opinion. So I'm not quite sure how much this is actually going to change in practice, but still it is a step in the right direction. Okay. And even if this doesn't happen, we can still do some cross-browser comparisons by looking at other metrics that are readily available on all browsers. And we actually started doing this in the RUM archive, because we have those metrics, of course. I had hoped to present them to you today. But we want to be sure that we are 100% correct in our interpretation of this before we release any type of summary on that. So not yet, but soon we are working on this. I don't want to leave you hanging for today, though. I do still want to give you something to take home. And this is because there is a shining ray of light in the darkness. Because a couple of months ago, Firefox actually announced that they will now start implementing largest contentful paint. First Corel Vital available in non-chromium browsers. And this actually went live in stable Firefox about two weeks ago. And we already have some of the data in the RUM archive, which I looked at. And if you compare this, you will see that Firefox is actually faster than Chrome, sometimes a little bit, and later percentiles significantly faster than Chrome, for LCP. Now, what I think this means is that Firefox has won the browser speed wars. We should all immediately switch to Firefox and dump Chrome. No, it's much too early for that. We don't know if this actually means that Firefox is faster, or if they just use a slightly different algorithm, or they identify different elements, or it's just a different type of site that Firefox users visit. We don't know, right? So don't read too much into these results. I just wanted to have something to start the discussion, to start getting people and ties to actually look into what the core reasons for these results are. But so, useful things for the future. We're talking about Corel Vital, so you might ask Robin, what about INP coming up? INP is actually already well supported in the Ampulse products. So you can see here, this is an INP screenshot from the previous speaker's website, Tim's ScaleMates. You can see Tim has a ton of work to do. He claims he has the fastest website in the world, but we can all see the proof that it is not true. Shame on you, Tim, shame on you. So INP is in Ampulse, it's just not piped through the Remarkive yet. We expect this to happen in the coming months, and so we can also start analyzing data for that. So up until now, I've mostly been talking about the milk in the coconut, right? But we all know there is something else in the coconut as well. The flesh, the meat of the coconut. We rarely eat this directly. We usually process it into other foods, such as, for example, these delicious coconut cookies. These are actually kind of a Flemish specialty, I think. We call these Rotskis. I think there are amazing, amazing cookies. Now one thing you might see is that there are several individual cookies in this box, right? But they all look kind of the same. They're all quite similar. And sadly, that is also something that we see for the third-party resources that we have in the... He's not human. Kind of human. So third-party resources that we have in the Remarkive. Because if you start looking into this, a lot of them are from Google, as you might think. Most of them are ads, or tracking, or analytics, right? Most of these are things that the typical end user probably would like not to see loaded on the pages they visit. So it's a little bit ironic that we have to go all the way down to number 98 to find the first sign of something that is created to try and mediate some of this, which is the very first cookie consent manager, the GDPR backwash, let's say. I say try to deal with all of this. I'm a bit skeptical that it actually works. But I mean, the fact that you have 100 entries before the first cookie consent manager, I think is a nice one-slide summary of some of the things that are wrong with the web today. Now, this was a bit of a downer, so I also wanted to end on a better note. So I went through the whole list, and almost near at the end at number 498, I find something that we all were hoping to see, which was, of course, the jQuery mouse wheel plugin. With 13,000 downloads every single day, half of that is from Tim's site, as we just heard. And then the other one's there. So jQuery is still going strong. Let's hear it for jQuery. As I said before, we also have some other stats on these third-party resources. For example, how often they're loaded from cache. And at the median, this is actually quite low. It's only about 2%. I definitely think that the browser cache partitioning plays into this. It gets better to higher percentiles. But so most of these third parties are not actually loaded from cache. This might not be a huge problem, though, because most of them are also quite small. Most of these are tracking pixels that are just a few hundred bytes in size, though there are definitely outliers. So one of the bigger ones that I found was a Google Ads JavaScript that was 131 kilobytes compressed. That's massive. And that was loaded over 260,000 times in a single day. So a very big impact just on that one external resource. Now we had a lot of different resources, a lot of different cookies. Another thing that we have a lot of is browser versions. Because browsers a few years ago, they started updating themselves fairly regularly. So for example, Chrome releases a new version almost every month. And the question there was, how long does it take for most users to switch to the new version? That's actually quite good. Because here is that within two weeks, within the month, over 75% of Chrome users are on the latest version. And most of the remaining ones are on the previous version. There is a very short long tail of versions present in the dataset. This is very similar for Firefox, which also very aggressively updates. But here we do see one interesting data point, which is the blue one here, which starts in August. And even in December still had about 13% usage. And it turns out this is something they call the extended support release, which I think is a long-term support version there. Probably mostly used by companies, I would imagine. So you do have a bit of a longer tail there. But other than that, Firefox is also a very cutting edge, I would say. This is of course contrasted with Safari. It's not an entirely fair comparison, because with Safari we don't have the minor version numbers like with the others. But here we still see the global trends, right? The latest version of Safari is 17. Even after two months, didn't even reach 50% of the Apple product population. And if you look even version 15, which was released over a year ago, is still at about 14% of all page lists. So clearly in Safari you do have a lot of older versions, up to a year and even older present in your dataset. You can't really rely on newer features being readily available there. A very fun one there was Facebook. They have a ton of versions, often multiple per week. And their clients apparently also update to the new versions very, very quickly. Meaning that I often only had one data point per version, which messed with my graphing library. It tries to draw a line, finds only one point, and then decides to just draw nothing at all. Now interestingly, this is exactly what would happen if you would leave me alone with these cookies. You would know that there was supposed to be something there, but there's no physical evidence of it whatsoever left. So a couple of other things. So this is again from Tim's website. So Tim has his own very extensive ROM setup, as you by now all know. But even for people with their own ROM, I think it's useful to have the ROM archive next to that so you can compare both of these. For example, this is for the navigation types dimension. The biggest part is normal navigations. You click a link, you go to the sites. You can also have back forward navigations. People press the back button, which should be much faster because it should still be somewhere in the browser loaded. And they have things like reload. So people actually hard reloading the page. Now for the back forward navigations, you want to see as much of that as possible. Because people, that is the faster that you can get. You can see here that Tim has clearly very well optimized for this use case because he has a lot more people doing back forward navigations than the averages that we see in the ROM archive. So good work, Tim. The same goes for reloads. Reloads you actually want as few of those as possible. Because when people reload, it usually means something has been going terribly wrong and they're doing the have you tried to turn it off and on again, Mattage, to try and fix it. So Tim is only at about 1% there. Well that is much lower than what you see in the aggregated data there as well. So it can be useful even if you have ROM to compare to see where might we improve or where are we actually doing better than others. Or let's say you want to move into a new region or a new country that you don't have ROM for yet. You can try and get some ideas about what the situation is there before you actually do. And so to thank Tim for everything that he does for the web performance community, I actually brought him a little gift. It's a palm tree scale model, Tim. I don't really know what to do with these. Maybe you can have one of your tanks drive them over or something. I don't know. But so thank you, Tim. Thank you for that. Another thing I really wanted to look at was single page apps. I have to admit something. I am still on Twitter. I still call it Twitter as well. And if you're on Twitter, sometimes it seems like everything is react. Nothing else exists on the web anymore. All of it is react. All of it is single page apps, which I really hope is not the case. But when I looked at this, I was somewhat surprised because more than 40% of all page loads in the ROM archive are actually single page apps, which is much more than I would have thought. Now, for web performance people, this is actually good news. This means we have a lot of job security down the line. So that's good. It's a little bit weird. And another very interesting point here is the difference between hard and soft. So hard means you load the initial load of the single page app, the spinner that we saw before. That's basically the hard. The argument being you download more, it takes a longer time to load the very first time, but after that everything is much faster. That's usually the selling point for an SPA. But if you would take this at face value, you would say for every hard load, there was only one soft load after that, where you would expect a lot more soft loads and hard loads. And if that is actually the case, then that whole argument for SPAs doesn't actually hold through at all. Now, that's not what I'm saying. We need more research. I need to look deeper into the data. There could be other explanations for this. But interesting to think about. I definitely did not expect these results. And I would love to compare these with other datasets there as well. I'm running out of time. I had a little bit about HTTP3 there as well, including some things where I got very angry. But let's skip that because I really want to get to this final page. Because we all know that coconuts are amazing. They are exceptionally delicious. You can make a lot of different products for them. But you can make them even better if you combine the coconuts with something else. For example, delicious Belgian chocolate. I think you can get into a very much a 1 plus 1 equals 3 situation with this. In case you haven't tried this Belgian coconut chocolate, it is to die for. Definitely try it out. What I'm trying to say is that currently the RUM archive only has Akamai impulse data. We are very much open to other RUM vendors or even large sites with a big RUM presence contributing data to the dataset as well to hopefully help us remove some of the biases that we've seen to get a better picture of the actual global web in the RUM archive there as well. Some of you might think, sounds interesting Robin, but this is going to be a lot of work. Isn't it? No, no, it's actually super easy, barely an inconvenience. Because if you look at the SQL query that we use to put impulse into the RUM archive, that is only 1.6K lines of SQL. Only 1.6K lines. Very simple, two hour stops and the data is in. Well I guess the message is clear. The RUM archive is now open for business. What I talked about today is really just the highlights, the top what we can do. We are literally just started to shift the coconut milk there. So if you want to help out with that, please come. If you have any questions, if you want us to run some queries for you, if you want to help with the analysis, please let us know. If by now you are just really, really hungry, I would say please come and try out some of the excellent chocolates and cookies, because there's no way I'm taking them home with me today. Okay, so please, thank you.
Linux on a Confidential VM in a cloud: where's the challenge?
Hello, everyone. Welcome to the virtualization dev room. My name is Vitaly. I normally work for Red Hat and you can see me being active in KVM community as well as taking care of Linux on all types of third party hypervisors and public clouds. And today I wanted to talk about bringing basically general purpose Linux distributions to the newly introduced VM type on public clouds, which is a confidential virtual machines. So if you haven't been living in a cave with no internet over the last couple years, which I wouldn't blame you because the world is a crazy place to be in now, but you may have noticed that some hyperscalers were announcing or releasing their confidential VM instance types or features. I'm not here to advertise any of them, but just for the reference, Google probably was the first with their plain AMD serve option in 2020. And now they even have a seven SNP in public review as of like last week or the week before. Microsoft Asia, they were refers to commercialize seven SNP offering AMD seven SNP and they GA in 2022. You probably see they now have Intel TDX option available in public review and Amazon offers seven SNP feature in GA. So it sounds confidential, so it must be good, right? Because we all like when our data is confidential. But what does it actually give us? Like at least like these technologies, what are they about? Like this are like both like AMD serve and all its variants and Intel TDX, they are CPU technologies. So first thing they give you is memory encryption. So your VMs memory cannot be read by your hypervisor or other guests. Second, which is important and which wasn't in like first implementation like plain serve is that your CPU status encrypted because normally hypervisor can see for example your registers where your VM is executed and if it can stop you every cycle can certainly read your data. And the last which is also important is that memory integrity guarantees are provided to you because even when your memory is encrypted, the hypervisor which is like malicious or compromised can do like an esthetic try to for example swap to memory pages. They will remain encrypted but your guest will access the wrong one, right? And it can probably mount an attack using this technique. So this all sounds great, but when we talk about confidentiality normally we say like confidentiality must be achieved in runtime at rest and it transits, right? Like very generic and all these things which I just described, they give you confidentiality at runtime, right? So what about the rest, right? Concentrality of the data in transit is not really specific to CVM because we were doing this for years, right? We know that internet is not safe place, right? So we need to encrypt our data when we send it through public channels and not only public channels, but what about storage, right? How do we ensure that the storage of the VM is also confidential because even if you have something which is confidential in memory, you will eventually need to write it to disk and do other things like you will need to read your operating system from the disk. So you need some guarantees there. The last thing I wanted to mention is that these confidential VM technologies, they don't give you any additional guarantees when you're already within the VM. So if you have an application which is attacked there, right? Nothing's gonna save you, right? The hypervisor cannot see your data, but everything which is within the VM can normally see the data. That's how it works, right? We want to put general purpose operating systems there. So yes, let's discuss a little bit about this protecting data trust because it seems that hardware technologies don't give us this, right? So first is that you want to protect at the guest level. If some cloud tells you, oh, but we are encrypting our disks, right? Like you don't need to worry. Yes, but then you have the key, right? If you can encrypt and decrypt it for me like in a transparent way, so then it's not confidential from this perspective. So you need to do it from the guest. And the thing is you need to somehow protect the operating system itself and not only data you care about because first you have some data which is really sensitive. Like think SSH host keys, right? If somebody can read it from your VM, he can impersonate himself and pretend that he's you, you know? You don't want this. Second, you have, you will say, oh, I'm running like a general purpose operating system there. It's open source. Why would I need to protect it? You don't probably need to protect it from arbitrary like reading from the host, but you still need to protect it from writing because a malicious host can try to mount an attack by modifying something in the operating system. Think about swapping SSHD binary with something, you know? How would you notice, right? You won't. And good thing is that we have some technologies in Linux already for years which are mature like locks or things like the invariative or integrity protection which you can use because even when you store your like encryption key something or like integrity hash in memory, it is protected from the host because remember your memory is encrypted, the host cannot read it. The thing is the guest needs to somehow get this key, right, when it starts and where would it get it from? So, yes, let's take a look at like how Linux normally boots and what we, how we can implement say like full disk encryption or something, right? You start booting from firmware, normally everything is UFI now and all these confidential instances, they are UFI. So, there is some firmware which comes from CloudVendor, but that's like another story. Why would you trust this firmware? You probably shouldn't, but anyway. So, then you will always have some unencrypted part, right? Because the firmware cannot jump in the encrypted part without knowing the key, right? You want to do decryption yourself, you don't want to afloat this job to someone else. So, you will always have something like bootloader, kernel, initramafas stored there in clear. Yes, you may say that we can actually do encryption at bootloader level, which is true, but then we are complicating the bootloader like a lot and the only one which does it probably is grub and nobody likes it. No, I mean, no, but it becomes, it's all like a operating system with all the complexity and everything and you don't really want that for your bootloader. You want it to be really small if present, maybe even you don't want to have a bootloader at all for confidential case. So, and then you will jump into this, you know, encrypted part, you will somehow get the key and then we'll decrypt it. So, that's how it's going to work. So, yeah, how can you provide the key to the VM? You cannot do it manually. For example, like grub, you can type it on your console. You cannot do it on a cloud because you don't trust the console. The console is an emulated device there, right? If you type your password there, the cloud will know the password, right? So, you're not going to do that and you will need to provide it like in an automated fashion, but you can only do that. you you you you So, they were suggesting if you want to have a virtual TPM device, you run a separate domain like another virtual machine which will have this like TPM device. It's really hard to implement and this like 1.5, I think, TDA specification they've added partitioning, which is somewhat similar to trust levels and I think that that's what clouds are going to use. Although, you don't know, thumb clouds may actually implement an emulated device on the host. Just for example, like you do with QEMU and SWTPM, right? You can run it as a process on the host. And not all of these solutions will give you a confidentiality. For example, the one which runs on the host obviously won't. Then there are two types of TPMs normally, stateful and stateless. Stateful is a TPM which has its state, right? And every time you run it, for example, think about it this way. It has a private key and it never changes, right? So, it's generated once when your VM is created and then every time it's loaded, you can use it for like encrypting, decrypting, something. Stateless TPM is just firmware which will generate a new key every time it boots. So, how can we use this? Let's first talk about stateful TPM. Like all these hyperscalers, they give you some sort of a stateful TPM. The question is where is the state stored, right? Because you can turn off your VM, turn it on back. So, the state needs to be saved somewhere. And it's not part of your like encrypted truth volume or anything. It's somewhere else, right? So far, again, like not an advertisement but publicly only Azure proves that this state is kept securely, that there is some attestation going on under the hood when this TPM loads, which protects it from the underlying hosts. You can't say much about other implementations, like because no such claims were made. So, you know, you don't know whether you can use it to isolate from your host or not, right? What's good about stateful TPM is that you can implement root volume like pre-encryption, right? There is a device which has like private key so it can decrypt something. So, you can take your root volume and encrypt it and upload it in an encrypted state there. And that's something which, for example, like Azure confidential decryption is doing. In theory, we don't need to pre-encrypt. We can probably do something like self-encryption. And there are such ideas floating in the air that we will start with this general-purpose Linux distro, right? Do some integrity checking. And on the first boot, you will encrypt your root volume and seal the key to the TPM. But I haven't seen such implementation yet. It's probably possible, but it's kind of hard because you need to prove that the environment where you were doing the initial encryption is saying that it was really a confidential VM doing an initial encryption. Otherwise, someone can try doing it at some other place and attack your VM. So, stateless TPM. Currently, I only know about Azure TDX which publicly offers this option. But what's good about stateless TPM is that it's just a program. You know, it's just part of the firmware. So, you can take this initial launch measurements and attest it. It never changes, right? You don't need to attest the state of the VTPM. It's going to get generated every time, right? Which is good. Think is that, again, like as I said, currently, you will have to trust your cloud provider with the provided VTPM. And yeah, there is no anything like bring your own firmware in public clouds. You can still use it for volume disc encryption if you want to use TPM, but you will probably have to do some attestation and then inject some intermediary key. And also, there is nothing like this in standard Linux tools, anything. Like you can, like just encrypting root volume to TPM is something which is like generally supported by SystemD or Clevis or other solutions. But something which would do like attestation to remote server and then bring the key is just non-existing. Second, yes, what do you do with the VTPM if the cloud provider is not telling you that its state is isolated from the host? Or doesn't tell you how it's implemented, actually. And the thing is you cannot use it, right? You probably cannot even use it for things like PCR measurements because if it's an emulated device, it can certainly get messed with, you know, and then you will see different measurements. So, the only thing you can do in this case is try ignoring this thing completely and rely on architectural attestation, something, registers which both Sev and TDX give you. The thing is, again, that our standard Linux tools for like volume encrypting, something, they don't know anything about this currently, right? So, you will have to, you know, come up with a solution for attestation and delivering like root volume key password or something there. And it's not done yet. So, just a few words about this unencrypted part, right, which I told you that will always be there, right? Even if you do like full disk encryption, which you call full, it's not going to be full because you want to load like kernel and something. So, how can you prove that these things are good? So, normally, we have two technologies which have been used. One is called secure boot, the other called like measure boot. Secure boot without a space, measured boot with a space, nobody knows why. Anyway, so secure boot proves that all loaded EFI binaries are signed by a trusted party and measured boot basically measures every important fact about the boot, like binary certificates, which signed binaries, there has to be something in special registers of TPM devices. And we need to check basically everything which is being loaded. And as I told you, like normally, again, for general purpose Linux distro, you will end up with like a kernel, initramafas, kernel command line being available in clear, not encrypted because, yes. And to protect these things, there was a concept called unified kernel image introduced, which is a very simple thing. It just you take all these artifacts like kernel, initramafas, command line, sign them together and make it like a UFI binary like which is extracting itself and launches the kernel after that. So the implications are, of course, of this like it's more secure, but it's less convenient to use. The initramafas becomes static and generated when we build UKI. And normally for a general purpose Linux distro, we want our vendors, yes, to build UKI. You want just like install an RPM, you get a UKI. You don't want to build it yourself. Otherwise, you will have to get your keys provisioned in the firmware. And not all clouds allow that, right? They may have like a vendor certificate there in UFI by default. It may not give you an option to put your own there. So you will get like a static initramafas which may or may not be a problem. Of course, you have less demands for initramafas which is on public clouds. And like you don't need to do network boot something there normally. But it's still limited. There is a system extension feature in system D which can be used to with limitations to do initramafas extension. Emanuele is going to give a talk like in an hour after me about extending UKI is going to cover this topic, how this can be done. So the other limitation is kernel command line becomes static, right? So this becomes one size fits all, right? When we build as a vendor like Fedora, we build Fedora UKI, we need to hard code kernel command line. You cannot pass like root equals UID anymore. So you need to rely on something like auto discover or something. And again, we just got an extension mechanism which is called like signed extensions. You place basically a UFI binary stub in ESP and get your kernel command line extended. This is already like publicly released in system D but these tools are still adopting this. I haven't seen like a fully working solution yet. But we're actively working on it in Fedora. Last but not least is how do you boot your UKI, right? So it is UFI binary. So it must pass secure boot checks. So it must be signed. And you can boot it either directly from firmware or you can, for example, boot it from shim if you want to have shim for some reason. For example, if the cloud provider does not allow you to have your vendor certificate in the secure boot DB. But you will still have to manage your UFI variables because there is nothing like boot menu there if you are booting directly from firmware, right? In Fedora, we have a package called UKI direct now which can manage it for you like automatically. We do things like AB booting. For example, when you install a new UKI, it's going to be tried once. If it boots, it becomes the default. If it doesn't boot, you will rework back after the reboot to the old UKI. Because otherwise, if it doesn't boot, you are like screwed completely. You won't be able, even able to access your encrypted root volume. Yes, so if we speak about stateless TPM where we don't really need to trust the provider, the cloud provider doing attestation of VTPM state under the hood, then we will need an attestation server and client. And again, there are some offerings say in the proprietary world like Intel was advertised as project ember. But there is nothing which you can use today in the open source world. There are attempts to implement this in confidential containers project. There is this thing called KBS which is both like a protocol and an implementation of this key broker server. But again, like we will need something in the standard tools to do attestation. We are yet to figure out how to tell this thing which server to attest to. Yes, so we talked a little bit about encryption as I said that for root volume, you need to at least ensure that it wasn't tampered with. And for that you can probably use integrity checking. But then problems are very similar there because now instead of the password, you will have to somehow convey the right hash ID to use for the checked part. Right? Yeah, so I'm a little bit out of time here. But yes, you will still need to use all the technologies which I described for encryption. You will have to ensure the integrity of this non-encrypted, non-verified part because UKI is still going to be on ESP which is like VFAT, you cannot attach anything there. Right? Okay, so just a few words. Even if you have your VM which started and checked, yes, everything, you need to verify that you are basically connecting to the VM you expect because think about host starting your VM somewhere and then starting another one which is completely encrypted and was like host and you know, oil miners there are changed. How would you know that you are connecting your VM? So you probably need runtime attestation and clouds are offering you something but there is also no open source, something standard for that. Okay, I'll skip to the last and the most important slide. Thank you very much for listening. You probably don't really have time for questions but I can take as many as I can before dying in the hallway. Yeah, so thank you.
How Much Do You Know about Snapshot
Okay, hello everyone. My name is Titi Ma. I'm from Redhead. And today my topic is about snapshot, especially for the implementation in open shift virtualization, open stack and LibWord. Actually, I'm a QE from QMU, which is very close to LibWord. And the main production of our LibWord is open shift virtualization and open stack. So I made some investigation here. And here is today's data. So first, what is a snapshot? A snapshot is a point in time representation or copy of data on the state of system, software, a disk, a virtual machine or any cells. But today I'm mainly focused on disk and virtual machine. And actually, snapshot plays a vital role in virtualization as it is used for data backup and recovery. We know that data is always imported for any users. And compared to the traditional data backup, snapshot can do a quicker backup and restore. And about the snapshot, we can also do different snapshots in different points in time. It means that we can restore to any historical value of our system in the state. So here are some general user cases about the snapshot. In our daily work, we mainly hit systems failures or data corruption. If they have a snapshot, we can use it to do the backup and the disaster recovery. And also it could be used for testing or developing environment. It means that in our data testing or in our development, we may destroy our system during our work. If we have a snapshot, we can make use of it for this scenario too. And also snapshots can be used for systems upgrade or software updates. If it fails, then we can roll back to the lower value of our system. And it can also be used for training on education scenarios. And that students may make mistakes during their learning. If there is a snapshot, we can also make use of it to roll back to the initial state of the system. And it also can be used to customer scenarios, customer issues, replication. It means that we can save customer environment as a snapshot. And we can use this snapshot to do the debug. It will accelerate the problems solved here. And a snapshot can also be used for security, incident recovery. In today's network world, malware is everywhere. So if our system is attacked by it, then we can make use of a snapshot for this scenario also. OK, from now on, I will talk about snapshots in that three platforms. First step is about snapshot in OpenShift virtualization. Actually, OpenShift virtualization is an add-on for OpenShift container platform. And about the snapshot in OpenShift virtualization, OpenShift provides robust capability for it. As it extends the base OpenShift snapshot feature to include the guest OS operation coordination and multi-disk management. And actually, from user space, there are two methods to create the snapshot. Next is through the web console. And next is created through the OS command line with YAML file. And in the YAML file, we need to define a virtual machine snapshot as a customer resource definition. And the snapshot, you open a shift virtualization, we can create when the guest is powered on or it is powered off, they are both supported. And when the guest is powered on, we are usually recommended to install guest agent, this software in the guest. The guest engine here is used to free file system of the guest. Then it gives time to flash memory data to the disk before the disk snapshot is created. So it here to guarantee the data consistency here. Okay, actually, the VAM snapshot in OpenShift virtualization, it makes use of a volume snapshot for VAM snapshot. The VAM snapshot here, yeah, VAM snapshot is a YAML file. It actually creates corresponding volume snapshots for all the supported volumes, either VAM. And actually, volume snapshots, the source of it usually from PBC, persistent volume claim. We know that the real data of the PBC is stored in a PBC persistent volume. And it could be classified into different storage classes based on different storage bank heads. It is the same as the same for volume snapshots. The real data of volume snapshot is stored in volume snapshot content, this object. And it also could be divided into different volume snapshot classes. Yeah. Okay, let's look at a general data flow for the snapshot in OpenShift virtualization. You already there is a user request line to create a volume snapshot. And the request will send it to the snapshot controller. This controller is deployed in the control plan of OpenShift. And here it is watching the volume snapshot, this object. Once it detects there is the object, then it will create the corresponding volume snapshot content here. And there is another component named CS9 snapshot, which is a sci-dicart container in CSI driver part. And it is watching the volume snapshot content, this object. Once it detects it, it will trigger the snapshot to create operations. And actually based on different storage bank heads, the issue in command loss is different. Like for RBD, it uses RBD snapshot related commands. And for NFS, it uses tar command to issue the snapshot here. And about the host pass for local file, it uses tar command also. And for the block, it uses DD related commands for the snapshot operations. Yeah. Okay. About the snapshot in OpenStack, like in OpenShift virtualization, there is also WAM snapshot and WALM snapshot. Actually WAM snapshot here is different from OpenShift virtualization. Here it actually creates several images, WALM, several image snapshots. And actually the image of the snapshot is also saved as an image file in OpenStack. It means that you like to restore from this snapshot, you need to relunch a new instance from the snapshot file. And also for data consistency, guest agent is also recommended to be installed before the snapshot is created. And for the WALM snapshot, OpenStack is similar as the OpenShift virtualization accepts the commands here is use OpenShift OpenStack related commands like OpenStack, WALM snapshot, create or for the restore it use sender related commands. Okay. Like in low end to the data flow here. For WALM snapshot in OpenStack, usually yes, it's the same. There is a user request from user space. The request will be sent to the sender component. And first it will send to the sender API. It will do some basic checks here. And then it will send it to the sender scheduler. And for the sender scheduler, it will schedule the request to different storage bankers. It's just like the OpenShift virtualization. And for the different storage bankers here is also the issue in commands is different. Like for RBD, it's the same. It use RBD related snapshot related commands here. And for OFS, it's different. It use the QMIL image, this QMIL tools to do the snapshot here. And for the LVM, it use LV related commands here. And about the VM snapshot in OpenStack, it's different. It's also different from the OpenShift virtualization. It does not make use of the WALM snapshot in OpenStack. Here is such code flow. It's mainly implemented in NOVA. And also it can divide into live snapshot or code snapshot. About live snapshot, the data flow first, it use the QMIL image to create a delta disk at first. And then it use the LibWord API block rebase to rebase this delta disk to the root FI file. And then it use the QMIL image to convert this delta to the snapshot. After the snapshot file is created, it will delete this delta disk. And about the code snapshot, it just use the QMIL image to convert directly to do the data transition. Actually, when I first saw this workflow, I'm confused. Why not use LibWord snapshot directly? Actually, the workflow here is just some LibWord APIs or QMIL related commands. Why not use LibWord snapshot? Actually, the reason here is that the LibWord snapshot from the current real release note that LibWord snapshot is not recommended to be used in that. OK, let's look at why it is not recommended to be used. What is the current status of the VAM snapshot in LibWord now? The LibWord snapshot now is using internal snapshot. So what is an internal snapshot? Internal snapshot means that the snapshot file itself is saved in the same base image file itself. We can image that the snapshot file and also the base file is merged in the same file. It will be hard to maintain. Actually, this feature is stopped developing in QMIL real level. It is planning to be disabled in the future. Another thing I'd like to highlight is that the VAM snapshot in LibWord is truly different from the VAM snapshot in OpelShift virtualization and OpelStack. In OpelShift virtualization, OpelStack is used as a guest agent to guarantee data persistence. But in LibWord snapshot, it will include the complete systems info. It will include the complete memory data and memory data and also the disk to see info into the snapshot file. So it can guarantee the data persistence here. And also for the LibWord snapshot, we can also do disk-only snapshot here. And regarding this advantage of the internal snapshot, LibWord appears as working on external snapshot now. And for the current status, we can create external snapshot now. But for restore and delete, it is still under developing and there is an issue tracking it now. And it is planning to be released in LibWord 10. And so eventually, when this feature is supported from data persistence, the perspective that this could be a perfect option for snapshot. But actually, there is still some limitations for the VAM snapshot in LibWord. As it should not source the storage bank assets that will. And the image format of the snapshot file in LibWord must be QQ2. While this QQ2 is not for some bank as a link from RBD bank as from the official documentation that we learned, QQ2 is not RAM-committed over RBD as there are some performance issues there. So, let's give a brief summary here that in high-level, we can divide the snapshot into two parts, code and live. For code, it means that the VAM is powered off. We can guarantee the data consistency. But actually, more customers may prefer live snapshots as there are still some other applications. If VAM is running in the VAM, they want to keep it running. They want to keep it running while doing a snapshot. So, about the live snapshot, we can also divide it into disk-only or volume-only snapshot. There is no memory data. It means that there is VMAH data in consistency here. And also, another choice is VAM snapshot, the whole VAM snapshot. There are two choices, like in OpenShift virtualization or OpenStack, they make use of guest agents, this component. It is used to file. But the question here is that it is just quite the file system as much as possible. It also depends on the workloads. It means that if there is a very heavy workload, in the VAM, there is still data loss, potential data loss here. And another choice is the live-watch snapshot. It will include a completely memory info in the snapshot file. But as I also told, there are some limitations for the different storage of backhands here. So, always based on your requirements or your environment to choose the one that suits your best. Okay, that's all of my presentation. Thanks for listening. Thank you.
UKI addons and extensions: safely extending UKIs kernel command line and initrd
Okay. Hello, everyone. My name is Emmanuel Giuseppe Sposto. I'm a software engineer at Red Hat. And today I'm talking about the UKI at Donson Extension, how to safely extend UKI, scan and comma line in E3D. So why this talk? First of all, because this is extremely new stuff, like it's very new, hopefully also exciting. Because there's not a lot of documentation, of course, because this stuff was just merged. And hopefully this talk will also help you understand a little bit more about what they are, how to use these addons and so on. And because they may be very useful because UKI, as also Vitaly explained in this talk one hour ago, is pretty static on the point of comma line in E3D. And with these addons, we can extend it, these two things without sacrificing the security. And also, yeah, this attempt to advertise a little bit to UKI, so what they are to the more public to be more recognized. So let's look first at Vitaly's slides. These are from last year, I think. So I will just briefly go through this. So Confidential VM provides data protection from the host he runs on. So we are protecting the VM from the hypervisor because it could be malicious and it's privileged, so we can access the VM and we don't want that. The host is still able to disrupt the execution of the VM. And there are specific hardware, SV, SMP and TDX responsible for encrypting memory and CPU. And storage encryption is necessary for security and must be done by the guest OS. This was already explained by Vitaly. And usually the situation that we have is that we usually encrypt, we have the encrypted part and while the kernel is signed by the vendor, in NITRA MFS and the common line are locally produced, are not signed and also it's difficult to measure them, of course. Whereas with the UKI unified kernel images, basically a single binary produced and signed by the vendor, in this case Red Hat. And it basically contains the important parts, the RP sections together with the signature, there is the kernel, the NITRA MFS and there is also the common line as a separate section that is then feed to the kernel. Before going to the next details, I wanted also to explain like the use case, like yeah, the use case in this case for this talk, that we have the UFI, the firmware that is in terms called shim the boot loader, which in terms called system distap which is very key piece for the add-ons and on both the kernel and common line the NITRA MFS which in turn unpacks the UKI and gets the kernel and runs the OS. The issue that also Vitaly mentioned is that the kernel line is immutable and is something that we don't like because there are limitations and you cannot have a static common line for every use case that you have, there is a crash kernel options, debugging options and we cannot ship different UKI for every basically use case. So what we are aiming for the UKI kernel common line is it cannot be static as I said because there are different use case, it has to be secure so whoever modifies the common line has to be authenticated otherwise the whole point of confidential computing is lost and by default nobody because the common line is inserted inside the UKI and then is signed so you cannot modify it anymore and has to be extensible of course because we don't want to ship a new UKI every single time. There are already ways for the one that are no UKIs to extend to add kernel common line to a new UKI but it's a little bit when we talk about confidential virtual machine it's a little bit tricky because as again I'll show you the option and you need to trust a lot of parts. So as I said there is the common line section it's embedded in UKI, it's generated with UKI, it's secure, it's shipped with UKI altogether but it's static, you cannot be modified. Then there is FI shell which is looked by system distable if the common line section inside the UKI is missing many distro for example they ship always something in the common line section inside the UKI so it's ignored. It's useful usually for type 1 entries but again it's unsafe because an attacker can easily inject its own parameter through the FI shell that's why it was disabled for CVMs so you cannot extend the kernel common line with the FI shell. There is SM BIOS system management BIOS, embedded metal this is good, it's trusted because it's coming from firmware and BIOS but it doesn't apply on CVMs again the hypervisor can easily inject kernel common line. So yeah as I said it's not good so it was also this was disabled and then there is QM firmware configuration by the name you can already figure that this is only from QM it's again coming from hypervisor so also disabled. Then what do we do? Our proposal initially upstream was an allow list basically an allow list is another P section where you use regex globbing and whatever just something like this to parse the common line that you want to get and the easy case will be if there is something that we don't accept in the regex we just discard the whole common line but the common line would come from FI shell SM BIOS all these sources but we try to filter and system desktop does the parsing. The advantage is of course that we can reject what we don't want but the problem is just moved to another place because then you can do attacks on the regex and globbing because they need to be very careful formulated so what's also this was disabled so was rejected actually and eventually we have the solution the system D solution nuclei addons. Nuclei addons is basically another separate binary which is contains a very few P section one of these is the common line and it's signed by yeah can be signed but should be signed for the CVMs and we take advantage of shim validate function offered by shim to validate the P signature so basically this means that system desktop will ask shim to validate if the binary has been signed by some key that we trust in the secure boot database. There is a very useful tool UQFI in system D upstream it's you can create UQIs very easily very better than drag up and object copy and you can also create addons and yeah basically the common line is very easy you can also provide the keys when you want to sign your own addon so it's this is the solution. So how UQI works the workflow is UQFI first you create the addons so you ask UQFI to create an addon with the common line that you want then the addons it needs to be put in the specific location in the ESP I will show you later where exactly is this system this tab looks for this location and finds automatically the addons asks shim calling shim verify on the addon to verify the if the addon is trusted so it's signed by somebody that we trust and then if a leadation is successful we read the addon the system read the addon and appends the common line inside the addon to the UQI common line section to extend it and then it's provided to VM linux to start links with the new common line there are two kinds of addons there is global and local addons so global addons can be applied to all installed UQIs and this is the location and UQFI UQI specific addons so if you want to apply all these to one specific UQI you have installed has to be provided in the UQI name has to be in an extra d folder in the same location where your UQI is and then has to be put in there just naming convention because last time I checked the system this tab was checking for also the extension name and this kind of stuff so you need to get them right UQIs are always located in the this part AFI linux UQIs always ends with the AFI and addons is dot addon dot AFI and specific addons here as I said you need to be located in the extra d folder okay so next next step is what is but white self so suppose that we as a vendor we shipped a new key I common line addon and we signed it and everybody's using it and then we figured the common line as an issue then what do we do because we signed it as a vendor so what it's trusted first solution just change the certificate so but this is basically impractical yeah good luck with that yeah we messes up all the measurements you invalidate all the addons so second solution try to create a blacklist on the cloud provider this is impractical third solution at the station check if the hash is matching your addon that you don't like anymore and the last solution these are these s but rules so what is s but is basically another p section inside the UQI the yeah the addons for example and contains component generation and also other information but the key part is the component generation table because there is the same table there should be the same table inside your shim that and then the we are at component level so for example every Linux PS action has should be should have the its own component generation version for the Linux one for the addon and so on and if the component generation match with the what shim has we accept it but if the generation for this component of the addon that is incoming is lower then we have a mismatch and even if the addon is signed by red dot or whatever it will be rejected and this part is done by shim when they verifies they are done in checks the s but components and generation just an example to clarify this in this case we have the shim has s but one myadon version two and then the addon contains the same version for s but and myadon so it's good it will be accepted of course has to be signed by somebody we trust in this case the my the addon as the s but version is correct but my addon component is lower which means that we don't accept it even if it's signed by whoever we trust in the secure boot database it won't be accepted one open problem it's combining addons so if you have two separate addons that contain common line that is safe but together can create a security issue because they enable something that we don't like how do we do it how do we solve this issue to be honest as of now I couldn't come up with a concrete example for this and yeah one solution will be to use that station to see if they are both there talking about the system dc6 in iterative addons so system dc6 already exist they are already famous so used and what is new is that you can also use them for uki so for what if you don't know is a system system extends an image extend the base system with an overlay containing additional files so you can extend base system and you can use this system this tab provides also the possibility to use this to extend the initer d inside the uki um more or less is same concept as the common line addons so you just use different tools because they are different things they are no p binary with p files sections so there are system extension images and micozi is used instead the uki fi and but the part for example where to put it is the same the workflow is more or less again the same create a system c6 extension you put it inside the extra d folder it must be a raw file and then this is the only difference system this tab will take the initer d the addon and will put it inside the initer d extra c6 folder where the c6 extension will then load it and apply it to the initer d yeah who uses this can use these addons the use case are various there are three groups of users that can use this the vendors for example read that they want to ship we want to debug kernel and uki and we ship our addon and there are there could be the vstod the virt host admins that can use host side tools like virt firmware or whatever to modify these these kind of variables more or less the same use case and the guest admins can add you can use guest side tools like mock to insert the key insecure boot even though this is a little bit tricky for in the cloud because on asia it's basically impossible to add a key in mock because when it reboots you cannot connect via when you connect with the shell you skip the mock reboot section when they ask you to confirm your key available tools there is a system d has a lot of tools uki fie is the main one in different version is supported gradually first how to build and then how to inspect them and then there is also i sent a pr to extend boot ctl to find addons and display already as a preview what will be the kernel command the full command line so if there is a system d maintainers right then and there is mico c to create a system d sex the image and then we have a uki director for fedora there is kernel boot config you can add update and remove uki's and then we and also added kernel addon which does the same thing for uki addons and the future work what are we planning to do next maybe an rpm so the vendor ships an rpm with the collection of addons generic addons that we want to ship signed by the vendor but of course we don't want to pollute the esp with the addons that the user doesn't need so there was a agreement also upstream to find these two locations user lib linux extra d for global and the other one for uki specific addons where the rpm should install these addons and then when the user needs them can simply use kernel addon or just copy the addon that for example we as a developer ask to for debugging the uki to copy it in the esp reboot and they will be there yeah on the cloud cloud if they want to allow the user to upload their own uki addons they need to be a way to inject to inject the owner certificate otherwise yeah you cannot do it this also there is a little bit an issue with the measurement because the when you add the user certificate has to be measured in pscr7 especially and the solution we found is to simply add the dummy addon before performing attestation so the certificate is part of the in the key ring so it will be attest is measured on prem more or less the same things who for us is libvirt we want to offer the same possibility to upload the certificate for secure boot and yeah and there is already a way to add the dummy addon so that's that's it from my e-talk if you have any question here outside thank you yes please uh so second comment is on all of the add-ons Right? Because you can trust the UiViceQ boot mechanism. Whereas in a confidential computer environment you cannot today use. I'm not aware of any stack right now that gives you a trustworthy UiViceQ boot environment. That means you need another mechanism to do that measurement for a confidential computer environment. The most natural path for that is to use the launch digest. Because you have the launch measurements, you need to know ahead of time. When you boot the VM in a way, in a way, in a way at boot time, all of the data that you need to launch at the end, which means you need to have the UiViceQ ready to be available including all the add-ons. At which point we go in full circle, I think we are much better off just building a separate UiViceQ for that one set of configuration you're doing. So you can attest that I'm actually running a set of configuration. You don't want your debug add-on in your production fleet. That is, you want to pre-aggressively. So I think the mechanism that is the most natural one here is to go and build a separate one-off UiViceQ even if they're made of add-ons if you want to. Okay. Okay, thank you. Okay. Thank you. We cannot do a vocation only with a firmware. The firmware cannot support a vocation mechanism outside of the DDX. And DDX has both space and around that. If you have a lot of space, if you ditch the microsoft solution, don't use the microsoft solution. Thank you. Bye. We know how it ends. Guys, you are more than welcome to present next year if you want. You are more than welcome to present next year. You are more than welcome.
From Virtualization Platform to Hybrid Cloud Solution: A Hands-On Account
So, good afternoon everyone and thank you for joining me today. My name is Bello and I'm a software engineer at Red Hat. And over the past year I have the opportunity to be part of the forklift team and take it for a spin. So, today I'm about to share with you our recent journey and without further ado let's jump in. So, in today rapidly involving both of IT we can observe an increase in moving away from traditional virtualization environments towards more hybrid cloud solutions. And with Red Hat RNOT just observing this trend we're actively participating in it. So, recently we had the opportunity to go on a journey and migrating from a virtual established environment to a newer solution. And today I'm going to share with you some of the inside challenges and benefits of such a transition. So, let's start by discussing these two very different solutions. So, picture you on a journey through the IT computing landscape. Our first step is Ovid. It's like an older reliable train that's been running for years. So, Ovid is an open source product based on KVM technologies and it's offering cost efficient way for enterprise to managing their virtual workloads. It's an alternative to vSphere. But our journey does not end there. We then continue to the world of OKD. So, picture it as a high speed train whisking us to the future of cloud computing. So, OKD is also an open source project based on Kubernetes and it's providing us with cloud computing capabilities alongside enhanced Kubernetes features such as added security, automation and user friendly interface. And it supports both containers alongside virtual machines. So, when considering such a transition it's important to take into account how it can be done. So, there are several path we could take each with its own set of advantages and challenges. But today I would like to focus on main three. So, first we can't reprovisioning all the virtual workloads and start from scratch. Even though this solution may be sound pretty straight forward it's both costly and time intensive. And for complex workloads it's not always possible without risking some data integrity and operational disruptions. Next, we can migrate all our virtual workloads into containers. So, with the use of conveyor project we can really reduce the cost here. But it's still not an easy task. And again we have the same issues before not all workloads can be containerized. So, while this may be a good solution for certain types of applications, it's not suitable for everyone. And finally, which seems to be the best one is keeping our migration workloads, our virtual workloads as they are. And with the use of forklift tool migrate them to the new environment. So, by that we don't have to worry about any data lozage. And with the use of this tool we can have simple and smooth transition. So, what is it forklift? So, forklift is a tool designed in a system that is designed to be a virtual environment. Design in assisting migrating from traditional virtualization environments to Kubernetes-based environments. And it's taking care of the entire migration process for us. It's working alongside with another project named CubeVirt. And CubeVirt providing us the virtualization capabilities on top of Kubernetes-based environments. And once forklift migrating the virtual workloads, they will be placed on top of CubeVirt. So, forklift as a versatile tool supports a variety of source providers, source environments as you can see here in this list. So, now I would like to take a deeper look at forklift high-level functionalities. So, forklift supports two types of environments, KVM-based and VMware-based. And for both of them, it's taking care for the entire migration process. That means creating the disk, copying the data, and for VMware-based product, converting the virtualization stack to match CubeVirt requirements. And, of course, finally creating the VM itself with its original setup to run on top of CubeVirt. So, the use of this tool will make easier and smoother transition to the new environment. So, now that we finish discussing these different solutions and approaches, let's dive in into the specifics of our own migration from OVirt to OKD, where forklift used as a crucial tool in facilitating this migration. So, I would like to start with a little bit background on why we decided to go ahead and proceed with this transition in the first place. So, our OVirt environment being in use for more than a decade, supporting hundreds of virtual machines with diverse usage, some for production while others for developing and testing. While the fact that OVirt reaching its end-of-life zone wasn't the main reason we decided to go on this transition, it certainly matches in this direction. And, moreover, we wanted to take this opportunity and reallocate some of our resources and remove underutilized workloads while causing as minimum interference to the users as possible. So, taking all this into account, the shift to OKD seems to be the most reasonable fitting choice. So, as any successful story, planning is always essential, and our migration wasn't exception. So, we started our journey by having in-depth analysis of our current environment and just understand what the migration requirements and what we need from this transition exactly. We then continued to having some resource evaluation. That means we had to make sure that our target environment will have enough resources to accommodate the incoming workloads in terms of compute, storage and network. And finally, we had to create a clear timeline to make sure that each step of the way is well known and everyone involved from users and IT teams are in the loop of this transition. So, now we would like to zoom in even more into the preparation step and focus on the resource allocation. So, we had to start by finalizing our VM list for migration. And when we thought about what going to be the criteria for VM to be eligible for this transition, we decided to proceed with actively used VMs only and had to have close conversation with their owners to understand their specific needs. After that, we had to calculate the storage and IP addresses of all the VMs in this list to make sure that our target environment will have enough resources. This step was more than just technical preparation. It was essential to ensure that once the migration is started, we won't have additional downtime to lack of resources. And last, we had to come up with a way to reflect our original ownership and access mode from the overt environment to OKD. So, with a well-planned and tool like forklifts at our disposal, you might think this migration is going to be a walk in the park, right? Well, not quite. As we started our journey, we discovered that the path ahead of us is going to be quite challenging. So, now I would like to share with you some of the obstacles we encountered and how we tackled each of them to keep our migration on track. So, the first challenge was regarding the VM selection. So, as I mentioned earlier, we wanted to continue with only actively used VMs. That required from us to analyze the VM usage patterns and understand which VMs were actively used during specific time period, tasks that proven to be quite challenging. Then we had to gather the information about these VMs, such as disk size, network and ownership. And that task appeared to be quite demanding as well, both in matters of complexity and in time intensive. And the first, our two environment had different provisioning models. Our overt was more admin-driven and our OKD was more user-driven. And we had to come up with a way to bridge this gap somehow. So, in order to overcome these challenges, we went ahead and developed Python script specific for the migration process. And they can be broken into two categories. The first one, based on OvitasDK, was mainly used for finalizing the VM list for migration and two data gathering, such as the disk size, IP allocation and ownership. The second sort of scripts were based on Kubernetes API and they were used for creating the namespace on the target environment and for assigning the appropriate role for the users. We also uploaded the script to our GitHub region, so they can be used as a blueprint if anyone wants to take a look, you're more than welcome. So, now I would like to focus into a specific issue we had and just walk you through the different stages that we took to solve it. So, as I mentioned earlier, our two environments had different provisioning models. So, our Ovitas environment were more centralized models, where admin had full control of the environment and managed all the resources and created new VMs. Our OKD environment, on the other hand, is more user-driven, where the user have freedom to manage and create their own resources within their namespace. The namespace resources are set by predefined quotas. So, to bridge this gap, we decided to go ahead and create new namespaces on the target environment and place in each one of these namespaces all the shared VMs by the same users. And by giving them an admin access, we made sure that each user will remain with the original permissions. So, let's clarify it with an example. So, let's say after we finished finalizing our VM list, we ended up having four VMs for migration. So, as you can see, on the new environment, we created three new namespaces, and in each one of them placed all the shared VMs by the same user. So, Bob and Alice, who shared two VMs, now will have shared namespaces with both having admin access to it. And Bob ended up having three projects assigned to him, which really reflects the diverse usage on the original setup. So, now I would like to guide you through the script we used for this mapping process. So, the first part is based on Ovid SDK, and we did the following. So, we started by creating a list that mapped between all the VMs and the users from the system. Then, based on information from another script, we removed all the admin and system users from that list. Then, we created a dictionary that mapped between sets of VMs and all their corresponding users. And based on this dictionary and Kubernetes API, we created a YAML file. So, here we can see a set of actions for one set of VMs. So, we started by creating the new namespace on the target environment. Then, we created an admin role that gave full permissions on all the resources under this namespace. And finally, we created a role binding that bind between a specific user and the role, the admin role. And by that, we made sure that each user will retain its original access to its resources. So, now that we finished with the planning and preparation phase, let's dive into the migration execution. So, our first step was to deploy Forklift. Forklift can be installed from the operator hub, and it's managed by an operator lifecycle manager. In our case, we decided to install it on the same cluster as the target one, but it also can be deployed on a remote different cluster. Next, we had to create a new namespace that will hold all the migration resources, including providers, different mappings, and the plans themselves. It's important to know that the user used to create the namespace should have sufficient permissions on the migration resources. Next, we had to create the target and source provider. So, each provider represents the environment we're migrating from and to. Once we deploy Forklift, a new tab named migration will appear in the console, and from there, we can manage all of our migration resources, including the addition of new providers. So, we started by creating the source provider, and here we chose Redhead Virtualization, which is the downstream name for Ovid. We then had to fill in all the information about this environment, so Forklift will be able to connect it. Here, it's important to use users that have sufficient permissions on the VMs where about to migrate, or else the migration will fail. In our case, since we were dealing with scale migration, we went ahead and used Administrator account. Next, we created the target provider. So, here we chose OpenChief Virtualization, which is the downstream name for OKD. Here, we only need to fill in the name, and all other information is automatically filled in. Next, we had to create our network and storage mapping. Once the migration starts, Forklift needs to know how to redirect the incoming workloads in terms of villains and storage class. This mapping will tell him how to handle the incoming workloads. So, here we can see our network mapping, and we can see the new villains we created for our migration needs. Here, we can see the storage mapping and the storage class used for accommodating our incoming workloads. Finally, with the use of script, we had to create our migration plans. So, each plan holds inside of it all the VMs that are about to be migrated to do the same namespace. This means used by the same users. Once we were ready, we triggered, again, with the use of script, all the migration, and the migration started. As you can see, it also can be triggered from the console, but since we were handling with scale migration, we automate this process. Now, I would like to have a quick overview of the steps we had and add some additional information. So, we started by deploying Forklift and setting up all the costume resources for migration. Then, with the use of script, we automated all the plans and the migration. In our case, we decided to go with cold migrations. That means that during the transition, the VM is going to be shut down, because it best suited our needs. We're migration, on the other hand, keeping the VM operational during the migration, but it's leading to longer migration time, because we need constantly backing up the data to keep the VM operational. So, during this transition, we also monitored and troubleshoot the entire process just to make sure that we're on track. And once the migration was over, we chose some randomly VMs and tested to see their up and running, and then waited for some user feedback. So, although eventually we had a successful migration, we did encounter some issues during it. So, the first two issues related to the fact that we had a lot of simultaneously migration running at once. That caused both storage and network strain, and eventually led to longer migration times than we originally anticipated. Another issue we encountered caused some of the migration to fail, and after we had some investigation, we realized it was related to some bug in our codebase. After that, we released a fix, and with the use of that fix, we were able to migrate all the VMs, and it was included in the next version of Forklift. And finally, since the downtime was involved, we had to keep a clear communication and make sure everyone in the loop of what's happening. So, once we started receiving user feedback from the field, it was clear that we still have some issues to solve in order to make this transition fully successful. So, the first one was related to boot order issues. So, VMs with multiple disks were not booting from the right one. So, we addressed this issue manually, and later we discovered it was caused by another bug in our codebase that was fixed in the next version of Forklift. The second issue was related to the new VLAN we used. That caused our FQDN names to change, and the workload inside the VMs were no longer accessible. So, we had to update our DNS records, and the user had to adjust their FQDN names inside the workload to use the new ones. And after that, all the workloads were accessible again. So, as we're reaching the end of today's journey, I think it's a good point to reflect and draw some conclusions. So, overall, we had a successful migration. We were able to migrate more than 100 VMs and copy 12 terabytes of data. We mainly were able to achieve this result through thorough in-depth preparation and planning, and we realized how much it's crucial for a successful migration. Another thing is that we understand that each migration process can be different and held between different environments, but we do see some common ground and best practice that can be used to similar journeys. And finally, and probably the most important, is that even though Forklift is a really powerful tool and it gives us great capabilities, it cannot facilitate migration on its own, and additional steps are required, such as the use of scripts and thorough preparation. So, as we're wrapping up today's session, I would like to extend my biggest gratitude for each and every one of you. I hope that the session today will be valuable for people that want to go on the same journey. I wasn't able to cover into details all today topics, but we post a blog post about this. So, whoever wants to get another information, you're more than welcome to take a look. And that's it. Question and some insights. Thank you. Yeah. How did you handle notifying the VM owners during the process you automated the fanning with the old automated notifications? Yeah, we had... Can you repeat the question? No, you should get the question. Sorry. For the streaming, so maybe people are watching from the audience. So, did we automate the process of notifying the VM owners? So, in our case, we had a VM list that all the owners on this environment were included in, and we said as spreadsheet that all the VMs that were eligible for migration were included. And then we asked the owners to let us know if they want to migrate their VMs, because there was people that decided to continue to different environments or didn't need the VM at all. And based on this information, we also built our migration, our final migration list. So, yeah. Yes. Kai, could you please give us some example of what issues you had during the migration steps? Yeah, so I will give an example about some boat issue we had after the migration. So, we had a lot of VMs with multiple disks, and when you're trying to boot from the disk that doesn't have the operation system on it, the boot will fail. So, you see just like a black screen, and the OS is not found. So, we understood that it's probably not booting from the right disk, because we saw there is another one. And once we manually changed that, we really saw that it's solving the issue. So, we adjust this manually for all the VMs in the migration list. And after that, as I said, we released the fix in our next version. Yeah. Hi. Hi. Is this tool also performing some kind of a preferred check over the plane? I don't know. It's checking that you have enough space on the target storage class, or it's checking that the VM you selected is not exposing particular devices. But can make the middle of the end? We do have set of, so the question was if we did some verification to make sure that we have enough space on our target environment or in devices like in compute. So, we do have set of validations on our plans, but this one are not included. We're checking more things like names that match Kubernetes and more security things, not something like that. Yes. You transferred, you mentioned 12 terabytes of data. I was in the presentation yesterday about talking about PCCOP, and then we're talking about validating that all the data was named correction over a large database migration. Did you do some things? Because I was saying quite a hard problem. A lot of days you might get a crack or off. So, the question is if we're doing some validation on the data, if it's copying correctly. So, it's depend on the source environment, but we do use some external tools for that, and these external tools supposed to make sure that all the data copied correctly. So, it's really depending on the source environment you're using because there's different flows between the different environments. But the tools that we're using, for VMware for example, we're using VIRT V2V, so we're taking care for this check. For Ovid and OpenStack, we're using ImageIO, so it's taking care of under this tool. Okay, so if anyone want to ask any specific question, feel free to approach outside. Thank you.
Making VirtIO sing - implementing virtio-sound in rust-vmm project
Hi everyone, my name is Dorin de Basse and I work at Red Hat. I currently work on enabling the audio stack and other features in the automotive team. And with me here is Matthias. Hello everyone, I'm Matthias. I also work at Red Hat. I am working at the automotive and the beautification team. And I'm going to talk about the audio sound and implementation we did last year in this year too. So yeah. Okay, so in this presentation, we'll be talking about making VETAIO sync. And we'll focus on the implementation of the VETAIO sound in the RASVMM project. So just a brief outline. I'll be talking about the automotive use case. I'll go through the VETAIO sound device on the driver. And Matthias will take care of the VHOS design implementation, the audio back end architecture and the upstream status. Okay, so let's get right into it. One might ask why VETAIO sound? Our main use case is the automotive industry. And in automotive, Android guests are being used for deploying infotainment systems. So in order to support these Android guests, the virtual machine monitor, as in our case, Quemo, requires a set of virtual hardware like VETAIO sound, VETAIO net and VETAIO GPU. And having a VETAIO sound device emulation would allow for Android to be deployed in different virtual machine monitors that currently support the VETAIO device emulation. Examples of these VMMs are Quemo, CrossVM and the likes of them. The Android reference platform, which I linked in the slide there, it defines a set of VETAIO interfaces that are expected from any VMM monitor that runs Android. So based on our expectation for Quemo KVM as a hardware diagnostic hypervisor, we decided to close the gap, which involves enabling the VETAIO sound device emulation as an external process. So now Quemo or any other VMM that currently implements the VHOSESA protocol can actually interact with the user space application. So before showing you how we build this device, let's present to you what the device is. So the VETAIO sound device is a parametriolized sound device and is based off on the VETAIO specification standard. It's consisting of the VETAIO driver, the PCI bus transport and the VETAIO sound device. And this is an architectural view of what the sound stack looks like. And I will show you how the different VETAIO components come together. So first we have the user application in the guest that's interacting with the driver using a set of SISC calls and common user space libraries, such as, take for example the ALSA library in the case of a normal application in the guest or tiny ALSA library as in the case of an Android application. And then the VETAIO sound driver on the other side takes the information that it received from the guest user space and shares it over a transport method. And in our case is the PCI bus. Now this PCI bus is a way to expose the VETAIO sound device to the driver that's in the guest. And the VETAIO sound device, just like any user space application that's running in your host, it sends the audio streams to the host sound drivers and the necessary sound libraries and the E-mone would route it to the host, to the sound driver that's running in the host canal space. So I mentioned something about the VHUCHESA protocol in the previous slide. So what is it? The VHUCHESA protocol is a set of messages that has been designed to offload the VETAIO data part processing from QEMU to a user space application on the host. And this user space process application is what's responsible for configuring the VETAIO rings and doing the actual processing. The VHUCHESA protocol actually uses communication over the Unix domain circuit. And it allows the control planes to initialize the shared memory regions and also exchange the file descriptors. The protocol defines two sides for communication. We have the front end and the back end. For the front end, we have it sending the message request while the back end is sending the message replies. The protocol itself also implements the control plane for establishing VETQ sharing between the guest and the user space process. And this user space process utilizes the VHUCHESA library. So I attached an example here of what the VHUCHESA protocol message would look like. We have the front end that's sending the VETQ memory layout and configuration to the back end. And you can see the message outputs in hex formats. An example of one of these messages is the VHUCHESA get feature message. It's expecting an acknowledgement reply. But sometimes not all messages from the driver expect a reply from the back end. We attached here a subdom tool, which is a tracing tool that can help you while you're debugging in case you want to play around with the traffic messages. So this subdom tool would actually dump the socket traffic between the front end and the back end. And it's being used if you pass the parts of the socket and also specify formats. Maybe you want the format in hex and the subdoms could also provide your format in a pickup format if you want. So the VETL memory region, which is this guest memory here, is initially being allocated by the guest. And in Quemo, this is being done by the memperealock option. And the VETL memory region, when it's been allocated by the guest, it's smacked by both the front end and the back end using the M-MAPS CIS calls. So this memory region would be accessed by the file descriptors on M-MAP. OK, so what happens during the device initialization? We have the feature bit negation that goes on there. During this initialization, the device and the driver both have feature bits that need to be negotiated. And at this point here, the driver would read the feature bits that the VETL sound device is exposing to the driver. And then the driver would tell the device, OK, hey, man, I only support this subset of features or I do not accept this set of features. So take a example, when we have the VETL ring event IDX feature, when it's been negotiated, it would allow the device to control how the notification from the driver should be handled. And we have other features like the indirect descriptor feature. And this one thing to note about the VETL sound driver is that it doesn't have any specific features that are currently defined. So it uses a generic feature bit set of the VETL device. And there are a couple of other driver requirements for this feature bit negation, which you can find it in the VETL specification link. So in a nutshell, a VETQ is a queue of guest allocated buffers. And this VETL sound driver is consistent on four VETQs. We have the control queue, the event queue, the TX queue and the RX queue. And each of these VETQs are consistent of three parts. So first we have the descriptor table. And the descriptor table is occupied the descriptor area. We have the available ring, which is occupying the driver area. And we have the used ring that's occupying the device area. So to further explain how the VETQs are being mapped in the driver and the device, take for example, we have the user application that's running in the guest. It would notify the driver of the audio streams that needs to be processed through the corresponding libraries and interfaces. And when the driver wants to send a buffer to the device, it fills the descriptor table with the M-Mapped buffer and writes that descriptor index into the available ring. Now after writing it, it has to notify the device of those available buffers. So it would notify the device saying, hey, I have some buffers that need to be processed. Now, depending on the buffer size, it could create a descriptor chain, which it would always because of the sound buffers are usually a lot of them. So for the device side, when it's done consuming these buffers, it would write the descriptor index into the used ring and send a used buffer notification to the driver itself. Now in the past, this was not how the driver used to work. That's when the user application sends messages to the driver, because it was unable to actually determine when the buffer has been updated from the user application that's running in the guest. And some of our upstream contributions was to ensure that this acknowledgement callback was being used to notify the updated buffers and also prevent the reading of steel buffers. Thanks to Matthias for some of those contributions. And let's see how the requests have been processed for each of the vertio sound red queue. So for the control queue, it's been used for sending the control messages from the driver to the device. And this control red queues have been translated into a VHOS user request and it's been forwarded to the backend for processing. So on the device side is going to respond to these messages indicating the status of the operation. For the event queue, it's been used for sending notifications to the driver, but in our current implementation, we did not use it because it's not necessary. Then we have the TX queue, which is used for sending the PCM frames for our P streams. And this TX queue is being used for playback. So it would carry the PCM frames that have been initiated by the driver and also replied to the previous received frames from the device. For the RX queue, it's being used to receive the PCM frames for input stream. And this is being used during the capture. So the RX queue would carry the PCM frames that have been initiated by the device and also replied to the previously transmitted frames. So I'll let Matthias take over. So now I'm going to talk about the VHOS user implementation. The VHOS user implementation is split into the front end and the backend. So the backend and the front end communicate by using the VHOS user protocol as Doreen explained before. So for the front end, we based on the word from Alex Benet from Linario that simplified the boilerplate code in Kimu, which is common for all the VHOS user devices. So if you want to see this work, I leave the patch set there. Then for the backend, we decided to implement it under the RASP-MM project in the VHOS device repository. And the benefits of doing that are the following. So for example, we show the device implementation between multiple virtual machine monitors like Kimu or cross-PM. And we use RASP as our main language. So we leverage the features that this language have. Also the process that emulates the device runs separately from the Kimu. So that's reducing the attack surface of Kimu. And also the current implementation has less context which that, for example, the Kimu built in device. And I leave you the link to the script that you can use if you want to try it, you compare. And also you have the link to the RASP-MM project. You can look for the implementation. So now let's see how the backend is designed. So basically the current implementation is made of a device and the audio backends. The audio backends implement the driver for different libraries like PyWear or ALSA. And the whole backend is implemented by a single thread. And current implementation has called the number of strings. So we have only one for input and one for output. So when a new request comes from the guest, depending on the queue in which the request arrives, we're going to have different handler. And depending on the queue, the semantic of how we handle that request change. So I'm going to talk about that a bit. So for example, for the control queue, when the driver's in a request, what we're going to do is just to process that request immediately. So for example, we're going to pass the request and depending on the control message, we're going to call a different method. What we use here is a genetic interface so anyone can write a driver for the audio backends because they share the interface. And then after processing the request, we notify immediately the guest that the request has been processed. So in this case, the methods in the interface are not blocking. In the case of the transmission queue, when a request arrives from the guest and the transmission queue, as Doreen said before, it is when we're doing playback. So we're going to reproduce some sound in the host. What happens is how we process that request is by just picking up the request, I mean, storing a pointer to the request and putting it in a 5.0 queue, which is per stream. And then at some point, the worker's going to wake up and pop the queue request and process that. Here we have to make sure that we're going to consume all the payload that the request has or at least to fill the buffer that the audio engine proposes because otherwise what happens is that the worker's thread is going to wake up more often and we're not going to use the buffer, I mean, the whole buffer that the engine has for the playback. So we have to be sure that at least we consume the whole period. So in this case for the transmission, we notify the guest only after consumption. We have to do that, have to wait because otherwise we can make the user application run out of data. So the spec said that we have to do that, I mean, to notify just after consumption. So in the case of the reception queue, I mean, the transmission queue, reception queue were exactly the same. The only difference is that in the case of the transmission queue, we have, and the payload has data to reproduce in the host. And in the case of the reception queue, we have data in the host that we want to send to the guest for capturing. So what we do is the only difference is that when we pop requests, we're going to use that space to fill with data from the host and then send it back. So if you want to try it, as I said before, we have to launch two processes. One is going to be for the emulation, for the device, and this is the command line in which you use it up there. For example, the backing that you want to use in this case is pipe wire. And in the other command line is for chemo. And the only parameter that you have to take into account is the unique socket that you're going to use to communicate with the demo. So I would like to mention some of the afterword that these were required. And for example, we fixed the BitDio sound driver because it was not respecting the BitDio specification. So that is what Doreen mentioned before. And so we fixed that. And also we have been working in the spec to make it more clear. So we have we sub-streamed some patches to the BitDio spec. And other work we did was to add the descriptor util module to the build queue crate, which allows, I mean, which is what's before in BitDio FS, before, and we move it to the build queue crate so anyone can use it. And the point to do that is because you cannot, you cannot hack all the way that request is distributed over at the scriptor. So the guest can use any distribution of the, use descriptors he wants and because the spec doesn't say how to do it. And we have to be independent of that. And that is the reason of that. So also there were the patches to add the generic because user device, which used the boilerplate code code that you have to put in chemo for because user devices. And also there were some, I mean, there were many development in the pipe wire arrays crate, thanks to the Linda. So for example, we added the fill out module. Also the sparring buffer. There were many also backfixing that we did doing this work. So yeah, we are getting at the end of the presentation. So if you want to get in touch, feel free to participate in the because device project. Also we have a Slack channel called a big dios on if you have any questions. And we also submitted a proposal for how Google somebody of course, so we are, if you're really interested in participating, we are trying to add a new. Audio backing for she is streamer. So feel free to submit your candidate to that. And if you have any questions, feel free to contact us directly. We have the email here. So yeah, that's all I think. So I think now we're going to questions. The question is what happened if I want to use it. It's going when you launch the first program is going to launch the device emulation and then it's going to launch Kimo. And then, for example, if you are in the guest, you want to use it, you're going to use for example, speaker test or apply or something like this to do. And then you are going to listen something in the host. So, yes, but what is now nothing is happening. What is happening when you use the back end? No. So she's asking what happens when we use the now backing. It's clean. No audio. He doesn't use any library. Yes, nothing because the pipe wire would use the pipe, I correspond in libraries and also would use the also libraries, but no, nothing. Okay. Sorry, I missed the question. Can you disclose some car brands that is using your feet? Can you can we mention some brand that is using this implementation? No. Can I ask why you chose to implement this in Rust? Okay. He's asking why we choose to implement this in Rust. So as you all know, Rust, like going to Rust design safety and features of Rust, we choose to implement it in Rust and also the memory usage. So, yeah. I can compliment a bit because also there was the was already the Rust BMM project that existed before. So for a lot of things, we was quite easy to implement the device because we could use many, many things. For example, to work through the beer queues, notify the guests, it was already all in that project already. So for us was just to implement the parsing of the request. But for example, the beer queue handling was already there and also it was easier to implement. Yeah. That's it. Maybe it's a bit out of scope, but have you made any benchmarks compared to like fully virtualized audio devices? What's the like overhand of using this compared to one of the audio devices already existing in KMU? Okay. So he's asking what is the benefit of using this audio device in comparison to the other audio devices in KMU? So regarding the PipeWire backend, PipeWire provides reduced latency, low latency and also low CPU usage and memory usage. And using it in the audio backend, we did some latency benchmarks. You can look up the PipeWire Wikipedia and how to do this latency benchmarks. You could also use the CPU check for CPU cycles and context switches and also latency. So that's, yeah. I think we compare it with the KMU built in device, for example. And it looked like the less context switch for the user application in the guess. Yeah. One of my colleagues who is a computer sound developer device, but completely different. I don't know. I think I'm going to go into details. So he said that the way how good that sound specification is written doesn't allow proper implementation of the device reset functionality. So I just want to ask if you've had any troubles with the device resets or just curious how you've handled that. So the question is that the built-in aspect, rather than built-in sound, doesn't exactly well describe the reset method. That's it. I said that the question is that the built-in sound aspect doesn't explain very well the reset method. That's it. There are some conflicts in the sound. We didn't have that issue yet, at least. And now I tried to remember if we had any feature to call it reset or something like this, but we don't. So maybe we can talk offline if you want. Any more questions? Thank you. Thank you. Thank you.
Exercising QEMU generated ACPI/SMBIOS tables using Biosbits from within a guest VM.
Thank you. Thank you. Good afternoon, everyone. Thanks for coming to my talk on using bias bits to test key moves, ACPI, and SM bias implementation. My talk is going to be structured around these four points. First, we're going to discuss what's bias bits and why we're using bias bits to test key moves. And then I'll be talking about some of the implementation choices of my test framework. And then I'll describe the test framework itself. And then I'll give a brief overview, depending on how much time I have on the changes that I made in bias bits to get everything working together. So what's bias bits? It's actually a software written by Josh Triplett. He wrote this software after he left Google. And the software had actually a real-life usefulness in the sense that the bias developers and Intel, they used it to test their bias implementations on real physical hardware boxes. And what this software comprises of is that you can exercise ACPI and SM bias objects in the bias directly from a grub environment. And even though it's a grub environment, it also has Python built into it. So you don't have to write tests using Bashish, which is grub's native scripting language. You can write all your tests using Python. And all of this is executed from ring zero. So there is no need to actually go from ring three to ring zero to execute your tests, et cetera. All of the components, that is grub, Python, ACPI, which is what bias bits uses to execute ACPI components. All of these comes together in the form for bootable ISO, which is then used to boot actual physical box or virtual machine, in our case. So this is what it looks like in a most simplest form. You just run Kimu KVM here. Using the bits ISO, and it spawns a virtual machine. It executes a bunch of tests, and then generates the logs, and it pushes the log out of the virtual machine. I'll describe that a little bit later. And then it shuts down the VM. So why use bias bits for testing? Well, first of all, like I said, all the tests you can write are based using Python in a pre-operating system environment. And so that means that we don't have to go through the OS to execute bias components, but we can directly execute ACPI from the grub environment itself. And it has already ACPI CA built in so that we can directly execute ACPI methods. And the current test framework that we have in Kimu is basically what it does is it spawns a VM. It extracts the bias, the ACPI tables from the virtual machine's memory, and then compares those tables with some golden master blobs that is already checked into Kimu repository. And then it compares the golden master blobs with the actual table which is what Kimu is using. And then if there is a difference between the two, it throws an error. So the main idea is that any time we're making changes into Kimu that affects ACPI or some bias tables, we can go through, inspect the changes, and we can make sure that the changes are not breaking anything. But what we don't have is an ability to actually execute the tables from a running VM. And using bias bits gives us the ability to execute the tables. So that's the main advantage of using bias bits. So let's discuss some of the implementation choice of the test work. So bias bits is a software in itself. So it has its own repository. And then we have the Kimu repository. And these two repositories, in the Kimu repository, we have all the changes that basically decide the ACPI implementation. And bias bits repository has all the bias bits specific stuff, like all the build scripts, all its internal logic, and the two things are kind of separate. And adding to the complication is the fact that bias bits has, so George gave up developing on bias bits around circa 2017. And any effort that I made to reach out to him failed, so he didn't respond to my queries. So we couldn't actually directly use the bias bits upstream. So what we had to do is we had to fork the upstream bias bit software and put it in GitLab under the Kimu project, and then make changes to it. And those changes involved a lot of build fixes. So bias bits turns out to be something that is not buildable under the Neo compiler and tool chain because nobody has been maintaining it. So we had to make a lot of changes to make bias bits just build. And then a lot of fixes to get all the parts of the test framework working together, which I'll describe a little bit later. And then we have the Kimu repository that has potentially the changes that affect the tables. And so the people who are actually making changes to the ACPI implementation in Kimu, they care about the Kimu repository. They don't know or understand the bias bits repository. So now we have to decide how these two repositories are going to work together. So one of the questions is, so do we make bias bits repository as a module of the Kimu repository? And there has been a lot of discussions upstream on that. And it turns out that people really hate some modules because of a multitude of reasons. And you can actually look into this thread upstream. And it has a lot of interesting discussion as to why we don't want to have another submodule. So how do we keep the two repositories in sync with each other is an interesting question. And then from developer's point of view, whoever is making changes to, say, ACPI implementation in Kimu, do we make them go back and forth between the two repositories? Say, for example, they make a change in Kimu that affects the tables, and they want to write a test for it. So do they go to the bias bits repository, make the change, build bias bits into an ISO, come back to the Kimu repository, point the test to the new ISO, run the test. Oh, something doesn't work and fail. OK, let's go back to the bias bits repository, make changes, come back to the Kimu repository, and go back and forth. That's kind of complicated. And developers don't like to do that because they don't really care about bias bits. They just want to test. They want to add a test to exercise their changes. Right? So another also going to question is what kind of test framework do we use to write the bias bits tests? Do we use Q-test framework? Or do we use something else like the integration Avogadro test framework? Now, the existing test that I just described before that compares the blobs, it's called Biostable Test, and it's actually a Q-test framework. And people are familiar with that framework, right? Because any time people make changes to SAP implementation, that's the test that fails because it compares tables blobs and it right away fails saying that you have these new changes in the tables. You better have a look at it. So people actually understand how Biostables Test work. But do we use the Q-test framework then? The problem with that is that Q-test framework is really not written for something like spawning a VM, the managing all the issues of VM management, collecting the logs, dealing with errors, and then shutting down the VM, et cetera. So I started writing a Q-test for bias bits, and then I realized that it's not really suitable. So I started then looking into writing a new Python-based test framework for just doing the VM management and then using bias bits with it. And then finally, when I proposed that upstream, then somebody pointed me to the Avogadro framework, and I looked at it, and it was right away, Avogadro framework already had all the libraries that deal with VM management. And all I had to do was just focus on the bias bits part and develop that part. So Avogadro Test framework kind of really nicely fit into what we really wanted to do and what was available already without doing any new development. So finally, we went with the Avogadro Test framework. But then the question is, how do we make people familiar with how to run Avogadro tests? Because not all people are familiar with this test framework. Not all people run integration tests. So then we decided that, OK, how about we write a documentation for bias bits test? And that's what we did. So Kimu repository has documentation to how to run a few simple commands to execute the test framework. So I just described all this stuff. So let's describe what the test framework is all about. Now, before I'll just keep a couple of slides, and I'll show you the diagram here. So like I said, there are two repositories. There is one Kimu repository, and there is one bias bits repository. So in bias bits repository, we want to maintain everything that's related to bias bits and nothing related to Kimu or a testing ACPI. So the way we did it is that in the fork, which is residing right here, we have all these branches in there. Now, the Kimu bits branch is the one where we have made all the changes specific to using bias bits for Kimu. And so there we have a GitLy CI job, which is basically a BAS script that builds bias bits. And as a part of this CI job, so every time you commit any change to bias bits repository, this CI job gets triggered, and it will generate a bunch of build artifacts, which are nothing but like pre-built binaries for things like rub, Python, ACPI, CAC, et cetera. And then all these build artifacts are pushed in a well-defined location. And there is a URL for it, and you can just go and download those artifacts. And so in the Kimu repository, what we do is we, in the Kimu repository, we maintain the actual tests that exercises ACPI and some bias tables. So the actual tests are here in this location that are run from within the bias bits environment. And then there is a main driver to put all these things together. And this is the one, this is the main Avogadro test, ACPI bits.py. So when you are running the bias bits ACPI, S&B bias test, you need to run this guy. And what this guy does is that it pulls in these changes, these test scripts, where you have potentially added new changes for your stuff that has gone into Kimu for ACPI. And then it pulls in these build artifacts. And together it generates an ISO here. And then with this ISO, it spawns a Kimu VM and it runs the tests. Once the test is running, it collects the logs. The logs are pushed out into outside the virtual machine into a well-defined location. This test script then analyzes the logs. And then it says whether it failed or passed, depending on how many tests it ran, whether it looks for certain patterns and says, OK, this test failed or what have you. So basically, this mechanism does two things. First of all, you don't need to go back and forth between the two repositories. Everything that is bias-bit-specific resides here. And if you're not concerned with bias-bits or if you don't care about how it is built or what changes are in there, you don't need to touch this repository. All you need to do is just remain here. So every time you make changes to ACPI implementation, you add corresponding test code in here. And then you run this guy. This guy will pull in your changes, use the existing artifacts, and you run your test. Now, after it runs your test, this has some verbose mode where it puts out more information in case there is a failure. So you can analyze the failure, make changes to these test scripts, and again, rerun this guy. So the advantage is that you are actually not, you're already within the chemo repository in your workspace. You're not going back and forth between the two. And then, because a pre-built artifacts are being used, generation of this ISO is a lot easier because these things need not be built. They're already built for you by the CI job. All you need to do is put these test scripts together with this guy and generate the ISO. So this is what I just described all these points here. And then, so let's look at the advantages which I briefly described. So, so no need to use some modules. There are pre-built artifacts that makes it a lot easier. And then if you need to make changes to the bias table, as to bias bits, you make the changes, build new artifacts, and you point the main test to the new artifacts. And the other advantage is that when you release chemo in turbos, that turbos does not have any bias bit specific binaries. They're completely mentored outside of the chemo repository. So they're completely separate, and you don't need to release chemo with any bias bits artifacts. The disadvantage is that because we're using pre-built binaries, therefore we are very architecture specific. So right now we only support 64-bit X86, and it does not support any other platforms. And supporting other platforms is kind of non-trivial, because you need to make sure bias bits can actually build for those platforms, right? And that is, and bias bits was never tested on platforms other than X86. So it's a non-trivial work anyway, right? And then there is tool dependencies to build the ISO, and the environment where you're running the test should have those tools available. So let's look at the overview of the changes that are in the bias bits fork. So like I said, bias bits was ever maintained after 2017, so I had to make numerous changes to make bias bits build with the latest toolchain and compiler, and changes were across all these guys. And I had to also upgrade a CPI-CA, because a CPI-CA is the main driver that knows about various tables. And if you don't upgrade a CPI-CA, you cannot write tests that uses the newer tables. So I had to upgrade a CPI-CA. I had to find a mechanism to push the logs out so that the test framework can analyze the logs. I had to make sure that the console logs are available. And one other thing is that the Python that runs from within the bias bits VM is still Python 27 and not 3, because upgrading Python is a non-trivial work. And since it is a very closed environment, very controlled environment, I didn't see the value of upgrading Python in that environment. So it is still running Python 2, whereas everything else in Kimu is Python 3. These are some of the useful resources, and you can have a look at those resources. This includes things like the Josh's presentation slides and his talk on bias bits itself, which is a lot more details than what I described about bias bits in this talk, and then the details about the test framework itself, the fork that we maintain here, et cetera. So the last but not the least is, before I talk about demo, is that I would really like to thank these guys. Igor is originally proposed the idea of using bias bits for exercising Kimu with the CPI tables, and so I'm grateful for that. And then all these other guys, they gave various useful feedback throughout the process while I was submitting patches upstream, and I'm grateful to all the reviewers of my batch sets and the entire upstream Kimu community for help. Lastly, if you really want to see a demo, there is no time for this in this presentation, but you can click on this link, and there is a video that describes a lot more details on actually how to run the test and all the scripts within the repository. So thank you so much, and now I can take questions if you have. Yes. I have a question. Yes. What do you mean by Python? I mean, what is that Python? It's just a copy based on the built in Python? No, no, it's Python. The interpreter is built from source. Wooden Biospits, it's actually, the Python is built from source. So Python 2.7 is the one that Biospits uses, and it builds everything because it has to build extensions so that it can integrate with Grub. So from Grub, you can actually run, you can say Pi, and then you can run a Python script. So all that happened because it was built from source with integration with Grub. The only problem is that it's a Python 2.7, and I didn't see the value upgrading it to 3, but you can actually run the whole Python script, and that's how all the tests work, because they're all running from Grub, but they're full-fledged Python 2.7 scripts. So it's a full-fledged one, not only certain API that you can use? No, no, it's a full Python. Any other questions? Thanks. Thank you.
One SDN to connect them all
Okay, so good afternoon. My name is Miguel Duarte. I'm a software engineer working for Red Hat and the OpenShift Virtualization Networking team. Well, in this talk we're going to be discussing an SDN solution for both types of workflows so you can have pods and virtual machines in the same network and the use cases that this SDN provides and a little bit how it works. There are going to be some demos as well. So let's jump to the agenda. All right, so first thing we're going to do is explain the motivation, like what drives us to have like to do this and the actual problem we're trying to solve. From there, there's going to be a little short introduction that depends how deep it is going to be. Well, it depends on a few things. And then I'm going to walk you through the use cases for this SDN solution, show the demos, finalize with the roadmap for the future and with the lessons we've learned during this development. So first thing, how many of you have used or like worked for stuff that has anything to do with Kubernetes? Like yeah, pretty much everyone. How many of you use Kubeverts or know what it is? Well, more than I thought. Okay, cool. So the introduction is not going to be that deep. But yeah, let's start. First thing, going to be discussing the Kubernetes networking model. Like as most of you will know, pretty much it's very simple. And one of its few premises is that any pod that is deployed on the Kubernetes cluster can contact and can reach any other pods in the Kubernetes cluster. Like basically you have cluster-wide communication between whatever type of workloads that are deployed in your cluster. Without NAT, by the way. So another thing you get as a byproduct of that is like VSC and all that, it pretty much it configures a way for you to reach the outside world. So you get free batteries to reach the outside world, to reach the internet. The thing it does not allow you to do is to connect to a pre-existing network. If it's, for instance, I say you want to connect to a database that's deployed on an existing network, well, you're out of luck. Kubernetes does not solve this. More things, if you, for instance, you want to deploy a VNF, for instance, and you require more than an interface, Kubernetes will also not do that for you. There are solutions out of three, but we're not going to go there right now. So the motivation for our talk pretty much comes that you don't have like an entryway for you to access like stuff on physical networks and to get access to additional networks. The default cluster network that you have that comes bundled and Kubernetes gives you for free or whatever Kubernetes distributions you have give you, well, it's not suited for all types of use cases. For instance, for virtualization, like the ones that you, of you that use Qvert, should know like you'll get IP addresses and well, IPAM management in for virtualization is extremely tricky and it will not mix well. Pretty much because you have like different IPs when you migrate from the source to the destination pod and that thing will not play along correctly. And finally, in virtualization, you typically like use secondary networks to do all sorts of east-west communication and you pretty much rely on the default cluster network just to like get batteries of the Kubernetes services, stuff like cluster DNS and things like that. So on your secondary networks, you need to figure out other ways. So you could do like bridge CNI and other types of plugins, but that leaves like your operation teams will need to know how to debug yet another totally different solution. Where admins will need to know and configure yet another like bunch of solutions and depending on the use cases, you'll realize that this plugin will work, but this other will not. So like the matrix of things that your operations team has to know and your administrators need to learn how to configure, it skyrockets like it's and it becomes literally too expensive to actually handle. So now the objectives we have the first is we want these cluster admins to like go do something else like we want to push all the complexity of these different sorts of use cases and this mix and match technology to be pushed from their heads and from this lots of tools that they need to learn and know. We want to push all this complexity to the network. And finally we want to have like a single like a single plugin be able to handle like a multitude of use cases. So pretty much what we want to have is like to have the whatever the CNI that comes bundled with the with our Kubernetes distribution, we want that to be able to to to work properly both for the cluster default network and for the secondaries. Okay, so very short introduction now. So Qvert. Qvert is a Kubernetes plugin that allows you to run virtual machines inside pods. So you basically get two different types of workloads. You have pods and VMs and like you manage them from the same from the same solution. As a like an implementation detail like the virtual machine actually runs inside of the pod. And each pod has like a live version instance running in it and the QML process and all that and like just to finalize you have like the networking requirements that virtual machine has is a lot more than a pod like a pod is something like entirely stateful. It's like a cattle you just kill it and a new one will spawn and do it a new thing while a VM is stateful and you need to treat it very careful carefully. Now kind of a little disclaimer our SDN solution that we developed uses Oven and Oven stands for open virtual network and it is essentially like an SDN control an SDN control plane to open V switch. So you have like a bunch of you have open V switch installed in each of the nodes of the cluster. You have Oven on top that is kind of rendering open flow and installing it in each of the open V switches in the nodes from like higher level entities like you have things like a logical switch that grants you connect L2 connectivity between the workloads on these two nodes and this thing gets rendered into open flow and installed to the nodes. Then we have Oven Kubernetes on top of it. It's a CNI plugin that what it does is translate from Kubernetes entities into Oven logical entities. So for instance when we have like a secondary network what we end up having is a logical switch when you have like pod attachments what we have is logical switch ports that are connected to the logical switches and for instance in network policy is nothing more than like a port group that associates a list of ports of logical switch ports to a bunch of ACLs. Alright, so supported use cases as I said in the initially in the motivation section for virtualization use cases mostly you do not rely on the default cluster network to do east-west communication. You use secondary networks. So that's the first use case we are focused on is east-west communication. So as you can see here like pretty much what we end up having like these things here are pods or virtual machines it doesn't matter what we actually end up doing is attaching a new internet network interface to it configure it and what we get is like the logical view of having like them connected via a cluster wide switch. So that's literally what we get a cluster wide switch a connection to this cluster wide switch and everybody that is disconnected that is connected to it can communicate across that network. There's a short demo right here and we'll see that oh god no internet I knew that that's why I have this terminal here. I'm really sorry for the font size but if I put this a little bit bigger it'll basically mess up like the window configuration. I hope you can still see it. So first thing we're going to see is what the network configuration is. I'm not sure if you're used to using MULTUS the ones of you that use Kieferd I guess they are so this is the first thing that we need to look at the network attachment definition. This explains this pretty much holds the CNI configuration from which the CNI plugin will just configure networking for your pod. So the interesting thing here is pretty much like the name of the network so the idea here is that the networks are not namespaced but the network attachments are. So this means that if your network admin wants to grant a namespace connection to a network he needs to provision or she needs to provision one of these like in the proper namespace and this will connect your namespace to this network. Finally the oops sorry it does not go back. Another interesting thing that was there was the topology which is layer 2 so pretty much what we have is an overlay network this is totally disconnected from the physical network that allows you to have like east-west communication. And we do not have IPAM because IPAM for work loads is very tricky and we'll see more on that later on. So we're connecting two different namespaces as I said. Now we're going to provision these into the cluster like fun fact this is like all lazy so I just put like a bunch of stuff in the cluster but nothing happened yet. It's just like the definitions are provisioned and nothing else and now we're going to show the workload definitions. Here they are so we have one virtual machine remember we do not have IPAM in the network so we need to configure the IP statically you have there in the bottom we configure the IP statically using cloud init in the VM and that's its IP 192, 168, 200 dot 10. And then we have like our pod we have two pods here specified we have the blue server pod and you have like a yellow server pod and the blue pod has the dot 20 IP that we configure using the network selection elements this is like multis lingo and what this is doing is exposing like an HTTP server on port 9000 and what we want to do is so we have two servers the blue and the yellow and we have the VM is the client and we're going to be curling to each of the servers by their hostname so that's what we're going to see in well let's first provision this the windows in the bottom are so we can kind of see when they're ready so the old service ready the tenant so the VM is booting up let's speed up this part so we're now going to log in via console to the virtual machine and we're going to curl both the servers and they're going to echo back their hostnames. So God it's going to take forever if only I had internet and I could play a video oh God. Does anybody know how you can ask an emma tell it wow amazing I don't know. It stopped again. Okay it's playing cool so yeah login via console I hope it did not stop what's happening so yeah I should have known better but again we're going to log via console the UI of this thing is absolutely preposterous like I don't know if it's playing or not. Okay so yeah you get curl to the dot 20 IP address it replies with the blue server thing we do the same thing to the dot 30 IP address it replies with its hostname which is the yellow server this concludes the first demo which shows us like east-west communication between well different workloads in different namespaces. Now going to the second use case which now we want to have like remember the motivation slides where I mentioned stuff about accessing things on a pre-existing physical network that's exactly what we're going to be seeing now so as before we see like we have a logical view of a cluster wide switch the difference is that this switch is actually connected to a physical network and you can access stuff that's there in our example it's going to be like a database that has well the data the VM needs. The first thing we need to kind of elaborate a little bit is that you need to configure the physical network so first of all it's not something that a typical user will get access to it needs to happen by a cluster admin and for that we're going to be using two tools first NM state and Kubernetes NM state. NM state is basically like a declarative tool that configures networking you just give it like the desired state I want my network to be like this and it's going to go while punching buttons attempting to make the current state be what you desire to be if it fails it rolls everything back so no changes to your network so it cannot destroy it and if it succeeds it'll tell you you succeeded so basically what we want to do is use Kubernetes NM state which is kind of a cluster wide thing send YAML specification to the cluster and all my network specification will be applied in all the nodes in our cluster so it would look like this so in the in the left we have an example of a policy and in the right we have like a diagram of the topology we're trying to do here so if you look here we this is going to be applied to all the Kubernetes worker nodes because of this node selector that we have here and what we're going to do is create an OVS bridge in each of these worker nodes attach this ENS for interface to the OVS bridge and then using this these oven bridge mappings in the bottom we're going we're saying that we want traffic from the network called default network to be sent to the OVS bridge called BRX and we want the traffic from the tenant blue physical network to be sent to the OVS one bridge it's literally what you diagram you see there in the right now we are granting access to from workloads to the physical network you should not mean should fret carefully when you do this and for that we need to have like micro segmentation this pretty much is like what you have in for the primary network for the cluster default network network policies this is the exact same thing but applied to secondary networks so in our example what we want to have is like a virtual machine that wants data from the database but we do not allow it to actually consume the data directly from the database so we expose that information from a pod so the pod actually can connect to the database and expose this information from a RESTful API over port 9000 so this is kind of what we want to do ensure that the VM cannot reach the database directly over the port the posgrasql port but and ensure you can using well this tiny pod as its proxy data proxy so again another demo this is going to be a disaster I have tempted to tell you to just check this at home but how many times we have more than five minutes right again this does not work sorry it's the other cast so now in this demo what we do have we do have two namespaces data consumer and data adapter we just provisioned them let's first thing some information like this I'm running a kind cluster here and I botched this again so I'm running a kind cluster here and so my Kubernetes nodes are running basic as containers in in my laptop and so the physical network that we see in the diagram is basically like my laptop it's going to be connected by a Linux bridge in the in the laptop and for that I need to kind of since I'm using a VLAN I'm going to need to pre provision the VLAN and that's the interface you see here in the bottom like this podman 1.123 it's a VLAN on top of the Linux bridge management interface I'm going to show this again and so again VLAN 1.123 that's subnet and we have a database running in containerized database running here and we are going you see we have access to the database now let's check our manifests really sorry so I think you should check the demo at home and we have five minutes yeah I don't think this is going to work please do check the demo at home but pretty much what you'll see is what I showed in this diagram so you'll have access on one port you'll have access direct access from the VM to the database like you can PSQL to the data directly and to get the data from using HTTP from the pod and then I provision some policies and then you stop having access to the database and that's pretty much it now roadmap what what are we going for next first thing we need to have is like IPAM for the workloads so we need to find a way to tie the IP allocation to the virtual machine and not to the pod where the VM runs like remember the our big issue where the virtualization is migration and that means that when you migrate the VM to a new node the pod gets a new interface the VM is still with the old interface and basically networking is not properly configured we wanted to address that first and that will unlock the next thing in our in our queue which is selective selective policies so our kind of policies for the secondary networks right now you can only use IP blocks you cannot use like semantic things like I want to allow all workloads from this namespace we're having these labels to access this sort of workload you cannot do that you need to specify IP ranges directly once we have these two things we're going for services next like we want to have like exposed via services like egress from VMs and to have like load balancer services so that you can access them directly over the secondary networks finally self-service networks this is instead of having the cluster admin provision these for you since network overlay a simple overlay that does not touch the physical network you could directly use like a self-service functionality to just create the network yourself and connect and provide east-west connectivity to your workloads okay well lessons learned this was a let's say a joint venture like or a collaboration between both Red Hat and NVIDIA and the fun thing is we had the exact use cases in mind like both of them focusing on Qvert but with different scope so we are a lot more into the generic kind of world we want we give you a platform and you do whatever you want with your platform and NVIDIA notes they have like their use case in mind which is I guess gaming in a data center and their tooling is a lot more let's say pointed in that direction but was a really nice collaboration and we're hoping to see more in the future on a less positive note we get the user experience of this is not amazing like it's better than originally intended because like for instance the thing I've showed you about the the nm state policy that was something that we came up with because we felt like this feature is entirely unusable like people are going to be breaking their default cluster network every time they use this or risk doing that so we've provided that but we still have some sort of nightmarish stories every now and then because of the the way network attachment definitions look like and how easy it is for you to get things wrong and how silently and how these silent errors kind of creep up it's absolutely insane like sometimes things work but not in the way you expect because it just doesn't recognize one of the parameters because you typed it wrong but everything else works it's insanely hard and yeah I'm really sorry about the demo but yeah thank you for listening and if you have any questions in one minute one question it's the same thing yeah so the question yeah so the question is basically there's another cni so we're doing this in oven Kubernetes and there's another cni cni called cube oven so it's kind of it really screams that it does the same thing and yeah it really does the same thing the use cases are mostly the same the thing is that they do a lot more than we do like quite honestly like their feature set is a lot more complete than ours and we're trying to get there like if your question is like why didn't you just use that well we do not have any let's say current stake in that cni we do not have maintainers there we do not like we would have to try to gain entry there and in some cases we do not like their API so we're trying to do things in another way it might seem like we're reinventing the wheel sometimes but yeah kind of we kind of are like we both do the same thing and their feature set is more complete but again we're trying to do more and reach their feature set thank you for the question and I think it's that's it well
Deploy Kubernetes... From Kubernetes: an overview of Cluster API
You're famous. Yeah. That's it. Close the door behind you. Okay. Okay, let's go. I hope you hear me correctly. People in the side and in the room and people online. So, yeah, let's begin. Okay. Okay, let's go. I hope you hear me correctly. People in the side and in the room and people online. So, yeah, let's begin. Thanks for them for having me today and to talk about cluster API and Kubernetes. Thanks to you to come here. That's quite impressive to see the room being fully packed. Yeah, I hope you will learn things. I hope you will discover things. That's the most important. And you will get some stuff to continue to investigate at home. So, the goal of this talk is to give you a brief introduction to cluster API. To give you a brief introduction to cluster API. So, yeah, let's begin. Thanks for having me today and to talk about cluster API and Kubernetes. So, yeah, let's begin. I hope you have a great day. Thanks for having me. Thank you. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thanks for joining. Let me quickly introduce myself. I work as a flat car engineer inside Microsoft. Flat car is an operating system designed to run container workloads. If you want to learn more about flat car you can go at 5.15 see my teammate Thilo talk about flat car. It's a deep dive introduction. And it will give you the key elements about this operating system. But that's not the purpose of this today. Outside of work like SRE France, which is a DevOps association where we organize meetups and conferences in France and in Paris and in France. So if you want to talk in a meetup or if you're interested to organize something, let me know. Context, the context is Kubernetes. Kubernetes is quite the answer to everything today. So if you want to deploy something small, something big, there is likely a big chance that you're going to use Kubernetes. So to me, it becomes a great standard, I think we can say this term. So yeah, that's the cool thing with Kubernetes. You can deploy a small thing and big things and it works. And it works in the same way if it's a big thing or if it's small thing. Something to know about Kubernetes is that you have two aspects of this technology. You can consume Kubernetes, means you deploy your application on it and that's cool. And you have also to deploy and maintain the Kubernetes cluster. So you can do both if you want to. You can do only one aspect of the other. But today, let's focus on the deploy and maintain Kubernetes cluster and not how to use Kubernetes cluster. Two or three weeks ago, I was on Twitter checking some news, what's going on in the tech industry. And I've seen a tweet of a person I've met in different conferences. A tweet about, hey guys, what if I write a book to describe all the ways to deploy Kubernetes. So it was an idea like that and he got some traction in the end about this idea and he started to draft a list of all the ways to deploy Kubernetes. So the knee is at the first day of the currently. So if you want to talk with him about his book or if you want to invest in his book, it's a great opportunity to meet him. He has a talk in the Go Dev Room this afternoon. But we're not talking about Go, we talk about Kubernetes and the 50 shades of deploying Kubernetes. So you can use binaries, you can use managed services, you can use platform, you can use a bunch of ways to deploy Kubernetes. But today, let's have a look on the line 27 or 26, something like that, it's the cluster API. Cluster API, if I quote documentation, it's Kubernetes, a project focused on providing, you can read. The most important is the last line, the cluster API project, use Kubernetes types APIs and patterns to automate cluster life cycle. In other words, use Kubernetes to deploy Kubernetes. So that's the cool thing with Kubernetes is you can extend this technology using CRDs or custom resource definition. So you can extend the technology and you can leverage, you can benefit from the reconciliation loop, for example, the Kubernetes for what you want to do. It's already available for the basic usage of Kubernetes, but you have over projects like a cross plane that will leverage this way of managing the life cycle on the provider side. So cluster API is this kind of stuff. So if we take a look on really abstract way of how does it work, you have two clusters. On the left, it's the management cluster, this is the pilot of everything. This is where things happen. And you have the workload cluster, this is where you run to run your workload. So your website, your SaaS, whatever, it will run in the workload cluster. This is what you currently do if you do some Kubernetes cluster. But before that, we have the management cluster. So you're going to tell the management cluster, hey, I need a cluster in these providers, please deploy everything I need to have to run this Kubernetes cluster. Because to deploy a Kubernetes cluster, you need networks, you need security groups, you need a bunch of things to create on the provider. Well, the management cluster will take care of that, and it will deploy things for you, and you don't have to do anything. So that's the way to see things. And in this example, my management cluster is running with Kubernetes in Docker, kind. So this is pretty convenient because I can run my management cluster on my local laptop, on tiny resource thing, because I just need to deploy one single control plane. I don't need to have high availability and stuff like that. I just want to use the Kubernetes APIs. And the workload cluster in this case is running on OpenStack just for illustrating. So as long as you have a network connectivity between those two clusters, and you have credentials, of course, it will work. So and you can even decide to migrate the management cluster from one cluster to another, but that's something else. That's the key elements to understand and to know if you want to understand the concept of cluster API. So this is my kind cluster, so I just have one single control plane running. That's it. Nothing fancy, nothing to do. Just kind create cluster, and I have my management cluster. Nothing to install on it on top of this. Just a regular Kubernetes cluster. Really simple. Now, how can I create things on my cloud provider using cluster API? Well, for people that already knows Terraform, that already knows cross plane, all these kind of projects, you know that there is no secret. You need to know the APIs of the cloud provider to implement them, to consume them, and to create what we call a contract. So this is the border between the cluster API logos and the cloud providers. So you need to teach cluster API. Okay, so in cluster API, we say that a network, it's this thing. So a network on GCP will be this thing, on OpenStack, it will be this thing, and so forth, and so forth. So yeah, the idea is to teach cluster API how to manipulate and how to deploy resources on the cloud providers. And for this, we use what we call a cluster API providers. So on the Kubernetes SIGS sub project on GitHub repository, you can see all the various providers supported. So there is OpenStack, GCP, Public Cloud, on-premise. So it's a tinker bell on the upper right. So yeah, you have a bunch of providers and if you have some knowledge in Go programming, if you have some knowledge in API consumption and stuff like that, feel free to start to contribute on this provider because this is a cool way to start contributing to Kubernetes and Kubernetes ecosystem. So yeah, that's the idea under the hood, what's going on. And now I have my management cluster. I need to create my workload cluster configuration. So I just use the cluster CTL, cluster Cuddle, whatever you call it, command to generate this YAML configuration file. And I just provide a few key elements, the flavor, the Kubernetes version, how many control plane I want, how many workers I want. One interesting thing is the flavor. So cluster API relies on templates. So these templates are provided by the maintenance of the providers. So for example, the flat card template will deploy a workload cluster based on flat card images. You have some flavor, for example, on the open stack with load balancing, if you need some load balancing services and stuff like that. So flavor is a way to customize your deployment of your workload cluster. You will still have a workload cluster in the end, you get a Kubernetes cluster, that's fine, but you can decide to customize it. So this flavor, this variant, are tested using end-to-end testing. So each time there is a new release of the providers, you can be sure that it passed the CI, so you can safely update your configuration. Of course, for clarity, I didn't mention that you need to provide a few more environment variables to, for example, to provide the credentials. Of course, cluster API is going to create some things on GCP, on AWS, on the open stack, whatever. It needs to get access to this infrastructure, so it needs to get the credentials. So this is an example of things you can pass, but you can also define which instance size you want to use for your control plane, which instance size you want to use for your walkers. So this is the kind of elements that you need to configure previously calling this command. But yeah, just for demo purpose, I wanted to show you this command line, which is the bare minimum to generate the cluster configuration. And now we have the KAPI quick start.tml file. We can apply it like any over Kubernetes manifest file. So KAPI Ctl, KAPI Ctl, apply KAPI quick start.tml. And it will create, as usual, some resources on my management cluster. So we can see that there is the cluster definition, there's machine deployment. So this is something common to cluster API. Then we have the open stack specific part. And this from one provider to the other, of course, the output will be different. But that's the idea, you just apply this. So that's pretty convenient because you have a file. So you can use this file in a Git repository. You can use this file in a CI. You can use this file with whatever you want. So you have an infrastructure as code in term of cluster API. Now, if I check on the provider side, I have some resources that are going to be created by themselves. Not by themselves, by cluster API. But you can see that I have some instances. So I asked for one control plane and three worker nodes. So we can see that I have four instances between being created. I have a network, I have security groups, I have stuff, SSH keys. So this is for open stack, but once again, it's the same thing for every provider. But this is the cool thing about cluster API is that it does not just deploy a cluster. It deploys everything to create a cluster. It's instance, the security groups, the firewall rules, ingress, egress, whatever you need. So it works in this way. When everything is up and running, you will just get your configuration that you can inject into Qubectl and then you have a new cluster ready to be used. So that's it about open stack. Now, we can ask yourself what's under the hood on the operating system side. I'm a factor engineer. I work in the operating system field. So I'm curious to know what's power my nodes. So with Qubectl, we can inspect the nodes and see that for example, this one is running Flakar because I asked for Flakar variant, but for example, with Flakar, we do not ship QADM. We do not ship Qubelet service. We do not ship MISC files. So how my nodes can start acting like a Kubernetes node. How things can work. And on top of that, Flakar is immutable. There is no package manager. So there is no way cluster API is going to SSH into that node. It say, okay, APT install QADM. No, no, no. So what's the magic behind? It's another project called the Image Builder. So it's on the Kubernetes 6 GitHub repository. That is the Image Builder project. So the idea is to take an OS, for example, Ubuntu, to build it with Pacer. So nothing new under the sun. And to inject the QADM, the container runtime, the MISC files, whatever you need to power Kubernetes nodes. So it's a three step thing. You take your OS, you inject the Kubernetes components, and then you export this new image, this golden image, like we sometimes call, into your providers of your choice. Open stack, GCP, AWS. So you understand that something quite complicated because in order to use cluster API, you need to use this kind of image. So it means I can wait for someone from the community to build it. The build of the image is not an issue. Everyone can build image. It's more about the storage. Because storing an image, it's something, but when you have to store image for each Kubernetes version, so it's three main versions at each time. So three Kubernetes version, then I have to keep the image for each cloud provider I wanna use, and I have to keep an image for each different version of Ubuntu. It can be complicated to store everything and to have the time and the energy to build these images. But this is what we currently do with this provider. So that's, I will not say this is the way to do, but this is commonly done currently in the open source world. So we can think about something, an alternative. And the truth spirit to me of the open source world is to have alternative. So there is no one way or one other way to do things, there is alternative. Then you choose which one is the best for you. So an alternative would be, okay, I take a Linux based OS, for example, Ubuntu, Flat Car, whatever. It's already available on GCP. It's already available on AWS. It's already available on Digital Assign Azure. Because these cloud providers provide these images for you. So just the vanilla image, a fresh image is already available. So what if now we download the Kubernetes components during the boot of the image? And in the end, we have the same result. We have a Linux based OS with the Mixed file, with the QBGM, everything I need to power my nodes. So this is something we implemented on the open stack side. So you just need to use an over flavor. It's SystemD CZex, Flat Car dash CZex, sorry. And it leveraged this new feature of SystemD called SystemD CZex. Basically SystemD CZex, it's an image, raw, squash fs image that you're going to mount as an overlay on the Linux based system. And it will bring you new binary files, new configuration files into your system. So if you want to have a look to SystemD CZex, I really encourage you to check this new features from SystemD and this is what basically we're going to do with this flavor. It's during the boot, we're going to download QBGM SystemD CZex image and everything will be open running to power my node. One, just for example, if I SSH on the node, I can just run SystemD CZex list and it will give me the output of Kubernetes image being available. So what's, what it's cool with this approach is that you remove the strong bounding between the Kubernetes version and the image version. So if you want to update Kubernetes but you don't want to update your base.OS, you can. If you want to update your base.OS but you don't want to update Kubernetes, you can. Before that, you were supposed to build new images and stuff like that. And the cool thing is that SystemD CZex is, it works in the same way on AWS, on Azure, on premise, on whatever. So you have just one configuration for all the cloud provider. So that's something to keep in mind. And we discussed with cluster API folks to see what could be the new approach of this. We also attend some office hours of the cluster API or ecosystem to make some demo of this. But it's already available on OpenStack and we hope it will be available in the next, in the over providers. A few resources, if you want to continue, check this at home. You can have of course the cluster API website, the cluster API OpenStack. This is for the example I've shown, Flattar and SystemD CZex, which is the main outlook in the end of this talk, is SystemD CZex. But yeah, so to conclude, I would say this talk is to give you the key elements, what's going on in cluster API, how does it work, but also to give you an overview of what we currently working on on this aspect of cluster API. So yeah, thanks. And of course I forgot to add the QR code, but you can find them on the FOSDEM website. And yeah, thanks for your attention, if you have any questions. So if you have any questions, or maybe on the chat, if there is some. I didn't see anything, but we can start with you. Yeah, do we have a mic? No, you will pretty much. Okay, we repeat the question. Okay, I have a philosophical question about cluster API. What's the life cycle of Kubernetes cluster to you? It's long running, you upgraded, or for it just temporarily, you just replace with a new one. So the question is about the life cycle of the cluster. Do we need to replace each node when there is a new version? That's correct? Yeah, but what is the intended purpose? It's like long life cluster, so it's like just for short term. Well, the goal is to have, yeah, okay. So the question is, do we use cluster API to have long term usage or short term usage? I'd say that cluster API is in the philosophy of leveraging Kubernetes, which means using the reconciliation loop of Kubernetes. You can just things that by themselves and if there is, I don't know, a network, so instance that going down, it can be restarted by the management cluster that will take care of, there is a state and you still want to be in this state. Because with Terraform, for example, there is a state, of course, but it's not live. It's a static state. So if there is some things missing, you need to rerun, plan apply, to be sure that your things get back. So that's one of the difference. Yes, Mr. Why cluster CTL to generate a template instead of using Helm templating? That's a great question. That's a great question. As you can see, can you repeat? Yeah, so why use cluster CTL CLI tool instead of using Helm or customizer or stuff like that? So the idea of cluster CTL, it provides some sugar on top of the common generation. So you can manage your clusters. So that's one of the features. But you can also have variable injection. When you generate a template, it will check if there is some missing variables required by the providers. So I think you can perform the same thing with other tools, but cluster CTL is just handy and you have it in this way to just be sure that you don't miss an environment variable to configure your deployment. Yes? In terms of the overlay or the flat card, how hard or easy is it to build custom overlays? Say if you've got OEM integration, what's the tooling to support that? So the question is about the overlay and how to build these images. If I understand correctly, the system is CZX. So you don't have to build, because we provide them on... You've got to say you wrote custom security... Yeah, you can... We've all decided to fork the repository I mentioned, which is called FlatCard CZX bakery, where we provide these images. So you can fork and do your stuff and why not send a peer, if it's something relevant for the community. And it's just a matter of SquashFS. If you have SquashFS utility tools on your system, you can just build your images. Basically, everything will be in a directories, then you convert these directories to a SquashFS image. Yes? Does the machine deployment controller do any sort of like... If forcing reconciliation, so if you were to delete the instance in OpenStack, would it be created? Yeah, not immediately, but in a few seconds, few minutes after it, we say, okay, I have to get four machines, I have only three. One missing is a walker node, so I need to go to habit. So the question was about... The question was about if there is some instance that is, that disappear from the OpenStack or the provider's dashboard, does the management cluster restart the instance? Yes? As a Kubernetes admin, I really love to manage my Kubernetes classes with Kubernetes resources, but I always wanted to bootstrapping problems like how do I manage my management cluster? I mean, I had so many projects, but I'd love to use cluster API, and it never makes sense, because in the end, I end up using CubeSpray for the management cluster, and I can just add it using CubeSpray. Yeah, so I think the bootstrap issue, they got the same question. So the question is about how do you create the management cluster? So I think this logo is well representative of the issue. It's the torter, the torter that you stack, and in the end, there is no answer, because your management cluster, you can define to handle it with another management cluster. So you can change cluster API if you want to, why not? That's something you can consider. And on the new workload cluster, you can just say, okay, now it's a management cluster. So I'm going to deploy cluster API on the workload cluster. That, of course, is just theoretical. There is no practical way to do that, and it's not the point. But your management cluster, that's what I say, you can use really something simple. I think I see that there is this new Kubernetes tool that you can use. It's like deployment without Cubelet or something like that, so maybe, yes, because you just need the APIs in the end of the Kubernetes. And so why not try to come with something like that, to just deploy a set of API, and that's it. But you can, yeah, we can do things like that. You can decide to use kind, for example. That's the best way to deploy things for the management cluster. Time's up. Okay, thanks everyone. Have a great day. Thank you. Thank you very much. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Ooh. Is it all right? How's that? Cold share. It's all right. You gonna have it? Yeah. Oh, yeah. Do you also have to use to the... You'll see it, so. I think we're gonna have to do the... So... We can enable hotspot for the... I think we'll stop here. Yeah, let's go here. Do you have presentations? Yeah.
Operating Kubernetes Across Hypervisors with Cluster API & GitOps
Hi everyone. Hi everyone. Welcome to this talk on Cluster API across hypervisors and with GitOps. So we've got a lot of the hype words in there. My name's Richard. I'm a maintainer of various Cluster API providers, more notably the AWS provider, the GCP provider and the RKE2 provider. Hey, I'm Alex. I work together with Richard at SUSE. I'm a KPI operator maintainer and also maintaining RKE2 provider. So today we are going to talk about Cluster API. This is only for the stream. But just to speak louder. So today we are going to speak about Cluster API, GitOps and couple virtualization providers. So we'll briefly talk about what is KPI. There was a previous talk about this. But just in case you haven't been there, we will repeat something again. We will tell you and show how Proxmox will integrate with KPI and how GitOps can be added to there. And then we'll replicate the same process with KubeVirt to show that KPI can work with different infrastructure providers. Cool. So all the demos, well the two demos in this session are available via this repo. Feel free to take a picture of it. It's got the full script for it. So you can actually run this yourselves when you get home. I'll leave that out for a second. So who was in the last talk about the intro to Cluster API? Okay, cool. So you get the idea that you have a management cluster. Oh, yep, sorry. You have an idea that you have a management cluster and to that management cluster you install Cluster API. Now Cluster API is made up of the core controllers and a bunch of providers. And you can mix and match those providers to meet your needs. So if you have your provision in AWS, you just install the AWS provider. Once you have that, you then declare the shape of your cluster. It's fully declarative using Kate's custom resources. And you apply that to the management cluster. Then Cluster API does its magic and then it will provision the infrastructure and then bootstrap Kubernetes on top of that infrastructure. So we're going to demonstrate how it works on Proxmox. So just a couple of words about this in case you don't know what it is. It's a virtualization platform. It's open source and includes anything you need for your virtualization purposes. One thing to note is there are two providers for Proxmox if you go out there. So one requires you to have essentially a template pre-created within your environment. The other one will essentially just take a bare OS and it will install everything on top. We are using the one that requires a template. Yeah, so we made a diagram of how our cluster will look like in terms of cluster API. So everything you see there are Kubernetes resources. And all these resources, they represent the cluster, the Proxmox cluster we are going to use. So we'll have the main entities, of course our cluster. It will reference the infrastructure and also reference the control planes and the way how they should look like on the Proxmox environment. And then another resource is machine deployment, which is used for worker machines and it also should reference a template of how it's going to look like on Proxmox and also some configuration for bootstrapping Kubernetes over there. Cool, so over to the demo. So we were going to do the demo live, but actually the network is not being nice to us. So luckily we did record it. So let me just set this up. That's what I'm going to do. Can I do full screen? Is that a visit? Yeah, that's what I tried. Obviously didn't try hard enough. So hopefully you can see this all right. So this is just shown initially the repo that that link showed before. In that repo there are two branches, one for the Kubevert side and one for the Proxmox side. So we're just going to use obviously the Proxmox branch here. And in that you can see all of the artifacts that we would have used in a live demo and if you're going to use this yourself. So moving on then to the pre-rex. So as I mentioned, you are required if you're going to do this yourself to have a template in your Proxmox environment. So the way that you do this, if you want to do it in an automated way, so you can use the Kubernetes image builder project and that has specific make targets that will provision and build that base image for you. And actually what we should see in a minute is I should change to that window and you can see it here. So the Virtual Machine 100 has been built using the Kubernetes image builder project. So that's got everything on there required to bootstrap Kubernetes. So it's got versions of KubeADM, et cetera, already baked into that VM and it's been marked as a template within Proxmox. Yeah. Cool. So the basic flow is we're going to create the management cluster. Sorry, we're going to create the management cluster. We're going to then install GitOps agent on there and then we're going to create a cluster. So I'm just going to fast forward here because this is great. We're using kind for our management cluster. So if I just fast forward, just preloading a bunch of images onto there. The idea being it would have made the demo a lot quicker. So I'm going to start canines in another environment, another window so you can see actually what is getting installed. So this is a plain vanilla Kubernetes cluster at this moment in time. One thing to note, if you're going through the instructions at a later time, we've made a slight config change to the cluster cut all utility configuration so that we can install an IPAN provider. So probably in the last session you went through the different provider types. The main ones are the control plane provider, infrastructure provider and bootstrap provider, but the newer provider types are the IPAN provider which is especially useful for virtualized and bare metal type scenarios and also the add-on provider type. So the way that you create a management cluster is with cluster cut all. One thing to note here is we're specifying version numbers. That was purely just to pin the versions so that we could load the images but you don't have to do that in your environment. And this will go away and install all of the providers and core Cappy into this, turning it into a Cappy management cluster. So if we fast forward a bit, you can see them installed in now and you can see the IPAN provider at the top there and the Proxmox provider. So the next step, so we've got a management cluster. So we want to use GitOps in this scenario. So you can use whatever GitOps agent you want. So we're going to be using fleet, but you could equally apply these steps with slack modifications if you wanted to use flux, Argo CD, whatever your choice is. But we're using fleet so we just need to quickly install fleet, a couple of Helm commands and we'll have that there. So we can fast forward a bit. So now we have the GitOps agent in our cluster. We can start using GitOps to provision clusters. And this is where I guess the mixture of cluster API and GitOps comes really interesting because you then can create clusters via a pull request, which opens up to all sorts of different types of scenarios. And it also means you can perform all of the operations against that cluster via pull requests. So you have the full history of the changes. You can roll back and all of those types of things. If you're using GitOps, you're used to with your applications, but you can now apply it to your actual clusters in the cluster lifecycle. So in the repo, you'll see two folders. Funny enough, the one with the cluster definitions in is in the clusters folder. It's just got the one cluster definition in there. So we're going to bring it up now to have a look at what it is. So it's just pure YAML. It describes the shape of your cluster. There's different resources to represent different parts of the cluster, whether that's the control plane or the worker machines. And it matches the diagram that we showed before in the presentation. Basically, this YAML is what you saw in the diagram, but not visualized. So two things to note here. Just highlighting the fact that we are using the Proxmox, so you will have resource kinds dependent on your infrastructure provider here and likewise for the other type of providers. So there's a couple of things we want to note there. So just highlighting some labels here. If you just remember these labels say CNI Calico, we'll come back to that in a bit. And then we just see some various other aspects. One thing to note, we're also using CubeVip here. So in this type of environment, you need some sort of load balancer so that you can get to the API server. So we're just using CubeVip as an easy way to do that, and it uses gratuitous ARP. So if the control plane machine that is currently hosted on crashes, it will move across and it will start advertising the address from another control plane machine. So it's quite a nice setup. So we can fast forward there. So here you can just see the shape of the VMs that we want, the specifications, so this could be whatever you want. One thing to note is you'll see the template idea at the bottom, which says 100, so that will have to match the template that was created via the ImageBuilder process. If they don't, then things don't work. So we require a small amount of configuration for Fleet, and this will be the same for other GitOps agents. So in this file, we call it the Git repo, and this just tells Fleet about, hey, go to this source repo, download everything in there and apply it to the cluster. So you'll just see that the repo URL, the branch that we require, so we're on the Proximax branch, and then potentially any paths or secrets that are required to access that cluster. Cool, now we've done that. We've applied that to the cluster, so it's going to bring all those cluster definitions into our management cluster, and then hopefully we start to get virtual machines being started and that cluster will be formed. Maybe. Cool, so you can see now that automatically, the cluster API has created machines here for you. So you'll see that there's one machine for the control plane and one machine for the machine deployment or the worker nodes, and you can see that the one has started to move to provisioning. What that basically means is it's going to provision the infrastructure and then start to bootstrap Kubernetes. So what does this mean from a Proximax point of view? People with really good eyesight will probably see that there's a new VM starting up, so you can see it in the events at the bottom there, a VM 104, and you'll see it on the side in the viewer. So this is being orchestrated by the cluster API provider. So it's talking directly to the Proximax API and saying, hey, create me a VM, I'm going to use it for this control plane machine. Now this part does take a while, so we're going to have to skip quite some way through. We'll just get it to the point where you bring up the console, so you can see it's using Ubuntu, and if we fast forward a little bit, eventually you'll start to see, essentially, cloud init will kick in, and depending on how you configure the bootstrap providers, it will use either cloud init or ignition currently. This is using cloud init, so you'll start to see cloud init run in, and that will essentially be running the commands to bootstrap Kubernetes on top of this VM using QVADM in this instance. Oh, we missed it. You'll see it, it will come up. So it does come up, and you can see that. So essentially what it's doing. So at that point, we have one control plane machine ready, essentially. Once one control plane machine is ready, you can then start to provision the worker machines, and it always waits until one control plane machine is ready, and then it will just start provisioning all of the worker machines in parallel. So we can fast forward that, and you'll see another VMs come up, and I think you get the point, so it just repeats the same things, but this will be for a worker machine. So, well, I just skipped ahead in the top part of the terminal window. I have just got the cube config for that newly created cluster. So the cube configs for the newly created child clusters or the tenant clusters are available in the management cluster, so you can get that out and then run, and obviously do what you want with it. In this instance, I'm just showing that stuff is running in there. So you can see that Calico is running in there, so we didn't put Calico necessarily in the cluster definition, but if we go back to those labels on our cluster definition that said C&I is Calico, that is using a feature of a cluster API called cluster resource sets, and essentially this enables you to install any type of resources into a newly provisioned cluster automatically. So it's really ideal for things like C&I or cloud providers to be able to do the things that you want as soon as that cluster is being provisioned. And again, this is all done in a declarative way, so you don't have to do any special commands, you just put all of your definitions into Git, and then the cluster API will do the orchestration. So this is what is in the second folder, in the repo, in the CRS folder. You'll see that there's a cluster resource set, and you'll see that there's a label selector, so if your cluster matches that label selector, it will then apply all the resources listed below, and those resources are essentially just config maps or secrets, and they contain embedded Kubernetes YAML, so it will just squirt those into your cluster. So where are we now? So we've got one control plane and one working machine, so I said that you could go into Git, and you could scale the cluster and do all your operations, so what we're going to show here is actually, if you go to the cluster definition in Git, we're going to scroll down until we get to the machine deployments, where it will say replica one, and we're going to, hopefully, you see the machine deployment there, change app to two, commit those changes to the Git repo, and you can probably guess what will happen now. Any VM is spun up, Kubernetes is bootstrapped on there, and that node joins the existing cluster, and you'll see eventually that it does come up. So that is the props box demo. So now we're going to show the same process with Kubert, and the idea is that you can use cluster API to provision your clusters and multiple providers in the same operational way, so the process for different infrastructure providers is relatively very similar, with the difference in operating your infrastructure, I mean, defining how your machine looks like, but the whole idea is the same, no matter where you're on your clusters. So the one major difference with the Kubert provider is it requires Kubert to be installed in your cluster already. So before you install the provider for Kubert, the Kappi provider, you must have Kubert already installed. So what you're seeing here is we're installing Metal LB to take the place of providing the load balancers within this environment. Then we install Kubert, and so Kubert works on the basis of you describe your virtual machine as a custom resource, and then it will make that happen behind the scenes via QMU. So this is what we're doing first, and this is before you get to any of the Kappi stuff. This is just setting up the Kubernetes cluster. So you can see the Kubert is starting up. So we now done the quaternate installing the provider. We're going to install the GitOps agent here now. So it's basically the same process, just slightly modified with different providers and just with different prerequisites, the prerequisite being Kubert. So forward again. So in this second branch, you'll see a different cluster definition that uses Kubert, but essentially the way it's applied to the cluster is exactly the same via GitOps. So what you take away from this is the same operational procedure irrespective of your target infrastructure or the flavor of Kubernetes that you want. You just create some YAML that describes the cluster that you want. So you'll see this is the interesting part. It will spin up a pod per VM, and that pod will then do the interaction with to actually provision the virtual machine on the host. So you'll see one of these spin up for each of the virtual machines that are required for the cluster. You can look at the boot logs via VNC as well. So if you use Vert Cuttle in this scenario, just get the name of the node and use Vert Cuttle, and you'll see it's using exactly the same QBADM commands that we saw previously with the Proxmox. And then we do the same operation. We scale it, and you get the third machine. So again, probably the key takeaway from this is that you can use the cloud environment to actually use it. So key takeaways are, CAPI can be used in many, many different infrastructure environments, not just like the cloud environments where a lot of people would naturally think of it. So virtualized environments, bare metal type environments, and some really interesting type environments where you want a control plane as a pod type scenario. It supports different Kubernetes flavors, so you might want just pure upstream with QBADM. You might want something a bit more lightweight, so you can use K3S. So it allows you to mix and match all of these things. And lastly, this is fully declarative, fully GitOps friendly. Perform all of your cluster operations via Git. So yeah, thank you for coming. Thank you for your question. Thanks. Thank you. Yeah, so the question was, can you realistically provision a cluster associated infrastructure like low balances, et cetera, with the cluster API currently in a hyperscaler to like AWS as an example? The answer is yes, definitely for AWS. I'd say the caveat is it will provision the infrastructure in an opinionated way. So it will only provision the infrastructure that's required for the cluster and nothing more. And it will provision it in a way that it thinks best. So you can slightly tweak it if you want. If you don't like, say you want to use A or Bs instead of something else, or you want to add security groups, it does allow you to do that as well. But there are, I guess, boundaries. So if you want full flexibility, then it might need to do something else. But you can also use things like Terraform and cluster API together. It doesn't have to be an either or. So you might provision the VPC and the network with Terraform and then get cluster API to do the Kubernetes and like the day two operations type of stuff on Kubernetes.
#snapsafety: de-duplicating state across Virtual Machine clones
Hello. Thanks for coming to this talk. My name is Babis Hallos. I am a software engineer with Amazon Web Services. I'm currently working with a team that maintains the Firecracker Virtual Machine Monitor. Today I will be speaking to you about Virtual Machine Snapshots. Essentially I'm going to be speaking more about some challenges we face when we clone virtual machines and then we start multiple virtual machines from that same clone. A problem that we call Snapshot Safety. I'm going to be speaking a bit about the mechanisms we have today for tackling those issues. What do we believe we need to do as a community in order to grow awareness about the issue and build systems that are safe in the presence of Snapshots. Quick sneak peek on the agenda. We're going to define what is a virtual machine Snapshot for us and what is problematic with virtual machine Snapshots and which scenarios we have problems with them. Then go through a bit about the mechanisms we have today for addressing those issues and how we are thinking about building solutions that are system wide and address the problem. Finally I'm going to be speaking a bit about what we're planning to do next on the area. Earlier this morning there was a very nice talk about virtual machine Snapshots. It went much more in detail what I'm going to go into but let's think for the moment about the virtual machine as a collection of some state and that state might be memory, the guest memory, architectural state of the VM. Then you might have some devices for doing networking and storage, etc. Then some host resources like whatever state the KVM in Linux is holding for us for the VM, maybe a top device for the networking and files that back our storage. For this talk, Snapshot is simply the serialization of this state at a given point in time in a file that we store somewhere in some storage medium. Then we use that Snapshot file in order to start one or more VMs, not that exact identical copies of the initial virtual machine. The morning talk spoke about various scenarios why you might want to do that. For example, you want to give a backup of your machine so you can go back in time in a previous state, etc. Or another scenario that you might want to do that is if you are building some sort of service that uses VMs to isolate workloads and you want to spawn these VMs very, very fastly in a state that they are ready to handle user requests, you might want to spawn a VM like that, bring it in a state, initialize everything, every service, a component that you want in order to get it ready to handle requests and take a Snapshot at that point. Whenever you have a new request in the future, instead of booting a machine from scratch, booting all the operating system, the user space, blah, blah, blah, blah, you just resume from a Snapshot and then you are much faster in a state where you can handle that request. What's wrong with that? Now, let's look again at the previous picture of our VM and let's imagine for a second that somewhere in VM memory or it doesn't have to be memory, it can be any other component of the VM. There is some piece of state, an object, some sort of state that for the purpose of the application that it's making use of it, it needs to be unique and or secret. It needs to have this property in order for the application to operate correctly or securely, etc. Now, you see where I'm going with this, once we take that Snapshot, that property of this state is lost. Here we're speaking about what sort of mechanisms, what sort of applications are having this problem and how we can address this exact problem. We are aware even today of many classes of applications that rely on this assumption of some part of the state being unique, secret, etc. For example, we can think of cryptographically secured pseudo-random number generators. Those are random number generators that have the property that it is very, very hard, if not impossible, to guess what the next byte they're going to give you is. Many applications, the security of many applications rely on this property. They have other properties as well that given knowledge of the current state of the PRNG, you cannot guess the previous bytes, etc. But for those sort of applications, imagine that one, those sort of random number generators, imagine that once you take the Snapshot, the VM Snapshot and you start more VMs from that, the state of the PRNG is being duplicated. So unless we do something else, unless we add more entropy, for example, in this PRNG, in all of the VMs that start from the same Snapshot, the next byte that is going to be given out from that PRNG is going to be exactly same in all of the VMs. Other examples of use cases that have this problem is network configuration. Imagine you have a VM that has some network configuration, IP addresses, MAC addresses, etc. Suddenly you Snapshot that VM and you create new VMs from that Snapshot that live in the same network as your seed VM. Suddenly they appear in the network VMs with the exact same network configuration and depending on your use case that might be a problem. So you might want to be able to do something about it once this happened. You might want to detect that this is happening and do something about it. Another class of applications that are affected by this is anything that really uses a UUID, a GUID. Many applications rely on the uniqueness of this variable, this number, in order to perform correctly. Imagine for example once you take this Snapshot of an application that has a UUID and you start more VMs out of it and the application that is running in this VM is using that number as an index in a database to modify stuff, read stuff, suddenly you have a race condition on the database. More than one entities are going to be using that same thing for accessing data. Any sort of use case where you rely on this thing being unique is a problem here. And really we do not know exactly all of the applications that use cases that have this problem. So it really depends on the application itself. We really need to go see whether our applications keep state that has the semantics, the semantics of uniqueness and secrecy. And if you know that you are running some workload that has this problem and you run in such environment, let's speak about and think of what sort of mechanisms you could use in order to make this use case safe. Okay, now that we know a bit more about the problem that we are speaking about, we are facing, let's see what kind of mechanisms do we have today to address it. Essentially the most fundamental mechanism we have today for doing that is called virtual machine generation ID. It operates as a notification mechanism for the VM after it is getting resumed from a snapshot about that particular fact. But it tells the VM, okay, now you are in a new world. You are not in the world that you thought you were without having rebooted. And in the technical aspect of it, it's an ACPI virtual device. It is emulated by the monitor. And the way it provides the notification inside the guest is via a generation ID, which is a 16 bytes cryptographically random number that changes every time we resume from a snapshot. So when you resume from the snapshot, the monitor makes sure that it changes the new value, it stores a new value in the generation ID, and before resuming the VCPUs of the VM, it injects an ACPI notification in the system. And once it resumes from the snapshot, resumes the VCPUs, then the guest kernel is going to handle that ACPI notification. What happens in Linux is that today the kernel is using the new generation ID as extra entropy for its entropy pool. So it's receding its entropy pool, essentially, so that it avoids the problem we were speaking about before about PRNGs. It works, apparently. It works fine. There is still a bit of a concern regarding the fact of its asynchronousity in the sense that there is a small race window between the moment we resume the VCPUs and the ACPI notification is being handled by whatever thread in the kernel handles it. Okay. Yay. Sorry about that. But at least we have something. Nice. So moving forward, recently we built in the Linux kernel, contributed a small essentially change that every time the generation ID changes, we emit a new event to the user space, because before that VM generation ID implementation did not do anything. It was using, since it was using the generation ID as entropy for the kernel PRNG, people were nervous about exporting it to the user space. So I said, okay, that's it. And in reality, the user space does not really need that 16 bytes themselves. It just needs a small notification. So there you have it. It got matched recently in 6.8 and it is still an asynchronous notification mechanism. So everything that in the user space that runs event loops, for example, can monitor for it and get notified about the fact that they're now in a new VM started from a snapshot. It is still racy, this thing has to be said. So if we think that we have use cases that need to get more asynchronous mechanism, more synchronous mechanism, we should continue doing work to build those. Okay, so going back to the PRNGs, mainly because they are used by security sensitive applications, let's see how these mechanisms can help us. In runtime systems that maintain their own PRNGs like JVM, we can now use the VM GenADU event to be notified about snapshots. So upon resume, the runtime would get that event, eventually would be notified and it would receive the PRNG as soon as possible. Now in other PRNGs that are implemented from libraries, within libraries, this is a bit more weird situation at the moment because an asynchronous mechanism like a U-event is not a perfect fit for the programming model. We will need to do something else about them. One idea here would be to use prediction resistance with what cryptographers call prediction resistance with hardware instructions. The idea here is simple. With every byte that the PRNG returns to you, you mix in some random bits that you got from a hardware instruction that is not affected obviously by virtual machine snapshots, so the problem just goes away. If you are able to do that, it doesn't matter if you have resumed from a snapshot. The state of the PRNG is always going to, including these snapshot irrelevant random bytes and everything is going to be fine. Other potential solutions, for example, in cases where you do not have these instructions or for whatever reason you don't want to use them, it would be to build some sort of synchronous APIs on top of the asynchronous VM-genade event, for example. But we really think that we should do something, don't go out on me again. We really think we should do something about the use case of these libraries. Okay, so let's think, now that we know what mechanism we have available, let's see if we can really solve the problem. And let's follow this example. It's a very simple example of a VM that has started from a snapshot. The hypervisor and the guest kernel support VM-genade. The kernel is going to use the generation ID to receive its random number generator. And we have a user space application that does some network communication and it wants to use TLS. And it reads some random bits from the from the view random, which is safe because of VM-genade in order to do some sort of communication. And everything works fine. The application creates the session key to start communicating without the world and everything looks fine. And at that point, we take a snapshot. Now the moment we resume the VM, the second VM from that snapshot, the session key is duplicated in essentially both VMs. So even though we have these mechanisms built in the system that give safe interfaces over the view random, for example, the final system is not necessarily safe. The same would go, for example, for GUID applications that have GUIDs, et cetera, et cetera, and they would need to adapt themselves. And it is true that the application could use the VM-genade event, but that event is present in the resumed snapshot, in the resumed VM. In the initial VM, there is not today a mechanism to do something about that. And again, there is some sort of race window between the event resuming the VM and the application being reacting to that event, which makes us think that probably there are things that should not ever be serialized at all. It would be much easier if that session key was never serialized. And that makes us think that VM-genade is a post-mortem mechanism. It is a notification in the new VMs, not the initial VM. And by the moment it arrives to us, sensitive information operations might be already in flight, and even if we handle that notification, there is nothing we can do about the things that are in flight. And that makes us think as well next that what we should probably do is control the timing of snapshot events. The moment snapshot events in the lifetime of the VM can arrive, let's say, at arbitrary points in time, instead we should control them. We should do something before we take even the snapshot and make sure that we only take a snapshot when the machine is in a safe state to be snapshotted. And once we resume, make sure that every application that needs to has adapted to the new situation before marking the system as ready to be operational again. Thinking about these things, some time ago we were speaking with system defaults and we thought about modeling this problem using force states, describing our systems being in one of force states. Planning is the normal state of your VM. Now once you want to take a snapshot, you start quiescing. People earlier today spoke about this as freezing, for example. And during that period you do things preparing yourself to be snapshotted so you cannot find yourself in a previous situation. And once you are quiesced, once everybody is ready to be snapshot, then you can take the snapshot and then the same. And on the resume path, on the resume from the snapshot path, you essentially do the opposite work, right? You start from a quiesced state, then you start inquiescing, getting ready for the new world, recreating your GUIDs and what not. And once everything is done, then you can be running again, up and running again. SystemD has this nice concept of inhibitors, which can essentially applications use in order to say, OK, don't do that. Don't do that transition until you are ready to, I tell you I'm ready to do so. For example, there are inhibitors for system CTL suspend. At the moment we were thinking that maybe we could use some para virtual agent to orchestrate everything. In reality, maybe system CTL suspend is all what we need and we can drive this from the hypervisor by sending an ACPI event. And going back to the previous example, how that would look like is we are in a running state in LVM, we have our previous application, and suddenly the control plane informs the PV agent that it needs to start quiescing. Here I say system CTL quiesce, but again, unless we find the reason why suspend should be different than some new sort of operation, we could even use suspend and get away with having to have a para virtual agent in there. In any case, once that happens, the application would say, OK, do not get quiesced again because I need to do some cleanup before you can snapshot me. And once the application does that, it says, OK, now I'm good to go. And at that point the control plane knows that, OK, we can take that snapshot. Now on the opposite path, the control plane would probably resume the VM from a snapshot and then start the unquiescing operation. The application might want to say, OK, wait until I know that I'm safe again because I want to create new random numbers. And I do that. And at that time, we are safe module of that tiny race condition in order to start getting random numbers again and recreate our safe and be in the state we want to be in order to be up and running. That's it. So yeah, we started working in adding support in Firecracker for VM GenAD. Up until now, we were telling people who were using the snapshot in feature in such a way that they should make sure that manually they would need to receive their kernels, PRNGs, and they use the space PRNGs after the fact. The other thing we want to pay attention to is working with PRNG owners in order to find proper ways to make their libraries not safe. Here we're speaking about the PRNGs that are implemented as libraries such as OpenSSL, AWS, and C, et cetera, et cetera. And start building this system we spoke about in system D, start modeling this in system D. And earlier, we had some ground work already done some time ago. And we hope that system D is going to be just the first one that we get this into and hopefully other management systems will follow. And that's it. Without that, I'd be happy to take questions. I just wanted to ask, you mentioned the network issues where machine comes up with the same back address. Didn't appear to address that. Is there a plan to take care of that situation as well? Or is that a problem? Yeah. The question was that we mentioned that during the presentation that there are problems with networking when you take snapshots and resume and whether we plan to address those at the future. Yeah. I think that this is part as well of the system D work that we're going to do. This problem essentially appears mainly when systems are in networks. If your VMs are not networked, there is no problem if two VMs that are not in the network are communicating somehow, they have the same random numbers. So yes, for example, something that we would like to do is to, I guess, shut down networking before taking a snapshot so you're sure that there are not in-flight connections and stuff like that. So I think this is going to be part of that work. Thank you. If we have to come up with a MAC address, we generally try to hash things first outside kind of even if we can also hash that into that element, it's already here. We have been discussing this, like, this is going to happen in my conference. We have to identify the generation ideas to the most obvious thing in the world, to add that to the hash. So that basically, yeah, once the generative changes and everything, get this into the GHD, it's not going to go to wherever else. Thank you very much. Thank you very much.
Pipewire audio backend in QEMU
Hi everyone, my name is Dorian Dabasi and I work at Red Hat. I currently work on enabling the audio stack and other features in the automotive team. And today I'll be talking about the Piperia Audio Backend in QEMO. So just a brief overview about what Piperia is. Piperia is a multimedia service for handling booths audio and video. But in this presentation I'll be focusing on the implementation that was done in QEMO and I will also focus on its use cases in the embedded platforms. So for a start, what's the QEMO's audio backing? It's a software component that's responsible for managing the audio streams and also providing audio functionality to the emulated platform and in our case QEMO. And it's also responsible for handling the audio imputes and the outputs on your virtual machine that's running on QEMO. The Piperia Audio Backend, it provides an interface that would allow the sharing of these audio streams from the guest operating system to your host using the Piperia native APIs and libraries. So how does it work? Here this is an illustration of what the stack looked like. First we have the application that's running in the guest is sending the audio data to Piperia through MMFD and DM above and we have the Piperia Daemon that's communicating with the session manager and it's also responsible for handling the media routing and in charge of talking and talking to the AOSA driver on the guest channel. And in QEMO you can see that we have the emulated sound card driver which could either be AC97 GOS or Intel HDA. It's providing the audio software emulation and then like any other host application that's running in your host user space, QEMO with the native Piperia Backend would be playing these audio streams to the host Piperia Daemon directly. So now it's playing those audio streams to the Piperia Daemon in the host user space but it's not going through any of the post audio compatibility layer or in the case of maybe AOSA, the AOSA plugin. So the Piperia Daemon process would now handle the media routing to the corresponding libraries like AOSA library and routed to the sound driver that's in the host channel. So after many iterations of the patches, the Piperia Audio Backend was merged in QEMO in May 2023 and it was added to the QEMO version 8.1 release and currently it supports Piperia version 0.36 although there's been a latest release in Piperia which is a Piperia 1.0 release and that's like a huge milestone for Piperia. So after the QEMO 8.1 release there was some improvements that were made to the backend thanks to McHundry and Volca for their support while optimizing this backend. QEMO, it has a number of audio backends and you can find the latest audio backend which is Piperia among the list of available audio drivers. So depending on your architecture, the Audio Dev Help option should show you the list of audio drivers and you can see Piperia there. The Piperia Audio Backend, it uses similar structures like the other audio backends although the difference is that it's being implemented with the Piperia native C libraries. So these are the Piperia Audio Backend properties that can be configured on your command line. So first we have the Audio Dev command and we specify the Piperia backend that we want to use and also specify the ID of the Piperia backend and then you can specify the name of the Piperia backend, the stream name and the latency for the output stream and you could also specify the same for your input stream depending on the latency that you want. So this is a description of what the QAPI schema looks like for the Piperia Audio Backend. You can see the name there which is the Piperia Key Target Object and it's used to specify the target object to link to although it's not necessary if you do not specify an object name to set. And for the stream name, this parameter is a stream media name that is being used when you're creating a new stream and if you don't set a stream name, it should use the ID of the Piperia backend which is a PW sound. For the latency, in order to set your desired latency, you could set anyone that you want although the default latency is 46 milliseconds for Piperia Audio Backend. So there are other parameters like the mixing engine, the frequency, the channels and the formats. These parameters are common to the other audio backends like Paul Sodu, Jack and Ausa. But one thing to note is that in QEMU, it currently supports just one channel or two channels. That's money or stereo setup. So you can configure either one or two channels and in Piperia, when you're using a single channel, the content of your buffer is basically S1 samples, S2, S3 and each of the samples would be the buffer samples. Now when you have two channels, the format would be expecting the buffer to be like one sample on the left and one sample on the right and like continuously like that. So each of the samples that's the one on the left would be going to the left speakers and then the samples on your right would be going to the right speakers. So in the case of two channels, the sum of the samples would be the sum of the left samples and the right samples which is the stride. And then the buffer size there, it's specified in microseconds just in case you want to configure a buffer size. And the default format that can be used is S16, although the Piperia Audio Backend has a range of formats that it supports. And for frequency, you could set a default frequency of 44.1 milliseconds. So to use the Piperia Audio Backend, you need an audio device and this audio device is an emulated sound card. It's a legacy PCI device that's been plugged directly into the PCI ExpressRid bus. So this is an example of how the audio device is being configured in the command line. So first we have the device option which we're specifying an Intel HD device and we'll specify a codec option like HD Duplex for streams from your host speakers and your host microphone. And maybe if you wanted to only allow access to just your speakers, you could use the HD output option or if you just want access only to your microphone, you could use the HD microphone option. So here you can see that when specifying the sound card device to use, I used the ID that was specified in the Piperia Backend. That's you telling the sound card device to use the Piperia Backend. So this is how the properties of the Intel HD Audio Device are being declared inside the code. You can look up how the properties of the other devices like GOS and AC97 are being declared inside the code. So Quemo allows you to configure multiple audio backends and this is very useful in embedded platform development. So let's say for example that I'm emulating an infotainment system using Quemo and I want to configure a stream only for notifications on the mono channel and then I want to configure another stream only for music on two channels. This multiple audio backend configuration, it will allow you to specify different parameters for each of the created stream. So this is a visual representation of what the backend would look like with two Piperia Audio Backends. So you can see that each node in the guest is representing a created stream and you can see that the nodes which are the colored boxes and you can also see the host speaker nodes. So for playback, the output ports of the Quemo node which is on the right, the output ports for the Quemo node which is on the right, I've been routed to the speaker nodes on the host and then the input ports that's coming from the host microphone, I've been routed to the input ports on the guest. So this is also very useful when maybe you want to isolate the audio that's coming from different processes that are running in your guest. So now we'll take a technical deep dive into how the Piperia Audio Backend works. So what happens in playback? For playback, we first activate the stream and using this Piperia Streamset Active API call, it will set the stream mode into streaming and then next we call the buffer get free function. This function is used to know in advance the available number of bytes for writing data to the buffer and this also improves the playback latency by a factor of two. And later I will show you some latency measurements. So next what we want to do is to lock the tread loop because I'm using the tread loop mechanism and this mechanism ensures that we are doing the Piperia API calls only from one single tread at a time. So you don't want to be accessing this Piperia resources from multiple threads because it could cause a risk condition. So next what we want to do is to get the number of bytes that are available for writing data to the buffer. How we get this value is that we subtract the number of bytes that are actually inside the ring buffer from the effective Piperia Backend buffer size. And to get these bytes that are inside the ring buffer, we use the sparring buffer get rights index API call. So now what we do next is to use the sparring buffer write data to actually do a mem copy of buffer data from the source audio device to a temporary buffer with the index being the offset and then we update the write pointer. So here at this point there is the possibility of buffer on the run sometime occurring and although this happens in very rare cases, this is like a situation where the audio buffer levels has dropped below a certain threshold and it sometimes cause audio distortion or stuttering and like we cannot really guarantee that okay this guest would be producing the audio samples fast enough. So in Piperia we had a robust solution to fix this issue which was to handle this buffer on the runs by plain silence. You can look up the code on how we handle the buffer on the runs. So next what we do is to get a buffer that can be filled for the playback stream and then we copy this audio data from the temporary buffer to the Piperia buffer using the sparring buffer read data API call. Although I'm just giving you a summary of what the sparring buffer read data API call does but it does much more than that. And then next we queue the buffer for playback and this continue to happen in a loop until all the buffers have been played. So for the capture side what happens? It's more or less the opposite of what's happened in the Piperia backend and in this case like it's kind of similar but not because it's the opposite. So in this case what we do first is that we like activate the stream and then we use the buffer get refunction to know in advance the available number of bytes that we can write and then we use the tread loop lock again to ensure that we're just doing the API calls from one single tread at a time. But the difference here is that this time around instead of using the sparring buffer write data API call we're using the sparring buffer read data call and this time we're doing a mem copy of buffer data from the temporary buffer to the source audio device. So with the index being the offset we would update the read pointer afterwards. Then what we want to do next is to get a buffer that can be consumed for the capture streams and then we copy the audio data from the Piperia buffer to the source audio device using the sparring buffer write data API call. Next we queue the buffer for capture and then this continues in a loop until all the buffers are being consumed for capture. So as regards to the volume controls in order to be able to adjust your volumes through the virtual machine to be effective on the host we use the Piperia volume control API calls and this volume control code would allow for purpose synchronization of the volume changes that are made on the guests to be effective on the host. So when this volume changes have been applied on the node output monitor parts of the guests it will synchronize with the host. So for Piperia I use the Piperia stream set control API call and this is used to set the effective volume. Although one thing to note in Quemo was that it had volume levels of 0 to 255 and in the back end because the Piperia API has volume levels from a floating range of 0 to 1 where 0 is the silence and 1 is representing without attenuation I had to do a linear conversion of these levels so did a linear conversion of 255 levels to Piperia floats in range of 0 to 1. So regarding the features of the Piperia back end these features are not like the features of Piperia in general it's not only limited to its design to handle multimedia processing on the Linux but it also transcends to applications that have been built with the Piperia C API of which a use case now is the Piperia audio back end in Quemo. So on into the Piperia low latency features the Piperia back end has been developed to significantly reduce the latency in many ways and one of the ways is by setting the Piperia keynote latency property and we set it in the back end to be 75% of the time period for faster updates and the other way in which we reduced latency was to use the buffer get free function which improved the latency by a factor of 2. And I think to note about the latency is that the Piperia back end latency is mostly determined by a combination of the buffer size and the sample rate of the back end and this is usually called the quantum. So another feature of the audio back end is that it's providing a reduced footprint and also reduced dependencies in comparison to the other audio back ends that we have in Quemo and for the native Piperia back end we get to benefit from Piperia features such as the less CPU usage and the memory as well. So here I made some benchmarkings of the round triple latencies for the different audio back ends and all these latencies were being measured with a jack Iodili and a loopback cable. So listed here are the round triple latencies as reported by jack Iodili and the sample rate of the device that I used is set to 44.1 kilohertz. So as you can see there, yeah, I have to agree that jack is like topping the charts in low latency as expected but that's not my focus. You can see that the low latency that Piperia offers is quite low and then next we have the pulse audio and SDL competing with each other. So about debugging, while I was working on this audio back end, the GDB was very useful because in case you want to examine the state like registers and memory and you want to maybe set break points and watch points you could use it and you could also leverage the Quemos internal tracing infrastructure. So I added a couple of Piperia audio back end trace events that you can use. So these trace events you can configure it on the command line and example of these trace events is the Piperia writes trace events. When you set it, it will show you the length of bytes that's to be written to the buffer. It will also show you the available number of bytes that can be written. And one thing to note here with Quemos is that this, when if you use this, if you enable this Piperia write trace event, it produces a lot of outputs given that we are copying bytes every millisecond. So you should expect to have a very big log file in case you want to enable those trace events. And then there is the other tool that is very handy, the Piperia debug login. You can use it to set different debug levels from 0 to 5 and these levels would help you to see and have control of the behavior of the Piperia back end or if you were maybe using like debugging your own Piperia application. So here I added some helpful links. The first one which is my blog about the Piperia back end and its usage in Quemo. The next one was Get Hoffman's blog about the sound configuration changes that were made in Quemo. And then we also have like the Quemo invocation that's in case maybe you want to see and know how to use these audio back ends or maybe some other audio back ends, you could look up the Quemo invocation and how to use it. Then I also added the Piperia's wiki page on performance measurements. So it includes scripts that you could use to measure the latency in the different audio back ends like Piperia, Paul, Sodja and Jack. And you could also use it to measure the context switches and the CPU cycles, etc. So at this point here I would like to give a shout out to Intema and the Piperia maintainer who assisted me while I was working on this back end. Thank you. Do you have any questions? Any questions? Questions? What was? Oh yeah. I'm not curious what applications you tested with the incubators. So you're asking what applications in Quemo that I tested with this and how they behave. Okay, you could test a couple of applications that's like trying to play audio which maybe you're watching YouTube on your guest. But I mostly use the loopback cable and the Jack Iodili tool to measure latency. At least that's very effective because you could use it to measure like the CPU cycles as well. The latency and you could also measure other like features that you're interested in. Any other questions? Thank you very much. Thank you. Thank you.
AI-Driven Observability and Operations in Cloud-Edge Systems
Thank you. Hello everyone. Thank you so much for being here. In today's session we are going to do an introduction to a driving observability on operations in cloud edge system. First let me introduce myself. My name is Victor Palma. I'm a cloud engineer at Supernevola Systems. I come from Madrid, Spain and I've been working for Supernevola for more than two years. So let's move on to the presentation. First I would like to start with some initial context in order to introduce later what we are going to see here. So first what is observability? Observability is the ability to understand and analyze the internal behavior of a system by collecting and analyzing relevant data. That is the dictionary definition. But in other words it's just the ability to transform data into information in something that could be useful for us. So we can have a lot of system logs or of data or number but if we don't provide a meaning to them it's useless. So observability has multiple, sorry I just got blank. It has some advantages like anomaly detection that allows us to identify anomalies or bugs in our system. This also provides the ability to do a performance analysis. So we can identify areas for improvement in our system and finally it's very useful for decision making. So we can see the impact of the change that we made in our system in a very easy way. As the saying goes information is power. So observability is very, very important in nowadays. But now I would like to talk about AI because AI in nowadays is everywhere. You know, a marketing guy's fault. But it's really useful for observability. It's really useful for the quick answer is yes, socially. AI provides the capacity to create more enhancing data analysis, create automated anomaly detection, create dynamic scaling for our cloud. For example, if we have more workloads that the usual we can automatically create new notes or deploy new VMs in our cloud in order to provide more services to our customers. And finally we can create predictive analysis, analytics in order to predict how our system is going to behave in the future. After finishing this part, I also would like to talk about data sorrentia and open source because I think the most important concept about AI, currently it's the data sorrentia or the information. Many organizations truce sensitive data to third parties providers. Currently these providers are based outside of Europe and we need to wait in order to bring back the data to our servers to Europe and be more transparent in this way. So as a solution, the open source is a very good solution for that problem. And provide more transparency for the cloud and helps reducing the vendor locking in our infrastructure. So we are not tired to a specific vendor and we can migrate within vendors every time we need it. So what's next? How can we address all these challenges? The answer is the one AYOPS framework, the open source solution for eye driving observability. One AYOPS framework combines open Nebula as virtualization and cloud management tool, Prometheus and Grafana as metrics and visualization solution and some AI and ML algorithms to predict and analyze all the behavior of infrastructure. All the three technologies together creates the one AI AYOPS framework that we are going to see here today. So let's go step by step and first we are going to see what is open Nebula. Open Nebula is the open source cloud platform solution in order to create your own cloud. It provides the ability to deploy virtual machines in your own private data center, in your public cloud or even in the edge. But you not only can deploy virtual machines but also application containers, micro VMs or even Kubernetes clusters. As I've said before, one of the features of open Nebula since it's open source and it's oriented to provide a truly, truly flexibility to the cloud is that avoids the vendor locking so you can migrate your workloads with between different providers in a very, very easy way. Open Nebula has a lot of integration with party tools like Terraform, like Kubernetes, Ansible or Docker. It also has a built-in tools like Sandstone that is the web user interface and you can handle all of your infrastructure from there or from the Celi-I and deploy a built-in machines based on the where, on KBM, LXD or micro VMs in Fightcracker. Finally one of the most important features of open Nebula is the possibility to expand your cloud to the multi-cloud or to the hybrid cloud. You can create on demand resources on the edge in Amazon, Google Cloud, Equinids, just clicking a button or automatically if you configure that. So you can migrate workloads to your on-premise data center to on-edge data center of the public cloud in a very strife way. So you can deploy any infrastructure with a uniform management for all this infrastructure and you can run any application in your cloud. For open Nebula doesn't occur if the host is located in Equinix or the edge or in your private data center. The only thing that open Nebula occurs is what is, what VM is running the workload and how can you access to that VM? So very handy. The next section of the one AI ops is the integration of open Nebula with Prometheus. That integration is based on the Prometheus Sportex like the Prometheus Node Sportex that is installed in every open Nebula server. It's also installed in the hypervisor nodes and it combines with the open Nebula Libreth exporter. It's a Q-Stume exporter created by open Nebula in order to extract and collect information about the KVM machines. And we also combine this information with the, the, our metrics of open Nebula that it's created itself. And this metric is, is gathered by the 1D Demon, it's the main demon of open Nebula. And then it's exported to the Prometheus server through the open Nebula server exporter. So the, so the next thing is the AI that we, sorry, that we add to the formula. So we create a bunch of machine learning algorithms and some decision algorithms and implemented in the, as a, as an exporter to Prometheus. So gathering all the metrics that open Nebula produce and the exporter produce, we implement this algorithm in order to predict and to, to, to get how, how improve the performance of, of your cloud. So in summary, the feature and capabilities of when one AA ops are the CPU usage prediction of the VMs of your cloud. One A ops come predicts the individual VM CPU prediction per hour, the general CPU prediction for usage for, for your host. The accuracy of that prediction, a very important value in, in terms of a feasibility of your, of the, of the system. And then when AI ops come also suggests where you can place a VM. So based on that prediction, one A ops maybe tell you to migrate a VM from one server to another in order to improve the performance there. Three main policies configuring when, when AI ops. The first one is the load balancing, load balancing. It's just the name said balancing the load of, in all your notes is very useful when you have on premise or private data center and you want to use all your hosts. The next policy is the resource contention. So very useful for public cloud environments when you want to use a, a few number of, of hosts. And the last one is reduce migration. That policy very useful when you want to, to avoid migrating VMs between hosts. The, this, this scenario is very useful for a edge environment where the migration of people machines between edge nodes sometimes it's kind of a bit done. So here you can see the architecture of the one AI ops. It's based on the already existed open nebulizer architecture. So everything at the bottom, it's already what's open nebulize. And the layer at the, at the top of the picture is the new one AI ops architecture layer. So here you can see the modules that we have implemented in order to provide this, this prediction and then the, all the virtualize infrastructure orchestrated. So let's do a demo in order to show you how this works. First we are going to, to go to the, to the open nebulizer system portal in order to show you, wait, sorry. Thank you. I'm so sorry. Get me a minute. Okay. Well, this is the main dashboard of open nebula, user graphic interface. Here you can see the, the principal information about your cloud, like how many machines we have or the images or the built on network or the usage of the host. We have currently, here, sorry. In this, this is a demo environment. So we only have two, two hosts with some workload and that these are the, the VMs that we have running in these environments. So this is solid to see that we already have some workload in this environment and this workload is fully random. So maybe it's consumed a certain CPU depending on, on the time. So when we install the one AI ops framework and we have a documentation for that, if we go to Grafana and import the one dashboard, we can see this. This is the results that we want AI ops generates. So we can see here at the left, the average CPU predicted per host. Here we can see the average CPU, the usage per VMs and here the, the, the real usage. So as you can see, it's closely one value to, to the other and the accuracy of the prediction, in that case, a 92%. Here we can see the suggestions that one AI provides to the user. The first policy is the core optimization policy. So it's going to, to, to reduce, it's going to try to reduce the number of hosts to the minimum. Here you can see that all the VMs, we have five VMs in this demo. The four VMs are in one host. Since this host is fully and not a more VMs entering inside this host, the, we have here another VM, but it tries to concentrate the VMs in the, in a few hosts. Here you can see the migrations that one AI suggests for, for that, for, for achieve this distribution. So it's to get to us to move the VM with ID three to the host with ID one. It's very, very, very easy to, to follow the instructions. And then we have the other policies, the load balancing optimization that as you can see, we have the, the VMs distributed in the two hosts that we have in our environment. And then the final policy that is the immigration optimization. And in this case, no migrations are suggested because no, no optimization are found. But what? In case one AI produce something, it, it should be up here. And what? Returning to the, to the slides. Well this is the demo that we have just seen. And closing thoughts on the next step of, of this project. So the next steps and challenge that we, we are facing in, in one AI ops. First is implement the virtualization operations in order to apply the suggestion automatically. Currently the operations are only suggested but not performed in the, in the cloud system. Then we would like to improve the AI ops distribution as part of the OpenEvola software. And you need to install it separately. And finally we will, we will like to expand the functionality to provide anomaly detections, allocation based on memory prediction and network traffic. Because we only provide currently CPU usage prediction. And based on the result of the, of the tool, creates alerts and warnings. This project is totally open source. So you can go to the repository on GitHub and collaborate and suggest new features and, and, and changes. And finally I would like to, to encourage you to join to the OpenEvola community. So you can visit the forum and participate in, in discussion with other, with other OpenEvola users and, and learn and help together in the cloud community that we have created here. As closing a slide, I would like also to, to, to say that this project is, is founded by the European Union as part of the Horizon Europe Research and Innovation Program. So this project is called CONNIS. I, I recommend you to take a look in, in this URL. I can espel you if you want, COG and IT. COGNET. It's, it's very, very interesting project. Well, that's all. Thank you very much for your attention. So, questions? Yeah. Yeah, we use a linear models and a, and a, and a, and a, and a, and a, and a, and a, and and a, and a, and a vasilian models tool. By, by a sund. Sorry. By a sund model save any models. And would you be able to share the slides in the forms of website as też in the child. You can also find it in the repositories as, as yeah here in this repository all the data models and, algorithm applied are, are explained. Okay, thank you. You're welcome. Any more questions? Yeah. I think I'm basically, it was the same question as the last one. If you could quickly go back to, because you did actually have the model, and then in like a couple of slides before. Here. So here, does it explain where the Bayesian are we in here? It's not in there, because that was the bit I was wondering how the model worked. Okay, thank you. That's a question near our side. Okay. Thank you. Perfect. Any more questions? You're welcome. Ah, it's here. Are you optimizing for CPU utilization or not recorded? So also, is it that possible to also say, okay, optimize for availability or network throughput? Okay, he asked about if we optimize for CPU or for other attributes like network or memory. Currently, in the current state of the project, we are only make suggestions and predict the CPU usage. But the idea is to implement a prediction based on the network, on the memory, and other keystone attributes that you want to add to the tool. The idea is that you can change the prediction, the configuration. But really, it's just a prototype. So, for storage? For storage prediction. Yeah, we are also considering that. Sometimes, optimality is changing based on the cloud service provider. Do you also consider this or is this only based on regular hardware? Or what is your optimization? Yeah, we are considering that too. I mean, the way to optimize based on the location of the VM. It's not the same half a VM in your on-premise cloud than on the public cloud or on the edge. So, it's based on different policies that we are currently defining. Yeah, we are considering that. Any more questions? Okay, thank you so much.
Bare-Metal Networking For Everyone
Okay, hello everyone. My name is Mateusz. I work at ThreadHat as a principal software engineer in the Kubernetes bare metal networking team. So yeah, as the title of the talk says, we'll be talking about bare metal networking and I wanted this talk to be somehow a gentle intro into what you need to think about when you want to start doing Kubernetes on bare metal, but the thing that Kubernetes doesn't tell you you should care about. So we'll see in a moment what that means, but I work at ThreadHat. I already said this. I'm based in Switzerland. When I'm not doing computers, I'm doing farming. I actually make it much much better, but it doesn't pay the bills, so I need to do the stuff that I'm going to tell you about here. Well, it is what it is. I don't do AI as opposed to, you know, all the hype and all this kind of stuff, so yeah, I'm not really on the hype wave. Bare metal was never really hyped, so well, what can I say? Some intro why we may even think about doing containers for bare metal. Like, you know, no one ever told us to do so, so what the heck is the deal? So HPC and AI. This slide predates the AI hype, so sorry for this. I could remove it, but long story short, there are some workloads we really want to benefit from running for bare metal. You may have some fancy GPU from, let's not name the company, or some network adapter, which is, you know, something that you really want to have access to the hardware directly, or the other side of the scale. Something that you run and is critical to any part of the infrastructure that you already have. Like, for example, network equipment. You don't want to run router of your own data center as an instance in AWS, right? That would be somehow, yeah, we shouldn't do this this way. Or something which is almost forgotten, and you know, then people call me and put this use case. Benchmarking. How do you benchmark hardware, CPUs, and this kind of stuff if not by running workload directly on this hardware? Again, you don't want to create 50 VMs on some CPU, only to get the benchmark of this CPU performance. That would be chicken egg. Let's not do this. So now fast forward. We agree that we want to do Kubernetes, and we agree that we want to do this on bare metal. So we go to Kubernetes.something, I don't know what that is today. We go to the, you know, FAQ, installing a cluster, and we start reading. What do I need to do to install a cluster? Is there any tooling that would help me installing this cluster? And the very first page you see is this installing Kubernetes with deployment tools, and they tell you QubeADM and to some other tools. And we are like, oh, so lucky. There are tools that are going to do this stuff for us. Okay, let's check the first one. You go to QubeADM and we start reading. Using QubeADM, you can create a minimum viable Kubernetes clusters. And, okay, so is MVP really the production cluster that I'm going to run? Well, probably no. Let's keep that tool. The second one, we look into K-Opps. Okay, let's go to the website of K-Opps. Let's do the same. Installing Kubernetes, getting started, and we start reading. Deploying to AWS, to GCP, digital option, yada, yada, yada. None of them is deploying to bare metal. Thank you very much. End of the story. Let's check the last one. Maybe that's our chance. So we go to the Qube spray. It's a set of ansibles. So another story, you know, but, okay, someone gives us some method to deploy Kubernetes on bare metal. So we go, run Qube, Qube spray playbooks. With the bare metal infrastructure deployed, Qube spray came now in, so Kubernetes and set up the cluster. And you start reading those playbooks and you feel like, oh, this is so opinionated. So either I want to do my data, either I want to build my data center like they want me to build, or thank you very much, there is no tool. So let's agree that none of these three methods is for us. We need to do this stuff ourselves. So let's build the stuff, you know, brick by brick from the, from the beginning. So what, what we need to care about a cluster, and not only during the installation, but in general to have this cluster bootstrapped and then working. First of all is, of course, this is bare metal. At the end, you want to deploy this cluster because there will be some workload, right? You want to access this workload. As well, you want to access the API, right? Basic operations. You don't deploy the cluster for the sake of deploying it and running, consuming the energy. Then, of course, DNS infrastructure. You are deploying this in your data center. And then what? Are you going to give to your customers? And now, you know, type this IP address slash something, something to look at this fancy web, website or application that we deployed. No, you want to have it some very nice domain and, you know, but for that, again, DNS infrastructure, you need that. It doesn't come for free. The next step is we agreed that we are doing bare metal because we have some reason to do this and it's not like we just don't like a simple VM from AWS, which means there will be some non-standard network configuration. Doesn't really matter if fancy or not. It will be something more than just, you know, plug the server, turn it on because in most of the cases, people doing bare metal, they don't have DHCP in all the networks or they need some storage network and it all requires some fine tuning which doesn't come from default when you boot your Linux distro and some other dirty tricks that I'm going to tell you later because it's Kubernetes specific and I want to build my way up to this. So cluster load balancer because I told you that you need to have API and ingress to your workload and all this kind of stuff. The slide is overly complicated for two reasons. The first reason is because it is complicated. The other reason is because no one ever cared to make it less complicated. I know it sounds bad but it is what it is. So the only thing I want to tell you is that, you know, we are in the story of building a cluster installing it from scratch, which means we are starting bootstrapped from summer. Like, you know, you may be running those cube ADM create cluster, yada yada, from this laptop, right? So this laptop will be your initial bootstrapping infrastructure. On the other hand, at the other side of this room, I have those three servers that are going to be masters. So this somehow has to ride all together. I need to have some IP address that will be this API finally when I spawn all those nodes in the cluster. So I need to have some virtual IP which will be pointing toward this API, right? This is what I'm calling API VIP and it sounds complex but at the end it boils down to one sentence. When you start doing cube CTL commands at the end, you need to target some IP address. If you are deploying Bermetal infrastructure, you don't want to ever target specific node because if this node goes down, all your tooling goes down. So you want to have some virtual IP and you may have some load balancer from well-known companies as an appliance or you may want to just do it yourself with keep alive this. So I will show this in a second. And in this slide, what is then the part? So at some point, we have deployed those control plane nodes, those worker nodes and we have the API address which should be now pointing only to the control plane nodes not to your bootstrap so this laptop, it goes away from this story. But then you have some other IP address because you are deploying Quarkode. You are not only an admin now. You really have something that runs and your applications, you don't want to expose your control plane to anyone, right? Or do you? Well, you rather not. So you need another IP and exactly the same story. Where do you take all those IP from and who manages them? Yeah, you manage them. So what you are doing for this and of course I'm telling you about some very opinionated way of designing how to install Kubernetes cluster and it's opinionated because we decided, so let's do keep alive D in the combination with HAProxy. And I told you the story why we need the VIP so you should be already convinced that if we need that, then we keep alive D because it's very simple and it's proven in action. Why do we put HAProxy in this story also? And now it will be fast forward to some specific use cases and requirements that we got. Only thing to remember is that it won't be always the same stock for API and ingress because your admin control, as an administrator of the cluster, I have usually different requirements than the user, so different tools, different purposes. Because it's very easy to simply deploy keep alive D and tell it, you know, let's pick this 1.2.3.4 IP and put it somewhere in the pool of this servers, right? But then Kubernetes is about being highly available. So what happens if your one node goes down? Well, the IP address should float to some other node that works, right? But what does it mean from the network perspective that IP address floats? What's going to happen with the connections that you have to this IP address? We start having this kind, we start asking ourselves this kind of questions because we have now three servers in the control plane, QBAPI server runs in three of them, we kill one QBAPI server, unlucky us, it was the one that was holding the IP of the, you know, of how we access the cluster. What happens now? No access to the cluster. So either we wait for keep alive D to move this IP address, our tables to propagate and all this kind of stuff or, and this is what we decided, we put HAProxy in between the QBAPI server and keep alive D so that HAProxy, and this is something that, you know, people from Kubernetes want to kill me, HAProxy is much more stable than Kubernetes API. That's it. That's it. If you look at this, that Xeq, QBAPI server fails much, much more than HAProxy, so this is our way to keep this and as simple as it sounds, the problem that I want to solve is that when QBAPI server dies, I don't want the IP address to float because propagating ARP tables and expiring the caches takes too long and I just simply don't want to wait for that, so I put HAProxy there and, and yeah, the only thing to remember if you really take this path is that you need to fine tune the health checks because then the worst you can do is that if keep alive D starts to notice outage faster than HAProxy because HAProxy also balances the traffic, right? So then the order of actions is that you want QBAPI server to die, which shouldn't be happening, but it happens. HAProxy notices that and end of the story. That's, that's it, keep alive. This should never, should never notice this and of course we may go deeper and what happens if HAProxy dies? Well, this is now a game of statistics. Has it ever happened for us that QBAPI server and HAProxy died at the same time? Well, it never happened apart if you go to the server and just plug it out from the rack. So this is some corner case that we don't want to cover, but, but it doesn't really, really happen in the wild. Of course, there are some limitations because, you know, you can have IP address on the single node. This is disadvantage versus some, some appliance. The biggest problem here is that you need to have all this stuff in one single L2 segment. So in one broadcast domain, this is because keep alive D doesn't work across subnets. We have some ways to fix that by grouping nodes into different L2s and then having different keep alive Ds in those L2s. But still, this is, this is a pain point and this is something that you should really well design on the, on the paper if you, if you start doing this. But, you know, enough of load balancers because we could be talking ages about this. DNS, because we said that we want to, to do this DNS mambo jumbo and, you know, we don't want to use IP addresses only. So of course you are administrator, you manage the infra. You could say, but, you know, we have this DNS infrastructure there. It's maybe AWS, maybe Cloudflare, maybe, maybe something else. So we can just create records there. But, but then, you know, either you trust the user or you don't. And we don't. So another opinionated thing in our way of installing Kubernetes is that we spawn very minimal setup of core DNS, which will be providing the DNS resolution of what you want to all the nodes of the cluster and all the pods running in this cluster. So that when you start installation claiming that you will have API running on API.example.com, I don't worry if you already created this record on the external DNS. I will just spawn static pod running core DNS and I will create those records myself. So whatever I'm running in this cluster will have this. This again protects me because now what happens if we decouple this? You have your external, you know, DNS like most of the people. And how do you want your cluster to behave when this DNS infrastructure goes down? You have your data center, everything is okay. In some other data center, you have DNS and this DNS is out. Do you want now your cluster to be, you know, dying because pods want to talk to each other and they cannot resolve DNS? It should be all self-contained, right? You don't want to have those external dependencies. So yeah, this is something that we are doing. And the part I will skip is that network manager requires some tuning because for people knowing how containers are spawned, when you start a container, a copy of ETC resolve conf is taken at the moment of starting the container and is plugged into the container. Meaning that if you change configuration of your host regarding DNS, it will not be propagated to the container unless you restart the container. So yeah, for this reason we are also hacking this file around so that it would be really updating on the fly but I don't want to go into this. Something a bit more interesting because we are going now into Kubernetes APIs and how to extend this stuff is network configuration of the host. This is static configuration file for network manager and probably you've seen this and probably you've made some mistakes to this file not once. The problem I want to state here is that this is a static file. You go, you modify it, nothing happens. You may notice mistake in this file five years after because for five years you haven't rebooted your server and we don't want to have this scenario in Kubernetes world. When you define some configuration it should either apply immediately or fail immediately. So this kind of stuff that you need to do manual modifications of the file, it breaks this contract we have and another part is it simply doesn't scale. If you have 300 servers in your bare metal cluster, you are not doing those changes manually. Simply not. You have CRDs and this is what should be happening. This is some very, very simple example. I do some modification. I mistake slash for backslash. They detect that and that's easy but I'm configuring default gateway as an IP address from outside of my subnet and this is utterly wrong but nothing in network manager will prevent me from this configuration. I simply don't want to. We have this CRD defined that creates host configuration from the API and it may sound like chic and egg but it's all the matter of how we order the stuff. We define Kubernetes CRD that will be defining how you configure network manager on the host. You can do it per node, all this kind of stuff. I will just show you how that works very quickly. That's the one. I have this node which has this IP address on the last interface 10, 24402 and what I want to do now, I want this to be different. I want to change that. I want to change it from the Kubernetes in a declarative way so that whenever someone will be modifying this, the change will get reverted. I just created a YAML which will configure IP address on some interface. As simple as that and I will apply that with the hope that it works as expected. At the top we can see that this CRD is now progressing with the configuration progress. In fact, that was as simple as it is so we can see that this IP was removed. For a moment I was thinking who's going to ask but you already had IP in this subnet configured. What's going to happen? Well, that configuration wouldn't fly because you should not have two IPs from the same subnet on the same interface. This is a short demo of that. At the same time, it's Kubernetes API. It should protect us from doing stupid things. I will try to configure a stupid DNS server which has no way of existing because it's on the link local IPv6 subnet. If I try to apply that, something should protect me from doing this because that would actually break the configuration. Let's see our configuration right now. We have 111.1 as the DNS server and let's apply this manifest. Now that configures the wrong DNS. The change has been applied. It's wrong. At this moment your cluster starts to misbehave, your probes go down and so on. Let's give it around 10-15 seconds and this configuration should get reverted because there is a controller which in fact checks if your modifications to the network infrastructure on the host. After applying, do they make something not working as it should? In this scenario, we see that degraded failed to configure. It failed because this DNS server doesn't exist in reality. That was just a short demo of how we handle all that. It's a bunch of self-contained elements that once you start using them all together, you give you a very nice Kubernetes installer that does it all for you. Sometimes in an opinionated way, sometimes less. Now I told you that there will be some dirty tricks. In KubeNet, there is a concept of Node.IP and we are now moving to the Linux world. When you want your application in Linux world to run and interact with the network, it has to bind somewhere. This somewhere is IP address and the port. Let's forget the port. We are about IP address. If you have multiple network interfaces, where should Kubernetes listen? Everywhere on one IP address, on two IP addresses. If you have 10 interfaces, what do we do? I say that Kubernetes upstream doesn't solve it in a very smart way because it was designed to run on clouds with only one network interface. As we started expanding, it's not something that we still want. We developed some additional logic to check that and I will skip the details. In general, one more problem to think about. When you configure KubeNet manually, you need to think what the IP addresses should be there. This configuration is complicated because actually you can say bind everywhere or bind to one IP address or you can say bind to IPv4, like as a string IPv4 and what happens there? It's all you know. You get even stranger syntax IPv6 as a string, comma and then IPv4 address. All this kind of stuff you need to understand how it behaves and pick your choice. It's complex. You may get really confused once you start. We have some set of rules. I will skip them. You can go back to this. In general, some corner cases, I just showed you an example in which you shouldn't have multiple IP addresses in one subnet. What if you do? There are some people who do this for a reason and how do you want KubeNet to behave then? Also, one example that I have and this is just mind blowing. It killed me for like two weeks. Is your IPv6 address really an IPv6 address? Okay, this slide I skip. I got to this RF, sewage describes IPv4 compatible IPv6 addresses and I was like, what the heck is that? Let's go to all the libraries in all the known programming languages. Every of them has a function. Is an IP address IPv6 address? You go to implementation. How implementation looks like? If string contains, colon return true. Thank you very much, game over. It's as simple as that. Really, for the last 30 years of my life, I thought this is as simple as it is, but it's not. Let's take this. So, comma, comma, four times f, then comma, sorry, colon, and then we put IP address with dots. It is a correct address. There is RFC for this address. It may look stupid, but it's a well defined address and, you know, it breaks. Try opening a net cut socket to listen on this address. It will not work because half of the tools now think this is IPv6 address, half of the tools think this is IPv4 address. I did a stress on that and what I realized is that based on this address, it was trying to open a socket on a simple IPv4 address. At this moment, how should we treat that? This is the real case scenario. I got it from a customer who was trying to install Kubernetes and they wanted to use this subnet. I was like, what is that? Then we dig deeper and we realized that this is a monster. It should have never existed, but apparently it exists. If you find a set of parameters that you pass to net cut and it crashes, then something went wrong. So, in the end, yeah, choose wisely what you want to do and once you design your infrastructure really, you know, double check it with someone out there with upstream community. Is it really how you should be doing stuff? Because in a lot of cases, you realize that something misbehaves and, you know, and that was, yeah, one more thing. You think everything is okay, then you start to get and you tell you, oh, sorry, but, you know, in fact, with this cloud provider, you cannot use this syntax and then you realize, oh, I wanted to do all that, but I cannot because you tell me that I cannot. So, you know, and you realize it only at the end of the story once you spend two weeks on designing. So, that's it.
Instant Ramen: Quick and easy multi-cluster Kubernetes development on your laptop
Okay. All right. All right. Okay. We are ready to go. All right. Thanks, everybody. Thanks for sticking around till the end today. And a special shout out to those of you on the live stream as well. My name is Adam Litke and this is Near Software. And today we're going to get our money's worth out of this laptop. Something is not right here. I keep flipping on and off. Let's see. I'll do my best here. So we've come a long way since Linus introduced Linux to the world back in 1991. What started off on his personal computer is deployed pretty much everywhere these days in increasingly complex scenarios. Take Kubernetes, for example. Everyone's favorite clustered container orchestrator, which runs open source up and down the entire stack from the cluster nodes to the Kubelet and to the container runtime itself. And developers haven't stopped building or debugging screens. Kubevert is a Kubernetes add-on that allows you to seamlessly run your virtual machines in Kubernetes. And since the VMs are running in pods, like any other workload on Kubernetes, they integrate really well with whatever else is deployed there, be it your applications, storage, networking, monitoring, et cetera. And as people continue to deploy Kubernetes and Kubevert to host their mission-critical workloads, naturally they wonder what will happen when disaster strikes. Disaster recovery software exists to protect your data and return to normal operations as quickly as possible. And this is typically achieved using redundancy. So data can be replicated from primary environments to secondary environments, and applications, including virtual machines, are able to be started on the secondary environment at a moment's notice, should that be required. In this particular scenario, we have a primary data center DR1 in the west, a secondary data center DR2 in the east, and a hub cluster located somewhere in between. Now we prefer to run our applications on the primary environment because it's closer to where our customers are. But thanks to continuous data replication, we can rest easy knowing. We can start the application up on DR2 when required. So ramen DR is software that enables disaster recovery for multi-cluster Kubernetes environments. It does this by working with the storage to enable data replication according to a DR policy set by the administrator. And it talks with open cluster management to manage application placement, failover, and relocation flows. Today we're going to simulate this disaster for you. We're going to start by disabling the primary environment. We can then failover our virtual machine to the secondary environment. And I just want to note here that failover is different from live migration because live migration would require both environments to be up. In this specific scenario, obviously, we don't have access to DR1. So failover is going to take a couple of minutes, but we can be confident that the app can start back up on the secondary and environment with minimal data loss. So I've been kind of introducing a bunch of different components here that's quite the menu of open source ingredients. KubeVert is a operator managed Kubernetes deployment, which packages libvert and QMU into a container image, allowing you to run your virtual machines inside of a pod. It also comes with other utilities to help you manage your virtual machine storage and networking requirements. Rook is software that manages and integrates self-storage into the Kubernetes platform. Open cluster management stitches together multiple Kubernetes clusters and provides for application management, placement, scheduling. And then RoninDR is adding on those DR flows to open cluster management. So when we're considering a realistic multi-cluster DR environment, it's a beautiful thing, kind of like this bowl of ramen here to tempt you at dinner time. However, it's also complicated and expensive to operate, especially when we consider like the single developer use case. So the question we're trying to answer here is how can we enable development in this open source software stack without huge cloud budgets? And our answer is to scale down that environment so that it can run inside the kind of laptop that most of us are carrying around with us each day. And NIR has prepared a live demo right on this laptop that you're looking at that's going to show all this stuff working together, and we're going to simulate that disaster for you. So take it away. Yep. And I'm going to mute it so we don't annoy the live stream people. Okay. Put that in your pocket. Yep. Okay. So this is our stack, right? Three clusters. We have two identical clusters. Everything is ramen. And we are going to put it inside this laptop to see that we can do it because they are small and cheap. So what we want to do today is to stuff three data centers with ramen and kovir and stuff and a lot of other components and large part of Europe and stuff everything inside this laptop. Now note about this environment. The clusters are all in the same laptop, but they are remote clusters on different regions. And the cluster is standalone with its own storage. So how can we prepare this laptop for the demo? So I have a pack of instant ramen DR, which is very easy to use. You want one command. DRM start with the environment of your flavor. This in case is a kovir environment. And then you let it cook for 10 minutes until everything is ready. Sorry. So we are not going to wait 10 minutes now because it's a little thing. I prepared the environment before the talk and we'll just use it. So whatever we need, we need a Git repo because we're going to use GitOps. We will give Ocm some Git repo to pull the VM resources and deploy the application. So we use Adam's repo, Ocm kovir samples. And I talked it to customize the VM with SSH public key. So let's jump into the demo and increase the font size a bit. So I'm using a little tool to save my typing error for you and make everything more colorful. So first look what we have in this laptop. We have three clusters. DR1 is the primary cluster where we run the VM. DR2 is the secondary cluster. Something bad happens to DR1 and something bad will happen. Don't tell anyone. And Hub is orchestrating everything and controlling the other clusters. Now each of these are libvirt VMs inside the laptop. So let's look inside one of the clusters. We can use kubectl, just a normal kubectl clusters. And we see that we have a lot of stuff that DR1 installed for us. The most important parts for the demo are the kovir parts that you will run the VM, the CDI that will provision the VM disk from container image. Of course, it will be stored. So we have a complete RookSafe system inside that using the local disk of the cluster. And this will provide storage for the VM disk and volume replication between the clusters. And to protect the VM, we have the Raman DR cluster operator, which orchestrates the DR flows. And finally, we need the open cluster management components that lets Raman control the clusters. Because Raman extends open cluster management and depend on it. So let's look inside the Git repo. I'm running this inside the clone of the Git repo from other. The important parts in this repo for the demo are this VM, VM standalone pbc.js. This is VM optimized for the environment. The subscription, which is OCM resources for the VM. And DR are the Raman DR resources. So let's look first at the VM. We have a quick look to see what we have there. We will not look inside the YAMLs. You can check the Git repo later. We have a VM configuration. This VM is using a pbc because we are using a pbc-based VM. So we have this pbc here. And we need to provision a pbc somehow. So we have the source YAML, which tells CDI how to provision the pbc disk. So we can apply this customization to cluster DR1. And this will start the VM on cluster DR1, but we are not going to do it. Because nobody will protect this VM. It's just like a port that you start and it goes down and nobody protects it. So what we want to do is create OCM application. OCM application. We will use subscription-based application. These resources tells OCM how to protect the application, how to create it, like which cluster set to use, where Git repo is, the namespace that the VM is running, where to place the VM, and subscription ties everything together. So to start the VM, we apply this customization to the hub. Everything is done on the hub. And then OCM and Raman later will do the right thing. So at this point, OCM is starting the VM on cluster DR1 and we can watch it. Using kubctl to get the VM, VMI port and pbc. And we can see that the pbc is already bound. And Virt launcher is running and we have an IP address, so ROVM is running. But let's inspect the VM a little bit more to see where is our disk. So we can use ROVM Cess kubctl plugin to look at Cess layer. And we can run the RBDU command, in this case on cluster DR1. And we see that we have RBD image created for our VM. If we look inside the pbc, we will find this volume there. So if something bad happened to cluster DR1, we lose the running VM and the disk. And this image will be gone and we lost all the data. So how can we prevent it? We want to protect this VM is Raman. So to do this, we must tell OCM first that Raman is going to take over this VM and ACM should not change it. We do this by notating the placement with a special annotation and at this point Raman can take over. So how do we protect this with Raman? We need to apply the Raman resources. Basically it's one resource, a DRPC. The DRPC tells Raman how to find the application, how to protect it, which pbc should be protected and what is the policy. We are not going to look inside now, we can check the gtrepolator. So to protect the VM, we apply disk customization. Again on the hub, then Raman will do the right thing on the right cluster. So once we did it, our VM is protected Raman and you can watch it again. This time I'm watching also VRG and VR resources. VRG is the volume replication group. We have one such resource per each protected application and volume replication is the resource that entails the volume replication for each pbc. So we have one of them replication for every pbc. Now both of them are ready and primary, primary windows, this is the primary cluster, replicating data to the secondary cluster. So what does it mean that we replicate data to the other cluster? If you look again on the RBD images, if you remember we have seen that we have an RBD image on the primary cluster. If you run the same command on the secondary cluster, and this time I'm running the same command on context DR2. And we will find that we have an image on the primary cluster and we have the same image on the secondary cluster. So what's going on under the cover is that when Raman enables volume replication, a secondary replica of the image is created on the secondary cluster, and the RBD mirror demon is starting to replicate rights from this image on the primary cluster to this image on the secondary cluster. So if something bad happens to cluster DR1, we can use the secondary image to start the VM at the time of the last replication. So the last thing to show about the VM is that we have a logger inside updating the log file every 10 seconds. We can access the log file using the VIRT CTL SSH. We just run this command to see the start of the log and we see the line where the service was started, and then we see one line every 10 seconds. This will help us verify later when we recover from a disaster that we got the right data from the disk. So now we are ready for everything. Let's try to create a disaster. So one thing that is easy to do on the laptop is to suspend the cluster DR1. If you remember this is a Libre VM, so we can just suspend it. Now everything running there stopped. So let's try to access the VM again with VIRT CTL SSH. Let's try to tail the log and let's see if it works. Well, it does not seem to work because of course we suspended the VM so nothing there is accessible. If we had an important service on this VM, our users would be not happy now. So how can we fix this? Adam, do we have any idea? I was hoping you would tell us. Yes, so because our VM is replicated, we can just failover to the other cluster quickly. How would we failover? If you remember that we installed the DRPC, we can patch the DRPC. We set the action to failover and we set the failover cluster. And once we did it, Raman starts the failover. And you can start watching the failover on the other cluster. I'm running this again on the DR2 cluster because DR1 is not accessible. And we see that we have a PPC. It's impending. We have a volume replication group. We have a volume replication, but the volume replication book is not primary yet. It will take a while until the VM is stopped. So while you wait for it, let's understand what's going on under the cover. So the RBD image on the secondary cluster was replica, pulling data from the master for the main cluster. Raman has to stop this replication and promote it to a primary image that will replicate to another cluster. Once this is done, the VRG will be marked as primary and it should happen any second. And at this point Raman will change the application placement. It just became primary. So now Raman changed the placement of the application. And Ocm will see the change and will redeploy the VM on the second cluster using the subscription. And this should happen any second now. When Ocm deploys the application, it will reuse the PPC that Raman has restored and connected to the right RBD image. And it just happened. We see that the VRT Launcher is running. The VM is up. We have an IP address. So if we add this important service of the VM, this service should be absent. And it will be used as a VRT. But how do we know that we got the right data? Maybe we got just a new application with empty disk. Let's check the disk. Again, we can use the logger. We just dumped the entire log using SSH. This time I'm connecting to the cluster, the R2. And we see all the logs from the VM that run on cluster DR1 until we created the disaster. And we see the new logs when the VM started to run on cluster DR2. Note that we have a gap here between the last line logged when the VM was running on DR1 and the first log, which is about three minutes in this case. This gap depends on how fast we started the failover and the tech that there was an issue with the cluster. So we had a short downtime. The VM is running. We got all the data. Looks like a successful failover. So what's next? In a real cluster, you would try to fix the cluster, recover it. Maybe you need to reinstall it. At this point, you can relocate the VM or maybe later during some maintenance middle, you will relocate the VM back to the primary cluster. In this demo, we are done. And it's time for questions. The first three questions we'll get in Sotramen. Go ahead. The question is what about the IP address of the virtual machine? We're paying attention and noted that change. So what would you suggest? Sotramen does not do anything about the IP address. I think in a real application, you will have some load balancer to make sure that you can switch from one cluster to the other cluster. Probably using the DNS system because you have a nice name for the DNS. But basically, you will do what VIRT CTL is doing when you connect to the VM. You use the VM name, the name space, and you have to find the actual address. Yes. Very nice demo. How much you run at home? I don't have any cloud copies. I can re-image it to 4G. Is 60G enough? 16 will be too low. Yes. So the question is what do we need to run it at home? So first, you can try it at home, but you need big enough laptop. I think for COVID, you need like 32G with Sotramen. Yes. Because we have two clusters with 8GB, maybe you can trim them down a little bit, but 16 will be too low. And... Maybe you need a laptop. Yes. Maybe you need a laptop. If a door is wide, it's not so. If a door would be easier because this is what we use. But it should work on anything. We continue the question. Can I use two laptops, one for disaster recovery and one for the one that you made and the other for disaster recovery? Let's say old laptop. Repeat the question. I didn't answer the question exactly. Can you repeat it? Your presentation is from the same laptop. Can I use the solution for two laptops? So the question is, can we use different machines for this? You can, but it will be harder because you need to set up the network between them. In the same laptop, it's much easier because MiniCube handed most of the stuff for you. If you use different machines, you need to make sure that the clusters are accessible to... So it will be harder. I've got one over here. Yes. Is it required to use SIF or you can use an OSR by system? Repeat the question. Do we have to use SIF? Currently, we work with SIF, so it's optimized for SIF and it works. And we have a complete tool that you set and configure it. If you want to use something else, we have support for any storage, basically, but it doesn't work on Kubernetes yet. It's very commonly on the shift. It needs more work. Yes. The primary site is down. Is there any mechanism for the extra machine for starting by mistake, by itself? The question was, once the cluster is down, do we have any protection that the virtual machine will not start again on the same cluster? So we don't have any protection at the ramen level because SIF is protecting us. If the same VM starts again, let's say we resume the cluster, the application is still running then it will continue to run and try to write to the disk, which is fine because the disk is not replicated at this point. Because the application is done on the destination cluster. It's pulling data from the primary. Usually it will just fail or go into some error state and in real application, when ramen detects that it should not run, it will shut it down. So it's safe. There is one more question. Yes. Just because it's the end of the day. Just one more question. What happens when the hub that was controlling the two data centers goes down? The question was what happens when the hub goes down? Very good question. In a real setup, in OpenShift you have a hub recovery setup, so actually you need two hubs. One passive hub and one active hub. And there is a lot of setup that backup the hub and restore it. But for testing it doesn't matter. And also hopefully you're not running customer visible or end user visible workloads on the hub. So if it goes down you can repair it and people won't be quite as urgent of a disaster. So hopefully the other sites don't fail at the same time. Alright, thanks everybody for coming. What a good question.